Naive Bayes Classifier on News Dataset

Manpreet Singh
Apr 19, 2022
3 min read

Importing the Dataset:

Droping the columns that wont be necessary for the naive bayes i.e the link and date column.

Checking null and NaN values in the data set

Data pre-processing:

Converting the string value into lower cases and removing redundant characters from the string using regex functions.

Applying the same method for the category column and replacing it also.

Removing Stopwords from dataset:

Finding probablity of a word in sentence: P(word)= Occurence of word/ Total documents

Conditional Probabilty= Probablity of word in class/ Total number of documents in that class

Top 10 and last 10 words in defining the probablity based on the number of the count:

Naive Bayes Classifier:

Calculating the probablity using the Bayes theorem and then returning the class that has the highest probability of that word. This function takes a dataset and then calculates the total probality of a word with respect to the total documents. Then it passes to the conditional Probablity function which calculates probablity with respect to the specific class. In the end it returns the value of highest probablity class and that is our class which will be most likely be the category belonging to that word.

Divinding the data into k-folds:

Finding the accuracy:

The model gives 6% accuracy on the development dataset using the cross validated splits. The number of sentences were less and that is why we had less accuracy, if we run this classifier on test data it successfully gives us the correct class.

Contribuition:

I majorly coded all the program by myself to calculate all the probablity of word, conditional probablity and defining a naive bayes classifier. I majorly used dicitonary instead of dataframe, so that took me a while to understand and manipulate the logic accordint to that data structure. I created many functions that manipulates the dataframe that gives desired functionalities. Functions that were created were getting the length of the specific dataframe['category'], then i created a function that takes in a folded list from the kfold function and returns a dataframe with two columns. For conducting the K-fold cross validation, i took the reference from the internet. After calculating everything, my final custom function takes the dataframe and checks the accuracy of the model on the development dataset. For the smoothening part, i took the help of google and found out what is the formula of the smoothning and how smoothning works out. Then implemented the concept of smoothning inside the probablity functions. I used basic probablity formula like probablity of a word in a document and conditional probablity formula and then hardcoded them inside a function and returned the value as a probablity. I digged deeper into regex functions in python and did its implementation in the program to clean the sentences. I calculated the final accuracy by checking that word is in our dataset document or not, if the word is present i have implemented a counter and in the end i calculate the probablity by occurence of the word/number of documents.

Challenges:

It was hard to understand how to do the K-fold split of the dataset. I saw couple of videos for understanding the concept of the K-fold split. And came across the coding implementation of the k-fold which i will mention in the references. Secondly, I mostly used traditional python and not dataframes for manipulation, that is where it took time in coding as well as implementing on the platform.