Manpreet Singh UTA ID:1001873982
Importing the dataset:
The titanic dataset consists of 3 files:
train.csv
test.csv
gender_submission.csv
Lets explore our Dataset:
The train.csv file contains 12 different columns, which gives us all the information about every passenger.
Computing the null or NaN values in the dataset
Calculating total number of survivors:
Survivors with respect to gender:
We further investigate and figure out that almost 3/4th of the female passengers survived the crash.
On further investigation we figure out that people travelling in first class has more probability of survival than the other classes
Completing our dataset, we fill out the ages of NaN with the mean of all the ages i.e 29.
After performing the Random Forest Classifier with the following parameters we saw an increase in the accuracy of our model by .4 percent.
Contribuition:
I explored the data using the matplot.lib and seaborn python library. Plotted graphs to explore more about the data and see hidden patterns and missing values. Used python library to identify if there are any null values present in the dataset for any column or not. Replaced the null value in the column('Age') with the median of the other ages and replaced it in the original dataset. For our Random Forest Classifier i added another parameter to the feature list which is the 'Age' and then changed the hyperparameter of the classifer as well.
Changed the max_depth to 4 from 5. Doing all of this increased the accuracy from 77.511 to 77.99.
Link to the notebook: https://www.kaggle.com/pocomano/datamining-lab1?scriptVersionId=88760817
Download the python file:
References:
Matplotlib for using the plots and graphs:
Pandas for manipulating and cleaning data:
Seaborn for categorical data:
Data Exploration Tutorial:
コメント