Kaggle Writeup: Titanic Dataset
Alex Klibisz, 8/6/2016
Recently I've spent a fair amount of time learning machine learning theory and techniques. In terms of specific material, I completed the Stanford Machine Learning Course on Coursera, and I've been reading the Machine Learning Mastery series.
This weekend I took some time to submit an entry to the Kaggle.com Titanic tutorial dataset. My solution, available here on Github, drew heavily from the techniques introduced in Jason Brownlee's Machine Learning with Python.
From a high level, my approach was as follows:
- Read in data, remove noisy features (passenger id, name, ticket id), recode categorical data to integers, split the data 70/30 for testing and validation. (code, code)
- Spot-check several algorithms offered by scikit-learn, measuring their accuracy. The algorithms: logistic regression, linear discriminant analysis, K-neighbors classification, decision tree classification, Gaussian naive Bayes, support vector classifier. (code)
- Spot-check the same algorithms, but with scaled data. I used the scikit-learn
StandardScalerclass to standardize the data. (code)
- Spot-check several ensemble algorithms offered by scikit-learn, measuring their accuracy. This was done without scaling the data. The algorithms: Adaboost classifier, gradient boost classifier, random forest classifier, extra trees classifier. (code)
- The support vector classifier proved to be the best non-ensemble method, so I used the scikit-learn
GridSearchCVclass to tune the penalty parameter (
C) and the kernel. This improved the support vector classifier slightly, but it still wasn't stronger than the gradient boost classifier. (code)
- At this point I had the results for each of the models stored, so I plotted them. The box-plots represent the accuracy of each algorithm as it was run through ten-fold cross validation. (code)
- Loop over the results and picked the model with the highest mean accuracy. This turned out to be the gradient boost classifier. (code)
- Evaluate the classification accuracy on the 30% validation data. This turned out to be about 80%. (code)
- Read in the Kaggle.com test data (no labels provided), use the trained gradient boost classifier model to predict survival for this data, save the predictions to a file. (code, code, code)
- Submit predictions to Kaggle.
My submission had a score of 77% on Kaggle's holdout data. That's fairly mediocre, but not bad considering the small amount of time spent on this one. Interestingly, the scikit-learn Gradient Boost algorithm I used was almost identical to simply splitting on gender (all women survive, all males don't).
There are some things I may try to improve this:
- Use some method of feature extraction to make sure I can rule out the "noisy" columns I mentioned above. Maybe there's some rhyme or reason to the ticket numbers?
- Get a better understanding for the ensemble algorithms and tune their parameters.
Imputer class makes it trivial to impute missing values in a dataset. Be careful to impute the training and testing data separately to avoid data leakage.
# Impute missing values separately for train and test # (X_train and X_validate have already been read from the CSV and split) imp = Imputer(missing_values='NaN', strategy='mean', axis=0) X_train = imp.fit_transform(X_train) X_validate = imp.fit_transform(X_validate)
LabelEncoder class makes it trivial to recode arbitrary categories to integers.
# recode categorical variables to integers using LabelEncoder # (df is the pandas data frame containing all data from the CSV) le_sex = LabelEncoder() df['sex'] = le_sex.fit_transform(df['sex'].astype('str')) le_cabin = LabelEncoder() df['cabin'] = le_cabin.fit_transform(df['cabin'].astype('str')) le_embarked = LabelEncoder() df['embarked'] = le_embarked.fit_transform(df['embarked'].astype('str'))