Titanic Survival Prediction

Written by Lily Gates
May 2025

Description

This project predicts the survival outcomes of passengers aboard the Titanic using machine learning. It investigates the influence of various factors such as age, gender, passenger class, and fare on survival chances. The analysis utilizes a Random Forest classifier, which outperforms other models like Logistic Regression and Decision Tree in predicting survival based on historical data.

Methodology

The analysis uses supervised learning, employing three classification models: Logistic Regression, Decision Trees, and Random Forest.

The methodology includes:

Data Preprocessing: Handling missing values, encoding categorical variables, and scaling numerical features.
Model Training: The models are trained on the dataset, which is split into training and test sets.
Model Evaluation: The models are evaluated using performance metrics like accuracy, precision, recall, and F1-score. A confusion matrix is also used to assess model performance.
Feature Importance: The models rank features based on their contribution to the survival prediction.

Required Dependencies

To run the project, the following Python libraries are required:

pandas
numpy
scikit-learn
matplotlib
seaborn

Output

The script generates:

Feature Importance Plots: Visualizations showing the most influential factors in predicting survival (e.g., age, gender, fare).
Confusion Matrix: For each model, visualizing true positives, false positives, true negatives, and false negatives.
Model Performance Metrics: Including accuracy, precision, recall, and F1-score for each model.

Limitations

Despite the Random Forest model outperforming the other models in key metrics, there are several limitations:

Limited Feature Set: The model was trained on a limited set of features, excluding potentially important variables like cabin location, family identifiers, or group ticket information. This simplification may have overlooked crucial survival patterns.
Overfitting and Bias: Random Forest models are prone to overfitting, especially when there are many distinct values in the features. The model could also be biased toward features with many categories, such as class or fare, rather than accounting for more nuanced factors.
Contextual Factors: The analysis does not include critical contextual factors such as proximity to lifeboats, crew behavior, or personal connections, all of which were likely influential during the Titanic disaster.
Generalizability: The model was validated using a holdout portion of the same dataset, so its performance on unseen data or in different scenarios remains untested.

Future Improvements

Experiment with different scaling methods for numerical features.
Test different tree depth levels for the Decision Tree model to avoid overfitting.
Explore alternative methods for addressing missing "Age" values.
Use the Kaggle test.csv file to compare the performance of the Random Forest model trained on train.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
output_graphics		output_graphics
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
titanic_machine_learning.py		titanic_machine_learning.py
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Titanic Survival Prediction

Description

Methodology

Required Dependencies

Output

Limitations

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Titanic Survival Prediction

Description

Methodology

Required Dependencies

Output

Limitations

Future Improvements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages