Skip to content
Chris M Wang edited this page May 30, 2017 · 3 revisions

Titanic - Goals

The reports, which consists of 3 parts, aim to answer the following questions.

  1. work with the Titantic Data Set from Kaggle at https://www.kaggle.com/c/titanic
  2. build source code in Scala or Python that runs in Spark 2.0.2 to analyze the Titantic data set.
  3. answer the question: “for subgroups of people boarding the Titanic, how would you maximize their individual probability of survival?”. You must define meaningful subgroups. You should submit your predictions in a file that clearly labels identity of person and the prediction.
  4. build at least two of {Naïve Bayes, Logistic Regression, random forests, support vector machines or neural networks using the libraries of Spark.MLLib only. Explain your choice; plot learning curves;explain observed behavior; investigate which features are most informative; do at least one round oferror analysis to maximize your chosen metric (F1, accuracy, weighted F1); explain your choice ofmetric.
  5. complete an analysis of what sorts of people were likely to survive. In particular, we ask you to applythe tools of machine learning to predict which passengers survived the tragedy.
  6. convey your analysis in writing and with supporting visualizations
Clone this wiki locally