-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Chris M Wang edited this page May 30, 2017
·
3 revisions
The reports, which consists of 3 parts, aim to answer the following questions.
- work with the Titantic Data Set from Kaggle at https://www.kaggle.com/c/titanic
- build source code in Scala or Python that runs in Spark 2.0.2 to analyze the Titantic data set.
- answer the question: “for subgroups of people boarding the Titanic, how would you maximize their individual probability of survival?”. You must define meaningful subgroups. You should submit your predictions in a file that clearly labels identity of person and the prediction.
- build at least two of {Naïve Bayes, Logistic Regression, random forests, support vector machines or neural networks using the libraries of Spark.MLLib only. Explain your choice; plot learning curves;explain observed behavior; investigate which features are most informative; do at least one round oferror analysis to maximize your chosen metric (F1, accuracy, weighted F1); explain your choice ofmetric.
- complete an analysis of what sorts of people were likely to survive. In particular, we ask you to applythe tools of machine learning to predict which passengers survived the tragedy.
- convey your analysis in writing and with supporting visualizations