Skip to content

Commit 3dc8f21

Browse files
committed
[SYSTEMML-1185] SystemML Breast Cancer Project
This is the initial commit of the SystemML breast cancer project! Please reference the attached `README.md` for an overview, background information, goals, our approach, etc. At a high level, this PR introduces the following new files/folders: * `README.md`: Project information, etc. * `Preprocessing.ipynb`: PySpark notebook for preprocessing our histopathology slides into an appropriate `DataFrame` for consumption by SystemML. * `MachineLearning.ipynb`: PySpark/SystemML notebook for our machine learning approach thus far. We started simple, and are currently in need of engine improvements in order to proceed forward. * `softmax_clf.dml`: Basic softmax model (multiclass logistic regression with normalized probabilities) as a sanity check. * `convnet.dml`: Our current deep convnet model. We are starting simple with a slightly extended "LeNet"-like network architecture. The goal will be to improve engine performance so that this model can be efficiently trained, and then move on to larger, more recent types of model architectures. * `hyperparam_tuning.dml`: A separate script for performing a hyperparameter search for our current convnet model. This has been extracted from the notebook as the current `parfor` engine implementation is not yet sufficient for this type of necessary job. * `data`: A placeholder folder into which the data could be downloaded. * `nn`: A softlink that will point to the SystemML-NN library. * `approach.svg`: Image of our overall pipeline used in `README.md`. Overall, this project aim to serve as a large-scale, end-to-end machine learning project that can drive necessary core improvements for SystemML. Closes apache#347
1 parent 42ebc96 commit 3dc8f21

File tree

10 files changed

+2368
-0
lines changed

10 files changed

+2368
-0
lines changed

docs/img/projects/breast_cancer/approach.svg

Lines changed: 4 additions & 0 deletions
Loading

projects/breast_cancer/MachineLearning.ipynb

Lines changed: 561 additions & 0 deletions
Large diffs are not rendered by default.

projects/breast_cancer/Preprocessing.ipynb

Lines changed: 904 additions & 0 deletions
Large diffs are not rendered by default.

projects/breast_cancer/README.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
<!--
2+
{% comment %}
3+
Licensed to the Apache Software Foundation (ASF) under one or more
4+
contributor license agreements. See the NOTICE file distributed with
5+
this work for additional information regarding copyright ownership.
6+
The ASF licenses this file to you under the Apache License, Version 2.0
7+
(the "License"); you may not use this file except in compliance with
8+
the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing, software
13+
distributed under the License is distributed on an "AS IS" BASIS,
14+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
See the License for the specific language governing permissions and
16+
limitations under the License.
17+
{% endcomment %}
18+
-->
19+
20+
# Predicting Breast Cancer Proliferation Scores with Apache Spark and Apache SystemML
21+
22+
Note: This project is still a **work in progress**.
23+
24+
## Overview
25+
The [Tumor Proliferation Assessment Challenge 2016 (TUPAC16)](http://tupac.tue-image.nl/) is a "Grand Challenge" that was created for the [2016 Medical Image Computing and Computer Assisted Intervention (MICCAI 2016)](http://miccai2016.org/en/) conference. In this challenge, the goal is to develop state-of-the-art algorithms for automatic prediction of tumor proliferation scores from whole-slide histopathology images of breast tumors.
26+
27+
## Background
28+
Breast cancer is the leading cause of cancerous death in women in less-developed countries, and is the second leading cause of cancerous deaths in developed countries, accounting for 29% of all cancers in women within the U.S. [1]. Survival rates increase as early detection increases, giving incentive for pathologists and the medical world at large to develop improved methods for even earlier detection [2]. There are many forms of breast cancer including Ductal Carcinoma in Situ (DCIS), Invasive Ductal Carcinoma (IDC), Tubular Carcinoma of the Breast, Medullary Carcinoma of the Breast, Invasive Lobular Carcinoma, Inflammatory Breast Cancer and several others [3]. Within all of these forms of breast cancer, the rate in which breast cancer cells grow (proliferation), is a strong indicator of a patient’s prognosis. Although there are many means of determining the presence of breast cancer, tumor proliferation speed has been proven to help pathologists determine the treatment for the patient. The most common technique for determining the proliferation speed is through mitotic count (mitotic index) estimates, in which a pathologist counts the dividing cell nuclei in hematoxylin and eosin (H&E) stained slide preparations to determine the number of mitotic bodies. Given this, the pathologist produces a proliferation score of either 1, 2, or 3, ranging from better to worse prognosis [4]. Unfortunately, this approach is known to have reproducibility problems due to the variability in counting, as well as the difficulty in distinguishing between different grades.
29+
30+
References:
31+
[1] http://emedicine.medscape.com/article/1947145-overview#a3
32+
[2] http://emedicine.medscape.com/article/1947145-overview#a7
33+
[3] http://emedicine.medscape.com/article/1954658-overview
34+
[4] http://emedicine.medscape.com/article/1947145-workup#c12
35+
36+
## Goal & Approach
37+
In an effort to automate the process of classification, this project aims to develop a large-scale deep learning approach for predicting tumor scores directly from the pixels of whole-slide histopathology images. Our proposed approach is based on a recent research paper from Stanford [1]. Starting with 500 extremely high-resolution tumor slide images with accompanying score labels, we aim to make use of Apache Spark in a preprocessing step to cut and filter the images into smaller square samples, generating 4.7 million samples for a total of ~7TB of data [2]. We then utilize Apache SystemML on top of Spark to develop and train a custom, large-scale, deep convolutional neural network on these samples, making use of the familiar linear algebra syntax and automatically-distributed execution of SystemML [3]. Our model takes as input the pixel values of the individual samples, and is trained to predict the correct tumor score classification for each one. In addition to distributed linear algebra, we aim to exploit task-parallelism via parallel for-loops for hyperparameter optimization, as well as hardware acceleration for faster training via a GPU-backed runtime. Ultimately, we aim to develop a model that is sufficiently stronger than existing approaches for the task of breast cancer tumor proliferation score classification.
38+
39+
References:
40+
[1] https://web.stanford.edu/group/rubinlab/pubs/2243353.pdf
41+
[2] See [`Preprocessing.ipynb`](Preprocessing.ipynb).
42+
[3] See [`MachineLearning.ipynb`](MachineLearning.ipynb), [`softmax_clf.dml`](softmax_clf.dml), and [`convnet.dml`](convnet.dml).
43+
44+
![Approach](https://apache.github.io/incubator-systemml/img/projects/breast_cancer/approach.svg)
45+
46+
---
47+
48+
## Setup (*All nodes* unless other specified):
49+
* System Packages:
50+
* `sudo yum update`
51+
* `sudo yum install gcc ruby`
52+
* Python 3:
53+
* `sudo yum install epel-release`
54+
* `sudo yum install -y https://centos7.iuscommunity.org/ius-release.rpm`
55+
* `sudo yum install -y python35u python35u-libs python35u-devel python35u-pip`
56+
* `ln -s /usr/bin/python3.5 ~/.local/bin/python3`
57+
* `ln -s /usr/bin/pip3.5 ~/.local/bin/pip3`
58+
* Prepend `~/.local/bin` to the `PATH`.
59+
* OpenSlide:
60+
* `sudo yum install openslide`
61+
* Python packages:
62+
* `pip3 install -U matplotlib numpy pandas scipy jupyter ipython scikit-learn scikit-image flask openslide-python`
63+
* SystemML (only driver):
64+
* `git clone https://github.com/apache/incubator-systemml.git`
65+
* `cd incubator-systemml`
66+
* `mvn clean package`
67+
* `pip3 install -e src/main/python`
68+
* Create a `data` folder with the following contents (same location on *all* nodes):
69+
* `training_image_data` folder with the training slides.
70+
* `testing_image_data` folder with the testing slides.
71+
* `training_ground_truth.csv` file containing the tumor & molecular scores for each slide.
72+
* Create a project folder (i.e. `breast_cancer`) with the following contents (only driver):
73+
* All notebooks (`*.ipynb`).
74+
* All DML scripts (`*.dml`).
75+
* SystemML-NN installed as an `nn` folder containing the contents of `$SYSTEMML_HOME/scripts/staging/SystemML-NN/nn` (either copy & paste, or use a softlink).
76+
* The `data` folder (or a softlink pointing to it).
77+
* Layout:
78+
79+
```
80+
- MachineLearning.ipynb
81+
- Preprocessing.ipynb
82+
- ...
83+
- data/
84+
- training_ground_truth.csv
85+
- training_image_data
86+
- TUPAC-TR-001.svs
87+
- TUPAC-TR-002.svs
88+
- ...
89+
- testing_image_data
90+
- TUPAC-TE-001.svs
91+
- TUPAC-TE-002.svs
92+
- ...
93+
```
94+
95+
* Adjust the Spark settings in `$SPARK_HOME/conf/spark-defaults.conf` using the following examples, depending on the job being executed:
96+
* All jobs:
97+
```
98+
# Use most of the driver memory.
99+
spark.driver.memory 70g
100+
# Remove the max result size constraint.
101+
spark.driver.maxResultSize 0
102+
# Increase the message size.
103+
spark.akka.frameSize 128
104+
# Extend the network timeout threshold.
105+
spark.network.timeout 1000s
106+
# Setup some extra Java options for performance.
107+
spark.driver.extraJavaOptions -server -Xmn12G
108+
spark.executor.extraJavaOptions -server -Xmn12G
109+
# Setup local directories on separate disks for intermediate read/write performance, if running
110+
# on Spark Standalone clusters.
111+
spark.local.dirs /disk2/local,/disk3/local,/disk4/local,/disk5/local,/disk6/local,/disk7/local,/disk8/local,/disk9/local,/disk10/local,/disk11/local,/disk12/local
112+
```
113+
114+
* Preprocessing:
115+
```
116+
# Save 1/2 executor memory for Python processes
117+
spark.executor.memory 50g
118+
```
119+
120+
* Machine Learning:
121+
```
122+
# Use all executor memory for JVM
123+
spark.executor.memory 100g
124+
```
125+
126+
* Start Jupyter + PySpark with the following command (could also use Yarn in client mode with `--master yarn --deploy-mode`):
127+
```
128+
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master spark://MASTER_URL:7077 --driver-class-path $SYSTEMML_HOME/target/SystemML.jar --jars $SYSTEMML_HOME/target/SystemML.jar
129+
```
130+
131+
## Create a Histopath slide “lab” to view the slides (just driver):
132+
- `git clone https://github.com/openslide/openslide-python.git`
133+
- Host locally:
134+
- `python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py path/to/data/`
135+
- Host on server:
136+
- `python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py -l HOSTING_URL_HERE path/to/data/`
137+
- Open local browser to `HOSTING_URL_HERE:5000`.

0 commit comments

Comments
 (0)