This is a data science project that performs binary classification on all US stocks. The goal is to compare traditional machine learning approach and deep learning approach for time series classification.
A stock ticket is labeled as "1" (positive) if its price have "fallen to the ground", whatever that means, and labeled as "0" (negative) otherwise. Such low price stocks are very risky. But price patterns can be very diverse, so it is difficult to filter such stocks by any fixed rule. We manually labeled 414 positive tickets and 616 negative tickets that are randomly chosen from a pool of 6000+ tickets, for a total of 1030 labeled tickets. Labels are stored in the label.json file.
download the repository, create a new python envrionment, and install dependencies.
pip install -r requirements.txt① download historical day-level price data for all US stocks. The list of tickets (data/tickets.txt) comes from nasdaq, and may not be 100% complete or up-to-date.
cd data
python download.pyEach ticket history will be downloaded as a separate csv file in data/csv folder. You can use --num=1000 flag to download data for only certain number of tickets at a time.
② plot all the data.
python plot.pyThe plots will be saved in the data/plots folder.
① extract features from data
cd ml
python extract_features.pythe output will be two csv tables in the ml folder, one for labeled tickets and one for unlabeled tickets.
② fit model
python fit.py --model=lrAvailable models:
--model=lrfor logistic regression (default)--model=treefor decision trees--model=boostfor gradient boosting
You should be able to get around 95% test accuracy. Model file will be saved as model.pkl in the same folder.
③ classify 5000+ unlabeled tickets
python pred.pyPrediction will be saved to a prediction.json file.
① prepare data for training with
cd cnn
python prepare.pyThe output is two csv files "training_data.csv" and "unlabeled_data.csv" in the cnn folder.
② train the model with
python train.pywandb is used for logging with project name "stock-cls", so before training please create an account and login. Otherwise, you can comment out the logger variable in train.py. After training, model weights will be saved to cnn/stock-cls/[some-name]/checkpoints/ folder.
③ classify unlabeled tickets with
python pred.pyThe output will be a prediction.json file in the cnn folder.
point 0. defining the problem is often hard.
If you have a precise definition of your problem, then you already solved it. There are so many known and unknown variations that we deem should be of the same category, that's why we label data. If you are a client-facing consultant, you'll find that clients often don't know what they want, until you show them your work. In such situation, it is important to encourage clients to clarify their needs early on.
point 1. no feature can perfectly distinguish classes.
Otherwise, this single feature can be used as a classifier. For consistency, it is recommended for all features to have the same scaling, for example in range
point 2. machine learning approach is very explainable.
During feature selection, it is even possible to discover labeling errors in data by examining individual features, which is something that is not very possible with deep learning approach. When the ML model works well, you know exactly why it works. You solved the problem with human intelligence. The whole process is transparent.
point 3. deep learning is powerful but hard to control.
Deep learning models have powerful representations, you can instantly achieve high accuracy without going through feature engineering. However, setting up and training neural networks is a heavy process, which means it is less flexible when you want to update or change something later. Training can be very unstable and volatile, and model performance is sensitive to hyperparameters. Yet hyperparameter tuning wouldn't give you much insights to your original problem.
