Skip to content

Commit 8034ba4

Browse files
authoredNov 5, 2024··
Add Hunting Anomalies in the Stock Market scripts (#780)
* Add Hunting Anomalies in the Stock Market scripts * Fix lint * Ignore type and fix typo * Fix type def from linter * Removed json dump
1 parent 53808f8 commit 8034ba4

File tree

5 files changed

+506
-0
lines changed

5 files changed

+506
-0
lines changed
 
+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Hunting Anomalies in the Stock Market
2+
3+
This repository contains all the necessary scripts and data directories used in the [Hunting Anomalies in the Stock Market](https://polygon.io/blog/hunting-anomalies-in-stock-market/) tutorial, hosted on Polygon.io's blog. The tutorial demonstrates how to detect statistical anomalies in historical US stock market data through a comprehensive workflow that involves downloading data, building a lookup table, querying for anomalies, and visualizing them through a web interface.
4+
5+
### Prerequisites
6+
7+
- Python 3.8+
8+
- Access to Polygon.io's historical data via Flat Files
9+
- An active Polygon.io API key, obtainable by signing up for a Stocks paid plan
10+
11+
### Repository Contents
12+
13+
- `README.md`: This file, outlining setup and execution instructions.
14+
- `aggregates_day`: Directory where downloaded CSV data files are stored.
15+
- `build-lookup-table.py`: Python script to build a lookup table from the historical data.
16+
- `query-lookup-table.py`: Python script to query the lookup table for anomalies.
17+
- `gui-lookup-table.py`: Python script for a browser-based interface to explore anomalies visually.
18+
19+
### Running the Tutorial
20+
21+
1. **Ensure Python 3.8+ is installed:** Check your Python version and ensure all required libraries (polygon-api-client, pandas, pickle, and argparse) are installed.
22+
23+
2. **Set up your API key:** Make sure you have an active paid Polygon.io Stock subscription for accessing Flat Files. Set up your API key in your environment or directly in the scripts where required.
24+
25+
3. **Download Historical Data:** Use the MinIO client to download historical stock market data. Adjust the commands and paths based on the data you are interested in.
26+
```bash
27+
mc alias set s3polygon https://files.polygon.io YOUR_ACCESS_KEY YOUR_SECRET_KEY
28+
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/08/ ./aggregates_day/
29+
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/09/ ./aggregates_day/
30+
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/10/ ./aggregates_day/
31+
gunzip ./aggregates_day/*.gz
32+
```
33+
34+
4. **Build the Lookup Table:** This script processes the downloaded data and builds a lookup table, saving it as `lookup_table.pkl`.
35+
```bash
36+
python build-lookup-table.py
37+
```
38+
39+
5. **Query Anomalies:** Replace `2024-10-18` with the date you want to analyze for anomalies.
40+
```bash
41+
python query-lookup-table.py 2024-10-18
42+
```
43+
44+
6. **Run the GUI:** Access the web interface at `http://localhost:8888` to explore the anomalies visually.
45+
```bash
46+
python gui-lookup-table.py
47+
```
48+
49+
For a complete step-by-step guide on each phase of the anomaly detection process, including additional configurations and troubleshooting, refer to the detailed [tutorial on our blog](https://polygon.io/blog/hunting-anomalies-in-stock-market/).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Download flat files into here.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
import os
2+
import pandas as pd # type: ignore
3+
from collections import defaultdict
4+
import pickle
5+
import json
6+
from typing import DefaultDict, Dict, Any, BinaryIO
7+
8+
# Directory containing the daily CSV files
9+
data_dir = "./aggregates_day/"
10+
11+
# Initialize a dictionary to hold trades data
12+
trades_data = defaultdict(list)
13+
14+
# List all CSV files in the directory
15+
files = sorted([f for f in os.listdir(data_dir) if f.endswith(".csv")])
16+
17+
print("Starting to process files...")
18+
19+
# Process each file (assuming files are named in order)
20+
for file in files:
21+
print(f"Processing {file}")
22+
file_path = os.path.join(data_dir, file)
23+
df = pd.read_csv(file_path)
24+
# For each stock, store the date and relevant data
25+
for _, row in df.iterrows():
26+
ticker = row["ticker"]
27+
date = pd.to_datetime(row["window_start"], unit="ns").date()
28+
trades = row["transactions"]
29+
close_price = row["close"] # Ensure 'close' column exists in your CSV
30+
trades_data[ticker].append(
31+
{"date": date, "trades": trades, "close_price": close_price}
32+
)
33+
34+
print("Finished processing files.")
35+
print("Building lookup table...")
36+
37+
# Now, build the lookup table with rolling averages and percentage price change
38+
lookup_table: DefaultDict[str, Dict[str, Any]] = defaultdict(
39+
dict
40+
) # Nested dict: ticker -> date -> stats
41+
42+
for ticker, records in trades_data.items():
43+
# Convert records to DataFrame
44+
df_ticker = pd.DataFrame(records)
45+
# Sort records by date
46+
df_ticker.sort_values("date", inplace=True)
47+
df_ticker.set_index("date", inplace=True)
48+
49+
# Calculate the percentage change in close_price
50+
df_ticker["price_diff"] = (
51+
df_ticker["close_price"].pct_change() * 100
52+
) # Multiply by 100 for percentage
53+
54+
# Shift trades to exclude the current day from rolling calculations
55+
df_ticker["trades_shifted"] = df_ticker["trades"].shift(1)
56+
# Calculate rolling average and standard deviation over the previous 5 days
57+
df_ticker["avg_trades"] = df_ticker["trades_shifted"].rolling(window=5).mean()
58+
df_ticker["std_trades"] = df_ticker["trades_shifted"].rolling(window=5).std()
59+
# Store the data in the lookup table
60+
for date, row in df_ticker.iterrows():
61+
# Convert date to string for JSON serialization
62+
date_str = date.strftime("%Y-%m-%d")
63+
# Ensure rolling stats are available
64+
if pd.notnull(row["avg_trades"]) and pd.notnull(row["std_trades"]):
65+
lookup_table[ticker][date_str] = {
66+
"trades": row["trades"],
67+
"close_price": row["close_price"],
68+
"price_diff": row["price_diff"],
69+
"avg_trades": row["avg_trades"],
70+
"std_trades": row["std_trades"],
71+
}
72+
else:
73+
# Store data without rolling stats if not enough data points
74+
lookup_table[ticker][date_str] = {
75+
"trades": row["trades"],
76+
"close_price": row["close_price"],
77+
"price_diff": row["price_diff"],
78+
"avg_trades": None,
79+
"std_trades": None,
80+
}
81+
82+
print("Lookup table built successfully.")
83+
84+
# Convert defaultdict to regular dict for JSON serialization
85+
lookup_table_dict = {k: v for k, v in lookup_table.items()}
86+
87+
# Save the lookup table to a file for later use
88+
with open("lookup_table.pkl", "wb") as f: # type: BinaryIO
89+
pickle.dump(lookup_table_dict, f)
90+
91+
print("Lookup table saved to 'lookup_table.pkl'.")
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,302 @@
1+
import os
2+
import pickle
3+
import json
4+
from datetime import datetime
5+
from polygon import RESTClient
6+
from polygon.rest.models import Agg
7+
import http.server
8+
import socketserver
9+
import traceback
10+
from urllib.parse import urlparse, parse_qs
11+
12+
PORT = 8888
13+
14+
# Load the lookup_table
15+
with open("lookup_table.pkl", "rb") as f:
16+
lookup_table = pickle.load(f)
17+
18+
19+
class handler(http.server.SimpleHTTPRequestHandler):
20+
def do_GET(self):
21+
# Parse the path and query parameters
22+
parsed_path = urlparse(self.path)
23+
path = parsed_path.path
24+
query_params = parse_qs(parsed_path.query)
25+
26+
if path == "/":
27+
# Handle the root path
28+
# Get the date parameter if provided
29+
date_param = query_params.get("date", [None])[0]
30+
31+
# Get all dates from the lookup table
32+
all_dates = set()
33+
for ticker_data in lookup_table.values():
34+
all_dates.update(ticker_data.keys())
35+
all_dates = sorted(all_dates)
36+
37+
# If date is None, get the latest date from the lookup table
38+
if date_param is None:
39+
if all_dates:
40+
latest_date = max(all_dates)
41+
else:
42+
self.send_response(200)
43+
self.send_header("Content-type", "text/html")
44+
self.end_headers()
45+
html_content = (
46+
"<html><body><h1>No data available.</h1></body></html>"
47+
)
48+
self.wfile.write(html_content.encode())
49+
return
50+
else:
51+
latest_date = date_param
52+
53+
# Ensure latest_date is in all_dates
54+
if latest_date not in all_dates:
55+
# Handle the case where the provided date is invalid
56+
self.send_response(400)
57+
self.send_header("Content-type", "text/html")
58+
self.end_headers()
59+
error_html = f"<html><body><h1>Error: No data available for date {latest_date}</h1></body></html>"
60+
self.wfile.write(error_html.encode())
61+
return
62+
63+
# Now, get the anomalies for the latest_date
64+
anomalies = []
65+
for ticker, date_data in lookup_table.items():
66+
if latest_date in date_data:
67+
data = date_data[latest_date]
68+
trades = data["trades"]
69+
avg_trades = data["avg_trades"]
70+
std_trades = data["std_trades"]
71+
if (
72+
avg_trades is not None
73+
and std_trades is not None
74+
and std_trades > 0
75+
):
76+
z_score = (trades - avg_trades) / std_trades
77+
threshold_multiplier = 3 # Adjust as needed
78+
if z_score > threshold_multiplier:
79+
anomalies.append(
80+
{
81+
"ticker": ticker,
82+
"date": latest_date,
83+
"trades": trades,
84+
"avg_trades": avg_trades,
85+
"std_trades": std_trades,
86+
"z_score": z_score,
87+
"close_price": data["close_price"],
88+
"price_diff": data["price_diff"],
89+
}
90+
)
91+
# Sort anomalies by trades in descending order
92+
anomalies.sort(key=lambda x: x["trades"], reverse=True)
93+
# Generate the HTML to display the anomalies
94+
self.send_response(200)
95+
self.send_header("Content-type", "text/html")
96+
self.end_headers()
97+
# Build the HTML content
98+
html_content = '<html><link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-QWTKZyjpPEjISv5WaRU9OFeRpok6YctnYmDr5pNlyT2bRjXh0JMhjY6hW+ALEwIH" crossorigin="anonymous"><script src="https://cdnjs.cloudflare.com/ajax/libs/tablesort/5.2.1/tablesort.min.js" integrity="sha512-F/gIMdDfda6OD2rnzt/Iyp2V9JLHlFQ+EUyixDg9+rkwjqgW1snpkpx7FD5FV1+gG2fmFj7I3r6ReQDUidHelA==" crossorigin="anonymous" referrerpolicy="no-referrer"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/tablesort/5.2.1/sorts/tablesort.number.min.js" integrity="sha512-dRD755QRxlybm0h3LXXIGrFcjNakuxW3reZqnPtUkMv6YsSWoJf+slPjY5v4lZvx2ss+wBZQFegepmA7a2W9eA==" crossorigin="anonymous" referrerpolicy="no-referrer"></script><head><title>Anomalies for {}</title></head><body>'.format(
99+
latest_date
100+
)
101+
html_content += '<div id="container" style="padding:4px;"><h1>Anomalies for {}</h1>'.format(
102+
latest_date
103+
)
104+
# Add navigation links (prev and next dates)
105+
current_index = all_dates.index(latest_date)
106+
prev_date = all_dates[current_index - 1] if current_index > 0 else None
107+
next_date = (
108+
all_dates[current_index + 1]
109+
if current_index < len(all_dates) - 1
110+
else None
111+
)
112+
html_content += "<p>"
113+
if prev_date:
114+
html_content += '<a href="/?date={}">Previous Date</a> '.format(
115+
prev_date
116+
)
117+
if next_date:
118+
html_content += '<a href="/?date={}">Next Date</a> '.format(next_date)
119+
html_content += "</p>"
120+
# Display the anomalies in a table
121+
html_content += (
122+
'<table id="anomalies" class="table table-striped table-hover">'
123+
)
124+
html_content += "<thead><tr>"
125+
html_content += "<th>Ticker</th>"
126+
html_content += "<th>Trades</th>"
127+
html_content += "<th>Avg Trades</th>"
128+
html_content += "<th>Std Dev</th>"
129+
html_content += "<th>Z-score</th>"
130+
html_content += "<th>Close Price</th>"
131+
html_content += "<th>Price Diff</th>"
132+
html_content += "<th>Chart</th>"
133+
html_content += "</tr></thead><tbody>"
134+
for anomaly in anomalies:
135+
html_content += "<tr>"
136+
html_content += "<td>{}</td>".format(anomaly["ticker"])
137+
html_content += "<td>{}</td>".format(anomaly["trades"])
138+
html_content += "<td>{:.2f}</td>".format(anomaly["avg_trades"])
139+
html_content += "<td>{:.2f}</td>".format(anomaly["std_trades"])
140+
html_content += "<td>{:.2f}</td>".format(anomaly["z_score"])
141+
html_content += "<td>{:.2f}</td>".format(anomaly["close_price"])
142+
html_content += "<td>{:.2f}</td>".format(anomaly["price_diff"])
143+
# Add a link to the chart
144+
html_content += (
145+
'<td><a href="/chart?ticker={}&date={}">View Chart</a></td>'.format(
146+
anomaly["ticker"], latest_date
147+
)
148+
)
149+
html_content += "</tr>"
150+
html_content += '</tbody></table><script>new Tablesort(document.getElementById("anomalies"));</script>'
151+
html_content += "</div></body></html>"
152+
self.wfile.write(html_content.encode())
153+
elif path == "/chart":
154+
# Handle the chart page
155+
# Get 'ticker' and 'date' from query parameters
156+
ticker = query_params.get("ticker", [None])[0]
157+
date = query_params.get("date", [None])[0]
158+
if ticker is None or date is None:
159+
# Return an error page
160+
self.send_response(400)
161+
self.send_header("Content-type", "text/html")
162+
self.end_headers()
163+
error_html = "<html><body><h1>Error: Missing ticker or date parameter</h1></body></html>"
164+
self.wfile.write(error_html.encode())
165+
else:
166+
# Fetch minute aggregates for the ticker and date
167+
client = RESTClient(
168+
trace=True
169+
) # POLYGON_API_KEY environment variable is used
170+
try:
171+
aggs = []
172+
date_from = date
173+
date_to = date
174+
for a in client.list_aggs(
175+
ticker,
176+
1,
177+
"minute",
178+
date_from,
179+
date_to,
180+
limit=50000,
181+
):
182+
aggs.append(a)
183+
# Prepare data for the chart
184+
data = []
185+
for agg in aggs:
186+
if isinstance(agg, Agg) and isinstance(agg.timestamp, int):
187+
new_record = [
188+
agg.timestamp,
189+
agg.open,
190+
agg.high,
191+
agg.low,
192+
agg.close,
193+
]
194+
data.append(new_record)
195+
# Generate the HTML for the chart page
196+
chart_html = """
197+
<!DOCTYPE HTML>
198+
<html>
199+
<head>
200+
<style>
201+
#container {
202+
height: 750px;
203+
min-width: 310px;
204+
}
205+
</style>
206+
<script src="https://code.highcharts.com/stock/highstock.js"></script>
207+
<script src="https://code.highcharts.com/stock/modules/data.js"></script>
208+
<script src="https://code.highcharts.com/stock/modules/exporting.js"></script>
209+
<script src="https://code.highcharts.com/stock/modules/accessibility.js"></script>
210+
<script src="https://code.highcharts.com/moment/moment.js"></script>
211+
<script src="https://code.highcharts.com/moment-timezone/moment-timezone.js"></script>
212+
</head>
213+
<body>
214+
<div id="container">
215+
<script type="text/javascript">
216+
Highcharts.setOptions({
217+
global: {
218+
timezone: 'America/New_York'
219+
}
220+
});
221+
var data = %s;
222+
Highcharts.stockChart('container', {
223+
exporting: {
224+
url: 'http://localhost:7801', // Set your local server as the exporting server
225+
enabled: true // Make sure exporting is enabled
226+
},
227+
rangeSelector: {
228+
enabled: false,
229+
selected: 1
230+
},
231+
navigator: {
232+
//enabled: false
233+
},
234+
scrollbar: {
235+
//enabled: false
236+
},
237+
xAxis: {
238+
labels: {
239+
//enabled: true // This hides the time labels under the chart
240+
}
241+
},
242+
title: {
243+
text: '%s Price Data on %s'
244+
},
245+
series: [{
246+
type: 'candlestick',
247+
name: '%s',
248+
data: data,
249+
color: 'red', // Color for downward movement
250+
lineColor: 'red', // Line color for downward movement
251+
upColor: 'green', // Color for upward movement
252+
upLineColor: 'green', // Line color for upward movement
253+
dataGrouping: {
254+
units: [[
255+
'minute',
256+
[1]
257+
]]
258+
}
259+
}]
260+
});
261+
</script>
262+
</div>
263+
</body>
264+
</html>
265+
""" % (
266+
json.dumps(data),
267+
ticker,
268+
date,
269+
ticker,
270+
)
271+
self.send_response(200)
272+
self.send_header("Content-type", "text/html")
273+
self.send_header("Access-Control-Allow-Origin", "*")
274+
self.end_headers()
275+
self.wfile.write(chart_html.encode())
276+
except Exception as e:
277+
# Handle exceptions
278+
self.send_response(500)
279+
self.send_header("Content-type", "text/html")
280+
self.end_headers()
281+
error_html = "<html><body><h1>Error fetching data: {}</h1></body></html>".format(
282+
str(e)
283+
)
284+
self.wfile.write(error_html.encode())
285+
else:
286+
# Serve files from the current directory
287+
super().do_GET()
288+
289+
290+
def run_server():
291+
with socketserver.TCPServer(("", PORT), handler) as httpd:
292+
print("serving at port", PORT)
293+
try:
294+
httpd.serve_forever()
295+
except KeyboardInterrupt:
296+
print("\nExiting gracefully...")
297+
httpd.shutdown()
298+
httpd.server_close()
299+
300+
301+
if __name__ == "__main__":
302+
run_server()
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
import pickle
2+
import argparse
3+
4+
# Parse command-line arguments
5+
parser = argparse.ArgumentParser(description="Anomaly Detection Script")
6+
parser.add_argument("date", type=str, help="Target date in YYYY-MM-DD format")
7+
args = parser.parse_args()
8+
9+
# Load the lookup_table
10+
with open("lookup_table.pkl", "rb") as f:
11+
lookup_table = pickle.load(f)
12+
13+
# Threshold for considering an anomaly (e.g., 3 standard deviations)
14+
threshold_multiplier = 3
15+
16+
# Date for which we want to find anomalies
17+
target_date_str = args.date
18+
19+
# List to store anomalies
20+
anomalies = []
21+
22+
# Iterate over all tickers in the lookup table
23+
for ticker, date_data in lookup_table.items():
24+
if target_date_str in date_data:
25+
data = date_data[target_date_str]
26+
trades = data["trades"]
27+
avg_trades = data["avg_trades"]
28+
std_trades = data["std_trades"]
29+
if avg_trades is not None and std_trades is not None and std_trades > 0:
30+
z_score = (trades - avg_trades) / std_trades
31+
if z_score > threshold_multiplier:
32+
anomalies.append(
33+
{
34+
"ticker": ticker,
35+
"date": target_date_str,
36+
"trades": trades,
37+
"avg_trades": avg_trades,
38+
"std_trades": std_trades,
39+
"z_score": z_score,
40+
"close_price": data["close_price"],
41+
"price_diff": data["price_diff"],
42+
}
43+
)
44+
45+
# Sort anomalies by trades in descending order
46+
anomalies.sort(key=lambda x: x["trades"], reverse=True)
47+
48+
# Print the anomalies with aligned columns
49+
print(f"\nAnomalies Found for {target_date_str}:\n")
50+
print(
51+
f"{'Ticker':<10}{'Trades':>10}{'Avg Trades':>15}{'Std Dev':>10}{'Z-score':>10}{'Close Price':>12}{'Price Diff':>12}"
52+
)
53+
print("-" * 91)
54+
for anomaly in anomalies:
55+
print(
56+
f"{anomaly['ticker']:<10}"
57+
f"{anomaly['trades']:>10.0f}"
58+
f"{anomaly['avg_trades']:>15.2f}"
59+
f"{anomaly['std_trades']:>10.2f}"
60+
f"{anomaly['z_score']:>10.2f}"
61+
f"{anomaly['close_price']:>12.2f}"
62+
f"{anomaly['price_diff']:>12.2f}"
63+
)

0 commit comments

Comments
 (0)
Please sign in to comment.