Skip to content

Commit c3e61c4

Browse files
committed
Refactoring
1 parent 8505c8f commit c3e61c4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+44640
-138
lines changed

.pre-commit-config.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
exclude: ^(data/|resources/)
1+
exclude: ^(data/|resources/|configs/)
22
default_stages: [ commit ]
33

44
repos:
@@ -13,7 +13,7 @@ repos:
1313
rev: 22.8.0
1414
hooks:
1515
- id: black
16-
exclude: ^(data/|resources/)
16+
exclude: ^(data/|resources/|configs/)
1717
language_version: python3
1818

1919
- repo: https://github.com/PyCQA/flake8

README.md

+63-14
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,72 @@
1-
TODO
1+
# metacurate.io: Top _N_ AI/ML/data science news of 2022.
22

3-
* Viz: have plotly set width of image, but set height explicitly
4-
* Produce report/list of clusters w top n urls for Medium (markdown?)
5-
* pre-commit, black, linting
6-
* Integrate viz into main.
7-
* Make sure it works end-to-end.
8-
* Refactor code.
9-
* Add command line argument for selecting other config file.
3+
This repository contains the code required to generate...
104

11-
## Set up chart studio
5+
Live Plotly graph here...
126

13-
https://jennifer-banks8585.medium.com/how-to-embed-interactive-plotly-visualizations-on-medium-blogs-710209f93bd
7+
Link to list with top N news stories here...
148

9+
## TODO
10+
* Run final version of clustering, description, viz, report.
11+
* README
12+
* Medium/LinkedIn article:
13+
* Top list
14+
* Behind the scenes w code
1515

16+
## Install
17+
This section contains instructions for how to install the code, resources, and dependencies
18+
needed to reproduce the clustering of the news headlines available in
19+
[metacurate_news_2022.csv](data/metacurate_news_2022.csv).
20+
21+
### Requirements
22+
23+
* git
24+
* Python (this repo was developed using Python 3.9)
25+
* pip
26+
* virtualenv
27+
* An API key from Cohere
28+
* Optional: Plotly Chart Studio credentials
29+
30+
### Create and activate a virtual environment
31+
32+
### Clone this repository
33+
34+
### Install dependencies
35+
36+
### Get and set up a Cohere API Key
37+
38+
In order to use [Topically](link) to describe the clusters, you need to have an API key
39+
from Cohere. Get a free API account/key for Cohere here. Take note of the key, and set
40+
the environment variable `COHERE_API_KEY` like so:
41+
42+
```bash
43+
export COHERE_API_KEY=<your_key>
1644
```
17-
import chart_studio
1845

19-
username = "<username>"
20-
api_key = "<api_key>"
2146

22-
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
47+
### Optional: Get and set up Plotly Chart Studio credentials
48+
In order to publish the generated Plotly plot to the web (Plotly Chart studio), you need to
49+
have an account and set up the credentials locally. Follow the instructions for getting an
50+
account
51+
[here](https://jennifer-banks8585.medium.com/how-to-embed-interactive-plotly-visualizations-on-medium-blogs-710209f93bd)
52+
and edit the file [set_up_plotly_credentials.py](src/set_up_plotly_credentials.py) to include
53+
your `username` and `api_key`.
54+
55+
Run the file:
56+
57+
```bash
58+
python chart_studio.py
2359
```
60+
61+
to generate and store the credentials. This only has to be done once.
62+
63+
## Run
64+
65+
To run the code, simply issue the following:
66+
67+
````bash
68+
python main.py
69+
````
70+
71+
NOTE that this is a long-running process: the vectorization step will take a long time (up to an
72+
hour) if you're running on a CPU, and the clustering takes quite some time too.

config.json

-25
This file was deleted.

configs/metacurate_news_2022_1.json

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"params": {
3+
"visualize_top_n": 50,
4+
"report_top_n": 200,
5+
"cluster_probability": 0.75,
6+
"title": "Top AI/ML/data science and related news of 2022",
7+
"publish_to_plotly": true,
8+
"plotly_file_name": "metacurate_top_ai_ml_news_2022_1"
9+
},
10+
"data": {
11+
"raw": "./data/metacurate_news_2022.csv",
12+
"normalized": "./data/transient/normalized.csv",
13+
"clustered": "./data/transient/clustered.csv",
14+
"cluster_info": "./data/transient/cluster_info.csv",
15+
"cluster_descriptions": "./data/transient/cluster_descriptions.csv",
16+
"cluster_viz_data": "./data/output/2022_1/cluster_viz_data.csv",
17+
"cluster_viz_html": "./data/output/2022_1/metacurate_news_viz_2022.html",
18+
"cluster_report": "./data/output/2022_1/metacurate_news_report_2022.md",
19+
"cache": "./data/transient/.cache"
20+
},
21+
"resources": {
22+
"omit_strings": "./resources/omit_strings.csv"
23+
},
24+
"vectorizer": {
25+
"model_name_or_path": "all-mpnet-base-v2"
26+
},
27+
"clusterer": {
28+
"metric": "precomputed",
29+
"cluster_selection_method": "leaf",
30+
"min_cluster_size": 10,
31+
"min_samples": 2,
32+
"cluster_selection_epsilon":0.05,
33+
"memory": "./data/transient/.cache"
34+
}
35+
}

configs/metacurate_news_2022_2.json

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"params": {
3+
"visualize_top_n": 50,
4+
"report_top_n": 200,
5+
"cluster_probability": 0.75,
6+
"title": "Top AI/ML/data science and related news of 2022",
7+
"publish_to_plotly": true,
8+
"plotly_file_name": "metacurate_top_ai_ml_news_2022_2"
9+
},
10+
"data": {
11+
"raw": "./data/metacurate_news_2022.csv",
12+
"normalized": "./data/transient/normalized.csv",
13+
"clustered": "./data/transient/clustered.csv",
14+
"cluster_info": "./data/transient/cluster_info.csv",
15+
"cluster_descriptions": "./data/transient/cluster_descriptions.csv",
16+
"cluster_viz_data": "./data/output/2022_2/cluster_viz_data.csv",
17+
"cluster_viz_html": "./data/output/2022_2/metacurate_news_viz_2022.html",
18+
"cluster_report": "./data/output/2022_2/metacurate_news_report_2022.md",
19+
"cache": "./data/transient/.cache"
20+
},
21+
"resources": {
22+
"omit_strings": "./resources/omit_strings.csv"
23+
},
24+
"vectorizer": {
25+
"model_name_or_path": "all-mpnet-base-v2"
26+
},
27+
"clusterer": {
28+
"metric": "precomputed",
29+
"cluster_selection_method": "leaf",
30+
"min_cluster_size": 20,
31+
"min_samples": 2,
32+
"cluster_selection_epsilon":0.05,
33+
"memory": "./data/transient/.cache"
34+
}
35+
}

configs/metacurate_news_2022_3.json

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"params": {
3+
"visualize_top_n": 50,
4+
"report_top_n": 200,
5+
"cluster_probability": 0.75,
6+
"title": "Top AI/ML/data science and related news of 2022",
7+
"publish_to_plotly": true,
8+
"plotly_file_name": "metacurate_top_ai_ml_news_2022_3"
9+
},
10+
"data": {
11+
"raw": "./data/metacurate_news_2022.csv",
12+
"normalized": "./data/transient/normalized.csv",
13+
"clustered": "./data/transient/clustered.csv",
14+
"cluster_info": "./data/transient/cluster_info.csv",
15+
"cluster_descriptions": "./data/transient/cluster_descriptions.csv",
16+
"cluster_viz_data": "./data/output/2022_3/cluster_viz_data.csv",
17+
"cluster_viz_html": "./data/output/2022_3/metacurate_news_viz_2022.html",
18+
"cluster_report": "./data/output/2022_3/metacurate_news_report_2022.md",
19+
"cache": "./data/transient/.cache"
20+
},
21+
"resources": {
22+
"omit_strings": "./resources/omit_strings.csv"
23+
},
24+
"vectorizer": {
25+
"model_name_or_path": "all-mpnet-base-v2"
26+
},
27+
"clusterer": {
28+
"metric": "precomputed",
29+
"cluster_selection_method": "leaf",
30+
"min_cluster_size": 15,
31+
"min_samples": 2,
32+
"cluster_selection_epsilon":0.05,
33+
"memory": "./data/transient/.cache"
34+
}
35+
}

configs/metacurate_news_2022_4.json

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"params": {
3+
"visualize_top_n": 50,
4+
"report_top_n": 200,
5+
"cluster_probability": 0.9,
6+
"title": "Top AI/ML/data science and related news of 2022",
7+
"publish_to_plotly": true,
8+
"plotly_file_name": "metacurate_top_ai_ml_news_2022_4"
9+
},
10+
"data": {
11+
"raw": "./data/metacurate_news_2022.csv",
12+
"normalized": "./data/transient/normalized.csv",
13+
"clustered": "./data/transient/clustered.csv",
14+
"cluster_info": "./data/transient/cluster_info.csv",
15+
"cluster_descriptions": "./data/transient/cluster_descriptions.csv",
16+
"cluster_viz_data": "./data/output/2022_4/cluster_viz_data.csv",
17+
"cluster_viz_html": "./data/output/2022_4/metacurate_news_viz_2022.html",
18+
"cluster_report": "./data/output/2022_4/metacurate_news_report_2022.md",
19+
"cache": "./data/transient/.cache"
20+
},
21+
"resources": {
22+
"omit_strings": "./resources/omit_strings.csv"
23+
},
24+
"vectorizer": {
25+
"model_name_or_path": "all-mpnet-base-v2"
26+
},
27+
"clusterer": {
28+
"metric": "precomputed",
29+
"cluster_selection_method": "leaf",
30+
"min_cluster_size": 50,
31+
"min_samples": 25,
32+
"cluster_selection_epsilon":0.05,
33+
"memory": "./data/transient/.cache"
34+
}
35+
}

configs/metacurate_news_2022_5.json

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"params": {
3+
"visualize_top_n": 50,
4+
"report_top_n": 200,
5+
"cluster_probability": 0.9,
6+
"title": "Top AI/ML/data science and related news of 2022",
7+
"publish_to_plotly": true,
8+
"plotly_file_name": "metacurate_top_ai_ml_news_2022_5"
9+
},
10+
"data": {
11+
"raw": "./data/metacurate_news_2022.csv",
12+
"normalized": "./data/transient/normalized.csv",
13+
"clustered": "./data/transient/clustered.csv",
14+
"cluster_info": "./data/transient/cluster_info.csv",
15+
"cluster_descriptions": "./data/transient/cluster_descriptions.csv",
16+
"cluster_viz_data": "./data/output/2022_5/cluster_viz_data.csv",
17+
"cluster_viz_html": "./data/output/2022_5/metacurate_news_viz_2022.html",
18+
"cluster_report": "./data/output/2022_5/metacurate_news_report_2022.md",
19+
"cache": "./data/transient/.cache"
20+
},
21+
"resources": {
22+
"omit_strings": "./resources/omit_strings.csv"
23+
},
24+
"vectorizer": {
25+
"model_name_or_path": "all-mpnet-base-v2"
26+
},
27+
"clusterer": {
28+
"metric": "precomputed",
29+
"cluster_selection_method": "leaf",
30+
"min_cluster_size": 3,
31+
"min_samples": 1,
32+
"cluster_selection_epsilon":0.2,
33+
"memory": "./data/transient/.cache"
34+
}
35+
}

configs/metacurate_news_2022_6.json

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"params": {
3+
"visualize_top_n": 50,
4+
"report_top_n": 200,
5+
"cluster_probability": 0.9,
6+
"title": "Top AI/ML/data science and related news of 2022",
7+
"publish_to_plotly": true,
8+
"plotly_file_name": "metacurate_top_ai_ml_news_2022_6"
9+
},
10+
"data": {
11+
"raw": "./data/metacurate_news_2022.csv",
12+
"normalized": "./data/transient/normalized.csv",
13+
"clustered": "./data/transient/clustered.csv",
14+
"cluster_info": "./data/transient/cluster_info.csv",
15+
"cluster_descriptions": "./data/transient/cluster_descriptions.csv",
16+
"cluster_viz_data": "./data/output/2022_6/cluster_viz_data.csv",
17+
"cluster_viz_html": "./data/output/2022_6/metacurate_news_viz_2022.html",
18+
"cluster_report": "./data/output/2022_6/metacurate_news_report_2022.md",
19+
"cache": "./data/transient/.cache"
20+
},
21+
"resources": {
22+
"omit_strings": "./resources/omit_strings.csv"
23+
},
24+
"vectorizer": {
25+
"model_name_or_path": "all-mpnet-base-v2"
26+
},
27+
"clusterer": {
28+
"metric": "precomputed",
29+
"cluster_selection_method": "eom",
30+
"min_cluster_size": 3,
31+
"min_samples": 1,
32+
"cluster_selection_epsilon":0.2,
33+
"memory": "./data/transient/.cache"
34+
}
35+
}
File renamed without changes.

0 commit comments

Comments
 (0)