-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtodo.txt
206 lines (197 loc) · 5.4 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
The newspapers we scraped (only for israel-related keywords)
cnn.com: 11272
WashingtonPost.com: 1887
Reuters.com: 4042
sfchronicle.com: 558
www.thetimes.com: 653
https://www.aljazeera.com/: 86341
https://www.cbc.ca/: 316
https://www.bbc.com: 710
https://www.nytimes.com: 1872
nypost.com: 4304
https://news.sky.com: 328
https://www.msnbc.com: 101
https://www.cnbc.com: 333
newspapers in CC-NEWS that we can use:
bnnbreaking.com 216557
dailymail.co.uk 181288
timesofindia.indiatimes.com 97887
yahoo.com 94407
hindustantimes.com 72798
thehindu.com 56829
cbsnews.com 41332
forbes.com 38805
independent.co.uk 34090
ca.news.yahoo.com 26618
dailystar.co.uk 26009
nzherald.co.nz 24162
telegraph.co.uk 24014
foxnews.com 21362
bnnbloomberg.ca 20453
thedailystar.net 19059
globalnews.ca 15547
irishtimes.com 13155
jpost.com 13013
theglobeandmail.com 12834
irishmirror.ie 12217
pressandjournal.co.uk 11266
tribuneindia.com 9797
inews.co.uk 9782
indiatimes.com 9344
thestatesman.com 8780
nottinghampost.com 8630
india.com 8486
abcnews.go.com 8051
israelnationalnews.com 7387
bostonglobe.com 7353
toronto.citynews.ca 7124
timeslive.co.za 6700
pressherald.com 6651
cyprus-mail.com 6382
thestar.co.uk 6322
euronews.com 6159
aljazeera.com 5995
kyivindependent.com 5712
sputnikglobe.com 5676
dailysabah.com 5399
azernews.az 5385
chicago.suntimes.com 5315
heraldscotland.com 5243
nbcnews.com 5234
japannews.yomiuri.co.jp 4984
thefederal.com 4935
huffpost.com 4895
u.today 4778
edinburghlive.co.uk 4455
cnn.com 4306
northernirelandworld.com 4175
cbc.ca 4130
gazettelive.co.uk 4006
japantimes.co.jp 4004
newsday.com 3837
thesun.my 3582
thesouthafrican.com 3450
thedailymeal.com 3367
eurasiareview.com 3324
hurriyetdailynews.com 3297
jewishpress.com 3155
abcactionnews.com 3072
nysun.com 3059
time.com 3051
huffingtonpost.co.uk 3051
dw.com 3015
standard.net.au 2839
politico.eu 2755
dailypost.co.uk 2557
cambridge-news.co.uk 2545
northumberlandgazette.co.uk 2530
theweek.com 2291
npr.org 2279
cnbc.com 2218
kyivpost.com 2209
ntv.ca 2069
the-independent.com 1882
chicagotribune.com 1868
africanews.com 1792
palestinechronicle.com 1747
turan.az 1738
romania-insider.com 1707
independent.com 1552
sfstandard.com 1525
To Do:
- choose between word2vec and fasttext - Word2Vec
- simplify the portrayal words, choose better portrayal words
- Arbitrary threshold: at least 4000 articles required to train a word embedding model on the newspaper
- All of the newspapers that we scraped (only for israel artilces) that pass the threshold are also either available on CC-NEWS or we did a general scrape too
- Combine cnn israel corpus with CC-NEWS cnn articles from 2022-2024 (be mindful of how many articles you will end up having)
- Do the same for aljazeera and cnbc
- We have extra 18k articles for nyt scraped by Davi, create a unified NYT dataset
- nypost, washingtonpost, sfchronicle, thetimes (I dont think we have enough articles for these and I dont think they are in CC-NEWS) find a solution or skip
- we have enough articles for reuters but they are only from the israeli context, maybe find a general reuters corpus, or create one
- Make sure there is a standard date that we include articles for individual newspaper corpuses
(you cannot have 4000 articles from 2007 for cnn, and 4k articles from 2024 for nyt)
- Choose total of 20 newspapers to train individual word embedding model on (choose some from the CC-NEWS and some from the ones we scraped)
- Train a word embedding model on the 20 newspapers + get the word counts word necessary words
(the following tasks can be done before of after the previous tasks)
- implement a word embedding model on pytorch to enable GPU acceleration
- decide on how to do year-by-year word embedding analysis on CC-NEWS (ie. dataset is too big, are you going to divide/filter, if so how? & train the models)
- country by country
- check out doc2vec
Choose 25 newspapers:
bnnbreaking.com 216557
dailymail.co.uk 181288
timesofindia.indiatimes.com 97887
yahoo.com 94407
hindustantimes.com 72798
thehindu.com 56829
cbsnews.com 41332
forbes.com 38805
independent.co.uk 34090
ca.news.yahoo.com 26618
dailystar.co.uk 26009
nzherald.co.nz 24162
telegraph.co.uk 24014
foxnews.com 21362
bnnbloomberg.ca 20453
thedailystar.net 19059
globalnews.ca 15547
irishtimes.com 13155
jpost.com 13013
theglobeandmail.com 12834
irishmirror.ie 12217
pressandjournal.co.uk 11266
tribuneindia.com 9797
inews.co.uk 9782
indiatimes.com 9344
thestatesman.com 8780
nottinghampost.com 8630
india.com 8486
abcnews.go.com 8051
israelnationalnews.com 7387
bostonglobe.com 7353
toronto.citynews.ca 7124
timeslive.co.za 6700
pressherald.com 6651
cyprus-mail.com 6382
thestar.co.uk 6322
euronews.com 6159
aljazeera.com 5995
kyivindependent.com 5712
sputnikglobe.com 5676
dailysabah.com 5399
azernews.az 5385
chicago.suntimes.com 5315
heraldscotland.com 5243
nbcnews.com 5234
japannews.yomiuri.co.jp 4984
thefederal.com 4935
huffpost.com 4895
edinburghlive.co.uk 4455
cnn.com 4306
northernirelandworld.com 4175
cbc.ca 4130
gazettelive.co.uk 4006
japantimes.co.jp 4004
newsday.com 3837
thesun.my 3582
thesouthafrican.com 3450
thedailymeal.com 3367
eurasiareview.com 3324
hurriyetdailynews.com 3297
jewishpress.com 3155
abcactionnews.com 3072
nysun.com 3059
time.com 3051
huffingtonpost.co.uk 3051
dw.com 3015
standard.net.au 2839
politico.eu 2755
npr.org 2279
cnbc.com 2218
kyivpost.com 2209
ntv.ca 2069
chicagotribune.com 1868
africanews.com 1792
palestinechronicle.com 1747
independent.com 1552
sfstandard.com 1525