News Bias Using Word Embeddings

Github repository: https://github.com/McGill-AI-Lab/news-bias-w2v

Data

From https://huggingface.co/datasets/stanford-oval/ccnews, we downloaded all of the parquet files for 2024. In parquet2csv.ipynb:

Got a list of publishers the collection of all parquet files using polars and ho wmany times each publisher appears in the dataset.
We filtered for publishers with more than 1500 articles.
We then manually chose publishers we thought could be interesting to look at (i.e. publishers that are popular, from conflicted regions, might be biased, etc.)
We then created a df with all of the articles from the chosen publishers and saved it as a csv into newspaper2024.csv In almost all of these processes, polars.lazy_scan was used and found to be immensely helpful in speeding up the process.

2024.csv includes all english articles from 2024 of chosen newspaper outlets from the CC_NEWS dataset. It has a row per article.

newspaper2024.csv collects all of the articles for a newspaper and stores them inside a cell as one single row. It has the followign columns: header = [ "Publisher", "Year", "Political Alignment" *, "Articles", "Article Count", "Corpus" *, "Corpus Word Count" *, "Unique Word Count" * Note: * means that the column is not filled out yet. Note: We should add Country column, and fill out political alignment for each newspaper.

CSV TO JSON

Inside bias_analysis.ipynb, we preprocessed our articles for each newspaper and saved them as a json file with the following fields:

newspapers.json

{"Publisher": "abcactionnews.com", "Year": 2024, "Political Alignment": "(not filled yet)", "Articles": [], "Article Count": 0, "Corpus": "[[][]..]", "Corpus Word Count": 0, "Unique Word Count": 0}

This json is \n deliminted (NDJSON) so we don't have to load all of the file to parse through it. We will share the csv and the json through hugging face as each of them are around 10 GBs. Email [email protected] if you would like access to it.

Bias Analysis

Inside bias_analysis.ipynb, we iterate through each "line" (publisher) in the json and train a word2vec on the corpus. We then save the word2vec model and the word vectors to /models/2024/ folder

We calculate how each geopolitical entity is represented in the corpus by portrayal_word = cosine_similarity(good, geopolitical_entity) - cosine_similarity(bad, geopolitical_entity) We then save the portrayals for each geopolitical entity for each newspaper into a json file.

Here are the visualizations of how each newspaper portrays Israel-Palestine and Russia-Ukraine conflicts:

Contributers: Emir Sahin: CC_NEWS + the research Jacob Leader: Repo maintanance + scraping Oscar, Jacob S, Dory: Scraping

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.idea		.idea
archive		archive
README.md		README.md
bias_analysis.ipynb		bias_analysis.ipynb
parquet2csv.ipynb		parquet2csv.ipynb
requirements.txt		requirements.txt
scraping_progress.txt		scraping_progress.txt
todo.txt		todo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Bias Using Word Embeddings

Data

CSV TO JSON

Bias Analysis

About

Releases

Packages

Contributors 5

Languages

McGill-AI-Lab/news-bias-w2v

Folders and files

Latest commit

History

Repository files navigation

News Bias Using Word Embeddings

Data

CSV TO JSON

Bias Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages