-
Notifications
You must be signed in to change notification settings - Fork 0
Home
In public debate, including political debate, facts are often at the core of arguments for a certain opinion. The misuse of facts and sources can thus mislead the public. Recently with the growth of the internet (and thus number of sources for facts) it can be difficult for the common person to know whether or not to trust these sources. If public debate should dominated by facts then we need strong tools to help the public understand what are facts and what are not. Automated fact-checking has recently been proposed as a such a tool; a tool that uses machine learning to automatically determine if the statement/article in question is a fact or not. A subproblem of this is to use natural language processing (NLP) to detect where there are claims that can be fact-checked.
In this project we study the task of identifying fact-checkable claims, as done in the DR2 TV-show Detektor, in collaboration with Danish Broadcasting Corporation (DR). We will use the DR2-program "Debatten" as a case-study to see if we can build a system that can in a data-driven way detect claims using the subtitles from the aforementioned TV-show. The project will have two parts: data collection and analysis.
We will, with the help from DR, scrape subtitles from all the available "Debatten"-programs. Employees from Detektor will then label all relevant claims that are "interesting" in the context of fact-checking, i.e. the sentence Copenhagen is the capital of Denmark is a claim - just not an interesting one in this context.
This will be an ongoing process throughout the project period.
After data collection, the subtitles will be split into "paragraphs" - a collection of a couple of sentences defined in the subtiteling-software, and the machine learning task we pose is to detect whether or not a paragraph contains a claims (binary classification). We will investigate different ways of representing the text data i.e.,
- Bag-of-Words
- Character-based
- Part-of-speech tagging
- Word2Vec (or glove)
One way to incorporate all of the above representations is illustrated in the figure below. We will compare our approach to a rule-based model.
- What representations (BoW, w2v, char,...) are relevant for detecting claims?
- Can a data-driven algorithm outperform a rule-based approach?
- How can we visualize what the model has learned?