Thesis Aleksandar

Semi-supervised learning for biomedical named-entity recognition

Abstract

As the volume of published research in the biomedical domain increases, the need for effective information extraction systems grows in parallel. In this context, the task of named-entity recognition (NER) is essential. NER is defined as the classification of words in free text that represent predefined categories such as genes, proteins or other entities.

As a specific application of NER, the main focus of this thesis is the recognition of mutation mentions from the biomedical literature. More specifically we aim to create a model able to recognize mutation mentions expressed in natural language. The current state-of-the-art method, tmVar [1] is only able to recognize a small subset of standard or semi-standard mentions. Our method both outperforms tmVar on those types of mentions and is also able to recognize natural language (NL) mentions. Previously no other method considered NL mutation mentions.

The performance of NER machine learning models is intrinsically limited by the availability of high-quality-annotated corpora. The construction of such corpora is costly – specially when expert annotators are required. In the biomedical domain, the difficulty of the task is even greater, since the number of possible named entities is higher and keeps growing with new discoveries.

To combat the lack of large annotated corpora, we turn to the exploitation of large volumes of unlabeled text, applying a semi-supervised learning approach. Using techniques for unsupervised feature learning we aim to increase the performance of traditional NER models. More specifically, this thesis focuses on augmenting common conditional random field (CRF) approaches combined with novel word representation features learned from large bodies of biomedical text. Furthermore, using an active learning approach we extend an existing corpus of mutation mentions (IDP4 [2]) with additional NL mentions. Finally, and in support of evaluating our semi-supervised learning approach, we develop a complete pipeline for biomedical named-entity recognition including preprocessing steps, feature generation, model learning and normalized predictions. Our extended corpus, NER tool and pipeline framework are all open sourced on GitHub (https://github.com/Rostlab/nala/).

Deliverables

Timeframe

Actual start: @March 26th, 2015
Official start: @June 15th, 2015
Official end: @December 15th, 2015
Actual end: TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thesis Aleksandar

Semi-supervised learning for biomedical named-entity recognition

Abstract

Deliverables

Timeframe

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally