-
Notifications
You must be signed in to change notification settings - Fork 5
Thesis Aleksandar
As the volume of published research in the biomedical domain increases, the need for effective information extraction systems grows in parallel. In this context, the task of named-entity recognition (NER) is essential. NER is defined as the classification of words in free text that represent predefined categories such as genes, proteins or other entities.
As a specific application of NER, the main focus of this thesis is the recognition of mutation mentions from the biomedical literature. More specifically we aim to create a model able to recognize mutation mentions expressed in natural language. The current state-of-the-art method, tmVar [1] is only able to recognize a small subset of standard or semi-standard mentions. Our method both outperforms tmVar on those types of mentions and is also able to recognize natural language (NL) mentions. Previously no other method considered NL mutation mentions.
The performance of NER machine learning models is intrinsically limited by the availability of high-quality-annotated corpora. The construction of such corpora is costly – specially when expert annotators are required. In the biomedical domain, the difficulty of the task is even greater, since the number of possible named entities is higher and keeps growing with new discoveries.
To combat the lack of large annotated corpora, we turn to the exploitation of large volumes of unlabeled text, applying a semi-supervised learning approach. Using techniques for unsupervised feature learning we aim to increase the performance of traditional NER models. More specifically, this thesis focuses on augmenting common conditional random field (CRF) approaches combined with novel word representation features learned from large bodies of biomedical text. Furthermore, using an active learning approach we extend an existing corpus of mutation mentions (IDP4 [2]) with additional NL mentions. Finally, and in support of evaluating our semi-supervised learning approach, we develop a complete pipeline for biomedical named-entity recognition including preprocessing steps, feature generation, model learning and normalized predictions. Our extended corpus, NER tool and pipeline framework are all open sourced on GitHub (https://github.com/Rostlab/nala/).
- Actual start: @March 26th, 2015
- Official start: @June 15th, 2015
- Official end: @December 15th, 2015
- Actual end: TODO