Skip to content

peluz/kneedle-exploration

Repository files navigation

Inferring the source of official texts: can SVM beat ULMFiT?

This repo holds the dataset and source code described in the paper below:

We kindly request that users cite our paper in any publication that is generated as a result of the use of our code or our dataset.

A snapshot of the code that was used to generate the results of the paper above is available from the static page of this project at https://cic.unb.br/~teodecampos/KnEDLe/propor2020.

Update (27/05/20)

The pre-trained language model used in this work was not originally released with its tokenizer model and vocabulary data, so our fine-tuned model and classifier were not able to leverage subword embeddings trained on general domain portuguese data. This has been amended, so we re-ran all experiments using the pre-trained vocab data. This repo contains the updated ULMFiT training notebook and the updated results.

Requirements

Reproducing results

  • Download the pretrained language model and place it in a model directory at the root
  • Run train_ulmfit.ipynb
  • Run train_baseline.ipynb

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published