This repo holds the dataset and source code described in the paper below:
- Pedro H. Luz de Araujo, Teófilo E. de Campos, Marcelo M. Silva de Sousa
Inferring the source of official texts: can SVM beat ULMFiT?
International Conference on the Computational Processing of Portuguese (PROPOR), March 2-4, Évora, Portugal, 2020.
Download: [ paper | slides | bib ]
We kindly request that users cite our paper in any publication that is generated as a result of the use of our code or our dataset.
A snapshot of the code that was used to generate the results of the paper above is available from the static page of this project at https://cic.unb.br/~teodecampos/KnEDLe/propor2020.
The pre-trained language model used in this work was not originally released with its tokenizer model and vocabulary data, so our fine-tuned model and classifier were not able to leverage subword embeddings trained on general domain portuguese data. This has been amended, so we re-ran all experiments using the pre-trained vocab data. This repo contains the updated ULMFiT training notebook and the updated results.
- Download the pretrained language model and place it in a model directory at the root
- Run train_ulmfit.ipynb
- Run train_baseline.ipynb