Is your trained English model available? #17

colingoldberg · 2019-04-05T13:30:02Z

Hi,

I was wondering if the English trained model behind your demo is available for others to use. I hope this is the case.

Colin Goldberg

svirpioj · 2019-07-31T13:57:08Z

Sorry, the demo models are not currently available for download. We'll look into it, but might be that there are some compatibility issues with the current version.

However, most of the models can be easily retrained with the Morpho Challenge data sets - for example the unsupervised English model should be quite the same as the output of these commands:

wget http://morpho.aalto.fi/events/morphochallenge2009/data/wordlist.eng.gz
morfessor-train -s unsup_model.bin --traindata-list wordlist.eng.gz

And the English semi-supervised model (based on the parameters shown in the demo page):

wget http://morpho.aalto.fi/events/morphochallenge2010/data/goldstd_trainset.segmentation.eng
morfessor-train -s semisup_model.bin --traindata-list wordlist.eng.gz -A goldstd_trainset.segmentation.eng -w 0.83 -W 361.32

anttttti · 2019-08-02T06:07:29Z

Could you make developer-friendly interface and trained models available from an open source such as Wikipedia dumps? There's a use case for off-the-shelf decompounding and morphological splitting tools, but Morfessor doesn't have trained models ready, so its not convenient enough for developers to try. Right now even if you know how to use Morfessor, there's not really time to train and tune the models for a project where it could be useful.

Ideally splitting with Morfessor would be easy as this:

import morfessor
morfessor_model= morfessor.read_model("finnish_model.pkl")
morfessor_model.split("Lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas")

Better yet, follow the Scikit-learn API for the model, so that it is accessed using .fit() and .transform() methods. This will make it more accessible to a wider community.

pabs3 · 2021-06-01T02:50:38Z

I would suggest to treat model files like you would compiled executables. Store the open source licensed source data for an individual model in a single GitHub repository (possibly using git-lfs to reduce disk usage for updates), then add a Makefile or similar for automatically training the model, then attach the model binaries to each source data release. In case multiple models share source data, you could create one GitHub repository containing all the source data.

svirpioj added the question label Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is your trained English model available? #17

Is your trained English model available? #17

colingoldberg commented Apr 5, 2019

svirpioj commented Jul 31, 2019

anttttti commented Aug 2, 2019 •

edited

Loading

pabs3 commented Jun 1, 2021

Is your trained English model available? #17

Is your trained English model available? #17

Comments

colingoldberg commented Apr 5, 2019

svirpioj commented Jul 31, 2019

anttttti commented Aug 2, 2019 • edited Loading

pabs3 commented Jun 1, 2021

anttttti commented Aug 2, 2019 •

edited

Loading