-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hello, I need to reproduce the results on a subset of your dataset and I met some problems including pid killed in parsing, ascii error in create_*_1.py and key error in create_*_2.py. Some of them are the same as @wt123u in another issue.
I delete & before line 40 in create_*.sh to solve the pid killed problem.
I add sys.setdefaultencoding('utf-8') to solve the ascii error.
Then I met the KeyError in create_*_2.py, I tried to solve it by putting x_id, y_id, path_id = term_to_id_db[x], term_to_id_db[y], path_to_id_db.get(path, -1) to the try block, finally I got a db file nearly 70GB. When I train the model, it shows Pairs without paths: 1549 , all dataset: 20314. Continuing to train can damage the results, so it would be unfair.
I am using the 20181201 version of wiki dump and spacy 1.9.0, can the different versions or the above changes be the reason of KeyError? What can I do to get fair results? Thanks!