KeyError: '\xf0\x93\x86\x8e\xf0\x93\x85\x93\xf0\x93\x8f\x8f\xf0\x93\x8a\x96'

Hello, I need to reproduce the results on a subset of your dataset and I met some problems including `pid killed` in parsing, `ascii error` in `create_*_1.py` and `key error` in `create_*_2.py`. Some of them are the same as @wt123u in another issue.

I delete `&` before line 40 in `create_*.sh` to solve the `pid killed` problem.

I add `sys.setdefaultencoding('utf-8')` to solve the `ascii error`.

Then I met the `KeyError` in `create_*_2.py`, I tried to solve it by putting `x_id, y_id, path_id = term_to_id_db[x], term_to_id_db[y], path_to_id_db.get(path, -1)` to the `try` block, finally I got a db file nearly 70GB. When I train the model, it shows `Pairs without paths: 1549 , all dataset: 20314`. Continuing to train can damage the results, so it would be unfair.

I am using the `20181201` version of wiki dump and `spacy 1.9.0`, can the different versions or the above changes be the reason of KeyError? What can I do to get fair results? Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KeyError: '\xf0\x93\x86\x8e\xf0\x93\x85\x93\xf0\x93\x8f\x8f\xf0\x93\x8a\x96' #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

KeyError: '\xf0\x93\x86\x8e\xf0\x93\x85\x93\xf0\x93\x8f\x8f\xf0\x93\x8a\x96' #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions