Skip to content

KeyError: '\xf0\x93\x86\x8e\xf0\x93\x85\x93\xf0\x93\x8f\x8f\xf0\x93\x8a\x96' #7

@zhixiaochuan12

Description

@zhixiaochuan12

Hello, I need to reproduce the results on a subset of your dataset and I met some problems including pid killed in parsing, ascii error in create_*_1.py and key error in create_*_2.py. Some of them are the same as @wt123u in another issue.

I delete & before line 40 in create_*.sh to solve the pid killed problem.

I add sys.setdefaultencoding('utf-8') to solve the ascii error.

Then I met the KeyError in create_*_2.py, I tried to solve it by putting x_id, y_id, path_id = term_to_id_db[x], term_to_id_db[y], path_to_id_db.get(path, -1) to the try block, finally I got a db file nearly 70GB. When I train the model, it shows Pairs without paths: 1549 , all dataset: 20314. Continuing to train can damage the results, so it would be unfair.

I am using the 20181201 version of wiki dump and spacy 1.9.0, can the different versions or the above changes be the reason of KeyError? What can I do to get fair results? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions