Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-Language Filtering and Sub-Language Handling Improvements for lex… #563

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

axif0
Copy link
Collaborator

@axif0 axif0 commented Jan 28, 2025

…eme dump

Contributor checklist


Description

Updated the logic a bit for sub-language in forms. After update we get more missing forms as expected, also added the filter for sub-language you mentioned earlier.

Sub-language structure is a bit different, thats why it didn't added.

Sub_lang for hindustani

{
{'type': 'lexeme', 'id': 'L580094', 'lemmas': {'hi': {'language': 'hi', 'value': 'पॉंच'}, 'ur': {'language': 'ur', 'value': 'پانچ'}}...
...
'forms': [{'id': 'L580094-F1', 'representations': {'hi': {'language': 'hi', 'value': 'पॉंच'}, 'ur': {'language': 'ur', 'value': 'پانچ'}}, 'grammaticalFeatures': ['Q115526488', 'Q1751855'], 'claims': {}}, {'id': 'L580094-F2', 'representations': {'hi': {'language': 'hi', 'value': '५'}, 'ur': {'language': 'ur', 'value': '۵'}}, 'grammaticalFeatures': [], 'claims': {}}, {'id': 'L580094-F3', 'representations': {'hi': {'language': 'hi', 'value': 'पॉंच'}, 'ur': {'language': 'ur', 'value': 'پانچ'}}, 'grammaticalFeatures': ['Q115526488', 'Q1233197'], 'claims': {}}, ...
...
}

English verb, windsurfing language without sub_language:

{"type":"lexeme","id":"L4","lemmas":{"en":{"language":"en","value":"windsurf"}},"lexicalCategory":"Q24905","language":"Q1860"....

....
"forms":[{"id":"L4-F1","representations":{"en":{"language":"en","value":"windsurfing"}},"grammaticalFeatures":["Q10345583"],"claims":{}},{"id":"L4-F3","representations":{"en":{"language":"en","value":"windsurfs"}},"grammaticalFeatures":["Q110786","Q3910936","Q51929074"],"claims":{}}

Update sub-language forms as well-

image

You can check the PR here.

Related issue

Copy link

github-actions bot commented Jan 28, 2025

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@henrikth93
Copy link
Member

Looks good

Copy link
Member

@wkyoshida wkyoshida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🙌

Comment on lines 241 to 251
except json.decoder.JSONDecodeError:
rprint(
"[bold red]Error: The Wikidata query service returned an invalid response. This usually happens when the query is too large or the service is temporarily unavailable. Please try again later or consider using a Wikidata dump instead.[/bold red]"
)
except (urllib.error.HTTPError, EndPointInternalError):
rprint(
"[bold red]Error: The Wikidata query service is currently experiencing issues (HTTP 500). This could be due to:[/bold red]"
)
rprint("[red]1. The query is too complex or large[/red]")
rprint("[red]2. The Wikidata service is temporarily overloaded[/red]")
rprint("[red]3. Server-side issues at Wikidata[/red]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking here - is there a way to differentiate between the scenarios where it's a client error and where it's a server error? What I mean is - are the errors or status codes that we get different, for instance, if it's due to the query being too large or if due to Wikidata being down?

More so asking, cause it be good to immediately inform the user what to do - whether they could potentially wait for Wikidata to come back up or if they really should just try the dump (as opposed to telling them it could be either one).

Maybe there is no way to tell and this has been looked at already, but was wondering 😊

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where it's a client error and where it's a server error? What I mean is - are the errors or status codes that we get different, for instance, if it's due to the query being too large or if due to Wikidata being down?

Can we do like, if the error code is within 400 to 499 then it should show a client side error. if e.code>=500 then it should show wikidata server side error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Yeah, that's what I was wondering if we could do. Do we know what typically happens in the scenario where the query is too large? Do we get a 400 Bad Request error or some other 4XX code in that case?

But yeah, as you pointed out, @axif0, could make sense to do:

  • 4XX for client error
  • 5XX for server error

Copy link
Collaborator Author

@axif0 axif0 Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what typically happens in the scenario where the query is too large?

Mostly we get errors like #549 or EndPointInternalError (error code: 500) where the query is too large.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm gotcha, I see 🤔

How about then:

  • if we get a JSONDecodeError or a EndPointInternalError, we already simply suggest to them to try a dump instead, since we know those errors are often indicative of the query being too large
  • if we get any other 4XX error, we let them know it's some client error
  • if we get any other 5XX error, we let them know it's some server error

Would this make sense perhaps? Let me know if not though (CC @andrewtavis)


P.S. It could be that perhaps #156 potentially obsoletes this issue with the query being too large 🤔 , but we'll get there when we do 😊

@andrewtavis
Copy link
Member

af38f12 is some standard changes from what I saw on quick glance as well as putting back in the default export directory. @axif0: One thing to note is for MARK: let's not do a period at the end as it's a header and capitalize all the words. Let's also try to keep all of them as short as possible, as if we make them long then people can't read them in their minimaps and then they're not helping as much :)

@andrewtavis
Copy link
Member

Check out the changes in af38f12 so to get an impression of them :) Beyond that, let's finalize the discussion here and then I'll go through for a final review. Would be great if we could get this in in the next few days!

@andrewtavis
Copy link
Member

Note that we need to fix check_query_forms here :) Might also be good to have lastModified be after the lexemeID?

@axif0
Copy link
Collaborator Author

axif0 commented Feb 3, 2025

Adjust date_modified_pattern re and add those in where_vars.

Also, as lastmodified not present in lexeme_form_labels_order, so I added it as statically at the end if present.

@axif0
Copy link
Collaborator Author

axif0 commented Feb 3, 2025

After merging the pr in test-scribe I tried to solve the errors generates by check_query_forms.

But I face a situation,
Example-
for the italian, adjective word fine, (L410), forms are Grammatical features
masculine, singular which is right and we can verified by lexeme_form_metadata.json.

In dump it has in reverse format. like "grammaticalFeatures":["Q110786","Q499327"] as in query checker/ extract forms shows like singularMasculine which made an error in check_query_forms.

image

  • Kindly please see google colab block-1 or the specicic lexeme line in dump.

@andrewtavis
Copy link
Member

Are we able to use the dictionary in lexeme_form_metadata.json to order the forms that we're getting back from the dumps within the check? Might be ok? I don't think that the Wikidata community has a specific naming order that they stick to, so maybe just order them with our way of doing it and then do the check?

@axif0
Copy link
Collaborator Author

axif0 commented Feb 5, 2025

  • Add form normalization for consistent QID sorting.
  • Prevent duplicate form labels in query generation.
  • Optimize missing forms detection logic, so that the query didn't cause any error for check_query_forms.py.
  • Refactor query generation to handle label uniqueness.

@andrewtavis
Copy link
Member

I need to be able to react to comments with more than one rocket for some of the work you're doing here, @axif0 😊🚀🚀🚀 Are we ready for a final review?

@axif0
Copy link
Collaborator Author

axif0 commented Feb 5, 2025

Are we ready for a final review?

Ya, I tried Index-Based Stable Sorting and other sorting based on QID. But the checker check_forms_order is based on the label. Couldn`t fully passed the full check.

Total 4 files caused the error. Please check the Test Workflow.

  1. french/verbs/query_verbs
  2. japanese/adjectives/query_adjectives_1.sparql
  3. russian/adjectives/query_adjectives_1.sparql
  4. russian/verbs/query_verbs_1.sparql

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants