Sub-Language Filtering and Sub-Language Handling Improvements for lex… #563

axif0 · 2025-01-28T17:33:07Z

…eme dump

Contributor checklist

This pull request is on a separate branch and not the main branch
I have tested my code with the pytest command as directed in the testing section of the contributing guide

Description

Updated the logic a bit for sub-language in forms. After update we get more missing forms as expected, also added the filter for sub-language you mentioned earlier.

Sub-language structure is a bit different, thats why it didn't added.

Sub_lang for hindustani

{
{'type': 'lexeme', 'id': 'L580094', 'lemmas': {'hi': {'language': 'hi', 'value': 'पॉंच'}, 'ur': {'language': 'ur', 'value': 'پانچ'}}...
...
'forms': [{'id': 'L580094-F1', 'representations': {'hi': {'language': 'hi', 'value': 'पॉंच'}, 'ur': {'language': 'ur', 'value': 'پانچ'}}, 'grammaticalFeatures': ['Q115526488', 'Q1751855'], 'claims': {}}, {'id': 'L580094-F2', 'representations': {'hi': {'language': 'hi', 'value': '५'}, 'ur': {'language': 'ur', 'value': '۵'}}, 'grammaticalFeatures': [], 'claims': {}}, {'id': 'L580094-F3', 'representations': {'hi': {'language': 'hi', 'value': 'पॉंच'}, 'ur': {'language': 'ur', 'value': 'پانچ'}}, 'grammaticalFeatures': ['Q115526488', 'Q1233197'], 'claims': {}}, ...
...
}

English verb, windsurfing language without sub_language:

{"type":"lexeme","id":"L4","lemmas":{"en":{"language":"en","value":"windsurf"}},"lexicalCategory":"Q24905","language":"Q1860"....

....
"forms":[{"id":"L4-F1","representations":{"en":{"language":"en","value":"windsurfing"}},"grammaticalFeatures":["Q10345583"],"claims":{}},{"id":"L4-F3","representations":{"en":{"language":"en","value":"windsurfs"}},"grammaticalFeatures":["Q110786","Q3910936","Q51929074"],"claims":{}}

Update sub-language forms as well-

You can check the PR here.

Related issue

…eme dump

github-actions · 2025-01-28T17:33:36Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

henrikth93 · 2025-01-31T13:34:21Z

Looks good

wkyoshida

🚀 🙌

wkyoshida · 2025-02-01T01:42:10Z

src/scribe_data/cli/get.py

+        except json.decoder.JSONDecodeError:
+            rprint(
+                "[bold red]Error: The Wikidata query service returned an invalid response. This usually happens when the query is too large or the service is temporarily unavailable. Please try again later or consider using a Wikidata dump instead.[/bold red]"
+            )
+        except (urllib.error.HTTPError, EndPointInternalError):
+            rprint(
+                "[bold red]Error: The Wikidata query service is currently experiencing issues (HTTP 500). This could be due to:[/bold red]"
+            )
+            rprint("[red]1. The query is too complex or large[/red]")
+            rprint("[red]2. The Wikidata service is temporarily overloaded[/red]")
+            rprint("[red]3. Server-side issues at Wikidata[/red]")


Just thinking here - is there a way to differentiate between the scenarios where it's a client error and where it's a server error? What I mean is - are the errors or status codes that we get different, for instance, if it's due to the query being too large or if due to Wikidata being down?

More so asking, cause it be good to immediately inform the user what to do - whether they could potentially wait for Wikidata to come back up or if they really should just try the dump (as opposed to telling them it could be either one).

Maybe there is no way to tell and this has been looked at already, but was wondering 😊

where it's a client error and where it's a server error? What I mean is - are the errors or status codes that we get different, for instance, if it's due to the query being too large or if due to Wikidata being down?

Can we do like, if the error code is within 400 to 499 then it should show a client side error. if e.code>=500 then it should show wikidata server side error?

Yep! Yeah, that's what I was wondering if we could do. Do we know what typically happens in the scenario where the query is too large? Do we get a 400 Bad Request error or some other 4XX code in that case?

But yeah, as you pointed out, @axif0, could make sense to do:

4XX for client error

5XX for server error

Do we know what typically happens in the scenario where the query is too large?

Mostly we get errors like #549 or EndPointInternalError (error code: 500) where the query is too large.

Hmm gotcha, I see 🤔

How about then:

if we get a JSONDecodeError or a EndPointInternalError, we already simply suggest to them to try a dump instead, since we know those errors are often indicative of the query being too large

if we get any other 4XX error, we let them know it's some client error

if we get any other 5XX error, we let them know it's some server error

Would this make sense perhaps? Let me know if not though (CC @andrewtavis)

P.S. It could be that perhaps #156 potentially obsoletes this issue with the query being too large 🤔 , but we'll get there when we do 😊

src/scribe_data/check/check_missing_forms/check_missing_forms.py

andrewtavis · 2025-02-01T07:41:29Z

af38f12 is some standard changes from what I saw on quick glance as well as putting back in the default export directory. @axif0: One thing to note is for MARK: let's not do a period at the end as it's a header and capitalize all the words. Let's also try to keep all of them as short as possible, as if we make them long then people can't read them in their minimaps and then they're not helping as much :)

andrewtavis · 2025-02-01T07:42:50Z

Check out the changes in af38f12 so to get an impression of them :) Beyond that, let's finalize the discussion here and then I'll go through for a final review. Would be great if we could get this in in the next few days!

andrewtavis · 2025-02-02T23:30:36Z

Note that we need to fix check_query_forms here :) Might also be good to have lastModified be after the lexemeID?

…fix_dump

axif0 · 2025-02-03T13:06:08Z

Adjust date_modified_pattern re and add those in where_vars.

Also, as lastmodified not present in lexeme_form_labels_order, so I added it as statically at the end if present.

axif0 · 2025-02-03T17:30:52Z

After merging the pr in test-scribe I tried to solve the errors generates by check_query_forms.

But I face a situation,
Example-
for the italian, adjective word fine, (L410), forms are Grammatical features
masculine, singular which is right and we can verified by lexeme_form_metadata.json.

In dump it has in reverse format. like "grammaticalFeatures":["Q110786","Q499327"] as in query checker/ extract forms shows like singularMasculine which made an error in check_query_forms.

Kindly please see google colab block-1 or the specicic lexeme line in dump.

andrewtavis · 2025-02-04T11:16:05Z

Are we able to use the dictionary in lexeme_form_metadata.json to order the forms that we're getting back from the dumps within the check? Might be ok? I don't think that the Wikidata community has a specific naming order that they stick to, so maybe just order them with our way of doing it and then do the check?

axif0 · 2025-02-05T21:22:02Z

Add form normalization for consistent QID sorting.
Prevent duplicate form labels in query generation.
Optimize missing forms detection logic, so that the query didn't cause any error for check_query_forms.py.
Refactor query generation to handle label uniqueness.

andrewtavis · 2025-02-05T21:23:30Z

I need to be able to react to comments with more than one rocket for some of the work you're doing here, @axif0 😊🚀🚀🚀 Are we ready for a final review?

axif0 · 2025-02-05T23:13:17Z

Are we ready for a final review?

Ya, I tried Index-Based Stable Sorting and other sorting based on QID. But the checker check_forms_order is based on the label. Couldn`t fully passed the full check.

Total 4 files caused the error. Please check the Test Workflow.

Sub-Language Filtering and Sub-Language Handling Improvements for lex…

d9d5836

…eme dump

andrewtavis requested review from andrewtavis, wkyoshida and henrikth93 January 28, 2025 17:52

andrewtavis mentioned this pull request Jan 31, 2025

Chore: Update missing Wikidata query forms #560

Closed

wkyoshida reviewed Feb 1, 2025

View reviewed changes

wkyoshida and others added 2 commits January 31, 2025 22:44

Merge branch 'main' into fix_dump

2495bdf

Minor formatting fixes as well as using default export directory

af38f12

axif0 added 3 commits February 2, 2025 13:26

Optimize error handling in query_data execution

8a62c25

add modified date in lexeme dump form & translation. ref scribe-org#562

64597f9

Add last modified date to lexeme query generation

e363282

andrewtavis mentioned this pull request Feb 2, 2025

Format Query Data adding Last Modified Date #566

Merged

2 tasks

Merge branch 'main' into fix_dump

c01436e

axif0 added 3 commits February 3, 2025 14:15

Fix query generation syntax for last modified date

084594b

Merge branch 'fix_dump' of https://github.com/axif0/Scribe-Data into …

a7f4a8a

…fix_dump

Adjust lastModified placement query form validation

35030b1

Update query comment to clarify form selection

72de05e

andrewtavis mentioned this pull request Feb 5, 2025

Refactor by removing existing_files check before query_result in get.py #537

Open

2 tasks

Improve form query generation and deduplication

73b90c4

improve unique form logic

51ec2e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-Language Filtering and Sub-Language Handling Improvements for lex… #563

Sub-Language Filtering and Sub-Language Handling Improvements for lex… #563

axif0 commented Jan 28, 2025

github-actions bot commented Jan 28, 2025 •

edited by andrewtavis

Loading

henrikth93 commented Jan 31, 2025

wkyoshida left a comment

wkyoshida Feb 1, 2025

axif0 Feb 1, 2025

wkyoshida Feb 1, 2025

axif0 Feb 1, 2025 •

edited

Loading

wkyoshida Feb 1, 2025

andrewtavis commented Feb 1, 2025

andrewtavis commented Feb 1, 2025

andrewtavis commented Feb 2, 2025

axif0 commented Feb 3, 2025 •

edited

Loading

axif0 commented Feb 3, 2025 •

edited

Loading

andrewtavis commented Feb 4, 2025

axif0 commented Feb 5, 2025

andrewtavis commented Feb 5, 2025

axif0 commented Feb 5, 2025

Sub-Language Filtering and Sub-Language Handling Improvements for lex… #563

Are you sure you want to change the base?

Sub-Language Filtering and Sub-Language Handling Improvements for lex… #563

Conversation

axif0 commented Jan 28, 2025

Contributor checklist

Description

Update sub-language forms as well-

Related issue

github-actions bot commented Jan 28, 2025 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

henrikth93 commented Jan 31, 2025

wkyoshida left a comment

Choose a reason for hiding this comment

wkyoshida Feb 1, 2025

Choose a reason for hiding this comment

axif0 Feb 1, 2025

Choose a reason for hiding this comment

wkyoshida Feb 1, 2025

Choose a reason for hiding this comment

axif0 Feb 1, 2025 • edited Loading

Choose a reason for hiding this comment

wkyoshida Feb 1, 2025

Choose a reason for hiding this comment

andrewtavis commented Feb 1, 2025

andrewtavis commented Feb 1, 2025

andrewtavis commented Feb 2, 2025

axif0 commented Feb 3, 2025 • edited Loading

axif0 commented Feb 3, 2025 • edited Loading

andrewtavis commented Feb 4, 2025

axif0 commented Feb 5, 2025

andrewtavis commented Feb 5, 2025

axif0 commented Feb 5, 2025

github-actions bot commented Jan 28, 2025 •

edited by andrewtavis

Loading

axif0 Feb 1, 2025 •

edited

Loading

axif0 commented Feb 3, 2025 •

edited

Loading

axif0 commented Feb 3, 2025 •

edited

Loading