Migrate Place Info and Series Dates to V2 API#6047
Migrate Place Info and Series Dates to V2 API#6047SandeepTuniki wants to merge 25 commits intomasterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request re-implements key Data Commons service functions, Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors get_place_info and get_series_dates to use V2 APIs instead of V1, which is a significant change involving the reimplementation of V1 endpoint logic by composing multiple V2 API calls. The changes also include new unit tests and a live verification script.
My review identified several issues that need attention:
- A correctness bug in
get_series_datesrelated to filtering child places by type. - An inefficiency in
get_place_infothat can be improved for better performance. - The unit tests for
get_series_datesuse incorrect mock data that doesn't match the real API response, which unfortunately hides the implementation bug. - The new live verification script contains a bug in its assertion logic, rendering it ineffective.
I have provided specific comments and code suggestions to address these points. While the overall direction of the changes is good, the implementation requires these fixes to ensure correctness and robustness.
…1f1-80c0-2922463dcc7d)
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request successfully migrates get_place_info and get_series_dates from v1 to v2 APIs. The new implementations in datacommons.py are more complex to accommodate the v2 API structure but appear correct and are well-tested with new verification tests. The related changes, such as removing old v1 endpoints and updating numerous golden files, are consistent with this migration. I have one minor suggestion to improve efficiency in the get_series_dates implementation. Overall, this is a solid refactoring effort.
juliawu
left a comment
There was a problem hiding this comment.
Hooray! Nice work. Just left some suggestions for extra comments/cleanup.
| "CensusZipCodeTabulationArea": 1, | ||
| "AdministrativeArea5": 2, | ||
| "AdministrativeArea4": 2, | ||
| "Village": 5, | ||
| "City": 5, | ||
| "Town": 5, | ||
| "Borough": 5, | ||
| "AdministrativeArea3": 5, | ||
| "County": 10, | ||
| "AdministrativeArea2": 10, | ||
| "EurostatNUTS3": 10, | ||
| "CensusDivision": 15, | ||
| "State": 20, | ||
| "AdministrativeArea1": 20, | ||
| "EurostatNUTS2": 20, | ||
| "EurostatNUTS1": 20, | ||
| "Country": 30, | ||
| "CensusRegion": 35, | ||
| "GeoRegion": 38, | ||
| "Continent": 40, | ||
| "Place": 50, |
There was a problem hiding this comment.
Should this list also contain the other StatisticalArea place types? See https://datacommons.org/browser/StatisticalArea
| resp = v2node(fetch_dcids, '->containedInPlace') | ||
| data = resp.get('data', {}) | ||
|
|
||
| current_frontier = set() | ||
| for dcid in fetch_dcids: | ||
| visited.add(dcid) | ||
| node_data = data.get(dcid, {}) | ||
|
|
||
| arcs_obj = node_data.get('arcs', {}).get('containedInPlace', {}) | ||
| nodes_list = arcs_obj.get('nodes', []) if isinstance(arcs_obj, | ||
| dict) else [] | ||
|
|
||
| parents = [x['dcid'] for x in nodes_list if 'dcid' in x] | ||
| if parents: | ||
| parent_graph[dcid] = parents | ||
| current_frontier.update(parents) |
There was a problem hiding this comment.
nit: Could you add a few more inline comments here to explain what the loop is doing? I just want it to be easier for future maintainers to parse if we ever decide to update the implementation down the road.
| def get_all_values(resp, dcid, prop, key='dcid'): | ||
| node_data = resp.get('data', {}).get(dcid, {}) | ||
| arcs_obj = node_data.get('arcs', {}).get(prop, {}) | ||
| if not arcs_obj: | ||
| # Try checking without arrow if key mismatch | ||
| arcs_obj = node_data.get('arcs', {}).get(prop.replace('->', ''), {}) | ||
|
|
||
| nodes_list = arcs_obj.get('nodes', []) if isinstance(arcs_obj, dict) else [] | ||
| return [n.get(key, '') for n in nodes_list if key in n] | ||
|
|
||
| def get_best_type(types_list): | ||
| if not types_list: | ||
| return '' | ||
|
|
||
| # Sort types by rank (highest rank first) | ||
| # If ranks are tied, prefer types that don't start with 'AdministrativeArea' | ||
| def sort_key(t): | ||
| rank = PLACE_TYPE_RANK.get(t, 0) | ||
| is_admin = 1 if t.startswith('AdministrativeArea') else 0 | ||
| return (rank, -is_admin) | ||
|
|
||
| return sorted(types_list, key=sort_key, reverse=True)[0] |
There was a problem hiding this comment.
These nested functions don't look like closure functions to me. What do you think of extracting these out to _get_all_values() and _get_best_type() at the top of this file instead? Because get_place_info() is already a very long function, I think this would help improve readability.
| for dcid in dcids: | ||
| self_types = get_all_values(types_resp, dcid, 'typeOf') | ||
| self_names = get_all_values(names_resp, dcid, 'name', 'value') | ||
|
|
||
| # Skip DCIDs that don't exist in the graph (bogus places) | ||
| if not self_types and not self_names: | ||
| continue | ||
|
|
||
| self_type = get_best_type(self_types) | ||
| self_name = self_names[0] if self_names else '' | ||
|
|
||
| parents = [] | ||
| for anc_dcid in ancestors_map.get(dcid, []): | ||
| if anc_dcid == dcid: | ||
| continue | ||
|
|
||
| anc_types = get_all_values(types_resp, anc_dcid, 'typeOf') | ||
| anc_type = get_best_type(anc_types) | ||
| anc_names = get_all_values(names_resp, anc_dcid, 'name', 'value') | ||
| anc_name = anc_names[0] if anc_names else '' | ||
|
|
||
| if anc_type in PLACE_TYPE_RANK: | ||
| parents.append({ | ||
| 'dcid': anc_dcid, | ||
| 'type': anc_type, | ||
| 'name': anc_name, | ||
| 'rank': PLACE_TYPE_RANK[anc_type] | ||
| }) | ||
|
|
||
| parents.sort(key=lambda x: x['rank']) | ||
| for p in parents: | ||
| del p['rank'] | ||
|
|
||
| result_data.append({ | ||
| 'node': dcid, | ||
| 'info': { | ||
| 'self': { | ||
| 'dcid': dcid, | ||
| 'type': self_type, | ||
| 'name': self_name | ||
| }, | ||
| 'parents': parents | ||
| } | ||
| }) |
There was a problem hiding this comment.
Similarly, could you add some more inline comments explaining these code blocks?
This PR migrates two backend service functions,
get_place_infoandget_series_dates, from the v1 to v2 APIs.Notes:
United StatesvsUnited States of America,New YorkvsNew York Cityetc.server/services/datacommons.py(main logic changes are here)server/services/discovery.py(removed the migrated v1 endpoints)shared/lib/constants.pyserver/tests/migration_verification_test.pyAlso, I was having trouble with updating golden files locally (the script was erroring out due to timeout issues in staging), so as a workaround, I triggered cloud builds through
build/ci/cloudbuild.update_nl_goldens.yamland brought the updated goldens to this PR.