-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data rearchitecture for dashboard #6195
base: master
Are you sure you want to change the base?
Conversation
…ance (#5815) * Add deployment file for the new data-rearchitecture-project-test instance
…project-test deployment (#5817) * Bump nokogiri to 1.16.5 to see if this fixes the data-rearchitecture-project deployment
…yment (#5819) * Update Gemfile.lock to avoid having mini_portile2 as a dependency, because apparently is no longer necessary * Add PASSENGER_INSTANCE_REGISTRY_DIR environment variable to the data-rearchitecture-project-test deploy stage
…rWikiTimeslice` models and migrations (#5807) * Create CourseWikiTimeslice model and migration for it * Create ArticleCourseTimeslice model and migration for it * Create CourseUserWikiTimeslice model and migration for it
This PR adds a proof of concept for the data rearchitecture project. There are a lot of TODO comments that needs to be handled in the future. * Add a new RevisionDataManager class that fetches revision and revision scores and keep them in memory (replacing the previous Revision table). * Add specs for RevisionDataManager class. * Implement cache updating for ArticleCourseTimeslice. * Add article_course_timeslices migration to remove deafult value for user_ids field. * Add a new UpdateCourseStatsTimeslice class, the "timeslice" version of the existing UpdateCourseStats class. Instead of using Revisions table, this new class would work with previous timeslices and revisions in memory. * Quick fix for RevisionScoreImporter specs. This should be a temporary fix. * Implement update_all_caches_from_timeslices for ArticlesCourses class and specs for it. * Call ArticlesCourses.update_all_caches_from_timeslices in UpdateCourseStatsTimeslice class and add specs for it.
Merge with master
…ster Data re-architecture project: sync with master
I'm working on rebuilding and reshuffling the older servers on Wikimedia Cloud. The new location of the replica tool is going to be https://replica-revision-tools.wmcloud.org/
…-replica-endpoint [Data Rearchitecture Project] Switch to new replica endpoint server
* Implement CourseUserWikiTimeslice.update_cache_from_revisions and specs for it * Add update_all_caches_from_timeslices method for CoursesUsers and specs for it * Update UpdateCourseStatsTimeslice class to call update_cache_from_revisions for course user wiki timeslices and add specs for it * Switch to new replica endpoint server I'm working on rebuilding and reshuffling the older servers on Wikimedia Cloud. The new location of the replica tool is going to be https://replica-revision-tools.wmcloud.org/ * Fix UpdateCourseStatsTimeslice spec * Format fix --------- Co-authored-by: Sage Ross <[email protected]>
…re-for-dahboard-sync-with-master
…-sync-with-master [Data Rearchitecture for dahboard] sync with master
* Add update_cache_from_revisions for CourseWikiTimeslice class and specs for it * Add CourseCacheManager.update_cache_from_timeslices and specs for it * Update course and course wiki timeslices caches in UpdateCourseStatsTimeslice class. Add specs for it
…cords from revisions in RAM (#5893) * Implement update_from_course_revisions class method to create new ArticlesCourses records from course revisions in RAM. TODO in the future: the method should also remove existing ArticlesCourses that do not corresponde to course anymore. * Call the new update_from_course_revisions method to create new ArticlesCourses records when running UpdateCourseStatsTimeslice. Fix UpdateCourseStatsTimeslice specs now that view counts are updated.
… don't use join fields (#5899) * Refactor ArticleCourseTimeslices model. Instead of having an article_course_id field, now it has two separate fields: article_id and course_id. Update specs. * Refactor CourseUserWikiTimeslices model. Instead of having a single course_user_id field, now it has two separate fields: course_id and user_id. Update specs. * Refactor CourseWikiTimeslices model. Instead of having a single course_wiki_id field, now it has two separate fields: course_id and wiki_id. Update specs.
- Hide 'Previous' and 'Next' buttons when there is only one page of articles - Do not show 'Previous' button on the first page - Do not show 'Next' button on the last page - Resolves regression issue #5873
Addresses issue #5882 where articles with entirely removed edits would indefinitely display "loading authorship data".
- Replace createStore with configureStore from Redux Toolkit - Remove explicit thunk middleware (now included by default) - Simplify middleware setup and add conditional checks - Enable Redux DevTools only in non-production environments - Remove manual compose enhancers configuration This update improves maintainability and adopts Redux best practices.
…re-sync-with-master
…ster [Data rearchitecture] sync with master
* Pass in right update_service to UpdateCourseWikiTimeslices * Pass in update_service to UpdateCourseWikiTimeslices * Pass in the same debugger instance * Add some basic logs through Rails.logger to identify when course metadata changed * Make UpdateCourseWikiTimeslices return number of timeslices processed and reprocessed. * Log the number of processed and reproceseed timeslices
…#6152) * Reduce the number of logs and sort them to make them easier to analyze.
* Add error field to LiftWingApi response * Add error field to ReferenceCounterAPI response * Do not set needs_update timeslice field to false if some revision is marked with error * Specify wiki in ReferenceCounterApi spec * Populate error in RevisionScoreApiHandler if some API call failed * Turn on ithenticate_id if there was an error when getting revisions * Add spec for UpdateCourseWikiTimeslices
…f the request returns 400 bad resquests, since it's not a transient error. (#6157)
* Improve comments * Remove unnecessary min_max and max_min methods * Move methods from TimesliceManager to new TimesliceCleaner class * Use methods from TimesliceCleaner * Removed unused methods from TimesliceManager class. * Fix reset_timeslices_that_need_update_from_article_timeslices when passing in wiki * Add new specs for TimesliceCleaner
* Clean up CourseWikiTimeslice model * Clean up CourseUserWikiTimeslice model * Clean up ArticleCourseTimeslice model * Remove revision_count from ArticlesCourses * Do not longer use revision_count field from ArticlesCourses in UpdateWikiNamespaceStatsTimeslice
…nd unused code (#6169) * Deletes CleanArticlesCoursesWorke because update_from_course is already being called * Stop importing wikidata summaries in daily update. * Removed unnecessary TODO comments * Remove commented code
…in reference-counter and liftwing APIs (#6176) * Add 403 forbidden responses to non transient errors for reference counter API. Usually, this error means that we lack permission to access the revision, possibly because it was deleted or hidden. That is not atransient error, so we don't want to mark the timeslice to be reprocessed. * Add 404 not found responses to non transient errors for reference counter API. This error occurs for some deleted revisions. * Do not return error field in LiftWingAPI response if the revision was deleted. That's not a transient error so keeping retrying the lift wing request doesnt make sense.
…re-sync-with-master
…tentType` liftwing API response (#6177) * Add UnexpectedContentType as non transient error for LiftWingApi * Minor refactor to LiftWingApi spec
…ster [Data rearchitecture] Sync with master
… course dates. Now when the previous course period and the current one have no overlap, delete the existing timeslices and generate new ones. Add basic specs for it. (#6185)
* Add resolve_duplicates_for_timeslices method in DuplicateArticleDeleter class * Resolve duplicate articles after fetching new revisions and importing articles
…nd there is no timeslice for the start ingestion date (#6188)
* Fix bug in timeslice creation when new previous start date * Fix specs to work no matter how the TIMESLICE_DURATION env variable is set * Add log when drastic change
…is set to 7 days. If the course start is less that one week ago, we don't want to try to retrieve a timeslice for 7 days ago because that timeslice won't exist. Because of that, now we fix the start period for the timeframe as the max between course start or 7 days ago. (#6186)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few little things I spotted
@@ -235,6 +235,13 @@ def manual_update | |||
redirect_to "/courses/#{@course.slug}" | |||
end | |||
|
|||
def manual_update_timeslice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was added just to make it easy to compare updates via the old method versus the new one?
During the cleanup phase, we'll want to remove this method and make manual_update
use the timeslice approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was added just to make it easy to compare updates via the old method versus the new one?
Yes, exactly. There are two routes: /manual_update
and /manual_update_timeslice
to easily compare metrics.
|
||
def filter_revisions(revisions) | ||
filtered_data = revisions.select do |_, details| | ||
wiki = Wiki.find(details['article']['wiki_id']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it might do a lot of duplicate Wiki queries. It looks like this is only ever called for one wiki at a time, via RevisionDataManager, so maybe the wiki can just be passed in as an argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done on 14357c0
@@ -75,4 +75,22 @@ def use_start_and_end_times | |||
def multiple_roles_allowed? | |||
true | |||
end | |||
|
|||
def filter_revisions(revisions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not duplicate this method across multiple course types. I guess that the different implementations of scoped_articles_titles
is the only difference, so perhaps this method can be extracted somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I understand your point because the following definition is duplicated:
def filter_revisions(revisions)
filtered_data = revisions.select do |_, details|
wiki = Wiki.find(details['article']['wiki_id'])
article_title = details['article']['title']
formatted_article_title = ArticleUtils.format_article_title(article_title, wiki)
scoped_article_titles.include?(formatted_article_title)
end
filtered_data
end
However, except for article scoped and visiting scholarship courses, filter_revision
method does not have to filter anything. There is no scoped_article_titles
definition that makes sense for basic courses I think, because it should contain all the article titles.
I could extract the code for filter_revisions
used in article scoped and visiting scholarship courses in some place to avoid duplication, but I think that for the course parent class, the method filter_revisions
has to avoid doing any query. WDYT?
lib/revision_data_manager.rb
Outdated
new_article: string_to_boolean(rev_data['new_article']), | ||
system: string_to_boolean(rev_data['system']), | ||
wiki_id: rev_data['wiki_id'], | ||
views: scoped_revision?(scoped_sub_data, mw_page_id, rev_data['mw_rev_id']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the views
field used for this? If it's just a hack so that the Revision model can be used as the in-memory representation of a revision without saving it to the database, since we weren't using the views
field anymore for its original purpose, we could put attr_accessor :scoped
into Revision
, and then Revision.new(scoped: value)
will create a Revision object that with an accessible scoped
value that just lives in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/revision_stat_timeslice.rb
Outdated
|
||
#= Provides a count of recent revisions by a user(s) | ||
class RevisionStatTimeslice | ||
REVISION_TIMEFRAME = 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant name should make it clear that the value is in days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done on 96ea93e
…d according to index stats (#6198)
…nused views field
…enticate_id field
…re-synch-with-master
…aster [Data rearchitecture] Sync with master
…6203) * Add routes, controller and views to create a page to change timeslice duration * Link timeslice_duration view from admin view * Add basic specs for new controller
No description provided.