Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data rearchitecture for dashboard #6195

Open
wants to merge 150 commits into
base: master
Choose a base branch
from

Conversation

ragesoss
Copy link
Member

No description provided.

gabina and others added 30 commits June 5, 2024 17:17
…ance (#5815)

* Add deployment file for the new data-rearchitecture-project-test instance
…project-test deployment (#5817)

* Bump nokogiri to 1.16.5 to see if this fixes the data-rearchitecture-project deployment
…yment (#5819)

* Update Gemfile.lock to avoid having mini_portile2 as a dependency, because apparently is no longer necessary

* Add PASSENGER_INSTANCE_REGISTRY_DIR environment variable to the data-rearchitecture-project-test deploy stage
…rWikiTimeslice` models and migrations (#5807)

* Create CourseWikiTimeslice model and migration for it

* Create ArticleCourseTimeslice model and migration for it

* Create CourseUserWikiTimeslice model and migration for it
This PR adds a proof of concept for the data rearchitecture project. There are a lot of TODO comments that needs to be handled in the future.

* Add a new RevisionDataManager class that fetches revision and revision scores and keep them in memory (replacing the previous Revision table).

* Add specs for RevisionDataManager class.

* Implement cache updating for ArticleCourseTimeslice.

* Add article_course_timeslices migration to remove deafult value for user_ids field.

* Add a new UpdateCourseStatsTimeslice class, the "timeslice" version of the existing UpdateCourseStats class. Instead of using Revisions table, this new class would work with previous timeslices and revisions in memory.

* Quick fix for RevisionScoreImporter specs. This should be a temporary fix.

* Implement update_all_caches_from_timeslices for ArticlesCourses class and specs for it.

* Call ArticlesCourses.update_all_caches_from_timeslices in UpdateCourseStatsTimeslice class and add specs for it.
…ster

Data re-architecture project: sync with master
I'm working on rebuilding and reshuffling the older servers on Wikimedia Cloud. The new location of the replica tool is going to be https://replica-revision-tools.wmcloud.org/
…-replica-endpoint

[Data Rearchitecture Project] Switch to new replica endpoint server
* Implement CourseUserWikiTimeslice.update_cache_from_revisions and specs for it

* Add update_all_caches_from_timeslices method for CoursesUsers and specs for it

* Update UpdateCourseStatsTimeslice class to call update_cache_from_revisions for course user wiki timeslices and add specs for it

* Switch to new replica endpoint server

I'm working on rebuilding and reshuffling the older servers on Wikimedia Cloud. The new location of the replica tool is going to be https://replica-revision-tools.wmcloud.org/

* Fix UpdateCourseStatsTimeslice spec

* Format fix

---------

Co-authored-by: Sage Ross <[email protected]>
…-sync-with-master

[Data Rearchitecture for dahboard] sync with master
* Add update_cache_from_revisions for CourseWikiTimeslice class and specs for it

* Add CourseCacheManager.update_cache_from_timeslices and specs for it

* Update course and course wiki timeslices caches in UpdateCourseStatsTimeslice class. Add specs for it
…cords from revisions in RAM (#5893)

* Implement update_from_course_revisions class method to create new ArticlesCourses records from course revisions in RAM. TODO in the future: the method should also remove existing ArticlesCourses that do not corresponde to course anymore.

* Call the new update_from_course_revisions method to create new ArticlesCourses records when running UpdateCourseStatsTimeslice. Fix UpdateCourseStatsTimeslice specs now that view counts are updated.
… don't use join fields (#5899)

* Refactor ArticleCourseTimeslices model. Instead of having an article_course_id field, now it has two separate fields: article_id and course_id. Update specs.

* Refactor CourseUserWikiTimeslices model. Instead of having a single course_user_id field, now it has two separate fields: course_id and user_id. Update specs.

* Refactor CourseWikiTimeslices model. Instead of having a single course_wiki_id field, now it has two separate fields: course_id and wiki_id. Update specs.
- Hide 'Previous' and 'Next' buttons when there is only one page of articles
- Do not show 'Previous' button on the first page
- Do not show 'Next' button on the last page
- Resolves regression issue #5873
Addresses issue #5882 where articles with entirely removed edits
would indefinitely display "loading authorship data".
- Replace createStore with configureStore from Redux Toolkit
- Remove explicit thunk middleware (now included by default)
- Simplify middleware setup and add conditional checks
- Enable Redux DevTools only in non-production environments
- Remove manual compose enhancers configuration

This update improves maintainability and adopts Redux best practices.
gabina added 19 commits January 16, 2025 23:03
…ster

[Data rearchitecture] sync with master
* Pass in right update_service to UpdateCourseWikiTimeslices

* Pass in update_service to UpdateCourseWikiTimeslices

* Pass in the same debugger instance

* Add some basic logs through Rails.logger to identify when course metadata changed

* Make UpdateCourseWikiTimeslices return number of timeslices processed and reprocessed.

* Log the number of processed and reproceseed timeslices
…#6152)

* Reduce the number of logs and sort them to make them easier to analyze.
* Add error field to LiftWingApi response

* Add error field to ReferenceCounterAPI response

* Do not set needs_update timeslice field to false if some revision is marked with error

* Specify wiki in ReferenceCounterApi spec

* Populate error in RevisionScoreApiHandler if some API call failed

* Turn on ithenticate_id if there was an error when getting revisions

* Add spec for UpdateCourseWikiTimeslices
…f the request returns 400 bad resquests, since it's not a transient error. (#6157)
* Improve comments

* Remove unnecessary min_max and max_min methods

* Move methods from TimesliceManager to new TimesliceCleaner class

* Use methods from TimesliceCleaner

* Removed unused methods from TimesliceManager class.

* Fix reset_timeslices_that_need_update_from_article_timeslices when passing in wiki

* Add new specs for TimesliceCleaner
* Clean up CourseWikiTimeslice model

* Clean up CourseUserWikiTimeslice model

* Clean up ArticleCourseTimeslice model

* Remove revision_count from ArticlesCourses

* Do not longer use revision_count field from ArticlesCourses in UpdateWikiNamespaceStatsTimeslice
…nd unused code (#6169)

* Deletes CleanArticlesCoursesWorke because update_from_course is already being called

* Stop importing wikidata summaries in daily update.

* Removed unnecessary TODO comments

* Remove commented code
…in reference-counter and liftwing APIs (#6176)

* Add 403 forbidden responses to non transient errors for reference counter API. Usually, this error means that we lack permission to access the revision, possibly because it was deleted or hidden. That is not atransient error, so we don't want to mark the timeslice to be reprocessed.

* Add 404 not found responses to non transient errors for reference counter API. This error occurs for some deleted revisions.

* Do not return error field in LiftWingAPI response if the revision was deleted. That's not a transient error so keeping retrying the lift wing request doesnt make sense.
…tentType` liftwing API response (#6177)

* Add UnexpectedContentType as non transient error for LiftWingApi

* Minor refactor to LiftWingApi spec
…ster

[Data rearchitecture] Sync with master
… course dates. Now when the previous course period and the current one have no overlap, delete the existing timeslices and generate new ones. Add basic specs for it. (#6185)
* Add resolve_duplicates_for_timeslices method in DuplicateArticleDeleter class

* Resolve duplicate articles after fetching new revisions and importing articles
…nd there is no timeslice for the start ingestion date (#6188)
* Fix bug in timeslice creation when new previous start date

* Fix specs to work no matter how the TIMESLICE_DURATION env variable is set

* Add log when drastic change
…is set to 7 days. If the course start is less that one week ago, we don't want to try to retrieve a timeslice for 7 days ago because that timeslice won't exist. Because of that, now we fix the start period for the timeframe as the max between course start or 7 days ago. (#6186)
Copy link
Member Author

@ragesoss ragesoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few little things I spotted

@@ -235,6 +235,13 @@ def manual_update
redirect_to "/courses/#{@course.slug}"
end

def manual_update_timeslice
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was added just to make it easy to compare updates via the old method versus the new one?

During the cleanup phase, we'll want to remove this method and make manual_update use the timeslice approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was added just to make it easy to compare updates via the old method versus the new one?

Yes, exactly. There are two routes: /manual_update and /manual_update_timeslice to easily compare metrics.


def filter_revisions(revisions)
filtered_data = revisions.select do |_, details|
wiki = Wiki.find(details['article']['wiki_id'])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it might do a lot of duplicate Wiki queries. It looks like this is only ever called for one wiki at a time, via RevisionDataManager, so maybe the wiki can just be passed in as an argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on 14357c0

@@ -75,4 +75,22 @@ def use_start_and_end_times
def multiple_roles_allowed?
true
end

def filter_revisions(revisions)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not duplicate this method across multiple course types. I guess that the different implementations of scoped_articles_titles is the only difference, so perhaps this method can be extracted somewhere else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I understand your point because the following definition is duplicated:

  def filter_revisions(revisions)
    filtered_data = revisions.select do |_, details|
      wiki = Wiki.find(details['article']['wiki_id'])
      article_title = details['article']['title']
      formatted_article_title = ArticleUtils.format_article_title(article_title, wiki)
      scoped_article_titles.include?(formatted_article_title)
    end
    filtered_data
  end

However, except for article scoped and visiting scholarship courses, filter_revision method does not have to filter anything. There is no scoped_article_titles definition that makes sense for basic courses I think, because it should contain all the article titles.

I could extract the code for filter_revisions used in article scoped and visiting scholarship courses in some place to avoid duplication, but I think that for the course parent class, the method filter_revisions has to avoid doing any query. WDYT?

new_article: string_to_boolean(rev_data['new_article']),
system: string_to_boolean(rev_data['system']),
wiki_id: rev_data['wiki_id'],
views: scoped_revision?(scoped_sub_data, mw_page_id, rev_data['mw_rev_id'])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the views field used for this? If it's just a hack so that the Revision model can be used as the in-memory representation of a revision without saving it to the database, since we weren't using the views field anymore for its original purpose, we could put attr_accessor :scoped into Revision, and then Revision.new(scoped: value) will create a Revision object that with an accessible scoped value that just lives in memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a hack as you mentioned.

I used the attr_accessor strategy in c9510a5

In addition, I followed the same approach for error field (I was using ithenticate_id) in 66f123c


#= Provides a count of recent revisions by a user(s)
class RevisionStatTimeslice
REVISION_TIMEFRAME = 7
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant name should make it clear that the value is in days.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on 96ea93e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants