Skip to content

Plan of study scraper#23

Open
denniwang wants to merge 10 commits into
mainfrom
plan-of-study-scraper
Open

Plan of study scraper#23
denniwang wants to merge 10 commits into
mainfrom
plan-of-study-scraper

Conversation

@denniwang

Copy link
Copy Markdown
Contributor

Scrapes plan of studies. Still need to implement automated scraping for list of urls. Given a plan of study HTML table it is able to scrape all the properly formatted classes.

@denniwang denniwang requested a review from KobeZ123 February 5, 2025 13:46
todo: need to ignore empty json files (web pages with no plan of study)
Comment thread src/urls/urls.ts Outdated
"https://nextcatalog.northeastern.edu/",
startYear === currentYear
? ""
: `archive/${startYear}-${startYear + 1}/#planofstudytext`,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same code as scrapeMajorLinks()? if so, we can abstract the two methods with something like a suffix parameter if only the suffix changes
scrapeMajorLinks('') and scrapeMajorLinks('#plan-of-study')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just note don't want these files in the final

Comment thread src/runtime/index.ts Outdated
//PhaseLabel.ScrapeMajorLinks,
//scrapeMajorPlanLinks(year, currentYear),
//)
//.then(addPhase(spin, PhaseLabel.Classify, classify))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the classify and tokenize phases removed? it would be useful to log the tokens scraped before parsing

@denniwang denniwang force-pushed the plan-of-study-scraper branch from bed9498 to 04b755c Compare March 11, 2025 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants