Crawler browses the student portfolio websites acting as user and stores the scraped data to the database for the research purposes.
install dependencies
npm istart crawl
npm run startcontinue last crawl
CRAWLEE_PURGE_ON_START=0 npm startCrawler stores data to portfolio_pages table in our Supabase Postgres database.
| column | description |
|---|---|
| url | crawled url used as unique key |
| data | scraped data in jsonb |
| created_at | timestamp of first creation |
| portfolio_id | uuid of the portfolio |
| updated_at | timestamp of last update (last crawl ended) |
The scraped data are stored as json data structure, see its documentation.
Crawled data are stored as jsonb type in the postgres database. There is variety of build-in functions and operators to query the json or even index its parameters.
get all pages with successfully parsed post-content (the where clause excludes pages such as 404, search, archive, tag etc.)
select * from portfolio_pages where (data->>'wordpress-pagetypes')::jsonb ?| array['single', 'page'] and json_typeof(data->'post-content') != 'null' and json_typeof(data->'post-content'->'tree') != 'null';get the plaintext of scraped posts
select data->'post-content'->'text' as text from portfolio_pages where (data->>'wordpress-pagetypes')::jsonb ?| array['single', 'page'] and json_typeof(data->'post-content') != 'null' and json_typeof(data->'post-content'->'tree') != 'null';get the title, description and open graph image of scraped posts
select data->>'title' as title, data->>'description' as description, data->'og'->'image'->'imageValue' as og_image from portfolio_pages where (data->>'wordpress-pagetypes')::jsonb ?| array['single', 'page'] and json_typeof(data->'post-content') != 'null' and json_typeof(data->'post-content'->'tree') != 'null';get the number of posts with and without parsed post-content
below are the stats from last crawl
| metric | count |
|---|---|
| total_crawled | 3837 |
| total_single | 1537 |
| total_page | 472 |
| single_without_tree | 0 |
| single_with_tree | 1537 |
| single_with_tree_published_at | 1442 |
| page_without_tree | 16 |
| page_with_tree | 456 |
| with_with_tree_published_at | 368 |
select
count(*) as total_crawled,
sum(1) filter (
where (
data->>'wordpress-pagetypes')::jsonb ?& array['single']
) as total_single,
sum(1) filter (
where (
data->>'wordpress-pagetypes')::jsonb ?& array['page']
) as total_page,
sum(1) filter (
where (
data->>'wordpress-pagetypes')::jsonb ?& array['single']
and json_typeof(data->'post-content') != 'null'
and json_typeof(data->'post-content'->'tree') = 'null'
) as single_without_tree,
sum(1) filter (
where (
data->>'wordpress-pagetypes')::jsonb ?& array['single']
and json_typeof(data->'post-content') != 'null'
and json_typeof(data->'post-content'->'tree') != 'null'
) as single_with_tree,
sum(1) filter (
where (
data->>'wordpress-pagetypes')::jsonb ?& array['single']
and json_typeof(data->'published-at') != 'null'
and json_typeof(data->'post-content') != 'null'
and json_typeof(data->'post-content'->'tree') != 'null'
) as single_with_tree_published_at,
sum(1) filter (
where (
data->>'wordpress-pagetypes')::jsonb ?& array['page']
and json_typeof(data->'post-content') != 'null'
and json_typeof(data->'post-content'->'tree') = 'null'
) as page_without_tree,
sum(1) filter (
where (
data->>'wordpress-pagetypes')::jsonb ?& array['page']
and json_typeof(data->'post-content') != 'null'
and json_typeof(data->'post-content'->'tree') != 'null'
) as page_with_tree,
sum(1) filter (
where (
data->>'wordpress-pagetypes')::jsonb ?& array['page']
and json_typeof(data->'published-at') != 'null'
and json_typeof(data->'post-content') != 'null'
and json_typeof(data->'post-content'->'tree') != 'null'
) as with_with_tree_published_at
from
portfolio_pages