Skip to content

Advanced usecases

Sohom Datta edited this page May 22, 2024 · 1 revision

While the crawler works for most cases, sometimes you might want to customize the crawler to do things it can't do by default. Here is a list of ideas/ways in which we have modified the crawler for past projects.

Using a different version of VisibleV8

To use a different instrumented chrome binary (or a different version of VisibleV8) you can edit https://github.com/rekap-ncsu/vv8-crawler-slim/blob/main/celery_workers/vv8_worker.dockerfile and make the following modifications

# Copy chromium with VV8
- COPY --from=visiblev8/vv8-base:latest /opt/chromium.org/chromium/* /opt/chromium.org/chromium/
+ COPY ./chrome_installer.deb .
+ RUN apt install -y ./chrome_installer.deb

Using a custom variant of the vv8-postprocessor binary

Choose the option to build postprocessors locally in the setup script.

Navigate to the celery_workers/visiblev8 folder and make changes to the local checkout of the postprocessors once you are done making the changes, re-run the setup script.

If you plan on using a custom postgresql schema to store your data, you should also edit vv8_backend_database/init/postgres_schema.sql to intialize the schema you want to dump data to.

Using extensions

You can copy your extension directory into the celery_worker/vv8_worker/vv8_crawler directory and add the following two lines to celery_workers/vv8_worker/vv8_crawler/crawler.js

    const default_crawler_args = [
                    "--disable-setuid-sandbox",
                    "--no-sandbox",
+                    `--load-extension=/app/node/extension_name`,
+                    `--disable-extensions-except=/app/node/extension_name`,
                    //'--enable-logging=stderr',
                    '--enable-automation',
                    //'--v=1'
                ];

Note If your extension loads content-scripts you will also want to use the --no-headless flag when crawling like such:

python3 ./scripts/vv8-cli.py crawl -u 'https://google.com' -pp 'Mfeatures' --no-headless

Adding a custom user-agent/having the browser perform a custom action

To add a custom useragent, or have the browser perform a custom action, you can edit celery_workers/vv8_worker/vv8_crawler/crawler.js. The crawler.js file is expected to be edited and customized, and is mounted as a volume in the docker conatiner making changes instantaneous.

Sequential crawling

When running the setup script, you can set the number of workers to 1. This will ensure only one celery worker gets created disabling concurrent crawling. The sequence of the crawls will depend on their ordering in the csv/txt file being passed to the crawl script.