Skip to content

Commit c851b23

Browse files
MQ37matyascimbulka
andauthored
fix: number of google pages results, limited perms, improvements from WCC (#81)
* fix: number of google pages results * Update src/crawlers.ts Co-authored-by: Matyáš Cimbulka <[email protected]> * Update src/crawlers.ts Co-authored-by: Matyáš Cimbulka <[email protected]> * refactor: user data handling for the serp pagination * refactor: remove duplicate code * fix: update request handler to use addRequests for search crawler * make code more typescripty * add var for better readability * feat: update libs for limited perms, fix and speed up Dockerfile build, fix wrong imports, firefox patches and policies, NY TZ (#82) * feat: update libs for limited perms, fix and speed up Dockerfile build, fix wrong imports * magic number to const * lint * fix lint, update crawlee * add firefox policies, patches, set NY TZ * add ghostery blocker * Squashed commit of the following: commit 7aadecd Author: MQ37 <[email protected]> Date: Wed Nov 26 11:14:09 2025 +0100 add var for better readability commit fd02b0a Author: MQ37 <[email protected]> Date: Wed Nov 26 11:09:57 2025 +0100 make code more typescripty * fix playwright version, fix types.ts * fix playwright version for the patch * fix merge issue * Update src/utils.ts Co-authored-by: Matyáš Cimbulka <[email protected]> --------- Co-authored-by: Matyáš Cimbulka <[email protected]>
1 parent c7e66e7 commit c851b23

File tree

16 files changed

+3615
-2401
lines changed

16 files changed

+3615
-2401
lines changed

.actor/Dockerfile

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
# Specify the base Docker image. You can read more about
22
# the available images at https://crawlee.dev/docs/guides/docker-images
33
# You can also use any other image from Docker Hub.
4-
FROM apify/actor-node-playwright-chrome:22-1.46.0 AS builder
4+
# use node base image as builder to speed up the build step instead of usiging the full playwright image
5+
FROM apify/actor-node:22 AS builder
6+
# override the default working directory set in the base image
7+
WORKDIR /home/myuser
58

69
# Copy just package.json and package-lock.json
710
# to speed up the build using Docker layer cache.
@@ -18,12 +21,17 @@ COPY --chown=myuser . ./
1821
# Don't audit to speed up the installation.
1922
RUN npm run build
2023

24+
# Build Ghostery blockers for content filtering
25+
RUN npm run build:playwright-blockers
26+
2127
# Create final image
22-
FROM apify/actor-node-playwright-firefox:22-1.46.0
28+
FROM apify/actor-node-playwright-firefox:22-1.54.1
2329

2430
# Copy just package.json and package-lock.json
2531
# to speed up the build using Docker layer cache.
2632
COPY --chown=myuser package*.json ./
33+
COPY --chown=myuser policies.json ./
34+
COPY --chown=myuser patches ./patches
2735

2836
# Install NPM packages, skip optional and development dependencies to
2937
# keep the image small. Avoid logging too much and print the dependency
@@ -38,28 +46,27 @@ RUN npm --quiet set progress=false \
3846
&& npm --version \
3947
&& rm -r ~/.npm
4048

41-
# Remove the existing firefox installation
42-
RUN rm -rf ${PLAYWRIGHT_BROWSERS_PATH}/*
43-
44-
# Install all required playwright dependencies for firefox
45-
RUN npx playwright install firefox
46-
# symlink the firefox binary to the root folder in order to bypass the versioning and resulting browser launch crashes.
47-
RUN ln -s ${PLAYWRIGHT_BROWSERS_PATH}/firefox-*/firefox/firefox ${PLAYWRIGHT_BROWSERS_PATH}/
48-
49-
# Overrides the dynamic library used by Firefox to determine trusted root certificates with p11-kit-trust.so, which loads the system certificates.
50-
RUN rm $PLAYWRIGHT_BROWSERS_PATH/firefox-*/firefox/libnssckbi.so
51-
RUN ln -s /usr/lib/x86_64-linux-gnu/pkcs11/p11-kit-trust.so $(ls -d $PLAYWRIGHT_BROWSERS_PATH/firefox-*)/firefox/libnssckbi.so
52-
5349
# Copy built JS files from builder image
5450
COPY --from=builder --chown=myuser /home/myuser/dist ./dist
5551

52+
# Copy Ghostery blockers from builder image
53+
COPY --from=builder --chown=myuser /home/myuser/blockers ./blockers
54+
5655
# Next, copy the remaining files and directories with the source code.
5756
# Since we do this after NPM install, quick build will be really fast
5857
# for most source file changes.
5958
COPY --chown=myuser . ./
6059

60+
# Edit the TZ environment variable to set the timezone in the container.
61+
# Most of the proxy traffic is from the US, so we set the timezone to New York.
62+
# which can help with the bot-detection mechanisms of some websites.
63+
ENV TZ=America/New_York
64+
65+
# Configure Firefox policies
66+
ENV PLAYWRIGHT_FIREFOX_POLICIES_JSON="/home/myuser/policies.json"
67+
6168
# Disable experimental feature warning from Node.js
6269
ENV NODE_NO_WARNINGS=1
6370

6471
# Run the image.
65-
CMD npm run start:prod --silent
72+
CMD ["npm", "run", "start:prod", "--silent"]

.dockerignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,7 @@ node_modules
1616
data
1717
src/storage
1818
dist
19+
20+
# Ghostery blockers (will be rebuilt in Docker)
21+
blockers
22+

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,6 @@ storage
1414
# Actor run input
1515
input.json
1616
INPUT.json
17+
18+
# Ghostery blockers (generated during build)
19+
blockers/**

0 commit comments

Comments
 (0)