Description
My Ubuntu server has 4-core CPU and 8GB RAM. In scrapyd.conf
, I set the following:
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 500
dbs_dir = dbs
max_proc = 60
#max_proc_per_cpu = 8
max_proc_per_cpu = 12
finished_to_keep = 100
#poll_interval = 5.0 # default
poll_interval = 1.0
bind_address = 0.0.0.0
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
ebroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
The max_proc_per_cpu
is set as 12
, the server has 4 cores => in theory, it looks like up to 48 processes should be running on the CPU at the same time. The reason why I have a server of this configuration is that I wanted to run as many spiders on the server at the same time as possible.
However, when I checked Scrapyd, I noticed that typically, there's less than 15 jobs running at the same time. After some additional observation, I noticed that if I restart the server, the number of processes jobs is in the range 45-60 (60 is the max_proc
), but after some time (after 60-90 minutes), the number of processed jobs drops from 45-60 to like 10-15.
In htop
, I checked the usage of the server:
- Core 0: 100%
- Core 1: 100%
- Core 2: 99%
- Core 3: 100%
- Mem: 4.05GB/7.75GB
- Swp: 1.15GB/4.00GB
I thought that the reason why there's running only 10-15 jobs and not 48 is that the server might be running out of memory, but that does not seem to be the case.
The CPUs are used to their full extend - that's what I wanted, to fully utilize it.
If I restart scrapyd
, then for the next hour or so there will be again 48-60 jobs running at the same time and after another hour, it will drop again to 10-15.
Why is that? I would like the server to be ideally processing 48 jobs at any given time. Why is that not happening? What am I missing here?