Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After a while, Scrapyd stops picking up the maximum number of jobs for processing defined in "max_proc_per_cpu" #534

Closed
aaronm137 opened this issue Nov 6, 2024 · 5 comments
Labels
type: question a user support question

Comments

@aaronm137
Copy link

aaronm137 commented Nov 6, 2024

My Ubuntu server has 4-core CPU and 8GB RAM. In scrapyd.conf, I set the following:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 500
dbs_dir     = dbs
max_proc    = 60
#max_proc_per_cpu = 8
max_proc_per_cpu = 12
finished_to_keep = 100
#poll_interval = 5.0 # default
poll_interval = 1.0
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
ebroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

The max_proc_per_cpu is set as 12, the server has 4 cores => in theory, it looks like up to 48 processes should be running on the CPU at the same time. The reason why I have a server of this configuration is that I wanted to run as many spiders on the server at the same time as possible.

However, when I checked Scrapyd, I noticed that typically, there's less than 15 jobs running at the same time. After some additional observation, I noticed that if I restart the server, the number of processes jobs is in the range 45-60 (60 is the max_proc), but after some time (after 60-90 minutes), the number of processed jobs drops from 45-60 to like 10-15.

In htop, I checked the usage of the server:

- Core 0: 100%
- Core 1: 100%
- Core 2: 99%
- Core 3: 100%
- Mem: 4.05GB/7.75GB
- Swp: 1.15GB/4.00GB

I thought that the reason why there's running only 10-15 jobs and not 48 is that the server might be running out of memory, but that does not seem to be the case.
The CPUs are used to their full extend - that's what I wanted, to fully utilize it.

If I restart scrapyd, then for the next hour or so there will be again 48-60 jobs running at the same time and after another hour, it will drop again to 10-15.

Why is that? I would like the server to be ideally processing 48 jobs at any given time. Why is that not happening? What am I missing here?

@jpmckinney
Copy link
Contributor

What's your Scrapyd version?

@jpmckinney jpmckinney added the type: question a user support question label Nov 6, 2024
@jpmckinney
Copy link
Contributor

Also, max_proc_per_cpu has no effect if max_proc is set. You can try max_proc = 0, which causes max_proc_per_cpu to be respected.

@aaronm137
Copy link
Author

The Scrapyd version is Version: 1.4.3. Testing it now with

max_proc    = 0
max_proc_per_cpu = 12

and will post here the progress.

@jpmckinney
Copy link
Contributor

Okay, you should also try with version 1.5.0 ideally.

@jpmckinney
Copy link
Contributor

  • Core 0: 100%
  • Core 1: 100%
  • Core 2: 99%
  • Core 3: 100%

Your server is already at full utilization. Getting Scrapyd to run more Scrapy processes running won't make anything go faster. Your options are:

  • Check your Scrapy spiders so that they use less CPU
  • Add more CPUs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question a user support question
Projects
None yet
Development

No branches or pull requests

2 participants