Skip to content

Conversation

@Gallaecio
Copy link
Contributor

To do:
  1. Docs:

    1. Read the entire scrapy-crawl-maps docs and complete it.

    2. Write docs about writing custom nodes.

    3. Document that inputs may be sent to multiple nodes, and so mutable
      inputs should not be modified in place. Recommend using copy() or
      deepcopy() as needed.

    4. Document how it is possible to download from a non-fetch node using
      crawler.engine.download(). Recommend using a fetch node instead to let
      spider middlewares work as usual, but recommend it for complex cases
      where it is needed, e.g. combining multiple requests in a node.

      Test that combining multiple requests in a node works as expected this
      way.

    5. In crawl map definition docs, document how to find supported node
      types (link to the reference of built-in docs, explain how to read
      the reference (node type ID, ports, args), and give an example.
      Also mention the ability of writing your own node types. And of
      course, explain what a node type is, as opposed to a node.

    6. Make sure that all APIs from the global __init__ are covered in
      the reference docs.

    7. Document how to read stats.

    8. Document the io logic: how outputs are duplicated when ports point
      to multiple nodes, and how spiderless nodes are executed only once
      or once per request depending on where on the map they are.

    9. Update the docs to include actual diagrams along with JSON examples, as a
      visual aid.

      Use https://www.sphinx-doc.org/en/master/usage/extensions/graphviz.html

      See if it is possible to automate the graphviz code generation from the
      JSON, to make things easier going forward.

      Consider rendering the node in a separate tab. Figure out if we can build
      some custom Sphinx directive that handles all of this automatically.

    10. Move away from having CSS selectors on crawl maps. Remove the node
      types and instead promote writing custom nodes.

    11. Document supported port types: request, response, item, string.

  2. Changes related to the schema and crawl map builders:

    1. Design a JSON-based language to support exposing to the UI all
      supported nodes, links and settings through spider metadata, and
      implement it in all spiders minimizing duplication (i.e. try to reuse
      code, e.g. from I/O specs used at the code level).

      1. Support assigning groups to global options. The initial plan is to have
        2 groups, Configuration and Job options, the latter of which will
        include additional options specific to Scrapy Cloud (e.g. SC units).
        Groups render as tabs in the UI next to the main tab containing the
        visual crawl map (Builder).

        Also allow marking an option to prompt for overriding it when running a
        job.

    2. Allow nodes to define extra metadata, e.g.

      • A documentation URL

      • An example output item

    3. Support short syntax for port links when there are multiple ports but
      only 1 has a matching type.

    4. Support hard-coding an initial crawl map for CrawlMapSpider,
      implementing a spider based on the Scrapy tutorial.

      In tests, make sure that those initial crawl maps are valid crawl maps.

  3. Work on an API to make existing spiders expose a crawl map and log stats
    based on their callbacks, e.g. with some decotators.

  4. Have CrawlMap validate as much as possible as part of Pydantic
    validation, before instantiation.

  5. Support inputless fetch nodes.

  6. Enforce and test new restrictions:

    • Disallow subgraphs involving multiple response input ports, on same
      or different nodes. That includes:

      • A node with 2 input ports of response type.

      • A node with 1 input port of response type that directly or
        indirectly, without a fetch node (request → response) in the
        middle, outputs a request or item to another node that also has
        a response as input that it gets from a different fetch.

        and another one of item type.

      However, allow:

      • Response processing nodes (response → response).

      • A node with 1 input port of response type that directly or
        indirectly, without a fetch node (request → response) in the
        middle, outputs a request or item to another node that also has
        a response as input that it gets from the same fetch node.

  7. Add more node types:

    1. Add a UrlRegexSplitter node that supports input requests and has an
      output port for matches and an output port for non-matches. Include a
      test for it.
  8. Improve test coverage:

    1. Add a test where the same output port is connected to 2 input ports,
      causing the crawl map to send the same data to both input ports.

    2. Add a test that checks that checks which requests are sent from start()
      and which requests are sent from a callback.

      urls1 → fetch1 → parser1 → fetch1 → parser3
      urls2 → fetch2 → parser2 → fetch3 → parser3

      Urls from urls1 and urls2 should be start requests, while URls from
      parser1 and parser2 sould be sent from a callback.

    3. Add a test for a next page link node that allows limiting the
      pagination.

    4. Test all combinations (that make sense) of port types in single-input,
      single-output nodes.

      | Consider (skip the ones that don't make sense):
      | request → request
      | request → response
      | request → item
      | request → string
      | response → request
      | response → response
      | response → item
      | response → string
      | item → request
      | item → response
      | item → item
      | item → string
      | string → request
      | string → response
      | string → item
      | string → string

    5. Test all combinations of multiple inputs and single output or multiple
      outputs from a single input that make sense.

    6. Test a scenario where strings and URLs are mixed in a node to build
      requests, as a proof of concept for such nodes, testing how the
      implementation would have to be done so that requests are sent as soon
      as possible, but ensuring that all requests are eventually sent. It
      probably requires some special handling of the input queue of strings,
      since it should be only read once but its output should be reused for
      every input request. Maybe we can implement a utility function for
      that.

    7. Test scenarios where nodes fail to defined the required methods, define
      them incorrectly or raise exceptions.

    8. Test that a crawl map spec defining more than 1 node type with the same
      type ID fails (e.g. 2 “fetch” node types, corresponding to different
      node classes that both use the same type string).

    9. Test persistent request metadata across 2 fetch nodes.

      urls → fetch1 → node1 → fetch2 → node2 → fetch1

      If node1 sets metadata, it should be available in requests from fetch2.

    10. Test that a spider node that outputs requests can link to itself.

    11. Review warnings triggered by tox -e py.

    12. Test passing a crawl map as a file path or as a Python dict
      instead of a string. Also test passing an invalid file path, a path
      to an unexistign file, a path to a folder, and other error
      scenarios. Include testing for leading spaces when passing a
      string.

  9. Code cleanup:

    1. Cleanly discard outputs not connected to anything, do not keep them in
      memory.

    2. Try to use _resolve_deps in _iter_output to reduce its complexity.

    3. Cache _resolve_deps.

    4. Refactor the implementation, specially _iter_output, to make it more
      readable.

    5. Switch to Ruff, e.g. to get all sorted automatically.

    6. Solve ignored PLR0912 and PLR0915 issues

  10. Self review:

    1. Rendered docs.

    2. PR.

@Gallaecio Gallaecio requested review from Copilot, kmike and wRAR May 21, 2025 14:34
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces the initial implementation of documentation, CI/CD workflows, pre-commit hooks, and configuration files to support the scrapy-crawl-maps project.

  • Added Sphinx documentation configuration and a custom Sphinx extension for cross references
  • Configured ReadTheDocs, pre-commit hooks (Ruff and Blacken-docs), and GitHub workflows for testing and publishing
  • Included a Codecov configuration for coverage reporting

Reviewed Changes

Copilot reviewed 18 out of 31 changed files in this pull request and generated no comments.

Show a summary per file
File Description
docs/conf.py Sphinx configuration for project documentation
docs/_ext/init.py Setup for custom Sphinx cross-reference directives
.readthedocs.yml ReadTheDocs configuration for building the docs
.pre-commit-config.yaml Pre-commit hook configurations for code quality tools
.github/workflows/test.yml CI configuration for running tests across multiple Python versions
.github/workflows/publish.yml CI configuration for publishing releases to PyPI
.codecov.yml Codecov configuration for managing code coverage reports
Files not reviewed (13)
  • CHANGES.rst: Language not supported
  • LICENSE: Language not supported
  • README.rst: Language not supported
  • docs/Makefile: Language not supported
  • docs/api.rst: Language not supported
  • docs/changes.rst: Language not supported
  • docs/customize.rst: Language not supported
  • docs/index.rst: Language not supported
  • docs/make.bat: Language not supported
  • docs/map.rst: Language not supported
  • docs/requirements.txt: Language not supported
  • docs/setup.rst: Language not supported
  • docs/tutorial.rst: Language not supported
Comments suppressed due to low confidence (1)

.pre-commit-config.yaml:11

  • The YAML indentation for the blacken-docs hook appears to be off, which might cause parsing errors. Please adjust the indentation to align with the required structure.
- id: blacken-docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant