Skip to content

Conversation

vannestn
Copy link
Contributor

@vannestn vannestn commented Oct 1, 2025

Some pending edits we can skip for this first review:

  • The example schema for "### Why MongoDB for Newsletter Pipeline" is incorrect and needs to be fixed.
  • Some bulletting can be reduced.

Note

Adds a Jupyter notebook that scrapes AI content, processes it via Unstructured into MongoDB, and generates summaries/executive briefs; also ignores .DS_Store.

  • Notebook: notebooks/Agentic-Weekly-AI-News-TLDR.ipynb
    • Data collection: Scrapes ArXiv PDFs and AI company blogs (Hugging Face, OpenAI, DeepLearning.AI, Anthropic) via Firecrawl; uploads to S3.
    • Processing pipeline: Creates Unstructured S3 source and MongoDB destination connectors; builds workflow (hi_res partition + chunk_by_page); runs and monitors jobs.
    • MongoDB integration: Prepares/clears target collection; stores structured elements/chunks with metadata.
    • Newsletter generation: Retrieves processed content; produces per-document summaries and a ~700-word executive brief using OpenAI (LangChain) with configurable prompts.
  • Config/Deps: Installs required packages; loads env vars (.env support) and validates required keys.
  • Housekeeping: Update .gitignore to ignore .DS_Store.

Written by Cursor Bugbot for commit 5a67621. This will update automatically on new commits. Configure here.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@vannestn vannestn requested a review from MKhalusova October 2, 2025 14:40
@vannestn
Copy link
Contributor Author

vannestn commented Oct 2, 2025

Images like these and references to them will be removed in the next draft

[[IMG:EXAMPLE_DOCUMENT_IMAGE]] # Image disabled - use --include-images to enable

[[IMG:EXAMPLE_OUTPUT_IMAGE]] # Image disabled - use --include-images to enable

@vannestn
Copy link
Contributor Author

vannestn commented Oct 2, 2025

Will remove this redundant explaination

Production Deployment: In a real implementation, you would schedule these scraping scripts to run daily (e.g., via cron job, AWS Lambda, or GitHub Actions). Each day's content would accumulate in S3, and at the end of the week, you'd run the processing and summarization pipeline to generate your newsletter.

For This Demo: We're scraping 7 days of content in one batch to simulate a week's worth of daily collection. This gives us enough diverse content to demonstrate the full pipeline without waiting a week.

Copy link
Collaborator

@MKhalusova MKhalusova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nick!
Thank you for working on this notebook. In addition to what you pointed out as pending edits, here are a few things that I would like to note.

Some cleanup is needed:

  • Images are missing in markdown in two places: Example Output Data Structure and Example Document Content. Instead, this is rendered. [[IMG:EXAMPLE_DOCUMENT_IMAGE]] # Image disabled - use --include-images to enable
  • def verify_customer_support_results(job_id=None): - the name of the function is a leftover from the previous notebook, but it doesn’t match this new topic, please rename. Please refactor to a generic name.
  • Please refactor to remove this legacy function:
def run_verification_with_images(job_id):
    """
    Legacy wrapper function - now just calls verify_customer_support_results with job_id.
    Use verify_customer_support_results(job_id) directly instead.
    """
    verify_customer_support_results(job_id)
  • Remove comments like this from the final version: AWS_REGION = os.getenv("AWS_REGION") # No default value as requested.

Broader issue:
Original requirement was to have an agentic setup. See ticket description https://app.asana.com/1/1205771759461250/project/1211227331310166/task/1211385625012202?focus=true
Here’s a quote:

When user asks the Agent to provide a weekly report:
1. the Agent should check if there are documents in the S3 bucket. If there are documents there,  trigger a job for the pre-configured workflow via API into the permanent AstraDB collection. It should wait for the docs to be processed. Once the docs are processed, retrieve all the documents that were processed that week (using corresponding metadata - see the docs) to generate a report. Also, once docs are processed, empty the bucket.
2. This should be a two-agent setup . One agent is an orchestrator: checks for documents and run workflows, another agent creates summaries. Handle the case where all of the documents over the week won’t fit into the context window.

While you do generate a newsletter with several LLM calls, I do not see the requested agentic setup.

@vannestn
Copy link
Contributor Author

vannestn commented Oct 3, 2025

Broader issue: Original requirement was to have an agentic setup. See ticket description https://app.asana.com/1/1205771759461250/project/1211227331310166/task/1211385625012202?focus=true Here’s a quote:

When user asks the Agent to provide a weekly report:
1. the Agent should check if there are documents in the S3 bucket. If there are documents there,  trigger a job for the pre-configured workflow via API into the permanent AstraDB collection. It should wait for the docs to be processed. Once the docs are processed, retrieve all the documents that were processed that week (using corresponding metadata - see the docs) to generate a report. Also, once docs are processed, empty the bucket.
2. This should be a two-agent setup . One agent is an orchestrator: checks for documents and run workflows, another agent creates summaries. Handle the case where all of the documents over the week won’t fit into the context window.

While you do generate a newsletter with several LLM calls, I do not see the requested agentic setup.

Thanks for the follow-up! I have more details about our rescoping call in the Asana ticket. Lets touch base there.

cursor[bot]

This comment was marked as outdated.

@vannestn
Copy link
Contributor Author

vannestn commented Oct 3, 2025

I've included the print statements from running the notebook. In Github, the block of response text in the Agentic Execution sections is very large and distracting, but in Google Colab we do not see large blocks of text. I am going to leave this as-is given Google Colab is our target publication destination.

https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/agentic-weekly-ai-news-tldr/notebooks/Agentic-Weekly-AI-News-TLDR.ipynb
Screenshot 2025-10-03 at 4 48 27 PM

Copy link
Collaborator

@MKhalusova MKhalusova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vannestn ! Thank you for adding the agentic setup that was missing in the first iteration. There are a few polishing touches that the notebooks still requires before we can merge it.

  1. There are duplicated imports throughout the notebook, please remove the duplicates. e.g. from unstructured_client import UnstructuredClient is there twice, but there are others

  2. Manual workflow creation section has duplicates as well.
    Section “Creating Your Document Processing Workflow” is not needed and can be removed: def create_image_workflow_nodes(): configures partitioner and chunker nodes, used increate_single_workflow. However, def create_single_workflow (a function to create a worklow) is never called anywhere. Following sections - “Starting Your Document Processing Job”, “Monitoring Your Document Processing Progress”, and “Pipeline Execution Summary” assume a workflow was created, but it wasn't. In fact, you actually create a workflow later on.

  3. Section “Orchestrating Your Complete Document Processing Pipeline”: “We'll now execute the pipeline in distinct steps, allowing you to monitor progress at each stage: preprocessing, connector setup, workflow creation, execution, and results validation.” - this is where the workflow setup actually happens. So the sections mentioned in #2 are not needed and can be cleaned up

  4. “Step 2-3: Create Data Connectors“ - these are unnecessary as you’ve already created these connectors earlier in the notebook. There's no need to call create_s3_source_connector(). You’ve already created an S3 connector early on: id=643599ad-2e56-4f00-b94b-e2f6bdbeaa3a. There's no need to create a destination connector again. You’ve already created one early in the notebook: a70289ba-e38e-4406-8ec2-87f501d36c45.

Once the above comments are addressed, we should be able to merge the PR.

@vannestn
Copy link
Contributor Author

vannestn commented Oct 9, 2025

R1 Feedback Resolution Summary

1. Removed Duplicate Imports

Centralized all Unstructured SDK and dependencies imports to the initial setup section. Removed duplicate imports from the Orchestrator Agent section.

2. Removed Unused Workflow Creation Sections

Deleted the following sections that described workflow creation but contained functions that were never called:

  • "Creating Your Document Processing Workflow" section and its functions (create_image_workflow_nodes(), create_single_workflow())
  • "Starting Your Document Processing Job" section and its function (run_workflow())
  • "Monitoring Your Document Processing Progress" section
  • "Pipeline Execution Summary" section

Kept functions still in use: poll_job_status(), print_pipeline_summary(), verify_pipeline_results()

3. Updated Orchestration Section Description

Changed "preprocessing, connector setup, workflow creation, execution, and results validation" to "preprocessing, workflow creation, execution, and results validation" since connector setup doesn't happen in the orchestration section.

4. Removed Duplicate Connector Creation

The orchestration section previously had a "Step 2-3: Create Data Connectors" that was creating new S3 source and MongoDB destination connectors even though these were already created earlier in the notebook. Removed this step entirely and updated the workflow creation step to use the existing source_id and destination_id variables from the earlier S3 and MongoDB connector sections.

Steps renumbered:

  • Step 1: MongoDB Preprocessing
  • Step 2: Create Processing Workflow (previously Step 4)
  • Step 3: Execute Workflow (previously Step 5)
  • Step 4: Pipeline Summary (previously Step 6)

Additional Enhancement

Added automatic chunked summarization for large documents (>20k tokens) to prevent API timeouts. Documents are split into 40k character chunks, summarized individually, then combined into a coherent final summary.

Just a brief reminder about the formatting of the notebook outputs after running it end-to-end
#21 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants