added agentic-weekly-ai-news-tldr #21

vannestn · 2025-10-01T23:15:54Z

Some pending edits we can skip for this first review:

The example schema for "### Why MongoDB for Newsletter Pipeline" is incorrect and needs to be fixed.
Some bulletting can be reduced.

Note

Adds a Jupyter notebook that scrapes AI content, processes it via Unstructured into MongoDB, and generates summaries/executive briefs; also ignores .DS_Store.

Notebook: notebooks/Agentic-Weekly-AI-News-TLDR.ipynb
- Data collection: Scrapes ArXiv PDFs and AI company blogs (Hugging Face, OpenAI, DeepLearning.AI, Anthropic) via Firecrawl; uploads to S3.
- Processing pipeline: Creates Unstructured S3 source and MongoDB destination connectors; builds workflow (hi_res partition + chunk_by_page); runs and monitors jobs.
- MongoDB integration: Prepares/clears target collection; stores structured elements/chunks with metadata.
- Newsletter generation: Retrieves processed content; produces per-document summaries and a ~700-word executive brief using OpenAI (LangChain) with configurable prompts.
Config/Deps: Installs required packages; loads env vars (.env support) and validates required keys.
Housekeeping: Update .gitignore to ignore .DS_Store.

^{Written by Cursor Bugbot for commit 5a67621. This will update automatically on new commits. Configure here.}

review-notebook-app · 2025-10-01T23:15:59Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

vannestn · 2025-10-02T15:45:23Z

Images like these and references to them will be removed in the next draft

[[IMG:EXAMPLE_DOCUMENT_IMAGE]] # Image disabled - use --include-images to enable

[[IMG:EXAMPLE_OUTPUT_IMAGE]] # Image disabled - use --include-images to enable

vannestn · 2025-10-02T15:51:35Z

Will remove this redundant explaination

Production Deployment: In a real implementation, you would schedule these scraping scripts to run daily (e.g., via cron job, AWS Lambda, or GitHub Actions). Each day's content would accumulate in S3, and at the end of the week, you'd run the processing and summarization pipeline to generate your newsletter.

For This Demo: We're scraping 7 days of content in one batch to simulate a week's worth of daily collection. This gives us enough diverse content to demonstrate the full pipeline without waiting a week.

MKhalusova

Hi Nick!
Thank you for working on this notebook. In addition to what you pointed out as pending edits, here are a few things that I would like to note.

Some cleanup is needed:

Images are missing in markdown in two places: Example Output Data Structure and Example Document Content. Instead, this is rendered. [[IMG:EXAMPLE_DOCUMENT_IMAGE]] # Image disabled - use --include-images to enable
def verify_customer_support_results(job_id=None): - the name of the function is a leftover from the previous notebook, but it doesn’t match this new topic, please rename. Please refactor to a generic name.
Please refactor to remove this legacy function:

def run_verification_with_images(job_id):
    """
    Legacy wrapper function - now just calls verify_customer_support_results with job_id.
    Use verify_customer_support_results(job_id) directly instead.
    """
    verify_customer_support_results(job_id)

Remove comments like this from the final version: AWS_REGION = os.getenv("AWS_REGION") # No default value as requested.

Broader issue:
Original requirement was to have an agentic setup. See ticket description https://app.asana.com/1/1205771759461250/project/1211227331310166/task/1211385625012202?focus=true
Here’s a quote:

When user asks the Agent to provide a weekly report:
1. the Agent should check if there are documents in the S3 bucket. If there are documents there,  trigger a job for the pre-configured workflow via API into the permanent AstraDB collection. It should wait for the docs to be processed. Once the docs are processed, retrieve all the documents that were processed that week (using corresponding metadata - see the docs) to generate a report. Also, once docs are processed, empty the bucket.
2. This should be a two-agent setup . One agent is an orchestrator: checks for documents and run workflows, another agent creates summaries. Handle the case where all of the documents over the week won’t fit into the context window.

While you do generate a newsletter with several LLM calls, I do not see the requested agentic setup.

vannestn · 2025-10-03T14:36:00Z

Broader issue: Original requirement was to have an agentic setup. See ticket description https://app.asana.com/1/1205771759461250/project/1211227331310166/task/1211385625012202?focus=true Here’s a quote:

When user asks the Agent to provide a weekly report:
1. the Agent should check if there are documents in the S3 bucket. If there are documents there,  trigger a job for the pre-configured workflow via API into the permanent AstraDB collection. It should wait for the docs to be processed. Once the docs are processed, retrieve all the documents that were processed that week (using corresponding metadata - see the docs) to generate a report. Also, once docs are processed, empty the bucket.
2. This should be a two-agent setup . One agent is an orchestrator: checks for documents and run workflows, another agent creates summaries. Handle the case where all of the documents over the week won’t fit into the context window.

While you do generate a newsletter with several LLM calls, I do not see the requested agentic setup.

Thanks for the follow-up! I have more details about our rescoping call in the Asana ticket. Lets touch base there.

vannestn · 2025-10-03T20:51:26Z

I've included the print statements from running the notebook. In Github, the block of response text in the Agentic Execution sections is very large and distracting, but in Google Colab we do not see large blocks of text. I am going to leave this as-is given Google Colab is our target publication destination.

https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/agentic-weekly-ai-news-tldr/notebooks/Agentic-Weekly-AI-News-TLDR.ipynb

MKhalusova

Hi @vannestn ! Thank you for adding the agentic setup that was missing in the first iteration. There are a few polishing touches that the notebooks still requires before we can merge it.

There are duplicated imports throughout the notebook, please remove the duplicates. e.g. from unstructured_client import UnstructuredClient is there twice, but there are others
Manual workflow creation section has duplicates as well.
Section “Creating Your Document Processing Workflow” is not needed and can be removed: def create_image_workflow_nodes(): configures partitioner and chunker nodes, used increate_single_workflow. However, def create_single_workflow (a function to create a worklow) is never called anywhere. Following sections - “Starting Your Document Processing Job”, “Monitoring Your Document Processing Progress”, and “Pipeline Execution Summary” assume a workflow was created, but it wasn't. In fact, you actually create a workflow later on.
Section “Orchestrating Your Complete Document Processing Pipeline”: “We'll now execute the pipeline in distinct steps, allowing you to monitor progress at each stage: preprocessing, connector setup, workflow creation, execution, and results validation.” - this is where the workflow setup actually happens. So the sections mentioned in #2 are not needed and can be cleaned up
“Step 2-3: Create Data Connectors“ - these are unnecessary as you’ve already created these connectors earlier in the notebook. There's no need to call create_s3_source_connector(). You’ve already created an S3 connector early on: id=643599ad-2e56-4f00-b94b-e2f6bdbeaa3a. There's no need to create a destination connector again. You’ve already created one early in the notebook: a70289ba-e38e-4406-8ec2-87f501d36c45.

Once the above comments are addressed, we should be able to merge the PR.

vannestn · 2025-10-09T21:38:59Z

R1 Feedback Resolution Summary

1. Removed Duplicate Imports

Centralized all Unstructured SDK and dependencies imports to the initial setup section. Removed duplicate imports from the Orchestrator Agent section.

2. Removed Unused Workflow Creation Sections

Deleted the following sections that described workflow creation but contained functions that were never called:

"Creating Your Document Processing Workflow" section and its functions (create_image_workflow_nodes(), create_single_workflow())
"Starting Your Document Processing Job" section and its function (run_workflow())
"Monitoring Your Document Processing Progress" section
"Pipeline Execution Summary" section

Kept functions still in use: poll_job_status(), print_pipeline_summary(), verify_pipeline_results()

3. Updated Orchestration Section Description

Changed "preprocessing, connector setup, workflow creation, execution, and results validation" to "preprocessing, workflow creation, execution, and results validation" since connector setup doesn't happen in the orchestration section.

4. Removed Duplicate Connector Creation

The orchestration section previously had a "Step 2-3: Create Data Connectors" that was creating new S3 source and MongoDB destination connectors even though these were already created earlier in the notebook. Removed this step entirely and updated the workflow creation step to use the existing source_id and destination_id variables from the earlier S3 and MongoDB connector sections.

Steps renumbered:

Step 1: MongoDB Preprocessing
Step 2: Create Processing Workflow (previously Step 4)
Step 3: Execute Workflow (previously Step 5)
Step 4: Pipeline Summary (previously Step 6)

Additional Enhancement

Added automatic chunked summarization for large documents (>20k tokens) to prevent API timeouts. Documents are split into 40k character chunks, summarized individually, then combined into a coherent final summary.

Just a brief reminder about the formatting of the notebook outputs after running it end-to-end
#21 (comment)

added agentic-weekly-ai-news-tldr

67f67c2

This comment was marked as outdated.

Sign in to view

added print statements from run and updated .env generation

5a67621

This comment was marked as outdated.

Sign in to view

vannestn requested a review from MKhalusova October 2, 2025 14:40

MKhalusova requested changes Oct 2, 2025

View reviewed changes

formatting updates

316966c

This comment was marked as outdated.

Sign in to view

agentic system added to newsletter summarization notebook

4459881

MKhalusova requested changes Oct 6, 2025

View reviewed changes

incorporated r1 feedback

a1173f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added agentic-weekly-ai-news-tldr #21

added agentic-weekly-ai-news-tldr #21

vannestn commented Oct 1, 2025 •

edited by cursor bot

Loading

Uh oh!

review-notebook-app bot commented Oct 1, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

vannestn commented Oct 2, 2025 •

edited

Loading

Uh oh!

vannestn commented Oct 2, 2025

Uh oh!

MKhalusova left a comment

Uh oh!

vannestn commented Oct 3, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

vannestn commented Oct 3, 2025

Uh oh!

MKhalusova left a comment

Uh oh!

vannestn commented Oct 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

added agentic-weekly-ai-news-tldr #21

Are you sure you want to change the base?

added agentic-weekly-ai-news-tldr #21

Conversation

vannestn commented Oct 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Oct 1, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

vannestn commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vannestn commented Oct 2, 2025

Uh oh!

MKhalusova left a comment

Choose a reason for hiding this comment

Uh oh!

vannestn commented Oct 3, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

vannestn commented Oct 3, 2025

Uh oh!

MKhalusova left a comment

Choose a reason for hiding this comment

Uh oh!

vannestn commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

R1 Feedback Resolution Summary

1. Removed Duplicate Imports

2. Removed Unused Workflow Creation Sections

3. Updated Orchestration Section Description

4. Removed Duplicate Connector Creation

Additional Enhancement

Uh oh!

Uh oh!

vannestn commented Oct 1, 2025 •

edited by cursor bot

Loading

vannestn commented Oct 2, 2025 •

edited

Loading

vannestn commented Oct 9, 2025 •

edited

Loading