-
Notifications
You must be signed in to change notification settings - Fork 0
added agentic-weekly-ai-news-tldr #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Images like these and references to them will be removed in the next draft
|
Will remove this redundant explaination
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Nick!
Thank you for working on this notebook. In addition to what you pointed out as pending edits, here are a few things that I would like to note.
Some cleanup is needed:
- Images are missing in markdown in two places: Example Output Data Structure and Example Document Content. Instead, this is rendered.
[[IMG:EXAMPLE_DOCUMENT_IMAGE]] # Image disabled - use --include-images to enable
def verify_customer_support_results(job_id=None):
- the name of the function is a leftover from the previous notebook, but it doesn’t match this new topic, please rename. Please refactor to a generic name.- Please refactor to remove this legacy function:
def run_verification_with_images(job_id):
"""
Legacy wrapper function - now just calls verify_customer_support_results with job_id.
Use verify_customer_support_results(job_id) directly instead.
"""
verify_customer_support_results(job_id)
- Remove comments like this from the final version:
AWS_REGION = os.getenv("AWS_REGION") # No default value as requested
.
Broader issue:
Original requirement was to have an agentic setup. See ticket description https://app.asana.com/1/1205771759461250/project/1211227331310166/task/1211385625012202?focus=true
Here’s a quote:
When user asks the Agent to provide a weekly report:
1. the Agent should check if there are documents in the S3 bucket. If there are documents there, trigger a job for the pre-configured workflow via API into the permanent AstraDB collection. It should wait for the docs to be processed. Once the docs are processed, retrieve all the documents that were processed that week (using corresponding metadata - see the docs) to generate a report. Also, once docs are processed, empty the bucket.
2. This should be a two-agent setup . One agent is an orchestrator: checks for documents and run workflows, another agent creates summaries. Handle the case where all of the documents over the week won’t fit into the context window.
While you do generate a newsletter with several LLM calls, I do not see the requested agentic setup.
Thanks for the follow-up! I have more details about our rescoping call in the Asana ticket. Lets touch base there. |
I've included the print statements from running the notebook. In Github, the block of response text in the Agentic Execution sections is very large and distracting, but in Google Colab we do not see large blocks of text. I am going to leave this as-is given Google Colab is our target publication destination. https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/agentic-weekly-ai-news-tldr/notebooks/Agentic-Weekly-AI-News-TLDR.ipynb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vannestn ! Thank you for adding the agentic setup that was missing in the first iteration. There are a few polishing touches that the notebooks still requires before we can merge it.
-
There are duplicated imports throughout the notebook, please remove the duplicates. e.g.
from unstructured_client import UnstructuredClient
is there twice, but there are others -
Manual workflow creation section has duplicates as well.
Section “Creating Your Document Processing Workflow” is not needed and can be removed:def create_image_workflow_nodes()
: configures partitioner and chunker nodes, used increate_single_workflow
. However,def create_single_workflow
(a function to create a worklow) is never called anywhere. Following sections - “Starting Your Document Processing Job”, “Monitoring Your Document Processing Progress”, and “Pipeline Execution Summary” assume a workflow was created, but it wasn't. In fact, you actually create a workflow later on. -
Section “Orchestrating Your Complete Document Processing Pipeline”: “We'll now execute the pipeline in distinct steps, allowing you to monitor progress at each stage: preprocessing, connector setup, workflow creation, execution, and results validation.” - this is where the workflow setup actually happens. So the sections mentioned in #2 are not needed and can be cleaned up
-
“Step 2-3: Create Data Connectors“ - these are unnecessary as you’ve already created these connectors earlier in the notebook. There's no need to call
create_s3_source_connector()
. You’ve already created an S3 connector early on:id=643599ad-2e56-4f00-b94b-e2f6bdbeaa3a
. There's no need to create a destination connector again. You’ve already created one early in the notebook:a70289ba-e38e-4406-8ec2-87f501d36c45
.
Once the above comments are addressed, we should be able to merge the PR.
R1 Feedback Resolution Summary1. Removed Duplicate ImportsCentralized all Unstructured SDK and dependencies imports to the initial setup section. Removed duplicate imports from the Orchestrator Agent section. 2. Removed Unused Workflow Creation SectionsDeleted the following sections that described workflow creation but contained functions that were never called:
Kept functions still in use: 3. Updated Orchestration Section DescriptionChanged "preprocessing, connector setup, workflow creation, execution, and results validation" to "preprocessing, workflow creation, execution, and results validation" since connector setup doesn't happen in the orchestration section. 4. Removed Duplicate Connector CreationThe orchestration section previously had a "Step 2-3: Create Data Connectors" that was creating new S3 source and MongoDB destination connectors even though these were already created earlier in the notebook. Removed this step entirely and updated the workflow creation step to use the existing Steps renumbered:
Additional EnhancementAdded automatic chunked summarization for large documents (>20k tokens) to prevent API timeouts. Documents are split into 40k character chunks, summarized individually, then combined into a coherent final summary. Just a brief reminder about the formatting of the notebook outputs after running it end-to-end |
Some pending edits we can skip for this first review:
Note
Adds a Jupyter notebook that scrapes AI content, processes it via Unstructured into MongoDB, and generates summaries/executive briefs; also ignores .DS_Store.
notebooks/Agentic-Weekly-AI-News-TLDR.ipynb
hi_res
partition +chunk_by_page
); runs and monitors jobs..gitignore
to ignore.DS_Store
.Written by Cursor Bugbot for commit 5a67621. This will update automatically on new commits. Configure here.