Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty JSON Output for prompt using LLM #893

Open
SumanthMeenan opened this issue Jan 15, 2025 · 12 comments
Open

Empty JSON Output for prompt using LLM #893

SumanthMeenan opened this issue Jan 15, 2025 · 12 comments
Labels
bug Something isn't working

Comments

@SumanthMeenan
Copy link

I want to extract all providers listed in this url https://www.aetna.com/dsepublic/#/contentPage?page=providerResults&parameters=searchText%3D'Primary%20Care%20Physician%20(PCP)';isGuidedSearch%3Dtrue&site_id=asa&language=en

I can use selenium, Bsoup etc., but came across this tool
I'm getting empty array as output
Found providers ['openai', 'azure_openai'] for model gpt-3.5-turbo-0125, using openai.
If it was not intended please specify the model provider in the graph configuration
--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.aetna.com/dsepublic/#/contentPage?page=providerResults&parameters=searchText%3D'Primary%20Care%20Physician%20(PCP)';isGuidedSearch%3Dtrue&site_id=asa&language=en) ---
--- Executing ParseNode Node ---
Error occurred: not enough values to unpack (expected 2, got 0)

Copy link

dosubot bot commented Jan 15, 2025

Hello @SumanthMeenan! I'm here to help you with any bugs, questions, or contributions you may have. Let's work together to solve this issue.

The error "not enough values to unpack (expected 2, got 0)" typically occurs when the code is trying to unpack a sequence (like a tuple or list) into variables, but the sequence doesn't contain enough elements. In the context of Scrapegraph-ai, this might happen during the parsing stage if the expected data structure isn't being returned from the Fetch Node.

Here are a few steps you can take to troubleshoot and resolve this issue:

  1. Check the Fetch Node Output: Ensure that the Fetch Node is correctly retrieving the HTML content. Since the URL involves dynamic content, make sure that the Fetch Node is configured to handle JavaScript-rendered content, possibly using a headless browser like ChromiumLoader.

  2. Verify the ParseNode Configuration: Ensure that the ParseNode is correctly configured to handle the structure of the HTML content. If the HTML structure doesn't match the expected format, it could lead to unpacking errors.

  3. Model Provider Configuration: The message suggests specifying the model provider in the graph configuration. Ensure that the correct model provider is set in your configuration to avoid any issues related to model selection.

  4. Debugging: Add logging or print statements to check the intermediate outputs between nodes. This can help identify where the data structure might not be as expected.

If these steps don't resolve the issue, you might need to inspect the specific implementation details of the Fetch and Parse nodes to ensure they align with the webpage's content structure.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the bug Something isn't working label Jan 15, 2025
@VinciGit00
Copy link
Collaborator

can you fetch by yourself the code?

@clemlesne
Copy link

This issue often happens, I would say 50% of the time. I'm using SearchGraph with 20 results, from SearXNG. The whole pipeline crashes, which is totally unpractical.

A simple condition would exclude the doc. Or, implement a retry mechanism for empty page_content.

@Kilowhisky
Copy link

Kilowhisky commented Jan 18, 2025

This happens often and random for me as well. This should be a recoverable error.

@VinciGit00
Copy link
Collaborator

which model are you using?

@Kilowhisky
Copy link

Here's the output of the error:

I'm on GPT 4o-mini with duckduckgo and scrape_do (but i've also seen it on local chrome)
i've also seen it with other combos as well

"errorMessage": "Traceback (most recent call last):
  File \"/var/task/index.py\", line 128, in scrape
    result = smart_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/search_graph.py\", line 110, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 358, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 303, in _execute_standard
    raise e
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 276, in _execute_standard
    result, node_exec_time, cb_data = self._execute_node(
                                      ^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 200, in _execute_node
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/nodes/graph_iterator_node.py\", line 77, in execute
    state = asyncio.run(self._async_execute(state, batchsize))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/asyncio/runners.py\", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/asyncio/runners.py\", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/asyncio/base_events.py\", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/nodes/graph_iterator_node.py\", line 141, in _async_execute
    answers = await tqdm.gather(
              ^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/tqdm/asyncio.py\", line 79, in gather
    res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
           ^^^^^^^
  File \"/usr/lib/python3.12/asyncio/tasks.py\", line 631, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/tqdm/asyncio.py\", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/nodes/graph_iterator_node.py\", line 131, in _async_run
    return await asyncio.to_thread(graph.run)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/asyncio/threads.py\", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/concurrent/futures/thread.py\", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/smart_scraper_graph.py\", line 292, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 358, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 303, in _execute_standard
    raise e
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 276, in _execute_standard
    result, node_exec_time, cb_data = self._execute_node(
                                      ^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 200, in _execute_node
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/nodes/parse_node.py\", line 96, in execute
    chunks = split_text_into_chunks(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/utils/split_text_into_chunks.py\", line 30, in split_text_into_chunks
    chunks = chunk(
             ^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/semchunk/semchunk.py\", line 186, in chunk
    chunks, offsets = zip(*[(chunk, offset) for chunk, offset in zip(chunks, offsets) if chunk and not chunk.isspace()]) if chunks else ([], []) # NOTE `if chunks else ([], [])` ensures that we don't unpack an empty list if there's no chunks (i.e., if the provided text was empty).
    ^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 0)
"

@Jdh235
Copy link

Jdh235 commented Jan 25, 2025

I am using SearchGraph and coming across the same error when using 'gpt-4o-mini'. I do not get the error when using 'gpt-4o', however, gpt-4o is far too expensive for most people.

umarbutler added a commit to isaacus-dev/semchunk that referenced this issue Feb 13, 2025
@umarbutler
Copy link

umarbutler commented Feb 13, 2025

Hey everyone, sincerest apologies about this, I've just fixed this in v3.0.4 of semchunk (please run pip install semchunk --upgrade to benefit from the new version, plus there's some minor performance enhancements you'll benefit from v3.0.3) and I've ensured there is now a unit test for this edge case.

The problem was that I was filtering out empty chunks including chunks consisting entirely of whitespace and then trying to unpack those filtered chunks which doesn't work if there's nothing to unpack.

PS If you ever do get any other issues with semchunk or other libraries, it is often better to raise those issues on the source library to make their creators aware. Thank you @Jdh235 for doing exactly that!

@dschuld
Copy link

dschuld commented Feb 15, 2025

I had the same error as mentioned. After the upgrade, the error does not appear anymore, but instead the returned content is hallucinated and has nothing to do with the web page I am trying to scrape.

@umarbutler
Copy link

I had the same error as mentioned. After the upgrade, the error does not appear anymore, but instead the returned content is hallucinated and has nothing to do with the web page I am trying to scrape.

This sounds like a Searchgraph bug. Presumably if no input is being provided to semchunk then you’re going to get no output and if you get no output then I guess it makes sense the model might start to hallucinate.

@VinciGit00
Copy link
Collaborator

@umarbutler what do you suggests as solution?

@umarbutler
Copy link

@umarbutler what do you suggests as solution?

To be honest, I'm not familiar with scrape graph myself, so I can't help out here.

This is a better question for the devs. @DPende @LorenzoPaleari

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants