Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 llm workflow #4

Merged
merged 27 commits into from
Jan 13, 2025
Merged

3 llm workflow #4

merged 27 commits into from
Jan 13, 2025

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented Dec 27, 2024

Preprocessing

Uses pymupdf4llm to convert input_file to markdown.
Then uses langchain_text_splitters to split the markdown into sections based on the headers.

Workflow

Uses a single model and 2 different prompts:

  • To find the appropriate section based on the question.
  • To answer the question using the information available in the section previously found.

Runs in a loop until the correct answer is found or an invalid section is queried or there are no sections left.

The process can be followed in the debug logs.


  • Codespaces Setup
bash .github/setup.sh
  • Demo
python -m streamlit run demo/app.py

Screenshot 2025-01-02 111010

  • CLI
structured-qa --from_config example_data/config.yaml \
--question "How many and what GPUs were used to train the model?"

image

@daavoo daavoo linked an issue Dec 27, 2024 that may be closed by this pull request
@daavoo daavoo self-assigned this Dec 27, 2024
@daavoo daavoo marked this pull request as ready for review January 2, 2025 10:22
@daavoo daavoo requested a review from a team January 2, 2025 10:22
@stefanfrench
Copy link
Contributor

stefanfrench commented Jan 2, 2025

@daavoo Setup and demo works on Codespaces with no issues.

I was doing some testing against the EU AI Act pdf doc, and it was having some difficulties getting answers. For example for this question:
"What is the threshold, measured in floating point operations, that leads to a presumption that a general-purpose AI model has systemic risk?
A) 10^15, B) 10^20, C) 10^25

Then I get:
Finding section 2025-01-02 15:32:39.361 | DEBUG | structured_qa.workflow:find_retrieve_answer:77 - Result: C) 10^25 2025-01-02 15:32:39.361 | INFO | structured_qa.workflow:find_retrieve_answer:81 - Retrieving section: C) 10^25 2025-01-02 15:32:39.361 | ERROR | structured_qa.workflow:find_retrieve_answer:88 - Unknown section: C) 10^25 2025-01-02 15:33:08.485 | DEBUG | structured_qa.workflow:find_retrieve_answer:52 - Current information available: None

I think this happens because the 'C)' in the question looks like a section, and it seems to be somehow finding that as a 'section result', then retrieving it and getting an error.

This happens for all questions that have a multiple-choice style. It seems like this means the LLM is not actually restricting itself well to retrieving from within {SECTIONS}, and creating searching for sections that don't exist.

Even if I remove the A,b,C,D options from the question, it is still trying to retrieve sections that do not exist.

@daavoo
Copy link
Contributor Author

daavoo commented Jan 2, 2025

EU AI Act pdf doc

Is it the full doc? Can you send me the link?
I think what you describe might happen if there are a lot of sections as the current model struggles with long input context

@stefanfrench
Copy link
Contributor

@daavoo - here's the full EU AI Act pdf. It is very long so perhaps you're right in terms of input context. I will re-do some testing will smaller sections tommorrow.

@daavoo
Copy link
Contributor Author

daavoo commented Jan 3, 2025

@daavoo - here's the full EU AI Act pdf

thanks!

It is very long so perhaps you're right in terms of input context. I will re-do some testing will smaller sections tommorrow.

When I did the initial testing I was using individual chapters I created by splitting the pdf

@stefanfrench
Copy link
Contributor

stefanfrench commented Jan 3, 2025

@daavoo - New pre-processing seems to work well and quickly!

I'm still having difficulties getting correct answers though. I tested against this paper

Some examples:

example 1:

  • Question: How many large language models were evaluated?
  • Checks sections: ["5 symbolic reasoning","8 conclusions","c extended related work"]
  • then tries to check section 'c' -> which doesn't exist so causes an error

example 2:

  • Question: How many benchmarks were used to evaluate arithmetic reasoning?
  • Checks: ["5 symbolic reasoning"] , logically, it should really be should be checking section "3 arithmetic reasoning"
  • Then tries checking "conclusions' which doesn't exist, so causes an error

I wonder if its something we can improve with the quality of the prompts? Or can we write some logic so that if section doesn't exist, the model comes up with a new section to look at so that it doesn't break.

@daavoo
Copy link
Contributor Author

daavoo commented Jan 3, 2025

I wonder if its something we can improve with the quality of the prompts

I think it might be more related to using a better instruct model.
Right now it is a 1.7B parameters model, we probably need something in the 8B range.
I will do some tests with a bigger one.

Or can we write some logic so that if section doesn't exist, the model comes up with a new section to look at so that it doesn't break.

This we can try to workaround in the code, but I also think a better model should be able to follow the instruction of picking a name from the list.

@stefanfrench
Copy link
Contributor

stefanfrench commented Jan 8, 2025

I wonder if its something we can improve with the quality of the prompts

I think it might be more related to using a better instruct model. Right now it is a 1.7B parameters model, we probably need something in the 8B range. I will do some tests with a bigger one.

Or can we write some logic so that if section doesn't exist, the model comes up with a new section to look at so that it doesn't break.

This we can try to workaround in the code, but I also think a better model should be able to follow the instruction of picking a name from the list.

@daavoo I did some manual experimentation and testing against this paper for 7 different questions:

  • First I changed the model to bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf. This made the results go from 0/7 correct to 2/7 correct
  • I iterated on the FIND_PROMPT a few times and managed to increase from 2/7 to 7/7 correct

FYI This is the find prompt that gave me that result:
find_prompt.txt

@daavoo
Copy link
Contributor Author

daavoo commented Jan 10, 2025

@daavoo I did some manual experimentation and testing against this paper for 7 different questions

@stefanfrench I think that could be a reasonable default to have. I was running out of memory on codespaces so I tested your prompt with Qwen/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-f16.gguf.

Do you think we can merge this (if you have confirmed that the logic works) so then I can move to work on the benchmark code to test different ones (so we can pick the "best" default)?

@daavoo daavoo merged commit 930bf64 into main Jan 13, 2025
3 checks passed
@daavoo daavoo deleted the 3-llm-workflow branch January 13, 2025 14:05
@daavoo daavoo linked an issue Jan 13, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LLM workflow Implement preprocessing module
2 participants