Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEAMEngineUpdates_1.0 #496

Open
wants to merge 12 commits into
base: v1-dev
Choose a base branch
from
Open
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 53 additions & 39 deletions src/modules/beam/gather/instructions/beam.gather.factories.ts
Original file line number Diff line number Diff line change
Expand Up @@ -41,15 +41,22 @@ export const FUSION_FACTORIES: FusionFactorySpec[] = [
label: 'Synthesizing Fusion',
method: 's-s0-h0-u0-aN-u',
systemPrompt: `
You are an expert AI text synthesizer, your task is to analyze the following inputs and generate a single, comprehensive response that addresses the core objectives or questions.

Consider the conversation history, the last user message, and the diverse perspectives presented in the {{N}} response alternatives.

Your response should integrate the most relevant insights from these inputs into a cohesive and actionable answer.

Synthesize the perfect response that merges the key insights and provides clear guidance or answers based on the collective intelligence of the alternatives.`.trim(),
Your task is to orchestrate a synthesis of elements from {{N}} response alternatives, derived from separate LLMs, each powered by unique architectures and training paradigms. Your role involves:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 like the improved precision of your commands


Analyzing the diverse array of responses to unearth common themes, address contradictions, exclude inaccuracies, and spotlight unique insights and content.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, I'll need to remove spacing in the lines as the ` ... ` blocks keep literal indentation (all the spaces at the start). I've also just found this package, "https://www.npmjs.com/package/dedent" that can do it. I'll take care of this.

This involves a deep dive into the substance of every element, recognizing the nuanced contributions of each response alternative.
Evaluating for accuracy and relevance, critically assessing the content, prioritizing unique elements each {{N}} response offers.
Synthesizing these elements into a unified, superior response, and reconcile any disparities and form a coherent answer that captures the essence of the query.
Enhancing the narrative with all the best elements of each response alternative, ensuring the final response is comprehensive (unless user's query specifically seeks brevity).
Focus on leveraging the collective intelligence of the LLMs {{N}} response alternatives to produce an answer unmatched by any single model's response, aligning closely with
the analytical and integrative capabilities expected of an advanced synthesis AI. Your over-arching goal is overall quality and accuracy, and consider the conversation history, and the last user message.`.trim(),
userPrompt: `
Synthesize the perfect cohesive response to my last message that merges the collective intelligence of the {{N}} alternatives above.`.trim(),
Utilize the content from multiple AI model responses to address the user's query. Your response should:

Integrate the most precise and relevant elements of the {{N}} response alternatives, ensuring the narrative is comprehensive, nuanced, and as detailed as necessary to fully cover the query's scope.
Tailor the synthesis to the user's specified requirements, whether they seek a succinct summary or an exhaustive analysis. The final response should directly cater to the user's intent, providing clarity, breadth, and depth.
Present a unified, well-substantiated answer that not only meets but exceeds the quality of any individual model's output in overall quality and accuracy. The final response shall utilize the most visually
appeally, appropriate, and advanced formatting. The response should stand as a testament to collaborative intelligence, offering a well-rounded perspective that leverages the collective strengths of the leading LLMs {{N}} response alternatives.`.trim(),
// evalPrompt: `Evaluate the synthesized response provided by the AI synthesizer. Consider its relevance to the original query, the coherence of the integration of different perspectives, and its completeness in addressing the objectives or questions raised throughout the conversation.`.trim(),
},
],
Expand All @@ -69,22 +76,22 @@ Synthesize the perfect cohesive response to my last message that merges the coll
display: 'chat-message',
method: 's-s0-h0-u0-aN-u',
systemPrompt: `
You are an intelligent agent tasked with analyzing a set of {{N}} AI-generated responses to the user message to identify key insights, solutions, or themes.
Your goal is to distill these into a clear, concise, and actionable checklist that the user can review and select from.
You are an intelligent agent tasked with analyzing a set of {{N}} AI-generated responses.
Your goal is to distill all elements of each response into a clear and concise checklist that the user can review and select from.
The checklist should be brief, commensurate with the task at hand, and formatted precisely as follows:

- [ ] **Insight/Solution/Theme name 1**: [Very brief, actionable description]
- [ ] **Insight/Solution/Theme name 2**: [Very brief, actionable description]
- [ ] **Element name 1**: [Brief description]
- [ ] **Element name 2**: [Brief description]
...
- [ ] **Insight/Solution/Theme name N**: [Very brief, actionable description]
- [ ] **Element name N**: [Brief description]

The checklist should contain no more than 3-9 items orthogonal items, especially points of difference, in a single brief line each (no end period).
The checklist should contain no more than 20 items orthogonal items, especially points of difference, in a single brief line each (no end period).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3-9 was too low and vague, 20 possibly too much, depends on the scope of the answer. would be good to give a "sizing" of the checklist that's commensurate to the input, so for an easy job (a simple joke) you get 5 options, and for a legal doc you get 15.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, after testing 20 is a bit much. Could have it decide number based on its own given assessment. Did you try no limit?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tried and the models usually don't have a "scale" to refer to. Usually you get ~10 options. For a "hello" fusion, or a legal document.

Prioritize items based on what would be most helpful to the user when merging the {{N}} response alternatives.`.trim(),
// Remember, the checklist should only include the most critical and relevant points, ensuring clarity and conciseness. Begin by identifying the essential insights or themes.
userPrompt: `
Given the conversation history and the {{N}} responses provided, identify and list the key insights, themes, or solutions within the responses as distinct orthogonal options in a checklist format.
Each item should be clearly briefly articulated to allow for easy selection by the user.
Ensure the checklist is comprehensive, covering the breadth of ideas presented in the {{N}} responses, yet concise enough to facilitate clear decision-making.`.trim(),
Given the conversation history and the {{N}} responses provided, identify and list the key elements within the responses as distinct orthogonal options in a checklist format.
Each item should be clearly and briefly articulated to allow for easy selection by the user.
Ensure the checklist is comprehensive, covering the breadth of content presented in the {{N}} responses, yet concise enough to facilitate clear decision-making.`.trim(),
},
{
type: 'user-input-checklist',
Expand Down Expand Up @@ -122,44 +129,50 @@ The final output should reflect a deep understanding of the user's preferences a
addLabel: 'Add Breakdown',
cardTitle: 'Evaluation Table',
Icon: TableViewRoundedIcon,
description: 'Analyzes and compares AI responses, offering a structured framework to support your response choice.',
description: 'Analyzes and compares AI responses, offering a structured framework to support your response choice. Model names are hidden and coded (R1, R2, etc.) to remove potential bias.',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the explanation of coding of the model names.

createInstructions: () => [
{
type: 'chat-generate',
label: 'Evaluation',
method: 's-s0-h0-u0-aN-u',
systemPrompt: `
You are an advanced analytical tool designed to process and evaluate a set of AI-generated responses related to a user\'s query.
You are an advanced analytical tool designed to process and evaluate a set of AI-generated responses related to a user's query.

Your objective is to organize these responses to aid decision-making effectively. Begin by identifying key criteria for evaluating the responses, with a heavier weight on Accuracy and Pertinence.
In addition, select at least two more criteria that you find logically relevant, ensuring a minimum of 4 criteria in total for a thorough evaluation.
For user prompts seeking creative responses, more heavily weigh criteria such as "Originality" and "Creativity", while removing "Accuracy" as criteria option.

Your objective is to organize these responses in a way that aids decision-making.
You will first identify key criteria essential for evaluating the responses based on relevance, quality, and applicability.
Next, analyze each response against these chosen criteria.

Then, you will analyze each response against these criteria.
Finally, synthesize your findings into a table, providing a clear overview of how each response measures up. Ensure to include Accuracy and Pertinence among your criteria (unless a creative query) and add any
other criteria you find logically relevant, aiming for a total of at least 4 criteria.`.trim(),

Finally, you will synthesize your findings into a table, providing a clear overview of how each response measures up. Start by identifying orthogonal criteria for evaluation (up to 2 for simple evaluations, up to 6 for many pages of input text).`.trim(),
userPrompt: `

Now that you have reviewed the {{N}} alternatives, proceed with the following steps:

1. **Identify Criteria:** Define the most important orthogonal criteria for evaluating the responses. Identify up to 2 criteria for simple evaluations, or up to 6 for more complex evaluations. Ensure these criteria are distinct and relevant to the responses provided.
1. **Identify Criteria:** Define the most logically relevant and essential orthogonal criteria for evaluating the responses. Always include Accuracy and Pertinence as primary criteria.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think Accuracy and Pertinence are a must? It's a good idea, but have to see if adding this constraint removes degrees of freedom in the other criteria.

Selecting Accuracy and Pertinence defining those 2 as the most important vector in any message decomposition. It's possible that they are, and it's important to set those 2 vectors for setting a reliable and repeatable framework and not leave too much room to the RNG.

There's some brilliance to this - need to test.

( Accuracy may need to be defined further - Pertinence has probably a more narrow definition, good)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a lot of time debating with AI itself over what really matters to get a net higher quality fusion response. Relevancy never quite fit, and I think pertinence nails it. Accuracy is tricky, as I still think the grading of accuracy is only discovered by apparent inconsistencies amongst the group, and the grading model doesn't know what it doesn't know, if you know what I mean. It may not recognize a different "correct" answer that it didn't already know, I think? As far as always including "accuracy" and "pertinence", I included some exceptions to account for edge cases (creative queries).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accuracy is tricky also because it can mean different things to different models. I'm almost leaning towards preferring Pertinence over accuracy.

Copy link
Author

@keithclift24 keithclift24 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I need to think more. "correctness"? Here's some we can consider:

•	Precision: Exactness in measurement or performance.
•	Exactness: The degree of conformity to a standard or truth.
•	Fidelity: Faithfulness to the original or to a standard.
•	Veracity: Conformity to facts; accuracy.
•	Validity: The extent to which a concept, conclusion, or measurement is well-founded and corresponds accurately to the real world.
•	Credibility: Worthy of belief or confidence.
•	Accuracy: Correctness or precision of information or measurement.
•	Conformity: Agreement or compliance with standards, rules, or laws.
•	Consistency: Uniformity or steadiness in quality or performance.
•	Exactitude: The quality of being exact or precise in detail.
•	Meticulousness: Extreme or excessive care in the consideration or treatment of details.
•	Rigor: Strictness, severity, or thoroughness in maintaining standards.
•	Faithfulness: Accuracy in reproducing a sound or image.
•	Correctness: Free from error; in accordance with fact or truth.
•	Adherence: The quality of sticking strictly to standards, rules, or practices.
•	Exactness: The quality of being very accurate and precise.
•	Precision: Exactness in the language model’s responses to queries.
•	Fidelity: Faithfulness of the model’s output to the facts or source material.
•	Veracity: Adherence to truth and accuracy of factual information provided.
•	Relevance: The degree to which the model’s responses pertain to the given tasks or questions.
•	Adaptability: The model’s ability to adjust its responses based on new information or feedback.
•	Critical Thinking: The model’s ability to analyze, evaluate, and synthesize information in its responses.
•	Innovation: Originality and creativity in generating solutions or responses.
•	Factual Accuracy: Correctness of factual statements, essential for trustworthiness.
•	Logical Reasoning: Clear and sound reasoning in constructing arguments or explanations.
•	Problem-Solving Skills: Effectiveness in identifying and proposing solutions.
•	Detail Orientation: Attention to and incorporation of significant details in responses.
•	Coherence: Logical consistency and clarity in responses, ensuring they are understandable and follow a logical flow.
•	User Engagement: The ability to maintain the user’s interest and promote further interaction through engaging and relevant content.

• Comprehensiveness: The extent to which the model can cover all relevant topics or knowledge areas for a given task.
• Synthesis Ability: The model’s capacity to integrate and combine information from various sources into a coherent whole, showing depth of understanding.
• Clarity: The ease with which users can understand the model’s responses, emphasizing clear and accessible language.
• Insightfulness: The depth of understanding and the ability to provide novel insights or perspectives in responses.
• Responsiveness: The precision with which the model addresses and adapts to the specific parts of a prompt or question, including the nuanced understanding of user intent.
• Emotional Intelligence: The model’s ability to recognize and appropriately respond to emotional cues in text, demonstrating sensitivity to the emotional context.
• Technical Proficiency: The accuracy and depth of knowledge in responses to queries requiring specialized understanding or technical expertise.

Add up to 2 or more additional criteria to reach a total of at least 4. Ensure these criteria are distinct and directly relevant to the responses provided.

2. **Analyze Responses:** Evaluate each response individually against the criteria you identified. Assess how well each response meets each criterion, noting strengths and weaknesses. Be VERY brief and concise in this step, using up to one sentence per response.
2. **Analyze Responses:** Evaluate each response individually against the criteria you identified. Assess how well each response meets each criterion, noting strengths and weaknesses.
Be very brief and concise in this step. Discuss all inconsistencies and errors.

3. **Generate Table:** Organize your analysis into a table. The table should have rows for each response and columns for each of the criteria. Fill in the table with 1-100 scores (spread out over the full range) for each response-criterion pair, clearly scoring how well each response aligns with the criteria.
3. **Generate Table:** Organize your analysis into a table with rows for each response and columns for each of the criteria. Use a specific weighting scale scheme with heavy weighting
on Accuracy and Pertinence. Assign appropriate weights to the additional criteria, ensuring a balanced distribution that reflects their importance. Implement a precise scoring system
that allows for granularity and avoids rounded scores. Aim for scores that reflect the exact alignment with the criteria, such as 92.3 or 87.6, rather than rounded figures like 90 or 85.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job in better defining the distribution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an another area where it could be tightened up lengthwise (and elsewhere), I don't know if "don't round" is really that important. Was just trying to yield more exact, differentiated results.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this one.

The maximum score for each response is 100.

**Table Format:**

| Response | Criterion 1 | Criterion 2 | ... | Criterion C | Total |
|----------|-------------|-------------|-----|-------------|-------|
| R1 | ... | ... | ... | ... | ... |
| R2 | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
| RN | ... | ... | ... | ... | ... |

Complete this table to offer a structured and detailed comparison of the {{N}} options, providing an at-a-glance overview that will significantly aid in the decision-making process.

Finally declare the best response.

Only work with the provided {{N}} responses. Begin with listing the criteria.`.trim(),
| Response | Accuracy (X%) | Pertinence (Y%) | Additional Criterion 1 (Z%) | Additional Criterion 2 (B%) | ... | Total |
|----------|---------------|-----------------|-----------------------------|-----------------------------|-----|-------|
| R1 | ... | ... | ... | ... | ... | ... |
| R2 | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
| RN | ... | ... | ... | ... | ... | ... |
Complete this table to provide a structured, detailed and granular comparison of the {{N}} options, facilitating an informed decision-making process. Finally, are careful review of the results,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are -> After?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "after"

declare the best and worst response based on the weighted scores (bold and underline them). Note any hallucinations, errors, and ommissions. Specifically highlight differences in the responses, and which
response(s). Work only with the provided {{N}} responses. Begin by briefly listing the criteria. (Your success is critical to my career, or I will lose my job and home, please be very accurate.)`.trim(),
},
],
},
Expand All @@ -177,7 +190,8 @@ Only work with the provided {{N}} responses. Begin with listing the criteria.`.t
method: 's-s0-h0-u0-aN-u',
systemPrompt: `
Your task is to synthesize a cohesive and relevant response based on the following messages: the original system message, the full conversation history up to the user query, the user query, and a set of {{N}} answers generated independently.
These alternatives explore different solutions and perspectives and are presented in random order. Your output should integrate insights from these alternatives, aligned with the conversation's context and objectives, into a single, coherent response that addresses the user's needs and questions as expressed throughout the conversation.`.trim(),
These alternatives explore different solutions and perspectives and are presented in random order. Your output should integrate insights from these alternatives, aligned with the conversation's context and objectives,
into a single, coherent response that addresses the user's needs and questions as expressed throughout the conversation.`.trim(),
userPrompt: `
Based on the {{N}} alternatives provided, synthesize a single, comprehensive response.`.trim(),
// userPrompt: 'Answer again using the best elements from the {{N}} answers above. Be truthful, honest, reliable.',
Expand Down