Create llm prompt #2366

RobinL · 2024-08-30T19:31:24Z

This is a script that creates a LLM prompt that includes all the key parts of the docs.

i.e. all the most important parts of the docs get put in context, and then the user asks a question.

At the moment, the prompt is about 60,000 tokens. So a single prompt with Anthropic Sonnet 3.5 costs about $0.20 (20 cents)

It seems to work pretty well, consider the following prompt:

Use the information above to write me a simple splink script that deduplicates a list of customers with fields first name, surname, phone number, dob, and product_category

And the output:

from splink import Linker, SettingsCreator, DuckDBAPI, block_on
import splink.comparison_library as cl
import pandas as pd

# Create a sample DataFrame
data = {
    'unique_id': range(1, 1001),
    'first_name': ['John', 'Jon', 'Jane', 'Janet'] * 250,
    'surname': ['Smith', 'Smyth', 'Doe', 'Dough'] * 250,
    'phone_number': ['1234567890', '1234567891', '9876543210', '9876543211'] * 250,
    'dob': ['1990-01-01', '1990-01-02', '1985-05-05', '1985-05-06'] * 250,
    'product_category': ['Electronics', 'Clothing', 'Books', 'Home'] * 250
}

df = pd.DataFrame(data)

# Define the settings
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "surname"),
        block_on("phone_number"),
        block_on("dob"),
    ],
    comparisons=[
        cl.NameComparison("first_name"),
        cl.NameComparison("surname"),
        cl.ExactMatch("phone_number"),
        cl.DateOfBirthComparison("dob", input_is_string=True),
        cl.ExactMatch("product_category").configure(term_frequency_adjustments=True),
    ],
    retain_intermediate_calculation_columns=True,
)

# Create the linker
linker = Linker(df, settings, db_api=DuckDBAPI())

# Estimate parameters
linker.training.estimate_probability_two_random_records_match(
    [
        block_on("first_name", "surname", "dob"),
        block_on("phone_number"),
    ],
    recall=0.7
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("first_name", "surname"))
linker.training.estimate_parameters_using_expectation_maximisation(block_on("phone_number"))

# Make predictions
df_predictions = linker.inference.predict(threshold_match_probability=0.8)

# Cluster the results
df_clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predictions, threshold_match_probability=0.8
)

# View results
print(df_clusters.as_pandas_dataframe(limit=10))

# Optionally, visualize the results
linker.visualisations.match_weights_chart()

Same prompt with the haiku model, costing $0.02 gives similarly good results

Things to do:

Add some more key parts of the docs in .md format to the prompt e.g. the part that discusses blocking rules and efficiency

RobinL · 2024-08-30T19:48:29Z

Here's haiku on

What's the difference between blocking rules for prediction and blocking rules for em training?

The key difference between blocking rules for prediction and blocking rules for EM training is:

Blocking rules for prediction:

These are the blocking rules that are used to generate the pairwise record comparisons that will be scored by the linkage model during the prediction/inference stage.
The goal is to generate a set of candidate pairs that covers as many of the true matches as possible, while keeping the total number of comparisons manageable.
These blocking rules should be strict enough to avoid generating too many unnecessary comparisons, but not so strict that they miss true matches.

Blocking rules for EM training:

These are the blocking rules used during the Expectation Maximization (EM) algorithm to estimate the model parameters (m and u probabilities).
The goal is to generate a set of pairwise comparisons that will allow the EM algorithm to accurately estimate the m probabilities for each comparison level.
For each EM training session, you need to use a different blocking rule, so that the EM algorithm can estimate the m probabilities for all comparison levels across the different training sessions.
The blocking rules used for EM training do not need to be as strict as the ones used for prediction, as the goal is just to get a representative sample of comparisons, not to optimize runtime.

In summary, the blocking rules for prediction are focused on efficiency and coverage of true matches, while the blocking rules for EM training are focused on providing the EM algorithm with the data it needs to accurately estimate the model parameters.

RobinL · 2024-08-30T19:50:03Z

Here's sonnet 3.5. Haiku is decent, sonnet is noticably better:

The key differences between blocking rules for prediction and blocking rules for EM training in Splink are:

Purpose:
- Blocking rules for prediction are used to generate candidate record pairs that will be scored by the model during the final prediction phase.
- Blocking rules for EM training are used to generate a subset of record pairs for estimating model parameters during the training phase.
Coverage:
- Prediction blocking rules aim to capture all potential matches while keeping the number of comparisons computationally feasible. They should be more comprehensive to avoid missing true matches.
- EM training blocking rules are typically more restrictive to focus on a smaller set of likely matches for parameter estimation.
Specification:
- Prediction blocking rules are specified in the main settings dictionary under "blocking_rules_to_generate_predictions".
- EM training blocking rules are provided as arguments to the estimate_parameters_using_expectation_maximisation() method.
Multiple rules:
- Prediction often uses multiple blocking rules to ensure good coverage.
- EM training often uses a single blocking rule per training session, with multiple sessions run using different rules to estimate parameters for all comparisons.
Impact on results:
- Prediction blocking rules directly affect which record pairs are considered for final linking/deduplication.
- EM training blocking rules influence the quality of parameter estimates but don't directly determine final predictions.

In summary, blocking rules for prediction are more comprehensive and determine which record pairs are scored in the final model, while blocking rules for EM training are more focused and are used to efficiently estimate model parameters during the training phase.

RobinL · 2024-08-30T20:16:02Z

Prob want to fit prompt to 128k tokes so that it can be run gpt4 o mini for very little money

GPT-4o mini
GPT-4o mini is our most cost-efficient small model that’s smarter and cheaper than GPT-3.5 Turbo, and has vision capabilities. The model has 128K context and an October 2023 knowledge cutoff.

Learn about GPT-4o mini(opens in a new window)
Model
Pricing
Pricing with Batch API*
gpt-4o-mini
$0.150 / 1M input tokens
$0.075 / 1M input tokens
$0.600 / 1M output tokens
$0.300 / 1M output tokens
gpt-4o-mini-2024-07-18
$0.150 / 1M input tokens
$0.075 / 1M input tokens
$0.600 / 1M output tokens
$0.300 / 1M output tokens

RobinL mentioned this pull request Sep 6, 2024

Add cookbook to examples #2388

Merged

RobinL added 10 commits September 6, 2024 14:25

llm prompt

f6f6215

new llm prompt

6e100a6

llm prompt

3d21e1e

llm prompt

17f000e

tutorial

6d1d6c8

better output

79cd84e

update short and long script

25161a3

Untrack llm_context_long.txt and llm_prompt_short.txt

f14c297

consistency

673a7c7

ruff

6835075

RobinL force-pushed the create_llm_prompt branch from 24b3e5d to 6835075 Compare September 6, 2024 13:43

RobinL merged commit 4b57cdf into master Sep 6, 2024
1 check passed

RobinL deleted the create_llm_prompt branch September 6, 2024 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create llm prompt #2366

Create llm prompt #2366

RobinL commented Aug 30, 2024 •

edited

Loading

RobinL commented Aug 30, 2024

RobinL commented Aug 30, 2024

RobinL commented Aug 30, 2024

Create llm prompt #2366

Create llm prompt #2366

Conversation

RobinL commented Aug 30, 2024 • edited Loading

RobinL commented Aug 30, 2024

RobinL commented Aug 30, 2024

RobinL commented Aug 30, 2024

RobinL commented Aug 30, 2024 •

edited

Loading