Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create llm prompt #2366

Merged
merged 10 commits into from
Sep 6, 2024
Merged

Create llm prompt #2366

merged 10 commits into from
Sep 6, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Aug 30, 2024

This is a script that creates a LLM prompt that includes all the key parts of the docs.

i.e. all the most important parts of the docs get put in context, and then the user asks a question.

At the moment, the prompt is about 60,000 tokens. So a single prompt with Anthropic Sonnet 3.5 costs about $0.20 (20 cents)

It seems to work pretty well, consider the following prompt:

Use the information above to write me a simple splink script that deduplicates a list of customers with fields first name, surname, phone number, dob, and product_category

And the output:

from splink import Linker, SettingsCreator, DuckDBAPI, block_on
import splink.comparison_library as cl
import pandas as pd

# Create a sample DataFrame
data = {
    'unique_id': range(1, 1001),
    'first_name': ['John', 'Jon', 'Jane', 'Janet'] * 250,
    'surname': ['Smith', 'Smyth', 'Doe', 'Dough'] * 250,
    'phone_number': ['1234567890', '1234567891', '9876543210', '9876543211'] * 250,
    'dob': ['1990-01-01', '1990-01-02', '1985-05-05', '1985-05-06'] * 250,
    'product_category': ['Electronics', 'Clothing', 'Books', 'Home'] * 250
}

df = pd.DataFrame(data)

# Define the settings
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "surname"),
        block_on("phone_number"),
        block_on("dob"),
    ],
    comparisons=[
        cl.NameComparison("first_name"),
        cl.NameComparison("surname"),
        cl.ExactMatch("phone_number"),
        cl.DateOfBirthComparison("dob", input_is_string=True),
        cl.ExactMatch("product_category").configure(term_frequency_adjustments=True),
    ],
    retain_intermediate_calculation_columns=True,
)

# Create the linker
linker = Linker(df, settings, db_api=DuckDBAPI())

# Estimate parameters
linker.training.estimate_probability_two_random_records_match(
    [
        block_on("first_name", "surname", "dob"),
        block_on("phone_number"),
    ],
    recall=0.7
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("first_name", "surname"))
linker.training.estimate_parameters_using_expectation_maximisation(block_on("phone_number"))

# Make predictions
df_predictions = linker.inference.predict(threshold_match_probability=0.8)

# Cluster the results
df_clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predictions, threshold_match_probability=0.8
)

# View results
print(df_clusters.as_pandas_dataframe(limit=10))

# Optionally, visualize the results
linker.visualisations.match_weights_chart()

Same prompt with the haiku model, costing $0.02 gives similarly good results

Things to do:

  • Add some more key parts of the docs in .md format to the prompt e.g. the part that discusses blocking rules and efficiency

@RobinL
Copy link
Member Author

RobinL commented Aug 30, 2024

Here's haiku on

What's the difference between blocking rules for prediction and blocking rules for em training?

The key difference between blocking rules for prediction and blocking rules for EM training is:

Blocking rules for prediction:

  • These are the blocking rules that are used to generate the pairwise record comparisons that will be scored by the linkage model during the prediction/inference stage.
  • The goal is to generate a set of candidate pairs that covers as many of the true matches as possible, while keeping the total number of comparisons manageable.
  • These blocking rules should be strict enough to avoid generating too many unnecessary comparisons, but not so strict that they miss true matches.

Blocking rules for EM training:

  • These are the blocking rules used during the Expectation Maximization (EM) algorithm to estimate the model parameters (m and u probabilities).
  • The goal is to generate a set of pairwise comparisons that will allow the EM algorithm to accurately estimate the m probabilities for each comparison level.
  • For each EM training session, you need to use a different blocking rule, so that the EM algorithm can estimate the m probabilities for all comparison levels across the different training sessions.
  • The blocking rules used for EM training do not need to be as strict as the ones used for prediction, as the goal is just to get a representative sample of comparisons, not to optimize runtime.

In summary, the blocking rules for prediction are focused on efficiency and coverage of true matches, while the blocking rules for EM training are focused on providing the EM algorithm with the data it needs to accurately estimate the model parameters.

@RobinL
Copy link
Member Author

RobinL commented Aug 30, 2024

Here's sonnet 3.5. Haiku is decent, sonnet is noticably better:

The key differences between blocking rules for prediction and blocking rules for EM training in Splink are:
  1. Purpose:

    • Blocking rules for prediction are used to generate candidate record pairs that will be scored by the model during the final prediction phase.
    • Blocking rules for EM training are used to generate a subset of record pairs for estimating model parameters during the training phase.
  2. Coverage:

    • Prediction blocking rules aim to capture all potential matches while keeping the number of comparisons computationally feasible. They should be more comprehensive to avoid missing true matches.
    • EM training blocking rules are typically more restrictive to focus on a smaller set of likely matches for parameter estimation.
  3. Specification:

    • Prediction blocking rules are specified in the main settings dictionary under "blocking_rules_to_generate_predictions".
    • EM training blocking rules are provided as arguments to the estimate_parameters_using_expectation_maximisation() method.
  4. Multiple rules:

    • Prediction often uses multiple blocking rules to ensure good coverage.
    • EM training often uses a single blocking rule per training session, with multiple sessions run using different rules to estimate parameters for all comparisons.
  5. Impact on results:

    • Prediction blocking rules directly affect which record pairs are considered for final linking/deduplication.
    • EM training blocking rules influence the quality of parameter estimates but don't directly determine final predictions.

In summary, blocking rules for prediction are more comprehensive and determine which record pairs are scored in the final model, while blocking rules for EM training are more focused and are used to efficiently estimate model parameters during the training phase.

@RobinL
Copy link
Member Author

RobinL commented Aug 30, 2024

Prob want to fit prompt to 128k tokes so that it can be run gpt4 o mini for very little money

GPT-4o mini
GPT-4o mini is our most cost-efficient small model that’s smarter and cheaper than GPT-3.5 Turbo, and has vision capabilities. The model has 128K context and an October 2023 knowledge cutoff.

Learn about GPT-4o mini(opens in a new window)
Model
Pricing
Pricing with Batch API*
gpt-4o-mini
$0.150 / 1M input tokens
$0.075 / 1M input tokens
$0.600 / 1M output tokens
$0.300 / 1M output tokens
gpt-4o-mini-2024-07-18
$0.150 / 1M input tokens
$0.075 / 1M input tokens
$0.600 / 1M output tokens
$0.300 / 1M output tokens

@RobinL RobinL mentioned this pull request Sep 6, 2024
@RobinL RobinL merged commit 4b57cdf into master Sep 6, 2024
1 check passed
@RobinL RobinL deleted the create_llm_prompt branch September 6, 2024 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant