Skip to content

Calculate the enrichment scores of CRISPR alleles and variants from direct target amplicon-sequencing data using Bayesian linear regression model "millipede".

License

Notifications You must be signed in to change notification settings

pinellolab/CRISPR-millipede-target

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI - Version CRISPR-Millipede User Documentation

CRISPR-Millipede logo

CRISPR-Millipede was developed by the Pinello Lab as an easy-to-use Python package for processing targeted amplicon-sequencing of tiled sequences from base-editing tiling screens to identify functional nucleotides. By providing amplicon-sequencing of installed alleles from multiple phenotypic populations, CRISPR-Millipede identifies the single-variants that contribute to differences in phenotype. See this preprint for more information on this method! It is expected that you are familiar with Python, command-line tools, and CRISPR screens to follow this guide.

Sections

Notes on Experimental Design and Expected Inputs

Skip this and scroll further down if interested in the tool usage

  • This tool is best used for pooled CRISPR saturation mutagenesis screens of a single focused region.
  • The length of the mutagenized region depends on the desired sequencing read length (i.e. paired-end 150bp sequencing has a max mutagenesis length of 300bp, however, it is desired that there is as much overlap of the paired-ends to maximize sequencing quality). You will perform targeted amplicon-sequencing of your intended mutagenized region. Ensure that no editing occurs at the primer binding sites, and ensure that the primers are tested and optimized beforehand (i.e. difficult-to-amplify or difficult-to-sequence regions may not be suitable for this method, therefore it is essential that this is tested prior to screening).
  • It is suggested that you also perform the standard sequencing of the guide RNA to calculate guide RNA enrichment scores in tandem. Therefore, you will split your genomic DNA into two different library preparation approaches: guide RNA sequencing and the aforementioned direct target sequencing.
  • The type of mutagenesis is best suited to single-nucleotide mutagenesis (i.e. base-editing and prime-editing). The method has not been extensively tested on in-del mutagenesis.
  • This model was developed and tested on FACS-sorted based screens rather than proliferation screens, however the model may still work for proliferation screens by comparing samples between two separate timepoints.
  • Ensure that you have sufficient cell coverage for sequencing, especially if doing both guide RNA and direct target sequencing. You should preferably have roughly 1000 cells * number of guide RNAs in your library for EACH guide RNA and direct sequencing approach (therefore 2000 cells * number of guide RNAs if doing both sequencing approaches) for EACH sample. The cell coverage depends on the editing efficiency and the expected effect sizes. Typically, the sorted population with the phenotypic change from the baseline after perturbation will have the lowest coverage, therefore you should ensure that you have sufficient cell counts in all populations prior to sequencing (by modifying your FACS gates while still maintaining separation between your negative and positive control gRNAs or by simply increasing input cell amount at the expense of longer sort time).
  • Ensure that you have sufficient biological replicates (at least 3 replicates).
  • It is not necessary to haploidize your region to have single-copy alleles, though this may reduce the noise of the phenotypic scores for each sequenced allele due to certain homozygosity of the sequenced allele.
  • While this method is robust to biases in different editing efficiencies among your guide RNAs since alleles are directly sequenced, ensuring high editing efficiency will increase the per-allele coverage in your samples thereby reducing the necessary cell coverage and increasing statistical power.

See Figure a below for a schematic of the experimental design:

CRISPR-CLEAR framework

Figure a: The workflow illustrates the key steps from guide RNA design to data analysis. First, cells stably expressing a base editor are transduced with a library of guide RNAs tiling the regulatory sequence. After editing, cells are FACS-sorted based on the expression of the target protein. Genomic DNA is extracted from sorted cells. Next-generation libraries are prepared to quantify sgRNA counts and to measure the distribution of edits at the endogenous sequence in the sorted population of cells. The left pathway shows the standard approach using sgRNA count-based readout and the CRISPR-SURF pipeline for deconvolution of functional regions. The right pathway depicts the CRISPR-CLEAR approach using direct allele-based readout and the CRISPR-Millipede pipeline, enabling precise genotype-to-phenotype linkage through per-allele and per-nucleotide analysis.

After performing the screen, you should have targetted amplicon-sequencing FASTQs for each of your phenotypic populations (i.e. different FACS gates along with the pre-sort sample) for multiple biological replicates. An overview of the pipeline is to 1) first quality-control using FASTQC to ensure sufficient read quality of all samples, 2) run all the samples through CRISPResso2 to characterize the introduced alleles in your samples, 3) encode the alleles in a numerical representation for Millipede modelling 4) and lastly perform the Millipede modelling to attain your results. See Figure b below for a schematic of the pipeline steps:

CRISPR-CLEAR framework

Figure b: Schematic of CRISPR-Millipede workflow.

Installation

CRISPR-Millipede can be easily installed from PyPi pip install crispr-millipede, which should only take a couple minutes. CRISPR-Millipede requires Python versions >=3.8,<3.12 which can be installed from the Python download page.

You will also need to run CRISPResso2, a Pinello Lab tool, to prepare the input for CRISPR-Millipede. See the CRISPResso2 repository for installation instructions.

Did you also directly sequence your guide RNAs? It is recommended you do so to compare against the CRISPR-Millipede results from target amplicon-sequencing. You could map your guide sequences using tools from the Pinello Lab such as CRISPR-Correct and analyze the resulting counts using CRISPR-SURF as done in the original paper!

System Requirements

CRISPR-Millipede can run on any operating system where Python versions >=3.8,<3.12 can be installed. To speed up model performance, CRISPR-Millipede can utilize both CPUs (for multi-threading) and GPUs (for model training) and is highly recommended to allow the pipeline to run in the span of a couple hours, though the tool can still work on single core non-GPU computers but may run in the span of a day for each run attempt.

Instructions

STEP 1: Run CRISPResso2 to generate allele tables

We need to take the raw amplicon-sequencing data and encode it into an input that CRISPR-Millipede accepts. It is suggested that your amplicon-sequencing data is quality-controlled using FASTQC to ensure sequencing quality.

CRISPR-Millipede's encoding step takes in as input the allele frequency tables produced from CRISPResso2, a Pinello Lab tool for processing amplicon-sequencing data from CRISPR experiments. Refer to the CRISPResso2 documentation for instructions on how to run CRISPResso2, which may depend on the type of CRISPR editing performed in your experiment.

Example command for a base-editing experiment:

CRISPRessoBatch
  -bs {FASTQ_FILENAME} -a {AMPLICON_SEQUENCE} -an {AMPLICON_NAME}
  -q {QUALITY}
  --exclude_bp_from_left {EX_LEFT} --exclude_bp_from_right {EX_RIGHT}
  --no_rerun -n {SCREEN_NAME}
  --min_frequency_alleles_around_cut_to_plot 0.001
  --max_rows_alleles_around_cut_to_plot 500
  -p 20 --plot_window_size 4 --base_editor_output -w 0
  -bo {OUTPUT_DIRECTORY}

Run CRISPResso2 for all samples and replicates. For each sample, CRISPResso2 will produce an allele frequency table named "Alleles_frequency_table.zip" which is used as input to the CRISPR-Millipede package. You will need these files in the next step. CRISPResso2 will also produce several other plots characterizing the editing patterns of your samples which will be useful for initial exploration of your data prior to modelling!

STEP 2: Encode the CRISPResso2 outputs into matrices

The CRISPResso2 output contains a table of alleles and their read counts for each sample. The alleles are represented as strings, though the strings must be encoded into a numerical representation for CRISPR-Millipede modelling.

Import and prepare the parameters of the encoding step by passing in the amplicon sequence (required), the acceptable variant types (optional), predicted editing sites (optional), population colummn suffixes for indexing (required), and encoding edge trimming for reducing sequencing background (optional) to the EncodingParameters class.

Below contains the class definition (and default values) of the EncodingParameters that you will need to instantiate:

@dataclass
class EncodingParameters:
    complete_amplicon_sequence: str # Amplicon sequence string
    population_baseline_suffix: Optional[str] = "_baseline" # Typically the population that unedited cells are primarily in. Suffix label
    population_target_suffix: Optional[str] = "_target" # The population used to calculate variant enrichment relative to the baseline population. Suffix label
    population_presort_suffix: Optional[str] = "_presort" # The un-sorted population used to calculate total editing efficiencies. Suffix label
    wt_suffix: Optional[str] = "_wt" # An unedited population to calculate the sequencing error background. Suffix label
    guide_edit_positions: List[int] = field(default_factory=list) # Position of expected editing sites. Positions are relative to the amplicon sequence. 0-based.
    guide_window_halfsize: int = 3 # Expected editing window size. Only edits in range(guide_edit_position-guide_window_halfsize,guide_edit_position+guide_window_halfsize+1) for all positions will be considered for modelling
    minimum_editing_frequency: float = 0 # Frequency of variants to consider for editing, may be useful for removing sequencing background.
    minimum_editing_frequency_population: List[str] = field(default_factory=list) # Population to consider for removal of variants by frequency, i.e. ["presort"]
    variant_types: List[Tuple[str, str]] = field(default_factory=list) #  List of variants to consider for modelling. Variants represented as two-value tuple where first index is REF and second index is ALT. i.e.  [("A", "G"), ("T", "C")] for adenine base-editing variants.
    trim_left: int = 0 # Filtering positions on left side of amplicon
    trim_right: int = 0 # Filtering positions on right side of amplicon
    remove_denoised: bool = False # Remove filtered features (from above criteria) from model input.  

Example of setting encoding parameters:

from crispr_millipede import encoding as cme

AMPLICON = "ACTGACTGACTGACTGACTGACTG" # Put your complete reference amplicon-sequence here
ABE_VARIANT_TYPES = [("A", "G"), ("T", "C")] # Optional: If using an adenine base-editor
CBE_VARIANT_TYPES = [("C", "T"), ("G", "A")] # Optional: If using a cytosine base-editor
encoding_parameters = cme.EncodingParameters(complete_amplicon_sequence=AMPLICON,
                            population_baseline_suffix="_baseline", 
                            population_target_suffix="_target", 
                            population_presort_suffix="_presort", 
                            wt_suffix="_wt", 
                            trim_left=20, 
                            trim_right=20, 
                            variant_types=ABE_VARIANT_TYPES, 
                            remove_denoised=True)

To load the CRISPResso2 allele frequency tables into CRISPR-Millipede from STEP 1, pass in the EncodingParameters object and the CRISPResso2 allele frequency table filenames from STEP 1 for each population. For each population, provide a list of filenames corresponding to each replicate:

encoding_dataframes = cme.EncodingDataFrames(encoding_parameters=encoding_parameters, #  From example above
                                                 reference_sequence=encoding_parameters.complete_amplicon_sequence,
                                                 population_baseline_filepaths=["CRISPResso_on_sample_baseline_1/Alleles_frequency_table.zip", 
                                                                                "CRISPResso_on_sample_baseline_2/Alleles_frequency_table.zip", 
                                                                                "CRISPResso_on_sample_baseline_3/Alleles_frequency_table.zip"],
                                                 population_target_filepaths=["CRISPResso_on_sample_target_1/Alleles_frequency_table.zip", 
                                                                              "CRISPResso_on_sample_target_2/Alleles_frequency_table.zip", 
                                                                              "CRISPResso_on_sample_target_3/Alleles_frequency_table.zip"],
                                                 population_presort_filepaths=["CRISPResso_on_sample_presort_1/Alleles_frequency_table.zip", 
                                                                               "CRISPResso_on_sample_presort_2/Alleles_frequency_table.zip", 
                                                                               "CRISPResso_on_sample_presort_3/Alleles_frequency_table.zip"],
                                                 wt_filepaths=[root_dir + "CRISPResso_on_sample_wt_1/Alleles_frequency_table.zip"])

Perform the encoding:

encoding_dataframes.read_crispresso_allele_tables() # This reads in the CRISPResso2 table
encoding_dataframes.encode_crispresso_allele_table(progress_bar=True, cores={CPUS}) # Performs the initial encoding. Replace {CPUs} with the number of CPUs for parallelization on your system. 
encoding_dataframes.postprocess_encoding() # Postprocesses the encoding with the filtering criteria from above.

Highly suggested to save the results of the encodings to your drive. Encouraged to include a prefix to version the results. These files will be used as input to the next modelling STEP 3.

prefix_label ="20240916_v1_example_"

cme.save_encodings(encoding_dataframes.encodings_collapsed_merged, sort_column="#Reads_presort", filename=prefix_label + "encoding_dataframes_editor_encodings_rep{}.tsv")
cme.save_encodings(encoding_dataframes.population_wt_encoding_processed, sort_column="#Reads_wt", filename=prefix_label + "encoding_dataframes_wt_encodings_rep{}.tsv")
cme.save_encodings_df(encoding_dataframes.population_baseline_encoding_processed, filename=prefix_label + "encoding_dataframes_baseline_editor_encodings_rep{}.pkl")
cme.save_encodings_df(encoding_dataframes.population_target_encoding_processed, filename=prefix_label + "encoding_dataframes_target_editor_encodings_rep{}.pkl")
cme.save_encodings_df(encoding_dataframes.population_presort_encoding_processed, filename=prefix_label + "encoding_dataframes_presort_editor_encodings_rep{}.pkl")
cme.save_encodings_df(encoding_dataframes.population_wt_encoding_processed, filename=prefix_label + "encoding_dataframes_wt_encodings_rep{}.pkl")

STEP 3: Perform modelling of the encoded dataset

Now that we have the encoded representation of the alleles, we will now perform Millipede modelling off of this representation. For documentation on the Millipede model sub-package, see here.

Set the model parameters: Below contains the class definition (and default values) of the MillipedeDesignMatrixProcessingSpecification that you will need to instantiate:

@dataclass
class MillipedeDesignMatrixProcessingSpecification:
    wt_normalization: bool = True # Normalize the read count base on the unedited allele counts
    total_normalization: bool = False # Normalize the read count based on the total sum of all allele counts
    sigma_scale_normalized: bool = False # If using the NormalLikelihoodVariableSelector, determine if the sigma_scale factor will be based on the normalized read count
    decay_sigma_scale: bool = True # Set the sigma_scale factor based on the decay function
    K_enriched: Union[float, List[float], List[List[float]]] = 5 # Set the K_enriched value of the decay function
    K_baseline: Union[float, List[float], List[List[float]]] = 5 # Set the K_baseline value of the decay function
    a_parameter: Union[float, List[float], List[List[float]]] = 300 # Set the a_parameter of the decay function

Additionally, you will need to specify the type of model as well. Below contains the class definition (and default values) of the MillipedeModelSpecification that you will need to instantiate:

@dataclass
class MillipedeModelSpecification:
    """
        Defines all specifications to produce Millipede model(s)
    """
    model_types: List[MillipedeModelType] 
    replicate_merge_strategy: MillipedeReplicateMergeStrategy
    experiment_merge_strategy: MillipedeExperimentMergeStrategy
    cutoff_specification: MillipedeCutoffSpecification
    design_matrix_processing_specification: MillipedeDesignMatrixProcessingSpecification
    shrinkage_input: Union[MillipedeShrinkageInput, None] = None
    S: float = 1.0 #S parameter
    tau: float = 0.01 #tau parameter
    tau_intercept: float = 1.0e-4

There are sub-classes you will need to instantiate. For instance, the MillipedeReplicateMergeStrategy specifies how multiple replicates are handled during modelling:

class MillipedeReplicateMergeStrategy(Enum):
    """
        Defines how separate replicates will be treated during modelling
    """
    SEPARATE = "SEPARATE" # Replicates are modelled separately; one model per replicate
    SUM = "SUM" # (Normalized) counts for all replicates are summed together; one model for all replicates
    COVARIATE = "COVARIATE" # Replicates are jointly modelled, though replicate ID is included in the model design matrix 

Recommended to run one version in MillipedeReplicateMergeStrategy.SEPARATE to assess individual replicate consistency, then if successful, run a final model in MillipedeReplicateMergeStrategy.COVARIATE

Likewise, the MillipedeExperimentMergeStrategy specifies how multiple experiments (i.e. screens with different editors) are handled during modelling.

class MillipedeExperimentMergeStrategy(Enum):
    """
        Defines how separate experiments will be treated during modelling
    """
    SEPARATE = "SEPARATE"
    SUM = "SUM"
    COVARIATE = "COVARIATE"

The MillipedeModelType specifies what likelihoood function to use for model fitting. See the Millipede documentation for more information.

class MillipedeModelType(Enum):
    """
        Defines the Millipede model likelihood function used
    """
    NORMAL = "NORMAL"
    NORMAL_SIGMA_SCALED = "NORMAL_SIGMA_SCALED"
    BINOMIAL = "BINOMIAL"
    NEGATIVE_BINOMIAL = "NEGATIVE_BINOMIAL"

We recommend using the NORMAL_SIGMA_SCALED model, you will need to define the K_enriched, K_baseline, a, and decay_sigma_scale paramters to specify how the sigma_scale_factor is calculated.

Here is an example of specifying the complete input parameters for modelling:

from crispr_millipede import encoding as cme
from crispr_millipede import modelling as cmm

design_matrix_spec = cmm.MillipedeDesignMatrixProcessingSpecification(
    wt_normalization=False,
    total_normalization=True,
    sigma_scale_normalized=True,
    decay_sigma_scale=True,
    K_enriched=5,
    K_baseline=5,
    a_parameter=0.0005
)

millipede_model_specification_set = {
    "model_specification_1" : cmm.MillipedeModelSpecification(
        model_types=[cmm.MillipedeModelType.NORMAL_SIGMA_SCALED],
        replicate_merge_strategy=cmm.MillipedeReplicateMergeStrategy.COVARIATE,
        experiment_merge_strategy=cmm.MillipedeExperimentMergeStrategy.SEPARATE,
        S = 5,
        tau = 0.01,
        tau_intercept = 0.0001,
        cutoff_specification=cmm.MillipedeCutoffSpecification(
            per_replicate_each_condition_num_cutoff = 0, 
            per_replicate_all_condition_num_cutoff = 1, 
            all_replicate_num_cutoff = 0, 
            all_experiment_num_cutoff = 0,
            baseline_pop_all_condition_each_replicate_num_cutoff = 3,
            baseline_pop_all_condition_acceptable_rep_count = 2,
            enriched_pop_all_condition_each_replicate_num_cutoff = 3,
            enriched_pop_all_condition_acceptable_rep_count = 2,
            presort_pop_all_condition_each_replicate_num_cutoff = 3,
            presort_pop_all_condition_acceptable_rep_count = 2
        ),
        design_matrix_processing_specification=design_matrix_spec
    )
}

Load in the encoding data: Now that you have specified the model inputs, let's load the encoding data in, which should be straightforward:

prefix_label ="20240916_v1_example_"
encoding_filename = prefix_label + "encoding_dataframes_editor_encodings_rep{}.tsv"

# This will load in the data
model_input_data = cmm.MillipedeInputDataExperimentalGroup(
    data_directory="./", 
    enriched_pop_fn_experiment_list = [encoding_filename],
    enriched_pop_df_reads_colname = "#Reads_target",
    baseline_pop_fn_experiment_list = [encoding_filename],
    baseline_pop_df_reads_colname = "#Reads_baseline", 
    presort_pop_fn_experiment_list = [encoding_filename],
    presort_pop_df_reads_colname = '#Reads_presort',
    experiment_labels = ["editor"],
    reps = [0,1,2],
    millipede_model_specification_set = millipede_model_specification_set
   )

Run the model: Now that you have specified the inputs, we will now run the model. You have the option to use the CPU or GPU for modelling.

model_run = cmm.MillipedeModelExperimentalGroup(experiments_inputdata=model_input_data, device=cmm.MillipedeComputeDevice.GPU)

Explore the results: The model will provide posterior inclusion probabilities (PIP) and beta coefficient scores for each feature/variant that was included in the model and not filtered out during the encoding step:

beta_df = paired_end_experiments_models_denoised.millipede_model_specification_set_with_results['model_specification_1'].millipede_model_specification_result_input[0].millipede_model_specification_single_matrix_result[cmm.MillipedeModelType.NORMAL_SIGMA_SCALED].beta
pip_df = paired_end_experiments_models_denoised.millipede_model_specification_set_with_results['model_specification_1'].millipede_model_specification_result_input[0].millipede_model_specification_single_matrix_result[cmm.MillipedeModelType.NORMAL_SIGMA_SCALED].pip

About

Calculate the enrichment scores of CRISPR alleles and variants from direct target amplicon-sequencing data using Bayesian linear regression model "millipede".

Resources

License

Stars

Watchers

Forks

Packages

No packages published