Skip to content

Patient bam processing#7

Open
pj-sullivan wants to merge 5 commits intomasterfrom
pj-sullivan/bams
Open

Patient bam processing#7
pj-sullivan wants to merge 5 commits intomasterfrom
pj-sullivan/bams

Conversation

@pj-sullivan
Copy link
Copy Markdown
Collaborator

@pj-sullivan pj-sullivan commented Sep 30, 2025

Closes #3

Control bams will be processed separately; this is just for the patient samples.
Once sbfs is configured following the README instructions, run:

cd analyses
bash 01-cram-to-bam.sh -i input/test-input.tsv -m input/manifest.tsv
md5sum -c results/bams/md5sum.txt

@pj-sullivan pj-sullivan requested a review from rjcorb September 30, 2025 15:46
@pj-sullivan pj-sullivan self-assigned this Sep 30, 2025
@pj-sullivan pj-sullivan marked this pull request as ready for review October 28, 2025 14:41
@rjcorb
Copy link
Copy Markdown

rjcorb commented Nov 3, 2025

as discussed in Slack-- the shell script should be updated to take input/manifest arguments via command line to make this more robust. The input files provided here could be included as test files in repo.

Base automatically changed from pj-sullivan/docker to master November 11, 2025 17:20
@pj-sullivan
Copy link
Copy Markdown
Collaborator Author

Made the input files arguments, and added (optional) inputs for defining the columns with the information needed for the script, so input files with a different format can be used.

Also added an md5sum for the resulting files to confirm that the script is working as expected.

Copy link
Copy Markdown

@rjcorb rjcorb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ran successfully for me and the md5sums match yours. I made a few suggestions that I think will ensure this will be more robust against different input data.

#!/bin/bash

## Define default variables
kf_id_col=1 # KF patient ID column
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be doing these queries by BS_ID rather than patient ID, because there are often multiple RNA-seq cram files from different sample collections from the same patient.

Comment on lines +5 to +8
chr_col=3 # Chromosome
pos_col=4 # Position
label_col=11 # Additional label to add to plot for identification, i.e. gene
window=10000 # Bases to plot either side of the position given
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes here:

  • To make this more robust, I think this should be updated to consider both the start and end positions and make the windows around those. I think for variant cases only one position is relevant, but for splice events where the exons might be far apart, we'd want to make sure we're capturing a window around this interval for plotting.
  • I think we might want to create a standardized file format for these input files to make sure these columns indices are always correct. I think the input file should include: ID, chr, start_pos, end_pos, gene_name (anything else?). And then you can build in a check to make sure any input file that is supplied has these columns in this order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create bam file from an ID and coordinates

3 participants