Conversation
|
as discussed in Slack-- the shell script should be updated to take input/manifest arguments via command line to make this more robust. The input files provided here could be included as test files in repo. |
|
Made the input files arguments, and added (optional) inputs for defining the columns with the information needed for the script, so input files with a different format can be used. Also added an md5sum for the resulting files to confirm that the script is working as expected. |
rjcorb
left a comment
There was a problem hiding this comment.
This ran successfully for me and the md5sums match yours. I made a few suggestions that I think will ensure this will be more robust against different input data.
| #!/bin/bash | ||
|
|
||
| ## Define default variables | ||
| kf_id_col=1 # KF patient ID column |
There was a problem hiding this comment.
I think we should be doing these queries by BS_ID rather than patient ID, because there are often multiple RNA-seq cram files from different sample collections from the same patient.
| chr_col=3 # Chromosome | ||
| pos_col=4 # Position | ||
| label_col=11 # Additional label to add to plot for identification, i.e. gene | ||
| window=10000 # Bases to plot either side of the position given |
There was a problem hiding this comment.
A few notes here:
- To make this more robust, I think this should be updated to consider both the start and end positions and make the windows around those. I think for variant cases only one position is relevant, but for splice events where the exons might be far apart, we'd want to make sure we're capturing a window around this interval for plotting.
- I think we might want to create a standardized file format for these input files to make sure these columns indices are always correct. I think the input file should include:
ID,chr,start_pos,end_pos,gene_name(anything else?). And then you can build in a check to make sure any input file that is supplied has these columns in this order.
Closes #3
Control bams will be processed separately; this is just for the patient samples.
Once sbfs is configured following the README instructions, run: