removing sample(s) from the talon database #104

ew367 · 2022-05-04T10:01:16Z

Hi

How does talon process datasets that have only partially been added to the input database? When running talon again using the same config file and input databse will it continue from part way through the partial dataset, or skip the dataset since part of it is already present? My HPC job was interrupted during processing and I want to know how I can tell whether the sample that was being added at the time has been completed, or if data is still missing after running talon again.

Thanks

callumparr · 2022-05-04T10:11:27Z

Also interested to know. When this happened I just deleted the database and initialize a new one as there isn’t a —resume flag.

ew367 · 2022-05-04T10:15:23Z

Also interested to know. When this happened I just deleted the database and initialize a new one as there isn’t a —resume flag.

Yes, that has been my approach in the past too, but my new dataset is HUGE and had already been running for nearly 2 weeks before the interruption so I really don't want to do that this time!

fairliereese · 2022-05-05T17:07:26Z

You can use this python code to check if the dataset has been added to your database:

import sqlite3
import pandas as pd

db = 'database_name.db'
with sqlite3.connect(db) as conn:
     q = 'SELECT dataset_name FROM dataset'
     datasets = pd.read_sql_query(q, conn)

print(datasets.dataset_name.tolist())

TALON is pretty good about discarding or not pushing incomplete changes to the database but this is not a surefire method. What I typically do is I make a backup copy of my TALON database before trying to add new datasets to it. That way, if the run fails, I can simply restart using the backup. I'm sorry there's not a better way to do this but this is definitely something that I learned based on getting burned in the past as well.

callumparr · 2022-05-05T17:14:31Z

What I typically do is I make a backup copy of my TALON database before trying to add new datasets to it. That way, if the run fails, I can simply restart using the backup

That's simple and ingenious. Not sure why I did not think to do that. TY for replies as always.

ew367 · 2022-05-06T10:44:41Z

Thanks for your input everyone.

Does anyone know if there is a way to remove an indivual dataset from a database? If so, I could just remove the dataset that it was part way through proccessing and then readd it...

The partially processed dataset definately exists in the database, but I'm not convinced that it has been fully added. I used talon_filter_transcripts on a new db that I created just from the 'suspect' dataset in question, then did the same on the global db after additionally specifying --datasets=suspectDataset and they were not comparable. The output from the global db contains less rows than the output from using the db created using the single suspect dataset.

rb520826 · 2022-09-30T13:10:11Z

Hi, thanks for the above suggestions. I am also interested in the removal of a sample from the database - was there any update to whether this is possible please?

Thanks!

callumparr · 2022-12-07T08:24:16Z

You can use this python code to check if the dataset has been added to your database:
import sqlite3
import pandas as pd

db = 'database_name.db'
with sqlite3.connect(db) as conn:
     q = 'SELECT dataset_name FROM dataset'
     datasets = pd.read_sql_query(q, conn)

print(datasets.dataset_name.tolist())
TALON is pretty good about discarding or not pushing incomplete changes to the database but this is not a surefire method. What I typically do is I make a backup copy of my TALON database before trying to add new datasets to it. That way, if the run fails, I can simply restart using the backup. I'm sorry there's not a better way to do this but this is definitely something that I learned based on getting burned in the past as well.

Is it possible to extract the sample description, the second column in the config file.

I been playing around with sqlite3 module trying to get the column headers of the dataset table in the database but its a beyond me.

fairliereese · 2022-12-13T22:45:34Z

Is it possible to extract the sample description, the second column in the config file.

I been playing around with sqlite3 module trying to get the column headers of the dataset table in the database but its a beyond me.

You should be able to pull that info out using the following sql query: SELECT DISTINCT sample FROM dataset

As an aside, if you're interested in navigating the stuff in the TALON database, I'd definitely recommend downloading a DB viewer such as this one. You can look through the tables and write / test out queries on your tables so you don't have to open up python and sqlite3 every time you want to poke around.

As another aside, I am much more comfortable in pandas in python than I am in manipulating these tables through sqlite3. If that's more your speed, there are sqlite3 functions that will literally dump a table from a database into a pandas table (see here for example) to make it easy to work on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removing sample(s) from the talon database #104

removing sample(s) from the talon database #104

ew367 commented May 4, 2022

callumparr commented May 4, 2022

ew367 commented May 4, 2022

fairliereese commented May 5, 2022

callumparr commented May 5, 2022

ew367 commented May 6, 2022

rb520826 commented Sep 30, 2022

callumparr commented Dec 7, 2022

fairliereese commented Dec 13, 2022

removing sample(s) from the talon database #104

removing sample(s) from the talon database #104

Comments

ew367 commented May 4, 2022

callumparr commented May 4, 2022

ew367 commented May 4, 2022

fairliereese commented May 5, 2022

callumparr commented May 5, 2022

ew367 commented May 6, 2022

rb520826 commented Sep 30, 2022

callumparr commented Dec 7, 2022

fairliereese commented Dec 13, 2022