handle RefSeq data

Get sequence metadata for RefSeq release catalogue using Entrez (also for suppressed or replaced sequences), filter for top 3 assemblies for each taxa and download Fastas.

Extract nucelotide sequence IDs from given RefSeq catalogue, and write sequence and taxonomy IDs out into 3 batches (-> seqId_taxId.batch{1,2,3}.txt.gz)

$ python get_nuc_seqIds_taxIds.py RefSeq-release70.catalog.gz

Download metadata (seqId, seqLength, taxId, assemblyId) from Entrez for each sequence from given seqId_taxId file (use batches, distributed on different machines):

$ python download_seq_metadata.py seqId_taxId.txt.gz seq_metadata.txt.gz 28 [email protected] NCBI_API_key

Load metadata, check, determine IDs of 3 longest assemblies for each taxa, and retrieve corresponding sequence IDs (-> selectd_seqIds.txt):

$ python filter_ids.top3.py seqId_taxId.txt.gz seq_metadata.txt.gz

Download fastas for selected ids:

$ python download_fastas.py selectd_seqIds.txt sequences.fasta 28 [email protected] NCBI_API_key

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
EntrezDownloader.py		EntrezDownloader.py
LICENSE		LICENSE
README.md		README.md
download_fastas.py		download_fastas.py
download_seq_metadata.py		download_seq_metadata.py
filter_ids.top3.py		filter_ids.top3.py
get_nuc_seqIds_taxIds.py		get_nuc_seqIds_taxIds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

handle RefSeq data

About

Releases

Packages

Languages

License

skrakau/handle_RefSeq_data

Folders and files

Latest commit

History

Repository files navigation

handle RefSeq data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages