-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtest.cor
50 lines (50 loc) · 56.6 KB
/
test.cor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
We present the relational database EDULISS (EDinburgh University Ligand Selection System), which stores structural, physicochemical and pharmacophoric properties of small molecules. The database comprises a collection of over 4 million commercially available compounds from 28 different suppliers. A user-friendly web-based interface for EDULISS (available at http://eduliss.bch.ed.ac.uk/ ) has been established providing a number of data-mining possibilities. For each compound a single 3D conformer is stored along with over 1600 calculated descriptor values (molecular properties). A very efficient method for unique compound recognition, especially for a large scale database, is demonstrated by making use of small subgroups of the descriptors. Many of the shape and distance descriptors are held as pre-calculated bit strings permitting fast and efficient similarity and pharmacophore searches which can be used to identify families of related compounds for biological testing. Two ligand searching applications are given to demonstrate how EDULISS can be used to extract families of molecules with selected structural and biophysical features.
We updated the tRNADB-CE by analyzing 939 complete and 1301 draft genomes of prokaryotes and eukaryotes, 171 complete virus genomes, 121 complete chloroplast genomes and approximately 230 million sequences obtained by metagenome analyses of 210 environmental samples. The 287 102 tRNA genes in total, and thus two times of the tRNA genes compiled previously, are compiled, in which sequence information, clover-leaf structure and results of sequence similarity and oligonucleotide-pattern search can be browsed. In order to pool collective knowledge with help from any experts in the tRNA research field, we included a column to which comments can be added on each tRNA gene. By compiling tRNAs of known prokaryotes with identical sequences, we found high phylogenetic preservation of tRNA sequences, especially at a phylum level. Furthermore, a large number of tRNAs obtained by metagenome analyses of environmental samples had sequences identical to those found in known prokaryotes. The identical sequence group, therefore, can be used as phylogenetic markers to clarify the microbial community structure of an ecosystem. The updated tRNADB-CE provided functions, with which users can obtain the phylotype-specific markers (e.g. genus-specific markers) by themselves and clarify microbial community structures of ecosystems in detail. tRNADB-CE can be accessed freely at http://trna.nagahama-i-bio.ac.jp .
The Protein Data Bank in Europe (PDBe; pdbe.org) is actively involved in managing the international archive of biomacromolecular structure data as one of the partners in the Worldwide Protein Data Bank (wwPDB; wwpdb.org). PDBe also develops new tools to make structural data more widely and more easily available to the biomedical community. PDBe has developed a browser to access and analyze the structural archive using classification systems that are familiar to chemists and biologists. The PDBe web pages that describe individual PDB entries have been enhanced through the introduction of plain-English summary pages and iconic representations of the contents of an entry (PDBprints). In addition, the information available for structures determined by means of NMR spectroscopy has been expanded. Finally, the entire web site has been redesigned to make it substantially easier to use for expert and novice users alike. PDBe works closely with other teams at the European Bioinformatics Institute (EBI) and in the international scientific community to develop new resources with value-added information. The SIFTS initiative is an example of such a collaboration—it provides extensive mapping data between proteins whose structures are available from the PDB and a host of other biomedical databases. SIFTS is widely used by major bioinformatics resources.
FlyFactorSurvey (http://pgfe.umassmed.edu/TFDBS/) is a database of DNA binding specificities for Drosophila transcription factors (TFs) primarily determined using the bacterial one-hybrid system. The database provides community access to over 400 recognition motifs and position weight matrices for over 200 TFs, including many unpublished motifs. Search tools and flat file downloads are provided to retrieve binding site information (as sequences, matrices and sequence logos) for individual TFs, groups of TFs or for all TFs with characterized binding specificities. Linked analysis tools allow users to identify motifs within our database that share similarity to a query matrix or to view the distribution of occurrences of an individual motif throughout the Drosophila genome. Together, this database and its associated tools provide computational and experimental biologists with resources to predict interactions between Drosophila TFs and target cis -regulatory sequences.
miRBase is the primary online repository for all microRNA sequences and annotation. The current release (miRBase 16) contains over 15 000 microRNA gene loci in over 140 species, and over 17 000 distinct mature microRNA sequences. Deep-sequencing technologies have delivered a sharp rise in the rate of novel microRNA discovery. We have mapped reads from short RNA deep-sequencing experiments to microRNAs in miRBase and developed web interfaces to view these mappings. The user can view all read data associated with a given microRNA annotation, filter reads by experiment and count, and search for microRNAs by tissue- and stage-specific expression. These data can be used as a proxy for relative expression levels of microRNA sequences, provide detailed evidence for microRNA annotations and alternative isoforms of mature microRNAs, and allow us to revisit previous annotations. miRBase is available online at: http://www.mirbase.org/ .
Phytohormone studies enlightened our knowledge of plant responses to various changes. To provide a systematic and comprehensive view of genes participating in plant hormonal regulation, an online accessible database Arabidopsis Hormone Database (AHD) has been developed, which is a collection of hormone related genes of the model organism Arabidopsis thaliana (AHRGs). Recently we updated our database from AHD to a new version AHD2.0 by adding several pronounced features: (i) updating our collection of AHRGs based on most recent publications as well as constructing elaborate schematic diagrams of each hormone biosynthesis and signaling pathways; (ii) adding orthologs of sequenced plants listed in OrthoMCL-DB to each AHRG in the updated database; (iii) providing predicted miRNA splicing site(s) for each AHRG; (iv) integrating genes that genetically interact with each AHRG according to literatures mining; (v) providing links to a powerful online analysis platform WebLab for the convenience of in-time bioinformatics analysis and (vi) providing links to widely used protein databases and integrating more expression profiling information that would facilitate users for a more systematic and integrative analysis related to phytohormone research.
The BRENDA (BRaunschweig ENzyme Database, http://www.brenda-enzymes.org ) enzyme information system is the main collection of enzyme functional and property data for the scientific community. The majority of the data are manually extracted from the primary literature. The content covers information on function, structure, occurrence, preparation and application of enzymes as well as properties of mutants and engineered variants. The number of manually annotated references increased by 30% to more than 100 000, the number of ligand structures by 45% to almost 100 000. New query, analysis and data management tools were implemented to improve data processing, data presentation, data input and data access. BRENDA now provides new viewing options such as the display of the statistics of functional parameters and the 3D view of protein sequence and structure features. Furthermore a ligand summary shows comprehensive information on the BRENDA ligands. The enzymes are linked to their respective pathways and can be viewed in pathway maps. The disease text mining part is strongly enhanced. It is possible to submit new, not yet classified enzymes to BRENDA, which then are reviewed and classified by the International Union of Biochemistry and Molecular Biology. A new SBML output format of BRENDA kinetic data allows the construction of organism-specific metabolic models.
Protein phosphorylation catalyzed by kinases plays crucial regulatory roles in intracellular signal transduction. With the increasing number of experimental phosphorylation sites that has been identified by mass spectrometry-based proteomics, the desire to explore the networks of protein kinases and substrates is motivated. Manning et al. have identified 518 human kinase genes, which provide a starting point for comprehensive analysis of protein phosphorylation networks. In this study, a knowledgebase is developed to integrate experimentally verified protein phosphorylation data and protein–protein interaction data for constructing the protein kinase–substrate phosphorylation networks in human. A total of 21 110 experimental verified phosphorylation sites within 5092 human proteins are collected. However, only 4138 phosphorylation sites (∼20%) have the annotation of catalytic kinases from public domain. In order to fully investigate how protein kinases regulate the intracellular processes, a published kinase-specific phosphorylation site prediction tool, named KinasePhos is incorporated for assigning the potential kinase. The web-based system, RegPhos, can let users input a group of human proteins; consequently, the phosphorylation network associated with the protein subcellular localization can be explored. Additionally, time-coursed microarray expression data is subsequently used to represent the degree of similarity in the expression profiles of network members. A case study demonstrates that the proposed scheme not only identify the correct network of insulin signaling but also detect a novel signaling pathway that may cross-talk with insulin signaling network. This effective system is now freely available at http://RegPhos.mbc.nctu.edu.tw .
LocDB is a manually curated database with experimental annotations for the subcellular localizations of proteins in Homo sapiens (HS, human) and Arabidopsis thaliana (AT, thale cress). Currently, it contains entries for 19 604 UniProt proteins (HS: 13 342; AT: 6262). Each database entry contains the experimentally derived localization in Gene Ontology (GO) terminology, the experimental annotation of localization, localization predictions by state-of-the-art methods and, where available, the type of experimental information. LocDB is searchable by keyword, protein name and subcellular compartment, as well as by identifiers from UniProt, Ensembl and TAIR resources. In comparison to other public databases, LocDB as a resource adds about 10 000 experimental localization annotations for HS proteins and ∼900 for AS proteins. Over 40% of the proteins in LocDB have multiple localization annotations providing a better platform for development of new multiple localization prediction methods with higher coverage and accuracy. Links to all referenced databases are provided. LocDB will be updated regularly by our group (available at: http://www.rostlab.org/services/locDB ).
Correlations of gene-to-gene co-expression and metabolite-to-metabolite co-accumulation calculated from large amounts of transcriptome and metabolome data are useful for uncovering unknown functions of genes, functional diversities of gene family members and regulatory mechanisms of metabolic pathway flows. Many databases and tools are available to interpret quantitative transcriptome and metabolome data, but there are only limited ones that connect correlation data to biological knowledge and can be utilized to find biological significance of it. We report here a new metabolic pathway database, KaPPA-View4 ( http://kpv.kazusa.or.jp/kpv4/ ), which is able to overlay gene-to-gene and/or metabolite-to-metabolite relationships as curves on a metabolic pathway map, or on a combination of up to four maps. This representation would help to discover, for example, novel functions of a transcription factor that regulates genes on a metabolic pathway. Pathway maps of the Kyoto Encyclopedia of Genes and Genomes (KEGG) and maps generated from their gene classifications are available at KaPPA-View4 KEGG version ( http://kpv.kazusa.or.jp/kpv4-kegg/ ). At present, gene co-expression data from the databases ATTED-II, COXPRESdb, CoP and MiBASE for human, mouse, rat, Arabidopsis , rice, tomato and other plants are available.
The Hymenoptera Genome Database (HGD) is a comprehensive model organism database that caters to the needs of scientists working on insect species of the order Hymenoptera. This system implements open-source software and relational databases providing access to curated data contributed by an extensive, active research community. HGD contains data from 9 different species across ∼200 million years in the phylogeny of Hymenoptera, allowing researchers to leverage genetic, genome sequence and gene expression data, as well as the biological knowledge of related model organisms. The availability of resources across an order greatly facilitates comparative genomics and enhances our understanding of the biology of agriculturally important Hymenoptera species through genomics. Curated data at HGD includes predicted and annotated gene sets supported with evidence tracks such as ESTs/cDNAs, small RNA sequences and GC composition domains. Data at HGD can be queried using genome browsers and/or BLAST/PSI-BLAST servers, and it may also be downloaded to perform local searches. We encourage the public to access and contribute data to HGD at: http://HymenopteraGenome.org .
The V oltage-gated K+C hannel D ata B ase (VKCDB) ( http://vkcdb.biology.ualberta.ca ) makes a comprehensive set of sequence data readily available for phylogenetic and comparative analysis. The current update contains 2063 entries for full-length or nearly full-length unique channel sequences from Bacteria (477), Archaea (18) and Eukaryotes (1568), an increase from 346 solely eukaryotic entries in the original release. In addition to protein sequences for channels, corresponding nucleotide sequences of the open reading frames corresponding to the amino acid sequences are now available and can be extracted in parallel with sets of protein sequences. Channels are categorized into subfamilies by phylogenetic analysis and by using hidden Markov model analyses. Although the raw database contains a number of fragmentary, duplicated, obsolete and non-channel sequences that were collected in early steps of data collection, the web interface will only return entries that have been validated as likely K + channels. The retrieval function of the web interface allows retrieval of entries that contain a substantial fraction of the core structural elements of VKCs, fragmentary entries, or both. The full database can be downloaded as either a MySQL dump or as an XML dump from the web site. We have now implemented automated updates at quarterly intervals.
Most proteins from higher organisms are known to be multi-domain proteins and contain substantial numbers of intrinsically disordered (ID) regions. To analyse such protein sequences, those from human for instance, we developed a special protein-structure-prediction pipeline and accumulated the products in the Structure Atlas of Human Genome (SAHG) database at http://bird.cbrc.jp/sahg . With the pipeline, human proteins were examined by local alignment methods (BLAST, PSI-BLAST and Smith–Waterman profile–profile alignment), global–local alignment methods (FORTE) and prediction tools for ID regions (POODLE-S) and homology modeling (MODELLER). Conformational changes of protein models upon ligand-binding were predicted by simultaneous modeling using templates of apo and holo forms. When there were no suitable templates for holo forms and the apo models were accurate, we prepared holo models using prediction methods for ligand-binding (eF-seek) and conformational change (the elastic network model and the linear response theory). Models are displayed as animated images. As of July 2010, SAHG contains 42 581 protein-domain models in approximately 24 900 unique human protein sequences from the RefSeq database. Annotation of models with functional information and links to other databases such as EzCatDB, InterPro or HPRD are also provided to facilitate understanding the protein structure-function relationships.
The Phospho.ELM resource ( http://phospho.elm.eu.org ) is a relational database designed to store in vivo and in vitro phosphorylation data extracted from the scientific literature and phosphoproteomic analyses. The resource has been actively developed for more than 7 years and currently comprises 42 574 serine, threonine and tyrosine non-redundant phosphorylation sites. Several new features have been implemented, such as structural disorder/order and accessibility information and a conservation score. Additionally, the conservation of the phosphosites can now be visualized directly on the multiple sequence alignment used for the score calculation. Finally, special emphasis has been put on linking to external resources such as interaction networks and other databases.
Pseudomonas is a metabolically-diverse genus of bacteria known for its flexibility and leading free living to pathogenic lifestyles in a wide range of hosts. The Pseudomonas Genome Database ( http://www.pseudomonas.com ) integrates completely-sequenced Pseudomonas genome sequences and their annotations with genome-scale, high-precision computational predictions and manually curated annotation updates. The latest release implements an ability to view sequence polymorphisms in P. aeruginosa PAO1 versus other reference strains, incomplete genomes and single gene sequences. This aids analysis of phenotypic variation between closely related isolates and strains, as well as wider population genomics and evolutionary studies. The wide range of tools for comparing Pseudomonas annotations and sequences now includes a strain-specific access point for viewing high precision computational predictions including updated, more accurate, protein subcellular localization and genomic island predictions. Views link to genome-scale experimental data as well as comparative genomics analyses that incorporate robust genera-geared methods for predicting and clustering orthologs. These analyses can be exploited for identifying putative essential and core Pseudomonas genes or identifying large-scale evolutionary events. The Pseudomonas Genome Database aims to provide a continually updated, high quality source of genome annotations, specifically tailored for Pseudomonas researchers, but using an approach that may be implemented for other genera-level research communities.
The DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp ) provides a nucleotide sequence archive database and accompanying database tools for sequence submission, entry retrieval and annotation analysis. The DDBJ collected and released 3 637 446 entries/2 272 231 889 bases between July 2009 and June 2010. A highlight of the released data was archive datasets from next-generation sequencing reads of Japanese rice cultivar, Koshihikari submitted by the National Institute of Agrobiological Sciences. In this period, we started a new archive for quantitative genomics data, the DDBJ Omics aRchive (DOR). The DOR stores quantitative data both from the microarray and high-throughput new sequencing platforms. Moreover, we improved the content of the DDBJ patent sequence, released a new submission tool of the DDBJ Sequence Read Archive (DRA) which archives massive raw sequencing reads, and enhanced a cloud computing-based analytical system from sequencing reads, the DDBJ Read Annotation Pipeline. In this article, we describe these new functions of the DDBJ databases and support tools.
The MIPS Fusarium graminearum Genome Database (FGDB) was established as a comprehensive genome database on one of the most devastating fungal plant pathogens of wheat, barley and maize. The current version of FGDB v3.1 provides information on the full manually revised gene set based on the Broad Institute assembly FG3 genome sequence. The results of gene prediction tools were integrated with the help of comparative data on related species to result in a set of 13.718 annotated protein coding genes. This rigorous approach involved adding or modifying gene models and represents a coding sequence gold standard for the genus Fusarium . The gene loci improvements results in 2461 genes which either are new or have different structures compared to the Broad Institute assembly 3 gene set. Moreover the database serves as a convenient entry point to explore expression data results and to obtain information on the Affymetrix GeneChip probe sets. The resource is accessible on http://mips.gsf.de/genre/proj/FGDB/ .
m icroRNA expression and sequence analysis database ( http://konulab.fen.bilkent.edu.tr/mirna/ ) (mESAdb) is a regularly updated database for the multivariate analysis of sequences and expression of microRNAs from multiple taxa. mESAdb is modular and has a user interface implemented in PHP and JavaScript and coupled with statistical analysis and visualization packages written for the R language. The database primarily comprises mature microRNA sequences and their target data, along with selected human, mouse and zebrafish expression data sets. mESAdb analysis modules allow (i) mining of microRNA expression data sets for subsets of microRNAs selected manually or by motif; (ii) pair-wise multivariate analysis of expression data sets within and between taxa; and (iii) association of microRNA subsets with annotation databases, HUGE Navigator, KEGG and GO. The use of existing and customized R packages facilitates future addition of data sets and analysis tools. Furthermore, the ability to upload and analyze user-specified data sets makes mESAdb an interactive and expandable analysis tool for microRNA sequence and expression data.
Stem cell biology has experienced explosive growth over the past decade as researchers attempt to generate therapeutically relevant cell types in the laboratory. Recapitulation of endogenous developmental trajectories is a dominant paradigm in the design of directed differentiation protocols, and attempts to guide stem cell differentiation are often based explicitly on knowledge of in vivo development. Therefore, when designing protocols, stem cell biologists rely heavily upon information including (i) cell type-specific gene expression profiles, (ii) anatomical and developmental relationships between cells and tissues and (iii) signals important for progression from progenitors to target cell types. Here, we present the Stem Cell Lineage Database (SCLD) ( http://scld.mcb.uconn.edu ) that aims to unify this information into a single resource where users can easily store and access information about cell type gene expression, cell lineage maps and stem cell differentiation protocols for both human and mouse stem cells and endogenous developmental lineages. By establishing the SCLD, we provide scientists with a centralized location to organize access and share data, dispute and resolve contentious relationships between cell types and within lineages, uncover discriminating cell type marker panels and design directed differentiation protocols.
By broad literature survey, we have developed a leaf senescence database (LSD, http://www.eplantsenescence.org/ ) that contains a total of 1145 senescence associated genes (SAGs) from 21 species. These SAGs were retrieved based on genetic, genomic, proteomic, physiological or other experimental evidence, and were classified into different categories according to their functions in leaf senescence or morphological phenotypes when mutated. We made extensive annotations for these SAGs by both manual and computational approaches, and users can either browse or search the database to obtain information including literatures, mutants, phenotypes, expression profiles, miRNA interactions, orthologs in other plants and cross links to other databases. We have also integrated a bioinformatics analysis platform WebLab into LSD, which allows users to perform extensive sequence analysis of their interested SAGs. The SAG sequences in LSD can also be downloaded readily for bulk analysis. We believe that the LSD contains the largest number of SAGs to date and represents the most comprehensive and informative plant senescence-related database, which would facilitate the systems biology research and comparative studies on plant aging.
miRGator is an integrated database of microRNA (miRNA)-associated gene expression, target prediction, disease association and genomic annotation, which aims to facilitate functional investigation of miRNAs. The recent version of miRGator v2.0 contains information about (i) human miRNA expression profiles under various experimental conditions, (ii) paired expression profiles of both mRNAs and miRNAs, (iii) gene expression profiles under miRNA-perturbation (e.g. miRNA knockout and overexpression), (iv) known/predicted miRNA targets and (v) miRNA-disease associations. In total, >8000 miRNA expression profiles, ∼300 miRNA-perturbed gene expression profiles and ∼2000 mRNA expression profiles are compiled with manually curated annotations on disease, tissue type and perturbation. By integrating these data sets, a series of novel associations (miRNA–miRNA, miRNA–disease and miRNA–target) is extracted via shared features. For example, differentially expressed genes (DEGs) after miRNA knockout were systematically compared against miRNA targets. Likewise, differentially expressed miRNAs (DEmiRs) were compared with disease-associated miRNAs. Additionally, miRNA expression and disease-phenotype profiles revealed miRNA pairs whose expression was regulated in parallel in various experimental and disease conditions. Complex associations are readily accessible using an interactive network visualization interface. The miRGator v2.0 serves as a reference database to investigate miRNA expression and function ( http://miRGator.kobic.re.kr ).
The National Institute of Agrobiological Sciences (NIAS) is implementing the NIAS Genebank Project for conservation and promotion of agrobiological genetic resources to contribute to the development and utilization of agriculture and agricultural products. The project’s databases (NIASGBdb; http://www.gene.affrc.go.jp/databases_en.php ) consist of a genetic resource database and a plant diseases database, linked by a web retrieval database. The genetic resources database has plant and microorganism search systems to provide information on research materials, including passport and evaluation data for genetic resources with the desired properties. To facilitate genetic diversity research, several NIAS Core Collections have been developed. The NIAS Rice ( Oryza sativa ) Core Collection of Japanese Landraces contains information on simple sequence repeat (SSR) polymorphisms. SSR marker information for azuki bean ( Vigna angularis ) and black gram ( V. mungo ) and DNA sequence data from some selected Japanese strains of the genus Fusarium are also available. A database of plant diseases in Japan has been developed based on the listing of common names of plant diseases compiled by the Phytopathological Society of Japan. Relevant plant and microorganism genetic resources are associated with the plant disease names by the web retrieval database and can be obtained from the NIAS Genebank for research or educational purposes.
Phospho3D is a database of three-dimensional (3D) structures of phosphorylation sites (P-sites) derived from the Phospho.ELM database, which also collects information on the residues surrounding the P-site in space (3D zones ). The database also provides the results of a large-scale structural comparison of the 3D zones versus a representative dataset of structures, thus associating to each P-site a number of structurally similar sites. The new version of Phospho3D presents an 11-fold increase in the number of 3D sites and incorporates several additional features, including new structural descriptors, the possibility of selecting non-redundant sets of 3D structures and the availability for download of non-redundant sets of structurally annotated P-sites. Moreover, it features P3Dscan, a new functionality that allows the user to submit a protein structure and scan it against the 3D zones collected in the Phospho3D database. Phospho3D version 2.0 is available at: http://www.phospho3d.org/ .
Entrez Gene ( http://www.ncbi.nlm.nih.gov/gene ) is National Center for Biotechnology Information (NCBI)’s database for gene-specific information. Entrez Gene maintains records from genomes which have been completely sequenced, which have an active research community to submit gene-specific information, or which are scheduled for intense sequence analysis. The content represents the integration of curation and automated processing from NCBI’s Reference Sequence project (RefSeq), collaborating model organism databases, consortia such as Gene Ontology and other databases within NCBI. Records in Entrez Gene are assigned unique, stable and tracked integers as identifiers. The content (nomenclature, genomic location, gene products and their attributes, markers, phenotypes and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI’s Entrez system, via NCBI’s Entrez programming utilities (E-Utilities) and for bulk transfer by FTP.
The Protein–RNA Interface Database (PRIDB) is a comprehensive database of protein–RNA interfaces extracted from complexes in the Protein Data Bank (PDB). It is designed to facilitate detailed analyses of individual protein–RNA complexes and their interfaces, in addition to automated generation of user-defined data sets of protein–RNA interfaces for statistical analyses and machine learning applications. For any chosen PDB complex or list of complexes, PRIDB rapidly displays interfacial amino acids and ribonucleotides within the primary sequences of the interacting protein and RNA chains. PRIDB also identifies ProSite motifs in protein chains and FR3D motifs in RNA chains and provides links to these external databases, as well as to structure files in the PDB. An integrated JMol applet is provided for visualization of interacting atoms and residues in the context of the 3D complex structures. The current version of PRIDB contains structural information regarding 926 protein–RNA complexes available in the PDB (as of 10 October 2010). Atomic- and residue-level contact information for the entire data set can be downloaded in a simple machine-readable format. Also, several non-redundant benchmark data sets of protein–RNA complexes are provided. The PRIDB database is freely available online at http://bindr.gdcb.iastate.edu/PRIDB .
It is 12 years since the IMGT/HLA database was first released, providing the HLA community with a searchable repository of highly curated HLA sequences. The HLA complex is located within the 6p21.3 region of human chromosome 6 and contains more than 220 genes of diverse function. Many of the genes encode proteins of the immune system and are highly polymorphic. The naming of these HLA genes and alleles and their quality control is the responsibility of the WHO Nomenclature Committee for Factors of the HLA System. Through the work of the HLA Informatics Group and in collaboration with the European Bioinformatics Institute, we are able to provide public access to this data through the web site http://www.ebi.ac.uk/imgt/hla/ . Regular updates to the web site ensure that new and confirmatory sequences are dispersed to the HLA community, and the wider research and clinical communities.
This article introduces the second release of the Gypsy Database of Mobile Genetic Elements (GyDB 2.0): a research project devoted to the evolutionary dynamics of viruses and transposable elements based on their phylogenetic classification (per lineage and protein domain). The Gypsy Database (GyDB) is a long-term project that is continuously progressing, and that owing to the high molecular diversity of mobile elements requires to be completed in several stages. GyDB 2.0 has been powered with a wiki to allow other researchers participate in the project. The current database stage and scope are long terminal repeats (LTR) retroelements and relatives. GyDB 2.0 is an update based on the analysis of Ty3/Gypsy , Retroviridae , Ty1/Copia and Bel/Pao LTR retroelements and the Caulimoviridae pararetroviruses of plants. Among other features, in terms of the aforementioned topics, this update adds: (i) a variety of descriptions and reviews distributed in multiple web pages; (ii) protein-based phylogenies, where phylogenetic levels are assigned to distinct classified elements; (iii) a collection of multiple alignments, lineage-specific hidden Markov models and consensus sequences, called GyDB collection; (iv) updated RefSeq databases and BLAST and HMM servers to facilitate sequence characterization of new LTR retroelement and caulimovirus queries; and (v) a bibliographic server. GyDB 2.0 is available at http://gydb.org.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
MicroRNAs (miRNAs), one type of small RNAs (sRNAs) in plants, play an essential role in gene regulation. Several miRNA databases were established; however, successively generated new datasets need to be collected, organized and analyzed. To this end, we have constructed a plant miRNA knowledge base (PmiRKB) that provides four major functional modules. In the ‘SNP’ module, single nucleotide polymorphism (SNP) data of seven Arabidopsis ( Arabidopsis thaliana ) accessions and 21 rice ( Oryza sativa ) subspecies were collected to inspect the SNPs within pre-miRNAs (precursor microRNAs) and miRNA—target RNA duplexes. Depending on their locations, SNPs can affect the secondary structures of pre-miRNAs, or interactions between miRNAs and their targets. A second module, ‘Pri-miR’, can be used to investigate the tissue-specific, transcriptional contexts of pre- and pri-miRNAs (primary microRNAs), based on massively parallel signature sequencing data. The third module, ‘MiR–Tar’, was designed to validate thousands of miRNA—target pairs by using parallel analysis of RNA end (PARE) data. Correspondingly, the fourth module, ‘Self-reg’, also used PARE data to investigate the metabolism of miRNA precursors, including precursor processing and miRNA- or miRNA*-mediated self-regulation effects on their host precursors. PmiRKB can be freely accessed at http://bis.zju.edu.cn/pmirkb/ .
The primary mission of Universal Protein Resource (UniProt) is to support biological research by maintaining a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces freely accessible to the scientific community. UniProt is produced by the UniProt Consortium which consists of groups from the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. UniProt is updated and distributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org .
The information of protein targets and small molecule has been highly valued by biomedical and pharmaceutical research. Several protein target databases are available online for FDA-approved drugs as well as the promising precursors that have largely facilitated the mechanistic study and subsequent research for drug discovery. However, those related resources regarding to herbal active ingredients, although being unusually valued as a precious resource for new drug development, is rarely found. In this article, a comprehensive and fully curated database for Herb Ingredients’ Targets (HIT, http://lifecenter.sgst.cn/hit/) has been constructed to complement above resources. Those herbal ingredients with protein target information were carefully curated. The molecular target information involves those proteins being directly/indirectly activated/inhibited, protein binders and enzymes whose substrates or products are those compounds. Those up/down regulated genes are also included under the treatment of individual ingredients. In addition, the experimental condition, observed bioactivity and various references are provided as well for user's reference. Derived from more than 3250 literatures, it currently contains 5208 entries about 1301 known protein targets (221 of them are described as direct targets) affected by 586 herbal compounds from more than 1300 reputable Chinese herbs, overlapping with 280 therapeutic targets from Therapeutic Targets Database (TTD), and 445 protein targets from DrugBank corresponding to 1488 drug agents. The database can be queried via keyword search or similarity search. Crosslinks have been made to TTD, DrugBank, KEGG, PDB, Uniprot, Pfam, NCBI, TCM-ID and other databases.
The RCSB Protein Data Bank (RCSB PDB) web site ( http://www.pdb.org ) has been redesigned to increase usability and to cater to a larger and more diverse user base. This article describes key enhancements and new features that fall into the following categories: (i) query and analysis tools for chemical structure searching, query refinement, tabulation and export of query results; (ii) web site customization and new structure alerts; (iii) pair-wise and representative protein structure alignments; (iv) visualization of large assemblies; (v) integration of structural data with the open access literature and binding affinity data; and (vi) web services and web widgets to facilitate integration of PDB data and tools with other resources. These improvements enable a range of new possibilities to analyze and understand structure data. The next generation of the RCSB PDB web site, as described here, provides a rich resource for research and education.
The procedure of drug approval is time-consuming, costly and risky. Accidental findings regarding multi-specificity of approved drugs led to block-busters in new indication areas. Therefore, the interest in systematically elucidating new areas of application for known drugs is rising. Furthermore, the knowledge, understanding and prediction of so-called off-target effects allow a rational approach to the understanding of side-effects. With PROMISCUOUS we provide an exhaustive set of drugs (25 000), including withdrawn or experimental drugs, annotated with drug–protein and protein–protein relationships (21 500/104 000) compiled from public resources via text and data mining including manual curation. Measures of structural similarity for drugs as well as known side-effects can be easily connected to protein–protein interactions to establish and analyse networks responsible for multi-pharmacology. This network-based approach can provide a starting point for drug-repositioning. PROMISCUOUS is publicly available at http://bioinformatics.charite.de/promiscuous .
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Electronic PCR, OrfFinder, Splign, ProSplign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), IBIS, Biosystems, Peptidome, OMSSA, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov .
PREX ( http://www.csb.wfu.edu/prex/ ) is a database of currently 3516 peroxiredoxin (Prx or PRDX) protein sequences unambiguously classified into one of six distinct subfamilies. Peroxiredoxins are a diverse and ubiquitous family of highly expressed, cysteine-dependent peroxidases that are important for antioxidant defense and for the regulation of cell signaling pathways in eukaryotes. Subfamily members were identified using the Deacon Active Site Profiler (DASP) bioinformatics tool to focus in on functionally relevant sequence fragments surrounding key residues required for protein activity. Searches of this database can be conducted by protein annotation, accession number, PDB ID, organism name or protein sequence. Output includes the subfamily to which each classified Prx belongs, accession and GI numbers, genus and species and the functional site signature used for classification. The query sequence is also presented aligned with a select group of Prxs for manual evaluation and interpretation by the user. A synopsis of the characteristics of members of each subfamily is also provided along with pertinent references.
The REDfly database of Drosophila transcriptional cis -regulatory elements provides the broadest and most comprehensive available resource for experimentally validated cis -regulatory modules and transcription factor binding sites among the metazoa. The third major release of the database extends the utility of REDfly as a powerful tool for both computational and experimental studies of transcription regulation. REDfly v3.0 includes the introduction of new data classes to expand the types of regulatory elements annotated in the database along with a roughly 40% increase in the number of records. A completely redesigned interface improves access for casual and power users alike; among other features it now automatically provides graphical views of the genome, displays images of reporter gene expression and implements improved capabilities for database searching and results filtering. REDfly is freely accessible at http://redfly.ccr.buffalo.edu .
The growing availability of complete genomic sequences from diverse species has brought about the need to scale up phylogenomic analyses, including the reconstruction of large collections of phylogenetic trees. Here, we present the third version of PhylomeDB ( http://phylomeDB.org ), a public database for genome-wide collections of gene phylogenies (phylomes). Currently, PhylomeDB is the largest phylogenetic repository and hosts 17 phylomes, comprising 416 093 trees and 165 840 alignments. It is also a major source for phylogeny-based orthology and paralogy predictions, covering about 5 million proteins in 717 fully-sequenced genomes. For each protein-coding gene in a seed genome, the database provides original and processed alignments, phylogenetic trees derived from various methods and phylogeny-based predictions of orthology and paralogy relationships. The new version of phylomeDB has been extended with novel data access and visualization features, including the possibility of programmatic access. Available seed species include model organisms such as human, yeast, Escherichia coli or Arabidopsis thaliana , but also alternative model species such as the human pathogen Candida albicans , or the pea aphid Acyrtosiphon pisum . Finally, PhylomeDB is currently being used by several genome sequencing projects that couple the genome annotation process with the reconstruction of the corresponding phylome, a strategy that provides relevant evolutionary insights.
The Pancreatic Expression database (PED, http://www.pancreasexpression.org ) has established itself as the main repository for pancreatic-derived -omics data. For the past 3 years, its data content and access have increased substantially. Here we describe several of its new and improved features, such as data content, which now includes over 60 000 measurements derived from transcriptomics, proteomics, genomics and miRNA profiles from various pancreas-centred reports on a broad range of specimen and experimental types. We also illustrate the capabilities of its interface, which allows integrative queries that can combine PED data with a growing number of biological resources such as NCBI, Ensembl, UniProt and Reactome. Thus, PED is capable of retrieving and integrating different types of -omics, annotations and clinical data. We also focus on the importance of data sharing and interoperability in the cancer field, and the integration of PED into the International Cancer Genome Consortium (ICGC) data portal.
Now in its 10th year, the Gramene database ( http://www.gramene.org ) has grown from its primary focus on rice, the first fully-sequenced grass genome, to become a resource for major model and crop plants including Arabidopsis , Brachypodium , maize, sorghum, poplar and grape in addition to several species of rice. Gramene began with the addition of an Ensembl genome browser and has expanded in the last decade to become a robust resource for plant genomics hosting a wide array of data sets including quantitative trait loci (QTL), metabolic pathways, genetic diversity, genes, proteins, germplasm, literature, ontologies and a fully-structured markers and sequences database integrated with genome browsers and maps from various published studies (genetic, physical, bin, etc.). In addition, Gramene now hosts a variety of web services including a Distributed Annotation Server (DAS), BLAST and a public MySQL database. Twice a year, Gramene releases a major build of the database and makes interim releases to correct errors or to make important updates to software and/or data.
A vast number of sweet tasting molecules are known, encompassing small compounds, carbohydrates, d -amino acids and large proteins. Carbohydrates play a particularly big role in human diet. The replacement of sugars in food with artificial sweeteners is common and is a general approach to prevent cavities, obesity and associated diseases such as diabetes and hyperlipidemia. Knowledge about the molecular basis of taste may reveal new strategies to overcome diet-induced diseases. In this context, the design of safe, low-calorie sweeteners is particularly important. Here, we provide a comprehensive collection of carbohydrates, artificial sweeteners and other sweet tasting agents like proteins and peptides. Additionally, structural information and properties such as number of calories, therapeutic annotations and a sweetness-index are stored in SuperSweet. Currently, the database consists of more than 8000 sweet molecules. Moreover, the database provides a modeled 3D structure of the sweet taste receptor and binding poses of the small sweet molecules. These binding poses provide hints for the design of new sweeteners. A user-friendly graphical interface allows similarity searching, visualization of docked sweeteners into the receptor etc. A sweetener classification tree and browsing features allow quick requests to be made to the database. The database is freely available at: http://bioinformatics.charite.de/sweet/ .
The stability, localization and translation rate of mRNAs are regulated by a multitude of RNA-binding proteins (RBPs) that find their targets directly or with the help of guide RNAs. Among the experimental methods for mapping RBP binding sites, cross-linking and immunoprecipitation (CLIP) coupled with deep sequencing provides transcriptome-wide coverage as well as high resolution. However, partly due to their vast volume, the data that were so far generated in CLIP experiments have not been put in a form that enables fast and interactive exploration of binding sites. To address this need, we have developed the CLIPZ database and analysis environment. Binding site data for RBPs such as Argonaute 1-4, Insulin-like growth factor II mRNA-binding protein 1-3, TNRC6 proteins A-C, Pumilio 2, Quaking and Polypyrimidine tract binding protein can be visualized at the level of the genome and of individual transcripts. Individual users can upload their own sequence data sets while being able to limit the access to these data to specific users, and analyses of the public and private data sets can be performed interactively. CLIPZ, available at http://www.clipz.unibas.ch , aims to provide an open access repository of information for post-transcriptional regulatory elements.
COMBREX ( http://combrex.bu.edu ) is a project to increase the speed of the functional annotation of new bacterial and archaeal genomes. It consists of a database of functional predictions produced by computational biologists and a mechanism for experimental biochemists to bid for the validation of those predictions. Small grants are available to support successful bids.
The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the interaction of environmental chemicals with gene products, and their effects on human health. Biocurators at CTD manually curate a triad of chemical–gene, chemical–disease and gene–disease relationships from the literature. These core data are then integrated to construct chemical–gene–disease networks and to predict many novel relationships using different types of associated data. Since 2009, we dramatically increased the content of CTD to 1.4 million chemical–gene–disease data points and added many features, statistical analyses and analytical tools, including GeneComps and ChemComps (to find comparable genes and chemicals that share toxicogenomic profiles), enriched Gene Ontology terms associated with chemicals, statistically ranked chemical–disease inferences, Venn diagram tools to discover overlapping and unique attributes of any set of chemicals, genes or disease, and enhanced gene pathway data content, among other features. Together, this wealth of expanded chemical–gene–disease data continues to help users generate testable hypotheses about the molecular mechanisms of environmental diseases. CTD is freely available at http://ctd.mdibl.org .
The molecular diversity of viruses complicates the interpretation of viral genomic and proteomic data. To make sense of viral gene functions, investigators must be familiar with the virus host range, replication cycle and virion structure. Our aim is to provide a comprehensive resource bridging together textbook knowledge with genomic and proteomic sequences. ViralZone web resource ( www.expasy.org/viralzone/ ) provides fact sheets on all known virus families/genera with easy access to sequence data. A selection of reference strains (RefStrain) provides annotated standards to circumvent the exponential increase of virus sequences. Moreover ViralZone offers a complete set of detailed and accurate virion pictures.
The Rfam database aims to catalogue non-coding RNAs through the use of sequence alignments and statistical profile models known as covariance models. In this contribution, we discuss the pros and cons of using the online encyclopedia, Wikipedia, as a source of community‐derived annotation. We discuss the addition of groupings of related RNA families into clans and new developments to the website. Rfam is available on the Web at http://rfam.sanger.ac.uk.
Domain Interaction MAp (DIMA, available at http://webclu.bio.wzw.tum.de/dima) is a database of predicted and known interactions between protein domains. It integrates 5807 structurally known interactions imported from the iPfam and 3did databases and 46 900 domain interactions predicted by four computational methods: domain phylogenetic profiling, domain pair exclusion algorithm correlated mutations and domain interaction prediction in a discriminative way. Additionally predictions are filtered to exclude those domain pairs that are reported as non-interacting by the Negatome database. The DIMA Web site allows to calculate domain interaction networks either for a domain of interest or for entire organisms, and to explore them interactively using the Flash-based Cytoscape Web software.
The Protein Data Bank (PDB) is the world-wide repository of macromolecular structure information. We present a series of databases that run parallel to the PDB. Each database holds one entry, if possible, for each PDB entry. DSSP holds the secondary structure of the proteins. PDBREPORT holds reports on the structure quality and lists errors. HSSP holds a multiple sequence alignment for all proteins. The PDBFINDER holds easy to parse summaries of the PDB file content, augmented with essentials from the other systems. PDB_REDO holds re-refined, and often improved, copies of all structures solved by X-ray. WHY_NOT summarizes why certain files could not be produced. All these systems are updated weekly. The data sets can be used for the analysis of properties of protein structures in areas ranging from structural genomics, to cancer biology and protein design.
The ArrayExpress Archive ( http://www.ebi.ac.uk/arrayexpress ) is one of the three international public repositories of functional genomics data supporting publications. It includes data generated by sequencing or array-based technologies. Data are submitted by users and imported directly from the NCBI Gene Expression Omnibus. The ArrayExpress Archive is closely integrated with the Gene Expression Atlas and the sequence databases at the European Bioinformatics Institute. Advanced queries provided via ontology enabled interfaces include queries based on technology and sample attributes such as disease, cell types and anatomy.
MicroRNAs (miRNAs) represent an important class of small non-coding RNAs (sRNAs) that regulate gene expression by targeting messenger RNAs. However, assigning miRNAs to their regulatory target genes remains technically challenging. Recently, high-throughput CLIP-Seq and degradome sequencing (Degradome-Seq) methods have been applied to identify the sites of Argonaute interaction and miRNA cleavage sites, respectively. In this study, we introduce a novel database, starBase ( s RNA tar get Base ), which we have developed to facilitate the comprehensive exploration of miRNA–target interaction maps from CLIP-Seq and Degradome-Seq data. The current version includes high-throughput sequencing data generated from 21 CLIP-Seq and 10 Degradome-Seq experiments from six organisms . By analyzing millions of mapped CLIP-Seq and Degradome-Seq reads, we identified ∼1 million Ago-binding clusters and ∼2 million cleaved target clusters in animals and plants, respectively. Analyses of these clusters, and of target sites predicted by 6 miRNA target prediction programs, resulted in our identification of approximately 400 000 and approximately 66 000 miRNA-target regulatory relationships from CLIP-Seq and Degradome-Seq data, respectively. Furthermore, two web servers were provided to discover novel miRNA target sites from CLIP-Seq and Degradome-Seq data. Our web implementation supports diverse query types and exploration of common targets, gene ontologies and pathways. The starBase is available at http://starbase.sysu.edu.cn/ .
CATH version 3.3 (class, architecture, topology, homology) contains 128 688 domains, 2386 homologous superfamilies and 1233 fold groups, and reflects a major focus on classifying structural genomics (SG) structures and transmembrane proteins, both of which are likely to add structural novelty to the database and therefore increase the coverage of protein fold space within CATH. For CATH version 3.4 we have significantly improved the presentation of sequence information and associated functional information for CATH superfamilies. The CATH superfamily pages now reflect both the functional and structural diversity within the superfamily and include structural alignments of close and distant relatives within the superfamily, annotated with functional information and details of conserved residues. A significantly more efficient search function for CATH has been established by implementing the search server Solr ( http://lucene.apache.org/solr/ ). The CATH v3.4 webpages have been built using the Catalyst web framework.