Information

Database of known human proteins


Is there an up-to-date database of known human proteins that is easily accessible using Python libraries?


GenBank includes human proteins and provides APIs for constructing HTTP requests to make searches. Examples are given in Perl, but if you prefer another programming or scripting language you are expected to write your own:

Equivalent HTTP requests can be constructed in many modern programming languages; all that is required is the ability to create and post an HTTP request.

This approach has always served me well.

Uniprot seems similar in this respect.

SwissProt provides a package for BioPython which may be of interest. I don't know if this is the library you are referring to - you don't say, and I don't use Python.


The human proteome can be retrieved from the UniProt Knowledgebase (http://www.uniprot.org/help/human_proteome).

All data on the UniProt website is accessible programmatically via a REST API. The documentation http://www.uniprot.org/help/programmatic_access also includes a few python code examples (among other programmaing languages).

Please don't hesitate to contact the UniProt helpdesk if you have any additional questions.


Database of known human proteins - Biology

Select a chromosome to access the Genome Data Viewer


Rationale

Genome sequencing has allowed scientists to identify most of the genes encoded in each organism. The function of many, typically 50%, of translated proteins can be inferred from sequence comparison with previously characterized sequences. However, the assignment of function by homology gives only a partial understanding of a protein's role within a cell. A more complete understanding of protein function requires the identification of interacting partners: interacting subunits if the protein is a component of a molecular complex, and pathway members if the protein participates in a metabolic or signal transduction pathway [1]. Knowledge of these relationships, which we will call 'functional linkages', is a prerequisite for understanding physiology and pathology.

An enhanced understanding of the physical and functional relationships between proteins has recently become attainable through the use of non-homology-based methods [2, 3]. These methods infer functional linkage between proteins by identifying pairs of nonhomologous proteins that coevolve. Evolutionary pressure dictates that pairs of proteins that function in concert are often both present or both absent within genomes (phylogenetic profiles method), tend to be coded nearby in multiple genomes (gene neighbors method), might be fused into a single protein in some organisms (Rosetta Stone method) or are components of an operon (gene cluster method). In contrast, proteins not related by function need not appear together or exhibit spatial proximity in the genome. The complete sequencing of over 100 genomes provides a rich medium from which to infer protein linkages and function by analyzing pairwise properties using these methods. Protein functional links may also be inferred from automated text mining. Here we use a simple algorithm (TextLinks) to identify proteins that are often found together in scientific abstracts [4].

In this paper we describe a new publicly available database - Prolinks - and the associated Proteome Navigator tool that combine pairwise associations generated from each of the inference methods mentioned above. This tool allows the user to explore interactively the protein links generated for 83 microbial organisms. Sequence, sequence homology, and public annotation, including the Kyoto Encyclopedia of Genes and Genomes (KEGG), Clusters of Orthologous Groups (COG) and National Center for Biotechnology Information (NCBI) descriptions, are available for each protein. The network of predicted associations is tunable, based on an adjustable confidence limit. The network has 'clickable' nodes that permit rapid navigation. Although this is not the first database that analyzes protein coevolution, it is in many respects distinct from existing tools [5, 6]. In the Discussion section we analyze these differences. We also show how the Proteome Navigator may be used to recover links between functionally related proteins and between proteins contained within protein complexes. In short, this database extends the value of existing tools for genome annotation.


Results and discussion

Prediction algorithm

In order to identify residues in a protein that are involved in a protein interaction, we devised a method that combines structural and experimental information. Using the iPfam [16] database of known interacting domains, we first select domain regions on all target proteins that have a homologous structure including interaction partners in the PDB [17] (see Materials and methods). We then select positions that form residue-to-residue contacts between distinct polypeptide chains in these structural templates and record the corresponding positions in the target proteins as potentially interacting residues.

We needed to choose a scoring function that discriminates between residues that are really involved and crucial for an interaction and those that are not. For this purpose, we tested the effect of two different variables on prediction accuracy.

Percent sequence identity with structural template

There is a well known correlation between sequence similarity and structural similarity [18], which also extends to interacting domains [19]. An interaction is more likely to be conserved and to display similar topology when sequence similarity is high. Although we find that percentage identity by itself is not a good predictor of the importance of a residue for an interaction, it can improve the prediction accuracy slightly when combined with another threshold (Figure 1).

Conservation difference between wild-type and mutated residues. Histogram of conservation of wild-type and mutated residues. Triangles denote the residue-conservation frequency of all residues in disease protein regions that map to an iPfam domain. Circles show the conservation of the pathogenic alleles (see Materials and methods). Trendlines are added to delineate normal distributions.

Conservation of mutated residues

For all identified interaction-related mutations, we calculated a conservation score (see Materials and methods). This score reflects the frequency with which an amino acid occurs at a given position in a protein family, relative to a universal background distribution. If we look at the frequency of conservation scores over all wild-type compared to all mutated alleles (Figure 1), we find that the scores for both wild-type as well as mutated alleles seem to follow a normal distribution. However, the latter exhibit markedly smaller average conservation scores (2.4 versus -2.2 Figure 2). Thus, a residue that is found in the wild type of a protein will generally be more conserved than the residue found in the mutated version [20]. We therefore tested whether conservation could be used as an indicator of the functional importance of a residue, even for surface exposed residues like the ones under investigation here.

ROC curves calculated on a set of alanine scanning experiments. The red line represents the performance of our algorithm when changing only the conservation threshold, applying no percentage identity cutoff. The green line shows the performance using only percentage identity as a threshold. The blue line reflects performance using conservation as threshold, but applying a 30% sequence identity filter. Confidence intervals where calculated using the Statistics::ROC Perl module [59].

Prediction accuracy

To estimate the accuracy of our prediction approach, we used the ASEdb database of alanine scanning energetics experiments in protein binding [21] as a 'gold-standard' test set (see Materials and methods). In such an alanine scan, residues in the binding interface of a protein are mutated to alanine by site-directed mutagenesis [22]. The difference in binding free energy (ΔΔG) between wild-type (ΔG0) and mutated (ΔG A) protein describes the contribution of a particular residue at position i to the total binding free energy:

ΔΔG i= ΔG O- GA,i

We assessed how well our method could predict residues with a large change in ΔG upon mutation. Randles et al. [23] showed that for two model proteins, ΔΔG was correlated with the severity of disease. They show that even changes <2 kcal/mol could cause disruption of protein binding. Here, we defined a residue as correctly identified (true positive) if ΔΔG > 2.5. This threshold is also used in another recent publication [24]. Residues below this threshold were considered neutral (false positive). This criterion might in itself cause some 'false-negatives', that is, some residues might be crucial for the function of the protein despite a measured ΔΔG < 2.5, but we considered a conservative threshold to be preferable.

Figure 1 shows the receiver operator characteristic (ROC) curve [25], a plot of the frequency of true positive over the frequency of false positive predictions for a given algorithm. From left to right, points mark decreasing score thresholds, until no thresholds are applied any more and both true positive as well as false positive rates reach 100% in the upper right corner.

The green and red lines represent the performance of our algorithm using either percentage sequence identity (green) or residue conservation (red) to score the predictions. With both scoring methods, our method retrieves more true positives than would be expected by chance. The conservation threshold, however, is far superior in distinguishing true from false positives. At a false positive rate of ≈20%, we can achieve a true positive rate of almost 60%. These benchmark results underline that we are able to identify interaction disruptive mutations with reasonable confidence. The real accuracy could be even higher than measured here, considering the conservative ΔΔG cutoff we chose to define a true positive residue.

We also tested a combination of the two measures, represented by a blue line in Figure 1. In this case, the residue conservation threshold was combined with a fixed 30% sequence identity cutoff. The performance improves slightly in the low false-positive region, yielding a true positive rate of 40% at a false positive rate of only 7%. In accordance with this benchmark, we decided on a residue conservation threshold of >2 in combination with a 30% sequence identity cutoff for all subsequent analyses. In order to make our algorithm generally applicable, two more filters were applied: target proteins had to have a homologous sequence (BLAST e-value of less than 10 -6 ) in one of four major repositories for protein interaction information (IntAct [26], BioGRID [27], MPact [28] or HPRD [29]). Subsequently, target proteins were excluded if no homologous experimental interaction involved both interacting iPfam domains that were seen in the structural template.

Application to disease mutations

We applied the prediction algorithm as described above to all single-residue disease mutations extracted from OMIM and UniProt (see Materials and methods). In the case of disease mutations, the disruptive nature of a residue mutation is already known. It is unclear, however, whether an interaction is in fact taking place and is likely to be mediated by the domain in question. As described above, mutations were reported, therefore, only if the disease associated protein has a close homolog that has been proven experimentally to interact with a protein that contains the same binding partner domain as seen in the PDB structure the interaction was modeled from (the 'structural template'). For example, [OMIM:+264900.0011] is a Ser576Arg mutation of the human coagulation factor IX (PTA). The residue is part of a trypsin domain and seen to interact with Ecotin. However, the interaction between PTA and Ecotin is not yet recorded in any interaction database therefore, the mutation cannot be included in our predictions.

Using these criteria, 1,428 mutations from 264 proteins were predicted to be interaction-related (Figure 3). The full list is available in Additional data file 1. In total, we collected 25,322 mutations from OMIM and UniProt. This means that approximately 4% of all mutations could be linked to a protein interaction.

Data integration steps for interacting residue prediction. Schematic outline of data integration for the prediction of interacting residues. Mutations from OMIM and UniProt for which a residue in a homologous structure is involved in an interaction are selected. This set is restricted further by searching for homologous proteins with known interactions, taken from a range of protein interaction databases. We require that the the homologous interacting proteins contain the same pair of Pfam domains that was observed in the structural template. This results in a set of 1,428 interaction related mutations.

Amongst these mutations, 454 mapped to a structure that exhibits an interaction between different proteins (hetero-interaction), while 1,094 mutations mapped to a structure with an interaction between two identical proteins (homo-interaction). This means that 120 mutations are found in structures of both homo- and hetero-interactions. The large proportion of homo-interactions can be explained by the overrepresentation of homo-interactions in the structural templates set: 70% of all distinct protein pairs in iPfam are homo-interactions, which is in accordance with recent findings that homo-interactions are more common than hetero-interactions [30].

Properties of mutations in interaction interfaces

Curated set of interaction-related mutations

In addition to the automatically derived data, we collected 119 mutations in 65 distinct diseases from the scientific literature for which there is evidence that they change the interactions of the protein they occur in (see Materials and methods). We call this the 'curated set' of interaction-related mutations (Additional data file 2). To our knowledge, it represents the biggest collection of high confidence interaction-related mutations to date.

Below, we explore differences between interaction-related mutations and non-interaction-related mutations. We focus on the mechanism of the mutation, the mode of inheritance and residue composition. For most of the 1,428 mutations from the automatically generated set, no information about their mode of inheritance or functional mechanism was instantly available. To allow a comparison with the manually curated set, we sampled 100 mutations randomly and conducted a manual search of the literature in order to annotate their properties.

Classification according to function

We suggest a classification that groups mutations according to their effects into loss of function (LOF) and gain of function (GOF). Below this broad distinction, the GOF mutations can be further divided into two groups: pathological aggregation and aberrant recognition. Similarly, LOF mutations can be split into one class that disrupts obligate interactions between protein subunits and another class that interferes with transient interactions.

From the curated set of interaction-related mutations, 95 mutations result in LOF, 17 in GOF, 4 mutations were reported to change the interaction preference of the protein and 3 could not be determined. The class of GOF mutations that result in protein aggregation contains 12 cases, comprising amyloid diseases like Alzheimer's or Creutzfeldt-Jacob, but also, for example, sickle cell anemia [OMIM:+141900.0243]. Five cases result in aberrant recognition for example, a Gly233Val mutation in glycoprotein Ib that leads to von Willebrand disease [OMIM:*606672.0003] by increasing the affinity for von Willebrand factor.

Amongst the LOF mutations, 61 affect transient interactions and 34 affect obligate interactions. The latter usually render proteins dysfunctional, for example, in the case of lipoamide dehydrogenase deficiency caused by impaired dimerization [31]. LOF mutations in transient interactions cause changes in localization or transmission of information, exemplified by a mutation in the BRCA2 gene that predisposes women to early onset breast cancer: a Tyr42Cys mutation in BRCA2 inhibits the interaction of BRCA2 with replication protein A, a protein essential for DNA repair, replication and recombination [32]. Lack of this interaction inhibits the recruitment of double stranded break repair proteins and eventually leads to an accumulation of carcinogenic DNA changes.

Mode of inheritance

We investigated the mode of inheritance for all mutations in the curated set, if information was available in the literature. All GOF mutations showed dominant inheritance (the two hemoglobin mutations exhibit incomplete dominance). Out of 61 LOF mutations for which inheritance information was available, 24 were autosomal dominant and 37 were recessive. Jimenez-Sanchez et al. [33] studied the mode of inheritance of human disease genes. According to them, mutations in enzymes are predominantly recessive, while mutations in receptors, transcription factors and structural proteins are often dominant. Overall, they find a ratio of 188:335 of dominant to recessive diseases. In our data set, the ratio of dominant to recessive mutations is 41:37 (31:29 in terms of diseases). This enrichment for dominant mutations is statistically significant, as determined by a two-sided test for equality of proportions (P-value < 0.014). The increase was seen across Gene Ontology functional categories, in enzymes as well as regulators and signaling proteins (data not shown). In the 100 randomly chosen mutations from the predicted set, we found a ratio of dominant to recessive mutations of 38:41, which is very similar to the ratio observed in the curated set (two-sided test for equality of proportions P-value > 0.68 hypothesis of difference in proportions rejected).

In GOF mutations, dominant inheritance is not surprising, but the high proportion (39%) of dominant LOF mutations is noteworthy. Dominant inheritance in LOF mutations can be explained by either haploinsufficency or dominant negative effects [34]. In yeast, dosage sensitivity of members of protein complexes has been shown [35]. According to what Papp et al. call the 'balance hypothesis', stoichiometric imbalances have negative effects on the function of protein complexes. Dominance would thus be a result simply of a lack of functional protein subunits.

Dominant negative effects as a result of interallelic complementation could be an alternative explanation for the observed enrichment of dominant mutations. For example, mutations of phenylalanine hydroxylase can lead to phenylketonuria [36] by inhibiting necessary conformational changes between monomers. In such cases where the protein function relies on the dynamic interactions between subunits, a mutation in one of the binding interfaces can actively inhibit the function of the other bound members of the complex. Detailed experimental analysis of dominant LOF mutations could reveal the relative importance of dominant negative effects compared to haploinsufficency due to stoichiometric imbalances.

Residue frequency

The residue frequency of the predicted interaction-related mutations was compared to the frequencies of residues over all mutation in OMIM and UniProt [37]. We find that the frequency distribution of wild-type residues in interaction-related mutations is mostly similar to the overall mutational spectrum, with the exceptions of a significant enrichment in glycine and, to a lesser extent, a higher frequency of tryptophan and glutamine and a reduced frequency of alanine, serine and valine (figure in Additional data file 3). The enrichment in glycine can not be readily explained by the composition of residues on the protein surface or in interaction interfaces [38, 39] but might be due to the disruptive nature of the residues glycine is most likely to mutate to, namely arginine, serine and aspartate [37].

Examples of putative interaction-related mutations

In the following section we describe three diseases identified by our method that appear likely to be related to changes in protein interaction.

Griscelli syndrome, type 2 [OMIM:#607624]

Griscelli syndrome is a disease that features abnormal skin and hair pigmentation as well as, in some cases, immunodeficiency due to a lack of gammaglobulin and insufficient lymphocyte stimulation. Without bone marrow transplantation, the disease is usually fatal within the first years of life [40]. The type 2 form of Griscelli syndrome usually maps to the Rab-27A gene [41]. The RAS domain of Rab-27A shares 46.8% sequence identity with the same domain in Ras-related protein Rab-3A from Rattus norvegicus. The crystal structure of Rab-3A interacting with Rabphilin-3A was solved by Ostermeier and Brunger [42] (PDB:1ZBD Figure 4). We found that a Trp73Gly mutation in Rab-27A affects a residue that is both highly conserved (scores of 5.62 for tryptophan and -1.84 for glycine) and in the center of the interaction interface. There is strong evidence that Rab-27A interacts with Myophillin [43]. For these reasons the Trp73Gly mutation seems likely to affect vesicle transport by reducing affinity of Rab-27A to Myophilin.

Structure of Rattus norvegicus Ras-related protein Rab-3A [PDB:1ZBD]. The small G protein Rab3A with bound GTP interacting with the effector domain of rabphilin-3A. The residue corresponding to the mutated Trp73 from human RAB27A is highlighted in red, while the two residues in contact with it are coloured green.

Adrenocorticotropin hormone deficiency [OMIM:#201400]

Adrenocorticotropin hormone deficiency is characterized by a marked decrease of the pituitary hormone adrenocorticotropin and other steroids. Its symptoms include, amongst others, weight loss, anorexia and low blood pressure. Lamolet et al. [44] identified a Ser128Phe mutation in the T-box transcription factor TBX19 that leads to a dominant LOF phenotype [UniProt:O60806, VAR_018387]. The crystal structure of the homologous T-Box domain from the Xenopus laevis Brachyury transcription factor [45] (81% sequence identity to the human TBX19 protein [PDB:1XBR]) shows that this particular residue is at the core of the dimerization interface (Figure 5). The mutation substitutes a small polar with a large aromatic side-chain. Accordingly, the residue features strong conservation, while phenylalanine is very rare at this position (scores of 3.31 and -1.78 for serine and phenylalanine, respectively). Pulichino et al. [46] report that the Ser128Phe mutation shows virtually no DNA binding affinity. We predict that this loss of affinity is due to a drop in binding free energy between monomer and DNA, as compared to the dimer.

Structure of X. laevis Brachyury protein [PDB:1XBR]. The crystal structure of a T-domain from X. laevis bound to DNA. The residues highlighted in red are the mutated Ser128, with green residues representing the contact residues in the partner protein. Blue dashed lines show residue contacts.

Baller-Gerold syndrome [OMIM:#218600]

Baller-Gerold syndrome is a rare congenital disease characterized by distinctive malformations of the skull and facial area as well as bones of the forearms and hands. The disease phenotypically overlaps with other disorders like Rothmund-Thomson syndrome or Saethre-Chotzen syndrome. Seto et al. [47] reported a case of Baller-Gerold syndrome that also included features of Saethre-Chotzen syndrome. They identified an isoleucine to valine substitution at position 156 of the H-Twist protein as the causative mutation. Experimental studies using yeast-two-hybrid assays have reported the loss of H-Twist/E12 dimerization ability as a possible cause of Saethre-Chotzen syndrome [48].

The basic helix-loop-helix domain of H-Twist shares 45% sequence identity with the c-Myc transcription factor that was crystalized by Nair et al. [49] (Figure 6). The structure shows a dimer of c-Myc and Max bound to DNA. The c-Myc/Max dimerization is essential for the transcriptional regulation. The Ile156Val mutation is located at the core of the interaction interface. Although the Ile156Val mutation constitutes a biochemically similar substitution, reflected by the relatively high frequency of valine at this position in other helix-loop-helix proteins (conservation scores 2.76 for isoleucine and 1.23 for valine), the change in volume could slightly change the interaction propensity. Correspondingly, the Ile156Val mutation causes a mild form of Baller-Gerold syndrome.

Structure of the Myc/Max transcription factor complex binding DNA [PDB:1NKP]. Both Myc-c and Max form a basic helix-loop-helix motif. They dimerize mainly through their extended helix II regions. The residue that corresponds to Ile156 in H-Twist is Ile550, shown in red. The residue sits at a key position of the interface, forming bonds with seven residues in Max, shown in green.


PTMD: A Database of Human Disease-associated Post-translational Modifications

Various posttranslational modifications (PTMs) participate in nearly all aspects of biological processes by regulating protein functions, and aberrant states of PTMs are frequently implicated in human diseases. Therefore, an integral resource of PTM-disease associations (PDAs) would be a great help for both academic research and clinical use. In this work, we reported PTMD, a well-curated database containing PTMs that are associated with human diseases. We manually collected 1950 known PDAs in 749 proteins for 23 types of PTMs and 275 types of diseases from the literature. Database analyses show that phosphorylation has the largest number of disease associations, whereas neurologic diseases have the largest number of PTM associations. We classified all known PDAs into six classes according to the PTM status in diseases and demonstrated that the upregulation and presence of PTM events account for a predominant proportion of disease-associated PTM events. By reconstructing a disease-gene network, we observed that breast cancers have the largest number of associated PTMs and AKT1 has the largest number of PTMs connected to diseases. Finally, the PTMD database was developed with detailed annotations and can be a useful resource for further analyzing the relations between PTMs and human diseases. PTMD is freely accessible at http://ptmd.biocuckoo.org.

Keywords: AKT1 Disease–gene network PTM–disease association Phosphorylation Posttranslational modification.


Download: Ligands

By entering chemical component IDs, SDF files with ligand coordinates can be downloaded.

  • Coordinates of first chemical component instance from each PDB entry
  • Coordinates of all chemical component instances from each PDB entry
  • Ideal coordinates from Chemical Component Dictionary

File Download Services

Searches and reports performed on this RCSB PDB website utilize data from the PDB archive. The PDB archive is maintained by the wwPDB at the main archive, ftp.wwpdb.org (data download details) and the versioned archive, ftp-versioned.wwpdb.org (Versioning details).

  • The directory pub/pdb is the entry directory for the PDB archive.
  • The directory pub/pdb/data/structures/divided contains the current PDB contents including PDB, mmCIF, and PDBML/XML formatted coordinate files, structure factors and NMR restraints

Annual snapshots of PDB Archive are available. 

Web Services

Programmatic access to individual structures and/or specific data items is provided through Web Service Application Program Interfaces (APIs).

Contact RCSB PDB with questions suggestions for specific services.

Molecular explorationsthrough biology and medicine

PDB-101 is an online portal for teachers, students, and the general public to promote exploration in the world of proteins and nucleic acids.

Browse all PDB-101 resources by biological theme or start exploring:

Molecule of the Month

Presents short accounts on selected molecules from the Protein Data Bank.

News and Events

Upcoming meetings and events RCSB will hold

Educational Resources

Access materials that promote exploration in the world of proteins and nucleic acids.

Guide to PDB Data

Understanding PDB Data is a reference to help explore and interpret individual PDB entries.

Curricula

Authentic, hands-on teaching materials, individual and group activities.

Geis Digital Archive

View iconic illustrations by the gifted artist Irving Geis (1908-1997) in context with PDB structures and educational information.


HomoKinase: A Curated Database of Human Protein Kinases

HomoKinase database is a comprehensive collection of curated human protein kinases and their relevant biological information. The entries in the database are curated by three criteria: HGNC approval, gene ontology-based biological process (protein phosphorylation), and molecular function (ATP binding and kinase activity). For a given query protein kinase name, the database provides its official symbol, full name, other known aliases, amino acid sequences, functional domain, gene ontology, pathways assignments, and drug compounds. In addition, as a search tool, it enables the retrieval of similar protein kinases with specific family, subfamily, group, and domain combinations and tabulates the information. The present version contains 498 curated human protein kinases and links to other popular databases.

1. Introduction

In human genome, the protein kinase is one of the largest recognized protein families which regulate multiple biological processes by posttranslational phosphorylation of serine, threonine, and tyrosine residues [1]. Human genome contains 500 protein kinase genes that constitute about 2% of all genes [2]. Approximately 2000 protein kinases are encoded by human genome. Protein kinases and phosphatases play an important role in regulating and coordinating aspects of metabolism, cell growth, cell motility, cell differentiation and cell division, and signaling pathways involved in normal development and disease [3]. In human genome, 30% to 50% of proteins may undergo phosphorylation therefore, improper functioning of kinase may lead to various human diseases [4]. Turning on and off of protein kinases and phosphatases maintains the functions of the cellular life in a systematic manner. Further, protein kinases are involved in regulation of many processes, so they are linked to many diseases and act as target for drug design. Protein kinases are the group of enzymes that share conserved catalytic domains involved in stimulating catalytic activity of enzymes and act as ATP binding sites. This result the need and availability of databases specific to protein kinases.

There are many databases for protein kinases present, which include human protein kinases information as well [2, 5, 6]. For example, KinBase [2] contains manually curated kinomes based on Hanks and Hunter classification for nine genomes including humans. KinG [5] contains protein kinases entries for 40 genomes that have been classified by kinome-based sequence search methods. KinWeb [6] is a specific collection of protein kinases encoded in the human genome, and the classification is based on the same orthologous groups present in human and other similar lineages. However, none of the above databases offers high accuracy in classification of human protein kinases due to their underlying classification algorithm. Further, they do not have the options for the retrieval of protein kinases with specific family, subfamily, group, and domain combinations with easy-to-use interface. In this present work, we developed curated human protein kinases database known as “HomoKinase.” First, each entry in the database was checked with HGNC to confirm whether it is approved or not. The HGNC approved entry was further confirmed by gene ontology (GO) information based on the presence of three GO terms: (i) ATP binding, (ii) kinase activity, and (iii) protein phosphorylation. The easy-to-use web interface of HomoKinase is shown in Figure 1.


2. Materials and Methods

The HomoKinase database creation involves several steps. First, human genes with their known aliases were downloaded from Entrez Gene (http://www.ncbi.nlm.nih.gov/gene) using the query term “(Homo sapiens [Organism]) AND HGNC.” Next, the retrieved gene list was crosschecked with the HUGO Gene Nomenclature Committee (HGNC) (http://www.genenames.org/) database to include only the genes with HGNC approved gene name for building the database [7]. The other genes in the list such as pseudogenes, noncoding RNAs, and phenotype which have no HGNC approved name were eliminated.

Finally, gene ontology based refinement was performed to classify the protein kinase genes from the HGNC approved list of human protein-coding genes. In general, GO is mainly focused on three significant ontology terms such as molecular function, cellular component, and biological process. A single gene product may be annotated to multiple GO terms, detailing a range of functional attributes, using both manual and electronic annotation methods [8, 9]. The conserved protein kinase core consists of two lobes: a smaller N-terminal lobe (N-lobe) with ATP binding site and a larger C-terminal lobe (C-lobe) with catalytic site responsible for kinase activity [3, 10]. In addition, the biological processes correspond to protein phosphorylation. These three unique terms of gene ontology (GO) provide precise information about the annotated gene, gene products, and other terms which in turn provide a deep insight about kinases to the researchers. So, we classify the HGNC approved human genes which confirms these three GO terms: (i) ATP binding, (ii) kinase activity, and (iii) protein phosphorylation as true protein kinases. Gene ontology search was performed using two web tools, namely, Quick Go [11] and Amigo Go [12] with automated PHP scripts. The HGNC approved human genes, which satisfy all these three GO criteria, were classified as human protein kinases and used to build the database.

The predicted list of protein kinases were further divided into groups, families, subfamilies, and domains. The group classifications were done using the PhosphoSite database [13], whereas the superfamily, family, subfamily, and domain level classifications were retrieved from UniProt [14]. In addition, various biological information such as official symbol, full name, biological IDs, other known aliases, amino acid sequences, functional domain, gene ontology, pathway assignments, and drug compounds were extracted from various biological databases such as (i) NCBI, (ii) UniProt, (iii) Amigo Go, (iv) KEGG, and (v) DrugBank. Figure 2 depicts a schematic summary of the HomoKinase data warehouse creation process.


The curated human protein kinase names and their related information retrieved from other databases were used to develop the HomoKinase database. The HomoKinase database is implemented as client/server architecture with easy-to-use web interface. The server is made of MySQL database, and the web client and programs for the human protein kinase retrieval, annotation, and query interface were designed using PHP programming language.

3. Results

Entrez Gene stores information on 1,93,709 genes specific to Homo sapiens (as on October 2012). We retrieved 33,489 human genes/proteins specific to our query term “(Homo sapiens [Organism] AND HGNC).” On further comparison with HGNC database, only the 19,026 genes have official HGNC gene symbol, and the remaining were 8399 pseudogenes, 4230 noncoding RNAs, 707 phenotype, and 1127 other genes.

The 19,032 HGNC approved human genes were further classified into protein kinases by checking the presence of three GO annotation terms (i) ATP binding property, (ii) kinase activity, and (iii) protein phosphorylation property. The HGNC approved genes fulfilling the above three GO properties (e.g., CDK1, MARK1) were classified as protein kinases and included in the database. Protein kinases missing any one of the above GO properties were filtered and eliminated as nonprotein kinase. The examples of proteins with missing kinase information were (i) absence of ATP binding (e.g., PRKAG2, ADCK4), (ii) absence of kinase activity (e.g., ACTR2, EPHA8), and (iii) absence of protein phosphorylation (e.g., RIOK1, TRIB2). In addition, few genes with lipid kinase activity (e.g., PIK3C2B) and nonprotein kinase (e.g., CKM) were also filtered out. The GO curation and filtration resulted in 498 human genes marked as validated human protein kinases which were included in the final HomoKinase database.

The HomoKinase database was compared with KinBase [2] and KinWeb [6], the two currently available databases which include human protein kinases. KinBase consists of 506 entries, whereas KinWeb contains 508 entries. HomoKinase excludes the genes which were not approved by HGNC (e.g., NIM1, MST4 in KinBase ZAK, SgK223 in KinWeb) and genes without proper GO kinase annotation (e.g., ADCK4, TRRAP in KinBase BRDT, SRM in KinWeb). As a result the number of entries in HomoKinase is reduced into 498. In addition, some of the other common mistakes identified in both databases include (i) gene ID replaced with another gene ID (e.g., SPEG in KinBase TAO2, Trad in KinWeb), (ii) genes without information on Entrez Gene ID (e.g., sgk424 in KinBase and KinWeb), and (iii) pseudogenes (e.g., PRKY in KinBase and KinWeb). In total, we identified 31 genes with incomplete information (such as error in gene ID and gene name) in KinBase and 8 genes in KinWeb. Table 1 shows the overall comparison of the three databases.

4. Discussion

We have developed a curated database of human protein kinases. The salient feature of HomoKinase database is that it provides individual protein name search as well as group search (e.g., family, subfamily, domain, etc.). Individual search can be carried out by giving official symbol (provided by HGNC), Entrez Gene ID, HGNC ID, Ensembl ID, and UniProt ID) and other aliases/designations. The group search can be carried out by classification of protein kinases into different kinase groups, families, subfamilies, and domains. The different group classification of protein kinases in HomoKinase is discussed below.

The 498 human protein kinases entries in the database were classified into 10 groups, 1 superfamily, 22 families, 66 subfamilies, and 115 domains. All 498 protein kinases fall in any one of the 10 groups. However, only 482 proteins were classified into 22 families, and 14 proteins do not belong to any family. Further, 358 proteins belong to 66 subfamilies, whereas for 140 proteins, the subfamily information is missing. In addition, each protein has one-to-many domains, and in total, 115 domains were found among 496 kinases. The database group search can be performed using any one of the above classes. The group search lists out all protein kinases that belong to that search category in a tabular form from which individual protein search can be carried out. The HomoKinase database classification and organization is shown in Figure 3.


5. Conclusion

In summary, HomoKinase is an easy-to-use interface to a curated database of human protein kinases. We plan for the future expansion of the database which includes high number of eukaryotic species for relative comparison. In addition, there are plans for expansion with inclusion of protein secondary and tertiary structure and pathway information on kinases. Protein structure information is vital in understanding protein function and evolutionary relationships, and pathway information will help to understand the various metabolic and signaling pathways in which the kinases were involved.

Availability

The database is hosted and available online at http://www.biomining-bu.in/homokinase/.

Acknowledgments

This work is supported by Grant from the Department of Information Technology (DIT), Government of India (no. DIT/R&D/BIO/15(22)/2008). Suresh Subramani and Raja Kalpana acknowledge the support received from the grant.

References

  1. G. Manning, G. D. Plowman, T. Hunter, and S. Sudarsanam, “Evolution of protein kinase signaling from yeast to man,” Trends in Biochemical Sciences, vol. 27, no. 10, pp. 514–520, 2002. View at: Publisher Site | Google Scholar
  2. G. Manning, D. B. Whyte, R. Martinez, T. Hunter, and S. Sudarsanam, “The protein kinase complement of the human genome,” Science, vol. 298, no. 5600, pp. 1912–1934, 2002. View at: Publisher Site | Google Scholar
  3. L. N. Johnson, M. E. M. Noble, and D. J. Owen, “Active and inactive protein kinases: structural basis for regulation,” Cell, vol. 85, no. 2, pp. 149–158, 1996. View at: Publisher Site | Google Scholar
  4. C. Y. Yang, C. H. Chang, Y. L. Yu et al., “PhosphoPOINT: a comprehensive human kinase interactome and phospho-protein database,” Bioinformatics, vol. 24, no. 16, pp. i14–i20, 2008. View at: Google Scholar
  5. A. Krupa, K. R. Abhinandan, and N. Srinivasan, “KinG: a database of protein kinases in genomes,” Nucleic Acids Research, vol. 32, pp. D513–D515, 2004. View at: Google Scholar
  6. L. Milanesi, M. Petrillo, L. Sepe et al., “Systematic analysis of human kinase genes: a large number of genes and alternative splicing events result in functional and structural diversity,” BMC Bioinformatics, vol. 6, no. 4, article S20, 2005. View at: Publisher Site | Google Scholar
  7. R. L. Seal, S. M. Gordon, M. J. Lush, M. W. Wright, and E. A. Bruford, “Genenames.org: the HGNC resources in 2011,” Nucleic Acids Research, vol. 39, no. 1, pp. D514–D519, 2011. View at: Publisher Site | Google Scholar
  8. D. Binns, E. Dimmer, R. Huntley, D. Barrell, C. O'Donovan, and R. Apweiler, “QuickGO: a web-based tool for Gene Ontology searching,” Bioinformatics, vol. 25, no. 22, pp. 3045–3046, 2009. View at: Publisher Site | Google Scholar
  9. M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000. View at: Google Scholar
  10. S. S. Taylor, E. Radzio-Andzelm, Madhusudan, X. Cheng, L. Ten Eyck, and N. Narayana, “Catalytic subunit of cyclic AMP-dependent protein kinasestructure and dynamics of the active site cleft,” Pharmacology and Therapeutics, vol. 82, no. 2-3, pp. 133–141, 1999. View at: Publisher Site | Google Scholar
  11. QuickGO, 2013, http://www.ebi.ac.uk/QuickGO.
  12. AmiGO, 2013, http://amigo.geneontology.org.
  13. PhophoSitePlus, 2013, http://www.phosphosite.org .
  14. UniProt, 2013, http://www.uniprot.org.

Copyright

Copyright © 2013 Suresh Subramani et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Abstract

Epigenetics refers to stable and long-term alterations of cellular traits that are not caused by changes in the DNA sequence per se . Rather, covalent modifications of DNA and histones affect gene expression and genome stability via proteins that recognize and act upon such modifications. Many enzymes that catalyse epigenetic modifications or are critical for enzymatic complexes have been discovered, and this is encouraging investigators to study the role of these proteins in diverse normal and pathological processes. Rapidly growing knowledge in the area has resulted in the need for a resource that compiles, organizes and presents curated information to the researchers in an easily accessible and user-friendly form. Here we present EpiFactors, a manually curated database providing information about epigenetic regulators, their complexes, targets and products. EpiFactors contains information on 815 proteins, including 95 histones and protamines. For 789 of these genes, we include expressions values across several samples, in particular a collection of 458 human primary cell samples (for approximately 200 cell types, in many cases from three individual donors), covering most mammalian cell steady states, 255 different cancer cell lines (representing approximately 150 cancer subtypes) and 134 human postmortem tissues. Expression values were obtained by the FANTOM5 consortium using Cap Analysis of Gene Expression technique. EpiFactors also contains information on 69 protein complexes that are involved in epigenetic regulation. The resource is practical for a wide range of users, including biologists, pharmacologists and clinicians.


Contents

[For a complete background, please refer to Autophagy].

Autophagy is the process by which the cells in an organism destroy non-functional or unnecessary self-components. [3] Specifically, autophagy is a catabolic process involving the degradation of a cell's own components through the lysosomal machinery. [1] Autophagy is also crucial for instances of starvation and removal of potentially dangerous cellular materials, indicating its necessity in maintaining life. [1] As seen in the associated figure Autophagy, cellular products are degraded by destructive cellular components, such as lysosomes, to produce new materials for the cell to use. Research into autophagy and its related processes has exploded over recent years, however, many of these processes are not completely understood and homologs have not been found in different species for many of these proteins. [1] Its molecular mechanisms have not been fully elucidated, despite dramatic advances in the field as evidenced by hundreds of autophagy-related genes and proteins reported. [1] As such, there was a demonstrated need for a database to characterize human autophagy proteins and components and/or their homologs, as well as orthologs in other species.

Autophagy database is a product of the National Institute of Genetics (NIG) [4] NIG was founded in June 1949 by the ministry of Education, Science, Sports, and Culture, with Prof. Kan Oguma being elected the first director. [4] Over time, many departments have been added for various applications such as Genetics, Genomics, DNA Research, and, most notably for our purposes, the DNA Data Bank. [4] NIG is a division of the Japanese Research Organization of Information and Systems, and is currently under the supervision of its ninth director. [4] NIG aims to conduct top-level research in the pursuit of streamlining of information, as well as the dissemination of information from research into societal application. [4] A tool created by this organization for this purpose is the Autophagy database.

The Autophagy database is a database of proteins involved in autophagy. The Autophagy database intends to collect all relevant information, organize it, and make it publicly available so that its users can easily get up-to-date knowledge. Specifically, the Autophagy database offers a "free-for-all" tool for those with interests, research and otherwise, in autophagy. [3] To better accomplish this aim, the available Autophagy database from NIG calls for users of the database to disseminate and share information, so that autophagy-related data can be available for free to all who need it. [1] For an interested research community, this model of research dissemination holds promise. As of April 2018 3 years ago ( 2018-04 ) , there were 582 reviewed proteins available in this database. [3] Including autophagic proteins available in HomoloGene, NCBI, there are over 52,000 total proteins. [3] Autophagy database offers comparison of homologous proteins between 41 different species to search new and old autophagy-related proteins, so that current autophagy research can be streamlined. [1] The database was made publicly available in March 2010 and currently includes 7,444 genes/proteins in 82 eukaryotes.

Human autophagy database is a product of the Luxembourg Institute of Health (LIH). [5] LIH has several branches throughout Luxembourg available for Biomonitoring, Infection and Immunity, Health administration, Oncology, Sports Medicine, and Biobank. Each of these departments aims to support the LIH mission statement, which is "to generate and translate research knowledge into clinical applications with an impact on the future challenges of health care and personalised medicine." [5] It offered tools. [5] The Laboratory of Experimental Cancer Research of LIH helped to establish one of these tools, that tool being the database known as Human autophagy database.

Human autophagy database (HADb) is another available autophagy resource. [2] Unlike Autophagy database, Human autophagy database only compares those proteins found in humans. HADb is the first human-only autophagy database, where researchers may find an updated listing of directly and indirectly related autophagic proteins, given no consistent database previously available to compensate for a huge expansion in autophagy research. [2] HADb does not only provide information on the gene of interest, but also aims to evolve into a database which can be used to analyze the gene of interest. [2] For this purpose, HADb was made as complete as possible in terms of autophagy-related proteins, though newly discovered proteins and genes may be submitted by different users to the Submission section. The information provided by Human autophagy database can be used further in bioinformatics applications.

Given that these databases are a large store of biological information, these can be used in bioinformatics applications to simplify information collection and analysis. Bioinformatics looks to pair biological discoveries with big data, to aid in improved scientific discoveries. Each database can be utilized to study an autophagic protein or gene of interest, where these databases are maintained by user submissions. Information for each gene can be used to access Entrez, Ensembl, and PubMed. FASTA sequence is also available for sequence analysis using sites such as BLAST. Specific uses available to Autophagy database and Human autophagy database are shown below.

Autophagy database has several available functions to search for autophagy-related proteins in different species.

A user may access Autophagy database at http://www.tanpaku.org/autophagy/index.html. The image given, "Options for ADb", showcases the variety of options available for this database. All unhighlighted tabs offer additional information and contact information unrelated to gene search. A user may refer to:

  • the Protein list, highlighted yellow, where the user may select an organism and search for Synonyms, Gene ID, and Protein accession, among other functions. These function offer the user multiple options on how to search for information on genes of interest. The options available on Autophagy database for Protein list can be seen in the example given to the right Protein list given to the right. Selecting various options, such as Synonyms, allows the user to search using specific queries.
  • matches of autophagy-related proteins amongst homologs and orthologs under the Homologs tab, highlighted in green. This can be used according to the image to the right, Homologs, where orthologs and homologs can be compared between different organisms and taxa by selecting the required boxes.
  • search for specific genes. This may be accomplished using Keyword search, highlighted in blue, and may also be used to match a known gene of interest to a certain species. This helps to determine potential orthologs.
  • homologs and orthologs to a gene of interest. Homology search, highlighted in orange, can be used to search a FASTA or unformatted text sequence of a known gene. This may aid in finding connected autophagy-related proteins, or in finding homologs or orthologs. Homologs and orthologs can be compared by the user within or amongst species, given different species and taxa options as seen in the associated figure.
  • analyze connections between one gene and autophagy genes available in the database. Original Analyses, highlighted in grey, may be used to find potential autophagy-related gene matches to a known gene. To best utilize these functions, the user should refer to the "Download" tab to download all gene files, so that function of Autophagy database can be fully utilized.

Human autophagy database has available functions for Look for gene and Clustering. [2]

When accessing http://autophagy.lu/index.html, these options can be accessed. Interested parties may also Submit new human autophagy proteins to the database. A user may utilize the database according to the following options:

  • A user may search for a gene by name in Look for gene category. A gene may be sought for using its gene symbol, Ensembl accession number, chromosome location, or a relevant keyword. Simple instructions for how to access a gene of interest using this method are given in the figure for "Look for gene: HADb". Briefly, the user would access the website, select the highlighted tab, and search for their gene of interest using the available tabs given in the associated image. Specifically, the user would select which option they would like to use in the associated table, and then fill in the information for the desired tab, whether it be Symbol or Synonym, Chromosome, Accession number, or Keyword. Once the tab is selected, information can be entered and searched to determine any linked autophagy-related proteins.
  • The user may also refer to Clustering where genes may be viewed in alphabetical order. A simplified map of how to conduct this search is shown in image "Clustering: HADb". The user would first refer to http://autophagy.lu/index.html, after which they would select the highlighted tab in the "Clustering" image to access their gene of interest. The user can then select their gene of interest alphabetically to gather further information. Though this database contains only human autophagy genes, the user need not download a database for use, and can find genes and proteins involved in the complex process of autophagy.

Each database offers its own strengths and weaknesses.

  • Autophagy database: Conceptually, Autophagy database offers the opportunity to easily access information on autophagy-related proteins in a variety of species. [3] However, there are some issues in using this database. The user may try the assortment of tab options available Options for ADb, though these options cannot be utilized without downloading content from Autophagy database. Though the user can access the Download tab (seen in "Options for ADb" in white), this offers only text output when using U.S.-based wifi service. As such, the user cannot access the variety of options mentioned above in a GUI format, but rather must search text output. The user may download the Autophagy database, but these files may be difficult to access using Apple OS. This complicates the ease of use for the U.S.-based user, reducing Autophagy database's utility. This is a potential complication of an internationally available database, that complicates its ease of use.
  • Human autophagy database: Though also an internationally available database, ease of use for Human autophagy database is considerably improved. All available options, though limited, can be accessed using a U.S.-based wifi service. Human autophagy database is limited in the array of options available for data collection and analysis, as there are fewer options available than those offered by Autophagy database. The database also stores only human autophagy-related genes and proteins, [2] whereas Autophagy database has information on autophagy-related genes and proteins available for a variety of different species.

Though each database has its own strengths and weaknesses, they each help to fill a gap. [3] Further additions may help to improve these databases in the future. Though there may be databases available that appear more complete for general gene or protein searches, such as NCBI, HADb and Autophagy database offer the most complete information on autophagy-related genes and proteins. The GUI is not fully refined for each, and may be harder to access, but each of these databases maintains focus on autophagy, whereas NCBI does not use the same focused approach on autophagy. As such, HADb and Autophagy database may offer an interesting route for exploration of autophagy-related genes and proteins.


Below is a list of articles on human chromosomes, each of which contains an incomplete list of genes located on that chromosome.

The lists below constitute a complete list of all known human protein-coding genes.

Human protein-coding gene pages:
•Python code for maintaining the list
•List of human protein-coding genes page 1 covers genes A1BG–ENTPD6
•List of human protein-coding genes page 2 covers genes ENTPD7–MTIF2
•List of human protein-coding genes page 3 covers genes MTIF3–SLC22A5
•List of human protein-coding genes page 4 covers genes SLC22A6–ZZZ3
NB: Each list page contains 5000 human protein-coding genes, sorted alphanumerically by the HGNC-approved gene symbol.
Follow the Python code link for information about updates to the list of genes on these pages.

This is a list of 1639 genes which encode proteins that are known or expected to function as human transcription factors.


Watch the video: Παράδειγμα προσδιορισμού εμπειρικού τύπου από δεδομένα ποσοστών σύνθεσης (January 2022).