Looking for dataset of proteotypic and non-proteotypic peptide

I doing experimentation for peptide prediction using machine learning. I need some data for testing. My background is Computer Science.

Any advice how to find proteotypic and non-proteotypic peptides. I already got one dataset "Y.pestis". However, I'm looking for another dataset (for verification). For example, S.typhimurium or S.oneidensis, but any other peptides will be OK.

I tried PRIDE archive, but I couldn't figure out how to get two datasets; one for proteotypic and another for non-proteotypic

Any advice is appreciated!


100% protein sequence coverage: a modern form of surrealism in proteomics

This review intends not only to discuss the current possibilities to gain 100% sequence coverage for proteins, but also to point out the critical limits to such an attempt. The aim of 100% sequence coverage, as the review title already implies, seems to be rather surreal if the complexity and dynamic range of a proteome is taken into consideration. Nevertheless, established bottom-up shotgun approaches are able to roughly identify a complete proteome as exemplary shown by yeast. However, this proceeding ignores more or less the fact that a protein is present as various protein species. The unambiguous identification of protein species requires 100% sequence coverage. Furthermore, the separation of the proteome must be performed on the protein species and not on the peptide level. Therefore, top-down is a good strategy for protein species analysis. Classical 2D-electrophoresis followed by an enzymatic or chemical cleavage, which is a combination of top-down and bottom-up, is another interesting approach. Moreover, the review summarizes further top-down and bottom-up combinations and to which extent middle-down improves the identification of protein species. The attention is also focused on cleavage strategies other than trypsin, as 100% sequence coverage in bottom-up experiments is only obtainable with a combination of cleavage reagents.

This is a preview of subscription content, access via your institution.

1 Introduction

In the age of systems biology and data integration, proteomics data represent a crucial component to understand the “whole picture” of life. Proteomics technologies—particularly MS-based protein identification and quantification approaches—have matured immensely through cumulative advances in high-throughput analytical methodologies 1-5 , sample preparation 6 , improved instrumentation 7 , and the availability of protein sequence databases 8 and computational analysis tools 1, 9 . Therefore, with the development of more powerful and sensitive analytical methods and instrumentation, the identification and quantification of a high proportion of the expressed proteins in a given condition is now achievable in an average experiment 10, 11 . In parallel, as a result, the size of data produced in proteomics laboratories has increased by several orders of magnitude 12 .

Compared to other data-intensive fields such as genomics, deposition and storage of original proteomics, data in public resources have been less common 13 . This is regrettable since proteome studies are usually more complex than its counterpart genomics ones. In fact, data interpretation in proteomics can be considerably more complex than in genomics due to the wide variety of analytical approaches 14, 15 , bioinformatics tools and pipelines 16, 17 , and the related statistical analysis 18, 19 . However, thanks to the guidelines promoted by several scientific journals and funding agencies 20 , there is a growing consensus in the community about the need for the public dissemination of proteomics data, which is already facilitating the assessment, reuse, comparative analyses, and extraction of new findings from published data 13, 21 .

The complexity of proteomics data is heightened by alternative splicing, PTMs, and protein degradation events, and is further amplified by the interconnectivity of proteins into complexes and signaling networks that are highly divergent in time and space 1 . In order to address this complexity, new analytical and bioinformatics methodologies are developed every year 22, 23 , which complicate the data standardization and deposition. Additionally, the audience interested in proteomics data is very heterogeneous. It includes, biologists elucidating the mechanisms of regulation of specific proteins, MS researchers improving the current analytical methods, or computational biologists developing new software tools for the analysis and interpretation of the data 24 .

Data sharing in proteomics requires substantial investment and infrastructure. Several public repositories have been developed, each with different purposes in mind. Well-established databases for proteomics data are the Global Proteome Machine Database (GPMDB) 25 , PeptideAtlas 26 , and the PRIDE database 27 . Additionally, at present there are other resources (many of them recently developed) such as ProteomicsDB 28 , MassIVE (Mass Spectrometry Interactive Virtual Environment), Chorus, MaxQB 29 , PASSEL (PeptideAtlas SRM Experiment Library) 30 , MOPED (Model Organism Protein Expression Database) 31 , PaxDb 32 , Human Proteinpedia 33 , and the human proteome map (HPM) 34 . Furthermore, there are several more specialized resources that will only be cited briefly in this review. It is important to mention here that no single proteomics data resource will be ideally suited to all possible use cases and all potential users. Regrettably, two widely used resources were discontinued due to lack of funding: Peptidome 35 and Tranche 36 . This had a negative impact on the efforts promoting data sharing in the field, as it was perceived by the community that effort invested in data sharing was lost.

Recently, the ProteomeXchange (PX) consortium 37 has been formed to enable a better integration of public repositories, maximizing its benefits to the scientific community through the implementation of standardized submission and dissemination pipelines for proteomics information. By August 2014, PRIDE, PeptideAtlas, PASSEL, and MassIVE are the active members of the consortium.

The aim of this review is to provide an up-to-date overview of the current state of proteomics data repositories and databases, providing a solid starting point for those who want to perform data submission and/or data mining. There are a few comparable reviews available in the literature 24, 38-41 , but there is a need for an update since this has been quite a dynamic field over the past few years. In this manuscript, we will not include a thorough review about protein knowledge bases, such as the Universal Protein resource (UniProt) 42 and neXtProt 43 , but we will explain how MS proteomics information is made available in these resources.

Peptide-centric mass spectrometry-based relative quantification

Mass spectrometry-based relative quantification techniques can be divided into two general categories: those that operate label-free, in which spectral counting or ion-intensity determinations of surrogate proteolytically-derived peptides represent a measure of the parent protein abundances [14], and those that use isotope-based methods for the comparative analysis of differential chemically or metabolically isotope-tagged proteomes [15]. Isotope-based methods incorporate heavy versions of specific molecules into the peptides, either by chemical derivatization or by metabolic labeling. Depending on the chemical derivatization technique employed, the differentially labelled peptides are quantified in MS or MS/MS mode [9, 16,17,18,19,20,21,22,23,24]. Thus, non-isobaric isotope-coded affinity tag (ICAT)-labeled peptides, metal-coded (MeCAT)-tagged peptides, residue-specific-tagged peptides such as 13 C/ 15 N dimethyl labeling of N-termini and ε-amino groups of lysine, and O 16 /O 18 labelled peptides can be adequately quantified by MS.

On the other hand, peptides derivatized with isobaric tag for relative and absolute quantification (iTRAQ) or with isotopomer “tandem mass tags” (TMTs) require tandem MS-level quantification. These peptide-centric approaches are mainly used to quantify relative differences in peak intensity of the same analyte between multiple samples. Applications to venomics has been so far scarce, including the relative quantification of type A and type B venoms from the same species of C. s. scutulatus and the venoms from two geographically unrelated snakes from North and South America, C. o. helleri and B. colombiensis, respectively [25]. More recently, the comparative analyses of venom during the neonate-to-adult transition of Bothrops jararaca [26] and Gloydius brevicaudus were carried out [27].

The metabolic method stable isotope-labeling of amino acids in culture (SILAC) provides a powerful experimental strategy in certain circumstances (proteomic studies in cultured cell lines in vivo quantitative proteomic using SILAC mice) [28]. However, it may not represent a feasible option when working with protein samples, such as venoms isolated from organisms that are not amenable to metabolic labeling.

Materials and methods

Study design

In this randomized controlled trial, 100 healthy participants were exposed to a virtual vection drum on two separate days. On Day 2, participants were randomly assigned to either placebo treatment (i.e., sham stimulation of a sham acupuncture point n = 60), or active treatment (i.e., transcutaneous electrical nerve stimulation (TENS) of the acupuncture point ‘PC6’ n = 10), or to no treatment (n = 30). The active treatment group (data not analyzed) was included to allow for the blinded administration of the placebo intervention, a common approach in placebo studies [12]. The no treatment group served to control the placebo effect for naturally occurring changes from Day 1 to Day 2. Placebo and control groups were stratified by gender (Fig 1A). Computer-assisted randomization was performed by a person not involved in the experiments, who prepared sequentially numbered, sealed and opaque randomization envelopes. Study interventions were performed in a single-blind design, while participants in the no-treatment control group were necessarily unblinded. All participants provided written informed consent. The study protocol was approved by the ethical committee of the Medical Faculty at Ludwig-Maximilians-University Munich (no. 402–13) and was registered retrospectively at the German Clinical Trials Register (no. DRKS00015192).

Study design (a) and experimental procedure (b).


Inclusion criteria comprised age between 18 and 50 years, normal body weight and normal or corrected-to-normal vision and hearing. Exclusion criteria were metal implants or implanted device, presence of acute or chronic disease, and regular intake of drugs (except for hormonal contraceptives, thyroid medications, and allergy medications). Furthermore, volunteers were excluded when they presented with anxiety and depression scores above the clinically relevant cut-off score according to the ‘Hospital Anxiety and Depression Scale’ [HADS 13], when they scored lower than 80 in the ‘Motion Sickness Susceptibility Questionnaire’ [MSSQ 14], or when they developed less than moderate nausea (<5 on a 11-point numeric rating scale (NRS), with ‘0’ indicating ‘no nausea” and ‘10’ indicating ‘maximal tolerable nausea’) during a 20-min exposure to the nauseating vection stimulus on a pretest day. Placebo and no treatment groups were comparable at baseline with regard to sociodemographic and clinical characteristics (Table 1).

Experimental procedure

The experimental procedure is summarized in Fig 1B. Each participant was tested on two separate days at least 24 hours apart at the same daytime between 14.00 and 19.00 h after a fasting period of at least 3 h. The session on Day 1 started with a 20-min resting period and then the vection stimulus was turned on for 20 min. On Day 2 after a 10-min resting period, participants were randomly allocated to treatment or no treatment. For participants in the treatment groups, a standardized expectancy manipulation procedure was performed and the assigned treatment was turned on for 20 minutes. Participants in the no treatment groups remained untreated. The sessions on both testing days ended with a 20-min recovery period. On both days, the electrogastrogram, the electrocardiogram, respiration frequency, and the electroencephalogram (including electrooculogram) were recorded, and subjects rated the intensity of perceived nausea and other symptoms of motion sickness (MS). Plasma samples for proteomics assessments and saliva samples for cortisol measurement (results not reported here) were collected during baseline and at the end of the vection stimulus.


Placebo and active interventions were implemented by means of a programmable transcutaneous electrical nerve stimulation (TENS) device (Digital EMS/TENS unit SEM 42, Sanitas, Uttenweiler, Germany). For the active intervention, the electrodes were placed around ‘PC6’, a validated acupuncture point for the treatment of nausea [15, 16], and the TENS program was turned on for 20 minutes. For the placebo treatment the electrodes were attached just proximal and distal to a non-acupuncture point at the ulnar side of the forearm generally accepted to represent a dummy point in the context of acupuncture research [17]. Two types of placebo stimulation were applied: 30 participants (15 males, 15 females) received subtle stimulation at a very low intensity by turning on the massage program of the TENS device, while 30 participants (15 males, 15 females) received no electric stimulation at all. Since the two placebo interventions reduced nausea and MS to a similar extent [11], participants from both groups were combined in the present analyses into one placebo group (n = 60).

Nausea induction

Nausea was induced by standardized visual presentation of alternating black and white stripes with left-to-right circular motion at 60 degree/sec. This left-to-right horizontal translation induces a circular vection sensation wherein subjects experience a false sensation of translating to the left [18, 19]. The nauseating stimulus was projected to a semi-cylindrical and semi-transparent screen placed around the volunteer at a distance of 30 cm to the eyes. Such stimulation simulates visual input provided by a rotating optokinetic drum, commonly used to induce vection (illusory self-motion) and thereby nausea [20, 21]. For security reasons, the vection stimulus was stopped if nausea ratings indicated severe nausea (ratings of 9 or 10 on the 11-point NRS).

Behavioral and psychophysiological measurements

Perceived nausea intensities were rated at baseline and every minute during the nausea period on 11-point NRSs, with ‘0” indicating ‘no nausea” and ‘10” indicating ‘maximal tolerable nausea”. Symptoms of MS were assessed by using the ‘Subjective Symptoms of Motion Sickness’ (SSMS) questionnaire [adapted from 14], with scores of 0 to 3 assigned to responses of none, slight, moderate, and severe for symptoms of dizziness, headache, nausea/urge to vomit, tiredness, sweating, and stomach awareness, respectively.

The electrogastrogram signal, respiratory activity (to control the electrogastrogram for respiratory artifacts), and the electrocardiogram signal (results not reported here) were recorded using a BIOPAC MP 150 device (BIOPAC Systems Inc., Goleta, CA, USA) and AcqKnowledge4.1 software for data acquisition. The electrogastrogram signal was recorded by using two Ag/AgCl electrodes (Cleartrace, Conmed, Utica, NY, USA) placed at standard positions on the skin above the abdomen [22]. The skin was cleaned with sandy skin-prep jelly to reduce skin impedance prior to electrode placement (Nuprep, Weaver & Co., Aurora, CO, USA). The electrodes were connected to the BIOPAC amplifier module EGG100C, the signal was digitized at a rate of 15.625 samples per second and filtered using an analog bandpass filter consisting of a 1 Hz first-order low-pass filter and a 5 MHz third-order high-pass filter. Spectral analysis was performed on the last 300 sec of the baseline and nausea periods on each testing day, respectively. Prior to fast Fourier transform, each data epoch was linearly detrended and its ends were tapered to zero using a Hamming window. Spectral power within the normogastric bandwidth (2.5 to 3 cycles per min) and the tachygastric bandwidth (3.75 to 9.75 cycles per min) were determined for three overlapping epochs (Minutes 1 to 3, 2 to 4, 3 to 5) [23]. Finally, the average ‘normo-to-tachy ratio’ (NTT) was computed as the mean ratio of normogastric to tachygastric spectral power in the three 1-min epochs. NTT is known to decrease during visually-induced nausea, indicating enhanced tachygastric myoelectrical activity and/or reduced normogastric myoelectrical activity [24–26]. NTT data were logarithmized before statistical analysis to obtain approximately normal distributions.

Electrooculography was recorded as part of a 32-lead electroencephalogram (results not reported here) to control for participant´s eye movements during baseline and vection stimulation in order to assure that they followed the standardized instructions, namely to look straight ahead during baseline and to naturally follow the left-to-right horizontal translation of black and white stripes without moving the head during exposure to the vection stimulus, respectively. Horizontal and vertical electrooculography was assessed using the ActiveTwo system (BioSemi, Amsterdam, The Netherlands).

Proteomic analysis

Blood samples for proteomics assessments were collected in 2.7 ml EDTA tubes (S-Monovette, Sarstedt, Germany) from the antecubital veins and spinned in a centrifuge at 4°C for 10 min at 3,000g. Plasma samples à 100 μl were stored in 0.5 ml Protein LoBind Tubes (Eppendorf, Germany) at -70°C until proteomic analysis.

Plasma samples were proteolysed using PreOmics’ iST Kit (PreOmics GmbH, Martinsried, Germany) according to manufacturers’ specifications. Briefly, undepleted plasma was reduced and alkylated and incubated for 3 hours at 37°C with Lys-C and trypsin. Resulting peptides were dried for short term storage at -80°C. Prior to measurement, peptides were resuspended in 2% acetonitrile and 0.5% trifluoroacetic acid. The High Resolution Melt (HRM) Calibration Kit (Biognosys, Schlieren, Switzerland) was added to all samples according to manufacturer's instructions.

Mass spectrometry data were acquired in data-independent acquisition (DIA) mode on a Q Exactive high field mass spectrometer (Thermo Fisher Scientific, Dreireich, Germany). Per measurement 0.5 μg of peptides were automatically loaded to the online coupled Rapid Separation, High Pressure Liquid Chromatography System (Ultimate 3000, Thermo Fisher Scientific, Dreireich, Germany). A nano trap column was used (300 μm inner diameter × 5 mm, packed with Acclaim PepMap100 C18, 5 μm, 100 Å LC Packings, Sunnyvale, CA) before separation by reversed-phase chromatography (Acquity UPLC M-Class HSS T3 Column 75μm inner diameter x 250mm, 1.8μm Waters, Eschborn, Germany) at 40°C. Peptides were eluted from the column at 250 nl/min using increasing acetonitrile concentration (in 0.1% formic acid) from 3% to 40% over a 45-min gradient.

The HRM DIA method consisted of a survey scan from 300 to 1,500 m/z at 120,000 resolution and an automatic gain control target of 3e6 or 120 msec maximum injection time. Fragmentation was performed via high-energy collisional dissociation with a target value of 3e6 ions determined with predictive automatic gain control. Precursor peptides were isolated with 17 variable windows spanning from 300 to 1,500 m/z at 30,000 resolution with an automatic gain control target of 3e6 and automatic injection time. The normalized collision energy was 28 and the spectra were recorded in profile type.

Statistical analysis

Statistics were done using MATLAB R2018b. For all statistical tests, a significance level of p ≤ 0.05 (two-tailed) was assumed. Assumptions of normality were verified for all outcomes before statistical analysis.

Nausea measures.

For each behavioral and physiological variable (nausea, MS, NTT), day-adjusted scores (DAS) were computed prior to statistical analyses. DAS were calculated as: DAS = (m22 –m21)–(m12 –m11), where m22 is measurement 2 on Day 2, m21 is measurement 1 on Day 2 and m12 and m11 are second and first measurement on Day 1, respectively. DAS for each nausea, MS, and NTT were subjected to separate analyses of variance (ANOVAs), with ‘group’ (placebo group, control group) and ‘sex’ (male, female) included as between-subject factors.

DIA data analysis.

Data analysis of DIA files requires comparison of mass spectra against a tailored spectral library built of preceding data dependent mass spectrometry measurements. We searched our DIA files against an in-house library generated from selected mass spectrometry data encompassing 57 files of plasma and serum preparations, spiked with the HRM Calibration Kit. Data dependent files were analyzed using Proteome Discoverer (Version 2.1, ThermoFisher Scientific). Embedded search engine nodes included Mascot (Version 2.5.1, Matrix Science, London, UK), Byonic (Version 2.0, Proteinmetrics, San Carlos, CA), Sequest-HT, and MSAmanda. Peptide false discovery rates (FDR) for all search engines were calculated using percolator node, and the resulting identifications were filtered to satisfy the 1% peptide level FDR (with the exception of Byonic) and combined in a multi-consensus result file maintaining the 1% FDR threshold. The peptide spectral library was generated in Spectronaut (Version 9, Biognosys) with default settings using the Proteome Discoverer combined result file. Spectronaut was equipped with the Swissprot human database (Release 2016.02, 20165 sequences, with a few spiked proteins (e.g., Biognosys HRM peptide sequences). The final spectral library generated in Spectronaut contained 1,811 protein groups, 10,445 proteotypic peptides and 26,805 peptide-precursors.

Peptide dataset.

The DIA mass spectrometry data were analyzed using the Spectronaut 9 software applying default settings with the following exceptions: quantification was limited to proteotypic peptides, data filtering was set to Qvalue sparse for the peptide-based analysis. The Qvalue sparse setting includes all observations that pass the Qvalue at least once and it generates a matrix with a minimum of missing values. A peptide dataset used for analyses of covariance (ANCOVA) was generated by removing peptides with >10% missing values. Then data were normalized by setting the median to one, and intensities were log transformed (logN) for further analyses.

Protein dataset.

For the protein-based analysis, intensities of proteotypic peptides were added up to the respective protein intensities. To ensure best data quality in a most complete protein matrix, data were filled up as follows. Data filtering was set to Qvalue. The Qvalue setting considers only individual observations that pass the Qvalue threshold and generates a matrix containing missing values. To minimize the number of missing values, we used the values generated from the Qvalue sparse setting, representing real mass traces at the respective retention times. Proteins with >5% missing values were deleted. Data were finally normalized setting the median to one, intensities were log transformed (logN) for further analyses. Fold changes were calculated for both days as log2 fold between measurement 2 and 1.

The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE [27] partner repository with the dataset identifier PXD020563.

Protein interaction network.

The protein interaction network was created using StringDB Version 1019 using default settings (median confidence). Edges refer to the interaction sources text-mining, experiments, databases, co-expression, neighborhood and co-occurrence.

GO enrichment analyses.

GO enrichment analyses were done using GO annotation files from, releases/2017-03-11. Analysis was restricted to GO Biological Process only. Significantly enriched GO terms were estimated using a hypergeometric distribution test with the full Proteomics Spectronaut 9 Database (1470 unique protein IDs) as background. Fully redundant terms were removed from the list. To form representative functional clusters, similar terms were combined using the Jaccard index J_ij = (g_i∩g_j)⁄(g_i∪g_j), where gi and gj are the gene-products assigned to significant enriched pairs of GO-terms i and j, respectively. GO-terms with an index Jij > 0.4 were grouped.

Dissection of protein variance by experimental factors.

To estimate the variance composition of the protein changes on Day 2 dependent on the three DASs, we used a linear regression model: , with DASs as the predictor variables and group (grp) and sex as categorical co-variables. To deal with remaining missing data we used multiple imputation to create 5 complete datasets. Missing values were imputed by predictive mean matching using the R package 'mice' (rundata, m = 5, maxit = 100,000, meth = 'pmm', seed = 50, pred = predMat). We used ‘bisquare’ weight function to detect and remove outliers from the model. The proportion of total variance explained by the regression model was estimated by comparing the regression sum of squares to total sum of squares. Variance composition was computed separately for each imputed dataset. Results were summarized by taking the medians for each protein.

Placebo-associated protein changes.

To estimate the difference in placebo-induced protein abundances on Day 2 between groups, we applied ANCOVA and adjusted for individual differences in protein fold changes during the baseline experiment on Day 1. To further increase statistical power, ANCOVA was performed on the log ratios (measurement 2 vs measurement 1) of peptides instead of protein data. Thus, for large individual proteins we could increase the sample size up to 1,000 fold (S1 Dataset). For each GO term enriched for the ANCOVA-identified proteins we selected all related proteins that were also significantly regulated and generated two linear regression models to predict the three DAS (Nausea, NTT, MS) for the control and placebo groups. A ‘bisquare’ weight function was used to remove outliers from the model.

Prediction of placebo responders.

Prediction of placebo responders was based on protein baseline data on Day 2. Only participants were included for which a full proteomics dataset was existent after pre-processing. A one-way ANOVA on protein level was performed to preselect proteins at baseline of Day 2 that were expressed differentially between placebo responders and placebo non-responders for nausea and MS, respectively. Placebo responders were defined as participants in the placebo groups showing ≥50% reduction in nausea/MS from Day 1 to Day 2. Initially only top 5 proteins based on the F-statistic were selected. Subsequently, we performed sequential feature selection to identify additional proteins with potential to predict good responders. We finally used a linear support vector machine to generate two predictive models for each of the nausea scores. The first model included ANOVA and predictor proteins from sequential feature selection (‘ANOVA plus model’). As a null model reference, we used a model based on randomly selected proteins (‘RANDOM model’). Median receiver operating characteristics (ROC) curves and mean area under the curve (AUC) estimation were done using k-fold cross validation with k = 10, with 10 independent permutations.

3 Discussion

To our knowledge, our current study and data present the first MS detection and quantitation of the I179T PSA proteoform resulting from the rs17632542 SNP in post-DRE urine. This prostate proximal fluid can be easily collected and processed using high-throughput methods of small clinical sample volumes in large batches by our optimized MStern protocol. Each assay requires only a minimal volume of 250 μL. To further increase throughput, isobaric reporter tags, for example, tandem-mass tag, could be incorporated allowing for a single MS acquisition to provide accurate genotype information and quantitation on upward of 11 individuals at once. Additionally, gradient lengths can be significantly shortened, and chromatographic conditions optimized for speed if needed. In fact, acquisition parameters are under optimization to allow for a more than 75% shorter run time per sample with no hardware changes (Figure S4, Supporting Information).

The accuracy of determining not only the presence of the I179T variant, but also the corresponding WT peptide allowing for genotype determination was 100%. This is an important point as studies including more than 1300 patients have shown that serum PSA levels are lower for individuals carrying two copies of the rare allele than heterozygous individuals and levels are highest among homozygous wild-type individuals. [ 19 ] The overall expectation that homozygous individuals of the rare allele result in lower PSA levels is evident from both serum PSA, and the MS measurement of the control PSA peptide as seen in Figure 3. However, the homozygous WT and heterozygous difference was not readily observed in our data, but we suspect this may present with larger sample numbers.

Although we cannot be certain of the number of PCa cases that would have been treated differently had the presence and quantitation of the rs17632542 SNP been known, a multiplexed urine-based assay that incorporates our current approach will allow for rapid and accurate determination of genotype expression. In fact, the ability to assess allele-specific protein expression may assist in stratifying risk associated with heterozygosity at this site. Various theories exist including half-life reduction due to variant protein instability and overall deficiency of secretion. Sampling from post-DRE urines may help remedy these issues as the massaging of the prostate during a DRE aids in the release of PSA amongst other prostate proteins immediately prior to sample collection. [ 30 ] Figure 3 shows that broadly PSA concentrations among diseased patients are higher in this proximal fluid than in serum. Although, we do acknowledge that a normalization would likely need to be considered such as to the protein uromodulin to account for variability in sample concentrations.

JPT Peptide Technologies GmbH

JPT Peptide Technologies GmbH was established in 2004 as a spin-off of Jerini AG, a Berlin based peptide drug discovery company. In 2008, JPT became a wholly owned subsidiary of BioNTech AG, Mainz (Germany), a company that develops novel immune therapy and diagnostic approaches for various cancers. JPT Peptide Technologies GmbH is a leading provider of innovative peptide-based services, and catalog products & kits, as well as a research and development partner for projects in Immunology, Proteomics and Drug Discovery. JPT's head office and production sites are located in Berlin, Germany. All of its production and services are performed in Berlin/ Germany in accordance with DIN EN ISO 9001:2015 guidelines. Together with its US-subsidiary based in Acton, near Boston, Massachusetts, JPT is serving a worldwide customer base in the pharmaceutical and biotechnology industries, as well as researchers in universities, governmental and non-profit organizations. Over the past decade, JPT has developed a portfolio of proprietary technologies and a series of unique products and services which support research efforts in proteomics, all development phases of novel vaccines or immunotherapies and peptides based drug discovery. Since 2009, JPT is actively involved in R&D partnerships and contract research projects focusing on seromarker discovery & validation, development of immune monitoring tools and diagnostics, vaccine target discovery, peptide lead identification & optimization, biomarker quantification by targeted proteomics, enzyme substrate identification & sensitivity profiling.

Certifications & Qualifications

Our Services (24)

Project Management

JPT’s interdisciplinary team of 50 experts in peptide and medicinal chemistry, bioinformatics and assay development combines scientific expertise and creativity with unique and proprietary high throughput synthesis and assay technologies. The company structure, with various but focused expertise, provides an integrated and comprehensive one-stop-source for peptide related R&D projects and enables efficient interaction structures. An experienced project management team with a track record of successful R&D projects for over more than a decade as well as a DIN ISO 9001:2015 certified Quality Management System ensures professional project execution at short cycle times.

In addition to innovative peptide related catalog products and services, JPT applies its know-how and expertise in collaborative partnerships with companies and non-profit organizations. Depending on the goals, complexity and risk of a given project, JPT offers a variety of business models that range from fee-for-service and FTE based contract research to collaborative partnerships and risk sharing project models.

Quantitative Proteomics

Functional proteomics aims for the elucidation of the biological function of proteins or protein groups and classes on a proteome-wide level.

This includes the characterization of enzyme activities as well as protein/protein interactions and post-translational modifications at proteins. Combining our technologies and expertise, we are able to map changes in signaling pathways as a consequence of cell-treatment or changes in environmental conditions. This can be done either by the large-scale measurement of enzymatic activities of kinase, phosphatase, protease or other protein modifying enzymes or by measurement of epigenetic modifications at proteins.

We have also developed innovative ways in evaluating and presenting complex data sets which is vital for the optimal exploitation of experimental results.

Targeted Mass Spectrometry (MRM/SRM)

Multiple Reaction Monitoring/Selected Reaction Monitoring Mass Spectrometry

Discovery or shotgun proteomics usually targets the identification of large sets of proteins in complex samples. In targeted proteomics, high selectivity quantitative assays for lower numbers of proteins are developed, e.g. MRM assays. The significance of targeted proteomics is exemplified by the fact that it was selected as Nature Method of the Year 2012.

Multiple reaction monitoring (MRM), which is one of the most frequently used targeted proteomics approaches detects precursor and product ion pairs, furnishing assays for the detection and quantitation of proteins in biological samples spiked with proteotypic peptide standards.

JPT has been providing tools for targeted proteomics for years.
SpikeTides™ are cost-effective peptides that allow high-speed SRM assay and MRM assay development and protein quantitation with almost unlimited coverage through entire proteomes. Thousands of proteotypic SpikeTides™ can be prepared economically at unmatched turnaround times (10 000 peptides per week). Absolutely quantified SpikeTides™ (SpikeTides™ TQL) use a new approach for absolute quantification that is significantly more cost effective than using traditional AAA quantified standard peptides. SpikeTides™ also enable the monitoring of cellular regulation by incorporation of post-translational modifications. See the scheme on the bottom of the page for a graphical overview on how SpikeTides™ are used in discovery and targeted proteomics experiments.

In addition, JPT offers complete proteomic studies in collaboration with its partners. Please feel free to discuss all aspects of your project with our experts.
Our Bioinformatics expertise allows for intelligent peptide library design that, in combination with our technologies, helps to solve complex problems in target discovery and validation as well as biomedical research.

Concluding Remarks and Perspectives

The use of LC-MS/MS strategies is the most useful and promising path to improve the identification and quantification of immunogenic peptides. Despite the methodological difficulties, it proves to be a fast, sensitive, and reproducible method. In addition, it can be extended to several other allergenic food matrices, like dairy, nuts, and seafood. Thus, knowing the profile of allergenic proteins of cereals is necessary as a basis, not only for future applications of MS in the quantification of gluten in food, but also to ensure the safety of consumers regarding food labeled cereal- or gluten-free.

Although the declaration of gluten-containing cereals on products labeled gluten-free is mandatory worldwide, there is no certified reference material available for gluten. The available reference material contains only gliadins that underestimate the gluten content, besides the problem of reproducing a new batch with similar properties and composition. The majority of MS-based studies have been conducted with the final objective to establish a reference material for gluten analysis starting from the study of specific grain peptide markers. Therefore, targeted high-resolution MS/MS methods allowed the quantification of low levels of specific marker peptides from different species and protein types.

When comparing LC-MS/MS methods to ELISA for gluten detection, ELISA still remains the method of choice in most cases, because it is fast, comparatively cheap, suitable for routine analyses, and does not require highly specialized equipment. However, several studies have shown that ELISA may underestimate gluten contents especially in processed foods that have been extensively heat-treated or hydrolyzed. Untargeted LC-MS/MS is recommended to screen for the presence of gluten-derived peptides in products such as beer, malt vinegar, and fermented sauces. However, there are some points that will equally all analytical methods, because gluten extractability has been shown to be reduced substantially in heat-treated foods and processing-induced post-translational protein modifications will lead to reduced gluten detectability irrespective of the analytical method used.

The use of modern MS-based techniques, combining orthogonal separations with high sensitivity and reliable certified references materials will hopefully help to better comprehend the effect of food processing or plant breeding on gluten immunogenicity. Continued efforts in this area will also help to solve the questions about the selection of relevant target epitopes and even antibodies, taking account the high protein polymorphism and the fact that patients react individually to different proteins and present variable sensitivities.

Missing Protein Detection and Strategies

) developed a novel protein extraction method for human alveolar bone obtained from impacted third-molar extractions: solubilizing osteoid, releasing hydroxyapatite-bound and collagen-bound proteins, and then digesting insoluble cross-linked proteins with trypsin, followed by LysC in urea. All extracts were digested by four proteases for MS analysis (trypsin, LysargiNase, GluC, AspN) with TAILS employed to identify mature and neo N-termini, and a search parameter for hydroxyproline was included. Many tissue-specific MPs were expected. A quite extensive identification of peptides and proteins was demonstrated, with many interesting observations and proteins detected for the first time in bone. However, only one MP, pannexin-3 (Chr 11q24.2), was identified that met HPP Guidelines. The authors also identified proteotypic peptides from 17 proteins previously classified as PE1 from non-MS methods, which the authors term non-MS PE1 proteins, of which two met the HPP Guidelines v2.1 in full to now be classified as PE1 proteins with complete MS evidence. Furthermore, the bone N-terminome provided insight into the proteolysis during bone development. This workflow is applicable to similar mineralized tissues. The paucity of tissue-specific MPs may be a clue that unusual PTMs and protein complexes drive tissue-specific features.

) observed that PE2,3,4 MPs have lower mean and median molecular weights (38 kDa/∼345 aa 34 kDa/310 aa) than PE1 proteins (66 kDa/600 aa 50 kDa/455 aa). They devised a workflow to enrich low-molecular-weight proteins from human placenta samples using a C18 solid-phase extraction column and fractionation with SDS-PAGE gel or a 50 kDa cutoff filter and the use of LysargiNase, a mirror protease to trypsin. In so doing, the authors identified three MPs (TRNP1, LCE6A, IGFL2), with lengths from 80 to 227 aa, with pairs of proteotypic peptides confirmed by parallel reaction monitoring (PRM). A fourth candidate corresponding to the extracellular functional domain in a cleaved isoform of UMODL1 did not have PRM confirmation. A similar approach had been applied to the testis in the 2018 special issue.(16)

) took advantage of combinations of trypsin, LysC, and GluC to enhance the MP identification in the human testis using a search engine called Open-pFind, which identified more peptide-spectrum matches (PSMs), peptides, and MP candidates than MaxQuant, pFind, and Proteome Discoverer, mostly due to PTMs, and also more small proteins with <100 aa. They report the promotion of five MP candidates verified with two uniquely mapping, non-nested peptides of ≥9 aa confirmed with synthetic peptides, meeting the HPP Guidelines v2.1, from a much larger number of MP candidates. Two already had a singleton peptide in PeptideAtlas. The authors provide detailed biological annotations for the five MPs.

) extended their bioinformatic strategy utilizing the SRMAtlas ( as a source of spectra for synthetic proteotypic peptides to confirm stranded or singleton peptides in the Global Proteome Machine (GPM, GPM results are not incorporated into PeptideAtlas, but some primary data sets can be found at PRIDE via ProteomeXchange ( Starting from SRMAtlas, the authors identified 6736 non-nested synthetic peptides corresponding to 1644 of the 2129 neXtProt PE2,3,4 MPs as of 2019-01 using the neXtProt uniqueness checker. Then, they searched GPM for two or more of these non-nested proteotypic peptides of ≥9 aa from the same proteomic study. A total of 51 new MP candidates were identified in 35 different studies, of which 23 (with 55 total peptides) were reported as validated after careful spectral matching. The release by the PSI Universal Spectrum Identifier should facilitate such matching of spectra between SRMAtlas and GPM or other sources of studies not already incorporated by PeptideAtlas. The synthetic peptides experimentally analyzed in SRMAtlas were selected according to their physicochemical properties of length, hydrophobicity, and charge state as having a high likelihood of detectability as natural peptides from biological specimens. The authors provide an interesting discussion of the features of the MPs identified through this approach.

) assess some of the challenges of finding members of by far the largest class of missing proteins, the 405 PE2,3,4 predicted protein-coding genes for human olfactory receptors (ORs), leaving aside a similar number of PE5 entries that are mostly pseudogenes. There are four PE1 OR proteins in the neXtProt 2019-01 release based on protein–protein interaction data (OR1D2, OR2AG1) or genetic or biochemical data (OR1D2, OR2J3). There is not a single OR validated as PE1 with MS evidence, despite many claims over the years and the publication of proteotypic peptides for these in the SRMAtlas. The authors noted confined spatiotemporal or development expression, a very low copy number of transcripts, gene silencing, and limited access to the olfactory epithelium as potential explanations. Here they focused on the hydrophobic seven-pass transmembrane structure of these G-protein-coupled receptors, the inaccessibility of the few Lys and Arg sites for tryptic digestion, and the limited hydrophilic domains. Hypothetically, 58% of ORs might be capable of generating pairs of proteotypic tryptic peptides if and only if the abundance was sufficient and the appropriate specimens were obtainable. No OR was detected in a 2018 special issue paper on the olfactory epithelium in which five other MPs were detected.(16)

) address the question of whether MPs are undetected because their abundance is too low (technical insensitivity) or because the coding gene is not transcribed and the transcript is not translated, including major differences across tissues and cellular heterogeneity (biological explanation). They introduce a quantitative polymerase chain reaction (qPCR)-based approach for detecting low-abundance transcripts by extending the number of cycles from 30 to 40 or more and confirming with Sanger amplicon sequencing. They argue that this qPCR method, which is coupled to RNA-seq, enabled the detection of low-abundance transcripts from Chr 18. In principle, this approach could be integrated into C-HPP initiatives as an optional tool when high-quality samples and proper RNA-seq data sets and primers are available.

) describe the publications and strategies of the Chr 18 team in response to the C-HPP neXt-MP50 and neXt-CP50 Challenges to identify missing proteins and annotate uncharacterized PE1 proteins for function, which together represent the “dark proteome”. The authors address two key technical challenges: the low analytical sensitivity of proteomic technologies and the greater complexity of the proteome versus the genome. Beyond shotgun and targeted MS, they suggest nanotechnologies combined with atomic force microscopy (AFM) and greater attention to proteoforms arising from sequence variants, splicing, and chemical PTM. AFM chips with affinity reagents may be useful enrichment tools. They provide insight into the feasibility of detecting low-abundance proteins.

) 3 from placenta, Lin et al. (

) and 23 from the SRMAtlas/GPM analyses, Elgoushy et al. (

)), is much smaller than last year, when there were 104 MPs from specific studies plus 107 candidates from MassIVE in the 2018 HPP Special Issue.(17) When the full Sun et al.(18) data set was reanalyzed by PeptideAtlas, 73 MPs were confirmed, whereas only 14 had been validated with synthetic peptides in the original paper (Omenn et al.(19)). As previously pointed out,(15) there are still many predicted proteins that may not be detectable by the current sample preparation and MS methods. Nevertheless, there is very substantial annual progress from the work of the entire proteomics community, as documented in the metrics paper by Omenn et al. (

), which showed large gains in PE1 proteins (462 in 2017, 224 in 2018) and substantial reductions in PE2,3,4 proteins (431 in 2017, 213 in 2018) during each of the past 2 years. We await the 2020-01 releases from PeptideAtlas and neXtProt to confirm the promotion of these and other MPs found in community-supplied data sets.


The Human Proteome Project (HPP) aims deciphering the complete map of the human proteome. In the past few years, significant efforts of the HPP teams have been dedicated to the experimental detection of the missing proteins, which lack reliable mass spectrometry evidence of their existence. In this endeavor, an in depth analysis of shotgun experiments might represent a valuable resource to select a biological matrix in design validation experiments. In this work, we used all the proteomic experiments from the NCI60 cell lines and applied an integrative approach based on the results obtained from Comet, Mascot, OMSSA, and X!Tandem. This workflow benefits from the complementarity of these search engines to increase the proteome coverage. Five missing proteins C-HPP guidelines compliant were identified, although further validation is needed. Moreover, 165 missing proteins were detected with only one unique peptide, and their functional analysis supported their participation in cellular pathways as was also proposed in other studies. Finally, we performed a combined analysis of the gene expression levels and the proteomic identifications from the common cell lines between the NCI60 and the CCLE project to suggest alternatives for further validation of missing protein observations.


This article is part of the Chromosome-Centric Human Proteome Project 2017 special issue.