C4. Common Motifs in Proteins - Biology

C4. Common Motifs in Proteins - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Super-Secondary Structure - Given the number of possible combinations of 1°, 2°, and 3° structures, one might guess that the 3D structure of each protein is quite distinctive. However, it has been found that similar substructures are found in proteins. For instance, common secondary structures are often grouped together to form a motifs called super-secondary structure (SSS). See some examples below:

  • helix-loop-helix : found in DNA binding proteins and also in calcium binding proteins. This motif, which is also a helix-loop-helix, is often called the EF hand. The loop region in calcium binding proteins are enriched in Asp, Glu, Ser, and Thr. Why? The EF hand shown below is from calmodulin.

Figure: helix-loop-helix (image made with VMD)

Figure: EF Hand

Jmol: Updated helix-loop-helix of the lambda Repressor Jmol14 (Java) | JSMol (HTML5)

Jmol: Updated helix-loop-helix (EF hand) from calmodulin Jmol14 (Java) | JSMol (HTML5)

  • beta-hairpin or beta-beta: is present in most antiparallel beta structures both as an isolated ribbon and as part of beta sheets.

Figure: beta-hairpin, or beta-beta (image made with VMD)

Jmol: Updated beta-hairpin from bovine pancreatic trypsin inhibitor Jmol14 (Java) | JSMol (HTML5)

  • Greek Key motif: four adjacent antiparallel beta strands are often arranged in a pattern similar to the repeating unit of one of the ornamental patterns used in ancient Greece.

Figure: Greek Key Motiff

Jmol: Greek Key

  • Figure: beta-alpha-beta: is a common way to connect two parallel beta strands. (beta hairpin used for antiparallel beta strands).

Figure: beta-alpha-beta (image made with VMD with H atoms added by Molprobity

Jmol: Updated beta-helix-beta motif from triose phosphate isomerase Jmol14 (Java) | JSMol (HTML5)

  • Beta Helices: These right-handed parallel helix structures consists of a contiguous polypeptide chain with three parallel beta strands separated by three turns forming a single rung of a larger helical structure which in total might contain as many as nine rungs. The intrastrand H-bonds are between parallel beta strands in separate rungs. These seem to prevalent in pathogens (bacteria, viruses, toxins) proteins that facilitate binding of the pathogen to a host cell.

    Figure: Beta Helices (image made with VMD)

    Table: Beta Helices
    Vibrio cholerae cholera
    Helicobacter pyloriulcers
    Plasmodium falciparummalaria
    Chlamyidia trachomatisVD
    Chlamydophilia pneumoniaerespiratory infection
    Trypanosoma bruceisleeping sickness
    Borrelia burgdorferiLymes disease
    Bordetella parapertussiswhooping cough
    Bacillus anthracisanthrax
    Neisseria meningitidesmenigitis
    Legionaella pneumophiliaLegionaire's disease
  • Beta Topologies on the Web
  • of the Swiss Institute of Bioinformatics. (SIB) is dedicated to the analysis of protein sequences and structures as well as 2-D PAGE

Domains -

Domains are the fundamental unit of 3o structure. It can be considered a chain or part of a chain that can independently fold into a stable tertiary structure. Domains are units of structure but can also be units of function. Some proteins can be cleaved at a single peptide bonds to form two fragments. Often, these can fold independently of each other, and sometimes each unit retains an activity that was present in the uncleaved protein. Sometimes binding sites on the proteins are found in the interface between the structural domains. Many proteins seem to share functional and structure domains, suggesting that the DNA of each shared domain might have arisen from duplication of a primordial gene with a particular structure and function.

Evolution has led towards increasing complexity which has required proteins of new structure and function. Increased and different functionalities in proteins have been obtained with additions of domains to base protein. Chothia (2003) has defined domain in an evolutionary and genetic sense as "an evolutionary unit whose coding sequence can be duplicated and/or undergo recombination". Proteins range from small with a single domain (typically from 100-250 amino acids) to large with many domains. From recent analyzes of genomes, new protein functionalities appear to arise from addition or exchange of other domains which, according to Chothia, result from

  • "duplication of sequences that code for one or more domains
  • divergence of duplicated sequences by mutations, deletions, and insertions that produce modified structures that may have useful new properties to be selected
  • recombination of genes that result in novel arrangement of domains."

Structural analyses show that about half of all protein coding sequences in genomes are homologous to other known protein structures. There appears to be about 750 different families of domains (i.e small proteins derived from a common ancestor) in vertebrates, each with about 50 homologous structures. About 430 of these domain families are found in all the genomes that have been solved.

Leucine zipper

A leucine zipper (or leucine scissors [1] ) is a common three-dimensional structural motif in proteins. They were first described by Landschulz and collaborators in 1988 [2] when they found that an enhancer binding protein had a very characteristic 30-amino acid segment and the display of these amino acid sequences on an idealized alpha helix revealed a periodic repetition of leucine residues at every seventh position over a distance covering eight helical turns. The polypeptide segments containing these periodic arrays of leucine residues were proposed to exist in an alpha-helical conformation and the leucine side chains from one alpha helix interdigitate with those from the alpha helix of a second polypeptide, facilitating dimerization.

Leucine zippers are a dimerization motif of the bZIP (Basic-region leucine zipper) class of eukaryotic transcription factors. [3] The bZIP domain is 60 to 80 amino acids in length with a highly conserved DNA binding basic region and a more diversified leucine zipper dimerization region. [4] The localization of the leucines are critical for the DNA binding to the proteins. Leucine zippers are present in both eukaryotic and prokaryotic regulatory proteins, but are mainly a feature of eukaryotes. They can also be annotated simply as ZIPs, and ZIP-like motifs have been found in proteins other than transcription factors and are thought to be one of the general protein modules for protein–protein interactions. [5]

These are completely different concepts, which sometimes may be connected.

A motif in biology is a mathematical model, typically of a sequence, that predicts which sequences belong to some defined group. For example, a DNA sequence motif can characterize the binding site of a transcription factor, i.e. which sequences tend to be bound by this factor. For proteins, sequence motifs can characterize which proteins (protein sequences) belong to a given protein family. A simple motif could be, for example, some pattern which is strictly shared by all members of the group, e.g. WTRXEKXXY (where X stands for any amino acid). There are also more complex motif models.

Protein domains, on the other hand, are a structural entity, usually meaning a part of the protein structure which folds and functions independently. So, proteins are often constructed from different combinations of these domains.

So how are motifs and domains related? Well, when you think about protein families, it makes sense not only to look at the whole sequence but also to focus on individual domains. Since they are a elementary functional-structural units, it makes sense to find sequence motifs for individual domains. So, you often find that a protein contains multiple domains, each domain characterized by having a sequence that matches the motif of its family.

Characterization of a novel class of plant homeodomain proteins that bind to the C4 phosphoenolpyruvate carboxylase gene of Flaveria trinervia

We are interested in the regulatory mechanisms responsible for the mesophyll-specific expression of C4 phosphoenolpyruvate carboxylase (PEPCase). A one-hybrid screen resulted in the cloning of four different members of a novel class of plant homeodomain proteins, which are most likely involved in the mesophyll-specific expression of the C4 PEPCase gene in C4 species of the genus Flaveria. Inspection of the homeodomains of the four proteins reveals that they share many common features with homeodomains described so far, but there are also significant differences. Interestingly, this class of homeodomain proteins occurs also in Arabidopsis thaliana and other C3 plants. One-hybrid experiments as well as in vitro DNA binding studies confirmed that these novel homeodomain proteins specifically interact with the proximal region of the C4 PEPCase gene. The N-terminal domains of the homeodomain proteins contain highly conserved sequence motifs. Two-hybrid experiments show that these motifs are sufficient to confer homo- or heterodimer formation between the proteins. Mutagenesis of conserved cysteine residues within the dimerization domain indicates that these residues are essential for dimer formation. Therefore, we designate this novel class of homeobox proteins ZF-HD, for zinc finger homeodomain protein. Our data suggest that the ZF-HD class of homeodomain proteins may be involved in the establishment of the characteristic expression pattern of the C4 PEPCase gene.


Although tandem repeats of amino acids are easily recognized features of proteins and have been extensively studied, protein sequences show more widespread repetitive features. This is shown by the high proportion of proteins containing repetitive segments - approximately 50% as measured by SEG [30] and over 70% of the S. cerevisiae proteome as measured by SIMPLE [32]. In this study we have compared the frequencies of tandem repeats with those of C4 repeats (repetitive regions with a local overrepresentation of motifs of length four residues) using SIMPLE, which has the advantage that it identifies explicitly the overrepresented motif in a given region. We have carried out this comparison in a large set of proteins orthologous between four mammals and chicken, which is the most closely related non-mammalian species with a sequenced genome. This allows us to compare repeat frequencies both between types and between species.

After excluding C4 motifs that overlap tandem repeats, many of the C4 motifs detected in these genomes are clearly related to common tandemly repeated amino acids (six of the seven most common tandem amino acid types in Figure 1a are mirrored by the six most common homogeneous C4 repeat types in Figure 1b), suggesting that the underlying mechanisms that gives rise to them is similar. This is also reflected in the high correlations seen between the frequencies of tandem repeats and their respective homogeneous C4 repeats. Tandem AARs most likely evolve by replication slippage, as they evolve more rapidly if they are encoded by pure codon repeats than interrupted codon repeats [6, 13]. Dieringer and Schlötterer [46] introduced a novel, slippage-related process they called indel slippage that acts in a non-repeat-length-dependent manner on repeated motifs as short as a single nucleotide. Such a mechanism could contribute to the evolution of C4 repeats and other cryptically repetitive sequences [47, 48] and could give rise to differences in the frequencies of tandem and cryptic repeats.

The biggest difference in frequency between tandem and cryptic repeats was seen for Leu, which is rare among C4 repeats. In addition, Q4 repeats are by far the most common class of C4 repeats while Gln is only the seventh most numerous class of tandem repeat in our sample. These large differences could reflect differences in underlying mechanisms (although this seems superficially unlikely as Q tandem repeats are known to undergo rapid evolution [6, 49]) but could also reflect differential selective forces (acting strongly against L4 repeats and Q tandem repeats but less so against their counterparts).

Repeat frequencies were highly similar between the mammals, but the chicken proteome showed a distinct frequency distribution in which most repeat frequencies were lower. A partial exception to this pattern were tandem S repeats, which although rarer in chicken than in mammals, were the most common class in the chicken proteome. A trivial explanation for these differences could be the currently lower quality of the chicken genome sequence. However, this is unlikely to be the main explanation as the dataset we used contained only clearly identifiable orthologues. Another, and more interesting, possibility is that the lower frequency in chicken is the result of the general reduction of genome size in birds. The chicken genome is approximately one-third the size of the human genome [50] while bird genomes in general are approximately half the size of mammalian genomes [51]. Analysis of the evolution of bird genome size indicates that genome shrinkage took place in the saurischian lineage leading to the birds circa 200 to 300 million years ago and that this was accompanied by a reduction in the genome fraction of repetitive elements [51]. A global correlation of genome sequence repetition with genome size has also been described [52, 53]. The lower frequency of amino acid repeats in chicken proteins may therefore reflect a parallel process of loss of transposable elements and tandem and cryptic repeats in that evolutionary lineage. A possible explanation for the stronger conservation of S repeats between mammals and chicken than other repeat types is that they play a less dispensable role in protein function serine-rich domains (RS domains) are intimately involved in alternative splicing [54] and it is possible that this role is sufficiently important to ensure their retention.

Previous analyses of the evolution of Gln repeats have suggested that in the early stages of their emergence, when encoded by pure codon repeats, they appear preferentially in regions of proteins that are subject to relatively low levels of purifying selection (that is, regions that evolve more quickly than the rest of the protein) [7, 21, 22]. In this study we have analyzed the evolution of regions flanking tandem and C4 AARs in human-rodent and human-chicken comparisons and show the same trend, confirming that the majority of tandem and C4 repeats in proteins emerge in rapidly evolving subregions. We also confirm earlier suggestions [21, 22] that conserved repeats lie in relatively more conserved protein subregions than non-conserved repeats and show that conserved AARs tend to lie in more conserved proteins than non-conserved AARs. In addition, we observe elevated sequence differences around conserved repeats of both types, although this elevation is less extreme than is observed for non-conserved repeats. The latter result differs from a previous study that did not find a difference between flanking regions and the remainder of the proteins for conserved AARs [21]. However, that study only considered a relatively small number of proteins and so most likely failed to detect this difference due to a lack of statistical power. Generally, the results are consistent with a model of repeat evolution whereby repeats tend to emerge in less-conserved regions of proteins and become frozen in length as they reach a length at which they are close to a threshold at which they may cause deleterious phenotypes [16, 21] but they also suggest that the regions in which repeats become fixed may continue to evolve relatively rapidly after repeat fixation.

IURs are regions of proteins that do not form stable tertiary structures under native conditions. Analyses of the extent of disorder in whole genomes suggest that in eukaryotes more than 40% of proteins are either completely disordered or contain significant regions of disorder [40, 41]. In this dataset we find 34% of proteins to contain IURs of length > 50 and 85% to contain an IUR of length > 10. These regions are thought to form flexible regions of proteins that might have a number of functions, including binding to other proteins and small molecules and providing flexibility in multidomain proteins. In an analysis of repeat content of a relatively small number of intrinsically unstructured protein regions, Tompa [25] identified an apparently strong role for AARs in IUR evolution. The definition of 'repeats' in his analysis is different from ours as it included longer, complex repeated motifs as well as simple sequence repeats, but some simple sequence repeats did appear in his results. This raises the question whether there is a real association of simple AARs with IURs, and whether an association of this type can account for the evolutionary dynamics of AARs. Here we have investigated this by considering the overlap between tandem and C4 repeats and, first, domains identifiable searching the SUPERFAMILY and InterPro databases, and second, unstructured regions predicted by the RONN predictor. The majority of AARs, with the exception of L tandem repeats, lie within IURs predicted by RONN (Figure 6).

We obtained inconsistent predictions on the level of structure shown by A repeats. They were predicted to be predominantly unstructured by two methods, RONN and DISOPRED, but not by a third, IUPRED. This disagreement may reflect the different methodologies employed by the different algorithms as IUPRED takes account of the chemical characteristics of the sequence being analyzed whereas RONN and DISOPRED use structural analyses of proteins. The ambiguous position of A in these analyses is interesting in the light of its role as the second major cause of human repeat expansion disease, after Q. Gln repeats are notable in showing markedly higher proportions of disorder as tandem repeats than as C4 repeats, suggesting that expansion of Q repeats could have a destabilizing effect on proteins, as suggested previously [18].

Seven of the eight most common tandemly repeated amino acids in our dataset correspond to the seven disorder-promoting amino acids defined by Dunker et al. [55]. Lise and Jones [56] in their study of common amino acid patterns in unstructured regions also identified a number of patterns similar to the most common C4 repeats, notably E- and P-rich regions. A strong element of the purifying selection acting against the emergence of AARs within folded regions of proteins therefore appears to be selection against their propensity to lower the stability of these regions. Interestingly, as noted by Kreil and Kreil [57], N repeats are much rarer than Q repeats - indeed, in our analysis of human proteins we found only four tandem N repeats. This observation may reflect the propensity of Asn to promote order [55] and consequent purifying selection acting against the appearance of N repeats in unstructured regions. A similar argument may apply to D and E repeats - Glu, which is common in AARs, is disorder-promoting whereas Asp, which is rare in AARs, is not. In this context, it is noteworthy that although E repeats are the most common class in mammals and the most often predicted to be unstructured, they are also, after L repeats, the class most commonly found associated with SUPERFAMILY and InterPro domains. This raises the question whether the domains in which they are located tend to be close to the threshold of instability. Mean RONN scores of domains containing E repeats are 0.44 for SUPERFAMILY and 0.46 for InterPro domains. These compare to means for all domains containing repeats of 0.43 for SUPERFAMILY domains and 0.41 for InterPro domains. The mean for E repeats in SUPERFAMILY domains is typical of all repeat-containing domains, but that for InterPro domains is the highest amongst all repeat types. As most of the domains containing E repeats are InterPro and not SUPERFAMILY domains, this raises the possibility that some E repeat-containing InterPro domains are relatively unstable.

L tandem repeats form interesting exceptions to the general association of AARs with unstructured regions as they are predicted to be 100% structured. The amino acids found in tandem repeats tend to be hydrophilic all the most hydrophilic amino acids [58] are found in the class of common tandem AARs - the only strongly hydrophobic amino acid in this class is Leu. Hydrophobic amino acids tend to occupy buried positions within proteins, so it is not surprising that Leu repeats show a high propensity to be structured. In earlier analyses, Leu repeats have been found to be concentrated close to the amino termini of proteins [15, 59], presumably forming part of the hydrophobic region of signal sequences, although Leu may also contribute to transmembrane segments of proteins and more generally to protein cores and stabilizing secondary and tertiary structure [59].


Chaperone-mediated autophagy (CMA) contributes to the lysosomal degradation of a selective subset of proteins. Selectivity lies in the chaperone heat shock cognate 71 kDa protein (HSC70) recognizing a pentapeptide motif (KFERQ-like motif) in the protein sequence essential for subsequent targeting and degradation of CMA substrates in lysosomes. Interest in CMA is growing due to its recently identified regulatory roles in metabolism, differentiation, cell cycle, and its malfunctioning in aging and conditions such as cancer, neurodegeneration, or diabetes. Identification of the subset of the proteome amenable to CMA degradation could further expand our understanding of the pathophysiological relevance of this form of autophagy. To that effect, we have performed an in silico screen for KFERQ-like motifs across proteomes of several species. We have found that KFERQ-like motifs are more frequently located in solvent-exposed regions of proteins, and that the position of acidic and hydrophobic residues in the motif plays the most important role in motif construction. Cross-species comparison of proteomes revealed higher motif conservation in CMA-proficient species. The tools developed in this work have also allowed us to analyze the enrichment of motif-containing proteins in biological processes on an unprecedented scale and discover a previously unknown association between the type and combination of KFERQ-like motifs in proteins and their participation in specific biological processes. To facilitate further analysis by the scientific community, we have developed a free web-based resource (KFERQ finder) for direct identification of KFERQ-like motifs in any protein sequence. This resource will contribute to accelerating understanding of the physiological relevance of CMA.

Citation: Kirchner P, Bourdenx M, Madrigal-Matute J, Tiano S, Diaz A, Bartholdy BA, et al. (2019) Proteome-wide analysis of chaperone-mediated autophagy targeting motifs. PLoS Biol 17(5): e3000301.

Academic Editor: Anne Simonsen, Institute of Basic Medical Sciences, NORWAY

Received: September 25, 2018 Accepted: May 15, 2019 Published: May 31, 2019

Copyright: © 2019 Kirchner et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files. All raw data (individual numerical values that underlie the summary data displayed in main and supplementary figure panels) have been deposited in the publicly available repository GitHub and can be accessed through this link:

Funding: This work was supported by grants from the National Institutes of Health AG031782, AG021904, AG038072, DK098408 (to AMC) and the generous support of the JPB Foundation, Rainwaters Foundation, Leducq Foundation, and Robert and Renée Belfer (to AMC). PK was supported by a DFG KI 1992/1-1 postdoctoral fellowship JM-M is supported by postdoctoral fellowship 17POST33650088 from the American Heart Association and is a Leducq fellow of the Transatlantic Network of Excellence (RA15CVD04 award). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist based on the content of the submitted manuscript.

Abbreviations: AKT1, RAC-alpha serine/threonine-protein kinase BLAST, Basic Local Alignment Search Tool Cath A, protective protein/cathepsin A CHK1, checkpoint kinase 1 CMA, chaperone-mediated autophagy eF1a, Elongation factor 1-alpha eMI, endosomal microautophagy GAPDH, glyceraldehyde-3-phosphate dehydrogenase GFAP, Glial fibrillary acidic protein GO, gene ontology HIF1a, hypoxia-inducible factor 1-alpha HSC70, heat shock cognate 71 kDa protein HSP40, DnaJ homolog subfamily B member 1 HSP90, heat shock protein HSP 90 KD, knocked down LAMP-2A, lysosome-associated membrane protein type 2A LC3-II, microtubule-associated proteins 1A/1B light chain 3B NFAT, nuclear factor of activated T cells NRF-2, nuclear factor erythroid 2-related factor 2 PARK7, Parkinsonism associated deglycase PAT, perilipin/ADRP/TIP47 PHLPP1, PH domain leucine-rich repeat-containing protein phosphatase 1 PLIN3, perilipin 3 PTM, posttranslational modification Rab11, Ras-related protein Rab-11A RAC1, Ras-related C3 botulinum toxin substrate 1 RARα, Retinoic acid receptor alpha SLiM, small linear motif

Fluorescent Protein Biophysical and Biochemical Properties

Currently, there are dozens of FP options available when designing an experiment. Which one to select? The answer can change every few months as improved FPs are reported in the literature, though newer does not always equal better. Whichever FPs one is considering, there are some key features fundamental for any FP experiment. Spectral and biochemical properties are important for FPs and these are usually provided either in the original paper describing the FP or a company's data sheet (see Box 1). With few exceptions, investigators need the brightest, most photostable, least phototoxic, and fastest folding FP to achieve robust fluorescent signals. For FP-fusion proteins, our lab primarily uses the FPs listed in the first part of Table 1 . Superfolder GFP is showing great promise for fusion protein constructs that appear comparatively dim, most likely because the fusion proteins are interfering with FP folding (see ref [14], especially Figure 4). That is, just as FPs can affect the functionality of a fusion protein (see following sections), a protein of interest can disrupt FP folding and affect the FP fluorescent signal.

Table 1

Popular FPs and Successful FP-fusion Protein Examples

    TagBFP[37]New bright photostable blue FP
   ꃎrulean[38]Improved form of the cyan FP (CFP)
    Monomeric EGFP[1, 15, 20, 21]The original optimized green FP and best characterized FP for FP-fusion protein design.
    Venus or Citrine[39, 40]Improved forms of yellow FP (YFP).
    mCherry or mKate2[3, 22]Popular red FPs.
    PA-GFP and PA-mCherry[41, 42]Photoactivatable FPs.
    FP-Fusion Proteins
    GRP94-GFP[43]Luminal ER Chaperone
    GFP-NMDAR1[44]Ion channel.
    GFP-Tub1[45]Yeast α-tubulin
    GFP-clathrin [46] Secretory pathway coat protein

Fluorescent Protein Codon Optimization and Sequence Suitability

The majority of FPs have been developed from jellyfish and coral proteins. One major difference between these animals and mammals is the choice of amino acid codons. In 1996 Brian Seed and colleagues [15] improved the expression and fluorescent signal of GFP in mammalian cells by 40-120 fold by heavily modifying the GFP codon sequence to reflect mammalian cell preferences. Most commercially available plasmids have been codon optimized for mammalian cells. However, some of the older FP plasmids lurking in lab freezers may not have been optimized and one GFP may not be equivalent to another. Mammalian codon optimized FPs are not necessarily optimized for other model organisms such as Drosophila or yeast. Investigators working in nonmammalian systems should consider synthesizing codon-optimized variants of FPs (currently about $350 for the average FP) for their model organism.

Even if an FP is codon optimized, it may not necessarily be suitable for some cellular environments. For example, TagRFP and mKO contain multiple cysteines and consensus N-glycosylation sites (N-X-S/T, X is any amino acid except proline), which could modify the folding, size, and oligomerization of these FPs if targeted to the secretory pathway of eukaryotic cells [16]. Even EGFP and its variants contain two cysteines, which can lead to disulfide-bonded oligomers in the endoplasmic reticulum (ER) [17]. Under more extreme conditions, EGFP cannot correctly fold or fluoresce in the highly oxidizing environment of the periplasmic space of gram-negative bacteria [18]. In contrast, the cysteine-less mCherry readily folds in the same environment [19]. Therefore, always carefully examine FP amino acid sequences for potentially environmentally sensitive sequences.

Fluorescent Protein Oligomeric State

Many FPs have a tendency to oligomerize either as part of the inherent structure (i.e. DsRed is an obligate tetramer) or when present in high concentrations on membranes or in oligomeric proteins (i.e. EGFP). Therefore, it is important to determine whether an FP is monomeric and whether this matters for your experiment. While FP oligomerization has become more commonly reported, the propensity of an FP to oligomerize is often unknown, as oligomerization assays are not always robust or quantitative. Many papers describe an FP as monomeric without directly demonstrating monomerization or without reporting a Kd value. This point is not merely academic. It can be difficult or expensive (up to $500 per plasmid) to obtain an FP plasmid and you will probably not be in a great mood if that $500 FP oligomerizes with your FRET biosensor or integral membrane fusion protein. Currently, there are no accepted standards for how monomeric an FP needs to be for cell applications. Some researchers fuse new FPs to tubulin or actin to determine whether cytoskeletal structures correctly form. However, such assays missed the effects of EGFP dimerization under other physiologic conditions (see below). Therefore, investigators must confirm that an FP-tagged protein behaves similarly to untagged proteins in assays and environments relevant to the protein of interest.

FP oligomerization matters because FPs considered monomeric have been revealed as dimers at sufficiently high concentrations in cells. For example, EGFP forms dimers when fused to integral membrane proteins or incorporated into oligomeric proteins [20]. As a consequence, fusion FPs can form inappropriate interactions leading to false positive FRET signals [20] or distortion of cellular organelles [21]. For fusion proteins, the FP must be truly monomeric. Fortunately, EGFP and variants (CFP and YFP) can be monomerized with a single point mutation (A206K) [20, 21].

Fluorescent Protein Applications in Cells

Free FPs to mark cells

FPs can be expressed as free proteins either constitutively or under the control of a promoter of interest. There are few restrictions on the choice of FP for these experiments other than identifying a sufficiently bright FP. Tandem dimer FPs, i.e. tdKatushka2 and tdTomato, are excellent choices because they have two copies of an FP making them exceptionally bright [22]. If FPs are being used as reporters of promoter activity, then chromophore formation time may be a consideration. Fast folder FPs, such as mCherry and Venus, will rapidly report promoter activity. Note that such FP reporters offer little insight into message stability and generally reflect both cumulative promoter activity and stability of the fluorescent protein, as fluorescent proteins typically have 24h half-lives [23]. To enhance FP turnover, several groups have attached proteasome degrons to FPs and achieve protein half-lives of

2h [23]. Alternatively, another class of FPs, fluorescent “timers” change color with age and provide relative measures of ratios of recently synthesized and old FPs [24, 25].

Fluorescent Protein Fusion Proteins

Visualizing a protein's distribution and dynamics in a subcellular compartment has opened new opportunities in cell biology [26-28]. Correct design and characterization of FP fusion proteins are essential for interpretation of FP fusion protein studies.

Some investigators take short cuts and 𠇌lone by phone.” While it is tempting to rely on others to create FP fusion proteins of interest, there are important reasons for making your own constructs. One must be skeptical of any constructs received from other labs or companies. Not all constructs are made with consideration of protein targeting domains (see below). Also, many FPs aren't always correctly labeled. For example, a DsRed construct could be the monomeric or tetrameric form. Another example happened to me. I often perform photobleaching experiments to study the protein dynamics of GFP-fusion proteins and a fundamental requirement for these experiments is that the FP photobleaches irreversibly. Once, GFP fusion proteins from a collaborator produced unexpectedly rapid protein mobilities in cells. Sequencing revealed the GFP contained the three EGFP mutations and two additional mutations reported to enhance brightness. Control experiments revealed that this GFP, unlike standard EGFP, underwent nearly 80% reversible photobleaching (also termed photoswitching) (our unpublished results and see studies by [16, 29, 30]). Not all 𠇎GFPs” are equal! Whenever obtaining an FP construct from another lab, politely request a plasmid map and a sequence file. If a sequence file is not available, sequence the FP construct yourself before performing any experiments. Don't work with mystery reagents! This anecdote also illustrates the importance of collecting stable baseline values for time resolved fluorescence experiments to help identify phenomena such as photoswitching. Finally, unusual photophysical properties of FPs aren't always problematic. They can be exploited to develop new imaging techniques. For example, photoswitching plays an important role in the super resolution technique of PALM ([31] and see the article by Jennifer Lippincott-Schwartz in this issue).

Why GFP hasn't made antibodies obsolete

Whenever an epitope tag (ANY epitope tag, EGFP, myc, His, HA, etc.) is added to a protein, the tag may modify protein function either by sterically blocking protein interactions with substrates or disrupting targeting sequences (see next section). Knowledge of your protein and engineering the epitope tag to avoid disrupting protein function or targeting can circumvent such issues.

An antibody against your native protein is a key reagent for any epitope-tagging experiments. An antibody can confirm that one's tagged protein: 1) localizes correctly by immunofluorescence, 2) is the correct size and expressed at levels similar to the untagged protein in an immunoblot, and 3) interacts with known substrates in a co-immunoprecipitation. Simply tagging a protein with an FP to avoid having to make an antibody will not address all of these important points. Any FP-fusion localization or related information must be independently verified with an antibody to confirm the FP hasn't disrupted protein behavior or localization.

Besides an antibody, fusion protein studies require the availability or development of a functional assay. The importance of a functional assay cannot be overstated. Even if your tagged protein localizes correctly in a cell, it is critical to confirm your tagged protein behaves as the native protein. The point of adding an FP to a protein is to monitor the localization and dynamics of the protein of interest in cells. A nonfunctional FP-tagged protein will be uninformative at best and most likely misleading. Some examples of FP-tagged proteins with demonstrated functionality are listed in the second part of Table 1 .

Targeting Sequences and Where to insert the Fluorescent Protein

After selecting a bright monomeric FP, establishing a functional assay for your protein of interest, and obtaining a good antibody against your native protein, you can decide where to place the FP. Significant knowledge of the protein of interest is essential to successful FP fusion design and care should be taken to ensure that the FP fusion does not block the normal localization and functionality of the protein of interest. A critical factor in FP placement involves knowledge of the different types of protein motifs for targeting, retrieval, and retention, as well as the contextual and positional requirements of the motifs.

Many cellular proteins reside within organelles or subcompartments. Protein localization critically depends on information encoded within the protein's primary sequence [32]. Protein targeting sequences frequently depend on the context and position of the sequence within the protein. Many protein-targeting sequences must be at the extreme NH2 or COOH terminus of the protein (see Table 2 ). For example, most secretory proteins will not enter the ER, unless the signal sequence is positioned at the NH2 terminus of the protein. Similarly, a resident ER protein requires that the ER retrieval motif (-KDEL or -KKXX) must be at the absolute COOH terminus of the protein to interact with the retrieval machinery. Thus, for example, placement of an FP before the signal sequence or after the ER retrieval motif will disrupt the correct localization of the FP-fusion protein. The positional requirements of a protein of interest's localization sequences will determine what sites are appropriate for fusing an FP.

Table 2

Eukaryotic Protein targeting domains with positional requirements

Sequence PositionLocalizationNotes
NH2 terminal domains
    Signal Sequence-ERUsually posttranslationally cleaved
    Presequence-MitochondriaAmphipathic helix that is usually posttranslationally cleaved
    Myristoylation SequenceCytoplasmic face of cellular membranesInitiating methionine is cleaved
    COOH terminal domains
    -KDELER retrieval motif for luminal proteinsDomain must be in the ER lumen.
    -KKXX (X is any amino acid)ER retrieval motif for integral membrane proteinsDomain must be exposed in the cytoplasm
    -SKLPeroxisome lumen
    -GPI Anchor SequenceBinds luminal and extracellular leaflets of cellular membranesFragment of COOH terminus of protein is cleaved for fusion with GPI
    -Tail AnchorER or mitochondrial membrane
    -CAAX (X is any amino acid) palmitoylation

6,000 genes) of the human genome encodes secretory proteins [33]. Another 1500 proteins localize in mitochondria, up to 8400 are in the nucleus, and 60 are in peroxisomes. Cytoplasmic proteins also can contain positionally dependent posttranslational modifications, such as myristoylation and palmitoylation. Together, at least one third of the genes in the human genome encode proteins with positionally dependent information. Thus, the tagging of each protein with an FP (or ANY epitope tag) requires a specific evaluation of appropriate and inappropriate positions for the FP relative to the protein of interest. The large number of proteins with targeting information suggests all potential fusion proteins should be analyzed for targeting sequences.

It is curious that numerous publications, often in top journals, employ one-size-fits-all FP tagging strategies for following the localization and behaviors of large arrays of proteins in cells. While it is clearly attractive to develop high throughput approaches to describe the latest “-ome,” a careful reading of the FP-tagging strategy may reveal serious issues with the approach and the associated data. Many protein targeting sequences have stringent position requirements. Placing an FP before or after a targeting sequence could mask the targeting sequence, disrupting correct targeting of a protein, and thus makes indiscriminate GFP-tagging of proteins a dubious practice (Box 2). For example, some studies have engineered an FP before the start or at the terminus of all open reading frames. The former approach will prevent most secretory proteins from entering the ER and mitochondrial proteins from translocating into the mitochondrial or addition of myristoyl groups. The latter approach will prevent retention of proteins in the ER and entry of proteins into peroxisomes. Thus, whole classes of proteins will be incorrectly targeted and incorrectly processed. The resulting data are of questionable value. Despite such concerns, some companies now offer thousands of cDNAs fused to EGFP at either the NH2 or COOH terminus. Inspection of a sample of secretory protein constructs, such as the luminal ER chaperone calreticulin, revealed that open reading frames with a signal sequence and a –KDEL retrieval motif had both EGFP fusion options, neither of which would be physiologically functional. Hardly worth $800! If you are interested in obtaining a pre-constructed FP fusion protein plasmid, one excellent resource is Published FP fusion constructs are available in a searchable database, have been well annotated, and are available for a modest fee of $65 per plasmid.

I do not wish to give the impression that every protein is a “mine field” of critical targeting domains. Rather, most positionally dependent targeting domains are found predominantly at the NH2 and COOH ends of the protein. This simplifies analysis and makes generation of FP fusion proteins relatively easy. Bearing in mind the importance of FP position, numerous studies have successfully created FP-tagged proteins with the functionality of the wild type untagged protein ( Table 1 ). While FP fusion protein design (Box 2) requires significant knowledge of the protein of interest, targeting sequences are not always apparent in the primary sequence of the protein. Note that many of the sequences in Table 1 are not defined as absolute consensus sequences. This is because many targeting sequences have biochemically-defined properties, but lack a common primary sequence. For example, every secretory protein in the human genome has its own unique signal sequence that ranges in size from 14-70 amino acids [34]. Web-based resources including GenBank, ExPASy, and SignalP 3.0 ( can assist in identifying signal sequences, for example. Given these complexities, FP-tagging is not a recommended approach for characterizing novel or poorly studied proteins.

As most FP plasmids are in the form of the Clontech N vector, there is an additional consideration for FP fusion design. The N construct contains a strong mammalian Kozak sequence and an initiating methionine for the FP. The design is great for expressing an FP by itself, but can be suboptimal for fusion proteins, as the FP potentially could be translated independent of the attached fusion protein sequence, possibly due to leaky ribosomal scanning (our unpublished data and [35]). To reduce the potential for such phenomena, PCR amplify the FP sequence without a methionine or Kozak sequence and fuse it in frame with the cDNA for the protein of interest. Once constructed, confirm the FP fusion protein sequence, functionality, localization relative to the untagged parent protein by immunofluorescence, and fusion protein size with an immunoblot. Now, you are ready to unlock the full potential of FP fusion proteins in living cells or even whole organisms.


Specificity determinants of the LacI family

We have chosen the LacI family for our analysis because (1) it is one of the largest families of bacterial transcription factors, (2) the availability of complete bacterial genomes has allowed us to resolve orthology by positional analysis (see Methods), and (3) available experimental [31, 32, 33] and structural [34, 35] information can be used to verify our predictions.

Figure 1 presents the mutual information I i, the expected mutual information I exp and the probability P(I) computed for the LacI family using Model1. Model2 produces very similar results (see Supplementary Information). This plot reveals several important features: First, it shows high correlation ρ = 0.97 between I i and . Very good agreement between I i and demonstrates that statistical model used to compute I exp succeeded in explaining ρ 2 = 94% of variation in mutual information and is able to reproduce naturally higher mutual information due to high intra-family similarity of orthologs. Second, the vast majority of amino acids in the LacI family exhibit weak association with the Specificity as indicated by P(I) ≈ 1. Third, very few positions have both low P(I i) and high I i (shown by arrows on Fig 1). Amino acids in these positions have strong association with functional grouping (stronger than sequences on average), indicating the role of these positions in determining different specificities of different groups of orthologs.

Observed I (blue) and the mean expected I exp (thick red) mutual information in DNA-binding (A) and ligand-binding (B) domains of LacI family. Thin red lines show I exp ± 2σ (I exp ). P(I) is statistical significance of mutual information. Filled circles indicated residues with I > 1.0. Positions with filled circles and low P(I) are predicted Specificity determinants. The number along the sequence are according to 1wet PDB structure.

Table 1 presents predicted Specificity determining amino acids. Importantly, although methods to estimate statistical significance are very different, sets of residues found by them are very similar. The Specificity determinants are: 15, 16, 50 and 55, in the first domain and 98, 114, 122, 146, 147, 160, 221 and 249 in the second domain (here and below the numbering is according to PurR the PDB code 1wet).

Table 2 of Supplementary Information shows the pattern of conservation of predicted specificity determinants. As expected, most of these residues are conserved within orthologous groups and are different between different groups. Importantly, there are some of exceptions from this rule in all specificity determining positions (see Discussions).

To better understand the role of specificity determining residues we map them onto the structures of the PurR and LacI-DNA complexes. Figure 2 presents the structure of the PurR-DNA complex with specificity-determining residues shown by space-filling atomic models with atoms of van der Waals radii. Clearly, these residues form two clusters in the structure: one around the DNA and another around the ligand. This result comes at no surprise, since proteins of the LacI family act as transcription repressers (activators) upon presence or absence of particular small molecules (sugars, nucleotides etc). Hence, paralogous proteins differ in specificity of both DNA and small molecule (ligand) recognition. The two identified spatial clusters supposedly determine this specificity.

Structure of PurR bound to the DNA. Two chains of the dimer are shown semi-transparent in light green and pink. Predicted specificity determinants are shown by space-filling and colored red in the pink chain and green in the light green chain. The ligand () and the DNA are shown in blue. Notice deep penetration of some specificity-determining residues into the DNA and formation of the ligand-binding pocket by most of the others.

Examination of the structure brings us to the following conclusions. (1) First four specificity-determining residues in PurR THR15, THR16, VAL50 and LYS55 (TYR17, GLN18, VAL52 and ALA57 in LacI) are located in the DNA-binding domain. Three of them (15, 16 and 55 in PurR 17,18,57 in LacI) are deeply buried in the DNA grooves forming a dense network of interactions with the bases (see Fig. 3C,3D). VAL50 (VAL52 in LacI) forms a hydrophobic contact with its counterpart on the other chain. (2) Six more specificity-determining residues (out of eight) MET122, ASP146, TRP147, ASP160, PHE221, ILE249 (ASN125, ASP149, VAL150, PHE161, TRP220, GLN248 in LacI) are located in the ligand-binding pocket. Five of them (MET122, ASP146, ASP160, PHE221, ILE249) are within 8 Å from the ligand in PurR and within 5 Å in LacI (ASN125, ASP149, PHE161, TRP220, GLN248) (see Fig. 3A,3B). The observed clustering of the identified amino acids around the ligand is striking since the structure of the protein was not used in our analysis.

Detailed picture of the ligand binding pockets (A,B) and protein-DNA interface (C,D) in PurR (left) LacI (right). Predicted specificity determinants are shown in space-fill.

Such structural location indicates that identified residues are indeed involved in the specific recognition. While the DNA-binding residues determine motifs recognized on the DNA, the residues located close to the ligand determine the ligand-binding specificity of the protein. Since different orthologs have different ligands, these residues change from sub-family to sub-family, but stay the same within most sub-families. PHE221 in PurR and corresponding TRP220 in LacI are of a special interest as their aromatic rings directly interact with aromatic ligands. Two other residues, (TRP98 and LYS114 in PurR ARG101, GLN117 in LacI) do not belong to either of the clusters, as they are located far from the DNA and the ligand. They either are "false positives", or have some special role in the alosteric regulation [36]. Indeed, VAL50, TRP98 and LYS114 of one chain interact tightly with the other chain, specificly VAL50 interacts with LYS114 of the other chain. These residues can be important for correct dimerization and hence exhibit sought covariation with functional grouping. In summary, the structural location of identified residues supports the view that they serve as specificity determinants in proteins of the LacI family. This includes the specificity of the DNA recognition and the ligand-binding specificity.

What is a Motif in Protein Structure

A motif is a super secondary structure of a protein. Generally, the first evolving 3D structure of a protein is the secondary structure, which can be either an alpha-helix or beta-sheet. Also, this secondary structure is formed to neutralize the natural polarity of different amino acids in the primary protein structure, which is a sequence of amino acids. Typically, this neutralization occurs through the formation of hydrogen bonds. Further, these secondary structures combine with each other to form these motifs. The combining occurs through small loops.

Figure 1: Zinc Fiber Motif

Furthermore, sometimes, motifs of a particular protein family perform a similar function. For example, Zinc fiber motif performs a DNA binding function. Some other examples of motifs in protein structure are the beta-hairpin motif, Greek key motif, Omega loop motif, helix-loop-helix motif, helix-turn-helix motif, nest motif, niche motif, etc.

Embracing proteins: structural themes in aptamer–protein complexes

Traditional aptamers fold into structures with established nucleic acid motifs.

Modified nucleotides expand the repertoire of known nucleic acid structural motifs.

Hydrophobic modifications make essential inter-molecular and intramolecular interactions.

The flexible phosphodiester bond of aptamers allows for high shape complementarity.

In most cases, aptamer binding occurs without a conformational change of the target.

Understanding the structural rules that govern specific, high-affinity binding characteristic of aptamer–protein interactions is important in view of the increasing use of aptamers across many applications. From the modest number of 16 aptamer–protein structures currently available, trends are emerging. The flexible phosphodiester backbone allows folding into precise three-dimensional structures using known nucleic acid motifs as scaffolds that orient specific functional groups for target recognition. Still, completely novel motifs essential for structure and function are found in modified aptamers with diversity-enhancing side chains. Aptamers and antibodies, two classes of macromolecules used as affinity reagents with entirely different backbones and composition, recognize protein epitopes of similar size and with comparably high shape complementarity.


  1. Siegfried

    Something doesn't work out like that

  2. Felis

    Perhaps, I shall agree with your phrase

  3. Brashura

    I can advise you on this matter. Together we can find a solution.

  4. Kazikree

    In my opinion, he is wrong. We need to discuss. Write to me in PM.

Write a message