We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I wanted to compare the amino acid sequence of enzymes for this project I'm working on and need to compare them at their catalytic site. For that, I went to the Catalitic Site Atlas to get the information on the catalytic site, but since they don't offer an easy way for me to download the structure data programmatically I downloaded it from the RSCB PDB by downloading the fasta sequence. When checking for the catalytic sites it wasn't matching what the CSA was telling me and that's when I realized that they are different files. Take for example the 3nos, the CSA presents the following sequence:
While the PDB presents the following sequence:
Why aren't they the same sequence if it's the same protein?
Sorry if it's a noob question, I'm not a biologist, just a computer scientist who happens to like bioinformatics.
The CSA data comes from here while the PDB data comes from here
Crystallography results (pdb files) almost always contain a truncated sequence.
Both ends of a protein are often flexible (even in a crystal) and don't result in enough data for a good fit. The corresponding residues are removed from the model and the sequence, and you're left with only the residues that show a defined electron density.
One sequence is partly contained in the other (highlighted).
So the CSA sequence is (FASTA format, truncated):
>sp|P29474|NOS3_HUMAN Nitric oxide synthase, endothelial OS=Homo sapiens GN=NOS3 PE=1 SV=3
MGNLKSVAQEPGPPCGLGLGLGLGLCGKQGPATPAPEPSRAPASLLPPAPEHSPPSSPLT QPPEGPKFPRVKNWEVGSITYDTLSAQAQQDGPCTPRRCLGSLVFPRKLQGRPSPGPPAP EQLLSQARDFINQYYSSIKRSGSQAHEQRLQEVEAEVAATGTYQLRESELVFGAKQAWRN…
taken from http://www.uniprot.org/uniprot/P29474 for convenience.
While the PDB one is :
>3NOS:A|PDBID|CHAIN|SEQUENCE PKFPRVKNWEVGSITYDTLSAQAQQDGPCTPRRCLGSLVFPRKLQGRPSPGPPAPEQLLSQARDFINQYYSSIKRSGSQA HEQRLQEVEAEVAATGTYQLRESELVFGAKQAWRNAPRCVGRIQWGKLQVFDARDCRSAQEMFTYICNHIKYATNRGNLR SAITVFPQRCPGRGDFRIWNSQLVRYAGYRQQDGSVRGDPANVEITELCIQHGWTPGNGRFDVLPLLLQAPDEPPELFLL…
The Uniprot entry mentions 3 different isoforms due to alternative splicing, so perhaps that is what is going on here. Here is the output from a sequence alignment (using https://www.ebi.ac.uk/Tools/psa/emboss_matcher/):
#======================================= # # Aligned_sequences: 2 # 1: NOS3_HUMAN # 2: SEQUENCE # Matrix: EBLOSUM62 # Gap_penalty: 14 # Extend_penalty: 4 # # Length: 240 # Identity: 240/240 (100.0%) # Similarity: 240/240 (100.0%) # Gaps: 0/240 ( 0.0%) # Score: 1294 # # #======================================= NOS3_HUMAN 66 PKFPRVKNWEVGSITYDTLSAQAQQDGPCTPRRCLGSLVFPRKLQGRPSP 115 |||||||||||||||||||||||||||||||||||||||||||||||||| SEQUENCE 1 PKFPRVKNWEVGSITYDTLSAQAQQDGPCTPRRCLGSLVFPRKLQGRPSP 50 NOS3_HUMAN 116 GPPAPEQLLSQARDFINQYYSSIKRSGSQAHEQRLQEVEAEVAATGTYQL 165 |||||||||||||||||||||||||||||||||||||||||||||||||| SEQUENCE 51 GPPAPEQLLSQARDFINQYYSSIKRSGSQAHEQRLQEVEAEVAATGTYQL 100 NOS3_HUMAN 166 RESELVFGAKQAWRNAPRCVGRIQWGKLQVFDARDCRSAQEMFTYICNHI 215 |||||||||||||||||||||||||||||||||||||||||||||||||| SEQUENCE 101 RESELVFGAKQAWRNAPRCVGRIQWGKLQVFDARDCRSAQEMFTYICNHI 150 NOS3_HUMAN 216 KYATNRGNLRSAITVFPQRCPGRGDFRIWNSQLVRYAGYRQQDGSVRGDP 265 |||||||||||||||||||||||||||||||||||||||||||||||||| SEQUENCE 151 KYATNRGNLRSAITVFPQRCPGRGDFRIWNSQLVRYAGYRQQDGSVRGDP 200 NOS3_HUMAN 266 ANVEITELCIQHGWTPGNGRFDVLPLLLQAPDEPPELFLL 305 |||||||||||||||||||||||||||||||||||||||| SEQUENCE 201 ANVEITELCIQHGWTPGNGRFDVLPLLLQAPDEPPELFLL 240
This answer is correct, I just wanted to add that the correct sequence numbering is preserved in the PDB file in the DBREF record (which you can see by opening the PDB in a text editor):
DBREF 3NOS A 66 492 UNP P29474 NOS3_HUMAN 66 492
In plain English, the sequence presented in this file (
A) corresponds to residues
492of the associated UniProt (
UNP) entry (accession: