Regarding the NCBI FTP site

Regarding the NCBI FTP site

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am trying to extract information about SNP data from the FTP server of NCBI. Could someone please explain to me how the directory is organised? There are many many files and folders and I can't figure which contains what. Is there one file which explains the organisation of the ftp site properly? The NCBI help manual did not give much help beyond helping me find species of interest.

Getting SNP data from FTP Site of NCBI SNP:

It's actually simple to download the data from the NCBI, if you follow the method given by FAQ( as given by @WYSIWYG).

Step 1:

Goto organism FTP:

Step 2

Open your required organism folder:

From here you can download any file you wanted. If you trying to study the whole organism please download any one of the whole folders(ASN.1, rs_fasta, XML(recommended)).


It's not about the downloading it's about understanding the files which are provided.

files types:

  • XML files
  • ASN.1 or Abstract Syntax Notation One
  • rs_fasta

All of them contents almost the same data, just in different format. It's up to you how your script(preferably Python or Perl) works to get out the import information.

Other way of Download from NCBI FTP:

You can always use filezilla:

  • Put Host as :
  • rest Username and Password, etc empty

Clearing Up Confusion with Human Gene Symbols & Names Using NCBI Gene Data

During the research and publishing process, scientists need to refer to their genes-of-interest. However, different labs sometimes use different gene symbols to refer to the same gene. As you can imagine, this leads to confusion.

To standardize the use of terms, the HUGO Gene Nomenclature Committee (HGNC) sets official gene symbols and names. The NCBI Gene resource reports these official gene symbols and names, as well as additional symbols and names that are included on related sequence records for the same gene or from submitted GeneRIFs.

RefSeq curators also store alternate symbols and name as they review the literature. These are reported in Gene as:

There is a tab-separated file on the NCBI FTP site that contains all of this information for human genes.

All of the information regarding gene symbols and names are stored on the NCBI FTP site in files called “gene_info_”, which are updated daily. A summary file for data of all organisms in the Gene database can be downloaded, or users can obtain a file with data for a particular organism, such as human, for example:

Selected Columns from the table:

Column Number Description of data in the column
2 GeneID: the unique identifier for a gene
3 *NCBI Symbol: the default symbol for the gene at NCBI
11 Official Symbol for this gene designated by the nomenclature authority (HGNC)
5 Symbol Synonyms: bar-delimited set of unofficial symbols for the gene
9 *NCBI Named Description: the default name for this gene at NCBI
12 Official Name for this gene designated by the nomenclature authority (HGNC)
14 Other Names & Designations: pipe-delimited set of some alternate descriptions that have been assigned to a GeneID. ‘-‘ indicates none is being reported

*The NCBI default symbol and names displayed for humans are based on official HGNC designations. If there isn’t an officially-designated HGNC symbol or name, then our RefSeq curators create the NCBI designated defaults based on information in the scientific literature and metadata provided by submitters of sequence information.

Symbol Synonyms: BRCAI | BRCC1 | BROVCA1 | FANCS | IRIS | PNCA4 | PPP1R53 | PSCP | RNF53

NCBI Name: BRCA1, DNA repair associated

Official Name: BRCA1, DNA repair associated

Other Names: breast cancer type 1 susceptibility protein | BRCA1/BRCA2-containing complex, subunit 1 | Fanconi anemia, complementation group S | RING finger protein 53 | breast and ovarian cancer susceptibility protein 1 | breast cancer 1, early onset | early onset breast cancer 1 | protein phosphatase 1, regulatory subunit 53 | truncated breast cancer 1

More information about the contents and structure of this file are in the GENE_INFO README file. These files are “GZipped” (.gz), which can be uncompressed by applications such as WinZip or GUNZip, etc. When uncompressed, they are tab-delimited tables. Organism-specific ones, such as the human one mentioned above, can be imported into and managed by a spreadsheet application such as Excel.

Structure viewer iCn3D version 3 featuring analysis of 3D structures!

The NCBI structure viewer iCn3D version 3 is now available on the NCBI web site and from GitHub.

Analysis of 3D Structures

You can use the current version with the icn3d package at npm to write scripts to call functions in iCn3D. For example, this script on GitHub can calculate the change in interactions due to a mutation. The results of this analysis for the structure (6M0J) of the SARS-CoV-2 spike protein bound to the ACE2 receptor are displayed in Figure 1. These show the predicted changes in interactions with other residues in the the SARS-CoV-2 spike protein and in the ACE2 receptor when the asparagine (N) at position 501 of the spike protein is changed to a tyrosine (Y). You can also run these scripts from the command line to process a list of 3D structures to get and analyze annotations.

Figure 1. iCn3D viewer showing the predicted interactions with other residues in the spike protein and in the ACE2 target when the asparagine (N) at position 501 of the SARS-CoV-2 spike protein is substituted with tyrosine (Y), highlighted in yellow. Interactions were calculated using the script interactions2.js.

Qualitative Study

Qualitative research is a type of research that explores and provides deeper insights into real-world problems. Instead of collecting numerical data points or intervene or introduce treatments just like in quantitative research, qualitative research helps generate hypotheses as well as further investigate and understand quantitative data. Qualitative research gathers participants' experiences, perceptions, and behavior. It answers the hows and whys instead of how many or how much. It could be structured as a stand-alone study, purely relying on qualitative data or it could be part of mixed-methods research that combines qualitative and quantitative data. This review introduces the readers to some basic concepts, definitions, terminology, and application of qualitative research.

Qualitative research at its core, ask open-ended questions whose answers are not easily put into numbers such as ‘how’ and ‘why’. Due to the open-ended nature of the research questions at hand, qualitative research design is often not linear in the same way quantitative design is. One of the strengths of qualitative research is its ability to explain processes and patterns of human behavior that can be difficult to quantify. Phenomena such as experiences, attitudes, and behaviors can be difficult to accurately capture quantitatively, whereas a qualitative approach allows participants themselves to explain how, why, or what they were thinking, feeling, and experiencing at a certain time or during an event of interest. Quantifying qualitative data certainly is possible, but at its core, qualitative data is looking for themes and patterns that can be difficult to quantify and it is important to ensure that the context and narrative of qualitative work are not lost by trying to quantify something that is not meant to be quantified.

However, while qualitative research is sometimes placed in opposition to quantitative research, where they are necessarily opposites and therefore ‘compete’ against each other and the philosophical paradigms associated with each, qualitative and quantitative work are not necessarily opposites nor are they incompatible. While qualitative and quantitative approaches are different, they are not necessarily opposites, and they are certainly not mutually exclusive. For instance, qualitative research can help expand and deepen understanding of data or results obtained from quantitative analysis. For example, say a quantitative analysis has determined that there is a correlation between length of stay and level of patient satisfaction, but why does this correlation exist? This dual-focus scenario shows one way in which qualitative and quantitative research could be integrated together.

Examples of Qualitative Research Approaches

Ethnography as a research design has its origins in social and cultural anthropology, and involves the researcher being directly immersed in the participant’s environment. Through this immersion, the ethnographer can use a variety of data collection techniques with the aim of being able to produce a comprehensive account of the social phenomena that occurred during the research period. That is to say, the researcher’s aim with ethnography is to immerse themselves into the research population and come out of it with accounts of actions, behaviors, events, etc. through the eyes of someone involved in the population. Direct involvement of the researcher with the target population is one benefit of ethnographic research because it can then be possible to find data that is otherwise very difficult to extract and record.

Grounded Theory is the “generation of a theoretical model through the experience of observing a study population and developing a comparative analysis of their speech and behavior.” As opposed to quantitative research which is deductive and tests or verifies an existing theory, grounded theory research is inductive and therefore lends itself to research that is aiming to study social interactions or experiences. In essence, Grounded Theory’s goal is to explain for example how and why an event occurs or how and why people might behave a certain way. Through observing the population, a researcher using the Grounded Theory approach can then develop a theory to explain the phenomena of interest.

Phenomenology is defined as the “study of the meaning of phenomena or the study of the particular”. At first glance, it might seem that Grounded Theory and Phenomenology are quite similar, but upon careful examination, the differences can be seen. At its core, phenomenology looks to investigate experiences from the perspective of the individual. Phenomenology is essentially looking into the ‘lived experiences’ of the participants and aims to examine how and why participants behaved a certain way, from their perspective. Herein lies one of the main differences between Grounded Theory and Phenomenology. Grounded Theory aims to develop a theory for social phenomena through an examination of various data sources whereas Phenomenology focuses on describing and explaining an event or phenomena from the perspective of those who have experienced it.

One of qualitative research’s strengths lies in its ability to tell a story, often from the perspective of those directly involved in it. Reporting on qualitative research involves including details and descriptions of the setting involved and quotes from participants. This detail is called ‘thick’ or ‘rich’ description and is a strength of qualitative research. Narrative research is rife with the possibilities of ‘thick’ description as this approach weaves together a sequence of events, usually from just one or two individuals, in the hopes of creating a cohesive story, or narrative. While it might seem like a waste of time to focus on such a specific, individual level, understanding one or two people’s narratives for an event or phenomenon can help to inform researchers about the influences that helped shape that narrative. The tension or conflict of differing narratives can be “opportunities for innovation”.

Research paradigms are the assumptions, norms, and standards that underpin different approaches to research. Essentially, research paradigms are the ‘worldview’ that inform research. It is valuable for researchers, both qualitative and quantitative, to understand what paradigm they are working within because understanding the theoretical basis of research paradigms allows researchers to understand the strengths and weaknesses of the approach being used and adjust accordingly. Different paradigms have different ontology and epistemologies. Ontology is defined as the "assumptions about the nature of reality” whereas epistemology is defined as the “assumptions about the nature of knowledge” that inform the work researchers do. It is important to understand the ontological and epistemological foundations of the research paradigm researchers are working within to allow for a full understanding of the approach being used and the assumptions that underpin the approach as a whole. Further, it is crucial that researchers understand their own ontological and epistemological assumptions about the world in general because their assumptions about the world will necessarily impact how they interact with research. A discussion of the research paradigm is not complete without describing positivist, postpositivist, and constructivist philosophies.

Positivist vs Postpositivist

To further understand qualitative research, we need to discuss positivist and postpositivist frameworks. Positivism is a philosophy that the scientific method can and should be applied to social as well as natural sciences. Essentially, positivist thinking insists that the social sciences should use natural science methods in its research which stems from positivist ontology that there is an objective reality that exists that is fully independent of our perception of the world as individuals. Quantitative research is rooted in positivist philosophy, which can be seen in the value it places on concepts such as causality, generalizability, and replicability.

Conversely, postpositivists argue that social reality can never be one hundred percent explained but it could be approximated. Indeed, qualitative researchers have been insisting that there are “fundamental limits to the extent to which the methods and procedures of the natural sciences could be applied to the social world” and therefore postpositivist philosophy is often associated with qualitative research. An example of positivist versus postpositivist values in research might be that positivist philosophies value hypothesis-testing, whereas postpositivist philosophies value the ability to formulate a substantive theory.

Constructivism is a subcategory of postpositivism. Most researchers invested in postpositivist research are constructivist as well, meaning they think there is no objective external reality that exists but rather that reality is constructed. Constructivism is a theoretical lens that emphasizes the dynamic nature of our world. “Constructivism contends that individuals’ views are directly influenced by their experiences, and it is these individual experiences and views that shape their perspective of reality”. Essentially, Constructivist thought focuses on how ‘reality’ is not a fixed certainty and experiences, interactions, and backgrounds give people a unique view of the world. Constructivism contends, unlike in positivist views, that there is not necessarily an ‘objective’ reality we all experience. This is the ‘relativist’ ontological view that reality and the world we live in are dynamic and socially constructed. Therefore, qualitative scientific knowledge can be inductive as well as deductive.”

So why is it important to understand the differences in assumptions that different philosophies and approaches to research have? Fundamentally, the assumptions underpinning the research tools a researcher selects provide an overall base for the assumptions the rest of the research will have and can even change the role of the researcher themselves. For example, is the researcher an ‘objective’ observer such as in positivist quantitative work? Or is the researcher an active participant in the research itself, as in postpositivist qualitative work? Understanding the philosophical base of the research undertaken allows researchers to fully understand the implications of their work and their role within the research, as well as reflect on their own positionality and bias as it pertains to the research they are conducting.

The better the sample represents the intended study population, the more likely the researcher is to encompass the varying factors at play. The following are examples of participant sampling and selection:

Purposive sampling- selection based on the researcher’s rationale in terms of being the most informative.

Criterion sampling-selection based on pre-identified factors.

Convenience sampling- selection based on availability.

Snowball sampling- the selection is by referral from other participants or people who know potential participants.

Extreme case sampling- targeted selection of rare cases.

Typical case sampling-selection based on regular or average participants.

Data Collection and Analysis

Qualitative research uses several techniques including interviews, focus groups, and observation. [1] [2] [3] Interviews may be unstructured, with open-ended questions on a topic and the interviewer adapts to the responses. Structured interviews have a predetermined number of questions that every participant is asked. It is usually one on one and is appropriate for sensitive topics or topics needing an in-depth exploration. Focus groups are often held with 8-12 target participants and are used when group dynamics and collective views on a topic are desired. Researchers can be a participant-observer to share the experiences of the subject or a non-participant or detached observer.

While quantitative research design prescribes a controlled environment for data collection, qualitative data collection may be in a central location or in the environment of the participants, depending on the study goals and design. Qualitative research could amount to a large amount of data. Data is transcribed which may then be coded manually or with the use of Computer Assisted Qualitative Data Analysis Software or CAQDAS such as ATLAS.ti or NVivo.

After the coding process, qualitative research results could be in various formats. It could be a synthesis and interpretation presented with excerpts from the data. Results also could be in the form of themes and theory or model development.

To standardize and facilitate the dissemination of qualitative research outcomes, the healthcare team can use two reporting standards. The Consolidated Criteria for Reporting Qualitative Research or COREQ is a 32-item checklist for interviews and focus groups. The Standards for Reporting Qualitative Research (SRQR) is a checklist covering a wider range of qualitative research.

Examples of Application

Many times a research question will start with qualitative research. The qualitative research will help generate the research hypothesis which can be tested with quantitative methods. After the data is collected and analyzed with quantitative methods, a set of qualitative methods can be used to dive deeper into the data for a better understanding of what the numbers truly mean and their implications. The qualitative methods can then help clarify the quantitative data and also help refine the hypothesis for future research. Furthermore, with qualitative research researchers can explore subjects that are poorly studied with quantitative methods. These include opinions, individual's actions, and social science research.

A good qualitative study design starts with a goal or objective. This should be clearly defined or stated. The target population needs to be specified. A method for obtaining information from the study population must be carefully detailed to ensure there are no omissions of part of the target population. A proper collection method should be selected which will help obtain the desired information without overly limiting the collected data because many times, the information sought is not well compartmentalized or obtained. Finally, the design should ensure adequate methods for analyzing the data. An example may help better clarify some of the various aspects of qualitative research.

A researcher wants to decrease the number of teenagers who smoke in their community. The researcher could begin by asking current teen smokers why they started smoking through structured or unstructured interviews (qualitative research). The researcher can also get together a group of current teenage smokers and conduct a focus group to help brainstorm factors that may have prevented them from starting to smoke (qualitative research).

In this example, the researcher has used qualitative research methods (interviews and focus groups) to generate a list of ideas of both why teens start to smoke as well as factors that may have prevented them from starting to smoke. Next, the researcher compiles this data. The research found that, hypothetically, peer pressure, health issues, cost, being considered “cool,” and rebellious behavior all might increase or decrease the likelihood of teens starting to smoke.

The researcher creates a survey asking teen participants to rank how important each of the above factors is in either starting smoking (for current smokers) or not smoking (for current non-smokers). This survey provides specific numbers (ranked importance of each factor) and is thus a quantitative research tool.

The researcher can use the results of the survey to focus efforts on the one or two highest-ranked factors. Let us say the researcher found that health was the major factor that keeps teens from starting to smoke, and peer pressure was the major factor that contributed to teens to start smoking. The researcher can go back to qualitative research methods to dive deeper into each of these for more information. The researcher wants to focus on how to keep teens from starting to smoke, so they focus on the peer pressure aspect.

The researcher can conduct interviews and/or focus groups (qualitative research) about what types and forms of peer pressure are commonly encountered, where the peer pressure comes from, and where smoking first starts. The researcher hypothetically finds that peer pressure often occurs after school at the local teen hangouts, mostly the local park. The researcher also hypothetically finds that peer pressure comes from older, current smokers who provide the cigarettes.

The researcher could further explore this observation made at the local teen hangouts (qualitative research) and take notes regarding who is smoking, who is not, and what observable factors are at play for peer pressure of smoking. The researcher finds a local park where many local teenagers hang out and see that a shady, overgrown area of the park is where the smokers tend to hang out. The researcher notes the smoking teenagers buy their cigarettes from a local convenience store adjacent to the park where the clerk does not check identification before selling cigarettes. These observations fall under qualitative research.

If the researcher returns to the park and counts how many individuals smoke in each region of the park, this numerical data would be quantitative research. Based on the researcher's efforts thus far, they conclude that local teen smoking and teenagers who start to smoke may decrease if there are fewer overgrown areas of the park and the local convenience store does not sell cigarettes to underage individuals.

The researcher could try to have the parks department reassess the shady areas to make them less conducive to the smokers or identify how to limit the sales of cigarettes to underage individuals by the convenience store. The researcher would then cycle back to qualitative methods of asking at-risk population their perceptions of the changes, what factors are still at play, as well as quantitative research that includes teen smoking rates in the community, the incidence of new teen smokers, among others.

Regarding the NCBI FTP site - Biology

NCBI Genome Downloading Scripts

Some script to download bacterial and fungal genomes from NCBI after they restructured their FTP a while ago.

Idea shamelessly stolen from Mick Watson's Kraken downloader scripts that can also be found in Mick's GitHub repo. However, Mick's scripts are written in Perl specific to actually building a Kraken database (as advertised).

So this is a set of scripts that focuses on the actual genome downloading.

Alternatively, clone this repository from GitHub, then run (in a python virtual environment)

If this fails on older versions of Python, try updating your pip tool first:

and then rerun the ncbi-genome-download install.

Alternatively, ncbi-genome-download is packaged in conda . Refer the the Anaconda/miniconda site to install a distribution (highly recommended) With that installed one can do:

ncbi-genome-download is only developed and tested on Python releases still under active support by the Python project. At the moment, this means versions 3.5, 3.6, 3.7, and 3.8. Specifically, no attempt at testing under Python versions older than 3.5 is being made.

If your system is stuck on an older version of Python, consider using a tool like Homebrew to obtain a more up-to-date version.

ncbi-genome-download 0.2.12 was the last version to support Python 2.

To download all bacterial RefSeq genomes in GenBank format from NCBI, run the following:

Downloading multiple groups is also possible:

Note: To see all available groups, see ncbi-genome-download --help , or simply use all to check all groups. Naming a more specific group will reduce the download size and the time needed to find the sequences to download.

If you're on a reasonably fast connection, you might want to try running multiple downloads in parallel:

To download all fungal GenBank genomes from NCBI in GenBank format, run:

To download all viral RefSeq genomes in FASTA format, run:

It is possible to download multiple formats by supplying a list of formats or simply download all formats:

To download only completed bacterial RefSeq genomes in GenBank format, run:

It is possible to download multiple assembly levels at once by supplying a list:

To download only bacterial reference genomes from RefSeq in GenBank format, run:

To download bacterial RefSeq genomes of the genus Streptomyces, run:

Note: This is a simple string match on the organism name provided by NCBI only.

You can also use this with a slight trick to download genomes of a certain species as well:

Note: The quotes are important. Again, this is a simple string match on the organism name provided by the NCBI.

Multiple genera is also possible:

You can also put genus names into a file, one organism per line, e.g.:

Then, pass the path to that file (e.g. my_genera.txt ) to the --genera option, like so:

Note: The above command will download all Streptomyces and Amycolatopsis genomes from RefSeq.

You can make the string match fuzzy using the --fuzzy-genus option. This can be handy if you need to match a value in the middle of the NCBI organism name, like so:

Note: The above command will download all bacterial genomes containing "coelicolor" anywhere in their organism name from RefSeq.

To download bacterial RefSeq genomes based on their NCBI species taxonomy ID, run:

Note: The above command will download all RefSeq genomes belonging to Escherichia coli.

To download a specific bacterial RefSeq genomes based on its NCBI taxonomy ID, run:

Note: The above command will download the RefSeq genome belonging to Escherichia coli str. K-12 substr. MG1655.

It is also possible to download multiple species taxids or taxids by supplying the numbers in a comma-separated list:

Note: The above command will download the reference genomes for cat and human.

In addition, you can put multiple species taxids or taxids into a file, one per line and pass that filename to the --species-taxids or --taxids parameters, respectively.

Assuming you had a file my_taxids.txt with the following contents:

You could download the reference genomes for cat and human like this:

It is possible to also create a human-readable directory structure in parallel to mirroring the layout used by NCBI:

This will use links to point to the appropriate files in the NCBI directory structure, so it saves file space. Note that links are not supported on some Windows file systems and some older versions of Windows.

It is also possible to re-run a previous download with the --human-readable option. In this case, ncbi-genome-download will not download any new genome files, and just create human-readable directory structure. Note that if any files have been changed on the NCBI side, a file download will be triggered.

There is a "dry-run" option to show which accessions would be downloaded, given your filters:

If you want to filter for the "relation to type material" column of the assembly summary file, you can use the --type-materials option. Possible values are "any", "all", "type", "reference", "synonym", "proxytype", and/or "neotype". "any" will include assemblies with no relation to type material value defined, "all" will download only assemblies with a defined value. Multiple values can be given, separated by comma:

By default, ncbi-genome-download caches the assembly summary files for the respective taxonomic groups for one day. You can skip using the cache file by using the --no-cache option. The output of --help also shows the cache directory, should you want to remove any of the cached files.

To get an overview of all options, run

You can also use it as a method call. Pass the pythonised keyword arguments ( _ instead of - ) as described above or in the --help :

Note: To specify a taxonomic group, like bacteria, use the group keyword.

Contributed Scripts:

This script lets you find out what TaxIDs to pass to ngd , and will write a simple one-item-per-line file to pass in to it. It utilises the ete3 toolkit, so refer to their site to install the dependency if it's not already satisfied.

You can query the database using a particular TaxID, or a scientific name. The primary function of the script is to return all the child taxa of the specified parent taxa. The script has various options for what information is written in the output.

A basic invocation may look like:

On first use, a small sqlite database will be created in your home directory by default (change the location with the --database flag). You can update this database by using the --update flag. Note that if the database is not in your home directory, you must specify it with --database or a new database will be created in your home directory.

All code is available under the Apache License version 2, see the LICENSE file for details.

Announcing the RefSeq annotation of rat mRatBN7.2!

NCBI RefSeq has finished its initial annotation of the new rat reference assembly, mRatBN7.2, recently released by the Darwin Tree of Life Project at the Wellcome Sanger Institute. This is the first coordinate-changing update to the rat reference since the 2014 release of Rnor_6.0 from the Rat Genome Sequencing Consortium and brings the rat assembly into the modern age with a nearly 300x increase in contig N50 and 9x increase in scaffold N50 lengths. It’s a major improvement!


Genotype data hosted at dbGaP comprise individual-level genotypes and aggregated summaries, both of which are distributed exclusively through the dbGaP Authorized Access System. The types of data available include DNA variations, SNP assay, DNA methylation (epigenomics), copy number variation, as well as genomic/exomic sequencing. RNA data types such as expression array, RNA seq and eQTL results are also available. For details about the accepted format of submitted genotype files please see the dbGaP submission guide. Genotype data files are compressed and archived into tar files for distribution. The files are explicitly named to indicate file content, raw data (cel and idat), genotype calls (genotype) and locus annotations (marker info). Genotype calls are usually clustered according to file format and genotyping platform, including one sample per file (indfmt), multiple-sample matrix (matrixfmt) and pre-defined variance call format (vcf) or in other popular formats. They will be accompanied by a sample-info file for subject lookup and consent status. The consent code and consent abbreviation are also embedded in the file name.

View BAM alignments in the NCBI genome browsers and sequence viewers sorted by haplotype tag

NCBI’s genome browsers and graphical sequence viewers now allow you to view BAM alignments sorted by haplotype tag . This option is useful for analyzing variants within a sequenced sample and can help you detect or validate structural variants.Figure 1. Remote BAM alignment data sorted by haplotype tag in the Genome Data Viewer. The remote BAM file was added through the “User Data and Track Hubs” feature in GDV. You can load the remote BAM for this example through The sorted display shows that haplotype 1 contains a significant deletion in this region relative to haplotype 2 and the reference genome assembly. Aligned reads not assigned a haplotype tag in the BAM file are grouped under the heading “haplotype not set” (not shown).


CD-ROM releases are distributed quarterly. The main contents are the nucleotide and protein sequence databases. Software for data query and retrieval is also provided. The program EMBL-Search for Macintosh and Windows allow data access by entry name, accession number, keyword, citation, author name, taxonomic classification, database cross-reference, free text and date. EMBL-Search also accesses the Prosite and Enzyme databases, and enables navigation between related entries via the cross-references built into the databases. It uses binary indices whose structure is documented and therefore available for other software systems. The sequence databases are also provided on a separate CD-ROM in FastA format for use with software such as FastA on Macintosh and PC systems.

Sequence search facilities

The EBI provides a number of services that allow external users to compare their own sequences against the most currently available data in the EMBL Nucleotide Sequence Database and SWISS-PROT. BLITZ is based on the MPsrch program of Collins and Sturrock (Edinburgh University) which uses the well-known Smith and Waterman ( 9 ) algorithm for sensitive searches of the protein sequence databases. It is implemented on a MasPar (massively-parallel) computer at the EBI. Mail-FastA is based on Pearson and Lipman's FastA program ( 10 ). It performs sensitive comparisons of nucleotide or amino acid sequences against the database. Blast performs a full Smith and Waterman alignment against the database, and uses Karlin and Altschul ‘sum statistics’ ( 11 ) to evaluate the significance of multiple regions of similarity.

These search tools are available either interactively through the URLs listed below or through Email:

Further information can be obtained by sending an Email to the according address with the word HELP in the body of the message.

Database query/retrieval

The EBI provides a query/retrieval system using the Sequence Retrieval System SRS ( 12 ). Specific query forms are accessable at the URL:


The European Molecular Biology Network ( ) was initiated in 1988 to link European laboratories using biocomputing and bioinformatics in molecular biology research as well as to increase the availability and usefulness of the molecular biology databases within Europe. Remote copies of the nucleotide and protein sequence databases, updated daily, as well as other molecular biology resources, are held at nationally mandated nodes. As bioinformatics grows, EMBnet plays an important role in support, training, research and development for the European bioinformatics research community. Table 1 gives a full listing of sites maintaining daily updated copies of the EMBL Database.

How to contact the European Bioinformatics Institute

Postal address. EMBL Outstation - The EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. Tel: +44 (1223) 494444 Fax: +44 (1223) 494468


  1. Coilin

    Excuse, I thought and pushed the message away

  2. Vudoshura

    This version is deprecated

  3. Duqaq

    With talent ...

  4. Fai

    everything can be

Write a message