Bioinformatics glossary - S
Secondary structure (protein)
The organization of the peptide backbone of a protein that occurs as a result of hydrogen bonds e.g alpha helix, Beta pleated sheet.
Selectivity
Selectivity of bioinformatics similarity search algorithms is defined as the significance threshold for reporting database sequence matches. As an example, for BLAST searches, the parameter E is interpreted as the upper bound on the expected frequency of chance occurrence of a match within the context of the entire database search. E may be thought of as the number of matches one expects to observe by chance alone during the database search.
Sense strand
The strand of double-stranded DNA that acts as the template strand for RNA synthesis. Typically only one gene product is produced per gene, reading from the sense strand only. (Some viruses have open reading frames in both the sense and the antisense strands).
Sensitivity
Sensitivity of bioinformatics similarity search algorithms centers around two areas: First, how well can the method detect biologically meaningful relationships between two related sequences in the presence of mutations and sequencing errors; Secondly how does the heuristic nature of the algorithm affect the probability that a matching sequence will not be detected. At the user's discretion, the speed of most similarity search programs can be sacrificed in exchange for greater sensitivity - with an emphasis on detecting lower scoring matches.
Sequence Tagged Site (STS)
A unique sequence from a known chromosomal location that can be amplified by PCR. STSs act as physical markers for genomic mapping and cloning.
Sexual PCR (Molecular Diversity)
Sexual PCR is a form of PCR in which similar, but not identical, DNA sequences are reassembled to obtain novel juxtapositions, simulating the result of genetic recombination. The result is the creation of an array of related genes which may possess improved characteristics. By repeated rounds of recombination, selection and PCR-based amplification vastly improved gene-products, such as enzymes with greater activity, may be generated and selected.
Shotgun cloning
The cloning of an entire gene segment or genome by generating a random set of fragments using restriction endonucleases to create a gene library that can be subsequently mapped and sequenced to reconstruct the entire genome.
Similarity (homology) search
Given a newly sequenced gene, there are two main approaches to the prediction of structure and function from the amino acid sequence. Homology methods are the most powerful and are based on the detection of significant extended sequence similarity to a protein of known structure, or of a sequence pattern characteristic of a protein family. Statistical methods are less successful but more general and are based on the derivation of structural preference values for single residues, pairs of residues, short oligopeptides or short sequence patterns. The transfer of structure/function information to a potentially homologous protein is straightforward when the sequence similarity is high and extended in length, but the assessment of the structural significance of sequence similarity can be difficult when sequence similarity is weak or restricted to a short region.
Signal sequence (leader sequence)
A short sequence added to the amino-terminal end of a polypeptide chain that forms an amphipathic helix allowing the nascent polypeptide to migrate through membranes such as the endoplasmic reticulum or the cell membrane. It is cleaved from the polypeptide after the protein has crossed the membrane.
Single nucleotide polymorphisms (SNPs)
Variations of single base pairs scattered throughout the human genome that serve as measures of the genetic diversity in humans. About 1 million SNPs are estimated to be present in the human genome, and SNPs are useful markers for gene mapping studies.
Single-pass sequencing
Rapid sequencing of large segments of the genome of an organism by isolating as many expressed (cDNA) sequences as possible and performing single sequencer runs on their 5' or 3' ends. Single-pass sequencing typically results in individual, error-prone sequencing reads of 400-700 bases, depending on the type of sequencer used. However, if many of these are generated from numerous clones from different tissues, they may be overlapped and assembled to remove the errors and generate a contiguous sequence for the entire expressed gene.
Site
Sites in sequences can be located either in DNA (e.g. binding sites, cleavage sites) or in proteins. In order to identify a site in DNA, ambiguity symbols are used to allow several different symbols at one position. Proteins, however, need a different mechanism (see Pattern). Restriction enzyme cleavage sites, for instance, have the following properties: limited length (typically, less than 20 base pairs); definition of the cleavage site and its appearance (3', 5' overhang or blunt); definition of the binding site.
Southern blotting
A procedure for the identification of DNA by transmitting a fragment isolated on an agarose gel to a nitrocellulose filter where it can be hybridized with a complementary "probe" sequence.
Splice site
The sequence found at the 5' and 3' region of exon/intron boundaries, usually defined by a consensus sequence:
Intron
5' CAGGTAAGT---------TNCAGG 3'
A G C T
N represents any nucleotide; the bottom line represents alternative nucleotides at the indicated positions.
Splice form
By using alternative splicing, a single message precursor from DNA can generate an entire family of mRNAs and proteins. This can be utilized to create specificity in cell-cell or cell-ligand interactions. A cell may produce a given protein, but it will be a different splice-form of the protein than that produced by an adjacent cell. In this manner, the two cells have the potential to interact differently with other cells or molecules. Two places where this has been extremely important is in the production of cell-surface specificity proteins in the immune and nervous systems.
Splicing
The joining together of separate DNA or RNA component parts. For example, RNA splicing in eukaryotes involves the removal of introns and the stitching together of the exons from the pre-mRNA transcript before maturation.
Solvent accessibility
The surface area (typically measured in square angstroms) of a biological molecule, usually a protein, that is exposed to solvent in its native, folded form. Determining the solvent accessibility of a protein helps define which amino acids in its molecular sequence are on the exterior of the molecule, and thus available to participate in interactions with other molecules.
Structural gene
Gene which encodes a structural protein (cf. Regulatory gene).
Structure prediction
Algorithms that predict the secondary, tertiary and sometimes even quarternary structure of proteins from their sequences. Determining protein structure from sequence has been dubbed "the second half of the Genetic Code" since it is the folded tertiary structure of a protein that governs how it functions as a gene product. As yet most structure prediction methods are only partially successful, and typically work best for certain well-defined classes of proteins.
Substitution matrix
A model of protein evolution at the sequence level resulting in the development of a set of widely used substitution matrices. These are frequently called Dayhoff, MDM (Mutation Data Matrix), BLOSUM or PAM (Percent Accepted Mutation) matrices. They are derived from global alignments of closely related sequences. Matrices for greater evolutionary distances are extrapolated from those for lesser ones.
Subtraction library
A cDNA library that only contains cDNAs uniquely expressed in a given cell or tissue. e.g T cells and B cells will express many common RNAs, as well as a very small percentage which will be unique for T cells and B cells respectively. To make a T cell subtraction library, the cDNA from a T cell library is hybridized with a vast excess of B cell RNA. The commonly expressed genes will result in RNA-cDNA hybrids which can be removed (or subtracted) to leave only T cell specific cDNAs.
Satellite DNA/simple sequence DNA
Highly repetitious DNA sequence; generally based on a short sequence (7-20 nucleotides) repeated up to a million times in the haploid genome. Usually found in heterochromatic regions, often associated with the centromere.
Sense strand
In a gene, the DNA strand that has the sequence found in the RNA molecule. Also called the coding, positive, or non-template strand.
Shotgun sequencing
A strategy for sequencing whole genomes, it was pioneered by the for-profit company Celera. Genomes are cut into very small pieces, cloned into plasmids, sequenced, and then assembled into whole chromosomes or genomes. This method is faster than hierarchical shotgun sequencing but more prone to assembly errors.
Simple repeat
A nucleotide repeat with one or a small number of bases, such as AAAAAAAAAAAA or CACACACACA.
SINE
Short Interspersed Nuclear Elements are a class of DNA segments derived from reverse-transcribed genes and commonly found in eukaryotic genomes.
SNP
Single-nucleotide polymorphism; a difference in DNA sequence at a single base between two sequences.
Splicing
The process by which introns are removed and exons are joined to produce a mature, functional RNA from a primary transcript. Some RNAs are self-splicing, but most require a specific ribonucleoprotein complex to catalyze the reaction.
Splicing acceptor site
The boundary between an intron and the exon immediately downstream (i.e. on the 3' side of the intron).
Splicing donor site
The boundary between an intron and the exon immediately upstream (i.e. on the 5' side of the intron).
Splicing transesterification mechanism
A chemical reaction that joins the 5' phosphate of the first nucleotide located at the 5' end of the downstream exon with the 3' hydroxyl group of the last nucleotide of the upstream exon forming a phosphodiester bond.
Start codon
The first codon of a coding sequence. In eukaryotes this is almost always ATG, which codes for methionine.
Start site
The nucleotide at which transcription starts, usually denoted as position +1 in reference to the gene being transcribed.
Stop codon
A codon that specifies the termination of peptide synthesis; sometimes called "nonsense codons," since they do not specify any amino acid.
STRs
Short tandem repeats. At many places in genomes, there are short sequences (~5- 35 bp) of bases that are not transcribed and that are repeated several times in a row (a tandem array). Different individuals will often have a different number of repeats and populations usually have a wide range of copy numbers at a given site. The number of repeats can therefore be convenient genetic markers for determining genetic relationships.
Subject
The sequence, typically retrieved from a database, to which the sequence of interest (the query) is being compared.
Synteny
The state of being on the same chromosome. A gene is also said to be syntenic to a particular chromosome if it is known to be located on that chromosome but is otherwise unmapped.
Sample and Data Relationship Format (SDRF)
ArrayExpress definition: The SDRF describes all the sample characteristics (e.g. cell type) or any treatment that the sample has been subjected to (e.g. growth in low oxygen conditions), and links each sample to its corresponding data file.
SCOP
The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences.
Single Linkage
Single linkage is a type of hierarchical clustering in which the distance between one cluster and another is considered to be equal to the shortest distance from any member of one cluster to any member of the other cluster.
Single nucleotide polymorphism
A single base pair of DNA that is polymorphic (has alternate alleles) with respect to a population.
Structural genomics
The determination of the three dimensional structure of all proteins of a given organism in a high-throughput manner.
Subcellular
A component of the cell, for example the mitochondrion, the nucleus or plasma membrane.
Substitution matrix
A grid that provides scores for the substitution of every amino acid (or nucleotide) for every other. They are used to score the substitution of one residue for another in a protein alignment. They provide information on residues possessing similar properties or the frequency of residue exchange in known protein families.
Systems Biology Markup Language
Systems Biology Markup Language (SBML) is a standard format based on XML for communicating and storing computational models of biological processes. It is a free and open standard with widespread usage and software support. SBML represents many different classes of biological phenomena, including metabolic networks, cell signaling pathways, regulatory networks and infectious diseases. www.sbml.org
Sanger sequencing
A widely used method of determining the order of bases in DNA.
See also: sequencing, shotgun sequencing
Satellite
A chromosomal segment that branches off from the rest of the chromosome but is still connected by a thin filament or stalk.
Scaffold
In genomic mapping, a series of contigs that are in the right order but not necessarily connected in one continuous stretch of sequence.
Segregation
The normal biological process whereby the two pieces of a chromosome pair are separated during meiosis and randomly distributed to the germ cells.