getorf |
Please help by correcting and extending the Wiki pages.
This program finds and outputs the sequences of open reading frames (ORFs) in one or more nucleotide sequences. An ORF may be defined as a region of a specified minimum size between two STOP codons, or between a START and a STOP codon. The ORFs can be output as the nucleotide sequence or as the protein translation. Optionally, the program will output the region around the START codon, the first STOP codon, or the final STOP codon of an ORF. The START and STOP codons are defined in a Genetic Code table; a suitable table can be selected for the organism you are investigating. The output is a sequence file containing predicted open reading frames longer than the minimum size, which defaults to 30 bases (i.e. 10 amino acids).
% getorf -minsize 300 Find and extract open reading frames (ORFs) Input nucleotide sequence(s): tembl:v00294 protein output sequence(s) [v00294.orf]: |
Go to the input files for this example
Go to the output files for this example
Find and extract open reading frames (ORFs) Version: EMBOSS:6.6.0.0 Standard (Mandatory) qualifiers: [-sequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [ |
Qualifier | Type | Description | Allowed values | Default | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Standard (Mandatory) qualifiers | ||||||||||||||||||||||||||||||||||||||||
[-sequence] (Parameter 1) |
seqall | Nucleotide sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required | ||||||||||||||||||||||||||||||||||||
[-outseq] (Parameter 2) |
seqoutall | Protein sequence set(s) filename and optional format (output USA) | Writeable sequence(s) | <*>.format | ||||||||||||||||||||||||||||||||||||
Additional (Optional) qualifiers | ||||||||||||||||||||||||||||||||||||||||
-table | list | Code to use |
|
0 | ||||||||||||||||||||||||||||||||||||
-minsize | integer | Minimum nucleotide size of ORF to report | Any integer value | 30 | ||||||||||||||||||||||||||||||||||||
-maxsize | integer | Maximum nucleotide size of ORF to report | Any integer value | 1000000 | ||||||||||||||||||||||||||||||||||||
-find | list | This is a small menu of possible output options. The first four options are to select either the protein translation or the original nucleic acid sequence of the open reading frame. There are two possible definitions of an open reading frame: it can either be a region that is free of STOP codons or a region that begins with a START codon and ends with a STOP codon. The last three options are probably only of interest to people who wish to investigate the statistical properties of the regions around potential START or STOP codons. The last option assumes that ORF lengths are calculated between two STOP codons. |
|
0 | ||||||||||||||||||||||||||||||||||||
Advanced (Unprompted) qualifiers | ||||||||||||||||||||||||||||||||||||||||
-[no]methionine | boolean | START codons at the beginning of protein products will usually code for Methionine, despite what the codon will code for when it is internal to a protein. This qualifier sets all such START codons to code for Methionine by default. | Boolean value Yes/No | Yes | ||||||||||||||||||||||||||||||||||||
-circular | boolean | Is the sequence circular | Boolean value Yes/No | No | ||||||||||||||||||||||||||||||||||||
-[no]reverse | boolean | Set this to be false if you do not wish to find ORFs in the reverse complement of the sequence. | Boolean value Yes/No | Yes | ||||||||||||||||||||||||||||||||||||
-flanking | integer | If you have chosen one of the options of the type of sequence to find that gives the flanking sequence around a STOP or START codon, this allows you to set the number of nucleotides either side of that codon to output. If the region of flanking nucleotides crosses the start or end of the sequence, no output is given for this codon. | Any integer value | 100 | ||||||||||||||||||||||||||||||||||||
Associated qualifiers | ||||||||||||||||||||||||||||||||||||||||
"-sequence" associated seqall qualifiers | ||||||||||||||||||||||||||||||||||||||||
-sbegin1 -sbegin_sequence |
integer | Start of each sequence to be used | Any integer value | 0 | ||||||||||||||||||||||||||||||||||||
-send1 -send_sequence |
integer | End of each sequence to be used | Any integer value | 0 | ||||||||||||||||||||||||||||||||||||
-sreverse1 -sreverse_sequence |
boolean | Reverse (if DNA) | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-sask1 -sask_sequence |
boolean | Ask for begin/end/reverse | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-snucleotide1 -snucleotide_sequence |
boolean | Sequence is nucleotide | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-sprotein1 -sprotein_sequence |
boolean | Sequence is protein | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-slower1 -slower_sequence |
boolean | Make lower case | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-supper1 -supper_sequence |
boolean | Make upper case | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-scircular1 -scircular_sequence |
boolean | Sequence is circular | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-squick1 -squick_sequence |
boolean | Read id and sequence only | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-sformat1 -sformat_sequence |
string | Input sequence format | Any string | |||||||||||||||||||||||||||||||||||||
-iquery1 -iquery_sequence |
string | Input query fields or ID list | Any string | |||||||||||||||||||||||||||||||||||||
-ioffset1 -ioffset_sequence |
integer | Input start position offset | Any integer value | 0 | ||||||||||||||||||||||||||||||||||||
-sdbname1 -sdbname_sequence |
string | Database name | Any string | |||||||||||||||||||||||||||||||||||||
-sid1 -sid_sequence |
string | Entryname | Any string | |||||||||||||||||||||||||||||||||||||
-ufo1 -ufo_sequence |
string | UFO features | Any string | |||||||||||||||||||||||||||||||||||||
-fformat1 -fformat_sequence |
string | Features format | Any string | |||||||||||||||||||||||||||||||||||||
-fopenfile1 -fopenfile_sequence |
string | Features file name | Any string | |||||||||||||||||||||||||||||||||||||
"-outseq" associated seqoutall qualifiers | ||||||||||||||||||||||||||||||||||||||||
-osformat2 -osformat_outseq |
string | Output seq format | Any string | |||||||||||||||||||||||||||||||||||||
-osextension2 -osextension_outseq |
string | File name extension | Any string | |||||||||||||||||||||||||||||||||||||
-osname2 -osname_outseq |
string | Base file name | Any string | |||||||||||||||||||||||||||||||||||||
-osdirectory2 -osdirectory_outseq |
string | Output directory | Any string | |||||||||||||||||||||||||||||||||||||
-osdbname2 -osdbname_outseq |
string | Database name to add | Any string | |||||||||||||||||||||||||||||||||||||
-ossingle2 -ossingle_outseq |
boolean | Separate file for each entry | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-oufo2 -oufo_outseq |
string | UFO features | Any string | |||||||||||||||||||||||||||||||||||||
-offormat2 -offormat_outseq |
string | Features format | Any string | |||||||||||||||||||||||||||||||||||||
-ofname2 -ofname_outseq |
string | Features file name | Any string | |||||||||||||||||||||||||||||||||||||
-ofdirectory2 -ofdirectory_outseq |
string | Output directory | Any string | |||||||||||||||||||||||||||||||||||||
General qualifiers | ||||||||||||||||||||||||||||||||||||||||
-auto | boolean | Turn off prompts | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-stdout | boolean | Write first file to standard output | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-filter | boolean | Read first file from standard input, write first file to standard output | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-options | boolean | Prompt for standard and additional values | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-debug | boolean | Write debug output to program.dbg | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-verbose | boolean | Report some/full command line options | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-help | boolean | Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose | Boolean value Yes/No | N | ||||||||||||||||||||||||||||||||||||
-warning | boolean | Report warnings | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-error | boolean | Report errors | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-fatal | boolean | Report fatal errors | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-die | boolean | Report dying program messages | Boolean value Yes/No | Y | ||||||||||||||||||||||||||||||||||||
-version | boolean | Report version number and exit | Boolean value Yes/No | N |
The input is a standard EMBOSS sequence query (also known as a 'USA').
Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl
Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application.
The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir (nbrf), swissprot (swiss, sw), dasgff and debug.
See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.
ID V00294; SV 1; linear; genomic DNA; STD; PRO; 1113 BP. XX AC V00294; XX DT 09-JUN-1982 (Rel. 01, Created) DT 10-FEB-1999 (Rel. 58, Last updated, Version 2) XX DE E. coli laci gene (codes for the lac repressor). XX KW DNA binding protein; repressor. XX OS Escherichia coli OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; OC Enterobacteriaceae; Escherichia. XX RN [1] RP 1-1113 RX DOI; 10.1038/274765a0. RX PUBMED; 355891. RA Farabaugh P.J.; RT "Sequence of the lacI gene"; RL Nature 274(5673):765-769(1978). XX CC KST ECO.LACI XX FH Key Location/Qualifiers FH FT source 1..1113 FT /organism="Escherichia coli" FT /mol_type="genomic DNA" FT /db_xref="taxon:562" FT CDS 31..1113 FT /transl_table=11 FT /note="reading frame" FT /db_xref="GOA:P03023" FT /db_xref="InterPro:IPR000843" FT /db_xref="InterPro:IPR010982" FT /db_xref="PDB:1CJG" FT /db_xref="PDB:1EFA" FT /db_xref="PDB:1JWL" FT /db_xref="PDB:1JYE" FT /db_xref="PDB:1JYF" FT /db_xref="PDB:1L1M" FT /db_xref="PDB:1LBG" FT /db_xref="PDB:1LBH" FT /db_xref="PDB:1LBI" FT /db_xref="PDB:1LCC" FT /db_xref="PDB:1LCD" FT /db_xref="PDB:1LQC" FT /db_xref="PDB:1LTP" FT /db_xref="PDB:1OSL" FT /db_xref="PDB:1TLF" FT /db_xref="PDB:1Z04" FT /db_xref="PDB:2BJC" FT /db_xref="PDB:2KEI" FT /db_xref="PDB:2KEJ" FT /db_xref="PDB:2KEK" FT /db_xref="PDB:2P9H" FT /db_xref="PDB:2PAF" FT /db_xref="PDB:2PE5" FT /db_xref="PDB:3EDC" FT /db_xref="UniProtKB/Swiss-Prot:P03023" FT /protein_id="CAA23569.1" FT /translation="MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAEL FT NYIPNRVAQQLAGKQSLLIGVATSSLALHAPSQIVAAIKSRADQLGASVVVSMVERSGV FT EACKAAVHNLLAQRVSGLIINYPLDDQDAIAVEAACTNVPALFLDVSDQTPINSIIFSH FT EDGTRLGVEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQIQPIAEREGDWSA FT MSGFQQTMQMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSSC FT YIPPSTTIKQDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASPR FT ALADSLMQLARQVSRLESGQ" XX SQ Sequence 1113 BP; 249 A; 304 C; 322 G; 238 T; 0 other; ccggaagaga gtcaattcag ggtggtgaat gtgaaaccag taacgttata cgatgtcgca 60 gagtatgccg gtgtctctta tcagaccgtt tcccgcgtgg tgaaccaggc cagccacgtt 120 tctgcgaaaa cgcgggaaaa agtggaagcg gcgatggcgg agctgaatta cattcccaac 180 cgcgtggcac aacaactggc gggcaaacag tcgttgctga ttggcgttgc cacctccagt 240 ctggccctgc acgcgccgtc gcaaattgtc gcggcgatta aatctcgcgc cgatcaactg 300 ggtgccagcg tggtggtgtc gatggtagaa cgaagcggcg tcgaagcctg taaagcggcg 360 gtgcacaatc ttctcgcgca acgcgtcagt gggctgatca ttaactatcc gctggatgac 420 caggatgcca ttgctgtgga agctgcctgc actaatgttc cggcgttatt tcttgatgtc 480 tctgaccaga cacccatcaa cagtattatt ttctcccatg aagacggtac gcgactgggc 540 gtggagcatc tggtcgcatt gggtcaccag caaatcgcgc tgttagcggg cccattaagt 600 tctgtctcgg cgcgtctgcg tctggctggc tggcataaat atctcactcg caatcaaatt 660 cagccgatag cggaacggga aggcgactgg agtgccatgt ccggttttca acaaaccatg 720 caaatgctga atgagggcat cgttcccact gcgatgctgg ttgccaacga tcagatggcg 780 ctgggcgcaa tgcgcgccat taccgagtcc gggctgcgcg ttggtgcgga tatctcggta 840 gtgggatacg acgataccga agacagctca tgttatatcc cgccgtcaac caccatcaaa 900 caggattttc gcctgctggg gcaaaccagc gtggaccgct tgctgcaact ctctcagggc 960 caggcggtga agggcaatca gctgttgccc gtctcactgg tgaaaagaaa aaccaccctg 1020 gcgcccaata cgcaaaccgc ctctccccgc gcgttggccg attcattaat gcagctggca 1080 cgacaggttt cccgactgga aagcgggcag tga 1113 // |
The input is a standard EMBOSS sequence query (also known as a 'USA') with associated feature information.
Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl
Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application.
The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: text, html, xml (uniprotxml), obo, embl (swissprot)
Where the sequence format has no feature information, a second file can be read to load the feature data. The file is specified with the qualifier -ufo xxx and the feature format is specified with the qualifier -fformat xxx
See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.
See: http://emboss.sf.net/docs/themes/FeatureFormats.html for further information on feature formats.
The output is a sequence file containing predicted open reading frames longer than the minimum size, which defaults to 30 bases (i.e. 10 amino acids).
>V00294_1 [735 - 1112] E. coli laci gene (codes for the lac repressor). GHRSHCDAGCQRSDGAGRNARHYRVRAARWCGYLGSGIRRYRRQLMLYPAVNHHQTGFSP AGANQRGPLAATLSGPGGEGQSAVARLTGEKKNHPGAQYANRLSPRVGRFINAAGTTGFP TGKRAV >V00294_2 [1 - 1110] E. coli laci gene (codes for the lac repressor). PEESQFRVVNVKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN RVAQQLAGKQSLLIGVATSSLALHAPSQIVAAIKSRADQLGASVVVSMVERSGVEACKAA VHNLLAQRVSGLIINYPLDDQDAIAVEAACTNVPALFLDVSDQTPINSIIFSHEDGTRLG VEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQIQPIAEREGDWSAMSGFQQTM QMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSSCYIPPSTTIK QDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASPRALADSLMQLA RQVSRLESGQ >V00294_3 [465 - 49] (REVERSE SENSE) E. coli laci gene (codes for the lac repressor). RRNISAGSFHSNGILVIQRIVNDQPTDALREKIVHRRFTGFDAASFYHRHHHAGTQLIGA RFNRRDNLRRRVQGQTGGGNANQQRLFARQLLCHAVGNVIQLRHRRFHFFPRFRRNVAGL VHHAGNGLIRDTGILCDIV |
The name of the ORF sequences is constructed from the name of the input sequence with an underscore character ('_') and a unique ordinal number of the ORF found appended. The description of the output ORF sequence is constructed from the description of the input sequence with the start and end positions of the ORF prepended.
The unique number appended to the name is simply used to create new unique sequence names, it does not imply any further information indicating any order, positioning or sense-strand of the ORFs.
If the ORF has been found in the reverse sense, then the start position will be smaller than the end position. The numbering uses the forward-sense positions, but read in the reverse sense. For example, >V00294_3 [465 - 49] in the output above is a reverse-sense ORF running from position 465 to 49. The description will also contain '(REVERSE SENSE)'.
If the sequence has been specified as a circular genome (using the command-line switch '-circular'), then ORFs can potentially continue past the 'end' of the input sequence (the breakpoint of the circular genome) and into the 'start' of the sequence again. This is dealt with by appending the sequence to itself three times and reporting long ORFs that are found in this extended sequence. Any ORF that is longer that three times the sequence length (i.e one that continues without hitting a STOP at any point in the genome) will be reported as being a maximum of three times the length of the input sequence. Note that the end position of an ORF in circular genomes can be apparently longer than the input sequence if the ORF crosses the breakpoint. If the ORF crosses the breakpoint, then the text '(ORF crosses the breakpoint)' will be added to the description of the output sequence.
The default file EGC.0 is the 'Standard Code' with the rarely used alternate START codons omitted, it only has the normal 'AUG' START codon. The 'Standard Code' with the rarely used alternate START codons included is Genetic Code file EGC.1.
It is expected that user will sometimes wish to customise a Genetic Code file. To do this, use the program embossdata.
EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.
To see the available EMBOSS data files, run:
% embossdata -showall
To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:
% embossdata -fetch -file Exxx.dat
Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".
The directories are searched in the following order:
The Genetic Code data files are based on the NCBI genetic code tables. Their names and descriptions are:
The format of these files is very simple.
It consists of several lines of optional comments, each starting with a '#' character.
These are followed the line: 'Genetic Code [n]', where 'n' is the number of the genetic code file.
This is followed by the description of the code and then by four lines giving the IUPAC one-letter code of the translated amino acid, the start codons (indicdated by an 'M') and the three bases of the codon, lined up one on top of the other.
For example:
------------------------------------------------------------------------------ # Genetic Code Table # # Obtained from: http://www.ncbi.nlm.nih.gov/collab/FT/genetic_codes.html # and: http://www3.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c # # Differs from Genetic Code [1] only in that the initiation sites have been # changed to only 'AUG' Genetic Code [0] Standard AAs = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG Starts = -----------------------------------M---------------------------- Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG ------------------------------------------------------------------------------
There are two common definitions of an open reading frame: it can either be a region that is free of STOP codons or a region that begins with a START codon and ends with a STOP codon.
Program name | Description |
---|---|
checktrans | Report STOP codons and ORF statistics of a protein |
marscan | Find matrix/scaffold recognition (MRS) signatures in DNA sequences |
plotorf | Plot potential open reading frames in a nucleotide sequence |
showorf | Display a nucleotide sequence and translation in pretty format |
sixpack | Display a DNA sequence with 6-frame translation and ORFs |
syco | Draw synonymous codon usage statistic plot for a nucleotide sequence |
tcode | Identify protein-coding regions using Fickett TESTCODE statistic |
wobble | Plot third base position variability in a nucleotide sequence |
Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.
November 2002 - added indication of reverse sense ORFs
November 2002 - added indication of ORFs that cross the breakpoint at position 1 in circular genomes.