TIGR XML DTD file
<!--
tigrxml.dtd
DTD for XML format presented by TIGR to release the genome annotation data
to the scientific community.
Brian Haas 02/13/2001
-->
<!--
Root element for XML is TIGR. TIGR contains at least one ASSEMBLY element.
-->
<!ELEMENT TIGR (PSEUDOCHROMOSOME | ASSEMBLY)* >
<!ELEMENT PSEUDOCHROMOSOME (SCAFFOLD, ASSEMBLY) >
<!--
The ASSEMBLY element is the parent element referring to an individual nucleotide assembly.
Often, the nucleotide assembly represents a single BAC (bacterial artificial chromosome) sequence.
This element houses the annotation for the sequence unit.
The unique index to the TIGR annotation database is the ASMBL_ID.
CLONE_ID is for TIGR's tracking purposes only.
DATABASE references the TIGR annotation database name. ie. ATH1:Arabidopsis, OSA1:Rice.
CURRENT_DATE : the date the xml was created.
COORDSET: represents the coordinates for which information is provided for the assembly. If the
entire assembly is described, then the coordset will be from position 1 to the length of the assembly.
-->
<!ELEMENT ASSEMBLY ( ASMBL_ID, COORDSET, HEADER, TILING_PATH?, GENE_LIST, MISC_INFO, REPEAT_LIST, ASSEMBLY_SEQUENCE ) >
<!ATTLIST ASSEMBLY CLONE_ID NMTOKEN #REQUIRED >
<!ATTLIST ASSEMBLY DATABASE NMTOKEN #REQUIRED >
<!ATTLIST ASSEMBLY CHROMOSOME NMTOKEN #IMPLIED >
<!ATTLIST ASSEMBLY CURRENT_DATE CDATA #REQUIRED >
<!ELEMENT ASMBL_ID (#PCDATA) >
<!--
GENE_LIST contains all gene features broken down into two parent nodes: the protein coding
genes and the RNA genes.
-->
<!ELEMENT GENE_LIST (PROTEIN_CODING, RNA_GENES)>
<!--
Element RNA_GENES contains each of the non-protein coding genes that TIGR may provide annotation for.
These include tRNAs (see PRE-TRNA), small nuclear RNAs (see SNRNA), small nucleolar RNAs (see SNORNA),
and ribosomal RNAs (see RRNA).
-->
<!ELEMENT RNA_GENES (PRE-TRNA*, SNRNA*, SNORNA*, RRNA*) >
<!--
FEAT_NAME represents a temporary identifier assigned to each gene component. The only stable reference
to a gene is the LOCUS or PUB_LOCUS (see GENE_INFO).
-->
<!ELEMENT FEAT_NAME (#PCDATA) >
<!--
DATE represents the date in which a feature was created or modified. This element is useful for synchronization
of the annotation data with external databases.
-->
<!ELEMENT DATE (#PCDATA) >
<!--
PROTEIN_CODING genes are represented by at least four components: TU, MODEL, EXON, CDS.
The TU represents the transcriptional unit and is the highest order component of the gene.
A TU can encode multiple gene MODELs only in cases where alternative splicing exists.
A gene MODEL encapsulates all of the coding and non-coding structures of an individual splicing isoform.
Each gene MODEL can encode several mRNA EXONS and represent the spliced, intronless portions of the gene.
An mRNA EXON may only partially code for a protein; exactly the case where upstream or downstream untranslated
regions exist. The protein coding portion of an individual EXON is represented by the CDS element. The CDS element
will also encode the stop codon. The gene components are not ordered based on their coordinates.
For regions in which untranslated regions exist, UTR(s) will present. UTR(s) represent the non-protein-coding portions
of the RNA EXON(s). UTRs are not currently supported TIGR data types outside of this DTD and they exist here only
to facilitate external data analysis.
Each gene component has a coordinate set associated with it (see COORDS). The following illustration should clarify
the role of each element and its coordinates:
TU {=============================================================================}
| |
MODEL | {============================================================} |
| | | |
EXON(s) {=============} {========================} {========================}
| || | | | || |
CDS(s) | |{=====} {========================} {===============}| |
| | | |
UTR(s) {======} {=======}
-->
<!ELEMENT PROTEIN_CODING (TU*) >
<!ELEMENT TU (FEAT_NAME, DATE, GENE_INFO, COORDSET, MODEL+, TRANSCRIPT_SEQUENCE) >
<!ELEMENT MODEL (FEAT_NAME, DATE, COORDSET, EXON+, CDS_SEQUENCE?, PROTEIN_SEQUENCE?) >
<!ATTLIST MODEL COMMENT CDATA #IMPLIED >
<!ELEMENT EXON (FEAT_NAME, DATE, COORDSET, CDS?, UTRS?) >
<!ELEMENT CDS (FEAT_NAME, DATE, COORDSET) >
<!--
UTRS specify each UTR or untranslated region.
There can be more than one if it's a single exon gene: ie.
5' 3'
EXON: {===============================================}
CDS : | |{============================}| |
LEFT_UTR: {===} | |
RIGHT_UTR: {============}
-->
<!ELEMENT UTRS (LEFT_UTR | RIGHT_UTR)* >
<!ELEMENT LEFT_UTR (COORDSET) >
<!ELEMENT RIGHT_UTR (COORDSET) >
<!--
Gene Sequences Described:
TRANSCRIPT_SEQUENCE: provides the unspliced genomic nucleotide sequence representing the entire transcribed
region of the gene.
CDS_SEQUENCE: The nucleotide sequence which encodes the protein sequence directly.
PROTEIN_SEQUENCE: the peptide sequence representing the translation of the CDS_SEQUENCE.
-->
<!ELEMENT TRANSCRIPT_SEQUENCE (#PCDATA) >
<!ELEMENT CDS_SEQUENCE (#PCDATA) >
<!ELEMENT PROTEIN_SEQUENCE (#PCDATA) >
<!--
COORDSET contains child elements END5 and END3 and provides the sequence-based (see ASSEMBLY_SEQUENCE) coordinates for all elements
containing it. The sequence begins at position 1. END5 and END3 represent the exact coordinates of the feature within the
sequence provided (positive orientation). If END5 < END3, then the positive strand orientation is specified; therefore,
if END5 > END3, the negative strand orientation is referenced.
-->
<!ELEMENT COORDSET (END5, END3) >
<!ELEMENT END5 (#PCDATA)>
<!ELEMENT END3 (#PCDATA)>
<!--
GENE_INFO contains the gene name, locus, and functional category role assignment information. The LOCUS in many
instances represents the assembly (ie. BAC)-based gene identifier. The PUB_LOCUS represents a publication-based
locus; possibly representing a chromosomal locus identifier. EC_NUM provides an enzyme commission number.
GENE_SYM provides the gene symbol conventionally given by experimentalists; ie. alcohol dehydrogenase: ADH
COM_NAME represents the gene name.
-->
<!ELEMENT GENE_INFO (LOCUS, PUB_LOCUS?, COM_NAME, PUB_COMMENT?, EC_NUM?, GENE_SYM?, DATE, ROLE_LIST?, EVIDENCE?) >
<!ELEMENT LOCUS (#PCDATA) >
<!ELEMENT PUB_LOCUS (#PCDATA) >
<!ELEMENT COM_NAME (#PCDATA) >
<!ELEMENT PUB_COMMENT (#PCDATA) >
<!ELEMENT EC_NUM (#PCDATA) >
<!ELEMENT GENE_SYM (#PCDATA) >
<!--
ROLE_LIST contains each of the functional role category assignments for the gene.
COMPARTMENT indicates the role assignment class being used; examples include microbial, plant, GO (gene ontology), etc.
The roles are classifications that become more specific via the SUBROLE_* elements.
-->
<!ELEMENT ROLE_LIST (ROLE_INFO+) >
<!ELEMENT ROLE_INFO (COMPARTMENT, DATE, MAIN_ROLE, SUBROLE_1?, SUBROLE_2?, SUBROLE_3?, SUBROLE_4?) >
<!ELEMENT COMPARTMENT (#PCDATA) >
<!ELEMENT MAIN_ROLE (#PCDATA) >
<!ELEMENT SUBROLE_1 (#PCDATA) >
<!ELEMENT SUBROLE_2 (#PCDATA) >
<!ELEMENT SUBROLE_3 (#PCDATA) >
<!ELEMENT SUBROLE_4 (#PCDATA) >
<!--
EVIDENCE simply provides data indicating the type of evidence that is available
that may support the existence of the corresponding gene.
The attributes are toggles set to 0 or 1 to indicate the presence of that
evidence type.
-->
<!ELEMENT EVIDENCE EMPTY >
<!ATTLIST EVIDENCE GENE_PREDICTIONS NMTOKEN #REQUIRED>
<!ATTLIST EVIDENCE PROTEIN_MATCHES NMTOKEN #REQUIRED >
<!ATTLIST EVIDENCE GENE_INDEX_MATCHES NMTOKEN #REQUIRED >
<!--
REPEAT_LIST contains REPEAT elements. A repeat is a repetitive nucleotide sequence and could represent
simple repeats (AT-rich regions) to complex repeats (retroelements, rRNA sequences). Currently, rRNA
sequences are being specified here. Eventually, they will be specified in the RRNA element (see RRNA).
-->
<!ELEMENT REPEAT_LIST (REPEAT*) >
<!ELEMENT REPEAT (FEAT_NAME, DATE, COORDSET, REPEAT_TYPE) >
<!ELEMENT REPEAT_TYPE (#PCDATA) >
<!--
RRNA encompasses ribosomal RNA genes.
-->
<!ELEMENT RRNA (FEAT_NAME, DATE, COORDSET, COM_NAME) >
<!--
SNRNA encompasses small nuclear RNA genes.
-->
<!ELEMENT SNRNA (FEAT_NAME, DATE, COORDSET, COM_NAME) >
<!--
SNORNA encompasses small nucleolar RNA genes.
-->
<!ELEMENT SNORNA (FEAT_NAME, DATE, COORDSET, COM_NAME) >
<!--
TRNA genes are represented by multiple components. The structure is analogous to that
provided for protein coding genes (see PROTEIN_CODING). The major difference is the lack
of a CDS, since no protein is encoded by tRNA genes.
The analogies are presented as follows:
PRE-TRNA ~ TU
TRNA ~ MODEL
RNA-EXON ~ EXON
-->
<!ELEMENT PRE-TRNA (FEAT_NAME, DATE, COORDSET, TRNA) >
<!ELEMENT TRNA (FEAT_NAME, DATE, COORDSET, COM_NAME, RNA-EXON+) >
<!ATTLIST TRNA ANTICODON NMTOKEN #REQUIRED >
<!ELEMENT RNA-EXON (FEAT_NAME, DATE, COORDSET)>
<!--
TILING_PATH provides all of the information required to position the current ASSEMBLY in the
context of it's neighboring ASSEMBLY(s). Each element and attribute is described as follows:
ORIENTATION : [+|-] the strand orientation in the pseudo-chromosome.
LEFT_ASMBL : identifies the ASMBL_ID (see ASSEMBLY) of the preceding neighbor in the tiling path.
RIGHT_ASMBL : identifies the ASMBL_ID of the succeeding neighbor in the tiling path.
FROM_CONNECT : [1|0] toggle which identifies if there is a sequence joining between the preceding sequence and current sequence.
TO_CONNECT : [1|0] toggle ... analogous to FROM_CONNECT except that it refers to the joining of the current assemlby and the succeeding one.
FROM_OVERLAP_SIZE : indicates the number of nucleotides that the current bac overlaps with the preceding bac.
FROM_OVERHANG_SIZE : indicates the sequence length of non-overlapping sequence of the current sequence with the preceding sequence.
TO_OVERHANG_SIZE : indicates the length of non-overlapping sequence of the current bac with the preceding bac.
Interpretation of this data: The data presented above is essentialy a single node in a linked list. To build the
pseudo-chromosome, the first step is to identify the head-asmbl_id which should have a FROM_OVERLAP = 0. From that
element, you can identify the TO_ASMBL that overlaps it.
Once an overlapping assembly is identified, prior to doing anything else, you must flip the assembly to the proper orientation (+).
Then, you can align the assembly to the previous one via the overlap information.
Here's an illustration:
ASMBL_ID 23
\__________________________________/
___________________
\
ASMBL_ID 1
properties of ASMBL_ID 1 (ORIENTATION = '+', FROM_CONNECT = 0, TO_CONNECT = 1, RIGHT_ASMBL = 23, TO_OVERHANG_SIZE = 50).
properties of ASMBL_ID 23 (ORIENTATION = '+', FROM_CONNECT = 1, FROM_OVERHANG_SIZE = 120, FROM_OVERLAP_SIZE = 1000, TO_OVERHANG_SIZE = 140)
The FROM_OVERHANG_SIZE indicates the (\) portion of ASMBL_ID 23 in which non-overlapping sequence exists (ie. untrimmed vector).
The FROM_OVERLAP_SIZE indicates that 1000 nt's overlap ASMBL_ID 1. Summing up both pieces of information for ASMBL_ID 23, coordinates 1 to 120
do not overlap, coordinates 121 to 1121 do overlap ASMBL_ID 1.
If size N = length (ASMBL_ID 1), then ASMBL_ID 23 overlaps ASMBL_ID 1 between (N-50-1000) to (N-50), taking into account
the non-overlapping sequence of ASMBL_ID 1.
If either assembly was in the reverse orientation (ORIENTATION = '-'), then the first step would be to reverse complement the sequence. The
remainder of the protocol remains identical.
MOST OF THE TIME, the non-overlapping end sequences OVERHANG_SIZE(s) will be = 0 because the assemblies should be trimmed of vector prior
to entering either genbank or TIGR's annotation database. Although, there may be some exceptions, and this specification prepares for it.
-->
<!ELEMENT TILING_PATH (LEFT_ASMBL, RIGHT_ASMBL, FROM_CONNECT, TO_CONNECT, ORIENTATION, FROM_OVERLAP_SIZE, FROM_OVERHANG_SIZE, TO_OVERHANG_SIZE, DATE) >
<!ELEMENT LEFT_ASMBL (#PCDATA) >
<!ELEMENT RIGHT_ASMBL (#PCDATA) >
<!ELEMENT FROM_CONNECT (#PCDATA)>
<!ELEMENT ORIENTATION (#PCDATA) >
<!ELEMENT TO_CONNECT (#PCDATA) >
<!ELEMENT FROM_OVERLAP_SIZE (#PCDATA) >
<!ELEMENT FROM_OVERHANG_SIZE (#PCDATA) >
<!ELEMENT TO_OVERHANG_SIZE (#PCDATA) >
<!--
SCAFFOLD is composed of SCAFFOLD_COMPONENT(s). Each SCAFFOLD_COMPONENT indicates the portion of a given nucleotide
assembly (ie. BAC) from which a segment of the pseudochromosome was constructed. By joining each of the SCAFFOLD_COMPONENT(s),
the entire pseudo-chromosome nucleotide sequence can be constructed.
-->
<!ELEMENT SCAFFOLD (SCAFFOLD_COMPONENT+) >
<!ELEMENT SCAFFOLD_COMPONENT (ASMBL_ID, CHR_LEFT_COORD, CHR_RIGHT_COORD, ASMBL_LEFT_COORD, ASMBL_RIGHT_COORD, ORIENTATION, DATE) >
<!ELEMENT CHR_LEFT_COORD (#PCDATA) >
<!ELEMENT CHR_RIGHT_COORD (#PCDATA) >
<!ELEMENT ASMBL_LEFT_COORD (#PCDATA) >
<!ELEMENT ASMBL_RIGHT_COORD (#PCDATA) >
<!--
MISC_INFO is the component in which we can store any comments regarding the ASSEMBLY. The FEATURE_DESC element
contains the feature description text, and a COORDSET element identifies the position the comment is referring to.
-->
<!ELEMENT MISC_INFO ( MISC_FEATURE+ ) >
<!ELEMENT MISC_FEATURE (COORDSET, DATE, FEATURE_DESC) >
<!ELEMENT FEATURE_DESC (#PCDATA) >
<!--
The HEADER element contains some basic attributes of the nucleotide assembly, including the identity of the
organism from which it was derived, the lineage, the group that sequenced the assembly, and information that is
provided to genbank within TIGR's annotation submissions.
-->
<!ELEMENT HEADER ( CLONE_NAME, GB_ACCESSION, ORGANISM, LINEAGE, SEQ_GROUP, KEYWORDS*, GB_DESCRIPTION*, GB_COMMENT*, AUTHOR_LIST ) >
<!ELEMENT CLONE_NAME (#PCDATA) >
<!ELEMENT GB_ACCESSION (#PCDATA) >
<!ELEMENT ORGANISM (#PCDATA) >
<!ELEMENT LINEAGE (#PCDATA) >
<!ELEMENT SEQ_GROUP (#PCDATA)>
<!ELEMENT KEYWORDS (#PCDATA) >
<!ELEMENT GB_DESCRIPTION (#PCDATA) >
<!ELEMENT GB_COMMENT (#PCDATA) >
<!ELEMENT AUTHOR_LIST ( AUTHOR*) >
<!ATTLIST AUTHOR_LIST CONTACT CDATA #IMPLIED >
<!ELEMENT AUTHOR EMPTY>
<!ATTLIST AUTHOR FNAME CDATA #IMPLIED >
<!ATTLIST AUTHOR LNAME CDATA #REQUIRED >
<!ATTLIST AUTHOR MNAME CDATA #IMPLIED >
<!ATTLIST AUTHOR SUFFIX CDATA #IMPLIED >
<!--
ASSEMBLY_SEQUENCE contains the entire nucleotide sequence of the ASSEMBLY. The sequence begins at position 1 in our coordinate space and is
assumed to exist in the positive strand orientation. No whitespace should interrupt the sequence; it should exist
as one loooooooong string.
-->
<!ELEMENT ASSEMBLY_SEQUENCE (#PCDATA) >