XML stands for Extensible Markup Language. It is a way of structuring data of any type to make it accessible to anyone. Both XML and HTML are subsets of SGML (Standard Generalized Markup Language), but XML and HTML are not completely congruent with each other. XHTML is an attempt to join the two, and XHTML documents can be viewed with newer web browsers.
XML uses tags like HTML, except that the tags are all user-defined, with no standard or general meanings outside of the XML document. The formal specification of XML is maintained by the World Wide Web Consortium (W3C). XML was formalized in 1998, and the current accepted version is 1.0. Version 1.1 is still being reviewed and tested (as of April 2003).
The user-defined tags can be defined in 2 ways, using a DTD (document type definition) or using an XML Schema. Below I will describe the first approach. The DTD tag definitions can be placed within the XML file itself or in a separate file that is referenced in the XML file. However, first it is necessary to describe the XML document itself.
The most obvious feature of an XML file is the presence of tags, words enclosed by < and > characters.
Attributes are name-value pairs in the form: name = "value". The value must always be quoted (single or double quotes). Attributes are found within the beginning tag and are separated from each other by spaces. Each attribute name can be used only once within a given tag. For example <hr align="center" size="3" width="50%">
Entities are groups of symbols that get interpreted as separate characters by the XML parser. A well-known HTML example is the non-break space entity . (You might want to view the source code of this page to see how I got " " to print and not appear as a space.) XML allows the definition of special entities.
XML files are organized in elements. An element is everything from the start of the beginning tag to the end of the ending tag, including the tags themselves, the attributes found in the beginning tag, and whatever text (content) lies between the tags. XML files are composed of elements.
XML documents need to start with the line: <?xml version="1.0" ?>. This tag doesn't need an ending tag. Also, a line pointing to the DTD is the necessary second line: The line <!DOCTYPE note SYSTEM "note.dtd"> points to an external DTD file "note.dtd", with the root element "note".
All tags must be properly nested: <p><b><i>Some text</i></b></p> is correctly nested. In a properly nested file you exit from tags in the reverse order in which you entered them. Improperly nested tags, such as <p><b><i>Some text</p></b></i> will not work (even though it does work for HTML).
Each document has a root element, a unique tag that is the parent of all the other tags. The document begins with the beginning root tag and ends with the ending root tag. The <?xml ... and <!DOCTYPE ... tags must preceed the root tag; they technically aren't part of the XML document.
Tags are in parent-child relationships: each tag (except the root tag) is the child of some other tag. Child tags are nested within parent tags. A given tag can be the child of several different parent tags.
Comments are written as in HTML: between <!-- and -->.
White space is preserved in XML. In contrast, white space is truncated to a single space or line in HTML.
A "well-formed" XML document conforms to the W3C specification mentioned in the first paragraph. Mostly it means having all elements within matching tags, using nested tags starting with a single root tag, using each attribute only once within a given tag, and quoting the values of the attributes. A "valid" XML document is a well-formed document that also conforms to its DTD.
In general it is possible to describe data equally well with child tags or with attributes. Child tags are probably the better approach,as they are easier to parse and easier to extend. After all, the purpose of XML is to produce a nice structure for the data; attributes rather defeat this purpose.
There are 2 ways to have a DTD, as a section internal to the XML document or as an external file. As an internal section, the DTD looks like: . Here is a small example:
<<?xml version="1.0" ?>
<!DOCTYPE memo [ <!ELEMENT memo (date, from, to, re)>
<!ELEMENT date (#PCDATA>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT re (#PCDATA)>
]>
<memo>
<date>April 10, 2003</date>
<from>Fred Hampton</from>
<to>Bobby Seale</to>
<re>Free breakfast</re>
</memo>
External DTDs need to be referenced in the second line of the XML file, using the keyword "SYSTEM":
<!DOCTYPE root_element_name SYSTEM "quoted file path and name">. You can also use an internet URL for the file name.
The element and attribute declarations can appear in any order in the file, but please remember that we humans prefer logical order.
There are 3 main elements in DTD declarations: comments, element declarations, and attribute declarations. Comments are created just as in HTML, starting with <!-- and ending with -->. Comments are used to make sense of the the document. The declarations give its structure, but the comments make it into something humans can examine and use. Be kind and comment liberally!
Element declarations look like: <!ELEMENT element_name (contents or child element names)>. If an element contains data only, an no child elements, the keyword "#PCDATA" is used within the parentheses. PCDATA stands for "parsed (or "processed") character data", which means that entities will be converted into the proper symbols. "#CDATA" is sometimes used--"character data" is printed as-is and not parsed. Most of the elements shown above with the internal DTD are data-only elements containing PCDATA.
If an element has child elements, they must be listed in the element declaration. Grandchildren, etc. do not need to be listed. Most commonly, the child elements are separated by commas. This implies that each one of these elements appears exactly once, in the listed order. For example: <!ELEMENT memo (date, from, to, re)>. Choices as to the number of times an element appears can be made using the same quantitors as in regular expressions:
An empty element, such as the HTML <br /> element is declared using the keyword EMPTY: <!ELEMENT br EMPTY > . Such elements are rarely used.
Attributes of particular elements, name="value" pairs within the starting tag of the element, are declared with the syntax: <!ATTLIST element_name attribute_name attribute_type default_value>.
Here are a few attribute types:
The default value of an attribute can be a literal quoted value like "0", or it can be a keyword, usually #REQUIRED, meaning that this attribute must be listed in the XML document, or #IMPLIED, meaning that this attribute is optional.
Here are a few attribute declarations:
Simple tutorial from W3schools
When first faced with an XML document, one naturally thinks of parsing it line-by-line, using regular expressions. Unfortunately, lines in XML documents don't necessarily correspond to useful boundaries in the data being described. Also, regular expressions are notoriously bad at dealing with nested data structures.
There are two rather different approaches to parsing XML documents. One is called "tree-based". Tree-based methods end up loading the entire XML document into memory as one very large nested data structure, a hash of hashes of hashes of hashes, etc. This approach doesn't work well on our limited computer resources with large objects such as chromosome descriptions: we run out of memory long before the final structure appears.
The other approach is "stream-based" or "event-driven", which reads through the file, reporting information about each tag as it appears in the document. We will use this approach, which is codified in the XML::Parser module. So, you will need a "use XML::Parser; " statement at the top of your source code.
The other element of XML parsing that is necessary is a way to keep track of paired opening and closing tags. XML::Parser only reports each tag as it is encountered, and it does not pair them. To do the pairing we will use a "stack", a common data structure. Stacks are arrays with two main operations: push and pop. You add items to the end (top) of the stack with push, and you remove them with pop. A stack works like one of those spring-loaded stacks of cafeteria trays. If you push each opening tag onto the stack and pop one item off the stack at each closing tag, you will see that tags are automatically paired with each other. In addition, looking further down the stack will show you which tags you are nested inside. Stacks are a natural way to deal with any kind of tree or nested data structure.
We want to develop a program that will take a large XML document describing the genes on a single BAC (bacterial artifical chromosome, used to sequence a small portion of an entire chromosome), and output a list of the genes along with some of their attributes. As a start to this, we will develop a program that lists the number of times each tag appears, with tags "fully-qualified" with a list of all the tags they are nested inside. The output of this program will allow us to decide which attributes of the genes we want to put in the final list.
XML::Parser is based on the "expat" library of James Clark. expat is a collection of C programs that has been installed on biolinx. XML::Parser is an object-oriented module, with documentation that can be found by typing "perldoc XML::Parser" on the command line, or by going to this site among others. It has several "styles", or modes of action. We will use the "Stream" style, which is invoked when we create a our parser object with the "new" constructor.
The Stream style goes through the document and calls specific subroutines when it finds different items. The main subroutines are: StartTag, Text, and EndTag. Each time a start tag is encountered, StartTag is called, etc. There are also Start Document and EndDocument subroutines, which are only called once each, at the begiing and end of the document being parsed.
You need to define the StartTag, Text, and EndTag subroutines: what happens when these items are encountered. That is, your program has an area that starts with "sub StartTag {" and goes on to define this subroutine.
The subroutines come wtih some built-in variables.
Define a stack array. Push each start tag's name onto it, and pop that tag name off at each end tag. Recall that array index -1 is the last element of the array, -2 is the next-to-last, etc. knowledge of what you are nested inside is very important, because some tags are used in several different places.
To get the parsing to actually occur, you need to use the command "$parser->parsefile("your_xml_file_name"); ". And of course, you ned to print out your results. The file xml_tag_parser.pl, which is in /home/bios546, shows how this is put together.
Exercise: parse the F6N15.xml file to produce a list of transcription unnits (TU) with their 5' and 3' coordinates (in COORD_SET), LOCUS (locus name based on BAC position), PUB_LOCUS (standard Arabidopsis locus name), COM_NAME (common name), number of exons, positions of all exons, and the CDS sequence (sequence that is translated into protein). Look at the XML file itself, the DTD file, and the list of tags generated by xml_tag_parser.pl to see how all of these things are arranged.