XML Parsing In C++
This page is extracted from the full Doxygen documentation, which is why the links don't work. To see the full documentation, download and unpack the source package, ensure you have Doxygen and Graphviz installed, and type make doc.
The UPnP A/V standards use XML heavily; the Chorale implementation needs parsers for several, mostly small, XML schemas.

Initially these all used a SAX-style parsing scheme implemented in xml::SaxParser, requiring a (usually quite wordy) custom parser observer for each use. But the problem seemed to be calling out for a DSEL ("domain-specific embedded language"); that is, for some template-fu that made declaring individual little parsers easy.

The solution adopted in Chorale is to use table-based parsers, and to simplify declaring such table-based parsers by generating them as static data of classes built from nested templates. For instance, parsing the "src" attribute out of XML that looks like this:

<wpl><media src="/home/bob/music/Abba/Fernando.flac">bar</media></wpl>

can be done as follows:

class WPLReader
{
public:
    unsigned int Parse(StreamPtr stm);

    unsigned int OnMediaSrc(const std::string& s)
    {
        TRACE << "src=" << s << "\n";
        return 0;
    }
};

extern const char WPL[] = "wpl";
extern const char MEDIA[] = "media";
extern const char SRC[] = "src";

typedef xml::Parser<xml::Tag<WPL,
                             xml::Tag<MEDIA,
                                      xml::Attribute<SRC, WPLReader,
                                                     &WPLReader::OnMediaSrc>
> > > WPLParser;

unsigned int WPLReader::Parse(StreamPtr stm)
{
    WPLParser parser;
    return parser.Parse(stm, this);
}

...where the nesting of the "xml::" templates, mirrors the nesting of the tags in the XML we're trying to parse. The typedef declares a class type WPLParser, with the single method Parse, which takes a StreamPtr (the usual abstraction for a byte stream in Chorale) and a pointer to the target object: the object which receives callbacks and data from the parser. The unsigned int returned from these functions is the usual Chorale way of indicating an error: 0 means successful completion, values from <errno.h> mean otherwise.

(As is often the case with DSELs in C++, some technicalities leak out into the user-experience: in this case, string literals are not allowed as template parameters, and nor are objects with internal linkage, so we must declare extern objects corresponding to each of the strings we want to look for. Also, in the invocation of xml::Attribute, C++ isn't able to deduce the type "WPLReader" from the method pointer "&WPLReader::OnMediaSrc", or vice versa, so "WPLReader" must be specified twice.)

The relevant templates are:

Here is a more complex example demonstrating the use of xml::Structure and xml::List. It's based on the parser for UPnP device description documents in libupnp/description.cpp.

struct Service
{
    std::string type;
    std::string id;
    std::string control;
    std::string event;
    std::string scpd;
};

struct Device
{
    std::string type;
    std::string friendly_name;
    std::string udn;
    std::string presentation_url;
    std::list<Service> services;
};

struct Description
{
    std::string url_base;
    Device root_device;
};

typedef xml::Parser<
    xml::Tag<ROOT,
             xml::TagMember<URLBASE, Description,
                            &Description::url_base>,
             xml::Structure<DEVICE, Device,
                            Description, &Description::root_device,
                            xml::TagMember<DEVICETYPE, Device,
                                           &Device::type>,
                            xml::TagMember<FRIENDLYNAME, Device,
                                           &Device::friendly_name>,
                            xml::TagMember<UDN, Device,
                                           &Device::udn>,
                            xml::TagMember<PRESENTATIONURL, Device,
                                           &Device::presentation_url>,
                            xml::Tag<SERVICELIST,
                                     xml::List<SERVICE, Service,
                                               Device, &Device::services,
                                               xml::TagMember<SERVICETYPE, Service,
                                                              &Service::type>,
                                               xml::TagMember<SERVICEID, Service,
                                                              &Service::id>,
                                               xml::TagMember<CONTROLURL, Service,
                                                              &Service::control>,
                                               xml::TagMember<EVENTSUBURL, Service,
                                                              &Service::event>,
                                               xml::TagMember<SCPDURL, Service,
                                                              &Service::scpd>
> > > > > DescriptionParser;

As always, the nested form of the parser corresponds to the nested form of the XML:

<root xmlns="urn:schemas-upnp-org:device-1-0">
  <URLBase>http://192.168.168.1:49152/</URLBase>
  <device>
    <deviceType>urn:schemas-upnp-org:device:MediaServer:1</deviceType>
    <friendlyName>/media/mp3audio on jalfrezi</friendlyName>
    <UDN>uuid:726f6863-2065-6c61-00de-df6268fff5a0</UDN>
    <presentationURL>http://192.168.168.1:12078/</presentationURL>
    <serviceList>
      <service>
        <serviceType>urn:schemas-upnp-org:service:ContentDirectory:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:ContentDirectory</serviceId>
        <SCPDURL>http://192.168.168.1:12078/upnp/ContentDirectory.xml</SCPDURL>
        <controlURL>/upnpcontrol0</controlURL>
        <eventSubURL>/upnpevent0</eventSubURL>
      </service>
      <service>
        <serviceType>urn:schemas-upnp-org:service:HornSwoggler:1</serviceType>
        <serviceId>urn:upnp-org:serviceId:HornSwoggler</serviceId>
        <SCPDURL>http://192.168.168.1:12078/upnp/HornSwoggler.xml</SCPDURL>
        <controlURL>/upnpcontrol1</controlURL>
        <eventSubURL>/upnpevent1</eventSubURL>
      </service>
    </serviceList>
  </device>
</root>

Calling DescriptionParser::Parse, passing the above XML and a Description object as the target, would end up filling-in the url_base member, the root_device structure, and a two-element list in root_device.services.

In the above example, the order of the elements in the XML happens to correspond to the ordering in the parser declaration, too. In general, this won't be the case: xml::List preserves ordering of its child elements, but the other templates don't. Parsers built using these classes can't be used where preserving the order of heterogenous tags is required: for instance, in XHTML.

In each case, the target object of all sibling templates must be the same. All child templates of an xml::Structure must use the structure type as the type of their target object; all child templates of an xml::List must use the list-element type; and all other child templates must use their parent template's type. (The root template, xml::Parser, must use the type of the target object passed to xml::Parser::Parse.) These requirements are enforced with compile-time assertions -- although, as seems unavoidable with C++ DSELs, it's hard to see the wood for the trees in the resulting error messages if you get it wrong. Errors involving "AssertSame" or "AssertCorrectTargetType" usually mean that target types have been muddled somewhere.

Implementation

The generated XML parsers are extremely compact, both in static and dynamic memory usage. The actual parsing is done by the xml::SaxParser class, via an adaptor (xml::internals::TableDrivenParser) which follows tables (xml::internals::Data) telling it what to do on encountering the various tags.

Each template invocation corresponds to one table entry, which on i686-linux is 20 bytes (plus the char* storage for the tag name). Tables are const, and so end up in the rodata segment. The size of an xml::Parser is also very small (1 byte), as all the other templates only have their static data referenced -- they aren't actually instantiated anywhere.

This diagram depicts some of the tables created for the device-description example parser above:

inline_dotgraph_2.dot

The diagram is a slight simplification: to achieve type erasure, the tables don't actually contain the pointers-to-members shown above, but instead pointers to functions (such as TagMember::OnText) that do the upcast from void* and reference the correct member.

Tables up to eight entries wide are supported; due to the lack of array or ellipsis ("...") support in template parameters, the tables are sized explicitly, using the dummy class NullSelector to signify absent entries, and then specialising the table type (xml::internals::Data) on the number of non-NullSelector entries present.

-- Peter Hartley, 2009-May-01 
Get chorale at SourceForge.net. Fast, secure and Free Open Source software downloads