The documents must be well-formed and may reflect different DTDs. The library supports the storage and management of these XML files in native and compressed form, operating directly at the File System level. The main features of the library are: state-of-the-art algorithms and data structures for text indexing, compressed space occupancy, and novel succinct data structures for the management of the hierarchical structure of the XML document.
As a result substring, regular expression, approximate and proximity searches on the textual content of the XML document as well on the attribute values can be executed in an efficient way. Resolving structural queries on (partially specified) tag paths can be also done efficiently by using a novel implementation of the hierarchical structure of the XML document.
Overall, the compressed XML document plus all of its indices occupy no more than the original file size. It goes without saying that the XCDE library is intended just as a kernel of a more complex XML-query engine or an XML-document engine. It may be used to implement most of the basic functionalities of XQuery, and it may support IR-like searches. Currently we are using the XCDE library to design an XML search engine for a collection of italian literary texts marked with TEI.
For details see http://sbrinz.di.unipi.it/~xcde/xcdelib.html.
(Joint work with Andrea Mastroianni.)