LT XML Software Introduction

TO INSTALL LT XML READ AND FOLLOW INSTRUCTIONS IN THE 00INSTALL FILE.

CONTACT INFORMATION IS IN 00COPYRIGHT

Version 1.2 (LTG September 2000)

Introduction

This release contains version 1.2 of the LT XML toolkit
and API, including everything required to process a very wide range of
well-formed XML documents, as well as normalised SGML documents
produced with the LT NSL toolkit.

The basic architecture is one in which XML and nSGML documents can be piped
through any tools built using our API for augmentation, extraction,
etc.

LT XML implements the XML 1.0 recommendation
(http://www.w3.org/TR/1998/REC-xml-19980210), including:

 * Full 16-bit character support, including
   UTF-8, ISO-646, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4,
   ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, UTF-16 and
   UCS-2 encodings.

 * WIN32 and UN*X platforms (may also work on Macintosh)

 * Validation

Thread safety

Revisions were added in release 1.1 to remove obvious obstacles to
thread-safety, but we have no multi-threaded applications to test it
with.  We would appreciate reports from anyone who tries this.

Pre-built tools supplied include:

 * a search and extraction program, sggrep, which matches XML document
   components based on both markup and content;

 * a program to output all the text from an XML file, textonly;

 * an illustrative application, simple, whose source code (in
   src/nsg/simple.c) is intended as a starting point for application
   developers;

 * an alternative version of simple, called simpleq, which uses queries
   to access the data.  Both of these are described in some detail in the
   documentation;

 * a sorting utility, sgsort;

 * two approaches to down-translation, sgmltrans and sgrpg;

 * xmlnorm, for trivial normalisation of XML files, useful for checking
   well-formedness and validity (with the -V flag);

 * a simple version (element structure and text only) of nsgmls called pesis;

 * a suite of tools developed for the MULTEXT project, including a
   tokeniser (sgmltoken), a toy segmenter (sgmlseg, actually a perl
   program) and a sentence boundary finder (sgmlsb).

See XML/doc/00lt_xml.html for complete documentation of the tools and API.

A model Makefile for user applications is included as Makefile.usr

Some sample data is provided in the data directory if you have run
"make test" -- here are some examples of the processing possible:

# Print out all <V> elements inside <CHAPTER>s whose P contains the string 
# 'Comforter' anywhere inside them.

	sggrep '.*/chapter/v' '.*/p' 'Comforter' < nt.xml

# Confirm that we have well-formed XML

	xmlnorm -sx < nt.xml

# Produce a nicely-formatted extract

	sgrpg -f comf-pat.xml < nt.xml

# Tokenise a text and extract the fifth token of each paragraph therein:

	sgmltoken < test01.xml | sggrep -r '.*/P/C[ID="T5$"]' ''
