1. jhove2
  2. main


Clone wiki

main / SGML_and_XML_catalog_files

SGML and XML catalog files

SGML and XML files are each considered to be an instance of some particular SGML or XML vocabulary. SGML vocabularies are defined by a Document Type Definition (DTD). XML vocabularies are defined by a DTD, or by an XML schema language such as XML Schema or RelaxNG.

So, for example, an HTML 4.0 file is an instance of the SGML vocabulary defined by one of the W3C HTML 4.0 SGML DTDs. XHTML is an XML vocabulary defined by one of the W3C XHTML DTDs.

An SGML parser MUST have access to the DTD that defines an SGML file's vocabulary even to parse the file into its constituent elements, attributes, and other components. If it cannot find the DTD, it cannot really provide any useful information about the SGML file, and is unable to determine whether or not the file is a valid instance of that DTD.

An XML parser can tell us much more about an XML file if it has access to the DTD or schema that defines the file's vocabulary. It does not need the definition to parse the file and give some very basic information (such as whether or not the file is "well-formed" XML), but it cannot tell us if the file is valid without the definition.

SGML files indicate what DTD they are based on via a DOCTYPE statement at the beginning of the file. For example, a web page based on the HTML 4.0 Transitional DTD, if it has a DOCTYPE statement, would begin like this:

     <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

The statement contains something called a public identifier:

"-//W3C//DTD HTML 4.01 Transitional//EN"

and something called a system identifier. If there is no catalog file, the SGML parser will use the system identifier to get a copy of the DTD for the file. In this case, the system identifier corresponds to a location on the W3C website, where there is a copy of the DTD.

If you do not have a web connection when you are running the module, it will not be able to find the DTD. Even if you do have web connection, the parser has to wait while the DTD is fetched from the (very busy) W3C site--which means, if you are characterizing a lot of HTML files, the characterization can take a very long time. If you set up a catalog file, and configure the JHOVE2 SGML module to use that catalog file, AND if you keep a copy of the DTDs locally, the SGML parser will use the catalog to find the local copy of the HTML DTD, and will run much more quickly. Some SGML files do NOT use a URL for the location of the DTD. The DOCTYPE statement for these files looks something like

	<!DOCTYPE myDocType PUBLIC "-//ME//DTD some DTD version 1.2.3//EN" "myDoctypeDtd/ myDocType.dtd" >

If you do not use a catalog, the SGML processor will try to use the system identifier ("myDoctypeDtd /myDocType.dtd") to find the file. If you happen not to have that file at that location, the SGML parser will not be able to give you any useful information about the SGML characteristics of your SGML file, or tell you whether or not it is valid.

If you do use a catalog, then the catalog file would have an entry like

	PUBLIC "-//ME//DTD some DTD version 1.2.3//EN""catalogDTDs/myDocType.dtd"

If you then have a copy of the file "myDocType.dtd" in directory catalogDTDs/myDoctypeDtd, and if directory "catalogDTDs" is located in the same directory as your catalog file, then the SGML parser will be able to find the DTD, and fully characterize and validate your SGML file.

You can find out more about SGML catalog files here. The W3C illustrates a sample SGML catalog here.

XML files can also have a DOCTYPE declaration, if they are based on a DTD, or can use schemaLocation declarations, if they use XML namespaces that are based on XML Schema definitions. Again, these declarations have public (or namespace URI) and system identifiers which, absent a catalog, the parser attempts to resolve directly from the information in those declarations. If the system identifier, or location, is a URL, you will experience delay while the URL is resolved. If the system identifier points to a local file, and if you do not have that file, or do not have it in the same (relative) location as the system identifier or location specifies, then your XML file cannot be validated.

You can use an XML catalog file, and a local copy of the DTD or schema, to resolve these locations quickly, and to make validation possible.

You can find out more about XML catalogs, and see an example for creating an XML catalog entry for the DocBook DTD here.

Please see the JHOVE2 Users Guide, and the SGML and XML module specifications, for more information about configuring JHOVE2 to use catalog files.