Clone wiki

main / JHOVE2_Frequently_Asked_Questions_(FAQ)


About JHOVE2

What is JHOVE2?

JHOVE2 is an open source next-generation application and framework for format-aware characterization. JHOVE2 is the successor to JHOVE, the original characterization system developed by Harvard University and JSTOR Electronic Archiving Initiative (now known as Portico). The JHOVE2 project aims to build on the success of JHOVE and to offer significant new features.

For more information see the project objectives and scope.

What is characterization?

Characterization can be thought about in two ways. First, it is information about a digital object that describes that object's character or significant nature and that can function as a surrogate for the object itself for purposes of much preservation analysis and decision making. Second, characterization is the process of deriving this information. This process has four important aspects: identification, validation, feature extraction, and assessment.

  • Identification is the process of determining the presumptive format of a digital object on the basis of suggestive extrinsic hints and intrinsic signatures, both internal (e.g. magic number) and external (e.g. file extension).
  • Validation is the process of determining the level of conformance of a digital object to the normative syntactic and semantic rules defined by the authoritative specification of the object's format.
  • Feature extraction is the process of reporting the intrinsic properties of a digital object significant to preservation planning and action.
  • Assessment is the process of determining the level of acceptability of a digital object for a specific purpose on the basis of locally-defined policy rules.

In general, these capabilities can be thought of as answering four questions about a digital object:

  • What is it?
  • What is it, really?
  • What about it?
  • So what?

For more information see the Glossary of JHOVE2 terms and concepts.

What are the features of JHOVE2?

The JHOVE2 application uses a modular plug-in architecture. The JHOVE2 project will develop modules for format identification, validation, feature extraction, and assessment. JHOVE2 will be able to process hierarchically-organized digital objects, both at the macro level of files within directory structures, and at a micro level of nested bit streams within container files. The JHOVE2 application will support extensive customization through local configuration.

The JHOVE2 identification operation is quite different from that offered by JHOVE. JHOVE identified an object by iteratively trying to validate it against all of the formats it knows about. This has benefit that a successful identification has a high level of assurance. Unfortunately, it also has the detriment that any trivial validation error will cause JHOVE to not identify an object with its true format. JHOVE2 uses signature-based identification, that is, it looks for well-known byte sequences (often called "magic numbers") that are indicative of a format. Since JHOVE2 doesn't have to fully validate an object to identify its format, identification can be performed much more quickly. Also, since magic numbers can be stored easily in a compact fashion, JHOVE2 will be able to identify a much larger number of formats than it has full validation modules for.

For more information see the JHOVE2 Functional requirements.

What formats will be supported by JHOVE2?

JHOVE2 will support the validation and feature extraction of the following format families and specific format subtypes:

  • ICC color profile
  • JPEG 2000
    • JP2 (ISO/IEC 15444-1) and JPX (ISO/IEC 15444-2) profiles
  • PDF
    • PDF 1 - 1.7, ISO 32000-1, PDF/X-1 (ISO 15930-1), PDF/X-1 (ISO 15920-1), \-1a (ISO 15930-4), \-2 (ISO 15930-5), \-3 (ISO 15930-6), PDF/A-1 (ISO 19005-1)
  • SGML
  • Shapefile
  • TIFF
    • TIFF 4 - 6, Class B, F, G, P, R, and Y, TIFF/EP (ISO 12234-2), TIFF-FX, TIFF/IT (ISO 12639), Exif (JEITA CP-3451), GeoTIFF, Digital Negative (DNG), RFC 1314
  • UTF-8 encoded text
    • ASCII (ANSI X3.4)
  • WAVE audio
    • Broadcast Wave Format (EBU N22-1997)
  • XML
  • Zip

ICC, SGML, Shapefile, and Zip are newly supported in JHOVE2. Unfortunately, due to project funding constraints several formats available formerly available for analysis by the original JHOVE system will not be supported by JHOVE2, including AIFF, GIF, HTML, and JPEG.

Since JHOVE2 uses signature-based identification, it will be able to identify many more formats than those for which it has full validation modules.

What are the technical requirements for installing/running JHOVE2?

JHOVE2 is written in Java Standard Edition (SE) 6. JHOVE2 should run on any computing platform that has a Java SE 6 Java Runtime Environment (JRE).

For more information please see the User's guide v2.0.0 (PDF)

Where can I learn more about JHOVE2?

This public wiki provides additional information about all aspects of the JHOVE2 project and application.

A good overview of the JHOVE2 is provided by the paper presented by project team members at the 2008 iPRES Conference at the British Library in October 2008. This paper is available in PDF form; the PowerPoint presentation slides are also available. A number of other presentations about JHOVE2 are also available.

General information about the JHOVE2 project is distributed via the "JHOVE2-Announcement-L" mailing list. Technical discussion about JHOVE2 takes place via the "JHOVE2-Techtalk-L" list. Subscription information for these lists is available here.

Where do I go to report bugs or ask questions?

JHOVE2, as an open source software project, has a community of users to rely on for help and advice. The jhove2-techtalk-l listserv is the main venue for asking for assistance and for reporting bugs. Also, the Issue tracker contains all reported bugs and feature enhancement requests.

JHOVE2 Advanced Questions

How do I run JHOVE2 and/or unit tests in Eclipse?

In order to run JHOVE2CommandLine or Junit tests from the eclipse run configurations, you need to modify the classpath to include the config and config/droid directories.

In the Run Configurations Dialog Box, select the Classpath tab

In the Classpath: window, Click on User Entries.

  1. Click Advanced to bring up the Advanced Options Dialog Box
  2. Select Add Folders and Click OK
  3. Choose your project folder and expand the directory.
  4. Select the config directory and click OK.

Repeat steps 1-3 and select the config/droid directory.

Note: JHOVE2 Junit test cases use Junit4 Test Runner. This is configured in the Test tab.

How do I increase the heap size in the JUnit JVM in Eclipse?

The JUnit JVM is separate from the JVM used for normal development within Eclipse.

To set the heap size for the JUnit JVM

  1. Open the JUnit Run Configuration
  2. Select the Arguments tab
  3. In the VM Arguments box, add the following: -Xms128m -Xmx1024m

Why is it taking so long for my files to validate?

If you're validating SGML, HTML or XHTML files, or zip files that contain SGML, HTML or XHTML files, the validation shells out to a separate SGML process which can add to processing time. Also, when processing HTML files, the catalog file may not be referencing a local copy of the DTD and the code is resolving the SystemID to the w3c website, which adds to latency. A workaround to the latter issue is to update the SGML catalog file to point to the directory where local dtds are stores for the appropriate html version (ex: html4.0.1). This catalog file is referred to in the config\spring\module\format\sgml\jhove2-sgml-config.xml file.

DTD and sample catalog files which you can copy to your local area are located in the src\test\resources\examples\sgml directory.

Note: you must put in the fullpathname in the catalog file, for example: CATALOG "/cygdrive/c/jhove2_hg_smm/src/test/resources/examples/sgml/dtds/html32/"

More info on this topic can be found in #123 and on this page.

About the JHOVE2 Project

How is the JHOVE2 project organized?

JHOVE2 is a collaborative project of the California Digital Library (CDL), Portico, and Stanford University. A distinguished advisory board with members drawn from international memory institutions, programs, and vendors, provides guidance.

The JHOVE2 project is funded for two years by the Library of Congress under its National Digital Information Infrastructure Preservation Program (NDIIPP).

JHOVE2 is an open source project. All software products newly developed as part of the project will be made freely available under the terms of the BSD license. However, some pre-existing software products incorporated into JHOVE2 may require separate licensing.

What is the project schedule?

The two year project is divided into three main phases. The first phase, which began in May 2008, was a six-month period of stakeholder engagement, needs assessment, and design. This phase led to the development of a set of appropriate use cases and Functional requirements.

The second phase, which began in December 2008, is a six month period of rapid prototyping and refactoring of the core JHOVE2 API and architectural framework.

The third phase, which will begin in June 2009 and last through the end of the project will be a period of module development.

June - November 2008Stakeholder engagement, needs assessment, preliminary design
December 2008 - May 2009Prototyping and development of core API and framework
June 2009 - February 2011Module development
March 2011Testing
April 2011Release of JHOVE2