Skip to main content

Introduction to Digital Preservation: Validation

Subjects: Digital Library

What is validation?

File format validation does a number of functions that help to confirm a file format is well-form and valid. Validation will:

  • confirm that a file conforms the the specific file format specification, which is a set of documentation that list the standards a specific file format must follow. This includes specific file signature information and embedded metadata.
  • notify if the file format does not conform the to the specification
  • it will ensure that files can be read by future readers. A valid file format is much easier to manage over time than one that does not conform. A file format the is not valid may create issues over time, especially when trying to change the file format type (known as migration). Access issues over time will also be harder to diagnose if it does conform and can no longer be open. Future software may also have issues rendering the file correctly if it does not conform the to the specification.

For these reasons, file format validation is important. It is an especially useful tool for digitization workflows as it will ensure that digital objects are being created correctly. When you are in control of creating a digital object, validation is an important step. However, it is important to know that file format validation has the following limitations:

  • Validation tools improve over time, and it is therefore recommended that validation software is regularly ran over digital objects to catch new non-conformances
  • Not every file format type has validation software and it is worth knowing what is available
  • Validation software will also not find issues with files that do not relate to the specific rules set for the file format specification. For example, a TIFF file may have some visual corruption in it that the the rules set in the file format validation software will not look for. It will therefore call the file well-formed and valid, though upon inspection, something is clearly wrong with the file.

This is why fixity is equally as important in digital preservation. It can help detect visual corruption of those files with the early generation of what is known as a checksum. The section on fixity goes into greater detail on creating and confirming checksums and their uses in digital preservation.

JHOVE

The most common validation tool is JHOVE, maintained by the Open Preservation Foundation. It is an open source validation tool that can validate the following file formats:

  • TIFF
  • JPEG
  • PDF
  • AIFF
  • ASCII
  • GIF
  • JPEG2000
  • HTML
  • BYTESTREAM
  • UTF8
  • WAVE
  • XML

JHOVE stands for JSTOR/Harvard Object Validation Environment. It was a joint project between JSTOR and Harvard University to create a tool to validate files and extract metadata. In 2015, the maintenance of the software was transferred to the Open Preservation Foundation.

The word JHOVE where the letter O is represented by red, orange and green horizontal ovals with a checkmark in it

Media conch

MediaConch is an implementation checker, policy checker, reporter, and fixer that targets preservation-level audiovisual files (specifically Matroska, Linear Pulse Code Modulation (LPCM) and FF Video Codec 1 (FFV1)) for use in memory institutions, providing detailed and batch-level conformance checking. It has an interface accessible by the command line, a graphical user interface, or a web interface. While it validates several audiovisual file types, it does not validation every file format type.

The policy checker part of the tool is useful, but it complex and requires a certain level of knowledge about the different file formats.

red, green and blue vertical bars next to a white question mark inside a black circle next to two black semi-circular lines means to represent sound waves

Jpylyzer

Jpylyzer is a validation tool for JPEG2000 (JP2) images. It also reports on the image's technical characteristics or technical metadata (called a feature extraction). It is an open source tool maintained by the Open Preservation Foundation. The creation of this validation tool was made possible by partial funding from the EU FP 7 project known as SCAPE. It is a richer validation tool for JPEG2000 images than JHOVE and is therefore preferred for validating this file type. It is commonly used in digitization workflows were TIFF files are migrated to JPEG2000 storage and access reasons.

 

Unlike JHOVE, Jpylyzer will only validate one file format, but it has a richer validation rules set for JPEG2000 than JHOVE.

 

Jpylyzer logo with a light blue circle over the J to represent the dot in the letter

 

 

 

 

EpubCheck

EpubCheck validates EPUB files and will extract technical and other embedded metadata. It checks things such as: 

  • OCF container structure
  • OPF and OPS mark-up
  •  internal reference consistency 

It was largely developed by Adobe Systems and is currently supported by the International Digital Publishing Forum (IDPF).

An online version of EpubCheck is available at: http://validator.idpf.org/ 

veraPDF

veraPDF validates all PDF/A parts and conformance levels. PDF/A is a version of PDF intended for long term preservation and archving of electronic documents. PDF/A is meant to prohibit features that are not suitable for long term preservation, including font linking (instead it will embed the font file in the document), encryption and annotations. However it does not work for every document and creating a valid PDF/A can be labour intensive. Conformance levels include A (Accessible), B (Basic) and U (Unicode). U was created to deal with the specialized fonts and characters like Greek, Arabic, Chinese and so on. On top of conformance levels, there are also three versions of PDF/A, which means a PDF/A document has a version number and conformation level associated with it.

veraPDF will help to validate the various versions and conformance levels of PDF/A, but will not be able to validate any other version of PDF -- JHOVE will be required for that. It is good practice to validate a PDF/A file using both veraPDF and JHOVE as both validate different aspects of the PDF file. 

vera written in white on a red background and PDF written in black on a white background next to vera

Other validators

There are several other file format validation tools available. These include, but are not limited to:

  • Warctools - for validation of warc files created for web archiving purposes
  • BadPeggy - for validation JPEG, GIF, BMP, PNG. It will provide a technical validation and also a visual validation that will detect any visual corruption in a file
  • BWF MetaEdit - used for extraction, validation, editing as well as embedding of metadata in Broadcast WAVE Format (BWF) files. It can also embed MD5 checksums in the file.

The COPTR registry of digital preservation tools has a list of further file format validation tools.

Characterizing files