Skip to main content

Introduction to Digital Preservation: Glossary

Subjects: Digital Library

Glossary

Background: This glossary was created for the GLAM Digital Preservation Project.The application of digital preservation terminology varies greatly across different organizations and information professions, there is therefore scope for misunderstandings and unclarity when using these terms. The purpose of the glossary is to ensure that GLAM uses a common language when defining system functionalities, policy and standards documentation to enable the organizations to better collaborate and exchange knowledge.  

 

The glossary is divided into three sections:

Section 1: Storage, copies, and backups

Section 2: Digital Preservation terms and concepts

Section 3: Abbreviations

 

Acknowledgements. The glossary was amalgamated and adapted from: NDSA glossary, OCLC Trusted Digital Repositories Report, APARSEN glossary, Digital Preservation at Oxford and Cambridge Project report: Bodleian Libraries’ digitized image assets (2016), DPC Digital Preservation Handbook (2nd edition), and Adrian Brown (2013) Practical Digital Preservation: A How-to Guide for Organizations of Any Size  

___________________________________________________________________________________

Section 1: Storage, copies and backups  

Archival copy - A copy of digital material made at a particular point in time, that can be used as a reference if the original disappears or is temporarily unavailable. Usually stored on long-term storage and could be considered a primary copy of the data.

 

Backup - A copy of digital material saved to a storage device for the purpose of preventing loss of data in the event of equipment failure or destruction of the original material. Backups could be considered a secondary convenience copy of the digital material. A backup may be only kept for 30 days, but is not retained indefinitely and will be overwritten with a newer backup periodically. Conversely, a backup maybe just occur when files are altered and the altered files are then backed up again, but all other files remain untouched in the backup. A backup may be of just the data, or the entire file system or also include the entire computer system

 

Clone - a copy of a data structure such as a file or disk image, a duplicate of the original data

 

Cloning - the process of copying the contents data structures (such as files)

 

Deep Copy - a copy of a data structure which includes all associated data, including deleted files

 

Geographically separated - This means that identical copies of the data are not stored in the same physical location. Geographically separated means that identical copies of the data, even if held on different storage medium are held a fair distance apart. What constitutes geographically separated is that the data is not subject to the same environmental risks, such as fire and flood, or the same infrastructure risks, such as power or internet failure, or the same human risks, such as arson, bombing or other malicious event. The physical distance will therefore vary in different regions and may be constrained by legal requirements

 

Logical copy - A copy of a data structure which includes all associated active data such as multimedia. The copy will retain the hierarchical organisation (folder/directory structure) and the full path of file names. Deleted files are not included

Shallow copy - A copy of a data structure which includes references to some structure, e.g. to a variable, file, folder, or other object. In contrast to a "deep copy" which is used to describe an actual duplicate of data, shallow copies are not meant for direct usage, and copying of a shallow copy doesn’t move the original contents. An example of a shallow copy would be a Windows shortcut or symbolic references, programming pointers - i.e. objects that contain the address and simply point to the data structure, but do not contain the data themselves

Snapshot - A copy of a data structure (e.g. computer hard drive, virtual machine) at a specific moment in time. Snapshots are useful for backing up data at different intervals, which allows information to be recovered from different periods of time

Spinning disk - Refers to a hard drive with physical spinning disk platter(s). However this term is often used to mean online storage (instantly accessible) as many hard drives are now use solid state technology rather than mechanical

Storage - Archival - Storage for digital material which is rarely accessed, often for storing archival copy of digital material. Usually kept off site and distanced from the original copy. Tape is often seen as a good medium for archival storage

Storage - Nearline - A storage system where access to the data is not immediately available, but the data being stored can be made online quickly without human intervention. Tape libraries that can automatic load and access tapes, are considered nearline storage, but if a tape must be manually loaded then it is considered offline storage. *see also "Storage - Online" and "Storage -Offline"

Storage - Offline - A storage system where access to data is not immediately available and requires human intervention to become online. *see also "Storage - Online"

Storage - Online - Online storage supports frequent, rapid access to data by being immediately available to users all the time. It often involves a series of either spinning disks, flash storage or a combination of both

Synchronized replication - Data is written to primary storage and replicated additional storage or backups simultaneously. This keeps multiple copies of the data up-to-date in real time. *see also "Replication"

Replication - The process of copying data from one location to another so there are multiple, identical copies and different locations. This helps keep copies of data up-to-date and mitigate risk of data loss or corruption

Resilient storage - A term which can have several meanings, often will refer to storage which has redundant array of independent disks (RAID) meaning that disks can fail in the system without effecting any of the data stored on the storage system

___________________________________________________________________________________

Section 2: Digital Preservation terms and concepts

 

3D scanning - The act of collecting data about a physical object, in order to reconstruct it as a digital three-dimensional model  

 

Accessioning - The process of bringing digital objects under the physical and intellectual control of an organisation

 

Administrative metadata - This term is sometimes used to refer to subtypes of digital preservation metadata. However, because Administrative metadata is also used by ISAD(G) for archival description aids, it is ambiguous and is not a prefered term across GLAM. See instead Preservation Metadata.


Asset register (digital) - A record of an organization’s digital information assets/digital materials, which quantifies the value and risk of loss in each case

 

Authenticity - A digital object is  authentic if it “is what it purports to be”. In the case of digital materials, it refers to the fact that whatever is being cited is the same as it was when it was first created, unless the accompanying metadata indicates any changes. Confidence in the authenticity of digital materials over time is particularly crucial owing to the ease with which alterations can be made *see also “provenance metadata”

 

Bag - A package of digital material that conforms to the BagIt Specification (specification available at http://www.digitalpreservation.gov/documents/bagitspec.pdf). Under the specification, a bag consists of a base directory containing a small amount of machine-readable text to help automate the material's receipt, storage and retrieval and a subdirectory that holds the files

 

Bit level preservation – A term used to denote a very basic level of preservation of digital object as it was submitted (literally preservation of the bits forming a digital object). It may include maintaining onsite and offsite backup copies, virus checking, fixity-checking, and periodic refreshment to new storage media. Bit preservation is not digital preservation but it does provide one building block for the more complete set of digital preservation practices and processes that ensure the survival of digital material and also its usability, display, context and interpretation over time

 

Bit stream - A stream of data in binary form. A bit stream may be a digital file or a component of a digital file. The term bit stream is particularly important in fields such as audiovisual archiving *see also “digital file”, “digital object”, and “digital material”

Born digital - Digital materials which are not intended to have an analogue equivalent. This differentiates born digital material from digitized material, as it has not been created from an analogue source

 

Canonical metadata – A metadata record which is to be regarded as the most up to date and correct source of information. Knowing which metadata record is the “canonical” source of information for a digital object is important, as metadata may become out of sync between (for example) catalogues, delivery websites, and metadata embedded in digital files  

Canonical file – * see “master file”

Capture standard (digitization) - The settings, formats, and quality levels used for digitization of physical objects

 

Chain of custody - A process used to maintain and document the chronological history of the handling, including the transfer of ownership, of any arbitrary digital file from its creation to a final state version. * See also "provenance"

Characterisation - Characterization is the identification and description of what a file is and of its defining technical characteristics. Characterisation may include the identification of file formats and technical attributes such as creating software and hardware, file size, bit depth etc. Characterisation is often captured as technical metadata *see also “technical metadata”

 

Checksum – An algorithmically-computed numeric value for a bitstream, file or a set of files. Checksums are used monitor files in order to detect accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and comparing it with the stored one. If the checksums match, the data was almost certainly not altered *See also "Fixity Check"

 

Derivative image file – A version of a file which has been derived from a master image file, often for access purposes *see also “Master file”

Descriptive metadata – Metadata created for discovery and identification. Examples of descriptive metadata include: shelfmark, date, and creator

Digital file -  Binary information that is available to a computer program

 

Digital material - A generic term which can refer to either a Digital File or to a Digital Object *see also “Digital file” and “Digital object”

 

Digital object - A conceptual term that describes an aggregated unit of digital content comprised of one or more related digital files. These related files might include metadata, master files and/or a wrapper to bind the pieces together *see also “digital file”, “digital material” and “bitstream”

Digital preservation – A series of activities, processes and policies used for ensuring continued access, usability and reliability of digital information

Digital signature - A method to authenticate digital materials that consists of an encrypted digest of the file being signed. The digest is an algorithmically-computed numeric value based on the contents of the file. It is then encrypted with the private part of a public/private key pair. To prove that the file was not tampered with, the recipient uses the public key to decrypt the signature back into the original digest, recomputes a new digest from the transmitted file and compares the two to see if they match. If they do, the file has not been altered in transit by an attacker

 

Digitization - The act of creating a binary representation of an analogue source object

 

Digitized - Digital files(s) generated from an analogue equivalent. *see also “born digital”

 

Disk image - A disk image is a copy of the entire contents of a storage device, such as a hard drive, DVD, or CD. The disk image represents the content exactly as it is on the original storage device, including both data and structure information

 

Emulation - A means of overcoming technological obsolescence of hardware and software by developing techniques for imitating obsolete systems on contemporary generations of computers

 

File format conversion - *see “File format Migration”

File format migration - A means of overcoming software obsolescence, by converting files into formats which the hosting institution is able to support and render. File format migration is also referred to as “file format conversion” by some groups within the University of Oxford and can be used interchangeably

 

File store – A system for delineating pieces of information, and controlling how digital material is stored and retrieved

Fixity check - The process of ensuring that digital files have not been changed without prior authorization. Changes to files may occur due to human error or transmission errors *see also “Checksum”

 

Handle –*see “persistent identifier”

Hash *see checksum

Image hashing - Image hashing is a way of creating a fingerprint of an image based on its visual appearance using a mathematical algorithm. The outcome is the creation of a pixel hash. Visually similar image files will have similar pixel hashes *see also “pixel hash”

 

Ingest - The process through which digital objects are added into a managed environment *see also “repository system”

 

Intellectual entity – A set of content which is considered a single intellectual unity for the purposes of management and description. An intellectual entity may have several representations. *see also “Representation”

 

Long-term - a period long enough to raise concern about the effect of changing technologies, including support for new media and data formats, and of changing user needs

 

Long-term preservation - the act of maintaining correct and independently understandable information over the long term *see also “long term”

 

Lossless compression – A compression method which allows reconstruction of the original data without any quality loss

Lossy compression – A lossy compression method which discards some data. Original data cannot be restored without some quality loss

Master file - A master file is a source file from which subsequent file versions can be created. An example of a master file could be a high quality TIFF file used for deriving JPEG access copies from. For this reason preservation effort is generally targeted at master files, rather than derivative files which can be regenerated from the source

 

Media carrier (handheld) -  A type of storage which is not networked or part of a “managed storage system”. Examples of handheld media carriers are: tape, cassettes, CDs, DVDs, and USB sticks   

 

Metadata -The set of information required to enable content to be discovered, managed and used by both humans and automated systems. *see also: “descriptive metadata”, “technical metadata”, “rights metadata”,  “preservation metadata”, “provenance metadata”, “tracking metadata”, and “structural metadata”

Normalization (file formats) -  Normalization (file formats) – The process of migrating digital files to new file formats at the point of ingest into a managed preservation environment. The purpose of normalization is to minimize the number of formats managed by an organization. *see also “ingest” and “file format migration”

 

Object management – Management of digital objects, their relationships and intellectual integrity *see also “Digital object”

Persistent identifier - A long-lasting/persistent set of characters used to uniquely identify a digital file or a digital object *see also “UUID”

Pixel hash -  A hash value created by image hashing, using the visual content of an image file rather than the bit stream. *see also “image hashing” and “hash”

Preservation metadata – preservation metadata is information which supports and records digital preservation processes. In the context of the GLAM digital preservation project, preservation metadata is an umbrella term which refers to four subsets of metadata: provenance metadata, rights metadata, technical metadata, and structural metadata  

*see also the entries for “Provenance metadata”, “Rights metadata”, “Technical metadata”, and “Structural metadata”

Process documentation - Step-by-step documentation about how an action or activity is undertaken. Process documentation is often updated to keep it relevant as technologies and systems change

Provenance metadata - Information about the origin of a digital object and about any changes to it that has occured while under management of the digital repository.  Provenance metadata includes (but is not limited to) information about file format migration, date of creation, the generation of checksums, and file format validation *see also “versioning” and “preservation metadata”

Refreshing - Copying information content from one storage media to the same or another storage media

Representation - A representation is a distinct manifestation of an Intellectual Entity. An Intellectual Entity could be a student thesis which has two representations (Representation 1: a PDF document. Representation 2: a DOCX file version of the same document). *see also “Intellectual Entity”

Repository system - A system in which digital objects are stored for possible subsequent access, retrieval and management.

Rights metadata - In the context of digital preservation, rights metadata records information about the intellectual rights to a digital object and system rights to access, view, and edit content  

Schema - A formal description of a data structure. I.e.: for XML, a common way of defining the structure, elements, and attributes that are available for use in an XML document that complies to the schema *see also “XML”

Storage migration - The process of copying content from one generation or configuration of digital data storage onto an updated generation or configuration

Structural metadata – Metadata used to describe relationships between digital files or other digital material which comprising a complex digital object. A simple example of structural metadata is mapping of page numbers within a digitized manuscript to corresponding image files

Technical metadata - The term technical metadata is contested and has many different definitions (sometimes being used synonymously with provenance metadata and “digital preservation metadata”). However, it has been given a more narrow definition in the context of GLAM to make metadata modelling simpler.  When referring to technical metadata in GLAM, we refer exclusively to information which can be automatically extracted from a digital file. I.e. metadata which has been embedded into a digital a digital file (such as EXIF metadata for 2D image files), as well as attributes such as the length and size of a digital file. As such, it is closely related to and compliments Provenance metadata collected by the digital repository.  *see also ”Preservation metadata” and “characterisation”

 

Tracking metadata – Administrative information for tracking and managing physical material, digital material, as well as processes during digitization projects. Tracking metadata may include information about the current location of physical collection items and the progress of current workflow steps

Transformation metadata - *see “provenance metadata”

Validation – The process of ensuring that data is correct and useful when checked against a set of data validation rules. These might include rules for package or file structure or specific file format profiles

Versioning (file system) - A versioning file system allows a computer file to exist in several versions at the same time, by keeping old copies of a file which has been edited. Versioning supports a repository’s ability to trace the provenance of a digital file *see also “provenance” and “file store”

Workflow - A defined sequence of tasks performed by either humans or software agents  

 

Wrapper -  A data structure or software that encapsulates (“wraps around”) other data or software objects, appends code or other software for the purposes of improving user convenience, hardware or software compatibility, or enhancing data security, transmission or storage

_____________________________________________________________________________

Section 3: Abbreviations

 

DCC - Digital Curation Centre

COPTR - Community Owned digital Preservation Tool Registry

DOI - Digital Object Identifier

DPC - The Digital Preservation Coalition - http://www.dpconline.org/

DROID - Digital Record Object Identification

EXIF – Extensible Image File Format

ISAD(G) - International Standard Archival Description (General)

JHOVE - Harvard Object Validation Environment (http://jhove.openpreservation.org/)

METS – Metadata Encoding and Transmission Standard

MIX - Technical Metadata for Digital Still Images Standard

OPF - Open Preservation Foundation

PREMIS - Preservation Metadata Implementation Strategies

PRONOM -  Technical registry service created by The National Archives (UK)

UUID – Universal Unique Identifier

XML - eXtensible Markup Language

XMP – Extensible Metadata Platform (ISO 16684-1:2012)