Friday, February 2, 2018

Conservation Treatment Tiers: An Aid to Prioritization

Staff members often need to know how much time a repair might take in order to prioritize work or to give an estimate to a donor who would like to sponsor a project. In 2017 the staff in the Bentley Conservation Lab devised a more comprehensible method of estimating repair time. A three-tier system didn’t seem detailed enough so we started with four and tweaked it over the next couple months until we settled on our five-tier system.
                Our Tier One category (less than one-hour repair) responds to requests for a quick fix-- examples below. Tier Five designates projects that are very involved and will take more than ten hours. There is a lot of area between “less than one hour” and “more than ten” so we broke it down into three more tiers that fit with our most common types of projects.
                The legend (below right) hangs in our lab for easy reference.  The bar graph is useful in reporting to our administration (through the Business Intelligence Committee) about the types of projects we handle and how long they take. It doesn’t report ongoing work, just the projects that have been completed each month.

Graph for Business Intelligence Report and legend for Conservation Lab

Tier 1: < 1 Hr.

                 A Tier One item might be popped in between longer projects or at the end of a day when starting a larger project doesn’t seem efficient. A Tier One is often done immediately because it is needed by the digitization lab which makes it high priority. Another example is when a researcher in our reading room requests an item and Reference staff finds the item in such need of repair that it might be damaged in handling.  Some examples are mending small tears, ironing wrinkles, and removing sewing or staples.

Ironing on a quick mend

Tier 2: 1 - 3 Hrs.

               Tier Two covers slightly more time-consuming repairs such as making portfolio-style boxes or encapsulating scrapbook leaves when they are too fragile to be rebound but must be protected.

Scrapbook pages encapsulated in polyester film

Inside view of a portfolio style box

A finished portfolio style box

Tier 3: 3 - 5 Hrs.

               Examples of Tier Three jobs are mending maps or drawings, depending on the extent of the tears and number of items. The photos show tears in a map and previous tape repair that needs to be removed from fragile tracing paper.

Damaged drawings on tracing paper drawing

Multiple types of tape on tracing paper drawing

Map torn and separated at the fold

Tier 4: 5 - 10 Hrs.

               This Ann Arbor Film festival document was a hand-made scroll with many types of tapes and adhesives, definitely Tier Four, as were the founding documents of the University of Michigan Philosophical Society. The book was in pieces and so important to the university’s history that it was given a ¾ leather binding.

12 foot Ann Arbor Film Festival collaged scroll

2 images of the scroll, detailing tape, adhesive and loose items

University of Michigan Philosophical Society founding documents, before treatment

Inside detail

Detail of rusty staples and worn signature folds

Finishing the ¾ leather binding

Tier 5: > 10 Hrs. 

         Tier Five projects are those that take over ten hours and we try to estimate just how many that might be. In this case we had a scrapbook of extremely acidic and crumbling paper with newspaper articles that were fragile, wrinkled and torn. We photographed each page before removing the items then used those photos for proper placement on the new pages.  The new scrapbook was larger so the articles could be displayed without overlapping.

Scrapbook, before treatment

Scrapbook pages were numbered and photographed for identification

The photos were used to match fragments of articles for proper placement

Reconstructed articles and polyester film pockets on new scrapbook leaves

Original Cover

Finished scrapbook- at long last!

               Our treatment tiers are serving their purpose and mesh well with the Bentley's system of prioritization. (Hint: it involves a COLORFUL spreadsheet!) More about that in our next rip-roaring installment.  

Wednesday, September 20, 2017

Archivematica Implementation: A Retrospective

First, some exciting news: It's official! We've fully implemented Archivematica here at the University of Michigan Bentley Historical Library and, as of August 31, 2017, we've used it as part of the end-to-end digital archives workflow we developed during the Mellon-funded ArchivesSpace-Archivematica-DSpace Workflow Integration project to deposit our first "production" AIPs into DeepBlue, our repository for digital preservation and access.

Go ahead, check them out! (And these too!) And there was much rejoicing (yaaaaaaaay)...

In this post, we'll reflect a bit on what's happened since our last status report and look forward to a brave new digital archives world (at least here at the Bentley).

Major Milestones

Archivematica is a web- and standards-based, open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content.

The Mellon grant officially concluded nearly a year ago on October 31st, 2016. At the time, we announced that we had achieved each of the three major development objectives for the project:
  • the creation of a new Appraisal and Arrangement tab in Archivematica that will permit archivists to characterize, review, arrange, and describe digital archives;
  • the integration of Archivematica and ArchivesSpace; and
  • the integration of Archivematica and DSpace.

Archivematica 1.6 was officially released on March 16, 2017. Dubbed "the Nancy Deromedi release" in memory of Nancy Deromedi, former Associate Director for Curation here at the Bentley, whose vision helped shape defining features of the release, this release contained the features listed above whose development we sponsored as part of this work, as well as some work by MLibrary's own Aaron Elkiss to "drastically cut down" the number of files that need to be indexed by removing empty BulkExtractor logs. (Up until this point, indexing in Archivematica had been a huge problem for us, particularly for transfers with lots of files).

Even with this release, however, we still weren't quite ready to fully adopt Archivematica and go "live" with the ArchivesSpace-Archivematica-DSpace workflow (even though we had been making extensive use of Archivematica's Backlog feature and the `automation-tools`).

Fix One Bug, Two More Shall Take Its Place

Even before Archivematica 1.6 was officially released, however, we had identified a number of additional bug fixes (and new features) that were blocking our full adoption and implementation of Archivematica.

Issues Addressed by Artefactual

We opened another contract with Artefactual (the lead developers of Archivematica) to address a number of these issues, some of which are listed below:
  • Handles were not being written back to the File Version field of ArchivesSpace's Digital Object module. Ultimately, this meant that links out to digital content were not making it back to our public finding aids.
  • We were unable to drag-and-drop all files from the Backlog pane. This was essential to being able to associate digital content with its description.
  • It was difficult to identify the location of files in the Backlog pane when they had been singled out in the Examine Contents and File List panes. Archivists thus had a hard time locating files (e.g., after they had tagged them, say, as having sensitive data) in their original order.
  • Files whose formats were not able to be identified were being included in facets for other file formats in the Analysis pane, making file format characterization a bit unwieldy.
  • Required (at least for us) metadata fields were not being written to the DSpace Item (although they were being written to the METS file inside the AIP). This had implications for searching and browsing in DeepBlue. This is particularly problematic for ensuring that online researchers we get from search engines that take people directly to digital content in DeepBlue (rather than through our finding aids) have the context they need to understand that digital content.
  • Scrolling down the File List pane made all the File List buttons disappear, which led to poor usability of the functionality enabled by the buttons (e.g., creating a new component of description, finalizing an arrangement, etc.).
  • We wanted to the option to package AIPs in the .ZIP archive format (in addition to .7Z). We prefer the .ZIP format because it's more familiar than .7Z to the majority of our researchers.
  • The date facets in the File List pane were not functional and, in any case, last modified dates weren't showing up.

All of these issues (except the last one, but more on that later) were incorporated in the 1.6.1 release of Archivematica, which came out on August 1, 2017. This release also included some work by our own Dallas Pillen to fix a bug that occurred when trying to run a SIP through Archivematica's Ingest microservices when that SIP (coming from the ArchivesSpace pane in the Appraisal tab) had a date, but no title. (This is a fairly common practice in our description, permitted in ArchivesSpace as well as content standards like DACS.)

Issues Addressed Locally

Due to the local, idiosyncratic nature of some of some additional issues we identified, we also made a number of fixes to our forks of Archivematica and the Archivematica Storage Service:
  • Archivematica
    • We got rid of a nested "digital_object_component_" in the AIP directory structure, a relic of a time before we decided to simplify the way we model digital objects in ArchivesSpace. Now all digital content is packaged inside a single "objects" folder and hopefully this makes things a bit more straightforward for researchers.
    • We added a "" prefix to the Handle written back to the File Version field of ArchivesSpace's Digital Object module so that links to digital content in the finding aids actually work. We toyed with hard-coding this in Archivematica, but Dallas ended up creating an ArchivesSpace plug-in that verifies all URLs with Handles coming to ArchivesSpace (whether or not they're coming from Archivematica).
    • We increased one of the timeouts in to an hour (it was set to two minutes) so that the Archivematica Storage Service could move around larger packages (e.g., at initial transfer, at final deposit, etc.) without timing out.
    • We disabled BulkExtractor scanners except the ones we need to identify the most common forms of sensitive data we encounter, since this application is extremely time and resource intensive. At the time, this application was not configurable in the Format Policy Registry.
    • We updated the default Copyright statement going from Archivematica to DSpace to point researchers to access and use restrictions recorded at the collection-level.
  • Archivematica Storage Service
    • We added a feature to deposit a License Bundle with every AIP going to DSpace. This is one of our internal requirements for all deposits to DSpace.
Of course, if you have questions about any of these, please don't hesitate to get in contact with us!

On Deck for Archivematica 1.7

Looking ahead to Archivematica 1.7, you can expect a couple of additional features related to the ArchivesSpace-Archivematica-DSpace Workflow Integration project, most notably the inclusion of an additional feature that will permit archivists to characterize and review content based on its last modified date.

The new "Last modified" column in the File List pane of the Appraisal tab.

While last-modified dates and times are notoriously unreliable (especially as they change hands or operating systems, e.g., on their way from donor to archive), they can help to give an archivist additional context for a set of files or prepare them for additional preservation steps that might be required for older content, e.g., exploring additional file format migration pathways if the content is of sufficient value.

This release will also contain some work I did to fix a bug that was introduced when the .ZIP functionality was added. The bug occurred when Archivematica tried to update permissions on the "metadata" bitstream when the AIP was packaged using the .ZIP archive file format.

Mission Accomplished (for These Archivists who are at This Institution on Their Mission)

So here we are--we've reached another milestone. As I mentioned at the beginning of the post, as of August 31, 2017, we are officially live with Archivematica and the new features and workflow we developed during the Mellon-funded ArchivesSpace-Archivematica-DSpace Workflow Integration project. In fact, our latest cohort of Project Archivists just started at the beginning of September and they were all trained to use these new tools and workflows--it's all so exciting!

While it's important to say that we've accomplished something--and that we're proud of what we've accomplished!--it's also important to qualify that a bit. What we've got works for us (we think!), at least for now, at least for most of what we're working with. We hope you can take at least some of what we've done (and we tried hard to make sure you could) and make it work for you, too. It's been exciting, for example, to hear about other people's experiences with the Appraisal tab (like this post on "Appraising Appraisal and picking the right tool for the job" by Chris Grygiel).

This has been an amazing journey, and along the way we've learned a lot, not just about Archivematica, but also about software development, project management, working with open source tools and communities, etc. We've said before that the end is just a new beginning--and that remains true today. With that in mind, we know our mission is never "accomplished" as such--we fully expect (and are equally excited for!) all the new challenges and adventures we'll face in Archivematica Land as we move forward.

Until next time!

Thursday, April 6, 2017

An Overview of Archivematica Storage Service Use at the Bentley

When I first encountered Archivematica, I understood it as a pipeline, a chain of microservices, a "sausage-maker." With a little more experience, I realized that this initial impression left out a hugely important part of the Archivematica package: the Storage Service.

As you might have guessed, the Storage Service has to do with storage. Specifically, it allows users to configure the storage spaces (e.g., transfer source locations, AIP and DIP locations, etc.) with which Archivematica interacts.

In short, the Archivematica Storage Service is the heart of Archivematica.

Blood Flow in the Heart

Information Flow in the Archivematica Storage Service
As you can see from this [anatomically correct] diagram, the heart (or Archivematica Storage Service) is made up of chambers (we'll call them Internally and Currently Processing locations, or, more simply, the Archivematica Pipeline). Blood (SIPs) enters the heart (from Transfer Sources) and flows through these chambers; oxygenated blood (AIPs) exits the heart (to AIP and/or DIP Storage).

You can learn more about the heart here.

You can learn more about the Archivematica Storage Service here in this post. (And here. And here.)

Storage Service Structure and Current Use

The Storage Service is made up of a number of different entities: Pipelines, Spaces, Locations and Packages. A Pipeline has Spaces, a Space has Locations; and a Location has Packages:
While it's not obvious from this diagram, the Storage Service can actually be used to configure Spaces and Locations across multiple Pipelines.


Pipelines are essentially Archivematica installations registered by the Storage Service. Although institutions may have many pipelines, we currently use just one for born-digital processing. That being said, we've imagined scenarios where we'd consider adding more pipelines, if another one of the libraries or archives at the University of Michigan wanted to use Archivematica, for example, or if we ever wanted to use Archivematica for more than this one, fairly well-defined workflow and material type.


Pipelines have one or more spaces. Spaces allow Archivematica to connect to physical storage (e.g., a local filesystem or a NFS, or even DSpace/Fedora via SWORD v. 2, LOCKSS, DuraCloud or Arkivum), and users input all the necessary information (e.g., remote hostname and location of the export) for Archivematica to do so.

We make use of a number of local filesystem spaces that point to:
  • a "dropbox" that donors and field archivists use to transfer material;
  • a "legacy" space (really, two spaces) containing our old, pre-Archivematica backlog, where we have the automation-tools pointed; and
  • an "archivematica" space that Archivematica uses for ingest processes.
We also have a "DSpace via SWORD2 API" space, which we use to integrate Archivematica and DSpace. The configuration here looks a bit different than in the other local filesystem spaces, and notable differences include:
  • Archivist must enter a DSpace username and password--these are used to authenticate with DSpace. 
  • Archivists must also enter a policy for restricted metadata, in JSON, to override any defaults in DSpace. When AIPs are "repackaged" into "objects" and "metadata" packages, the metadata package will get this policy. In our case, this points to a DSpace "group" that includes a handful of curation and reference archivists here, restricting access to only those archivists.
  • Finally, archivists must select an "Archive format" option. Since we're depositing "packages" of digital objects to DSpace (and DSpace only accepts single objects), you have to package them into a 7z or ZIP file. We make use of the latter, our thinking being that the ".zip" extension is fairly ubiquitous, and that as such there's a greater chance that researchers will recognize it (and know what to do with it).


Spaces have one or more locations, and locations are where you get into the knitty gritty of associating an individual location on physical storage with particular "purposes" in Archivematica (e.g., transfer source locations, AIP and DIP locations, etc.). This next part was a bit confusing to me the first time I read it, so I'll quote directly from the documentation: "Each Location is associated with at least one pipeline; with the exception of Backlog and Currently Processing locations, for which there must be exactly one per pipeline, a pipeline can have multiple instances of any location, and a location can be associated with any number of pipelines."

We make use of the following locations (organized by purpose):
  • AIP Storage
    • We have a number of these locations that correspond to DSpace collections. This is configured by pointing the Storage Service (and the DSpace space) to the DSpace REST API endpoint for that collection, e.g.,, and giving it a name that you'll see in a dropdown when you get to the store AIP microservice in Archivematica:

    • We also have one location on a local filesystem for for content with restrictions. These AIPs end up going through a more specialized workflow that matches PREMIS Rights Statements we record in Archivematica with the appropriate "group" or access profile in DSpace, functionality that is not included with the standard DSpace integration.
  • Currently Processing: This is the location used by the Archivematica pipeline as it runs transfers through its various microservices. We've learned the hard way that this space takes a lot of management! We frequently run into 500 errors with the automation-tools that end up being caused by this space being full. Part of the reason it fills up quickly is that Archivematica is very conservative, holding onto copies on copies on copies of transfers in various subdirectories for various reasons, e.g., "rejected" (used when transfers are rejected in the dashboard), "failed" (used when transfers fail for some reason, usually because they're too big, which just exacerbates the "being full" problem) and "tmp" directories. These can be emptied through the "Administration" --> "Processing storage usage" tab of the dashboard, but we ended up just making a daily cronjob to empty these out.
  • Storage Service Internal Processing: This is required for the Storage Service to move run, must be locally available to the storage service, and must not be associated with any pipelines.
  • Transfer Backlog: This is where SIPs go when you select the "Send to backlog" option for "Create SIP(s)" in the "Administration --> Processing configuration" tab of the dashboard. This is an optional workflow step, but we make heavy use of it. For us, there can be some time lag between an initial accession of material and its subsequent processing and deposit to DeepBlue. This backlog location is safe and secure and serves as a temporary, "minimally viable" preservation environment for the original digital objects and the logs and METS file generated by Archivematica's initial transfer process. With Archivematica 1.6, thanks to some transfer backlog management development work by Simon Fraser University Archives, you can use a new "Backlog" tab in the dashboard to search and view backlogged transfers, download entire transfers or items from backlog and even perform transfer deletion requests.
  • Transfer Source: Archivematica looks to these locations when creating a new transfer. As mentioned earlier, we use a couple of these, a "dropbox" that donors and field archivists use to transfer material and a "legacy" space containing our old, pre-Archivematica backlog. Material in here is accessed (sometimes slowly if there's a lot in there!) when creating a transfer through the dashboard:
Selecting transfer source directories


Packages are Transfers, SIPs, AIPs and DIPs uploaded to a location managed by the storage service. The Storage Service is also the place where requests to delete packages are fulfilled by an administrator.

Future Ideas for Storage Service Usage

You may have seen our recent post to the Archivematica Tech list about an API endpoint for posting locations to a space. We're interested in this to try to reuse metadata and further automate our own workflows, for example, in this Resource-to-Collection command-line utility we're working on that:
  • creates or updates a DSpace Collection from an ArchivesSpace Resource (using the DSpace API);
  • creates an Archivematica Storage Service Location for the DSpace Collection (in lieu of the endpoint, we're currently using Selenium with Python for this part);
  • creates and links an ArchivesSpace Digital Object for the DSpace Collection to the ArchivesSpace Resource (using the ArchivesSpace API); and
  • notifies the processor (using their Archivematica username) via a message on Slack (using the Slack API).
Deposit away, Dallas!
Who knows, maybe this or something like it could be a button in ArchivesSpace one day.

Well, that's enough from us! How do you use the Storage Service? As always, please feel free to leave a comment or drop us an email: bhl-mellon-grant [at] umich [dot] edu. Thanks!

Tuesday, January 17, 2017

Appraisal and Arrangement Tab 101

With the forthcoming release of Archivematica version 1.6, folks are going to get a chance to roll out the new Appraisal and Arrangement Tab.  We're excited to implement Archivematica—and the Appraisal Tab—in a production environment and will continue to blog about our experiences (as well as the additional enhancements we've contracted with Artefactual Systems to complete).

In the meantime, we wanted to share two screencasts to help folks get up and running with the Appraisal Tab and also get a better idea of the digital archives workflow we're implementing here at the Bentley Historical Library.  Without further ado, I give you:

Part 1: Configuring ArchivesSpace and DSpace Integration within Archivematica

This screencast provides a step-by-step guide to adding instances of ArchivesSpace and DSpace for use with the Appraisal and Arrangement Tab. 

Part 2: Appraisal, Arrangement to ArchivesSpace and Deposit to DSpace

This screencast demonstrates the functionality of Archivematica's Appraisal and Arrangement Tab, including the appraisal of digital content within the Appraisal and Arrangement Tab, the arrangement of content to corresponding ArchivesSpace Resource records, and the deposit of content to a DSpace collection.

Some features in the above may change over time and with subsequent releases of Archivematica—and our local practice is sure to evolve as we get more experience under our belt—but we hope you find these videos helpful.  As always, please feel free to leave a comment or drop us a line at bhl-mellon-grant[at]

Monday, December 12, 2016

October 13-14, 2016 Mid-Michigan Digital Practioners Organization Meeting

It is known by many names. OAISers know it as the "Monitor Technology" and "Monitor Designated Community" functions of the Preservation Planning Functional Entity of the OAIS Reference Model and PLANETS folks know it is as "Preservation Watch." You may even know it as reconnoitering. Whatever you call it, it's the process of information gathering to stay up on current trends, best practices and cutting edge implementations in the digital preservation world so that you can bring them back to your institution.

Here in Michigan, we have an NDSA Innovation Award winning (that's right, NDSA Innovation Award winning!) organization that does just that, recognized at Digital Preservation 2016 for its "highly original and successful organizational model in fostering innovation sharing and knowledge exchange": the Mid-Michigan Digital Practitioners.

This is such a wonderful group of people. Founded in 2013, the mission of the group is to provide an open and local (and free!) forum for conversation, collaboration and networking for professionals working with digital collections in Michigan. Their biannual meetings, which rotate between institutions in the region, are designed democratically via pre- and post-conference surveys and typically draw between 40 to 70 participants and have attracted student groups, practicing professionals, vendors and even the general public.

October 13-14, 2016 Meeting

Mike, Dallas and I attended the most recent meeting on October 13-14, 2016 at the MSU Libraries in East Lansing.

The first day consisted of two workshops:
  • Policies and Strategies for Managing Digital Assets, in which attendees had the opportunity to:
    • fine-tune and define the scope of digital assets for their own institutions;
    • analyze needs and solutions for their digital asset management; and 
    • practice drafting policy or updating existing policy for digital assets.
  •  XML/XSLT, which covered:
    • basic syntax of XML/XSLT;
    • tools for working with XML/XSLT;
    • using MarcEdit to transform MARC data to MODS, Dublin Core, EAD, etc.;
    • editing XSLT stylesheets; and
    • real world examples of XSLT in action. 

As you can see, this was a nice mix of policy and technology.

The second day was full of excellent presentations.
  • Digital Detroit: Getting Started with Video, although billed as a "getting started" presentation on video digitization, was a deep dive into file formats, codecs, containers and compression, as well as capture devices, setup and more.
  • Merging Traffic: Accessing Archival Collections and Museum Artifacts Through a Common Interface discussed the Henry Ford's new Digital Collections portal (and all the behind the scenes magic that went into it).
  • APIs in the Library: Selected Projects that Expand the Local Information Environment was a super informative introduction to Application Programming Interfaces which I highly recommend to anyone whose ever wanted to get started with APIs (e.g., the ArchivesSpace API!).
  • Cataloging Archival Collections: Grouping Collections to Aid Retrieval gave an overview of one strategy for collocating collections in a discovery interface based on collection/research guides that the archivists have created to aid researchers.
  • In Access Update for GVSU, librarians there gave an update on where they were, where they are and where they're going with regard to providing access to digital collections at GVSU.
  • Using Open Refine for Data-driven Decision Making covered the way that one archivist used OpenRefine, a free, open source, powerful tool for working with messy data to cut processing time by approximately 99.8%
  • Turning Your Smartphone Into a Scanner for Cumbersome Projects: Did you know that you can use your smartphone to digitize hard-to-capture paper and photographic items and quickly get them out to researchers? (Hint: You can.)
  • Digital Scanning Course Materials discussed the excellent, progressive work of the MSU Course Materials Program, a one-stop shop at the MSU Library for faculty seeking to use third-party copyrighted content in their courses.
  • Legacy Computer Challenge gave an overview of the challenges encountered when working with two legacy computers, a Mac Classic II (1991) and a Power Mac 6500 (1997), which was a catalyst for discussing a MMDP Best Practices and Tech Exchange.

Finally, during an open mic session we gave an update on our ArchivesSpace-Archivematica-DSpace Workflow Integration project, which wrapped up shortly after that meeting.

All in all, a busy and informative day!

Want to Learn More?

You can learn more about this "regional collective of librarians, archivists, museum curators, conservators, historians, scholars and more engaged in creating and curating digital collections in Mid-Michigan and the surrounding region" on the Mid Michigan Digital Practitioners website or catch up with them on Twitter. You should also feel free to join the listserv.

If you're in the area, the next meeting is March 23-24, 2017 in Detroit... we'll see you there!

Friday, November 4, 2016

The End is Just a New Beginning!

Greetings, all; as hard as it is to believe, the Bentley Historical Library's ArchivesSpace-Archivematica-DSpace Workflow Integration project has come to a close.

October 31 marked the end of two and a half years of intense planning, development, and testing but it also signaled the beginning of a new phase as we here at the University of Michigan begin to implement the project outcomes in a full production environment.

While this development work won't be available until version 1.6 of Archivematica (release date forthcoming), we wanted to take this opportunity to glance backward and also look ahead...

Project Outcomes

In our inaugural blog post back on April 8, 2015, we identified three major development objectives for the project:
1. Introduce functionality into Archivematica that will permit users to review, appraise, deaccession, and arrange content in a new "Appraisal and Arrangement" tab in the system dashboard.
2. Load (and create) ASpace archival object records in the Archivematica "Appraisal and Arrangement" tab and then drag and drop content onto the appropriate archival objects to define Submission Information Packages (SIPs) that will in turn be described as 'digital objects' in ASpace and deposited as discrete 'items' in DSpace.  This work will build upon the SIP Arrangement panel developed for Simon Fraser University and the Rockefeller Archives Center's Archivematica-Archivists' Toolkit integration (as demonstrated around the 12 minute point of the first video here).
3. Create new archival object and digital object records in ASpace and associate the latter with DSpace handles to provide URIs/'href' values for <dao> elements in exported EADs.

 I am extremely pleased to announce that we have achieved each of these outcomes in the development work that concluded on October 31.  More specifically, the project has resulted in:

1. The creation of a new Appraisal and Arrangement tab in Archivematica that will permit users to characterize, review, arrange, and describe digital archives with such features as:

  • Browsing the folder hierarchies of transfers in the "Backlog" pane.

  • Identifying file format distributions (in both tables and pie charts) and sensitive personal information (Social Security and Credit Card numbers) in the "Analysis" pane.

  • Displaying items in a "File List" pane, with contents updated based upon selections in the Backlog and Analysis panes.

  • Previewing content (using available web browser plugins) within the Analysis pane (with the ability to download and locally render other file types).

  • Tagging content (to aid in archival description, the identification of sensitive information, deaccession decisions, etc.) with the added ability to facet by tags in both the Backlog and File List panes.

Archivematica Appraisal and Arrangement Tab
What's in your transfer?

2. The integration of Archivematica and ArchivesSpace, so that users can:

  • Review, create, and edit archival description from ArchivesSpace directly within Archivematica (with information being written back to ASpace via its API) without having to switch between applications/browser windows.

  • Drag and drop content from the Backlog pane onto archival description in the ASpace pane, thereby associating data with metadata (and also establishing a Submission Information Package ready to undergo Archivematica's Ingest procedures).
  • Elect to use the ArchivesSpace functionality (or simply arrange content into Submission Information Packages without employing ASpace) whether or not they have a DSpace repository.
Associating digital content with archival description
Hey, you got your digital content in my archival description!

3. The integration of Archivematica and DSpace so that:

  • Users select a DSpace collection from available Storage Service locations during the 'Store AIP' microservice.

  • The Archivematica Storage Service splits the AIP into two archive files (one for the digital content, which will be publicly accessible by default, and the other for administrative metadata and log files, which will be restricted from public access by default) and automatically deposits them as a new item to the selected DSpace collection.

  • Upon successful deposit, a new digital object record (with the unique DSpace handle URL for the item) will be created in ASpace and associated with the appropriate archival object.
Content in DSpace
My repository has a first name, it's D-S-P-A-C-E...

4. Documentation related to the use of the Appraisal tab.

Max Eckard has produced a manual for use by the Bentley's archivists and student employees and he's looking to contribute to the documentation of the Appraisal tab in the Archivematica version 1.6 user manual.

Bentley Historical Library digital processing manual
It's just a jump to the left, and then a step to the right...

Because a (moving) image is worth a thousand words, I invite you to feast your eyes (and ears) on this rather brisk demo performed by Max:

Next Steps

As mentioned above, with the grant's conclusion we're moving forward with getting these new features implemented in a production environment.  This work is going to proceed on several fronts:

1. Configuring Archivematica to work with our local, highly customized version of DSpace ("Deep Blue").  

The developers at Artefactual Systems worked with an out-of-the-box copy of DSpace (version 5.5); UM's instance (around since 2006) has had a fair number of bells and whistles added to it over the years.  As a result, Max is spending a lot of quality time on Slack with colleagues in Michigan's Library Information Technology group.

2. Customizing metadata fields to be used in Deep Blue to accommodate our qualified Dublin Core (i.e., as opposed to dc.creator).

3. Establishing workflows to streamline the deposit of restricted content to the repository.

While our default workflow involves content that will be publicly accessible via DSpace, the Bentley also encounters a decent amount of material that must be restricted from the general public (due to sensitive information, regulations such as FERPA or HIPAA, donor requests, etc.) or that can only be accessed in our reading room (due to copyright issues).  We've established some semi-automated strategies for dealing with these materials and will look at trying to streamline this process.

4.  Identifying and addressing bugs in advance of going live (and the release of Archivemcatica 1.6 release).  

As we've been testing the Appraisal tab, we've reported a number of bugs to Artefactual Systems and also identified some enhancements related to local practice that we will contract with Artefactual Systems to address independently of our grant project.

5.  Training additional staff so that all of our processing archivists and graduate students are using Archivematica to arrange and describe digital archives in addition to their work with physical and analog materials.

Thank you!!!

Finally, we'd like to thank all the following organizations and individuals for their steadfast support and ample contributions to this project!
  • The Andrew W. Mellon Foundation
    • Donald J. Waters, Senior Program Officer
    • Kristen C. Ratanatharathorn, Senior Program Associate
  • The Bentley Historical Library
    • Terrence J. McDonald, Director
    • Nancy Bartlett, Associate Director
    • Angela Clark, Business Administrator
    • Kellie Carpenter, Administrative Assistant
  • The University of Michigan Library
    • John Weise, Associate Director of Library IT and Head, Digital Library Platform & Services
    • Aaron Elkiss, Systems Programmer/Analyst Senior
    • Jose Blanco, Applications Programmer/Analyst Senior
  • Everyone at Artefactual Systems (especially Evelyn, Justin, Sarah, Nick, Holly, Dan, and Radda as well as Misty and Courtney)
  • The readers of this blog and everyone who reached out to us through comments, emails, tweets, and professional meetings. Thank you!!!!  Your questions, comments, and overall interest in the project were profoundly valuable!

We plan to continue blogging about our engagement with innovative archival practice and technology—so please continue to stop by to see what's new here on Beal Avenue.  Until next time, keep on keeping on!

Tuesday, October 18, 2016

Customizing Archivematica's Format Migration Strategies with the Format Policy Registry (FPR)

Over the past couple weeks we've been exploring the ways in which our current normalization strategies (to read them for yourself, see our Format Conversion Strategies for Long-Term Preservation) compare to those in Archivematica. Below you'll find a brief introduction to Archivematica's Format Policy Registry (FPR), an overview of the process we went through to compare our format policies to Archivematica's and a couple of approaches we've taken to reconcile the differences between the two.

Hope you enjoy it!

Format Policy Registry (FPR)

Located in the "Preservation planning" tab, the Archivematica Format Policy Registry (FPR) is a database which allows users to define format policies for handling file formats.


In the FPR, "a 'format' is a record representing one or more related format versions, which are records representing a specific file format" (Format Policy Registry [FPR] documentation). As you can see from the example above, the "Graphics Interchange Format" format is made up of 3 specific versions, "1987a," "1989a," and "Generic gif."

Formats themselves are described this way:
  • Description: Text describing the format, like a name.
  • Version: The version number for that specific format.
  • PRONOM ID: The specific format version’s unique identifier in PRONOM, the UK National Archives’s format registry.
  • Access format? and Preservation format?: This is where you indicate that whether something is suitable for access or preservation purposes, or both or neither.
Formats also have UUIDs, are enabled or disabled, and have a number of associated actions (which we'll talk about later). They also have a group, "a convenient grouping of related file formats which share common properties" (ibid.), e.g., "Video." All of this is customizable.

Format Policies

In Archivematica, format policies act on formats. Format policies are made up of:
  • Tools: Tools are things like 7Zip, ImageMagick's convert command, ffmpeg, FFprobe, FITS, Ghostscript, Tesseract, MediaInfo, etc., which come packaged with Archivematica.
  • Commands: These are actions that you can take with a tool, e.g., "Transcoding to jpg with convert" or "Transcoding to mp3 with ffmpeg." Commands can be used in one of the following ways:
    • Identification: The process of trying to identify and report the specific file format and version of a digital file.
    • Characterization: The process of collecting information (especially technical information) about a digital file.
    • Normalization: Migrating/transcoding a digital file from an original format to a new file format (for access or preservation purposes).
    • Extraction: The process of extracting digital files from a package format such as ZIP files or disk images.
    • Transcription: The process of performing Optical Character Recognition (OCR) on images of textual material.
    • Verification: The process of validating a digital file produced by another command. Right now these are pretty simple, e.g., check that it isn't 0 bytes.
  • Rules: This is where you put it all together and apply a specific command to a specific format, saying something like: "Use the command 'transcoding to tif with convert' on the 'Windows Bitmap' format for 'Preservation' purposes." When browsing the FPR you can actually see how well these policies are working out. In our case, this particular policy has been successful for 2 out of 2 digital files we attempted it on.

The first time a new Archivematica installation is set up, it will register the Archivematica install with the FPR server [1], and pull down the current set of format policies. FPR rules can be updated at any time from within the Preservation Planning tab in Archivematica (and these changes will persist through future upgrades). You also have the option of refreshing your version with the centralized Archivematica FPR server version, if you so choose.

Customizing Archivematica's Format Migration Strategies

What follows is our initial foray into customizing Archivematica's format migration strategies. For a more detailed look at this as well as customizing other aspects of the FPR, you should definitely check out the documentation.

What We Do Now

For some context, we've been normalizing files for quite some time. Because we must contend with thousands of potential file formats, a number of years ago we adopted a three-tier approach to facilitate the preservation and conversion of digital content:
  • Tier 1: Materials produced in sustainable formats will be maintained in their original version.
  • Tier 2: Common “at-risk” formats will be converted to preservation-quality file types to retain important features and functionalities.
  • Tier 3: All other content will receive basic bit-level preservation.

These, by the way, are being incorporated into a more comprehensive Digital Preservation Policy which we hope to share with others in the near future...

Comparing Our Format Migration Strategies to Archivematica's

We decided to make some customizations to Archivematica's FPR because some of our existing policies didn't quite match up with Archivematica's. We discovered this by doing an initial comparison of the FPR with our existing Format Conversion Strategies for Long-Term Preservation.

For a detailed list of all of our findings, please see this spreadsheet. Basically, however, here's how things broke down for the 62 formats in Tiers 1 and 2 that I examined in depth:
  • Formats we recognized as preservation formats, and are an Archivematica preservation format.

Examples: Microsoft Office Open XML formats, OpenDocument formats, TXT, CSV and XML files, WAV files, PNG and JPEG2000 files, etc.

  • Formats we recognized as preservation formats, but aren't an Archivematica preservation format. These have a normalization pathway.

Examples: AIFF and MP3 files, and also lots of video: AVI, MOV, MP4.

  • Formats we recognized as preservation formats, but aren't an Archivematica preservation format. These have no normalization pathway.

These were the most varied, including files belonging to the PDF, Word Processing, Text, Audio, Video, Image, Email and Database groups. Examples: PDF/A files, RTF and TSV files, FLAC and OGG files, TIFF files and SIARD files.

  • Formats we didn't recognize as preservation formats, but are an Archivematica preservation format.

These were mostly things like older Microsoft Office formats, mostly. Examples: DOC, PPT and XLS files.

These, by the way, are our most common Tier 2 formats based on an analysis of our already processed digital archives I did for Code4Lib Midwest this year:

As you can see, all but one of the top five Tier 2 formats is one of those older Microsoft Office formats. What can I say? We get a lot of this kind of record!

  • Formats we didn't recognize as preservation formats, and aren't an Archivematica preservation format. For these, Archivematica's normalization pathway is the same as ours.

Lots of raster images here. Examples: BMP, PCT and TGA files.

  • Formats we didn't recognize as preservation formats, and aren't an Archivematica preservation format. For these, Archivematica's normalization pathway is not the same as ours.

These all stemmed from a difference in preferred preservation target for normalized video formats. We typically converted these to MP4 files with .H264 encoding, while Archivematica prefers the MKV format. Examples: SWF, FLV and WMV files.

  • Formats we didn't recognize as preservation formats, and aren't an Archivematica preservation format. For these, Archivematica does not even have a normalization pathway.

Essentially, these were files that we had a normalization pathway for, but Archivematica doesn't. Examples: Real Audio files, FlashPix Bitmap and Kodak Photo CD Image files, and PostScript and Encapsulated PostScript files.

  • Finally, formats we didn't recognize as preservation formats, but not even in Archivematica.

Examples: EML files and other plain text email formats.


To be honest, I was a bit surprised by just how different our local practice was from Archivematica's, considering we both look to the same authorities on this type of thing! This diversity led to a number of different approaches to customizing Archivematica's Format Migration Strategies, which I'll briefly detail here.

Do Nothing

For those formats that we agree on, i.e., we both agreed they were preservation formats or we both agreed they were not preservation formats, but shared the same normalization pathway, we didn't do anything! Easy peasy lemon squeazy.

Disable a Normalization Rule, Replace a Format

This we did for formats we recognized as preservation formats, that aren't an Archivematica preservation format but that do have a normalization pathway in Archivematica. Basically, we disagreed with the out-of-the-box FPR and we weren't interested in having Archivematica doing any normalization on these. After we went to check the Library of Congress Sustainability of Digital Formats site to ensure that we weren't totally off...

...we went to the FPR and disabled the normalization rule...

...and verified that we'd done it correctly...

...then searched for the format itself...

...clicked "Replace"...

...and set the format to a Preservation format.

You can also easily verify that Archivematica got the message...

Replace a Format

A somewhat simpler approach, this we did when there were formats we recognized as preservation formats, but that aren't an Archivematica preservation format and have no normalization pathway. Since Archivematica didn't really have a better alternative, we stuck to our existing policies.

This was as simple as finding the appropriate format, clicking "Replace"...

...and setting it as a Preservation format.

Create a Command, Edit a Normalization Rule 

This started to get a bit more complicated. We did this for formats we didn't recognize as preservation formats, and neither did Archivematica, but Archivematica's normalization pathway is not the same as ours. Again, these all stemmed from a difference in preferred preservation target for normalized video formats.

For these, Archivematica didn't have an existing command that worked for our purposes (it did have a tool, ffmpeg, that would). We had to write a little something up (which was inspired by other Archivematica commands) [2]...

...create a new normalization command...

...add in the information Archivematica needs for  the new command...

...then go in and replace the rule for the appropriate format(s)... the appropriate command (our new one!)...

...and finally verify that it had been changed.

Create a Normalization Rule

This we did for formats we didn't recognize as preservation formats, and neither did Archivematica, but for which Archivematica does not even have a normalization pathway (and we did). For these, we wanted to have Archivematica use our existing normalization pathway.

To create a new rule, we selected the "Create New Rule" option...

...and entered the new information (purpose, original format and command you want to use) for the file format for which you're interested in created a new policy.

Manual Normalization and Other Thoughts...

That leaves us with a couple of outstanding issues, namely, legacy Microsoft Office documents and EML and other email formats (which Archivematica doesn't recognize at all--because the tools Archivematica uses for file format identification doesn't recognize them or they aren't registered in PRONOM).

The "ubiquity" argument aside, we'd really love to do something about older Microsoft Office documents, especially since currently these are the most common formats that we normalize. At the moment we use LibreOffice's Document Converter to handle conversion to a more sustainable format, i.e., Microsoft Office Open XML. However, Archivematica has looked into LibreOffice with the following results:
  • LibreOffice normalization led to significant losses in formatting information.
  • LibreOffice sometimes hangs, causing any future LibreOffice jobs to fail until an administrator manually kills the service.
  • LibreOffice sometimes reports that it succeeded despite not actually succeeding, making it difficult to determine whether or not the job really succeeded.

There may also be options here for converting to PDF as well, at least for documents. In the interim, we're still examining our options. At the very least we can change the FPR so that these formats are not recognized as preservation formats; we'll be looking into alternative approaches and will plan to report back when appropriate.

As for the email formats, we currently use a tool called aid4mail to convert these to MBOX files. This is a proprietary program, though, and only works in Windows, so we're looking into ways that we might manually normalize these files outside Archivematica (and associate different versions of files with one another inside Archivematica). This can be done, but we're looking into ways of doing this efficiently in batch, however, and again, we can plan to report back when we've got something figured out.

To the FPR and Beyond!

Alright! That's about it for customizing the FPR; I think we've covered (in at least a basic way) all the different angles (with the exception, perhaps, of introducing a new tool to Archivematica!).

By the way, one of the most exciting things about the FPR is that since ours (and yours!) is actually registered with the Archivematica server, one day we all might be able to share this information in a more efficient fashion!

Have you customized the FPR? Are you too excited about the possibility of sharing FPR format policies via linked data? Let us know in the comments!

[1] Format policies are maintained by Artefactual, Inc., who provide a freely-available FPR server hosted at This server stores structured information about normalization format policies for preservation and access.
[2] This could also have been written in Python.