Thursday, October 6, 2011

Adventures in web archiving

As a web curator, quality control of archived websites is part of my everyday work.  With the help of our able student assistants, we review each archived site to determine whether or not we have successfully captured the site.  Basic quality control includes making sure that the archive site resembles the live site, internal site links work, we can download files (for example, PDfs of reports), and so on. More advanced quality control involves evaluating crawl reports and adjusting (and re-adjusting) the settings for a particular site.  The advanced quality control happens when a site is not successfully captured using our default settings.  Most of the time, crawl issues can be tackled and resolved by the combined brilliance of the curators, library programmers, and the unflappable Archive-It support staff.  But occasionally, there are sites that seem designed to test your skills and your patience.  Just when you think you’ve solved the problem and the road to a perfect capture seems clear, the site is redesigned, or changes URLs, or disappears only to reappear months later with a new host of issues.
Web Curator Nightmare: an incredibly valuable site disappears before it can be captured, then reappears months later at a completely unrelated URL, redesigned completely in Flash and Java and a programming language that didn’t exist five minutes ago, with content hosted on eight different servers inexplicably returning HTTP 404 codes all day except for three hours during the vernal equinox.
The Trials of Web Archiving: A Saga in Six Screenshots...
Victory! Almost!
Formatting is *slightly* off.

This screen appears when something was not captured and/or is missing from the web archive

 More defeat.
We originally captured the Arabic content, and then the URL changed...
 Crushing defeat.
This shows the URL of a PDF that was not captured.
 Victory at last! Again!
The new website, captured and functional.
There should be content here...


  1. I tried to put a jump/break, but blogger was not having it.

  2. This is great! I love the drama of victory and defeat. As a MLIS student, I'm very interested in web curation. But I can't seem to find out who you are or where you work...?

  3. The Smarsh hosted email archiving and email compliance solution will capture every email (and attachment) that enters or leaves your organization
    email archiving