Day of Digital Archives: Adventures in web archiving

Thursday, October 6, 2011

Adventures in web archiving

As a web curator, quality control of archived websites is part of my everyday work. With the help of our able student assistants, we review each archived site to determine whether or not we have successfully captured the site. Basic quality control includes making sure that the archive site resembles the live site, internal site links work, we can download files (for example, PDfs of reports), and so on. More advanced quality control involves evaluating crawl reports and adjusting (and re-adjusting) the settings for a particular site. The advanced quality control happens when a site is not successfully captured using our default settings. Most of the time, crawl issues can be tackled and resolved by the combined brilliance of the curators, library programmers, and the unflappable Archive-It support staff. But occasionally, there are sites that seem designed to test your skills and your patience. Just when you think you’ve solved the problem and the road to a perfect capture seems clear, the site is redesigned, or changes URLs, or disappears only to reappear months later with a new host of issues.

Web Curator Nightmare: an incredibly valuable site disappears before it can be captured, then reappears months later at a completely unrelated URL, redesigned completely in Flash and Java and a programming language that didn’t exist five minutes ago, with content hosted on eight different servers inexplicably returning HTTP 404 codes all day except for three hours during the vernal equinox.

The Trials of Web Archiving: A Saga in Six Screenshots...

Victory! Almost!