Thursday, October 6, 2011

13 Million Documents and Counting

I manage the Legacy Tobacco Documents Library (LTDL, http://legacy.library.ucsf.edu), a digital archive of over 13 million documents, and I scanned every one of them myself! (NOT) Most of the documents were spidered from websites that the major US tobacco companies are required by court order to host. In 1998 the companies agreed to this as part of a settlement of lawsuits from 46 states trying to recover money they’d spent treating people with tobacco-related diseases. And, in 2006 a federal judge, when she found them guilty under the Racketeer Influenced Corrupt Organization Act for creating an illegal enterprise for the purpose of defrauding the public, ordered them to continue making documents available until 2021.

Above: Typical document http://legacy.library.ucsf.edu/tid/dup28d00

As an archivist I’d had plenty of experience keeping track of things – photographs, financial records and textiles, for example, but I’d never worked with digital documents until I started as a project archivist with LTDL in 2004. It wasn’t until I started managing LTDL in 2006, though, that I learned that having a great computer programmer is the key to a successful digital archive. While some people evaluate success by the numbers of visitors to the website (we get an average of 15,000 visits each month) I look at our ability to correctly spider, parse and ingest hundreds of thousands of pages of documents, each with constantly changing metadata fields and values, from at least three websites each month as the measure of our competence.

The other key component to our success is a strong relationship with some of our users which include academic researchers, journalists, lawyers, public health advocates, government agencies and students. My office is in the same building as the Center for Tobacco Control Research and Education and I frequently consult with faculty and post-docs there about their research. I go to presentations about their work with the documents and am able to make decisions about LTDL with their needs and desires in mind. In the past month, LTDL had visitors from 132 countries and I am able to communicate with some of them via GlobaLink, the international tobacco control listserv. While we get many visitors who look at a document and leave, many of our users spend countless hours researching the tobacco industry’s advertising, manufacturing, marketing, sales, lobbying and scientific research activities. In fact, close to 600 peer-reviewed articles using the documents have been published.

My biggest headaches come from the donors of our material – the tobacco companies that do not want the world to see how they’ve manipulated science, lied to the public and covertly marketed cigarettes to youth, who are at greater risk of becoming addicted than adults. Whether from a lack of attention to detail, incompetence, or outright resistance, the companies continually fail to produce documents that should be uploaded to their sites, post downloadable metadata files that are missing fields or contain values that are different than those showing on the website, clutter up the corpus with duplicate documents and give incorrect document dates in the indexing information. There is no easy ingest of new documents, each download contains numerous subsets of documents, each requiring specialized handling. There’s always a tradeoff between just ingesting the whole batch as is, with duplicates, incorrect metadata, etc. and spending the time trying to fix the most egregious anomalies.

But, thank goodness that the Court ordered the cigarette companies to make documents and metadata available, warts and all. I also manage the Drug Industry Document Archive (DIDA, http://dida.library.ucsf.edu), a digital archive of pharmaceutical industry documents from lawsuits and government hearings. No court has ordered a drug company, even Merck which continued to sell Vioxx even after they knew of its increased risk for deadly heart attacks, to make their documents with indexing information available digitally. So, we’ve had to create our own metadata for the DIDA documents, something which is extraordinarily expensive when you’re talking about thousands of documents. With no original order, series or file folders, it’s item level description for these documents and as a result we’ve been unable to add many documents to DIDA. Crowd sourcing the creation of metadata seems like the only possible solution to this problem. But, meanwhile, it’s back to the task of spidering thousands of new tobacco company documents. It’s all in a day’s work!

No comments:

Post a Comment