Thursday, October 6, 2011

Is This Bit Rot?

The term "bit rot" gets batted around a lot, but isn't so easy to nail down. Wikipedia's bit rot entry is surprisingly short and contains multiple definitions: decay of storage media, decay of data on storage media, and degradation of software programs. The entry is flagged as needing citations to reliable sources.

So maybe I'm not alone in wondering what bit rot looks like in a word processing file. At the Hoover Institution Archives, where I work, one manuscript collection contains hundreds of 5.25-inch floppy disks. The disks present many problems, but today I'm looking at one containing files that open on a PC. They suffer from some malady, as indicated by this sample from a letter dated January 21, 1986:

-------begin sample text------

Thankó verù mucè foò á lovelù luncheoî anä somå splendiä views® Wå 
imaginå yoõ no÷ iî Indiá anä wondeò iæ yoõ arå listeninç tï somå oæ thå 
samå Indianó witè whoí wå talkeä yearó ago® Thå artistó anä economistó 
werå quitå remarkable¬ buô thå politicaì scientistó useä tï talë abouô 
atomiã bombó foò Indiá witè eager¬ burninç eyeó whilå beinç verù carefuì 
noô tï kilì anù insects® (Severaì haä theiò beardó covereä iî whitå 
silk so that no insect would get caught and be stifled there.)

-------end sample text------

The file name, Enid, has no file extension, so it is difficult to determine what software was used to create the file. The sample above is from the rendering in MS Word with Windows character encoding, but no matter what software I open the file in, I get some gibberish. But it isn't all gibberish--the last line is completely legible. You'll see that sometimes a letter displays correctly, like the "n" in "insect," but not in other cases, like the "n" that should end "luncheoî" in the first line. Is this bit rot?

Regardless of the diagnosis, the next question is what to do with it. Because we can infer what many of the corrupted characters should be, we can match them to their actual counterpart, like this sample:

Corrupted versionTrue character

Then, we could use the find and replace feature to fix all the corrupted characters. But it is more difficult to infer the correct characters for corrupted numbers. In addition, I assume that matching corrupted to true characters also varies from one file to the next--after all, this decay probably doesn't occur in the same, predictable manner and standard rate. Furthermore, we've got to deal with thousands of corrupted files on hundreds of disks. Lastly, if we did restore all of the files, how can we ensure researchers studying them understand our restoration process and its implications for the authenticity and reliability of the content?

If you can answer my questions, I think Wikipedia needs you to enhance its bit rot entry.

1 comment:

  1. The answer is simple. The document was created using 7 bit ASCII, where the eighth (most significant) bit was either ignored or used for some other purpose. The errors you are seeing is what happens when the document is interpreted using an 8-bit character set (ISO-8859-1).