[e2e] Achieving Scalability in Digital Preservation (yes, this is an e2e topic)

Mon Jul 16 10:46:16 PDT 2012

Submitted for your consideration:

Process A sends data over a channel which is prone to bit corruption. Process B receives the data, but there is no possibility for B to communicate with A to request retransmission. B must work with what it has received.

Examples of such one-way communication scenarios include highly asynchronous communication (eg Delay Tolerant Networking), or multicast situations where there are too many receivers for the sender to handle retransmission requests from all of them.

The scenario that I am dealing with has some similarities to these examples: it is Digital Preservation. In this scenario A "sends" data by writing it to a storage archive and B "receives" it long after (100 years is a good target), when no communication with A is possible. The "channel" is the storage archive, which is not a simple disk but in fact a process that involves a number of different storage media deployed over time and mechanisms for "forwarding" (migrating) between them.

One end-to-end approach is to use forward error correction, but that can be inefficient, and in any case it will always have limits to the error rate it can overcome. Let us assume that the receiver will in some cases still have to deal with unrecoverable information loss.

Another solution is to use hop-by-hop error correction in the "channel" (archive), and that is in fact the approach taken by conventional Digital Preservation systems. Constant checksum calculation and application of error correct algorithms are used as "anti-entropy" measures. The issue with this is scalability: we have huge amounts of data to store over long periods of time. Furthermore, the. cost and complexity of the solution is a major issue, since we need to maintain even data whose value we are unsure of through unpredictable periods of austerity or of hostility to the particular content being preserved. Think for example about NASA's need to archive all of the data coming from Earth-observing satellites essentially forever in order to be able to study climate change over time. Now consider how one would fund such preservation at some future time when rapacious oil companies control the government's pursestrings - use your imagination!

One interpretation of end-to-end tells us that in order to improve the scalability of our solution, we should do less in the channel, let corruption go uncorrected, and move the work of overcoming faults closer to the endpoint. In the case of video streaming without retransmission, this means using the structure of the video stream: detect corrupt frames and apply interpolation to repair (but not fully correct) the damage. The adequacy of this approach is application-dependent, and it definitely has its limits, but it may be necessary in order to achieve scale.

Applying this last approach to Digital Preservation, this tells us that if we need to preserve data at scale we should let the bits in the archive rot, perhaps focusing on media and mechanisms with "good" failure modes rather than applying complex mechanisms to overcoming "bad" failure modes. Then we would need to focus on end-to-end protocols (between the writer A and the ultimate reader B) that would operate at a higher layer and be resilient to such bit rot by using the structure of the application and the data.

Not to put too fine a point on it, this analysis has so far been vigorously rejected by the academic Digital Preservation community. The idea of allowing bit rot and then working to overcome it (an approach I call Loss Tolerant Digital Preservation) is anathema to them.

I post the idea on the e2e list to find out if it seems to what's left of the end-to-end community like a valid application of end-to-end analysis to the problem of communication over time. If my argument is flawed, I thought perhaps someone who understands end-to-end could explain why. The feedback I have so far received from the Digital Preservation community has not been very useful.

Micah Beck
University of Tennessee EECS