[e2e] Achieving Scalability in Digital Preservation (yes, this is an e2e topic)

Mon Jul 16 15:23:15 PDT 2012

Hi Micah,

Long time no chat ;-)

On 7/16/2012 10:46 AM, Micah Beck wrote:
> Submitted for your consideration:

Picturing Rod Serling now ...

> Process A sends data over a channel which is prone to bit
> corruption. Process B receives the data, but there is no possibility
> for B to communicate with A to request retransmission. B must work
> with what it has received.

FEC is the only solution.

> Examples of such one-way communication scenarios include highly
> asynchronous communication (eg Delay Tolerant Networking), or
> multicast situations where there are too many receivers for the
> sender to handle retransmission requests from all of them.
>
> The scenario that I am dealing with has some similarities to these
> examples: it is Digital Preservation. In this scenario A "sends" data
> by writing it to a storage archive and B "receives" it long after
> (100 years is a good target), when no communication with A is
> possible. The "channel" is the storage archive, which is not a simple
> disk but in fact a process that involves a number of different
> storage media deployed over time and mechanisms for "forwarding"
> (migrating) between them.
>
> One end-to-end approach is to use forward error correction, but that
> can be inefficient, and in any case it will always have limits to
> the error rate it can overcome. Let us assume that the receiver will
> in some cases still have to deal with unrecoverable information
> loss.

FEC does have limits. FWIW, FEC includes the use of a digital media 
itself, since you can do "level restoration" repeatedly (if the 'errors' 
are less than half a bit of analog value).

That's how all archives that have survived over the millennia persist - 
there is redundancy, either in the "digitization" (encoded as letters, 
where loss of a small fraction of a letter can be recovered) or in 
multi-level syntax and semantics (e.g., fixing missing letters of a word 
or missing words of a sentence).

> Another solution is to use hop-by-hop error correction in the
> "channel" (archive), and that is in fact the approach taken by
> conventional Digital Preservation systems.

That corrects errors within the hops, but not across multiple hops. 
E.g., you can fix the errors within a copy, but when you copy from one 
media to another you could easily do an erroneous encoding that destroys 
the info.

Some people who do backups see this - the disk reports errors, 
intra-disk encoding fixes errors, but a failure of the overall RAID 
system can render the entire copy invalid.

 > Constant checksum
> calculation and application of error correct algorithms are used as
> "anti-entropy" measures. The issue with this is scalability: we have
> huge amounts of data to store over long periods of time. Furthermore,
> the. cost and complexity of the solution is a major issue, since we
> need to maintain even data whose value we are unsure of through
> unpredictable periods of austerity or of hostility to the particular
> content being preserved. Think for example about NASA's need to
> archive all of the data coming from Earth-observing satellites
> essentially forever in order to be able to study climate change over
> time. Now consider how one would fund such preservation at some
> future time when rapacious oil companies control the government's
> pursestrings - use your imagination!
>
> One interpretation of end-to-end tells us that in order to improve
> the scalability of our solution, we should do less in the channel,
> let corruption go uncorrected, and move the work of overcoming faults
> closer to the endpoint.

Scalability depends on your metric - are you concerned with archive 
size, ongoing restoration maintenance (repeated checking and correcting 
detected errors), or something else?

 > In the case of video streaming without
> retransmission, this means using the structure of the video stream:
> detect corrupt frames and apply interpolation to repair (but not
> fully correct) the damage. The adequacy of this approach is
> application-dependent, and it definitely has its limits, but it may
> be necessary in order to achieve scale.

That works only because of the nature of video - that there's very 
little information in a single bit of a single frame if there are 
adjacent frames that are valid. It notably does NOT work for digital 
books - you can't figure out page 83 of Paradise Lost by looking at 
pages 82 and 84.

> Applying this last approach to Digital Preservation, this tells us
> that if we need to preserve data at scale we should let the bits in
> the archive rot,

The only reason to let things rot is to reduce the overhead of repeated 
restoration. However, if you let disorder creep in and don't correct it 
as it accumulates, you can easily end up with an irrecoverable error.

> perhaps focusing on media and mechanisms with
> "good" failure modes rather than applying complex mechanisms to
> overcoming "bad" failure modes. Then we would need to focus on
> end-to-end protocols (between the writer A and the ultimate reader B)
> that would operate at a higher layer and be resilient to such bit rot
> by using the structure of the application and the data.

That's just multilevel encoding, and still amounts to FEC. How much 
structure you look at, at what timescale, using what semantics, 
determines the higher level structure and how efficient your FEC is. 
However, once compressed, such structure is irrelevant anyway since 
you'll lose it at the bit level on the recorded media.

> Not to put too fine a point on it, this analysis has so far been
> vigorously rejected by the academic Digital Preservation community.
> The idea of allowing bit rot and then working to overcome it (an
> approach I call Loss Tolerant Digital Preservation) is anathema to
> them.

Then why do they do it all the time? They don't replace books every day; 
they let them decay *until* they're near the time when they are 
irrecoverable, then they transfer them to new media. This is no 
different - it's just a trade-off between maintenance and FEC overhead.

> I post the idea on the e2e list to find out if it seems to what's
> left of the end-to-end community like a valid application of
> end-to-end analysis to the problem of communication over time. If my
> argument is flawed, I thought perhaps someone who understands
> end-to-end could explain why. The feedback I have so far received
> from the Digital Preservation community has not been very useful.

I think this is a good example of E2E (for error recovery) - and why HBH 
is still useful, but cannot replace E2E, as noted above...

Joe

>
> Micah Beck
> University of Tennessee EECS
>