[e2e] Achieving Scalability in Digital Preservation (yes, this is an e2e topic)

Tue Jul 17 10:53:09 PDT 2012

On Jul 16, 2012, at 6:23 PM, Joe Touch wrote:

> > In the case of video streaming without
>> retransmission, this means using the structure of the video stream:
>> detect corrupt frames and apply interpolation to repair (but not
>> fully correct) the damage. The adequacy of this approach is
>> application-dependent, and it definitely has its limits, but it may
>> be necessary in order to achieve scale.
> 
> That works only because of the nature of video - that there's very little information in a single bit of a single frame if there are adjacent frames that are valid. It notably does NOT work for digital books - you can't figure out page 83 of Paradise Lost by looking at pages 82 and 84.

You are correct that nearly undetectable recovery in the face of reasonable levels of actual information loss is fairly rare in application protocols and file formats. However, that may partly be due to the fact that it is not commonly required. Video is a good fit to this idea because 1) there is a high degree of redundancy between adjacent frames, even after compression is applied, and 2) some important streaming applications do not allow for retransmission. That doesn't mean that the concept is not applicable elsewhere.

One can see the importance of recovery at application-level when considering the converse: applications that fail when confronted by even trivial levels of information loss. Take for example large data file formats that are extremely vulnerable to any corruption in headers or other embedded metadata. Are there cases in which a PDF interpreter will fail due to the corruption of a small number of bits? If I had a  PDF of Paradise Lost with bit corruption on page 83, would I prefer that someone had built a PDF reader that was highly robust, and could at least properly display the remaining pages in the face of such corruption? Or had designed the file so that complete loss of whole pages was less unlikely, spreading likely patterns of information loss out over multiple pages? Could a variety of applications be made much more resilient to bit errors if attention was paid to these issues? 

>> Applying this last approach to Digital Preservation, this tells us
>> that if we need to preserve data at scale we should let the bits in
>> the archive rot,
> 
> The only reason to let things rot is to reduce the overhead of repeated restoration. However, if you let disorder creep in and don't correct it as it accumulates, you can easily end up with an irrecoverable error.

Overhead is not the only reason to avoid repeated restoration. The process of "scrubbing" a disk, reading each file and checking for checksum errors is itself a source of danger to the data. With medium capacity increasing faster than access bandwidth, scrubbing may become a constant activity. Moving a read/write head over the surface of a disk increase wear on the driver motor and arm actuators. Then of course there is the problem that "correcting" errors to the disk runs the risk of introducing new errors due to the processing required, just as routers can introduce errors while forwarding (I seem to remember that this was a fundamental argument for E2E in IP networking). If information loss is to be avoided, it is probably better to increase redundancy by writing a new block somewhere else rather than trying to "correct" an error by updating a block. Any write to a sector creates a danger to neighboring sectors due to possible head alignment errors.

Sometimes doing less accomplishes more.

>> perhaps focusing on media and mechanisms with
>> "good" failure modes rather than applying complex mechanisms to
>> overcoming "bad" failure modes. Then we would need to focus on
>> end-to-end protocols (between the writer A and the ultimate reader B)
>> that would operate at a higher layer and be resilient to such bit rot
>> by using the structure of the application and the data.
> 
> That's just multilevel encoding, and still amounts to FEC. How much structure you look at, at what timescale, using what semantics, determines the higher level structure and how efficient your FEC is. However, once compressed, such structure is irrelevant anyway since you'll lose it at the bit level on the recorded media.

OK, I agree that conceptually what I'm talking about is FEC at the higher layer. However, it's not generally seen this way in the Digital Preservation community. If I store a Rembrandt poorly and then reinterpret the damaged parts by drawing on the canvas with a crayon would be considered a cultural crime, not an application of FEC at a high semantic level.

Some kinds of high level redundancy may not be not lost due to bit-level compression. And there's the question of whether compression is a good idea at all in the context of DIgital Preservation.

>> Not to put too fine a point on it, this analysis has so far been
>> vigorously rejected by the academic Digital Preservation community.
>> The idea of allowing bit rot and then working to overcome it (an
>> approach I call Loss Tolerant Digital Preservation) is anathema to
>> them.
> 
> Then why do they do it all the time? They don't replace books every day; they let them decay *until* they're near the time when they are irrecoverable, then they transfer them to new media. This is no different - it's just a trade-off between maintenance and FEC overhead.

The question is not why they do it, but why they object to it in the digital case. I am probably the wrong person to ask since I am a newcomer to that community, but I can tell you a bit of what I've heard, been told and surmised.

An important difference between digital and non-digital objects in the context of preservation is that digital objects are interpreted by a program, and that program has invariants (or assumption about its inputs). The interpreters used for day-to-day work with digital objects generally make strong assumptions regarding non-corruption of their inputs. Modern networks and file systems are considered reliable enough to make this acceptable. Sure, an ebook reader might be considered higher quality if it can tolerate a bit of corruption in its input file, but hardly anyone eschews the use of PDF because readers are so brittle. Admitting the possibility of corruption means weakening the assumptions made in that software, which makes the software harder to write and means that commercial off-the-shelf readers may not suffice. That's difficult enough and expensive enough to possibly put it out of reach of the Digital Preservation community.

Many in the DIgital Preservation community come from a non-technical background (libraries with books in them) and there is in fact a great lack of understanding of what's actually going on. There is, for example, a common confusion between the fact that one cannot necessarily control the lifetime of data on the Internet at the application layer and the idea that it therefore "lives forever." Bits are thought to be much more resilient than non-digital information when their implementation is in fact typically quite ephemeral, land only an active process of continual checking, correcting and copy results in resilience. I overheard someone at a Library of Congress meeting referring to maps that had been "digitized and therefore preserved."

Making reference to Dr. Reed's point about the ultimate goal of preservation: I believe is to preserve knowledge. Libraries have "reduced" (in the formal sense) the problem to that of preserving objects which bear knowledge. However, it's not clear that the objects which society has chosen in which to encode its knowledge (eg digital objects) are a good choice for purposes of long-term preservation. This has been a problem for libraries in the past: paper back books fall apart; acetate degrades. Some important knowledge has been lost to the choice of technology in the past (many films in archives will decompose before they can be copied). Facing this fact may result in some painful conclusions, such as that more work has to be done to make digital objects resilient to bit rot, and that we may have to choose between preserving at scale (in terms of byte count) or preserving with minimal bit errors.