[e2e] Achieving Scalability in Digital Preservation (yes, this is an e2e topic)

Wed Jul 18 12:15:23 PDT 2012

Hey Joe,

On Jul 18, 2012, at 11:20 AM, Joe Touch wrote:

> Archive solutions don't necessarily benefits from widescale adoption. If my archive uses one solution for data maintenance, and yours uses another, there's no inherent benefit besides reuse of the solution, unless merging archives is an issue.

The first thing to see is that Digital Preservation requires interoperability between the present and future. The data that is stored today must be readable and interpretable a long time from now.

Readability is more than a matter of survival of the storage medium. As those who work in forensics point out, the mechanical and electronic mechanisms that connect the storage medium to an operational computer system must be maintained. If you have a file system stored in some old tape format, can you find an operational drive, and does it connect to an I/O bus on any system that can still execute instructions, and do you have a version of the OS with compatible drivers that understands the format with which the files were written out? Is there an application that can interpret the files? if one gets around these issues by assuming that there will always be timely copying of data to new media and formats, then it is necessary to accept a higher risk of losing the data altogether due to violation of these assumptions.

Since the writer does not know the identity of the reader, the set of potential readers must be taken into account. These are spread throughout time, and could amount to a large community.

It is also important to take into account that data stored in an archive does not necessarily stay in the same archive over its entire lifespan. One important aspect of Digital Preservation is what is called "federation" of archives, meaning that the data in one archive may be accessed through another archive, or that data from one archive may be moved to another archive. These scenarios are particularly important when the funding or institutional will of one institution to maintain the data is diminished. In these cases, interoperability between archives can be important.

There's also the issue  of vendor lock-in, which is again a matter of interoperability over time. If an archive uses interfaces that are non-standard, then when it comes time to replace its systems it may be locked in to one vendor. Unless the vendor stops supporting the products used by that archive, in which case it may be necessary to migrate its entire collection to some other vendor's interfaces and representations. That can impose a high cost, or even spell the end of particular archive's operations.

All of these issues become more important as the scale of data being preserved increases, and as the duration increases. The costs and limitations of non-interoperability increase with the scale of data. And the probability of circumstances placing especially stringent limitations or obliterating the resources available at any one archive increases with the duration of preservation.

> For many of your other concerns, this many not be the best list, e.g., archive duration, durability, etc. For some of your concerns, even archivists aren't always thinking in those terms - some approach that of Danny Hillis (see http://longnow.org/).
> 
>> Today, we can deploy IP on a cell phone in the middle of the Sahara dessert and interoperate with servers attached to the North American backbone. Today my telephone (Android) and my laptop (OS X) run operating systems whose kernel interfaces are descended from the one that Ken Thompson designed, and which still have a certain interoperable core. Those are designs that *have* scaled. Call it what you will, that kind of design success is my goal when designing hardware or software infrastructure.
> 
> IP is designed as a common interoperation layer. OS interfaces similarly define interoperation. What is the interoperation goal here?

Communities of interoperability include

1) storers of data, since they need to have a choice of where they put things;
2) maintainers of data, since they need to have a choice of tools and resources to use over time
3) readers of data, since they need to be able to easily extract and use data from a number of different archives

These parallel  the communities of interoperability that are important for wide area networking. To quote Dan Hillis (albeit selectively) from his 1982 paper Why Computer Science Is No Good, "… memory locations … are just wires turned sidewarys in time."

/micah