[e2e] Applications with UDP checksum disabled

Lynne Jolitz muse at alum.calberkeley.org
Mon Mar 11 14:26:35 PST 2002


As I recall, during the "sunbox" project (nee SPARC Cluster 1, a scalable NFS server cluster), early units were used in the Sun MIS in PAL1.
To show off scalability (e.g. smallest pcpu per IOPs), they turned off the checksums.
Unfortunately, the RAMs on the ethernet SBUS cards were marginal, so AFTER the packet was
received with a good CRC, the WRONG data was recorded in the RAM, OCCASIONALLY corrupting the files.
They didn't pick up the poor quality RAM problem immediately (they'd been shipping a while) because the protocols were retransmitting during quality tests.

I'd like to think that turning off UDP checksums was all in the past. But, at a dot.com I'd worked, they did the exact same thing with a NetApps used to support the storage for a Oracle database, resulting in corruption of a database. The VP of eng didn't believe in UDP checksums - he thought NFS did higher level checking, which it of course didn't.
In sum, a lot of Internet datacenters and corporate MIS believe UDP checksumming is optional, and that the performance boost done by turning it off is more important than risking "unlikely" database integrity problems.

But this is all kind of silly. Just as faulty hardware (RAM) impeded the function, hardware can also be integrated into the server to eliminate any protocol overhead issues to the processor. Thus, no need for risky performance/integrity trade-offs. Why dumb down protocols when we should be improving them with hardware? Until a new hardware architecture with no processor performance hit (e.g. wire-speed / ballistic) is desired (yes it's possible), awful things like that datacenter database corruption will continue to occur.

In any case, the UDP checksum disabled made good case examples in my book on datacenter operations and engineering. :-)

Regards,
Lynne Jolitz.

> 
> In message <200203092308.g29N8OTr025496 at calcite.rhyolite.com>, 
> Vernon Schryver 
> writes:
> 
> >I do not think that fairly represents the history of NFS, although it
> >is repeated by many people who I doubt were there.  I think the people
> >responsible knew perfectly well about bus errors and other hazards
> >and did not "discover" anything of the sort.
> 
> I was at the original NFS announcement in Boston by Rusty (last 
> name forgotten)
> of SUN -- the assertion was that Ethernet and hardware were so reliable
> that it was worth turning off the checksum.
> 
> I can attest that the BBN subsidiary selling multiprocessor 
> computers (name
> now forgotten) did indeed discover around 1989 that they needed the UDP
> checksum to protect their NFS filesystems from corruption due to bus
> voltage problems in their early computers.
> 
> I also remember that, around 1988 or so, it became common 
> knowledge amongst
> maintainers of large software systems that compiles done over NFS without
> checksums often caused bad binaries.  (I personally hit the 
> problem, compiling,
> as I recall, the MH email system).
> 
> Craig
> 



More information about the end2end-interest mailing list