[e2e] Re: [Tsvwg] Really End-to-end or CRC vs everything else?

Mon Jun 11 18:51:50 PDT 2001

> From: Jonathan Stone <jonathan at DSG.Stanford.EDU>

> ...
> >I think I side with Vernon Schryver (as I understand his 
> >point).  Checksumming is going to be done in outboard boxes even if it is 
> >cheap, 
>
> Yes, it will be done there. Yes, it is a cheap hardware speedup.

That's 80% of my point. 

> But outboard checksumming is source of additional uncaught errors:
> errors which would be caught by a software implementation of
> a checksum but not by an outboard implementation of the same checksum,
> because the actual cause of the error was between the hardware checksum
> and the in-memory buffer.

The remainder of my point was that talk about "the in-memory buffer"
is meaningless except on trivial computers.  I'll admit that there
are many trivial computers connected to the Internet today, but I
suspect many of the hosts the trivial ones are talking to are not.
In saying the phrase is meaningless, I'm referring to 3 and 4 level
caches, inter-CPU packet buses or old style snoopy (or snooping) buses,
and the other stuff that make it impossible to pin a single bit to a
single capacitor, flip-flop, ferrite core, or time slot in a delay line.

> You and Vern seem to be assuming that is a rare case. The data I have
> indicates that, amonsgt the end-hosts we could directly monitor and
> caught sending errored packets, it is a common case.

I didn't say anything about the rarity of the bugs I've fought and
heard about.  They tend to be rare in the sense that those that happen
on 10% of packets are quickly noticed and fixed.  Those that happen
rarely are often not diagnosed as anything more than normal network
bit rot for months, years, and even longer.  The first such design
bug I encountered was finally found (not by me!) in about 1974 in the
high speed paging DMA machinery of the SDS-940 or XDS-940.  By then
the machine was elderly but still popular, having been the product of
Berkeley's Project Genie 10 years before.  My point is that such bugs
have always been around, and that they corrupt occassional bits for
very long times before they're nailed.

I'm trying to say when you have lots of bits from zillions of computers,
you will have lots events that could be caught by host computed
checksums, except that many hosts are simply not going to compute them
and except that as even PC's become multi-procecessors, the notion of
"host computed" becomes bogus.

In even the trivial 2-16 processor 80*86 based systems (trivial compared
to real MP systems), users demand that different processors compute TCP
checksums than run the application.  These demands are as non-negotiable
as they are generally technically wrong (because of the costs of multiple
I and D caches and locking/mutexs/P&V's/etc.).  They also wreck the notion
of "the in-memory buffer," at least for anyone who has fought hardware and
software bugs in locking and caches.

> For the end-hosts we could directly observe, the conditonal
> probabillity of an error damaging a packet between the end-host's
> computation of IP checksum and CRC computation, given that an error
> happens at all, was significant.

How significant did you find them to be?  My intuition says it should be
less than 0.01% and more than 0.0001%

> ...
> That' a minimax argument: assume there's some error process, with some
> distribution of acutal errors; but the error process is not smart
> enough to damage bits and recompute a valid checksum.  Under those
> conditions, how do we minimize the worst-case or expected rate of
> undetected errors?

Why exclude the current too-smart-by-half boxes in the middle that
do just that?

My obsession is with $%$#@! interception proxies, but there are other
cases.  I wouldn't be surprised if people are now finally shipping what
people at PEI and SGI in the 1980's called TCP agglutination and
segmentation.  That's where a box at the end sends and receives big TCP
segments but boxes in the middle do TCP segmentation and assembly.  Such
middle boxes will have bugs because everything else does and so they will
"damage bits and recompute a valid checksum" because they have no choice.
There are strong reasons to consider such boxes at speed, although in all
of the cases I've been involved in (including those where I may have
started the obsession at SGI/PEI) I've finally felt they were bad ideas.

Vernon Schryver    vjs at rhyolite.com