[e2e] Re: [Tsvwg] Really End-to-end or CRC vs everything else?

Mon Jun 11 14:47:41 PDT 2001

In message <200106112048.f5BKmpF07926 at aland.bbn.com>Craig Partridge writes
>
>In message <5.1.0.14.2.20010611143202.0462bec0 at mail.reed.com>, "David P. Reed"
> 
>writes:

>I think you've missed the point.  In a prior note, you suggested a line of
>thinking of assume an adversary.  Implicitly, that's an error model.
>
>So what if traffic doesn't match that error model -- that is to say, errors
>are not ones an adversary would pick -- then the checksum chosen is the
>wrong one.

Craig,

To be fair, there are several points kicking around here, and being
made to (or intended for) different audiences.  The tsvwg folks are
under time pressure to (re-)decide on a check function.  The e2e
folks can take a longer, or at least a middle-vision, view.

The way I'd put it to the tsvwg folks choosing a checksum is this:
if you start appealing to catching all but 1 in 2^32 errors (or
more accurately, 1 in 65521^2), then you have fallen into a fallacy.

You have just conflated a purely combinatoric result, about the ratio
of sizes of the domain and range of the error-check function, with a
probabilistic statement about how likely *in practice* you are to
catch errors.  What should give you a serious wake-up call,is to hear
that even the constant function -- some constant 32-bit integer-- will
catch the same fraction of all errors.  There's no grounds for
labelling *any* function (in the mathematical sense) as stronger than
another, unless we know something about the distribution of the
errors, or how well some particular function does against some
particular distribution.    CRCs are not any stronger  than
checksums, *unless* we happen to know that the distribution of
acutal errors tends to favour low Hamming-weight errors.
(the data I and Craig have, is that it doesn't.)

The point to Dave Reed is that the combinatoric argument is very
general and applies to any function, whether the constant function, or
a cryptographic hash, or a shared-secret key, provided we define
"error check bits" to properly measure the fraction of bit
combinations which are accepted, versus those which are
rejected.

One further thing I've mentioned in email is that I recently
re-analyzed the captured error datasets which I and Craig and Vern
gathered, and I did find one pattern which could be exploited here.

The pattern is that the errors we found can be broadly characterized
into two classes: either single-bit or short, low-hamming weight
errors; or as errors where some prefix of the packet is bad; the
packet is subjected to an error; and the error continues all the way
to the end of the packet. The ratio of errored bits within that
damaged `tail' of packet is very close to 0.5.

That suggests an error model where we model packet-level errors as due
to either signle-bit errors, memory-readout errors which affect a
single word or cache line; or due to `stateful' errors in the
hardware/software finite-state engines which move packets between
packet- buffer memory, and the hardware which implements some specific
media layer. (think of errors due to an under-run in a hardware FIFO,
or a bad bit in a DMA pointer register.)

There's two things to take away from that.  The first is that the
errors we've acutally observed, in the only study of in-the-wildn
packet-level errors I know of, the errors are so heavy that, on
average, they affect more than R bits, for any R that's a plausible
error-check. That says we're only going to catch errors stochastically.
The second is that, since the errors seem to be stateful, putting the
error-cheeck information at the end of the packet rather than in a
fixed header field doesn't hurt, and (for the reasons we analyzed to
death in our 98 ToN paper) will acutally help, for the kinds of
nonuniform data we find in filesystems.

If there's anything i can recommend to the tsvwg, its to pick even a
32-bit extension of the TCP checksum, rather than Adler32; and to
think seriously about moving the error-check bits to the end.  Not to
help hardware, but to make whatever error-check you use more resilient
against errors in packet-processing engines which (once an error does
hit) trash the remainder of the packet.