That is an interesting point of view. You are implicitly assuming that
applications using the transport are bounded (like the backup) and a "total
checksum" can be applied on them

What about those that don't fit this model?

What about continuously running applications?

Should your multinational bank run its applications with checksums on every
transaction?  How will they react on failures?
Reinvent a new recovery mechanism for every application? (in the ips WG we
ran exactly through this and it was/is a painful experience).

I think that the implications of an "imperfect" transport (in the error
detection sense) are more extensive than applying another checksums and the
cost to transport users (complexity, footprint, performance) is excessive.

For most of the data processing applications I think that transport users
would welcome a more resilient error detection mechanism built into the


What is at stake here is whether we want a safety belt or an alarm bell.
The current TCP checksum is the latter. It will detect "most" errors
with a good probability, which is enough to detect a faulty board. It
will not provide sufficient protection for guaranteeing a complete
absence of glitches in the 5 terabyte backup of a multinational
commercial bank. Note that, if you really follow the e2e argument, this
is OK: the only way to be certain that the back-up went well is by
computing a strong checksum over the whole volume, not by trusting TCP.
In fact, it is all a matter of probabilities and arbitrations. The
transmission system should be good enough that the backup is OK most of
the time, so that the e2e checksum (volume) only fails rarely, so that
the cost of correction is acceptable...

-- Christian Huitema

> Jonathan,
> You can't be suggesting a simple summation is worth using in the face
> router memory errors.  You have detected these in the wild and noted
> positions within a 32 word.  You have statistics that indicate there
> more bits in error than others suggesting there may be a weak bus
> being seen.  From the simple tests that I have run, a simple summation
> extremely weak in this area.  A CRC still does well even when the
> packet is corrupted including the CRC itself.  There is no need for
> to be affected to improve the performance of the algorithm.  I can say
> that
> Fletcher-16 2^n should be avoided altogether due to this extremely
> memory bus performance.  It is no where near 2^32 in preformance.  It
> closer to 2^6.  The only reason for placing a mandatory chunk at the
> of
> the packet would be to ensure against truncation which a good check
> catch.  If there is only one checksum type allowed, then placing this
> the
> end of the packet has the advantage of minimizing the potential passes
> this
> packet needs in preparation.  Adler-32 suffers the Fletcher problem if
> packet is small or mostly zero.
> Doug
> >
> >
> > One further thing I've mentioned in email is that I recently
> > re-analyzed the captured error datasets which I and Craig and Vern
> > gathered, and I did find one pattern which could be exploited here.
> >
> > The pattern is that the errors we found can be broadly characterized
> > into two classes: either single-bit or short, low-hamming weight
> > errors; or as errors where some prefix of the packet is bad; the
> > packet is subjected to an error; and the error continues all the way
> > to the end of the packet. The ratio of errored bits within that
> > damaged `tail' of packet is very close to 0.5.
> >
> > That suggests an error model where we model packet-level errors as
> > to either signle-bit errors, memory-readout errors which affect a
> > single word or cache line; or due to `stateful' errors in the
> > hardware/software finite-state engines which move packets between
> > packet- buffer memory, and the hardware which implements some
> > media layer. (think of errors due to an under-run in a hardware
> > or a bad bit in a DMA pointer register.)
> >
> > There's two things to take away from that.  The first is that the
> > errors we've acutally observed, in the only study of in-the-wildn
> > packet-level errors I know of, the errors are so heavy that, on
> > average, they affect more than R bits, for any R that's a plausible
> > error-check. That says we're only going to catch errors
> > The second is that, since the errors seem to be stateful, putting
> > error-cheeck information at the end of the packet rather than in a
> > fixed header field doesn't hurt, and (for the reasons we analyzed to
> > death in our 98 ToN paper) will acutally help, for the kinds of
> > nonuniform data we find in filesystems.
> >
> > If there's anything i can recommend to the tsvwg, its to pick even a
> > 32-bit extension of the TCP checksum, rather than Adler32; and to
> > think seriously about moving the error-check bits to the end.  Not
> > help hardware, but to make whatever error-check you use more
> > against errors in packet-processing engines which (once an error
> > hit) trash the remainder of the packet.
> >
> >
