[e2e] Re: [Tsvwg] Really End-to-end or CRC vs everything else?

Mon Jun 11 10:16:02 PDT 2001

In message <200106111335.f5BDZNF05472 at aland.bbn.com>Craig Partridge writes
>

>Going back to Dennis' question (which is one I've pondered nights over the
>past several weeks).  I think there's a broad error model that we might try
>to use.  The transport and network layer protocols deal with an environment
>where (we expect) link layer errors have been filtered out.
>
>So the checksum doesn't have to worry about bit-serial types of errors.
>
>The checksum must deal with the kinds of errors that a mix of hardware and
>software (exact mix to reflect your preference for hardware or software)
>tend to suffer in receiving, buffering, and transmitting packets.  Thought
>about this way, you get a few distinct classes of errors:
>
>    * buffers get mixed
>    * data gets transposed
>    * buffers get truncated
>    * byte or word erasures
>    * bit errors (which we could probably stomp out if we forced folks to use
>      parity on their memories... but we don't get to do that)

There's a different approach one can take.

I've been poring over the data-sets which I and Craig reported in our
SIGCOMM 2000 paper, but analyzing them using a burst-error model (an
XOR of the damaged data and the retransmission), rather than the
`minimum-edit-distance' model i originally developed to understand how
the commutative TCP checksum would fare against in-the-wild errors.

That data bluntly rejects the traditional link-level assumption
Dennis referred to.  In our data, low-Hamming-weight errors do not
dominate.  Single-bit errors do occur -- they are the modal Hamming
weight, the most common individual error; but after the initial spike
at one-bit errors, the Hamming-weight stays relatively flat up to
about 6,000 bits (or half the bits in an Ethernet packet).  The mean
and median weight is thousands of bits.  For our datasets, the
traditional link-level preoccupation with low Hamming-weight errors is
not the right approach to error detection.

It's hard to describe what the data *does* suggest, without flipping
through lots of graphs and plots. One really striking fact is that
(after excluding single-bit errors or single-word errors), the
overwhelming majority of the errors we caught span all the way to the
end of the packet.  Once a packet gets damaged, the damage extend all
the way to the end of the packet.

What that says is that (excluding the single-bit or very short
bursts), the errors we see are mostly packet-stateful: once an error
first hits a particular packet, the rest of the packet from that
first error point onward is damaged.  Which implies the error
processes are packet-stateful, too.

That all suggests the error processes causing the errors aren't due to
`on the wire errors'; instead, they're errors in the hardware (or
software) which transfers packets between the TCP/IP stack buffer, and
the link-level hardware with its FEC and hardware CRCs; with a few
single-bit or single-word errors thrown in for good measure.

What I took from this is a different approach to David Reed's: instead
of assuming an intelligent adversary process, i suggest a blind
adversary process, one which doesn't know the details of the
error-check algorithm.  If we are given R bits of error-check, we
are told the Hamming weight of errors is far more than R bits,
and no further information about the adversary, what should we do?

Game theory says that for a blind adversary (not an intelligent
adversary, a cracker, as in David's model) our best approach is a
maximin strategy-- minimize the worst-case damage the blind adversary
can do.  To achieve that, we should choose error-check functions so as
to maximize the Shannon information of the available R check bits.

If one grinds through the math, it turns out that for short packets
(such as SS7 data) Adler32 checksum is a poor checksum by that metric,
for the reasons I and Craig have posted to tsvwg two or three times:
the problem is the 8-bit input, the 16-bit accumulators, and the 
Central Limit Theorem.

I'm with Craig on using crypto: it'd work, it does well on the
informational metric, but it is changing the ground-rules. 

I see absolutely no need to appeal to hmac-md5 to detect errors; md5
would do just fine.  Unless you postulate an `error process' which can
damage bits and then, in an adversarial fashion, recompute the md5 hash
over the damaged bits.  But is that a genuine error procss, or a middle-box?