[e2e] UDP checksum field?

Tue Apr 5 09:31:01 PDT 2005

Or, as Steve Balmer, Prince of OS/2 LanManager, King of Faulty Releases, would
glare:  "WAD, so stifle".  (WAD = works as designed)

:]

Alex

Lynne Jolitz wrote:
> 
> (With no apologies to Microsoft...) - If the Oracle tech guy had gone to the Microsoft Research school of obsfucation, he would have said "The probability of this event occuring such that the reliability of the underlying link layer is impaired by an improbably low memory bit error at ten to the minus 12 excluding thermal radiative factors and charge displacement is so low as to be impossible, hence the question is irrelevent". :-)
> Lynne Jolitz
> 
> ----
> We use SpamQuiz.
> If your ISP didn't make the grade try http://lynne.telemuse.net
> 
> > -----Original Message-----
> > From: end2end-interest-bounces at postel.org
> > [mailto:end2end-interest-bounces at postel.org]On Behalf Of Cannara
> > Sent: Monday, April 04, 2005 10:03 AM
> > To: end2end-interest at postel.org
> > Subject: Re: [e2e] UDP checksum field?
> >
> >
> > I'll add a funny (if you're not using Oracle TNS gateways) SQL transport
> > example that still exists today, despite being pointed out to
> > Oracle about a
> > decade ago.  When Network General was adding more SQL decodes to the
> > Sniffer(r), in the '90s, we had a presentation on the Oracle
> > transport (TNS)
> > underlying SQL Net traffic.  TNS rode on Netware SPP, or TCP,
> > etc.  The fellow
> > went into packet fields in detail and explained how Oracle also
> > made gateway
> > software available for Sun boxes to go from an Oracle system to
> > an IBM SNA db
> > system.  The gateway received SQL on TNS on TCP on IP on Ethernet (for
> > instance) and spit out SQL on TNS or whatever IBM wanted.
> >
> > As he expounded on TNS pkt fields, a few hands went up -- "What's
> > the checksum
> > field for if it's always 0?" asked a few experienced network folks. The
> > presenter turned back to the slide show and said: "It's unimplemented for
> > now".  Without malice, another question was posed:  "Well if it's
> > unused and
> > your gateway has bad memory, how do you know the data going into
> > the db on the
> > other side will be good?"  The presenter, a highly lauded Oracle
> > techy, looked
> > at the screen for a bit, looked back at the audience, shuffled his feet,
> > looked again at the screen, and finally said words like:  "I
> > don't know".
> >
> > After the presentation, a letter was written to Oracle, copied to Ellison,
> > explaining exactly the problem and urging the TNS checksum be
> > implemented.  No
> > response ever came back, and, if you look at a TNS packet today,
> > the checksum
> > is still zero.  I guess no one has used the gateway software who
> > cares about
> > their data.  :]
> >
> > Alex
> >
> > PS Note that "gateway" here is used in the proper sense, not for "router".
> >
> > Lynne Jolitz wrote:
> > >
> > > Yes, Lloyd is exactly right here. It is often the case that
> > people turn off UDP checksums to "buy" more performance by
> > relying on the CRC of the ethernet packet. It's not a stupid
> > question - it's a very smart question, and a lot of smart people
> > get fooled by this.
> > >
> > > For example, the Sun datacenter back in the early 1990's had an
> > NFS cluster project called Sunbox - an array of workstation CPUs
> > that did divide and conquer to build a massive file server. It
> > used an ethernet multiplexer to dynamically split the load. To
> > buy back performance, they turned off the UDP checksum. It worked
> > fine until they had a bad lot of ethernet boards with substandard
> > memories - this wasn't picked up in tests because the test units
> > were doing resends of the occasionally corrupted packets (UDP
> > checksums usually was turned on), and in TCP the checksums would
> > do resends as well. It was also a fairly rare problem, and the
> > test periods were too short to pick up on the nature of this
> > problem easily.
> > >
> > > But when UDP checksums were turned off in normal use, the
> > resulting NFS requests were corrupting the filesystem (which in
> > this case were database files), forcing rebuilds and manual
> > repairs of database tables.
> > >
> > > As they were about to announce and release it, they suddenly
> > discovered this problem - they noticed the corruption and in
> > order to determine whether it was in the high level (stack or
> > above) or lower levels, they turned on checksums and it worked
> > immediately.
> > >
> > > They then examined the failed checksum packets to traceback in
> > the lower level stack-down through the link layer to discover
> > where the corruption occured. With logic analyzers, they were
> > able to observe the contents going into memory from the NIC on
> > reception was different than the contents going out of the memory
> > and traveling across the bus to the processor.
> > >
> > > This is a surprisingly common problem in datacenters -
> > sometimes the problem would be a switch, sometimes a
> > configuration error, sometimes a programming error in the
> > application, and so forth. I most recently experienced this
> > problem with an overheated ethernet switch passing  video on an
> > internal network.
> > >
> > > I also ran into this at an Internet portal company where I was
> > a manager. We were using NetApps file servers to mirror the daily
> > information - NetApps at the time encouraged staff to turn off
> > checksums to increase performance. The DBAs noticed problems and
> > ended up doing frequent rebuilds, but couldn't figure out why. It
> > took me a lot of time to convince my staff to turn on the
> > checksums because they were told "they don't have to" by NetApps.
> > Most datacenter staff work by cookbook, and this wasn't in the
> > cookbook. When they finally tried it, it worked. This little
> > problem cost us a lot of time and aggravation for very little (if
> > any) performance gain.
> > >
> > > Performance gain by turning off checksums now can be obviated
> > through the use of intelligent NIC technologies like SiliconTCP
> > (http://jolitz.telemuse.net/pubs/pt2001_01/item) and TOE that
> > calculate the checksum as the packet is being received. But we
> > don't have this in commodity switches yet, so check that switch
> > if you're having problems.
> > >
> > > Higher level checksums are worth it every time. Don't leave the
> > server without them. :-)
> > >
> > > Lynne Jolitz.
> > >
> > > ----
> > > We use SpamQuiz.
> > > If your ISP didn't make the grade try http://lynne.telemuse.net
> > >
> > > > -----Original Message-----
> > > > From: end2end-interest-bounces at postel.org
> > > > [mailto:end2end-interest-bounces at postel.org]On Behalf Of Lloyd Wood
> > > > Sent: Monday, April 04, 2005 2:48 AM
> > > > To: Faisal Aslam
> > > > Cc: end2end-interest at postel.org
> > > > Subject: Re: [e2e] UDP checksum field?
> > > >
> > > >
> > > > On Sun, 3 Apr 2005, Faisal Aslam wrote:
> > > >
> > > > > Why we have checksum field is in UDP header, as UDP does not provide
> > > > > data retransmission etc? I think it is used only to silently
> > > > > discarding a packet with wrong checksum (thats it?).
> > > >
> > > > yes - you need an end-to-end check against a corrupted packet. UDP
> > > > could have the checksum turned off, which proved disastrous for a
> > > > number of applications, subtly corrupted filing systems which didn't
> > > > have higher-level end2end checks etc.
> > > >
> > > > > Is there any  other application of checksum field?
> > > >
> > > > For other applications
> > > > http://www.faqs.org/rfcs/rfc3828.html
> > > >
> > > > UDP Lite originally sprang out of the observation that UDP has
> > > > redundant length information, and that this information could be
> > > > combined with the checksum (as in TCP/UDP) to give partial coverage.
> > > >
> > > > L.
> > > >
> > > > >
> > > > > Sorry if the question is too naive.
> > > > >
> > > > > Thanks
> > > > > Faisal
> >