[e2e] Reacting to corruption based loss

Wed Jun 29 02:03:23 PDT 2005

On Tue, 28 Jun 2005, Cannara wrote:

> On the error rates issue, mobile is an extreme case, always subject to
> difficult conditions in the physical space, so symbol definitions & error
> correction are paramount.  However, most corporate traffic isn't over mobile
> links, but dedicated lines between routers, or radio/optical bridges. etc.
> Here, the reality of hardware failures raises its head and we see long-lasting
> error rates that are quite small and even content dependent.  This is where
> TCP's ignorance of what's going on and its machete approach to slowdown are
> inappropriate and costly to the enterprise.
>
> As an example of the latter, a major telecom company, whose services many of
> us are using this instant, called a few years back, asking for help

How many (years?)

> determining why just some of its offices were getting extremely poor
> performance downloading files, like customer site maps, from company servers,
> while other sites had great performance.  The maps were a few MB and loaded
> via SMB/Samba over TCP/IP to staff PCs.  The head network engineer was so
> desperate, he even put a PC in his car and drove all over Florida checking
> sites.  This was actually good.  But, best of all, he had access to the
> company's Distributed Sniffers(r) at many offices and HQ.  A few traces told
> the story:  a) some routed paths from some offices were losing 0.4% of pkts,
> while others lost none; b) the lossy paths experienced 20-30% longer
> file-download times.  By simple triangulation, we decided that he should check
> the T3 interface on Cisco box X for errors.  Sure enough, about 0.4% error
> rates were being tallied.  The phone-line folks fixed the problem and voila,
> all sites crossing that path were back to speed!
>
> Now, if you were a network manager for a major corporation, would you rush to
> fix a physical problem that generated less than 1% errors, if your boss &
> users were complaining about mysterious slowdowns many times larger?
> 0.4% wasn't even enough to trigger an alert on their management consoles.  You'd
> certainly be looking for bigger fish.  Well, TCP's algorithms create a bigger
> fish -- talk about Henny Penny.  :]

I can't help but wonder - if TCP/IP were generally so sensitive to a loss
of 0.4%, then why does the Internet work?  I spent a long time simulating
the BSD stack a while back and it held up extremely well under random
loss until you hit 10% at which point things go non-linear.  I've also
never experienced what you describe, neither as a user nor or in my
capacity as engineer debugging customer network problems.

And what's with that "major corporation" and "boss" stuff?  I'm guessing
they'd like the "replace the hardware" solution to the "replace the
whole infrastructure with something that's incompatible with everything else
on the planet" one.

>
> The files were transferred in many 34kB SMB blocks, which required something
> like 23 server pkts per.  The NT servers had a send window of about 6 pkts
> (uSoft later increased that to about 12).  All interfaces were 100Mb/s, except
> the T3 and a couple of T1s, depending on path.  RTT was about 70mS for all
> paths.

So the NT servers were either misconfigured, or your example is rather
dated, right?

> Thankfully, the Sniffer traces also showed exactly what the TCPs at both ends
> were doing, despite Fast Retransmit, SACK, etc.:

I'm don't know a lot about NT's history, but having a 9K window *and* SACK sounds
historically schizo.

> a) the typical, default
> timeouts were knocking the heck out of throughput; b) the fact that transfers
> required many blocks of odd numbers of pkts meant the the Ack Timer at the
> receiver was expiring on every block, waiting (~100mS) for the magical
> even-numbered last pkt in the block, which never came.  These defaults could
> have been changed to gain some performance back, but not much.  The basic idea
> that TCP should assume congestion = loss was the Achille's heel.  Even the
> silly "ack alternate pkts" concept could have been largely automaticaly
> eliminated, if the receiver TCP actually learned that it would always get an
> odd number.

The issue you describe was fixed a long time ago in most stacks, AFAIAW.
I fixed it in IRIX aroundabout 6 years ago.

For fun, I tried an experiment.  I transfered a largish file to my sluggish
corporate ftp server.  Took 77 seconds (over the Internet, from San Francisco
to Sunnyvale).  I then did the same thing, this time I unplugged my Ethernet
cable 6 times, each time for 4 seconds.  The transfer took 131 seconds.
Not bad, I think.  At least not bad enough to warrant a rearchitecture.

Cheers,
-- Sam