[e2e] Open the floodgate

Cannara cannara at attglobal.net
Thu Apr 22 01:05:01 PDT 2004


I'm reducing the extraneous copies here because we should all be on the same
list.  

So, David, yes the hard problems are very important.  And, when you go out and
work with 1000 or so real companies using TCP/IP on The Internet and their own
corporate nets, you get to see many interesting, even problematic things. 
Now, whether they meet your definition of being caused by "hard problems" in
data transport or not, may be irrelevant to these folks just trying to use
TCP/IP as a tool.  When a tool breaks on a simple task it may be hard to
understand and even to fix, but the problem may not be a "hard" one for
theorists.  To the tool's users, however, the problem is indeed important.

The problem may, in fact, be a simple one that network theorists consider too
simple.  Let's throw in a reliably simple one of TCP being unable to
distinguish a congestion loss from an error loss (bad CRC...).  When a 30kB
transfer is needed every few seconds for a $50/hr employee to do his/her work,
minutes count, especially if the employee interacts directly with customers. 
So if the phone company supplying a WAN link (T1, cell...) has a bit error
rate of zero, the employee works well for his/her pay.  When the phone company
has a small error problem, or employs technology that naturally incurs small
losses, the end company expects relatively small impairment of the workers'
time to task completion.  This is very real world.  It's also quite a
reasonable expectation, especially when the company has been told by
installation techs that "TCP is reliable" in the face of packet losses.

Well, will the company be disappointed.  Because as we all know, who've seen
TCP react to a single packet loss, there's a very nonlinear and stiff penalty
for any loss (largely stemming from the ancient fear of Internet meltdown). 
In the case of 1% loss, the company's estimation from what it's been told
about TCP is that its employees will be slowed by about 1%.  So, when
employees are complaining because they're suddenly 30% slower, the company
techs don't even suspect a "small" source of loss.  They start looking for
biggies -- extreme loads, busted cables, broken routers, etcetera.  Now, a bad
thing has happened:  misdirection of troubleshooting.

They find their network paths are lightly loaded, the cables and interfaces
are all ok, the routers are happy, and so on.  What the %$&$# is going on? 
Yes, TCP is reliable, because the employees eventually get their tasks done,
but some offices are slow and some are not, and there's no apparent difference
in the light traffic anywhere.

So some smartypants gets a Sniffer(r) and starts looking at the slow folks.
TCP Retransmissions!  But only a few.  But look at those backoff and restart
delays!  Hey, weren't Repeated ACK and SACK supposed to help that?  Well, they
do, sort of.  So now the Sniffer guy can compute the loss rate -- about 1%. 
But, the time to deliver 30kB is on average 30% more than for users not seeing
the 1% losses.  Well, well, TCP may be reliable, but it certainly takes its
time about it, even when the network load is light.  Now where could we be
losing 1% of our packets?

That's the right question, finally.  By seeing how sensitive TCP is to loss,
we come to the problem needing solution, whether one thinks it's a hard or
easy one.  Well, if it's easy, it'll be in the stats, because now that we
understand TCP's weakness, some interface(s) must be losing due to errors. 
Solution:  check all interfaces common to the paths slowed employees use. 
Oops, here's one to the phone company that's showing about 1% CRC errors
incoming.  Call 'em up to fix it.

Now the above is not uncommon in the real world.  Is it "hard" to figure out
how packets can be transported efficiently under errored rather than congested
conditions?  We certainly know that several hundred $50/hr employess slowed by
30% is a hard problem to their employer, because it's very expensive.  But,
for some reason, giving TCP a fighting chance to distinguish causes of loss
has been very "uninteresting" for many years (ECN is apparently the ISDN of
this millenium :).  Yet, there are things that can be done (archives will tell
some), even without much damage to "The Foundation".  :]

This is just one of a number of other TCP improvements that have been
suggested.  Many of these could even be implemented in downward-compatible
ways.  Of course, if the Internet protocols were open source and source and
release controlled...oops, what did I just say!?  {:o]

Alex

"David P. Reed" wrote:
> 
> I'd like to offer the following comment, which is intended as constructive
> technical criticism.
> 
> The Internet's value is focused on solving a subtle problem - allowing many
> different sets of technical requirements to coexist in a single common
> networking infrastructure resource.   As such, the measure of "optimality"
> for the Internet is its ability to be maximally adaptable to as wide a set
> of uses as possible.   There is no clear evidence that delivering massive
> files between two points is either the best and highest economic use of the
> Internet or a representative sample of the future Internet.
> 
> It is well known how to maximize the speed of an unlimited dragster (a
> specialized automobile) on Bonneville salt flats.   Such techniques
> contribute to human knowledge.   They do not, however, address many of the
> key requirements of engineering a universal transportation system; and in
> particular, focusing on such speed does not optimize such things as
> maneuverability, low lifetime maintenance cost, minimizing greenhouse gas
> emissions, etc.
> 
> Now getting to TCP flow control and congestion control.   I am as concerned
> as anyone that the current TCP algorithms are not evolving to new
> situations.   However, the situations that I believe we must take into
> account are the newly emerging classes of applications, whatever they may
> be - not just the applications that benchmark well.   Though super-computer
> file transfers are one such case, there are many other, far more diverse
> operating points that it is desirable for the network to concurrently
> support.   Such operating points include very high burst rates where the
> flow lifetime is too short to provide flow-based congestion control, and
> operating points with very high rates of reconfiguration and mobility,
> during the lifetime of a "connection".  But even those are quite simple.
> 
> At the same time, the potential to congest the network is not going
> away.   The solution to congestion is coordination algorithms that make
> reasonable and fair decisions in at least two independent dimensions:  how
> to obtain additional capacity where needed, and which traffic to restrain,
> and how to restrain it.   (we almost always tend to neglect the former,
> since we are all poor engineers who live on limited budgets, and are not
> used to having to make investment decisions to deploy new network capacity,
> except in our own homes, where it is cheaper to assume that when we need a
> gigabit LAN, it will come down to the price of today's 100 mbit LAN.  But
> in the arena of transport systems, that turns out to be true as well.   We
> are nowhere near the physical limits of our ability to get bits between two
> points on the earth, and we are deploying capacity at an exponential
> rate).  We almost certainly need such coordination algorithms to be
> completely decentralized and "future proof" in the sense that they can be
> adapted to innovative new uses of the network.
> 
> The problem of congestion is not the simple academic theory problem that
> you can solve either by benchmarking Internet2 drag-races or by doing
> papers about session level flow control as if that is all that matters
> (because the only connections that matter are FTPs or WWW transfers - using
> today's applications as if they represent the mid- or long-term future is a
> major research strategy error).
> 
> The real research problems around congestion and control are much less
> obvious and much more important.
> 
> So improving the startup time of an individual TCP connection (or a few) is
> nice and useful, and worth doing.  But if we let it get in the way of
> seeing through to the really hard problems of managing interactions in an
> ever more complex Internet, then a whole community of researchers is
> wasting its time.   That's what advanced development is about, not systems
> research.


More information about the end2end-interest mailing list