[e2e] Open the floodgate - back to 1st principles

Sun Apr 25 12:03:06 PDT 2004

Jon,
  Thanks for opening this thread.
  Although I have thoughts/opinions about the answers, let me also add my 
two cents on refining the question.
  Take the comments below as complementary to your points.
        -- Guy

--On Sunday, April 25, 2004 14:11:32 +0100 Jon Crowcroft 
<Jon.Crowcroft at cl.cam.ac.uk> wrote:

> so going back to the original posting  about Yet Another Faster TCP
> (and i recommend people go look at
> http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/bicfaq.htm
> for the technical background),
> it would be nice to think through some of the economics here
>
> the motive in almost all the Go Faster Stripe TCPs
> is often cited as the time it takes TCP AIMD
> go get up to near line rate on todays ultra-fast networks
>
> but why are we considering this as very important? how many times does a
> TCP session actually  witness a link with the characteristics cited (and
> tested against) in the real world, e.g.  100ms, 1-10Gbps?, without ANY
> OTHER sessions present? in the test cases, we often see people spending
> sometime at places like CERN, SLAC, and so forth, waiting for a new
> optical link to be commissioned, before they get to be abler to run their
> experiment - how long does one have to wait before the link routinely has
> 100s or 1000s of flows on it? at which point why are we trying to get
> 100% of the capacity in less than the normal time

Let me try to restate this.  It's not just that the Reno/AIMD family of TCP 
algorithms take time to ramp up, but they also exhibit great fragility with 
respect to loss once they've ramped up.  In a 10,000 km path with available 
capacity in the 1-10 Gb/s range, the occasional loss-induced stumble will 
take many minutes to recover from.  During those minutes, the link is very 
underutilized *and* the file transfer is getting very bad performance.

Not only that, but just before the loss-induced stumble, Reno/AIMD builds 
up a massive queue at the head of the bottleneck link.  Router designers 
are told to provide big buffers to allow these queues to get big.  One 
result of all this is that the bottleneck link swings between periods of 
congestion (with huge queueing delay) and periods of underutilization.

During the last few years, it's become common to use multiple parallel 
Reno/AIMD TCP flows to move a single large file.  The idea is that, while 
one TCP stumbles, the others can proceed and make good use of the otherwise 
underutilized capacity.  This does not work as well as one would like, 
largely because these multiple flows tend to synchronize with each other, 
resulting in really really massive congestion/queueing periods and then 
(unless you use several dozen parallel flows) surprising periods of 
underutilization.

The new generation of TCP congestion control algorithms, including the 
cited work at North Carolina State and also interesting work at Caltech, 
Rice, Cambridge, and other places, are good attempts to give us a much 
better alternative to the multiple-stream-Reno/AIMD approach.  This work is 
likely to yield two benefits:
<> provide better TCP stacks that are much more robust in very high 
delay-bandwidth product environment, and
<> provide better insight into the boundary between the operating ranges 
where TCP/IP style networks and dedicated-circuit style networks are 
respectively optimal.
Both of these benefits will be important.

And the folks who currently need this technology are not limited to a 
handful of CERN-to-SLAC high-energy physics researchers.  The economics and 
the quality of gigE interfaces, combined with the increasing prevalence of 
2.5-to-10 Gb/s wide-area circuits, and combined with the need for 
high-speed file transfer (and other applications that look pretty much like 
file transfer) are all combining.  And the really hard problems are not 
when the high-speed link is dedicated (the problem then, of course, is 
technically easier), but when the big file transfers are combined with a 
moderate, but dynamic, number of other users, perhaps including several 
1-gigE flows at a time over wide area.

>
> another side to this motivational chasm seems to me to be: if we have a
> really really large file to transfer, does it matter if we have to wait
> 100s of RTTs before we get to near line rate? frankly, if its a matter of
> a GRID FTP to move a bunch of astro, or HEP or genome data, then there's
> going to be hours if not days of CPU time at the far end anyhow, so  a
> few 10s of seconds to get up to line rate is really neither here nor
> there (and there are of course more than 1 HEP physicist going to be
> waiting for LHC data, and more than one geneticist looking at genome
> data, so again, what is the SHARE of the link we are targeting to get
> to?)?
But consider that there will be a large dynamic range of file sizes and of 
distances (RTTs).  I'd caution against stereotyping this user community. 
The requirement space is more textured than you might imagine.  And it's 
not best to relegate this entire space to the dedicated lambda school.

>
> so of course then there's the argument that with even fiber optic loss
> rate, TCP on its own, on  a truly super duper fast link with sufficient
> RTT, will never even get to line rate, coz the time to get from half rate
> to full (i.e. 1 packet per rtt, so W/2 RTTs where W = capacity/RTT), is
> long enough to always see a random loss which fools the TCP - this last
> point is fixed simply by running SACK properly ,and admitting there might
> be merely TWO tcps and random losses, although bursty at the bit level,
> are hardly likely to correlate at the packet level, especially not across
> the large MTUs we might use on these links.
>
> [note also none of the Big Science TCP users with these typoes of
> datarates pretend to have humans at the end points -while there are
> people connecting CAVEs and other VR systems to physics simulation
> systems, the data rates are things we managed to do rather a long time
> back....usually - often one can move a lot of the computer model to the
> right end of the net to achieve the right mix of throughput/latency even
> there, too so I am doubtful people need more than about 250Mbps that HDTV
> types ask for]
Many of the large file transfers *do* have humans at the ends, at least in 
the sense of someone running a computation that needs the file transfer as 
input and should run within a minute (say) if things are efficient.

>
> So, back to economics - in general, we have a network which is speeding
> up in all places - in the core it is speeding up for 2 reasons
> 1/ number of users-  primary reason I believe
> 1/ access link speed up - secondary (but I could be wrong)
>
> access links speed up in 2 general ways
> i) we replace site 10baseT with 100baseT with GigE with 10GigE etc -this
> is really corporate or server side stuff. ii) we (on a logistical long
> time scale) replace individual user lines or SMEs lines with something a
> bit better (modem -> DSL, dialup to cable modem, etc)
>
> I guess someone will have the historical data on this but taking the UK -
> we were doubling the number of dialup users each year, but it took 10
> years to go from 0 to 2M DSL lines - so the contribution from raw browser
> demand cannot be nearly as significant as the mere contribution of weight
> of numbers.
>
> Hence, going back to the TCP argument above, we might expect the number
> of TCP sessions on large fat pipes to always be high -
>
> so while there is an increase in the rate TCPs sessions would like to run
> at, i believe it is much slower overall than we are anticipating  - its
> probably worth being smarter about random losses, but what I am arguing
> is that we should change the concentration of work on
> highspeed/scalable/fast/bic, to look at the behaviour of large numbers of
> medium speed flows

There are two (at least) issues:
[1] for a given mix of flows (few large or many small or a mix), how can 
TCP (or any other end-to-end transport layered over a statistically 
multiplexed datagram service such as IP) best be engineered?

[2] when is it best for the high-end users to grab dedicated circuits vs to 
use shared IP networks?
*Both* of these questions are important, and I'd say that the 
highspeed/scalable/fast/bic folks are shedding light on both.
>
>
> back to the physics example - after processing , the total cross section
> of data from the LHC at CERN is 300Mbps. that is NOT hard. in the genome
> database, a typical search can result in around 300Gbyte of intermediate
> data (sometimes) - however this is usually input to something that takes
> a few days to process (some protein expression model or whatever) - no
> problem.
>
> I'd love to see a paper where a 10Gbps link has say 1000 flows of varying
> duration on it...
>
> cheers
>
>    jon
>
>