[e2e] Short Fat Networks, tcp and Policers

Mon May 17 01:52:36 PDT 2004

Follow the [PG],

A) Minimum throughput
=======================

When you browse through tcp literature, you will
usually come across
the Bandwidth x Delay=Window formula. It has a very
straightforward 
meaning: If you want to get a full utilization of the
available bandwidth,
you better have a window that is large enough to fill
the RTT with packets.

Even a simpleton like myself knows this. But let us
ask the question in 
reverse: How *low* a bandwidth can we pass through a
line with a given, fixed,
delay? Some simple thinking will show you than it is
(almost) impossible to 
deliver less that a few MSS each RTT. Why? Because
even during slow start, tcp
sends a new packet once an ack is recieved.

>> excellent point, I would appreciate if this list
for one clarified the "MSS". Today we support jumbo
frames etc. Does this imply we increase the MSS
propotional to the MTU?

What is the implication of the above? That in order to
cause a tcp flow to use *less* than the
physically available link, one must *increase* the
RTT. Of course, this is exactly what shapers
do, by buffering the data. alternatively one could
buffer the acks ( and I think that "ack holding"
schemes have been proposed in the literature), but
that requires L3- L4 knowledge and treatment of 
piggybACKs.

>> Shapers also provide something else. They let you
differentiate services. Assume that one wants to
really run a "converged network" which carries voice,
video and data, one also needs the flexibility to have
CBR. Shappers can do that.
Eitherways, shappers or no shappers, if everything was
"policers" or say, slotted/TDM access based, there
would still always be a point in the network where one
would have to "buffer", and that is the point of
ingress into the "policed" core. This specifically
holds true for cases where we have say a 100Mbps LAN
and a X<100Mbps WAN slotted access bandwidth in front.

----> Agreed, but one way to use shapers to
differentiate, is the diffserve scheme.  In diffserve,
you don't allocate a shaper per *service* but per
*class of service*. or in other  words, you aggergate
all services with the same CoS into the same shaper.
This is done inside the network. At the edges, you can
either shape, or police. If you police, you will run
into the problems that I have described.

[PG]:

1.Say I want more than 8 classes of service, basically
as many as I can offer to my customers.
2.Say I want this class mapping to be universal not
PHB. Now?

B) tcp timeout Phase
========================

So what happens when you try to limit bandwidth below
the MSS/RTT limit? A policer achieves this
by discarding all packets that are non conforming, and
this will cause the session to run in a
burst-timeout-burst phase. Typical implementations set
timeouts to as much as several hundred msec.
In this phase ( that is , when ther required bandwidth
is less that Const X MSS/RTT), the bandwidth 
can range from zero to [Policer Burst Size]/[Tcp
Timeout] . This phase is not efficient, and causes
huge
bursts in the network. For timouts around 500 msec, a
4 mb/s session ( which is not unreasonal on LANs
and WANs) needs a policer burst of 512KB. since bursts
tend to get translated into buffer sizes, we see
that a single service eats up quite a large buffer
size. The irony being, that in the timeout phase, the
burst is not buffered, because if the burst was
buffered, the RTT would increase, and we wouldn't have
been
in the Timoeout phase in the first place.

>> agreed.

C) Partial window phase
========================

A reasonable way to generate sub-physical line rates,
( without adding to the RTT), is to cause 
the tcp to work at a window that is less that
[physical Rate]x[RTT]. The frame pattern would be
something like M frames every RTT, with MxMSS/RTT ~
[policed rate]. This is a much better behaviour
than the timeout phase, and a good policer design
should strive to reach this phase. As pointed out
before, this phase cannot exist when the required rate
is too low, or the RTT is too short.

>> Which is why TCP slow starts I presume (I believe
the more "literate" on the list can correct me if I am
wrong).
----> I dont think that slow start was "designed" with
policers in mind.

PG:
Fine, as I said I was not sure. But it serves a
purpose here. By that logic doing the reverse (sending
in as much and then pulling it down to a lower amount
gradually) would achieve similar results if the
bandwidth was sufficient.

D) Stability of the partial window phase
========================================

Eventually, because of slow start or congestion
avoidence, the number of frames in the window M
slowly creeps up, until the policer "realizes" that
the policed rate has been passed, and the policer
will discard a series of frames. If this is done
delicately enough ( suppose using a RED like
algorithm),
Fast retransmission will take place and the session
shall be able to slow start its way back to the target
of 
M frames per RTT. 

If the policing is too drastic, either an entire
window of M packets will be discarded, or the
fast-retransmit frame itself will be lost, and a
timeout will occur. 
E) tcp defence lines and policers
==================================
tcp has three defence lines against congestion
* self clocking
* congestion window + slow start
* retransmission timeout
Not only do they protect the network, they also
control the bandwidth that the application recieves. 
Policers neutralize completely the first line of
defence, since they have no effect on the RTT ( or on
the
more subtle inter packet gap ). The only way that
policers can indicate rate to the tcp layer is by
packet discard. Tcp responds to packet discard by
retransmission timeouts or fast retransmission, both
are considered inefficient, but compared to the huge
problems caused by timeouts, the slight ineffciency
caused by fast retransmission induced slow start, is
minor. The estimation of timeouts based on averaged
RTT
statistics are totally irrelevant when the actual
network performance is below 100msec, but the
recommended
tcp tick is 500msec.

>>Also consider the case of "how soon can that
feedback" be given to the sliding window mechanism so
as to detect that there was a "congestion". We are
talking about a system where the propogation delay or
duration of "feedback" is considerable compared to the
rate at which the input comes/packets are transmitted.

F) What is the point?!
======================
The points are 
1- policers are much easier to build that shapers, so
we should start to understand them.

>> Considering all the points you have made, it seems
shappers should be the obvious choice. They need not
be the "exact buffered" shappers as you mention them,
but they could be made such that:
a. minimal buffering is needed at the end points.
b. There is a way to differentiate CBR streams from
non CBR streams

---> I suppose that giving a shaper per service is a
possibility, but the complexity, compared to a shaper
per CoS is prohibitive. Policers, on the other hand ,
are relatively simple to implement, and scaling is not
a big issue. Shapers per CoS are enough to support
differentiation, I'd imagine that policed CBR streams
would go into a single high prioirty buffer, instead
of having to select among many eligible CBR shapers . 

2- acceptance tests are usually done with very short
RTT times, so that the timeout phase is
quite relevant.
3- All these exponential smoothings and estimations of
RTT RMS are useless if the smallest timeout
is 300msec, and typical LAN's are less that 50msec.

PG:
.... and how do you fill in "CBR" in this?
A simple question, if one does slotted access are time
slots per class evenly allocated? or are they even
close to that? if they are not close to the even
allocation, and the source is a CBR stream, what
happens in that case? do you not have to buffer the
stream till your next slot comes in? So whatever be
the way the packet is marked, the principal of
shapping, or getting "as close to" shapping as
possible makes sense. It is all a matter of how well
you allocate the slots in the slotted access
mechanism? right? 
as far as acceptance tests go :) if everyone is happy
and has "accepted" whatever is made, why are we doing
anything on the same infrastructure any more?

>>Why would you say that?

---> my point is that afaik, the clock used to measure
rtt and rttvar has a granularity of a few hundred msec
( except in very recent tcp implementations, where the
cpu clock is used ). 

PG:
But TCP has always been a "packet" service. Clock for
RTT var or not, there will always be a "window" which
can be sent out as a "burst" in 1 shot.

4- alot of TCP improvements have been based around
"LFN"'s but it may turn out that alot of the broad
band networks are really "SFN"'s (Short Fat Networks).

>>Let me put it this way, performance is always
"relative". If you sit near an end point then it is
SFN for you, else it is a LFN. Would that not be true?
Eitherways, the idea is to get "more" out of an
existing infrastructure, not to say "this
infrastructure was not made for this".

---> I dont quite follow you here. Consider the
following "practical" situation:

Your customer wants to bench mark  your equipment. She
hooks up 2 pc's back to back through your traffic
managment box, and starts pinging /iperfing when the
traffic managment is disabled. She measures a
beautiful 700usec delay and a 100 Mb/s  throughput (
say that its a 100bT FE interface). Now she tries to
set the box to say 1Mb/s. There are two possibilites:

1) you are using a shaper, so that the delay jumps to
about 12msec, and the bandwidth is fine.

PG: 
All I am saying allocate you "slots" on the link
itslef in such a way that it comes as close to
shapping as possible. Certainly that is not a terribly
huge overhead?
It is fair to say in this case that one can always
define a service for "best effort" and some for "CBR"
and some for only "guaranteed bandwidth" which is not
shapped. Moreover one also needs to get this out of
the intermediate equipment so that it is "end to
end"..right? Also the limits of "bandwidth" allocated
for CBR, that allocated for guaranteed/ or the
reservations for the same have to done across end
points.

2) you are using a policer. Every packet that gets
transmitted has a 700usec delay, but there are alot of
timeouts. Bandwidth is fine, because you have
configured a large burst.

Both results need some explaining to the client. This
is a fairly common situation.

PG:
:) hope that someday customers will be happy :) 

G) The End

Chat instantly with your online friends?  Get the
FREE Yahoo! Messenger http://uk.messenger.yahoo.com/