[e2e] Some thoughts on WLAN etc., was: Re: RES: Why Buffering?

Mon Jul 6 06:55:58 PDT 2009

I think you are confusing link troubles with "network" troubles.  (and 
WLAN's are just multi-user links, pretty much).

Part of the architecture of some link layers is a "feature" that is 
designed to declare the link "down" by some kind of measure.  Now this 
is clearly a compromise - the link in most cases is only temporarily 
"down", depending on the tolerance of end-to-end apps for delay.

In an 802.11* LAN (using standard WiFi MAC protocols), there is one of 
these"declared down", whether you are using APs or Virtual LANs or AdHoc 
mode (so called) or even 802.11s mesh.    Since 802.11 doesn't take 
input from the IETF, it has no notion of meeting the needs of end-to-end 
protocols for a *useful* declaration.  Instead, by studying their 
navels, the 802.11 guys wait a really long (and therefore relatively 
useless in terms of semantics) before declaring a WLAN "link" down.  Of 
course that is a "win" if your goal is just managing the link layer.

What would be useful to the end-to-end protocol is a meaningful 
assessment of the likelihood that a packet will be deliverable over that 
link as a function of time as it goes into the future.   This would let 
the end-to-end protocol decide whether to tear down the TCP circuit and 
inform the app, or just wait, if the app is not delay sensitive in the 
time frame of interest.

Unfortunately, TCP's default is typically 30 seconds long - far too long 
for a typical interactive app.  And in some ways that's right: an app 
can implement a shorter-term "is the link alive" by merely using an app 
layer handshake at a relevant rate, and declaring the e2e circuit down 
if too many handshakes are not delivered.  If you think about it, this 
is probably optimal, because otherwise the end-to-end app will have to 
have a language to express its desire to every possible link along the 
way, and also to the "rerouting" algorithms that might preserve 
end-to-end connectivity by "routing around" the slow or intermittent link.

Recognize the "end to end argument" in that last paragraph?   It says: 
we can't put the function of determining "app layer circuit down" into 
the different kinds of elements that make up the Internet links.  
Therefore we need to do an end-to-end link down determination.  And in 
fact, if we have that, we don't need the link layer to tell the ends 
when they are down.  So the function of "app layer circuit down" should 
NOT be required of the network elements.

What should we put in the network?  Well, we can definitely improve 
matters for lots of protocols by "routing around" crappy wireless 
connectivity quickly.  So the routing algorithms should probably avoid 
buffering lots of data to be sent over a degraded wireless link, and 
start routing that traffic over an alternative path, if there is one.  
This increases the chance that higher levels will experience no 
interruption.  And if buffers are not allowed to build up in the process 
(which means signalling to the endpoints to slow down via head drops or 
ECN) one can avoid congestion in the network by reflecting it out the 
endpoints quickly enough as the overall capacity degrades.

Thus the network shouldn't spend its time holding onto packets in 
buffers.  Instead it should push the problem to the endpoints as quickly 
as possible.  Unfortunately, the link layer designers, whether of DOCSIS 
modems or 802.11 stacks, have it in their heads that reliable delivery 
is more important than the cost to endpoints of deep buffering.  DOCSIS 
2 modems have multiple seconds of buffer, and many WLANs will retransmit 
a packet up to 255 times before giving up!  These are not a useful 
operational platform for TCP.  It's not TCP that's broken, but the 
attempt to maximize link capacity, rather than letting routers and 
endpoints work to fix the problem at a higher level.

On 07/05/2009 08:44 AM, Detlef Bosau wrote:
> Lachlan Andrew wrote:
>> Greetings Detlef,
>>
>> 2009/7/5 Detlef Bosau <detlef.bosau at web.de>:
>>> Lachlan Andrew wrote:
>>>> "a period of time over which all packets are lost, which extends for
>>>> more than a few average RTTs plus a few hundred milliseconds".
>>> I totally agree with you here in the area of fixed networks, 
>>> actually we use
>>> hello packets and the like in protocols like OSPF. But what about 
>>> outliers
>>> in the RTT on wireless networks, like my 80 ms example?
>>
>> That is why I said "plus a few hundred milliseconds".
>
> Now, how large is "a few"?
>
> Not to be misunderstood: There are well networks, where a link state 
> can be determined.
> E.g.:
> - Ethernet, Normal Link Pulse,
> - ISDN, ATM, where we have a continuous bit flow,
> - HSDPA, where we have a continuous symbol flow on the pilot channel 
> in downlink direction and responses from the mobile stations in uplink 
> direction.
>
> In all these networks, we have continuous or short time periodic 
> traffic on the link and this traffic is reflected by responses in a 
> quite well known period of time. In addition, the behaviour of 
> hello-response seems does not depend on any specific traffic. In 
> Ethernet or ATM, a link, our a link outage respectively, is detected 
> even when no traffic from upper layers exist.
>
> In some sense, this even holds true for HSDPA, when we define a HSDPA 
> link to be "down", when the base station does not receive CQI 
> indications any longer.
>
> I'm not quite sure (to be honest: I don't really know) whether similar 
> mechanisms are available e.g. for Ad Hoc Networks.
> Particularly as we well know of hidden terminal / hidden station 
> problems, where stations in a wireless network even see each other.
>
>
>>   You're right
>> that outliers are common in wireless, which is why protocols to run
>> over wireless need to be able to handle such things.
>>
> Exactly.
>
> So, we come to an important turn in the discussion. It's not only the 
> question whether we can detect a link outage.
> The question is: How do we deal with a link outage?
>
> In wireline networks, link outages are supposed to be quite rare. 
> (Nevertheless, the consequences may be painful.)
> In contrast to that, link outages are extremely common in MANETs. 
> Actually, we have to ask what the term "link" and the term "link 
> outage" or "disconnection" shall mean in MANETs.
>
> For example, think of TCP. How does TCP deal with a link outage?
>
>
> Now, if this were a German mailing list and I came from Cologne, I 
> would write: "Es is wie es is und et kütt wie et kütt."
> More internationally spoken: "Don't worry, be happy."
>
> If the path is finally broken, the TCP flow is broken as well.
>
> If there is an alternative path and the routing is adjusted by some 
> mechanism, the TCP flow will continue.
>
> Of course, there may be packet loss. So, TCP will do packet 
> retransmissions.
> Of course, the path capacity may change. So, TCP will reassess the 
> path capacity. Either by slow start or by one or several 3 D ACK / 
> fast retransmit, fast recovery cycles.
> Of course, the throughput may change. Thats the least problem of all, 
> because its automatically fixed by the ACK clocking mechanism.
> Of course, the RTT may change. So, the timers have to converge to a 
> new expectation.
>
> There will me some rumbling, more or less, but afterwards, TCP will 
> keep on going.
>
> Either way, there is no smart guy to tell TCP "there is a short time 
> disconnection." Hence, there is no explicit mechanism in TCP to deal 
> with short time disconnections. Because the TCP mechanisms as they are 
> work fine - even when short time disconnections and path changes 
> occur. There is no need for some "short time disconnection handling".
>
> Of course, this will rise the question whether TCP as is can be 
> suitable for MANETs, because we can well put in question whether e.g. 
> the RTO estimation and the CWND assessment algorithms in TCP will hold 
> in the presence of volatile paths with volatile characteristics.
>
> TCP is supposed work with a connectionless packet transport mechanism 
> with "reasonbly quasistationary characteristics"  and a packet loss 
> ratio, we can reasonably live with.
>
> Or for the people in Cologne: "Es is wie es is und et kütt wie et kütt."
>
>>> Was there a "short time disconnection" then?
>>> Certainly not, because the system was busy to deliver the packet all 
>>> the
>>> time.
>>
>> From the higher layer's point of view, it doesn't matter much whether
>> the underlying system was working hard or not... 
>
> Correct. From the higher layer's point of view, the questions are:
> - is the packet acknowledged at all?
> - is the round trip time "quasistationary" (=> Edge's paper).
> - is the packet order maintained or should we adapt the dupackthreshold?
> - more TCP specific: Is the MSS size appropriate or should it be changed?
>>  If the outlier were
>> more extreme, then I'd happily call it a short term disconnection, and
>> say that the higher layers need to be able to handle it.
>>
>
> Question: Should we _actively_ _handle_ it (e.g. Freeze TCP?) or 
> should we build protocols sufficiently robust, so that protocols can 
> implicitly cope with short time disconnections?
>>> So the problem is not a "short time disconnection", the problem is that
>>> timeouts don't work
>>
>> Timeouts are part of the problem.  Another problem is reestablishing
>> the ACK clock after the disconnection.
>>
>
> Hm. Where is the problem with the ACK clock?
>
> If, the problem could be (and I'm not quite sure about WLAN here) that 
> a TCP downlink may use more than one paths in parallel. Hence, there 
> may be three packets delivered along three different paths - and a 
> sender in the wireline network sees three ACKs and hence sends three 
> packets....
>
> However, in the normal "single path scenario", I don't see a severe 
> problem. Or do I miss something?
>
>>> Actually, e.g. in TCP, we don't deal with "short time disconnections"
>>
>> There may not be an explicit mechanism to deal with them.  I think
>> that the earlier comment that they are more important than random
>> losses is saying that we *should* perhaps deal with them (somehow), or
>> at least include them in our models.
>
> I'm actually not convinced that short time disconnections are more 
> important than random losses.
>
> If this was the attitude of the reviewers who rejected my papers, I 
> would suppose they would try to tease me.
>
> Of course, I could redefine any random loss to be a short time 
> disconnection - hence there wouldn't be any random loss at all.
>
> However, this would be some nasty kind of hair splitting.
>
> I think, the perhaps most important lesson from my experience from 
> last week is that we must not suppose
> one wireless problem to be more important than others.
>
> Of course this puts in question mainly the opportunistic scheduling 
> work which assumes that there is only Rayleigh Fading
> and despite the useful, well behaved, periodic and predictable 
> Rayleigh Fading for evenly moving mobiles, there is no other 
> disturbance on the wireless channel.
>
> Of course, many students earn there "hats" that way, but the more I 
> think about it, the less I believe that this really reflects reality.
>
> Detlef
>>> So, the basic strategy of "upper layers" to deal with short time
>>> disconnections, or latencies more than average, is simply not to 
>>> deal with
>>> them - but to ignore them.
>>>
>>> What about a path change? Do we talk about a "short time 
>>> disconnection" in
>>> TCP, when a link on the path fails and the flow is redirected then? We
>>> typically don't worry.
>>
>> Those delays are typically short enough that TCP handles them OK.  If
>> we were looking at deploying TCP in an environment with common slow
>> redirections, then we should certainly check that it handles those
>> short time disconnections.
>>
>>> To me, the problem is not the existence  - or non existence - of 
>>> short time
>>> disconnections at all but the question why we should _explicitly_ 
>>> deal with
>>> a phenomenon where no one worries about?
>>
>> The protocol needn't necessarily deal with them explicitly, but we
>> should explicitly make sure that it handles them OK.
>>
>>> Isn't it sufficient to describe the corruption probability?
>>
>> No, because that ignores the temporal correlation.  You say that the
>> Gilbert-Elliot model isn't good enough, but an IID model is orders of
>> magnitude worse.
>>
>> Cheers,
>> Lachlan
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.postel.org/pipermail/end2end-interest/attachments/20090706/47b1b157/attachment.html