[e2e] tcp connection timeout

Wed Mar 1 19:18:46 PST 2006

On Wed, 1 Mar 2006, David P. Reed wrote:

> Actually, there is no reason why a TCP connection should EVER time out 
> merely because no one sends a packet over it.

The knowledge that connectivity is lost (i.e. that there's no *ability* to
send information before the need arises) is valuable.  A preemptive action
can then be taken to either alert user, or to switch to an alternative.  
Just an example (with somewhat militarized slant): it does make a lot of
difference if you know that you don't know your enemy's position, or if
you falsely think that you know where they are, and meanwhile they moved
and you simply didn't hear about it because some wire is cut.

There's also an issue of dead end-point detection and releasing the
resources allocated to such dead point (which may never come back). There
is no way to discriminate between dead end-point and an end-point which
merely keeps quiet other than using connection loss detection.

So, in practice, all useful applications end up with some kind of timeouts
(and keepalives!) - embedded in zillion protocols, mostly designed
improperly, or left to the user's manual invervention.  It makes
absolutely no sense - in a good design shared functions must be located at
the level below, so there's no replication of functionality.

What is needed is an API which lets applications to specify maximal
duration of loss of connectivity which is acceptable to them.  This part
is broken-as-designed in all BSD-ish stacks, so few people use it.

Keepalives, arguably, is the crudest method for detection of loss of
connectivity, and they load the network with extra traffic and do not
provide for the fast detection of the loss.  But, because of their
crudity, they always work.

A good idea would be to fix standard socket API and demand that all TCP
stacks allow useful minimal keepalive times (down to seconds), rather than
have application-level protocol designers to implement workarounds at the
application level.

And, yes, provide the TCP stack with a way to probe the application to
check if it is alive and not deadlocked (that being another reason to do
app-level keepalives).

Some tap to the routing system which, in turn, obtains link status from
the underlying hardware with its quick detection of loss of carrier would
be the best, but it also is complicated.  A limited form of it (i.e.  
shutting down TCP session when a directly attached interface carrying them
goes down for longer than their timeouts) could be useful.

The same kind of mentality (i.e. if a lower level is broken, just dump the
problem into application developers laps) gave us the IPv4->IPv6
application portability issues (by not hiding domain name to transport
address conversions) and many other bogosities, so any sizeable software
project nowadays includes mandatory work on "OS abstraction" layer which
attempts to hide the ugliness - and you still have to produce new binaries
every time something changes underneath.

There's a rule of thumb: you cannot get rid of the complexity, but you can
move it around. If it stays in one place, it is manageable. If you move it
so that it is replicated in many places, it always comes to bite you in
the back.

Keep keepalives in one place, puh-leeease.

--vadim