[e2e] Retransmission Timouts revisited

Thu Aug 25 12:13:38 PDT 2005

I intendedly do not write "TCP" because this matter is not restricted to
TCP but retransmission timeouts are required in _any_ protocol
which has to cope with packet losses. It doesn´t matter whether a
timeout is detected on the sender or on the receiver.
As long as there is no other mechanism to detect packet loss, we must
rely on timouts. 

I just compared the rto algorithms given in Edge´s paper from 1984 and
the congavoid paper (obviously in some newer version where rto is
set to mean+4*variation instead of mean+2*variation, but this does not
matter for the discussion).

For simplity, let´s ignore Karn´s algorithm and focus on the math work
for rto. We further assume the preconditions for Edge´s work
may hold. (In fact, we must have a very close look to this, but in this
discussion we assume they will hold.)

Then the difference is the choice of the variance estimator.
Is this correct?

Both, Edge and Jacobson/Karels as well, estimate the RTT mean wusing an
EWMA filter.

Edge estimates the variance using an EWMA filter as well, other than
Jacobson who uses an estimator which gives an estimation for
the variance and is easier to calculate.

What makes me curious about that is, that the rto given by edge
_essentialy_ relys on (an one tailed version of) Chebyshev´s inequality.

That´s why ranted the last days when it came to spurious timeouts. The
is much written about the multiplicative factor k
in rto=mean + k*variance. However, isntead of a qualitative "guess",
Edge´s paper derives an RTO which respects a _prescribed_ upper limit
for an "unwanted retransmission probality" AKA spurious timeout
probability.

When we consider Edge´s formula

RTO = mean + e [\sigma*(1-Y)/Y]^1/2

and set e to 1 (which is appropriate because "e" does not appear in the
derivation of the formula), than Y is the spurious timeout probability.
(Refer to Edge´s paper, formulae 20 ff. for details.)

Practically, this means that if we choose 

rto=man+2*var = mean + (4*\sigma^2)^1/2 

this means

4=(1-Y)/Y

hence 

Y = 1/5.

Hence, the spurious timeout probability has an upper limit of 1/5,
_independent_ from the actual RTT distribution.

As I said above, a newer version of the congavoid paper deals with k=4,
then the spurious timeout probability is 1/17.

This matter was even discussed in a paper by Leung, Klein, Mooney and
Haner who proposed to further increase the RTO to avoid
spurious timeouts. Of course, one can have a religious discussion here.
However, if we take Edge´s formula, we prescribe an
upper limit for the spurious timeout propability, e.g Y should be 1/63,
and then we have:

k=sqrt((1-y)/y) = sqrt((62/63) * 63) = sqrt(62) = ca. 7.87

Hence, we can define an upper limit for the spurious timeout probability
and derive the necessary k.

NB: We did not discuss delayed retransmissions here which may result
from a too large rto.

All this holds true in Edge´s paper, at least asymptotically.

However, the congavoid paper choses a different variance estimator,
which is easier to calculate.

Q1.: Is there evidence that the rto equations by Edge hold true?
Particularly: What is the precise relationship between the "mdev" used
in 
the congavoid paper and \sigma?

There are some vague remarks on this one. However, this is a central
issue as it directly affects the applicability of Edge´s formula.
The very strength of Edge´s formula is that we have a _generic_
estimation for the spurious timeout probability, which is especially 
indendendent of the actual RTT distribution.

Q2.: As we have more powerful computing machinery now than 1988, did
anybody think of using Edge´s orginal formulae again?

If we recall Edge´s formula, that´s why I talked from an "urban legend"
here recently as far as spurious timeouts are concerned:

There _are_ spurious timeouts. Anywhere, anytime. And assumed, Edge´s
formula where applicable, we can define an upper limit for the
spurious timeout probability and hence, spurious timeouts will not occur
unduly often. It doesn´t matter whether we run
TCP over Ethernet, GPRS or even with flying pigs.

Whether the assumptions for Edge´s formulae will hold on an arbitrary
network, is a different story.

Some problems are:

1. A few years ago, I read a paper that packet order distortions cannot
be neglected in the Internet. Thus, the Internet does no longer
performe "Sequencing Positive Acknowledgement Retnramission (SPAR)". I
did not yet udnerstand all the details, but Edge explicitely maks
use of the SPAR assumption several times in his paper. Thus, I´m not yet
convinced that his rationale will hold, if this assumption
is violated.

2. The EWMA estimators used by Edge require the observation variables to
be independent. In fact, RTT observations are gained from
ACK packets and these are send by the receiver as TCP packets arrive.
(Let´s ignore delayed ACK here.) Hence, the latency experienced
by a packet directly affects the sampling times of following RTT
samples. In ohter terms: The random variables used for rtt observation
directly affect each other and I´m not sure whether they are really
indpendent.

3. In addition to 2, we must review the weakly stationary assumption.
Basically, in Edge´s paper this assumption results
in convergence statemtens:
i) The mean estimator is in fact an asymptotic unbiased estimator for
mean.
ii) The variance estimator converges to var (t(n+1)-T(n)) where t(n+1)
is the n+1 th rtt observation and T(n) the n-th estimate for mean.
Both convergences hold for n->inf., i.e. "in the long run".

4. "in the long run": As we know from pratical statistics, in the
Internet short term flows are by far more often as long term flows.
In other terms: When the estimators "start to converge", the flow of
interest may be history.

In any case, the formulae given by Edge make it easier to do proper
analysis, as they explicitely state assumptions and preconditions.
At the moment, I do not quite see, under wich circumstances the rto
formula in the congavoid paper will hold.
Particularly unduly often spurious timeouts rise the question, whether
the problem is in fact the network technology in use,
or wether the problem is the violation of the requirements for the rto
estimator to work properly.

>From the aforementioned problem list, I conclude that there are a number
of vulnerabilities in the actual rto estimation scheme.
In addition, this is not only a problem for TCP, but for any protocol
which requires timeouts and must rely upon estimators for
mean and variance to obtain them.

Detlef

-- 
Detlef Bosau
Galileistrasse 30
70565 Stuttgart
Mail: detlef.bosau at web.de
Web: http://www.detlef-bosau.de
Mobile: +49 172 681 9937