[e2e] [tcpm] RTTM + timestamps

Mon Jan 17 05:48:22 PST 2011

Hi group,

After having received some feedback off-list so far, I would like to summarize what I learned so far. Also, I invite everyone to discuss these points. There are benefits for the research community both empirical as well as theoretical, as well as mobile and high-speed network operators and implementers.

o) one-way delay based (and delay variation based) congestion controls would benefit from knowing the clock resolution on both sides. Some research in that area is done by Mirja Kuehlewind and Bob Briscoe (http://bobbriscoe.net/pubs.html#chirp_impl), as well as David Hayes (http://caia.swin.edu.au/reports/100219A/CAIA-TR-100219A.pdf)

o) RTT variance during loss episodes is not deeply researched. Current heuristics (RFC1122, RFC1323, Karn's algorithm, RFC2988) explicitly exclude (and prevent) the use of RTT samples when loss occurs. However, solving the retransmission ambiguity problem - and the related reliable ACK delivery problem - may allow the refinement of these algorithms further, as well as enabling new research to distinguish between corruption loss (without RTT / one-way delay impact) and congestion loss (with RTT / one-way delay impact). This appears to be a rather neglected field however, especially when it comes to large scale, public internet investigations. Due to the very nature of this, passive investigations without signals contained within the headers are only of limited use in empirical research here.

o) A side-effect of Van Jacobson's algorithm is that RTO spikes when the path RTT suddenly drops. With the decrease of the path RTT, the variance grows. As the variance has a large effect on the calculated RTO, this leads to potentially very lengthy timeouts even though the RTT is much shorter. This particular problem has been addressed in some stacks, and the lessons learned from the deployment there could be used to update the RTO calculation specs. 

o) Enabling of spurious RTO detection (Eifel, D-SACK, F-RTO) in the last years also make it possible to dynamically identify instances when the RTT estimation or RTO calculation where mislead, allowing to use a more conservative algorithm for certain paths / times.

o) Retransmission ambiguity detection during loss recovery would allow an additional level of loss recovery control without reverting to timer-based methods. As with the deployment of SACK, separating "what" to send from "when" to send it could be driven one step further. In particular, less conservative loss recovery schemes which do not trade principles of packet conservation against timeliness, require a reliable way of prompt and best possible feedback from the receiver about any delivered segment and their ordering. SACK alone goes quite a long way, but using Timestamp information in addition could remove any ambiguity. However, the current specs in RFC1323 make that use impossible, thus a modified signaling (receiver behavior) is a necessity.

The first central aspect of the above mentioned points is to resolve the retransmission ambiguity, and the second to give the end host a better understanding of their respective partners behavior.

I strongly believe that much about the retransmission ambiguity can be solved by exploiting synergistic signaling between the TCP SACK option and Timestamp option. In particular, the receiver-side state required by RFC1323 to choose which Timestamp to reflect when a non-contiguous segment is received could be alleviated, when the TCP session is also using SACK. (The presence of a SACK field indicates some out of sequence delivery. The current stipulations were made, afaik, to ensure that no unduly small RTT sample is entered into the RTO calculation, when ACK loss occurs. But again, SACK disambiguates between a in-sequence ACK, and a duplicate ACK).

For the second aspect, the single most important aspect to convey would be the tcp clock resolution. For this, no signaling exists yet. However, there appears to be a downward-compatible method to extend the Timestamp SYN option, to include this signal.

According to RFC1323, TSecr is unused in the SYN segment, and most stacks appear to ignore that value. An enhanced signaling could set this initial TSecr field. To convey some feedback from the receiver to the sender in the SYN|ACK, while maintaining compatibility with stacks that do not evaluate the contents of this SYN TSecr, a simple XOR between the SYN|TSopt field and the clients respective settings could be employed.

The sender would compare the TSecr value in the SYN|ACK with it's initial sent TSval, and if they are identical, conclude to deal with an unenhanced client (no SACK+TS synergy, no one-way delay/variance CC). If the TSecr in the SYN|ACK would be different, the negotiated heuristics can be enabled for that TCP session - decoded by an XOR from the original TSval. 

SYN:
+-------+-------+---------------------+---------------------+
|Kind=8 |  10   |   TS Value (TSval)  |TS Echo Reply (TSecr)|
+-------+-------+---------------------+---------------------+
    1       1              4                     4
With TSecr containing, ie. A 16-bit Flag field plus a 16-bit IEEE float indicating the TCP clock resolution (duration or frequency), or a 8-bit Flag field plus two 12-bit floats for negotiation of an acceptable clock range. Also, perhaps a CIDR-like prefix bitstring may be of interest for future extensions, ie:

1 1 x x x - future
1 0 1 x x - use 2x12 bit floats
1 0 0 1 x - use 1x16 bit float

A few potentially meaningful flag bits discussed so far could be

TS-extension flag (always set; this limits the senders initial TS selection...)
  Reserved
    TS-negotiate range (w/ 2x 12bit floats)
    TS-fixed rate (w/ 16bit float)
      Modified exponent offset (-21 instead of -15, for very high speed networks)
        SACK+TS synergy (always reflect TS of last received segment)
          High-Precision (use HW late in data path for TS insertion, exclude much jitter)

Note that binary16 allows a dynamic range (with some lost precision) from 2^15 down to 2^-24, while a 12-bit representation, omitting sign and most significant exponent bit, plus least two significant fraction bits, would still allow a range between 2^0 to 2^-22. 

Binary16:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|Si|   Exponent   |          Fraction           |
|gn|offset shifted|                             |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

12-bit subsample:
      +--+--+--+--+--+--+--+--+--+--+--+--+
      |  exponent |        fraction       |
      +--+--+--+--+--+--+--+--+--+--+--+--+

However even this dynamic range may not be enough to allow a tcp clock to run at the sending rate of minimal IP frames at very high (100 Gbit/s, 1Tbit/s) speeds. Using a different exponent offset shift, and omitting one additional fraction bit instead of one exponent bit in the 12 bit value should be discussed.
   +--+--+--+--+--+--+--+--+--+--+--+--+
   |   exponent   |        fraction    |
   +--+--+--+--+--+--+--+--+--+--+--+--+

From: Scheffenegger, Richard 
Sent: Montag, 10. Jänner 2011 19:25
To: tcpm at ietf.org; iccrg at cs.ucl.ac.uk
Subject: [tcpm] RTTM + timestamps

Hi group,
in order not to spam the whole group, I would like to learn who is interested in the heuristics and signaling around round-trip measurement, timestamps and possible synergies between TS and other option, as well as some empirical measurements around currently deployed tcp stacks that diverge from the IETF specs in these aspects.
In the light of some recent developments (ie. LEDBAT, path-chirp/chirp CC), and more efficient (end-of-flow / non-congestion) loss recovery, improvements in that space may be of interest to a larger group of implementers too...
With your feedback, I would like to learn the extent of possible interest, before going further with my work in that area!
Best regards,
Richard Scheffenegger