[e2e] TCP implementations in various OS's
perfgeek at mac.com
Wed May 12 19:45:13 PDT 2010
On May 12, 2010, at 11:55 AM, Detlef Bosau wrote:
> rick jones wrote:
>> On May 12, 2010, at 2:30 AM, Detlef Bosau wrote:
>> I'm arriving late to the discussion - perhaps data centers and LANs
>> were not included in your set of terrestrial TCP sessions and I'm
>> but providing fodder for "TCP as the one true protocol is bad"
>> school of thought, but it has been my experience thusfar that over
>> a 10 Gbit/s Ethernet LAN, TCP needs 128KB or more of window to
>> achieve reasonable throughput.
> Is this due to the link lenghts or due to huge interface buffers?
I believe it is the result of the basic bandwidth delay product. 10
billion bits per second does not leave room for much delay. 40 or 100
billion bits per second will leave even less. To get 9 Gbit/s with a
65535 byte window means the RTT on the LAN must be less than 0.000583
seconds - so anything more than half a millisecond and 65535 won't cut
it. And that will include getting through the stack on the
transmitter, DMA into the NIC, getting through the NIC, toggle the
bits onto the fibre, get through the switch(s) and any bit toggling
that entails on the inter-switch links, then bit toggling across the
last hop, through that NIC and up through that stack.
>> Get much more than 1 ms of delay in the LAN or data center and
>> even that is insufficient.
> I left out the consideration, that we have to take into account the
> number of active flows.
> Using VJCC, any flow has a minimum window of 1 MSS. Actually, even 1
> MSS may not fit on a small link. Hence, we have to provide a certain
> minimum of queueing memory to make the system work with the actual
> number of flows being active.
> May this be the reason for the delays you mentioned?
In the tests I run, the only queueing is consumed by the data of the
connection itself. I'm talking about a single TCP flow, not even when
there are multiple flows attempting to go through a common path.
> Actually, I don't mind reasonable window scaling when there are
> sound reasons for it. Perhaps, the general term "misbehaved" is too
> strict and we should better encourage a reasonable usage of window
> scaling. Unfortunately, I read several discussions on this matter
> where window scaling was used or encouraged quite carelessly.
To be sure, I've seen some crazy things - like 10GbE NIC vendors
suggesting people set their TCP windows to 16 MB, but I do not see
that as condemning TCP window scaling in general.
I actually use a 1MB socket buffer in my 10GbE netperf tests - for
The multiple results are generally when I am shifting the CPU affinity
of netperf/netserver around relative to the core taking interrupts
from the NIC. You will notice, if you scroll way over to the right,
how far Linux autotuning will take the TCP window/socket buffers- eg
Column I rows 6-8 for the local send socket buffer final and Column R
for the remote receive socket buffer final (4194304 was the configured
limit - the default in the kernel I was using.) (rows 6-8 were using
autotuning, rows 12-14 were with explicit setsockopt() calls on the
>> rick jones
>> Wisdom teeth are impacted, people are affected by the effects of
> Lucky me, I've only two wisdom teeth left ;-)
I'm not sure if that leaves you one up or one down on me - I have just
the one :)
More information about the end2end-interest