[e2e] TCP implementations in various OS's

Wed May 12 19:45:13 PDT 2010

On May 12, 2010, at 11:55 AM, Detlef Bosau wrote:

> rick jones wrote:
>>
>> On May 12, 2010, at 2:30 AM, Detlef Bosau wrote:
>>
>> I'm arriving late to the discussion - perhaps data centers and LANs  
>> were not included in your set of terrestrial TCP sessions and I'm  
>> but providing fodder for "TCP as the one true protocol is bad"  
>> school of thought, but it has been my experience thusfar that over  
>> a 10 Gbit/s Ethernet LAN, TCP needs 128KB or more of window to  
>> achieve reasonable throughput.
>
> Is this due to the link lenghts or due to huge interface buffers?

I believe it is the result of the basic bandwidth delay product.  10  
billion bits per second does not leave room for much delay.  40 or 100  
billion bits per second will leave even less.  To get 9 Gbit/s with a  
65535 byte window means the RTT on the LAN must be less than 0.000583  
seconds - so anything more than half a millisecond and 65535 won't cut  
it.  And that will include getting through the stack on the  
transmitter, DMA into the NIC, getting through the NIC, toggle the  
bits onto the fibre, get through the switch(s) and any bit toggling  
that entails on the inter-switch links, then bit toggling across the  
last hop, through that NIC and up through that stack.

>>  Get much more than 1 ms of delay in the LAN or data center and  
>> even that is insufficient.
>>
>
> I left out the consideration, that we have to take into account the  
> number of active flows.
>
> Using VJCC, any flow has a minimum window of 1 MSS. Actually, even 1  
> MSS may not fit on a small link. Hence, we have to provide a certain  
> minimum of queueing memory to make the system work with the actual  
> number of flows being active.

> May this be the reason for the delays you mentioned?

In the tests I run, the only queueing is consumed by the data of the  
connection itself.  I'm talking about a single TCP flow, not even when  
there are multiple flows attempting to go through a common path.

> Actually, I don't mind reasonable window scaling when there are  
> sound reasons for it. Perhaps, the general term "misbehaved" is too  
> strict and we should better encourage a reasonable usage of window  
> scaling. Unfortunately, I read several discussions on this matter  
> where window scaling was used or encouraged quite carelessly.

To be sure, I've seen some crazy things - like 10GbE NIC vendors  
suggesting people set their TCP windows to 16 MB, but I do not see  
that as condemning TCP window scaling in general.

I actually use a 1MB socket buffer in my 10GbE netperf tests - for  
example, in:

ftp://ftp.netperf.org/netperf/misc/dl380g5_2.6.32-3_ad386_1.1.3-ko_T7.4.0_b2b_to_same_1500mtu_20100114.csv

The multiple results are generally when I am shifting the CPU affinity  
of netperf/netserver around relative to the core taking interrupts  
from the NIC.  You will notice, if you scroll way over to the right,  
how far Linux autotuning will take the TCP window/socket buffers- eg  
Column I rows 6-8 for the local send socket buffer final and Column R  
for the remote receive socket buffer final (4194304 was the configured  
limit - the default in the kernel I was using.)  (rows 6-8 were using  
autotuning, rows 12-14 were with explicit setsockopt() calls on the  
socket buffers.)

>> rick jones
>> Wisdom teeth are impacted, people are affected by the effects of  
>> events
>
> Lucky me, I've only two wisdom teeth left ;-)

I'm not sure if that leaves you one up or one down on me - I have just  
the one :)

rick
http://homepage.mac.com/perfgeek