[e2e] Re: cleaning up TIME_WAIT states

Mon Jun 11 14:43:24 PDT 2001

First, end2end-interest has moved to postel.org;
there is no list at the other address.

That said...

Zhan Philip wrote:
> 
> I am sorry that my last message with a wrong title.
> 
> I have two almost identical systems (Solaris 2.6 with
> the same HW and the same settings) and running an
> identical application (SunLink SNA gateway).  The
> first one has the same amount of traffic as the second
> one but much more TCPs in TIME_WAIT state (1500 vs
> 50).
> The first one was not stable (very slow or hang) in
> the past two weeks but the second one has been very
> stable.
> 
> After I changed the tcp_fin_wait_2_flush_interval
> value from 675000ms to 30000ms and
> tcp_keepalive_interval value from 2hrs to 60000ms, the
> first one is running OK now. My questions are:
> 
> (1)   Why are there so big difference in the number of
> TCPs in TIME_WAIT state (1500 vs 50)between the two
> identical systems with the same traffic?
> (2)   Does those TCPs in TIME_WAIT state contribute to
> the file descriptor of the server (the applications)?

Some of this is described in the following paper:

"The TIME-WAIT state in TCP and Its Effect on Busy Servers,"
    Theodore Faber, Joe Touch, and Wei Yue, Proc. Infocom '99, pp.
1573-1583.
    http://www.isi.edu/touch/pubs/infocomm99/ 

Regarding your specific questions:

1) The changes in the timer values are more than sufficient
   to create the changes you describe.

2) I'm not sure what you're asking.

The stability of the system can be related to the
number of TCPs in TIME_WAIT when kernel memory 
is limited. It can also slow connection processing,
depending on the implementation. If the kernel
is memory limited, that can cause new connections to
be rejected until memory is freed by connections
leaving TIME_WAIT, which in turn can result in
the stalls seen. All this is covered in the above paper
and its references.

I'm not clear on what tcp_fin_wait_2_flush_interval
does on Solaris. TCP's timeouts are based on 2MSL,
which is 120 seconds in the current Internet. I am
suspicious of changing this parameter from 675 seconds
down to 30 seconds (below 2MSL, in particular). It
seems like, if it's what it's name says, it should be
some multiple of the 2MSL parameter.

As to the tcp_keepalive_interval, that too should be
at least some multiple of the 2MSL. 60 seconds
seems too short.

--

Mostly, I'd encourage you to find out what "mostly" means,
as in "mostly identical". Kernel memory is clearly limited
in the first case; is it because of load, RAM limits, 
kernel configuration, or differences in the settings of
these variables to start with?

Joe