[e2e] TCP in outer space

Sat Apr 14 13:18:46 PDT 2001

    > From: Ted Faber <faber at ISI.EDU>

    >> The early Internet *did* try and *explicitly* signal drops due to
    >> congestion ... that's what the ICMP "Source Quench" message

    > I was thinking of more network specific ways to signal the event.

Well, that wouldn't have been workable, if the congestion point was on a
different network from the source of the traffic, no? (And I hope we're not
falling into terminology confusion over what you mean by "network specific" -
I'm assuming it means "specific to a particular link technology").

    > My point was that the brain power seems to have made some a priori
    > decisions about how to wade into the design space. One of those was to
    > favor robustness to efficiency. ... But I only know what I read in the
    > papers; you were there, and if you tell me that I'm misunderstanding
    > I'm probably misunderstanding.

No, you're quite right, that was a stated tradeoff.

But there's a common failure mode in systems built with that philosophy, one
I personally nickname the "Multics failing disk drive" problem - and the
Internet displayed it in spades! Briefly, the problem is that when you built
robust systems, sometimes the robustness prevents you from seeing real
problems until they become really serious. (Some people will do doubt claim
this is a feature, not a bug! :-)

In the case above, the Multics disk driver code tried really hard to
read/write blocks, with retry strategies, and that robustness hid a failing
disk drive until the overall system performance went into the proverbial
handbasket. However, the problem is that robustness mechanisms can hide not
only failing equipment, but also *design* failures - and that's a much more
serious problem.

Thus, in a number of places in the Internet stack (e.g. Sorcerer's Apprentice
Syndrome, as well as the TCP congestive collapse of the early 80's), the
robustness hid design problems until they became so massive they severely
impacted performance.

The moral of the stories is clear: robustness is fine, but you need to have
counters to see how often you're triggering the robustness mechanisms,
because if you're doing it too often, that's a problem (for all the obvious
reasons - including that the recovery is generally less efficient, so you
shouldn't be doing it too much).

    > That implies that once congestion systems were sufficient to keep the
    > net running, that prioritization urged brain power to be spent on
    > robustness and swallowing new network technologies rather than tuning
    > the congestion control for optimal efficiency.

Well, actually, it was more like "since the original TCP congestion control
worked OK to begin with, we moved our limited resources toward other things
like getting applications up, and dealing with some of the initial scaling
problems, such as needing to deploy A/B/C addresses, subnets, replacing GGP
with EGP, getting rid of HOSTS.TXT in favor of DNS, etc, etc, etc, etc"!

    > Unless you give ICMP packets preferential treatment with respect to
    > drops, I think it's tough to characterize the relative reliability of
    > the signals. ... Even in the absence of other shortcomings, I don't
    > think I'd believe an assertion about the relative reliabilities without
    > a study or three.

Excellent point. In large complex systems, theory has to take a back seat to
empirical data. It's almost a given that a large system is going to behave in
ways that all of Mensa couldn't predict.

	Noel