TCP Chorusing in the Windows 9x TCP/IP Stack
Abstract
Microsoft Windows 95 and 98 clients have the capability to bind multiple
TCP/IP stacks to the same MAC address, simply by having the protocol added
more than once in the Network control panel. This is actually quite useful,
except for the fact that these stacks can run concurrently on the same IP,
even if they receive their IP through BOOTP/DHCP. The effect of the bug is
to cause the number of ACKnowledgement packets sent to be equal to that of
the number of loaded and bound TCP/IP stacks, creating excessive and
significant network noise and collisions. At least one Samba 2.0.0beta1
server on an affected subnet can become completely inaccessible when one of
these machines start misbehaving.
Redundant ACKing can be referred to as TCP Chorusing, due to the minor time
delays introduced between multiple copies of identical data. The problem is
undetectable using the Ping command built into Windows 95 or 98–this is a
significant bug in and of itself. Linux´s ping is not similarly
crippled. NT does not detect TCP Chorusing with its Ping command.
Introduction: Discovery
A word of warning: This is the first edition of this document and there are
bound to be errors. My ego isn’t so fragile as to be bothered if I made a
misstatement of fact when writing this. Just tell me.
My university possesses a generally excellent network, but on occasion
certain dorms would grind to a halt for no apparent reason. Seeking
answers, I used a windows platform pinger to see if there were correlations
between network downtimes and the presence of specific IP´s on a
specific subnet. We use essentially static IP´s distributed from a
DHCP server–a cookie seems to be assigned to a given MAC address on first
request for an IP, and all future IP´s are given on that IP. Nothing
out of the ordinary was found using the Windows pingers, so I decided
I´d automate the testing process over time using an excellent Linux
tool entitled fping. (In another environment, I might have merely shoved up
a sniffer, but the secure hubs and my lack of permission to modify them in
any way prevented that possibility.
Very quickly, I noticed some very strange entries in the fping
logs(IP´s changed):
10.0.9.42 : duplicate for [3], 84 bytes, 3.28 ms 10.0.9.73 : duplicate for [3], 84 bytes, 3.59 ms 10.0.10.33 : duplicate for [3], 84 bytes, 3.51 ms 10.0.10.99 : duplicate for [3], 84 bytes, 3.81 ms
I thought there might be a bug in fping, so I pinged the offending machines
from Windows 98:
C:\WINDOWS>ping -f 10.0.9.42 Pinging 10.0.9.42 with 32 bytes of data: Reply from 10.0.9.42: bytes=32 time=4ms TTL=126 Reply from 10.0.9.42: bytes=32 time=3ms TTL=126 Reply from 10.0.9.42: bytes=32 time=4ms TTL=126 Reply from 10.0.9.42: bytes=32 time=3ms TTL=126 Ping statistics for 10.0.9.42: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 3ms, Maximum = 4ms, Average = 3ms
Confusing, everything seemed normal from here. Then I tried the Linux ping
command.
effugas@doxpara:~> ping 10.0.9.42 PING 10.0.9.42 (10.0.9.42): 56 data bytes 64 bytes from 10.0.9.42: icmp_seq=0 ttl=127 time=3.5 ms 64 bytes from 10.0.9.42: icmp_seq=0 ttl=127 time=14.7 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=1 ttl=127 time=6.2 ms 64 bytes from 10.0.9.42: icmp_seq=1 ttl=127 time=7.5 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=2 ttl=127 time=3.3 ms 64 bytes from 10.0.9.42: icmp_seq=2 ttl=127 time=3.8 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=3 ttl=127 time=15.0 ms 64 bytes from 10.0.9.42: icmp_seq=3 ttl=127 time=15.4 ms (DUP!) * 10.0.9.42 ping statistics ---4 packets transmitted, 4 packets received, +4 duplicates, 0% packet loss round-trip min/avg/max = 3.3/8.6/15.4 ms
This was disturbing, especially since there was a very high correlation
between subnets experiencing high collisions and slow networks and the
number of TCP-Chorusing machines on that subnet. What was causing this?
The first step was to hunt down the machines exhibiting the bug and do a
little exploratory surgery. It didn’t take much deduction once I got access
to a few of the affected machines to realize that there were the same number
of extra TCP/IP stacks bound to the main adapter as there were extra pings.
Plus, just because there were machines with extra stacks didn’t mean it
wasn’t the NIC´s fault–a bug in the NIC installer could have have
created the extra TCP/IP entries. And what about wiring? All of these
machines were exhibiting these reactions on a rather non-standard “secure
hub”. Perhaps that was the cause of the stacks reacting so strangely?
Further investigation did shed some light. The automated installation
routines are suspect, since they´re the routines that most commonly
add the stacks. All cards, though, from generic Linksys´s to an Intel
8255x 10/100 board to the entire bevy of PCI and PCMCIA that 3Com offers can
have additional TCP/IP stacks merely added onto them for this behavior.
While only 3Com cards have been seen by me suffering from unintentional
TCP/IP Chorusing, this is probably because of the 90%+ market share 3Com
enjoys on campus and not because of a flaw in their drivers. It´s
quite likely that, since students and not staff install network drivers on
campus, this is more of a wetware problem–the student does whatever he or
she can to “just make it work like the directions say”, and if adding TCP/IP
multiple times happens to “Just Work”, so be it.
Before I could be sure that this was the problem, though, I needed to
isolate a computer from the University network first. I used my dorm room
100baseT internal network to do so. The following tcpdump is from a single
character typed from the chorusing machine into the telnet port of the Linux
machine:
11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: P 6:7(1) ack 171 win 7756 (DF) [Initial Keypress] 11:31:02.390000 10.0.6.194.telnet > 10.0.6.195.1043: P 171:172(1) ack 7 win 16352 (DF) [Pressed key is echoed from the Linux machine to be displayed on the Windows box.] 11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: . ack 172 win 7755 (DF) [Windows machine acknowledges receipt of data signifying what character it should display.] 11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: . ack 172 win 7755 (DF) [Windows machine again acknowledges receipt. This is the "chorus".]
There´s most probably no limit to the number of extra ACKs–If I had
ten TCP/IP stacks, I´d have nine duplicate packets, as far as I can
tell.
A final note–I have thus far been able to locate the bug in Windows 98 and
Windows 95 OSR2. The original version of Windows 95 was simply unavailable
for testing, but I would appreciate an email verifying the bug harkens back
that far.
Impact of Bug
The impact of TCP Chorusing, barring significant installer bugs(possible),
is directly related to the extent to which non-technical users have
installed network hardware and software. With millions of computers hooked
up to college dormitories, it would be quite myopic to dismiss this
population as minimal. Still, the relatively small percentage of machines
on campus here(maybe 1%) suggests the problem isn´t too prevalent.
It should be noted, though, that TCP Chorusing can wreak havoc. It only
takes one, possibly two TCP Chorusers on the same subnet as my Samba
2.0.0beta1 server to render it inaccessible to any machine on the affected
subnet, not just the choruser. Logs appear to show that the Windows
machines time-out attempting to connect to the machine, but the problem is
quite difficult to debug due to the bug´s moderately uncommon
nature–the server only occasionally becomes disabled by the Chorusers.
Attempts to connect to other 9x machines on the affected subnet still remain
successful, however. There were no NT machines available to test with, but
they´d probably remain intact as long as the overall network remained
usable.
However, four TCP Chorusers on a moderately active network will rend it
unusable. My theory, completely unsubstantiated by observation(I do not
possess local sniff capability) is that since additional packets are being
sent out by the interface–one per stack per packet to acknowledge–and
since these additional packets are sent out nearly simultaneously,
they´re much more likely to cause collisions than randomly distributed
packets. Since the ACK packets are generated simultaneously and pretty much
must be delivered(lest a new packet come to replace it), they stream
themselves over the line as soon as an incoming packet comes in, possibly
colliding with packets from other nodes, possibly even colliding with other
incoming packets from the same host. Empirical evidence shows that a
network becomes near unusable with four TCP Chorusers on–assuming 1.5
duplicate packets per node, that´s 10 ACKs for each packet being sent
down the line if all four of them are simultaneously attempting to
acknowledge received data–and, yes, these ACKs can and do collide with
multiple ACKs from other hosts, leading to a reverberating feedback effect.
That significantly overloads the backoff system, and the ether becomes
unusable. That´s my theory, for now.
It´s unknown at this time whether or not this bug affects modem users.
If it does, it´s an extremely significant bug, considering the reduced
bandwidth of SLIP/PPP.
Extra TCP/IP stacks listed in the registry but not active under Network
Neighborhood do not appear to generate duplicate packets. The probability
for a machine sending out duplicate packets to be suffering from multiple
active TCP/IP stacks is 100% within the observed sample set at Santa Clara
University.
One final note–when scanning for affected machines, be sure to deliver at
least two or three pings to each client. There is a degree of intermittance
with this bug.
Solutions
Adminstrators should download and run fping to scan their networks for
machines that respond multiple times to pings. It doesn´t take much
more than removing the excess stackage and rebooting to make a chorusing
machine normal again. From the router/hub side, there´s not much that
can be done since the right MAC is asking for the right IP–multiple times,
yes, but still valid. Some extra software on the router end might be able
to help. But to really fix this, MS needs to change the behavior of TCP/IP
stacks.
Microsoft should not make a one-tcp/ip-stack-per-device limit–this reduces
much of the effectiveness of Windows as a TCP/IP client. Rather, a simple
check should be instituted so that no two TCP/IP stacks will attempt to
provide services to the same IP.