User:NThomas:CastroTimeouts
Investigating failures on talos-r3-fed-012
Contents
Build history
- May 06 21:35 thru 21:42 - successful - details
- May 06 22:00 thru 22:08 - exception - details
- May 06 22:15 thru 22:20 - exception - details
Logs
- slave clock is 2m 10s fast
- talos-master clock is right
slave 'console' log
From gnometerminal (via VNC):
notice: Starting catalog run notice: Finished catalog run in 3.22 seconds Following twistd.log until startup finished.. 2010/05/06 21:47 PDT [-] Log opened. 2010/05/06 21:47 PDT [-] twistd 2.4.0 (/home/cltbld/bin/python2.5 2.5.4) starting up 2010/05/06 21:47 PDT [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor'> 2010/05/06 21:47 PDT [-] Loading buildbot.tac... 2010/05/06 21:47 PDT [-] Creating BuildSlave -- buildbot.version: 0.7.9 2010/05/06 21:47 PDT [-] Loaded. 2010/05/06 21:47 PDT [-] Starting factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c> 2010/05/06 21:47 PDT [Broker,client] message from master: attached The buildslave appears to have (re)started correctly.
So puppet probably only ran once, and the slave did not reboot after first purple build.
slave twistd.log
The interesting parts starting at the failure of the first purple build:
2010/05/06 22:05 PDT [Broker,client] ShellCommand._startCommand 2010/05/06 22:05 PDT [Broker,client] python mochitest/runtests.py --appname=firefox/firefox-bin --utility-path=bin --extra-profile-file=bin/plugins --certificate-path=certs --autorun --close-when-done --console-level=INFO --symbols-path=symbols --total-chunks=5 --this-chunk=1 --chunk-by-dir=4 2010/05/06 22:05 PDT [Broker,client] in dir /home/cltbld/talos-slave/mozilla-central-fedora-debug-u-mochitests-1/build (timeout 1200 secs) 2010/05/06 22:05 PDT [Broker,client] watching logfiles {} 2010/05/06 22:05 PDT [Broker,client] argv: ['python', <blah blah blah> 2010/05/06 22:05 PDT [Broker,client] closing stdin 2010/05/06 22:05 PDT [Broker,client] using PTY: True 2010/05/06 22:10 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate <repeats 14 times> 2010/05/06 22:10 PDT [Broker,client] lost remote <repeats many times at same timestamp> 2010/05/06 22:10 PDT [Broker,client] lost remote step 2010/05/06 22:10 PDT [Broker,client] stopCommand: halting current command <buildbot.slave.commands.SlaveShellCommand instance at 0x9da528c> 2010/05/06 22:10 PDT [Broker,client] command interrupted, killing pid 1974 2010/05/06 22:10 PDT [Broker,client] trying os.kill(-pid, 9) 2010/05/06 22:10 PDT [Broker,client] signal 9 sent successfully 2010/05/06 22:10 PDT [Broker,client] <twisted.internet.tcp.Connector instance at 0x9514c8c> will retry in 2 seconds 2010/05/06 22:10 PDT [Broker,client] Stopping factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c> 2010/05/06 22:10 PDT [-] command finished with signal 9, exit code None, elapsedTime: 297.229275 2010/05/06 22:10 PDT [-] SlaveBuilder.commandComplete None 2010/05/06 22:10 PDT [-] Starting factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c> 2010/05/06 22:10 PDT [Broker,client] message from master: attached 2010/05/06 22:17 PDT [Broker,client] SlaveBuilder.remote_print(Rev3 Fedora 12 mozilla-central talos nochrome): message from master: ping 2010/05/06 22:17 PDT [Broker,client] SlaveBuilder.remote_ping(<SlaveBuilder 'Rev3 Fedora 12 mozilla-central talos nochrome' at 156405580>) 2010/05/06 22:17 PDT [Broker,client] <SlaveBuilder 'Rev3 Fedora 12 mozilla-central talos nochrome' at 156405580>.startBuild … 2010/05/06 22:17 PDT [Broker,client] ShellCommand._startCommand 2010/05/06 22:17 PDT [Broker,client] python run_tests.py --noisy 20100506_2217_config.yml 2010/05/06 22:17 PDT [Broker,client] in dir /home/cltbld/talos-slave/mozilla-central-fedora-nochrome/../talos-data/talos/ (timeout 21600 secs) 2010/05/06 22:17 PDT [Broker,client] watching logfiles {} 2010/05/06 22:17 PDT [Broker,client] argv: ['python', 'run_tests.py', <blah blah blah> 2010/05/06 22:17 PDT [Broker,client] closing stdin 2010/05/06 22:17 PDT [Broker,client] using PTY: True 2010/05/06 22:22 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2010/05/06 22:22 PDT [Broker,client] lost remote <repeated many times at same timestamp> 2010/05/06 22:22 PDT [Broker,client] lost remote step 2010/05/06 22:22 PDT [Broker,client] stopCommand: halting current command <buildbot.slave.commands.SlaveShellCommand instance at 0x959636c> 2010/05/06 22:22 PDT [Broker,client] command interrupted, killing pid 2078 2010/05/06 22:22 PDT [Broker,client] trying os.kill(-pid, 9) 2010/05/06 22:22 PDT [Broker,client] signal 9 sent successfully 2010/05/06 22:22 PDT [Broker,client] <twisted.internet.tcp.Connector instance at 0x9514c8c> will retry in 2 seconds 2010/05/06 22:22 PDT [Broker,client] Stopping factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c> 2010/05/06 22:22 PDT [-] command finished with signal 9, exit code None, elapsedTime: 294.922019 2010/05/06 22:22 PDT [-] SlaveBuilder.commandComplete None 2010/05/06 22:22 PDT [-] Starting factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c> 2010/05/06 22:23 PDT [Broker,client] message from master: attached
So at 22:10 and 22:17 (less 2min 10s) the slave failed to send an update to the master, and gets disconnected. It then reattaches. Also got looks like it got kicked off at 2010/05/07 00:34 while not doing a build.
master log
Excerpts of master twistd.log* (copied to talos-master:/tmp/nthomas/)
twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] duplicate slave talos-r3-fed-012 replacing old one twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] old slave was connected from IPv4Address(TCP, '10.250.49.71', 55109) twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] new slave is from IPv4Address(TCP, '10.250.49.71', 46179) twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] disconnecting old slave talos-r3-fed-012 now twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] waiting for slave to finish disconnecting twistd.log.17:2010-05-06 22:08:34-0700 [Broker,261,10.250.49.71] BuildSlave.detached(talos-r3-fed-012) twistd.log.17:2010-05-06 22:08:42-0700 [Broker,316,10.250.49.71] Got slaveinfo from 'talos-r3-fed-012' twistd.log.17:2010-05-06 22:08:42-0700 [Broker,316,10.250.49.71] bot attached --- twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] duplicate slave talos-r3-fed-012 replacing old one twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] old slave was connected from IPv4Address(TCP, '10.250.49.71', 55284) twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] new slave is from IPv4Address(TCP, '10.250.49.71', 55437) twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] disconnecting old slave talos-r3-fed-012 now twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] waiting for slave to finish disconnecting twistd.log.4:2010-05-07 00:32:17-0700 [Broker,378,10.250.49.71] BuildSlave.detached(talos-r3-fed-012) twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] Got slaveinfo from 'talos-r3-fed-012' twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] bot attached
Eureka! Also, WTF.
tcpdump
dmoore ran tcpdump on the master
- 10.2.73.184 is talos-master; 9012 the port is listens on
- 10.250.49.71 is talos-r3-fed-012; port 55109 was the existing connection on the slave, 46179 the new one
22:07:59.064645 IP 10.2.73.184.9012 > 10.250.49.71.46179: P 13:53(40) ack 53 win 46 <nop,nop,timestamp 498143731 1183030> 22:07:59.070738 IP 10.250.49.71.46179 > 10.2.73.184.9012: P 53:110(57) ack 53 win 92 <nop,nop,timestamp 1183043 498143731> 22:07:59.111501 IP 10.2.73.184.9012 > 10.250.49.71.46179: . ack 110 win 46 <nop,nop,timestamp 498143778 1183043> 22:08:01.216136 IP 10.2.73.184.9012 > 10.250.49.71.55109: P 288651:288661(10) ack 7616605 win 1055 <nop,nop,timestamp 498145883 1178482> 22:08:08.127118 IP 10.2.73.184.9012 > 10.250.49.71.55109: P 288651:288661(10) ack 7616605 win 1055 <nop,nop,timestamp 498152795 1178482> 22:08:21.949067 IP 10.2.73.184.9012 > 10.250.49.71.55109: P 288651:288661(10) ack 7616605 win 1055 <nop,nop,timestamp 498166619 1178482> 22:08:34.626296 IP 10.2.73.184.9012 > 10.250.49.71.55109: F 288661:288661(0) ack 7616605 win 1055 <nop,nop,timestamp 498179298 1178482> 22:08:41.054353 IP 10.2.73.184.9012 > 10.250.49.71.46179: P 53:88(35) ack 110 win 46 <nop,nop,timestamp 498185727 1183043> 22:08:41.063158 IP 10.250.49.71.46179 > 10.2.73.184.9012: P 110:120(10) ack 88 win 92 <nop,nop,timestamp 1225037 498185727> 22:08:41.063158 IP 10.2.73.184.9012 > 10.250.49.71.46179: . ack 120 win 46 <nop,nop,timestamp 498185736 1225037>
Relate IRC chatter
<dmoore> It's interesting... the server is sending a FIN, like it wants to close the connection but the TCP connection never actually drops At no point is the connection actually closed it just keeps on truckin <dmoore> wait, I found the actual disconnect... minor clock difference <dmoore> that FIN does close the connection, it's just that the client has two connections going look at the port numbers
So the master thinks another slave (from the same IP and using same slave name) is connecting to it, kills off the existing slave, and we get an exception build.
More data
- the 'duplicate slave' messages appear in talos-r3 master logs as far back as 2010-05-02 04:39 (as much as we keep)
- the 'duplicate slave' messages also appear in pm01 master logs, going back to 2010-05-01 03:30
- we don't keep slaveside logs all that long on talos, so can't find historical data there
- do keep logs on build slaves without limit, and I found a similar slave log signature on mv-moz2-linux-ix-slave03:
2010/04/15 09:56 PDT [Broker,client] ShellCommand._startCommand 2010/04/15 09:56 PDT [Broker,client] make upload 2010/04/15 09:56 PDT [Broker,client] in dir /builds/slave/mozilla-1.9.2-linux/build/obj-firefox (timeout 2400 secs) 2010/04/15 09:56 PDT [Broker,client] watching logfiles {} 2010/04/15 09:56 PDT [Broker,client] argv: ['make', 'upload'] 2010/04/15 09:56 PDT [Broker,client] environment: {'SSH_ASKPASS': <blah blah blah> 2010/04/15 09:56 PDT [Broker,client] using PTY: True 2010/04/15 10:02 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2010/04/15 10:02 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
And another on 2010/04/01. I'm being careful to identify situations where we go from a command running directly to ackFailed, and ignoring
2010/04/06 16:04 PDT [Broker,client] python tools/buildfarm/maintenance/count_and_reboot.py -f ../reboot_count.txt -n 1 -z 2010/04/06 16:04 PDT [Broker,client] in dir /builds/slave/mozilla-1.9.2-linux-opt-unittest-crashtest/. (timeout 1200 secs) 2010/04/06 16:04 PDT [Broker,client] watching logfiles {} 2010/04/06 16:04 PDT [Broker,client] argv: ['python', 'tools/buildfarm/maintenance/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z'] 2010/04/06 16:04 PDT [Broker,client] environment: {'SSH_ASKPASS': <blah blah blah> 2010/04/06 16:04 PDT [Broker,client] closing stdin 2010/04/06 16:04 PDT [Broker,client] using PTY: True 2010/04/06 16:04 PDT [-] Received SIGTERM, shutting down. 2010/04/06 16:04 PDT [-] stopCommand: halting current command <buildbot.slave.commands.SlaveShellCommand instance at 0x85d77cc> 2010/04/06 16:04 PDT [-] command interrupted, killing pid 3752 2010/04/06 16:04 PDT [-] trying os.kill(-pid, 9) 2010/04/06 16:04 PDT [-] signal 9 sent successfully 2010/04/06 16:04 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
Moar data
A list of disconnected slaves over night:
22:07:59 mw32-ix-slave13 22:20:24 mw32-ix-slave13 22:20:25 mv-moz2-linux-ix-slave10 23:33:34 mw32-ix-slave04 01:02:34 mw32-ix-slave03 03:21:26 mw32-ix-slave16 03:36:05 mw32-ix-slave05 04:59:46 talos-r3-leopard-035 05:13:47 talos-r3-snow-016 05:16:20 talos-r3-snow-017 05:16:22 talos-r3-snow-009 06:03:42 talos-r3-fed64-016 06:03:44 talos-r3-fed64-017 06:17:17 talos-r3-fed64-005
Discussion
We need to find out why the master & slave are behaving this way, and if it's a fault in network land or in buildbot. We got a lot more failures immediately after the downtime, during which
- the VPN hardware providing the Castro to MPT connection was swapped out
- talos-master was restarted
Given Windows Bo/Bd/N failed 5 times on pm01 after the downtime (and that master wasn't touched) we can probably point at the network.