User:NThomas:CastroTimeouts

From MozillaWiki
Jump to: navigation, search

Investigating failures on talos-r3-fed-012

Build history

  • May 06 21:35 thru 21:42 - successful - details
  • May 06 22:00 thru 22:08 - exception - details
  • May 06 22:15 thru 22:20 - exception - details

Logs

  • slave clock is 2m 10s fast
  • talos-master clock is right

slave 'console' log

From gnometerminal (via VNC):

notice: Starting catalog run
notice: Finished catalog run in 3.22 seconds
Following twistd.log until startup finished..
2010/05/06 21:47 PDT [-] Log opened.
2010/05/06 21:47 PDT [-] twistd 2.4.0 (/home/cltbld/bin/python2.5 2.5.4) starting up
2010/05/06 21:47 PDT [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor'>
2010/05/06 21:47 PDT [-] Loading buildbot.tac...
2010/05/06 21:47 PDT [-] Creating BuildSlave -- buildbot.version: 0.7.9
2010/05/06 21:47 PDT [-] Loaded.
2010/05/06 21:47 PDT [-] Starting factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c>
2010/05/06 21:47 PDT [Broker,client] message from master: attached
The buildslave appears to have (re)started correctly.

So puppet probably only ran once, and the slave did not reboot after first purple build.

slave twistd.log

The interesting parts starting at the failure of the first purple build:

2010/05/06 22:05 PDT [Broker,client] ShellCommand._startCommand
2010/05/06 22:05 PDT [Broker,client]  python mochitest/runtests.py --appname=firefox/firefox-bin --utility-path=bin --extra-profile-file=bin/plugins --certificate-path=certs --autorun --close-when-done --console-level=INFO --symbols-path=symbols --total-chunks=5 --this-chunk=1 --chunk-by-dir=4
2010/05/06 22:05 PDT [Broker,client]   in dir /home/cltbld/talos-slave/mozilla-central-fedora-debug-u-mochitests-1/build (timeout 1200 secs)
2010/05/06 22:05 PDT [Broker,client]   watching logfiles {}
2010/05/06 22:05 PDT [Broker,client]   argv: ['python', <blah blah blah>
2010/05/06 22:05 PDT [Broker,client]   closing stdin
2010/05/06 22:05 PDT [Broker,client]   using PTY: True
2010/05/06 22:10 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
<repeats 14 times>
2010/05/06 22:10 PDT [Broker,client] lost remote
<repeats many times at same timestamp>
2010/05/06 22:10 PDT [Broker,client] lost remote step
2010/05/06 22:10 PDT [Broker,client] stopCommand: halting current command <buildbot.slave.commands.SlaveShellCommand instance at 0x9da528c>
2010/05/06 22:10 PDT [Broker,client] command interrupted, killing pid 1974
2010/05/06 22:10 PDT [Broker,client] trying os.kill(-pid, 9)
2010/05/06 22:10 PDT [Broker,client]  signal 9 sent successfully
2010/05/06 22:10 PDT [Broker,client] <twisted.internet.tcp.Connector instance at 0x9514c8c> will retry in 2 seconds
2010/05/06 22:10 PDT [Broker,client] Stopping factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c>
2010/05/06 22:10 PDT [-] command finished with signal 9, exit code None, elapsedTime: 297.229275
2010/05/06 22:10 PDT [-] SlaveBuilder.commandComplete None
2010/05/06 22:10 PDT [-] Starting factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c>
2010/05/06 22:10 PDT [Broker,client] message from master: attached
2010/05/06 22:17 PDT [Broker,client] SlaveBuilder.remote_print(Rev3 Fedora 12 mozilla-central talos nochrome): message from master: ping
2010/05/06 22:17 PDT [Broker,client] SlaveBuilder.remote_ping(<SlaveBuilder 'Rev3 Fedora 12 mozilla-central talos nochrome' at 156405580>)
2010/05/06 22:17 PDT [Broker,client] <SlaveBuilder 'Rev3 Fedora 12 mozilla-central talos nochrome' at 156405580>.startBuild
…
2010/05/06 22:17 PDT [Broker,client] ShellCommand._startCommand
2010/05/06 22:17 PDT [Broker,client]  python run_tests.py --noisy 20100506_2217_config.yml
2010/05/06 22:17 PDT [Broker,client]   in dir /home/cltbld/talos-slave/mozilla-central-fedora-nochrome/../talos-data/talos/ (timeout 21600 secs)
2010/05/06 22:17 PDT [Broker,client]   watching logfiles {}
2010/05/06 22:17 PDT [Broker,client]   argv: ['python', 'run_tests.py', <blah blah blah>
2010/05/06 22:17 PDT [Broker,client]   closing stdin
2010/05/06 22:17 PDT [Broker,client]   using PTY: True
2010/05/06 22:22 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2010/05/06 22:22 PDT [Broker,client] lost remote
<repeated many times at same timestamp>
2010/05/06 22:22 PDT [Broker,client] lost remote step
2010/05/06 22:22 PDT [Broker,client] stopCommand: halting current command <buildbot.slave.commands.SlaveShellCommand instance at 0x959636c>
2010/05/06 22:22 PDT [Broker,client] command interrupted, killing pid 2078
2010/05/06 22:22 PDT [Broker,client] trying os.kill(-pid, 9)
2010/05/06 22:22 PDT [Broker,client]  signal 9 sent successfully
2010/05/06 22:22 PDT [Broker,client] <twisted.internet.tcp.Connector instance at 0x9514c8c> will retry in 2 seconds
2010/05/06 22:22 PDT [Broker,client] Stopping factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c>
2010/05/06 22:22 PDT [-] command finished with signal 9, exit code None, elapsedTime: 294.922019
2010/05/06 22:22 PDT [-] SlaveBuilder.commandComplete None
2010/05/06 22:22 PDT [-] Starting factory <buildbot.slave.bot.BotFactory instance at 0x94bc96c>
2010/05/06 22:23 PDT [Broker,client] message from master: attached

So at 22:10 and 22:17 (less 2min 10s) the slave failed to send an update to the master, and gets disconnected. It then reattaches. Also got looks like it got kicked off at 2010/05/07 00:34 while not doing a build.

master log

Excerpts of master twistd.log* (copied to talos-master:/tmp/nthomas/)

twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] duplicate slave talos-r3-fed-012 replacing old one
twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] old slave was connected from IPv4Address(TCP, '10.250.49.71', 55109)
twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] new slave is from IPv4Address(TCP, '10.250.49.71', 46179)
twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] disconnecting old slave talos-r3-fed-012 now
twistd.log.17:2010-05-06 22:07:59-0700 [Broker,316,10.250.49.71] waiting for slave to finish disconnecting
twistd.log.17:2010-05-06 22:08:34-0700 [Broker,261,10.250.49.71] BuildSlave.detached(talos-r3-fed-012)
twistd.log.17:2010-05-06 22:08:42-0700 [Broker,316,10.250.49.71] Got slaveinfo from 'talos-r3-fed-012'
twistd.log.17:2010-05-06 22:08:42-0700 [Broker,316,10.250.49.71] bot attached
---
twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] duplicate slave talos-r3-fed-012 replacing old one
twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] old slave was connected from IPv4Address(TCP, '10.250.49.71', 55284)
twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] new slave is from IPv4Address(TCP, '10.250.49.71', 55437)
twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] disconnecting old slave talos-r3-fed-012 now
twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] waiting for slave to finish disconnecting
twistd.log.4:2010-05-07 00:32:17-0700 [Broker,378,10.250.49.71] BuildSlave.detached(talos-r3-fed-012)
twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] Got slaveinfo from 'talos-r3-fed-012'
twistd.log.4:2010-05-07 00:32:17-0700 [Broker,624,10.250.49.71] bot attached

Eureka! Also, WTF.

tcpdump

dmoore ran tcpdump on the master

  • 10.2.73.184 is talos-master; 9012 the port is listens on
  • 10.250.49.71 is talos-r3-fed-012; port 55109 was the existing connection on the slave, 46179 the new one
22:07:59.064645 IP 10.2.73.184.9012 > 10.250.49.71.46179: P 13:53(40) ack 53 win 46 <nop,nop,timestamp 498143731 1183030>
22:07:59.070738 IP 10.250.49.71.46179 > 10.2.73.184.9012: P 53:110(57) ack 53 win 92 <nop,nop,timestamp 1183043 498143731>
22:07:59.111501 IP 10.2.73.184.9012 > 10.250.49.71.46179: . ack 110 win 46 <nop,nop,timestamp 498143778 1183043>
22:08:01.216136 IP 10.2.73.184.9012 > 10.250.49.71.55109: P 288651:288661(10) ack 7616605 win 1055 <nop,nop,timestamp 498145883 1178482>
22:08:08.127118 IP 10.2.73.184.9012 > 10.250.49.71.55109: P 288651:288661(10) ack 7616605 win 1055 <nop,nop,timestamp 498152795 1178482>
22:08:21.949067 IP 10.2.73.184.9012 > 10.250.49.71.55109: P 288651:288661(10) ack 7616605 win 1055 <nop,nop,timestamp 498166619 1178482>
22:08:34.626296 IP 10.2.73.184.9012 > 10.250.49.71.55109: F 288661:288661(0) ack 7616605 win 1055 <nop,nop,timestamp 498179298 1178482>
22:08:41.054353 IP 10.2.73.184.9012 > 10.250.49.71.46179: P 53:88(35) ack 110 win 46 <nop,nop,timestamp 498185727 1183043>
22:08:41.063158 IP 10.250.49.71.46179 > 10.2.73.184.9012: P 110:120(10) ack 88 win 92 <nop,nop,timestamp 1225037 498185727>
22:08:41.063158 IP 10.2.73.184.9012 > 10.250.49.71.46179: . ack 120 win 46 <nop,nop,timestamp 498185736 1225037>

Relate IRC chatter

  <dmoore>	It's interesting... the server is sending a FIN, like it wants to close the connection
		but the TCP connection never actually drops
		At no point is the connection actually closed
		it just keeps on truckin
  <dmoore>	wait, I found the actual disconnect... minor clock difference
  <dmoore>	that FIN does close the connection, it's just that the client has two connections going
		look at the port numbers

So the master thinks another slave (from the same IP and using same slave name) is connecting to it, kills off the existing slave, and we get an exception build.

More data

  • the 'duplicate slave' messages appear in talos-r3 master logs as far back as 2010-05-02 04:39 (as much as we keep)
  • the 'duplicate slave' messages also appear in pm01 master logs, going back to 2010-05-01 03:30
  • we don't keep slaveside logs all that long on talos, so can't find historical data there
  • do keep logs on build slaves without limit, and I found a similar slave log signature on mv-moz2-linux-ix-slave03:
2010/04/15 09:56 PDT [Broker,client] ShellCommand._startCommand
2010/04/15 09:56 PDT [Broker,client]  make upload
2010/04/15 09:56 PDT [Broker,client]   in dir /builds/slave/mozilla-1.9.2-linux/build/obj-firefox (timeout 2400 secs)
2010/04/15 09:56 PDT [Broker,client]   watching logfiles {}
2010/04/15 09:56 PDT [Broker,client]   argv: ['make', 'upload']
2010/04/15 09:56 PDT [Broker,client]  environment: {'SSH_ASKPASS': <blah blah blah>
2010/04/15 09:56 PDT [Broker,client]   using PTY: True
2010/04/15 10:02 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2010/04/15 10:02 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate

And another on 2010/04/01. I'm being careful to identify situations where we go from a command running directly to ackFailed, and ignoring

2010/04/06 16:04 PDT [Broker,client]  python tools/buildfarm/maintenance/count_and_reboot.py -f ../reboot_count.txt -n 1 -z
2010/04/06 16:04 PDT [Broker,client]   in dir /builds/slave/mozilla-1.9.2-linux-opt-unittest-crashtest/. (timeout 1200 secs)
2010/04/06 16:04 PDT [Broker,client]   watching logfiles {}
2010/04/06 16:04 PDT [Broker,client]   argv: ['python', 'tools/buildfarm/maintenance/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z']
2010/04/06 16:04 PDT [Broker,client]  environment: {'SSH_ASKPASS': <blah blah blah>
2010/04/06 16:04 PDT [Broker,client]   closing stdin
2010/04/06 16:04 PDT [Broker,client]   using PTY: True
2010/04/06 16:04 PDT [-] Received SIGTERM, shutting down.
2010/04/06 16:04 PDT [-] stopCommand: halting current command <buildbot.slave.commands.SlaveShellCommand instance at 0x85d77cc>
2010/04/06 16:04 PDT [-] command interrupted, killing pid 3752
2010/04/06 16:04 PDT [-] trying os.kill(-pid, 9)
2010/04/06 16:04 PDT [-]  signal 9 sent successfully
2010/04/06 16:04 PDT [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate

Moar data

A list of disconnected slaves over night:

22:07:59    mw32-ix-slave13
22:20:24    mw32-ix-slave13
22:20:25    mv-moz2-linux-ix-slave10
23:33:34    mw32-ix-slave04
01:02:34    mw32-ix-slave03
03:21:26    mw32-ix-slave16
03:36:05    mw32-ix-slave05
04:59:46    talos-r3-leopard-035
05:13:47    talos-r3-snow-016
05:16:20    talos-r3-snow-017
05:16:22    talos-r3-snow-009
06:03:42    talos-r3-fed64-016
06:03:44    talos-r3-fed64-017
06:17:17    talos-r3-fed64-005

Discussion

We need to find out why the master & slave are behaving this way, and if it's a fault in network land or in buildbot. We got a lot more failures immediately after the downtime, during which

  • the VPN hardware providing the Castro to MPT connection was swapped out
  • talos-master was restarted

Given Windows Bo/Bd/N failed 5 times on pm01 after the downtime (and that master wasn't touched) we can probably point at the network.