CIDuty/How To/Deprecated / Archived/Slave Management: Difference between revisions

CIDuty/How To/Deprecated / Archived/Slave Management (view source)

Revision as of 15:44, 24 November 2017

1,200 bytes removed , 24 November 2017

Various minor updates

Aselagea

148

edits

@@ Line 1: / Line 1: @@
 In general, slave management involves:
 * keeping as many slaves up as possible, including
-** proactively checking for hung/broken slaves - see [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html slave health dashboard].
+** proactively checking for hung/broken slaves - see [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html SlaveHealth] dashboard.
 ** returning re-imaged slaves to production
 * handling [[ReleaseEngineering:Buildduty:Nagios|nagios alerts]] for slaves
@@ Line 7: / Line 7: @@
 = Known failure modes =
-* talos-r4-snow, talos-mtnlion-r5
+* HW machines getting unreachable, a reboot is generally needed:
-** all r4 and r5 slaves are connected to [[#PDU|PDUs for power-cycling]]
+** all Mac OS machines (bld-lion-r5* and t-yosemite-r7*) are connected to [https://wiki.mozilla.org/ReleaseEngineering/How_To/Connect_To_IPMI PDU].
-* AWS slaves
+** Windows and Linux machines use [https://wiki.mozilla.org/ReleaseEngineering/How_To/Connect_To_IPMI IPMI]
-** a common failure is running out of disk space.  They have default disk allocations of 150GB versus our which have 250GB.  Catlee is working on changing that.
-*** To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See {{bug|829186}} for an example.
-** Rail wrote a tool [[ReleaseEngineering/How_To/Manage_AWS_slaves | to manage aws slaves]] - enable or disable automatic reboot and automatic shutdown.
 = Automated =
 There are currently no automated mechanisms for recovering individual slaves.
+* AWS instances will automatically terminate when idle.
-AWS instances will automatically terminate when idle.
 = Manual =
 == Rebooting slaves ==
-Find the slave page on slave health. There's a button to reboot the machine.
+Find the slave page on [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html SlaveHealth]. There's a button to reboot the machine.
 == Filing bugs for IT ==
-* File a bug using the link in the slave health page for the slave - it will "do the right thing" to set up a new bug if needed.
+* File a bug using the link in the SlaveHealth page for the slave - it will "do the right thing" to set up a new bug if needed.
 * File a [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Server%20Operations%3A%20DCOps&short_desc=HOST%20is%20unreachable "slave is unreachable bug"] for IT.
-* Create dependent bugs for any IT actions. (As of 2014, we should file per-slave bugs for reboots instead of grouping together machines in the same DC into one bug.)
+*** '''Note''': SlaveAPI will do that automatically when failing to reboot the machine.
-** should block both the datacenter bug & the per host bug (for record keeping)
+* Create dependent bugs for any IT actions. (As of 2017, we should file per-slave bugs for reboots instead of grouping together machines in the same DC into one bug.)
+** should block the per host bug (for record keeping)
 ** consider whether the slave should be disabled in slavealloc, and note that in bug (no slave without a detailed bug should be disabled)
-** dcops assumes if there is no separate bug, they only need to reboot and see the machine come online.
+** DCOps assumes if there is no separate bug, they only need to reboot and see the machine come online.
-** Examples: https://bugzilla.mozilla.org/show_bug.cgi?id=966954, https://bugzilla.mozilla.org/show_bug.cgi?id=828602
+** e.g. bug [https://bugzilla.mozilla.org/show_bug.cgi?id=1420132 1420132].
 == Slave Tracking ==
@@ Line 78: / Line 75: @@
 <pre>
 nickname,fqdn,http_port,pb_port,datacenter,pool
-bm89-tests1-panda,buildbot-master89.srv.releng.scl3.mozilla.com,8201,9201,scl3,tests-panda
+bm141-tests1-linux32,buildbot-master141.bb.releng.use1.mozilla.com,8201,9201,scl3,tests-use1-linux32
+bm142-tests1-linux32,buildbot-master142.bb.releng.usw2.mozilla.com,8201,9201,scl3,tests-usw2-linux32
 </pre>
@@ Line 108: / Line 106: @@
 <pre>
   SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';
-  DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';
+  DELETE FROM slaves WHERE notes LIKE '%bumblebumble%';
 </pre>
 == Returning a re-imaged slave to production ==
+* see [[ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave]]
-(aka. post-imaging)
-See [[ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave]]
 == How to decommission a slave ==
-* disable the slave in slavealloc, also setting its environment to "decomm"
+* https://wiki.mozilla.org/ReleaseEngineering/How_To/Decommission_Slave
-* if the hardware has failed:
-** file a bug against Server Ops:Releng to decomm the slave. They should (at the very least) make sure the nagios alerts are updated, DNS updated, and the hardware recovered from the dc.
-* if the hardware is still viable and can be used by another pool (e.g. r3 mini)
-** file a bug against Server Ops:Releng to have the slave re-imaged to another OS with bad wait times (usually Windows)
-** add the new slave to the buildbot configs, and make sure nagios monitoring is setup for the new slave (may require a new bug against relops)
-* Open a bug for the release engineering changes to decommission a machine which include
-** remove the slave from the buildbot configs example {{bug|798460}}
-** remove the slave from puppet or opsi configs, if it exists in one (This doesn't apply to puppet if you unless you are removing an entire pool of machines, not single one)
-** remove the slave from slavealloc see https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#Removing_slaves
-** In the bug you can attach patches for the buildbot configs and mysql changes for slavealloc for review by another relenger
 = Windows =