Confirmed users
3,990
edits
| Line 1: | Line 1: | ||
| In general, slave management involves: | In general, slave management involves: | ||
| * keeping as many slaves up as possible, including | * keeping as many slaves up as possible, including | ||
| Line 7: | Line 6: | ||
| * interacting with IT regarding slave maintenance | * interacting with IT regarding slave maintenance | ||
| = Known failure modes = | |||
| * talos-r4-snow, talos-mtnlion-r5 | * talos-r4-snow, talos-mtnlion-r5 | ||
| ** <strike>These slaves will sometimes fail to puppetize correctly. The [https://hg.mozilla.org/build/braindump/file/120bdff523a3/mac-related/remote_scutil_cmds.bash remote_scutil_cmds.bash] script can help with this. </strike> | ** <strike>These slaves will sometimes fail to puppetize correctly. The [https://hg.mozilla.org/build/braindump/file/120bdff523a3/mac-related/remote_scutil_cmds.bash remote_scutil_cmds.bash] script can help with this. </strike> | ||
| Line 23: | Line 18: | ||
| ** <strike>Mozilla DNS servers don't resolve AWS hostnames, thus [[ReleaseEngineering/How_To/Resolve_AWS_names | this document describes how to resolve them]]</strike> | ** <strike>Mozilla DNS servers don't resolve AWS hostnames, thus [[ReleaseEngineering/How_To/Resolve_AWS_names | this document describes how to resolve them]]</strike> | ||
| = Automated = | |||
| == Slave Rebooter == | |||
| Slave rebooter is a script that analyzes recent slave activity and attempts to reboot slaves that it thinks are stuck. It is a [[ReleaseEngineering/Applications/SlaveAPI | SlaveAPI]] based replacement for Kittenherder. It [https://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/reboot-idle-slaves.py lives in the build/tools repository], [https://hg.mozilla.org/build/puppet/file/default/modules/slaverebooter gets deployed by Puppet], and currently lives on buildbot-master65. | Slave rebooter is a script that analyzes recent slave activity and attempts to reboot slaves that it thinks are stuck. It is a [[ReleaseEngineering/Applications/SlaveAPI | SlaveAPI]] based replacement for Kittenherder. It [https://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/reboot-idle-slaves.py lives in the build/tools repository], [https://hg.mozilla.org/build/puppet/file/default/modules/slaverebooter gets deployed by Puppet], and currently lives on buildbot-master65. | ||
| At the time of writing, it works for all hardware machines except Tegras and Pandas. Cloud machines are explicitly ignored because they don't suffer from the same types of transient failures. | At the time of writing, it works for all hardware machines except Tegras and Pandas. Cloud machines are explicitly ignored because they don't suffer from the same types of transient failures. | ||
| = Manual = | |||
| == Rebooting slaves == | |||
| Find the slave page on slave health. There's a button to reboot the machine. | Find the slave page on slave health. There's a button to reboot the machine. | ||
| == Filing bugs for IT == | |||
| * File a bug using the link in the slave health page for the slave - it will "do the right thing" to set up a new bug if needed. | * File a bug using the link in the slave health page for the slave - it will "do the right thing" to set up a new bug if needed. | ||
| * File a [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Server%20Operations%3A%20DCOps&short_desc=HOST%20is%20unreachable "slave is unreachable bug"] for IT. | * File a [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Server%20Operations%3A%20DCOps&short_desc=HOST%20is%20unreachable "slave is unreachable bug"] for IT. | ||
| Line 42: | Line 37: | ||
| ** Examples: https://bugzilla.mozilla.org/show_bug.cgi?id=966954, https://bugzilla.mozilla.org/show_bug.cgi?id=828602 | ** Examples: https://bugzilla.mozilla.org/show_bug.cgi?id=966954, https://bugzilla.mozilla.org/show_bug.cgi?id=828602 | ||
| == Slave Tracking == | |||
| * Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc. | * Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc. | ||
| '''NOTE:''' you no longer need to add the slave-specific bug number to the Notes field. Clicking on the http://slavealloc.build.mozilla.org/ui/icons/help.png icon in slavealloc will look up the bug number and status for you, or create a template you can use to file a new bug. If there is another bug, e.g. for IT re-imaging, please add that extra bug number to the Notes field instead using the format: 'bug #######.' | '''NOTE:''' you no longer need to add the slave-specific bug number to the Notes field. Clicking on the http://slavealloc.build.mozilla.org/ui/icons/help.png icon in slavealloc will look up the bug number and status for you, or create a template you can use to file a new bug. If there is another bug, e.g. for IT re-imaging, please add that extra bug number to the Notes field instead using the format: 'bug #######.' | ||
| === Slavealloc === | |||
| ==== Connecting ==== | |||
| Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. | Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. | ||
| Line 56: | Line 51: | ||
| ssh <your user>@relengwebadm.private.scl3.mozilla.com   | ssh <your user>@relengwebadm.private.scl3.mozilla.com   | ||
| </pre> | </pre> | ||
| ==== Staging vs production ==== | |||
| The DB urls for staging and production are shared in a PGP encrypted file used by the Release Engineering team. Ask someone else in the team if you do not have this file. | The DB urls for staging and production are shared in a PGP encrypted file used by the Release Engineering team. Ask someone else in the team if you do not have this file. | ||
| ==== Adding a slave ==== | |||
| Once you connect to relengwebadm (see above), to see the help for the slavealloc dbimport command, run: | Once you connect to relengwebadm (see above), to see the help for the slavealloc dbimport command, run: | ||
| <pre> | <pre> | ||
| Line 79: | Line 74: | ||
| </pre> | </pre> | ||
| ==== Adding a master ==== | |||
| Adding masters is similar to adding a slave: | Adding masters is similar to adding a slave: | ||
| <pre> | <pre> | ||
| Line 108: | Line 103: | ||
| The slavealloc dbimport mechanism will convert lines of the CSV file into INSERT sql statements. Non specified fields will essentially be set to NULL. To see how the fields are mapped and normalized, see: https://hg.mozilla.org/build/tools/file/5439f10a7127/lib/python/slavealloc/scripts/dbimport.py#l111 (lines 111-137). | The slavealloc dbimport mechanism will convert lines of the CSV file into INSERT sql statements. Non specified fields will essentially be set to NULL. To see how the fields are mapped and normalized, see: https://hg.mozilla.org/build/tools/file/5439f10a7127/lib/python/slavealloc/scripts/dbimport.py#l111 (lines 111-137). | ||
| ==== Moving slaves ==== | |||
| Connect to relengwebadmn and then connect to the mysql DB. | Connect to relengwebadmn and then connect to the mysql DB. | ||
| Line 114: | Line 109: | ||
|   UPDATE slaves SET poolid=43, trustid=4 WHERE notes LIKE 'bug 917923 - to be converted into try hosts'; |   UPDATE slaves SET poolid=43, trustid=4 WHERE notes LIKE 'bug 917923 - to be converted into try hosts'; | ||
| ==== Removing slaves ==== | |||
| Connect to relengwebadmn and then connect to the mysql DB. | Connect to relengwebadmn and then connect to the mysql DB. | ||
| <pre> | <pre> | ||
| Line 121: | Line 116: | ||
| </pre> | </pre> | ||
| == Returning a re-imaged slave to production == | |||
| (aka. post-imaging) | (aka. post-imaging) | ||
| Line 127: | Line 122: | ||
| See [[ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave]] | See [[ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave]] | ||
| == How to decommission a slave == | |||
| * disable the slave in slavealloc, also setting its environment to "decomm" | * disable the slave in slavealloc, also setting its environment to "decomm" | ||
| * if the hardware has failed: | * if the hardware has failed: | ||
| Line 137: | Line 132: | ||
| * remove the slave from puppet or opsi configs, if it exists in one | * remove the slave from puppet or opsi configs, if it exists in one | ||
| = Windows = | |||
| I'm hoping to add enough info to demystify Windows and allow anyone to debug a Windows machine. | |||