ReleaseEngineering/IT Resource Job Description
This is a general description of activities on which we'd like to collaborate with someone from IT.
- remote power control/remote console solutions
- eliminates *most* of the need for NOC monkeys if we can handle remote reboots ourselves
- reboot hung/stuck slaves for anything that is not configured for remote reboot (currently all minis, nokias, tegras)
- reimage corrupted slaves (IX, VMs, minis, nokias, tegras)
- nagios improvements
- munin improvements
- switch to ganglia?
- puppet support/maintenance
- need to run our configs (e.g. Build VPN) to know how/when things are broken
- configuration sanity script development and maintenance
- based on a knowledge of how the releng systems work and what they need to connect to, develop a suite of tools that can verify that IT changes won't affect builds *before* builds actually fail
- Asset tracking: low confidence that https://build.inventory.mozilla.org/build/ matches reality. This needs to be fixed, and then maintained going forward.
- Either do VPN/DNS/ configs, or interface with IT for this.
Comments from the IT side: (Zandr)
- Search and destroy mission on single points of failure
- It's way to easy to close the tree right now
- Datacenter Engineering
- We're having power issues in scl1 already, and we need to decorrelate failure modes better.