ReleaseEngineering/IT Resource Job Description
Jump to navigation
Jump to search
This is a general description of activities on which we'd like to collaborate with someone from IT.
- remote power control/remote console solutions
- eliminates *most* of the need for NOC monkeys if we can handle remote reboots ourselves
- nagios improvements
- munin improvements
- switch to ganglia?
- puppet support/maintenance
- need to run our configs (e.g. Build VPN) to know how/when things are broken
- configuration sanity script development and maintenance
- based on a knowledge of how the releng systems work and what they need to connect to, develop a suite of tools that can verify that IT changes won't affect builds *before* builds actually fail
Comments from the IT side: (Zandr)
- Search and destroy mission on single points of failure
- It's way to easy to close the tree right now
- Datacenter Engineering
- We're having power issues in scl1 already, and we need to decorrelate failure modes better.