This is a general description of activities on which we'd like to collaborate with someone from IT.

  • remote power control/remote console solutions
    • eliminates *most* of the need for NOC monkeys if we can handle remote reboots ourselves
    • reboot hung/stuck slaves for anything that is not configured for remote reboot (currently all minis, nokias, tegras)
  • reimage corrupted slaves (IX, VMs, minis, nokias, tegras)
  • nagios improvements
  • munin improvements
    • switch to ganglia?
  • puppet support/maintenance
  • need to run our configs (e.g. Build VPN) to know how/when things are broken
  • configuration sanity script development and maintenance
    • based on a knowledge of how the releng systems work and what they need to connect to, develop a suite of tools that can verify that IT changes won't affect builds *before* builds actually fail
  • Asset tracking: low confidence that matches reality. This needs to be fixed, and then maintained going forward.
  • Either do VPN/DNS/ configs, or interface with IT for this.

Comments from the IT side: (Zandr)

  • Search and destroy mission on single points of failure
    • It's way to easy to close the tree right now
  • Datacenter Engineering
    • We're having power issues in scl1 already, and we need to decorrelate failure modes better.