ReleaseEngineering/IT Resource Job Description: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 10: Line 10:
* configuration sanity script development and maintenance
* configuration sanity script development and maintenance
** based on a knowledge of how the releng systems work and what they need to connect to, develop a suite of tools that can verify that IT changes won't affect builds *before* builds actually fail
** based on a knowledge of how the releng systems work and what they need to connect to, develop a suite of tools that can verify that IT changes won't affect builds *before* builds actually fail
Comments from the IT side: (Zandr)
* Search and destroy mission on single points of failure
** It's way to easy to close the tree right now
* Datacenter Engineering
** We're having power issues in scl1 already, and we need to decorrelate failure modes better.

Revision as of 20:11, 30 November 2010

This is a general description of activities on which we'd like to collaborate with someone from IT.

  • remote power control/remote console solutions
    • eliminates *most* of the need for NOC monkeys if we can handle remote reboots ourselves
  • nagios improvements
  • munin improvements
    • switch to ganglia?
  • puppet support/maintenance
  • need to run our configs (e.g. Build VPN) to know how/when things are broken
  • configuration sanity script development and maintenance
    • based on a knowledge of how the releng systems work and what they need to connect to, develop a suite of tools that can verify that IT changes won't affect builds *before* builds actually fail


Comments from the IT side: (Zandr)

  • Search and destroy mission on single points of failure
    • It's way to easy to close the tree right now
  • Datacenter Engineering
    • We're having power issues in scl1 already, and we need to decorrelate failure modes better.