- Still seeing intermittent sync failures (bug 1038678); ssh timeout tweaked on zeus, but mirror-pull could still stand to be more resilient
- Add'l hgweb nodes? (bug 1049519) Added two spares, but how many should we have?
- Deployed user repository fixes
- Deployed serverlog extension on cluster, debugged
- Build packages and installed python debugging packages, then installed on hgweb1
- Diagnosed and created verbal (IRC) reports of some traffic statistics
- Added two build trees to DXR! Also, staging working again, all cron and config bits now in build repo, and build script refactored
- Two new hgweb nodes provisioned (9 & 10); added new webhead docs
- Configured local2 syslog logging for new pash_wrapper and gps' extensions
- oncall last week - only one late page -- clarified how unimportant the current nagios alert is (it's a leading indicator with about 80% false positive)
- started releng intern Mihai Tabara on looking at logs near the start of event to find root cause of issues.
- installed pash_wrapper for ssh
- more headcount justifications
- approval from bmoss for the extra blades
- Chased hgweb spins around. With bkero, got debug symbols installed on hgweb1 to pull Python tracebacks from running processes. What actually seems to be the case (n=2) is that the spins are happening while at apr_poll()→poll() in mod_wsgi, which is weird. I'd like to get a few more backtraces out of a spinning webhead to be sure that wasn't a fluke. Did a bunch of theorizing around mod_wsgi spin causes, directives we could frob, etc.
Planned for this week
- Update wsgi, deploy
- Tune wsgi settings to see if it alleviates unavailability
- Parse logs for patterns/errors/statistics
- Push for hg update
- ReviewBoard web heads/admin node
- hg firefighting
- more build repos in DXR
- more added monitoring and correlation.
- get help from srich for rb deployment
- status board bug
- other than that: what's the most helpful thing I can do?