ReleaseEngineering/How To/Fix a hung scheduler master

From MozillaWiki
Jump to: navigation, search

bhearsum needs to clean this up.

NOTE: There's a chance that your old schedulers are no more. We have a weekly process that removes old schedulers.

When a scheduler master hangs it's almost always because of new schedulers being added that used to exist. When this happens, their really old state in the DB causes them to look through months and months of old changes - which can take a very long time. To fix:

  • Stop the master
  • Remove the offending schedulers from the DB (eg: 'delete from schedulers where name="xxx";')
  • Start the master

Figuring out which schedulers you need to remove can be tricky. Many of the schedulers in the database are old and unused. The best way to start is looking over the changes that were just merged to production and reasoning about what they may have added. For example, if 10.8 tests were just enabled you can use the following query to have a look at the state of those schedulers:

select name, state from schedulers where name like '%mountainlion%';

That query will return rows like this:

| name                                                      | state                       |
| tests-mozilla-central-mountainlion-opt-unittest           | {"last_processed": 1635317} | 

The "last_processed" integer is a reference to a changeid, and you can get details about that change by selecting from the changes table with a query like:

select branch, revision, when_timestamp from changes where changeid=1635317;

Which will give you something like:

| branch                             | revision     | when_timestamp |
| mozilla-inbound-win64-opt-unittest | 19a91d0fd50b |     1346332788 | 

With that, you can compare the revision and timestamp to the current ones. If you are more than one revision behind or 30 minutes behind, that scheduler should probably be removed.