User:Djmitche/Slave Allocator Proposal

From MozillaWiki
Jump to: navigation, search

Slave-side Operation

currently implemented for Puppet-administered linux and mac slaves, but not for mobile, windows, or misc slaves.

Using runslave.py, each slave requests a new buildbot.tac each time before it starts the buildslave by issuing an HTTP GET to a simple URL containing the slave's hostname.

If the GET fails, and a buildbot.tac exists, it is used. If the file DO_NOT_START exists in the basedir, then no .tac file is requested and the slave is not started.

Runslave currently guesses the buildslave directory based on the hostname. We should have a more reliable way of doing this - bug 622980 - but this will do for now.

Allocator Design

partially implemented locally; nothing committed

Intention

When a slave starts up, it should be allocated to the master that most needs it at that moment. This means that the appropriate master is calculated for each HTTP request, rather than being statically calculated and stored e.g., in a database.

This should be a minimally disruptive service. That is, we should not often need to poke and prod the slave allocator. It is not a slave management console by any stretch (that's another project).

Data

The allocator uses lists of slaves and masters - slaves are currently statically calculated from inventory; masters are based on Masters and Machine Bookings, but could potentially be based on catlee's masters JSON file.

Silos

Slaves are assigned to silos based on their innate characteristics. Those characteristics are:

  • environment
  • purpose
  • distro
  • bitlength
  • datacenter
  • trustlevel (corresponds to commit levels)

Each 6-tuple of such values constitutes a silo. Current silos, with counts, are (roughly - I know there are some problems with this list):

environment purpose   distro        bits dc  trustlevel count
preprod     build     centos5       32   mpt core       1    
preprod     build     centos5       64   mpt core       1    
preprod     build     darwin10      32   mpt core       1    
preprod     build     darwin9       32   mpt core       1    
preprod     test      darwin10      32   scl core       1    
preprod     test      darwin9       32   scl core       1    
preprod     test      fedora12      32   scl core       1    
preprod     test      fedora12      64   scl core       1    
production  build     centos5       32   mpt core       42   
production  build     centos5       32   mpt tryuser    33   
production  build     centos5       32   mtv core       25   
production  build     centos5       32   mtv tryuser    10   
production  build     centos5       32   scl core       27   
production  build     centos5       64   mpt core       10   
production  build     centos5       64   mpt tryuser    10   
production  build     darwin10      32   mpt core       24   
production  build     darwin10      32   mtv core       11   
production  build     darwin9       32   mpt core       56   
production  build     darwin9       32   mpt tryuser    36   
production  build     darwin9       32   mtv core       9    
production  build     darwin9       32   mtv tryuser    13   
production  build     darwin9       64   mpt tryuser    26   
production  build     win2k3sp2     32   mpt core       56   
production  build     win2k3sp2     32   mpt tryuser    35   
production  build     win2k3sp2     32   mtv core       47   
production  build     win2k3sp2     32   scl core       17   
production  build     win2k3sp2     64   mtv core       1    
production  geriatric darwin8       32   mpt core       18   
production  geriatric darwin8       32   mpt tryuser    2    
production  test      android       32   mtv core       14   
production  test      android-n900  32   mtv core       40   
production  test      android-tegra 32   mtv core       93   
production  test      darwin10      32   scl core       52   
production  test      darwin9       32   scl core       50   
production  test      fedora12      32   scl core       50   
production  test      fedora12      64   scl core       51   
production  test      win7          32   scl core       48   
production  test      win7          64   scl core       50   
production  test      winxp         32   scl core       46   
staging     build     centos5       32   mpt core       4    
staging     build     centos5       32   mtv core       3    
staging     build     centos5       64   mpt core       1    
staging     build     darwin10      32   mpt core       4    
staging     build     darwin10      32   mtv core       2    
staging     build     darwin9       32   mpt core       3    
staging     build     win2k3sp2     32   mpt core       4    
staging     build     win2k3sp2     32   mtv core       3    
staging     test      darwin10      32   scl core       2    
staging     test      darwin9       32   scl core       2    
staging     test      fedora12      32   scl core       2    
staging     test      fedora12      64   scl core       1    
staging     test      win7          32   scl core       3    
staging     test      winxp         32   scl core       6    

Pools

Each masters and slaves is assigned a pool. My rough sketch of the available pools are:

pool           nslaves masters                          
bm-scl         44      bm01 bm02                        
geriatric      20      geriatric                        
pm-mpt         285     pm pmX pm01 pm02 pm03            
pmm-mtv        147     pmm01 pmm02                      
preprod        8       pp                               
sched          0       scheduler_master tests_scheduler 
staging        24      sm01 sm02 sm03 sm04 rail bhearsum
staging-mobile 0       smm smm2                         
staging-tests  16      mozilla-tests1 mozilla-tests2    
tm-mpt         0       talos-master                     
tm-mtv         0       talos-master02 tm01 tm02         
tm-scl         347     tm03 tm04 tm05 tm06              
try            159     try_trunk_master                 

are the pm (production master) and bm (builder master) masters doing roughly the same thing? Should I name them 'bm-scl and bm-mpt'? --Dmitchell@mozilla.com 02:23, 5 January 2011 (PST)
Aki sez: yes. also, smm* and pmm* can probably be taken out unless you're dealing with n900s
Great, so pm == bm; I want to leave the mobile masters in there for now, even if they aren't being allocated, so that they're ready to roll if/when we decide to start hooking phones or foopies up to them. --Dmitchell@mozilla.com 16:07, 5 January 2011 (PST)

Runtime Data

The allocator needs some estimate of which slaves are attached to which masters. A reasonable approximation is simply to remember allocations when they are made: if slave S is assigned to master M, then assume S is attached to M until it is reassigned.

The allocator also needs an up-to-date list of active masters. This doesn't change so often, so a static list (e.g., a human-edited DB table) is a reasonable approximation.

A more accurate determination of both can be made by polling the buildmasters: request a buildmaster's /about page before allocating a slave to it (to check up-ness), and periodically request and scrape the /buildslaves?no_builders=1 page to determine which slaves are actually attached to the master.

Balancing

Within each pool, the allocator attempts to balance the slaves in each silo across the available masters. Let's take an example from the data above. Specifically, let's look at the tm-scl pool and the <production, test, darwin9, 32, scl, core> silo. There are 50 slaves in this silo, all of which are assigned to this pool, and 4 masters in the pool. The ideal allocation, then, will attach 12 or 13 slaves to each of tm03..tm06.

The balancing algorithm will proceed as follows:

  • determine the pool for the slave
  • determine the silo for the slave
  • for each active master in the pool, count the number of attached slaves from the silo
  • attach the new slave to the master with the lowest count, sorting by master name where counts are equal

Implementation

The slave allocator needs to both serve HTTP requests for .tac files and run background operations such as polling masters. None of the "cool" Python web frameworks that I know of (Pylons, Django, Plone) support this without lots of gymnastics, but trusty old Twisted supports it quite nicely.

I'm building the slave allocator to run as a Twisted daemon. It has a command-line utility for all other management tasks, so there is no web-based management UI. The command-line utility currently speaks directly to the database, but that can certainly be changed later.

Not Another Standalone System!

Yep! I couldn't find any other system into which this would fit nicely.

Running a Downtime

When it's time for a rolling downtime, we select at most one master at a time from each pool, configure the allocator to allocate slaves away from it, and wait until it has no more slaves attached before shutting it down (no need to be graceful, but it won't hurt!). It will require a little bit of coordination between engineers to ensure that we don't shut down too many masters in a pool, but assuming that each pool has as a sufficient number of masters to run with one master down, we can perform a rolling downtime with no loss in capacity.

Relationship to Pods

Pods are a great way to conceptualize ensuring sufficient redundancy for release engineering infrastructure. However, they inform the decision of where to put particular types of slaves and masters. Once that decision is made, the slave allocator's job is very simple: connect slaves to appropriate masters as described above. So the allocator is not particularly concerned with pods in and of themselves, although pod-related considerations guide the decisions that create the data in the allocator's masters and slaves tables.