User:Catlee/LargeWoodenRabbit
Architecture
Events are triggered via external notifications e.g. a push to mozilla-central, a RelEng request for particular maintenance to be performed on a machine, or via internal timers. An event has a name, and payload as primary attributes. Probably also has things like author and timestamp and maybe signature.
There are a configurable set of schedulers, each listening to one or more event types. A scheduler is a special type of job whose primary task is to create more jobs. A scheduler has a name and job template. When an event arrives that a scheduler is subscribe to, a job is created according to the job template. The event details are attached to the job. The job is then scheduled like any regular job.
Schedulers create jobs and possibly a jobset. A jobset contains a job graph that describes inter-job dependencies.
Jobs are created as "runnable" or "pending" initially. Jobs move from "pending" to "runnable" by completion of their dependencies specified in the job graph.
Goals / Antigoals
LWR will:
- allow you to simply rebuild failed tasks, or hierarchies of tasks. These rebuilt tasks can satisfy previous dependency graphs.
- allow you to specify DAGs for job dependencies
- be able to change scheduling at run time via a web interface or API
- operate at scale. This means:
- support multiple distributed clusters of slaves and 'masters'
- support 105 slaves
- support 105 pending jobs
LWR will not:
- gain you access to fortified french castles
Open questions
Assigning jobs to slaves
I'd like to be able to have slaves tagged with one or more tags, e.g. 'linux', 'fast' as well as a tag for its hostname. Jobs can have slave tags to indicate which type of slave it needs to run on. I'd like to be able to have a job specify multiple slave tags, which means that the job only runs on slaves that are tagged with all specified tags.
The problem is how to efficiently assign jobs to slaves when we have different overall job priorities and different preferences of slaves per job. Doing the naive thing is O(R*S) (R = runnable jobs, S = number of slaves)