Taskcluster/Release Integration Meetings/2015-07-10

From MozillaWiki
Jump to: navigation, search

Tracker Bug for Taskcluster Migration work
https://bugzil.la/1141248
Purpose: Bringing Up Blockers And Celebrating

To organize and track work toward moving to TaskCluster from Buildbot.

Attempted trello board:
https://trello.com/b/asIJ2pGC/taskcluster-migration

Announcements
TaskCluster Meeting recording


TODO:

  • Get a talk given about scopes (jonas) - Friday 10am in ReleaseEngineering vidyo room
  • Get a talk given about generic worker (pmoore)http://docs.taskcluster.net/presentations
    • brought anhad up to speed
    • selena - get a copy of the videos to send them to youtube
  • dustin: what's the plan for x86 fennec builds (specifically that architecture)? lets get clarification also is this something we have to move? can likely do it easily.
    • yes, we need to move this
    • tests run _much_ faster on x86. there is even talk about running some tests here and not elsewhere (autophone), so we definitely need the builds
    • blocked on tests for now
    • sorry specifically that architecture shouldn't be more problematic than the existing arch
  • [selena] - treeherder dev followup bug 1165469#c35 (escalation for this?) selena to follow up on this one, turns out it was not a Taskcluster problem
  • [selena] - make a roadmap


Agenda

  • Whistler results
    • ~10 groups became aware that taskcluster exists
    • triaged all of trello
  • Notifications [armen]
    • [selena] - send 'dev-tree-management' email when things impact TH
    • hal to start an email thread on ... global notifications
  • Issues for production TaskCluster
    • bug 1080265#c2 - selena to schedule a meeting with armen and go over TODAY
    • Security - hwine taking lead, RRAs
      • next thing: scopes
      • bhearsum: build promotion
      • Getting an RRA on funsize -- first production bits to run on TC (rail) :)
      • Buildbot Bridge (bhearsum)
  • Scopes - meeting with hal/dustin/jonas? Friday in the office??
    • rough timeline?
    • Give access to releng to scope config viewer (lightsofapollo) -selena will follow up
    • Audit the scopes - (hal)
  • OPEN ITEMS
    • [ahal] Either need to upgrade tester images (to Ubuntu 14.04) or downgrade builder ones (to 12.04) - bug 1175938
      • Have upgraded tester uploaded, but blocked on build failure
      • According to dustin, bug 1171033 should fix the issue, so tests are blocked on that or on upgrading the image
    • FOR NEXT WEEK: put





  • Q3 goals
    • IMPORTANT: let's focus on goals where we have a good set of known-knowns
    • Builds (coop)
      • OS X cross compile (ted)
        • Reproduced mshal's prior work inside mrrgn's desktop-build container.
        • Next steps: try running in taskcluster, look at packaging+symbols
        • Got prerequisite tools in tooltool, landed build-cctools script
      • all try builds running in TC?
        • reluctant to commit to production builds (e.g. nightly) until we figure out some of the security bits
      • anhand - generic worker porting (will hand off to anthony)
    • Deploy a Periodic task tool (selena) - hooks.taskcluster.net
    • Linux Tests (ahal)
    • Release Promotion (nthomas, bhearsum, rail)
      • IT WILL BE AWESOME
      • Release scheduling in TC (nthomas, bhearsum, rail)
    • Proxies and secret handling (dustin)
      • talking with jhford (secrets.taskcluster.net)
      • TODO [selena/jhford] RRA for secrets.taskcluster.net
    • We're not doing self-serve for TC anymore [X]
    • What else?
      • security? where do we need to get to for shipping.
        • Docker feature to prevent modification
          • (rail) Bug 1175561 - docker-worker: pull images by hash - almost done, needs deployment
          • TODO [garndt] get clarity on what S3 might do for us instead of docker/quay security wise, incl hal
        • Stop sharing machines
          • two classes of service: release and other (?)
          • checklist we need to address.. mostly scopes issue, configuration for release workers
          • TODO [selena] find a home for this checklist :)
        • Credentials management (try especially)
          • q3 builds will be shipped by buildbot, scheduled by taskcluster
          • build promotion?
          • funsize will be out of taskcluster in q3 -- partial updates -- and only issue is the docker image issue
        • Locking down scopes
        • Authentication - Hawk?
          • how access keys get spread


  • Communication
    • How should we publish information about future meetings?
      • Mailing list? release-engineering@mozilla.com, sheriffs@
      • Blog posts? yes




Who to contact about X

  • overall migration and figuring out who can help: selena
  • build migration: coop
  • tests migration: armen
  • individual goal work: see above for contacts


Components of build/test process that need to be addressed

  • Signing (rail)
  • Authentication, management of routing (dustin)
  • Updates (completes, partials) (rail)
  • Symbols (ted, amiyaguchi)
  • Tests (likely through the buildbot bridge)
  • Periodic Tasks - can do out of tree with something like crontabber (selena)
  • Task configuration for each task (need to enumerate the long tail)
  • Runner - deployments on Windows are an issue (coop)


Big Questions

  • Can runner help us avoid continuing to have buildbot infra?
    • Is this TaskCluster support of hardware pools? No, it would be using runner itself, which is currently launched by Puppet separately from TC or BB.
  • bhearsum built the BBB, who is moving the remaining buildbot scheduling to TC scheduling?
    • IIUC the setup is there but we have to move the scheduling to the TC decision tasks - please correct me if I misunderstood -- We are trying to figure this out now.
  • "sendchange" issues -> can we do tests and make check in parallel?
    • bug open for reducing make check ted: https://bugzil.la/992323
    • parallelizing: possible if we want to go there (lightsofapollo)


Firefox releases/build promotion


Bridge (scheduling in TC, runs in either)


Signing - rail

  • MAR (and Linux) signing: https://bugzil.la/1149147
    • presentation at Whistler
    • Doesn't block signing on other platforms, might be part of a pipeline later


Nightly and periodic tasks (PGO, autoland, B2G Partners/Devices)


== Builds ==
FxOS Builds/tests


Firefox desktop builds



  • Windows builds- needs a windows worker and windows AMI & windows infra to support builds
    • Cross compiling
      • Why? This will save us $ on licenses for windows itself for builders, cost for compiling is the same for licenses but more because compiling on windows hosts is likely slower than cross compiling (need data, but strongly suspect)
      • Options:
        • Visual Studio under WINE (no one active)
        • clang cl (longer term?)
    • We shouldn't be spending time trying to cross-compile Windows builds. We should just fix TaskCluster to be able to support Windows workers on AWS. +1
      • not as a first pass, sure, but I think this bears further investigation once we're running Windows builds on TC in AWS



Haz builds


Spidermonkey


l10n repacks


Thunderbird?

==================================================
Tests

  • Talos (perf tests) - jonas has done some experiments
    • probably stuck on hardware forever
    • possible plans for linux: running for 1 quarter on AWS side by side to see if it works or docker images on hardware
    • Hardware for Windows/Mac
    • Pandas? Off them by Q3, to move to AutoPhone
    • jmaher believes that we need more experimentation in the cloud
      • jmaher's current Q3 deliverable: Android talos tests off the pandas (moving to Autophone)
      • We have not been able to get clear answers about regressions in the past
      • 2 ways to solve this
        • all on the cloud
        • all on the hardware
      • take every build on inbound or fx-team and run talos on there
        • jonas said is too much volume; what was the reason?
          • TODO: investigate more
        • We could take some tests off but we need most of them
        • Some tests require graphics
        • The media tests have specialized I/O devices
        • It would be meaningful if we get all tests running
        • Report to graph server
        • Once we see a regression we can compare if it is properly reported
        • All or None
          • For 28 tests we run on Linux32, if we cannot run all of them on VMs we should stick to hardware, otherwise, developers would get confused
          • The more vectors we add on a regression the harder it will get to traction from developers to fix
          • For instance, tp5 fails on hardware and developers wonder why the other perf suites did not fail and then they discover that it is because it was running on a VM instead
          • It becomes two platforms to support instead of 1
            • Different network access, disks, memory
        • It would be nice that they could run the same VM setup locally
  • Linux tests - ahal - https://bugzil.la/1171033
    • [blocker] waiting on L64 builds to stop timing out before we can see them
      • ask Morgan or ahal about it
    • Containerized/Virtualized tests (have some in VMs now)
    • linux, including Android emulator tests - probably easy, since we already have them running in AWS
  • Windows (?) vmware! - will need some amount of effort to green up tests in a new environment, hard to estimate until we do a first pass
    • Buildbot Bridge?
    • Containers?
      • winjail?
  • Mac OS X
    • runner to replace buildbot master management?
    • Buildbot Bridge?
    • Containers?
  • All the rest of our tests
    • mac hardware
    • windows hardware


Data source:


== Blog Posts and other resources ==
TC tutorial recording:
https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=bhKVEns4kzkzdYH
http://www.chesnok.com/daily/2015/05/29/migrating-to-taskcluster-work-underway/
http://www.chesnok.com/daily/2015/06/02/taskcluster-migration-a-hello-world-for-worker-task-creator/
https://etherpad.mozilla.org/taskcluster-hello-world
https://etherpad.mozilla.org/taskcluster-migration-scheduler

Buildbot Bridge
https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879

Funsize


Build Promotion/release promotion

=== ARCHIVE ===

  • Status of Q2 goals
    • Linux builds available thru TC (mrrgn)
      • debug & opt hooked up & triggered on try
      • working on asan builds
      • 32-bit on its way
    • Fennec builds available thru TC (dustin)
      • no 32-bit fennec (WOOO)
      • Fennec needs tooltool - working on a proxy
      • x86 builds? (q for catlee - we need them but probably hasn't looked at them)
    • Generic worker running on Windows by EOQ (pmoore)
      • Can we get a recorded presentation about this? (coop gonna bug him - sure - i'll make a recording)
    • https://etherpad.mozilla.org/tc-builds-whistler-tasks (lets do things! android, other things welcome! room will be 10-12 people! HEY JONAS you can use our room if you want to hack, but not if you want to meet&talk) Look in sched.org
    • https://etherpad.mozilla.org/jonasfj-taskcluster-whistler-subjects
    • armenzg: talk about self-serve for TC
      • TODO: selena remove this entry when you read this
      • change of plans to remove dependencies from a task graph
      • http://bit.ly/1HkhrAz
    • dustin: permacreds
      • a millenium of credentials, ask jonas for your creds
      • we need to be able to expire these :)
    • armenzg: talos
      • jonathan and joel: need to be running all of them on hardware, or all of them on VM
      • current plan for hardware is to use generic worker; will keep enough hardware (non-virtualized) to run talos
      • XP, 7, 8, (probably) Windows 10 - need GPO integration for the generic worker for helping with the tests. 2008 will be running puppet in AWS and will need puppet integration
      • Need work on Mac for Generic Worker
      • Q3: Linux, Q4/Q1: Windows and Mac
    • the scheduler: https://etherpad.mozilla.org/taskcluster-migration-scheduler
    • project: coalescing as a service
      • please don't lose the ability to schedule only using the tree