Taskcluster migration status: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Page generated from https://etherpad.mozilla.org/taskcluster-migration-status by Special:ImportFromEtherpad)
 
(Moved)
 
Line 1: Line 1:
'''Tracker Bug for Taskcluster Migration work'''<br />
See https://public.etherpad-mozilla.org/p/buildbot-to-taskcluster-migration
[https://bugzil.la/1141248 https://bugzil.la/1141248]<br />
Purpose: '''Bringing Up Blockers And Celebrating'''<br />
To organize and track work toward moving to TaskCluster from Buildbot.
 
Attempted trello board:<br />
[https://trello.com/b/asIJ2pGC/taskcluster-migration https://trello.com/b/asIJ2pGC/taskcluster-migration]
 
'''Announcements'''<br />
TaskCluster Meeting recording<br />
 
* May 29: [https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=bhKVEns4kzkzdYH https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=bhKVEns4kzkzdYH]
* June 12 [https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=vyeYxrpVarOQTZN https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=vyeYxrpVarOQTZN]
* June 19 [https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=dpasghNdnSWOnYf https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=dpasghNdnSWOnYf]
* July 10 [https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=17FI2iUmQE5J4ts https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=17FI2iUmQE5J4ts]
 
 
TODO:<br />
 
* <s>Get a talk given about scopes (jonas) - Friday 10am in ReleaseEngineering vidyo room</s>
* <s>Get a talk given about generic worker (pmoore)</s>[http://docs.taskcluster.net/presentations http://docs.taskcluster.net/presentations]
** brought anhad up to speed
** selena - get a copy of the videos to send them to youtube
* dustin: what's the plan for x86 fennec builds (specifically that architecture)? lets get clarification also is this something we have to move? can likely do it easily.
** yes, we need to move this
** tests run _much_ faster on x86. there is even talk about running some tests here and not elsewhere (autophone), so we definitely need the builds
** blocked on tests for now
** sorry specifically that architecture shouldn't be more problematic than the existing arch
* [selena] - treeherder dev followup {{bug|1165469#c35}} (escalation for this?) selena to follow up on this one, turns out it was not a Taskcluster problem
* [selena] - make a roadmap
 
 
'''Agenda'''<br />
 
* Whistler results
** ~10 groups became aware that taskcluster exists
** triaged all of trello
* Notifications [armen]
** [selena] - send 'dev-tree-management' email when things impact TH
** hal to start an email thread on ... global notifications
* Issues for production TaskCluster
** {{bug|1080265#c2}} - selena to schedule a meeting with armen and go over TODAY
** Security - hwine taking lead, RRAs
*** next thing: scopes
*** bhearsum: build promotion
*** Getting an RRA on funsize -- first production bits to run on TC (rail) :)
*** Buildbot Bridge (bhearsum)
 
* Scopes - meeting with hal/dustin/jonas? Friday in the office??
** rough timeline?
** Give access to releng to scope config viewer (lightsofapollo) -selena will follow up
** Audit the scopes - (hal)
* OPEN ITEMS
** [ahal] Either need to upgrade tester images (to Ubuntu 14.04) or downgrade builder ones (to 12.04) - bug 1175938
*** Have upgraded tester uploaded, but blocked on build failure
*** According to dustin, bug 1171033 should fix the issue, so tests are blocked on that or on upgrading the image
** FOR NEXT WEEK: put
 
 
* '''What bugs need to be created?'''
** '''Authentication broker/proxy for TC workers (dustin)'''
*** '''Note from Jonas:'''[https://etherpad.mozilla.org/jonasfj-auth-tc-summary '''https://etherpad.mozilla.org/jonasfj-auth-tc-summary''']''''''
** '''l10n dep builds still use buildbot factory: need bug to migrate to mozharness? (coop)'''
* '''Operations'''
** '''Sheriff concerns:'''[https://bugzil.la/1147867 '''https://bugzil.la/1147867''']''''''
*** '''selena went to this meeting, Callek will report progress to Sheriff meeting weekly'''
*** '''james is going to send someone from taskcluster to the Thursday Treeherder meeting'''
*** [https://bugzilla.mozilla.org/show_bug.cgi?id=1080265 '''https://bugzilla.mozilla.org/show_bug.cgi?id=1080265''']'''not this?'''
** '''Document key pieces of infra required for RelEng supported system to work'''
*** '''TaskCluster'''[http://status.taskcluster.net/ '''http://status.taskcluster.net/''']''','''[https://wiki.mozilla.org/Auto-tools/Projects/TaskCluster#Availability '''https://wiki.mozilla.org/Auto-tools/Projects/TaskCluster#Availability''']
*** '''Pulse'''
*** '''Buildbot #releng, #buildduty and #moc coverage 24/7'''
*** '''AWS S3 EC2'''
*** '''Azure Storage'''
** '''Document process for updating key objects'''
*** '''Docker images (README in-tree) ['''[https://dxr.mozilla.org/mozilla-central/source/testing/docker/README.md '''https://dxr.mozilla.org/mozilla-central/source/testing/docker/README.md''']''']'''
*** '''AMIs (config and keys with TC team) ['''[https://github.com/taskcluster/docker-worker/blob/master/deploy/checklist.md '''https://github.com/taskcluster/docker-worker/blob/master/deploy/checklist.md''']''']'''
** '''Future TC workers will have their configs in tree'''
*** '''Documentation for this?'''
 
 
 
 
 
 
* Q3 goals
** IMPORTANT: let's focus on goals where we have a good set of known-knowns
** Builds (coop)
*** OS X cross compile (ted)
**** Reproduced mshal's prior work inside mrrgn's desktop-build container.
**** Next steps: try running in taskcluster, look at packaging+symbols
**** Got prerequisite tools in tooltool, landed build-cctools script
*** all try builds running in TC?
**** reluctant to commit to production builds (e.g. nightly) until we figure out some of the security bits
*** anhand - generic worker porting (will hand off to anthony)
** Deploy a Periodic task tool (selena) - hooks.taskcluster.net
** Linux Tests (ahal)
*** see [https://bugzil.la/1171033 https://bugzil.la/1171033]
*** is blocked on glibc/build issues
** Release Promotion (nthomas, bhearsum, rail)
*** IT WILL BE AWESOME
*** Release scheduling in TC (nthomas, bhearsum, rail)
** Proxies and secret handling (dustin)
*** talking with jhford (secrets.taskcluster.net)
*** TODO [selena/jhford] RRA for secrets.taskcluster.net
** We're not doing self-serve for TC anymore [X]
*** Plan to remove dependencies from tasks:
*** [http://bit.ly/1HkhrAz http://bit.ly/1HkhrAz]
*** <s>Self-serve for TC</s>
*** <s>see</s>[https://bugzil.la/1174236 <s>https://bugzil.la/1174236</s>]<s>and dep bug</s>
** What else?
*** security? where do we need to get to for shipping.
**** Docker feature to prevent modification
***** (rail) '''Bug 1175561''' - docker-worker: pull images by hash - almost done, needs deployment
***** TODO [garndt] get clarity on what S3 might do for us instead of docker/quay security wise, incl hal
**** Stop sharing machines
***** two classes of service: release and other (?)
***** checklist we need to address.. mostly scopes issue, configuration for release workers
***** TODO [selena] find a home for this checklist :)
**** Credentials management (try especially)
***** q3 builds will be shipped by buildbot, scheduled by taskcluster
***** build promotion?
***** funsize will be out of taskcluster in q3 -- partial updates -- and only issue is the docker image issue
**** Locking down scopes
**** Authentication - Hawk?
***** how access keys get spread
 
 
* Communication
** How should we publish information about future meetings?
*** Mailing list? release-engineering@mozilla.com, sheriffs@
*** Blog posts? yes
 
* Armen: ryan &amp; sheriffs, [https://bugzil.la/1080265 https://bugzil.la/1080265]
** Treeherder meeting: thursday at 8am
 
 
 
 
 
 
'''Who to contact about X'''<br />
 
* overall migration and figuring out who can help: selena
* build migration: coop
* tests migration: armen
* individual goal work: see above for contacts
 
 
'''Components of build/test process that need to be addressed'''<br />
 
* Signing (rail)
* Authentication, management of routing (dustin)
* Updates (completes, partials) (rail)
* Symbols (ted, amiyaguchi)
* Tests (likely through the buildbot bridge)
* Periodic Tasks - can do out of tree with something like crontabber (selena)
* Task configuration for each task (need to enumerate the long tail)
* Runner - deployments on Windows are an issue (coop)
 
 
'''Big Questions'''<br />
 
* Can runner help us avoid continuing to have buildbot infra?
** Is this TaskCluster support of hardware pools? No, it would be using runner itself, which is currently launched by Puppet separately from TC or BB.
* bhearsum built the BBB, who is moving the remaining buildbot scheduling to TC scheduling?
** IIUC the setup is there but we have to move the scheduling to the TC decision tasks - please correct me if I misunderstood -- We are trying to figure this out now.
* &quot;sendchange&quot; issues -&gt; can we do tests and make check in parallel?
** bug open for reducing make check ted: [https://bugzil.la/992323 https://bugzil.la/992323]
** parallelizing: possible if we want to go there (lightsofapollo)
 
 
'''Firefox releases/build promotion'''<br />
 
* nthomas: [https://bugzil.la/1118794 https://bugzil.la/1118794]
 
 
'''Bridge (scheduling in TC, runs in either)'''<br />
 
* bhearsum: [https://bugzil.la/1135192 https://bugzil.la/1135192]
** overview video presentation: [https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879 https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879]
*** Notes: [http://www.chesnok.com/daily/?p=5289 http://www.chesnok.com/daily/?p=5289]
** [https://github.com/mozilla/buildbot-bridge/ https://github.com/mozilla/buildbot-bridge/]
** [[ReleaseEngineering/Applications/BuildbotBridge]]
** currently connected to 'alder'
* In production and stable
* Needs someone to start porting schedulers over
** Initial work in {{bug|1157242}}
* may have impacts on Treeherder (avoid showing things twice, builds per hour)
** its hard to figure out where the logs are... lightsofapollo is asking for brain cycles
** XXX: who has more specifics or a bug where this is explained in detail?
 
 
'''Signing''' - rail<br />
 
* MAR (and Linux) signing: [https://bugzil.la/1149147 https://bugzil.la/1149147]
** presentation at Whistler
** Doesn't block signing on other platforms, might be part of a pipeline later
 
 
'''Nightly and periodic tasks (PGO, autoland, B2G Partners/Devices)'''<br />
 
* <s>HG bundles:</s>[https://bugzil.la/1171190 <s>https://bugzil.la/1171190</s>]<s></s>&lt;- WONTFIXED in favor of [https://bugzil.la/1144872 https://bugzil.la/1144872]
* blocklist/HSTS/HPKP updates (weekly): [https://bugzil.la/1171193 https://bugzil.la/1171193]
** relies on keys to push updates back in tree
** TC option 1: {{bug|1088350}}
** TC option 2: (preferred) All scheduling is in the tree... (XXX: file a bug for this)
 
 
'''== Builds =='''<br />
'''FxOS Builds/tests'''<br />
 
* lightsofapollo: Mulet
** Linux: builds and tests done
** Mac: builds - [https://bugzil.la/1171592 https://bugzil.la/1171592]
** Windows: builds - [https://bugzil.la/1171601 https://bugzil.la/1171601]
* lightsofapollo: B2G emulator builds/tests still happening in buildbot Q2
** [https://bugzil.la/1130763 https://bugzil.la/1130763] - gecko: Emulator ICS Mochitest 11 perma fail
** [https://bugzil.la/1146713 https://bugzil.la/1146713] - mach mochitest-remote fails: expected to find ssltunnel
** in buildbot: ICS {opt,debug}
*** already migrated to TC, lightsofapollo greening up tests
** in TC: all other emulators builds on TC
* b2gdesktop still happening in buildbot
** Windows - [https://bugzil.la/1171616 https://bugzil.la/1171616]
** Linux - already in TC
** OSX - [https://bugzil.la/1171615 https://bugzil.la/1171615]
*** blocked by needing platform support for OSX - [https://bugzil.la/1171618 https://bugzil.la/1171618]
 
 
 
'''Firefox desktop builds'''<br />
 
* '''Linux''' - mrrgn [https://bugzil.la/1135206 https://bugzil.la/1135206]
** Successful Linux64 Opt Build: [https://tools.taskcluster.net/task-inspector/#imXseNwETUi1R4lwpnkDRQ/0 https://tools.taskcluster.net/task-inspector/#imXseNwETUi1R4lwpnkDRQ/0]
** ([https://bugzil.la/1154826 https://bugzil.la/1154826] Have ubuntu based containers
** (shared with Dustin) which build successful, but fail at running the gtest suite
*** see bug: [https://bugzil.la/1162965 https://bugzil.la/1162965]
** Working in parallel with Dustin to figure out caches/artifacts uploads.
** ([https://bugzil.la/1155749 https://bugzil.la/1155749] - After the gtest suite bug is fixed, will move Linux Opt builds on try to TC. After that (Q3) we'll get the opt builds working as the default everywhere.
** After 64-bit builds, we need to tackle 32 bit builds. Because containers are 64-bit only, we'll need to cross compile, for that my plan is to add an option to MozBoot (the way we installdependencies) to force it to create a 32-bit build environment on a 64 bit machine. That bug: [https://bugzil.la/1159534 https://bugzil.la/1159534]
*** switching to ubuntu requires a little more work than CentOS for cross compiling
*** do we trigger test jobs on the buildbot side? No - turned off sendchange. ahal is going to start looking at linux64 tests.
*** TODO: Determine how to trigger test jobs on buildbot OR wait for Linux test jobs on TC
** Using mozboot now
*** [http://mxr.mozilla.org/mozilla-central/source/python/mozboot/bin/bootstrap.py http://mxr.mozilla.org/mozilla-central/source/python/mozboot/bin/bootstrap.py]
 
 
* '''Fennec/linux builds -'''dustin
** collaborating with mrrrgn; have a build working but need to hook up caches and figure out stuff like symbol uploads and artifacts
** notes at [https://gist.github.com/djmitche/9ca81f91798d512d543d https://gist.github.com/djmitche/9ca81f91798d512d543d]
** &quot;shot in the dark&quot; -- trying to get things working on an Ubuntu image, hacking through bug after bug, rather than trying to replicate existing build infra
** [https://bugzil.la/1118394 https://bugzil.la/1118394] - tracker
*** current issues include but are not limited to: authentication/proxy
** [https://bugzil.la/1125973 https://bugzil.la/1125973] - Docker images for Android builds - focus of current work
** [https://bugzil.la/1155349 https://bugzil.la/1155349] - mshal's work to port to mozharness
** x86 builds - [https://bugzil.la/1174206 https://bugzil.la/1174206]
 
 
* '''Windows builds'''- needs a windows worker and windows AMI &amp; windows infra to support builds
** pmoore: generic worker: [https://bugzil.la/1119546 https://bugzil.la/1119546] - TaskCluster Windows worker is a Q2 goal, so starting Q3 it should be possible to set up windows jobs in taskcluster. Will work with existing aws provisioner. [http://petemoore.github.io/generic-worker/ http://petemoore.github.io/generic-worker/]
** ffledgling: [https://bugzil.la/1180775 https://bugzil.la/1180775] - Implement Windows builds using the generic worker
** Windows builds in AWS (arr)
*** Runner on windows
*** Puppet on windows
*** AMI generation for windows
*** Performance issues - we're pretty sure this is solved now!
**** {{bug|1168812#c23}}
 
** Cross compiling
*** Why? This will save us $ on licenses for windows itself for builders, cost for compiling is the same for licenses but more because compiling on windows hosts is likely slower than cross compiling (need data, but strongly suspect)
*** Options:
**** Visual Studio under WINE (no one active)
**** clang cl (longer term?)
 
** We shouldn't be spending time trying to cross-compile Windows builds. We should just fix TaskCluster to be able to support Windows workers on AWS. +1
*** not as a first pass, sure, but I think this bears further investigation once we're running Windows builds on TC in AWS
 
 
* '''Mac OS X builds'''
** Ted/coop: [https://bugzil.la/921040 https://bugzil.la/921040] - Cross-compile Firefox for Mac on Linux
** Last time stalled out on:
*** buildsymbols -- better story there now, llvm people are working on a dsymutil: [https://github.com/llvm-mirror/llvm/tree/master/tools/dsymutil https://github.com/llvm-mirror/llvm/tree/master/tools/dsymutil]
*** packaging -- was pretty close, libdmg-hfsplus does 95% of what we need: [https://bugzil.la/935237 https://bugzil.la/935237]
** next step: working docker image
 
 
'''Haz builds'''<br />
 
* TBD - [https://bugzil.la/1171632 https://bugzil.la/1171632]
 
 
'''Spidermonkey'''<br />
 
* ffledgling: [https://bugzil.la/1164656 https://bugzil.la/1164656]
 
 
'''l10n repacks'''<br />
 
* Linux
** Nightly: [https://bugzil.la/1171736 https://bugzil.la/1171736]
** Dep:
* Linux64
** Nightly: [https://bugzil.la/1171738 https://bugzil.la/1171738]
** Dep:
* Mac
** Nightly: [https://bugzil.la/1171741 https://bugzil.la/1171741]
** Dep:
* Win32
** Nightly: [https://bugzil.la/1171743 https://bugzil.la/1171743]
** Dep:
* Win64
** Nightly: [https://bugzil.la/1171745 https://bugzil.la/1171745]
** Dep:
* Android (single locale)
** Nightly: [https://bugzil.la/1171787 https://bugzil.la/1171787]
** Dep:
 
 
Thunderbird?
 
==================================================<br />
'''Tests'''<br />
 
* Talos (perf tests) - jonas has done some experiments
** probably stuck on hardware forever
** possible plans for linux: running for 1 quarter on AWS side by side to see if it works or docker images on hardware
** Hardware for Windows/Mac
** Pandas? Off them by Q3, to move to AutoPhone
** jmaher believes that we need more experimentation in the cloud
*** jmaher's current Q3 deliverable: Android talos tests off the pandas (moving to Autophone)
*** We have not been able to get clear answers about regressions in the past
*** 2 ways to solve this
**** all on the cloud
**** all on the hardware
*** take every build on inbound or fx-team and run talos on there
**** jonas said is too much volume; what was the reason?
***** TODO: investigate more
**** We could take some tests off but we need most of them
**** Some tests require graphics
**** The media tests have specialized I/O devices
**** It would be meaningful if we get all tests running
**** Report to graph server
**** Once we see a regression we can compare if it is properly reported
**** All or None
***** For 28 tests we run on Linux32, if we cannot run all of them on VMs we should stick to hardware, otherwise, developers would get confused
***** The more vectors we add on a regression the harder it will get to traction from developers to fix
***** For instance, tp5 fails on hardware and developers wonder why the other perf suites did not fail and then they discover that it is because it was running on a VM instead
***** It becomes two platforms to support instead of 1
****** Different network access, disks, memory
 
**** It would be nice that they could run the same VM setup locally
 
* Linux tests - ahal - [https://bugzil.la/1171033 https://bugzil.la/1171033]
** '''[blocker]''' waiting on L64 builds to stop timing out before we can see them
*** ask Morgan or ahal about it
** Containerized/Virtualized tests (have some in VMs now)
** linux, including Android emulator tests - probably easy, since we already have them running in AWS
* Windows (?) vmware! - will need some amount of effort to green up tests in a new environment, hard to estimate until we do a first pass
** Buildbot Bridge?
** Containers?
*** winjail?
 
* Mac OS X
** runner to replace buildbot master management?
** Buildbot Bridge?
** Containers?
* All the rest of our tests
** mac hardware
** windows hardware
 
 
'''Data source:'''<br />
* All buildbot builders: [http://people.mozilla.org/~armenzg/permanent/all_builders.txt http://people.mozilla.org/~armenzg/permanent/all_builders.txt]<br />
** Generated every night
 
<br />
'''== Blog Posts and other resources =='''<br />
TC tutorial recording:<br />
[https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=bhKVEns4kzkzdYH https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=bhKVEns4kzkzdYH]<br />
[http://www.chesnok.com/daily/2015/05/29/migrating-to-taskcluster-work-underway/ http://www.chesnok.com/daily/2015/05/29/migrating-to-taskcluster-work-underway/]<br />
[http://www.chesnok.com/daily/2015/06/02/taskcluster-migration-a-hello-world-for-worker-task-creator/ http://www.chesnok.com/daily/2015/06/02/taskcluster-migration-a-hello-world-for-worker-task-creator/]<br />
[https://etherpad.mozilla.org/taskcluster-hello-world https://etherpad.mozilla.org/taskcluster-hello-world]<br />
[https://etherpad.mozilla.org/taskcluster-migration-scheduler https://etherpad.mozilla.org/taskcluster-migration-scheduler]
 
Buildbot Bridge<br />
[https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879 https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879]
 
Funsize
 
<br />
Build Promotion/release promotion
 
=== ARCHIVE ===<br />
 
* Status of Q2 goals
** Linux builds available thru TC (mrrgn)
*** debug &amp; opt hooked up &amp; triggered on try
*** working on asan builds
*** 32-bit on its way
** Fennec builds available thru TC (dustin)
*** no 32-bit fennec (WOOO)
*** Fennec needs tooltool - working on a proxy
*** x86 builds? (q for catlee - we need them but probably hasn't looked at them)
** Generic worker running on Windows by EOQ (pmoore)
*** Can we get a recorded presentation about this? (coop gonna bug him - sure - i'll make a recording)
** [https://etherpad.mozilla.org/tc-builds-whistler-tasks https://etherpad.mozilla.org/tc-builds-whistler-tasks] (lets do things! android, other things welcome! room will be 10-12 people! HEY JONAS you can use our room if you want to hack, but not if you want to meet&amp;talk) Look in sched.org
*** [http://juneworkweekwhistler2015.sched.org/mobile/#session:0319364894adb0f090250ef89f75ae03 http://juneworkweekwhistler2015.sched.org/mobile/#session:0319364894adb0f090250ef89f75ae03] (note that the room only holds 10- 12 people and is shared with the windows hacking session [http://juneworkweekwhistler2015.sched.org/mobile/#session:0fb58a5d8b99a43dac064e7f892ebdc7 http://juneworkweekwhistler2015.sched.org/mobile/#session:0fb58a5d8b99a43dac064e7f892ebdc7], so space is limited to about 6-7 people)
** [https://etherpad.mozilla.org/jonasfj-taskcluster-whistler-subjects https://etherpad.mozilla.org/jonasfj-taskcluster-whistler-subjects]
** armenzg: talk about self-serve for TC
*** TODO: selena remove this entry when you read this
*** change of plans to remove dependencies from a task graph
*** [http://bit.ly/1HkhrAz http://bit.ly/1HkhrAz]
** dustin: permacreds
*** a millenium of credentials, ask jonas for your creds
*** we need to be able to expire these :)
** armenzg: talos
*** jonathan and joel: need to be running all of them on hardware, or all of them on VM
*** current plan for hardware is to use generic worker; will keep enough hardware (non-virtualized) to run talos
*** XP, 7, 8, (probably) Windows 10 - need GPO integration for the generic worker for helping with the tests. 2008 will be running puppet in AWS and will need puppet integration
*** Need work on Mac for Generic Worker
*** Q3: Linux, Q4/Q1: Windows and Mac
** the scheduler: [https://etherpad.mozilla.org/taskcluster-migration-scheduler https://etherpad.mozilla.org/taskcluster-migration-scheduler]
** project: coalescing as a service
*** please don't lose the ability to schedule only using the tree

Latest revision as of 16:38, 20 October 2015