Taskcluster/Release Integration Meetings/2015-07-10
Tracker Bug for Taskcluster Migration work
https://bugzil.la/1141248
Purpose: Bringing Up Blockers And Celebrating
To organize and track work toward moving to TaskCluster from Buildbot.
Attempted trello board:
https://trello.com/b/asIJ2pGC/taskcluster-migration
Announcements
TaskCluster Meeting recording
- May 29: https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=bhKVEns4kzkzdYH
- June 12 https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=vyeYxrpVarOQTZN
- June 19 https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=dpasghNdnSWOnYf
- July 10 https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=17FI2iUmQE5J4ts
TODO:
-
Get a talk given about scopes (jonas) - Friday 10am in ReleaseEngineering vidyo room -
Get a talk given about generic worker (pmoore)http://docs.taskcluster.net/presentations- brought anhad up to speed
- selena - get a copy of the videos to send them to youtube
- dustin: what's the plan for x86 fennec builds (specifically that architecture)? lets get clarification also is this something we have to move? can likely do it easily.
- yes, we need to move this
- tests run _much_ faster on x86. there is even talk about running some tests here and not elsewhere (autophone), so we definitely need the builds
- blocked on tests for now
- sorry specifically that architecture shouldn't be more problematic than the existing arch
- [selena] - treeherder dev followup bug 1165469#c35 (escalation for this?) selena to follow up on this one, turns out it was not a Taskcluster problem
- [selena] - make a roadmap
Agenda
- Whistler results
- ~10 groups became aware that taskcluster exists
- triaged all of trello
- Notifications [armen]
- [selena] - send 'dev-tree-management' email when things impact TH
- hal to start an email thread on ... global notifications
- Issues for production TaskCluster
- bug 1080265#c2 - selena to schedule a meeting with armen and go over TODAY
- Security - hwine taking lead, RRAs
- next thing: scopes
- bhearsum: build promotion
- Getting an RRA on funsize -- first production bits to run on TC (rail) :)
- Buildbot Bridge (bhearsum)
- Scopes - meeting with hal/dustin/jonas? Friday in the office??
- rough timeline?
- Give access to releng to scope config viewer (lightsofapollo) -selena will follow up
- Audit the scopes - (hal)
- OPEN ITEMS
- [ahal] Either need to upgrade tester images (to Ubuntu 14.04) or downgrade builder ones (to 12.04) - bug 1175938
- Have upgraded tester uploaded, but blocked on build failure
- According to dustin, bug 1171033 should fix the issue, so tests are blocked on that or on upgrading the image
- FOR NEXT WEEK: put
- [ahal] Either need to upgrade tester images (to Ubuntu 14.04) or downgrade builder ones (to 12.04) - bug 1175938
- What bugs need to be created?
- Operations
- Sheriff concerns:'https://bugzil.la/1147867'
- selena went to this meeting, Callek will report progress to Sheriff meeting weekly
- james is going to send someone from taskcluster to the Thursday Treeherder meeting
- https://bugzilla.mozilla.org/show_bug.cgi?id=1080265not this?
- Document key pieces of infra required for RelEng supported system to work
- TaskClusterhttp://status.taskcluster.net/,https://wiki.mozilla.org/Auto-tools/Projects/TaskCluster#Availability
- Pulse
- Buildbot #releng, #buildduty and #moc coverage 24/7
- AWS S3 EC2
- Azure Storage
- Document process for updating key objects
- Docker images (README in-tree) [https://dxr.mozilla.org/mozilla-central/source/testing/docker/README.md]
- AMIs (config and keys with TC team) [https://github.com/taskcluster/docker-worker/blob/master/deploy/checklist.md]
- Future TC workers will have their configs in tree
- Documentation for this?
- Sheriff concerns:'https://bugzil.la/1147867'
- Q3 goals
- IMPORTANT: let's focus on goals where we have a good set of known-knowns
- Builds (coop)
- OS X cross compile (ted)
- Reproduced mshal's prior work inside mrrgn's desktop-build container.
- Next steps: try running in taskcluster, look at packaging+symbols
- Got prerequisite tools in tooltool, landed build-cctools script
- all try builds running in TC?
- reluctant to commit to production builds (e.g. nightly) until we figure out some of the security bits
- anhand - generic worker porting (will hand off to anthony)
- OS X cross compile (ted)
- Deploy a Periodic task tool (selena) - hooks.taskcluster.net
- Linux Tests (ahal)
- see https://bugzil.la/1171033
- is blocked on glibc/build issues
- Release Promotion (nthomas, bhearsum, rail)
- IT WILL BE AWESOME
- Release scheduling in TC (nthomas, bhearsum, rail)
- Proxies and secret handling (dustin)
- talking with jhford (secrets.taskcluster.net)
- TODO [selena/jhford] RRA for secrets.taskcluster.net
- We're not doing self-serve for TC anymore [X]
- Plan to remove dependencies from tasks:
- http://bit.ly/1HkhrAz
-
Self-serve for TC -
seehttps://bugzil.la/1174236and dep bug
- What else?
- security? where do we need to get to for shipping.
- Docker feature to prevent modification
- (rail) Bug 1175561 - docker-worker: pull images by hash - almost done, needs deployment
- TODO [garndt] get clarity on what S3 might do for us instead of docker/quay security wise, incl hal
- Stop sharing machines
- two classes of service: release and other (?)
- checklist we need to address.. mostly scopes issue, configuration for release workers
- TODO [selena] find a home for this checklist :)
- Credentials management (try especially)
- q3 builds will be shipped by buildbot, scheduled by taskcluster
- build promotion?
- funsize will be out of taskcluster in q3 -- partial updates -- and only issue is the docker image issue
- Locking down scopes
- Authentication - Hawk?
- how access keys get spread
- Docker feature to prevent modification
- security? where do we need to get to for shipping.
- Communication
- How should we publish information about future meetings?
- Mailing list? release-engineering@mozilla.com, sheriffs@
- Blog posts? yes
- How should we publish information about future meetings?
- Armen: ryan & sheriffs, https://bugzil.la/1080265
- Treeherder meeting: thursday at 8am
Who to contact about X
- overall migration and figuring out who can help: selena
- build migration: coop
- tests migration: armen
- individual goal work: see above for contacts
Components of build/test process that need to be addressed
- Signing (rail)
- Authentication, management of routing (dustin)
- Updates (completes, partials) (rail)
- Symbols (ted, amiyaguchi)
- Tests (likely through the buildbot bridge)
- Periodic Tasks - can do out of tree with something like crontabber (selena)
- Task configuration for each task (need to enumerate the long tail)
- Runner - deployments on Windows are an issue (coop)
Big Questions
- Can runner help us avoid continuing to have buildbot infra?
- Is this TaskCluster support of hardware pools? No, it would be using runner itself, which is currently launched by Puppet separately from TC or BB.
- bhearsum built the BBB, who is moving the remaining buildbot scheduling to TC scheduling?
- IIUC the setup is there but we have to move the scheduling to the TC decision tasks - please correct me if I misunderstood -- We are trying to figure this out now.
- "sendchange" issues -> can we do tests and make check in parallel?
- bug open for reducing make check ted: https://bugzil.la/992323
- parallelizing: possible if we want to go there (lightsofapollo)
Firefox releases/build promotion
- nthomas: https://bugzil.la/1118794
Bridge (scheduling in TC, runs in either)
- bhearsum: https://bugzil.la/1135192
- overview video presentation: https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879
- https://github.com/mozilla/buildbot-bridge/
- ReleaseEngineering/Applications/BuildbotBridge
- currently connected to 'alder'
- In production and stable
- Needs someone to start porting schedulers over
- Initial work in bug 1157242
- may have impacts on Treeherder (avoid showing things twice, builds per hour)
- its hard to figure out where the logs are... lightsofapollo is asking for brain cycles
- XXX: who has more specifics or a bug where this is explained in detail?
Signing - rail
- MAR (and Linux) signing: https://bugzil.la/1149147
- presentation at Whistler
- Doesn't block signing on other platforms, might be part of a pipeline later
Nightly and periodic tasks (PGO, autoland, B2G Partners/Devices)
-
HG bundles:https://bugzil.la/1171190<- WONTFIXED in favor of https://bugzil.la/1144872 - blocklist/HSTS/HPKP updates (weekly): https://bugzil.la/1171193
- relies on keys to push updates back in tree
- TC option 1: bug 1088350
- TC option 2: (preferred) All scheduling is in the tree... (XXX: file a bug for this)
== Builds ==
FxOS Builds/tests
- lightsofapollo: Mulet
- Linux: builds and tests done
- Mac: builds - https://bugzil.la/1171592
- Windows: builds - https://bugzil.la/1171601
- lightsofapollo: B2G emulator builds/tests still happening in buildbot Q2
- https://bugzil.la/1130763 - gecko: Emulator ICS Mochitest 11 perma fail
- https://bugzil.la/1146713 - mach mochitest-remote fails: expected to find ssltunnel
- in buildbot: ICS {opt,debug}
- already migrated to TC, lightsofapollo greening up tests
- in TC: all other emulators builds on TC
- b2gdesktop still happening in buildbot
- Windows - https://bugzil.la/1171616
- Linux - already in TC
- OSX - https://bugzil.la/1171615
- blocked by needing platform support for OSX - https://bugzil.la/1171618
Firefox desktop builds
- Linux - mrrgn https://bugzil.la/1135206
- Successful Linux64 Opt Build: https://tools.taskcluster.net/task-inspector/#imXseNwETUi1R4lwpnkDRQ/0
- (https://bugzil.la/1154826 Have ubuntu based containers
- (shared with Dustin) which build successful, but fail at running the gtest suite
- see bug: https://bugzil.la/1162965
- Working in parallel with Dustin to figure out caches/artifacts uploads.
- (https://bugzil.la/1155749 - After the gtest suite bug is fixed, will move Linux Opt builds on try to TC. After that (Q3) we'll get the opt builds working as the default everywhere.
- After 64-bit builds, we need to tackle 32 bit builds. Because containers are 64-bit only, we'll need to cross compile, for that my plan is to add an option to MozBoot (the way we installdependencies) to force it to create a 32-bit build environment on a 64 bit machine. That bug: https://bugzil.la/1159534
- switching to ubuntu requires a little more work than CentOS for cross compiling
- do we trigger test jobs on the buildbot side? No - turned off sendchange. ahal is going to start looking at linux64 tests.
- TODO: Determine how to trigger test jobs on buildbot OR wait for Linux test jobs on TC
- Using mozboot now
- Fennec/linux builds -dustin
- collaborating with mrrrgn; have a build working but need to hook up caches and figure out stuff like symbol uploads and artifacts
- notes at https://gist.github.com/djmitche/9ca81f91798d512d543d
- "shot in the dark" -- trying to get things working on an Ubuntu image, hacking through bug after bug, rather than trying to replicate existing build infra
- https://bugzil.la/1118394 - tracker
- current issues include but are not limited to: authentication/proxy
- https://bugzil.la/1125973 - Docker images for Android builds - focus of current work
- https://bugzil.la/1155349 - mshal's work to port to mozharness
- x86 builds - https://bugzil.la/1174206
- Windows builds- needs a windows worker and windows AMI & windows infra to support builds
- pmoore: generic worker: https://bugzil.la/1119546 - TaskCluster Windows worker is a Q2 goal, so starting Q3 it should be possible to set up windows jobs in taskcluster. Will work with existing aws provisioner. http://petemoore.github.io/generic-worker/
- ffledgling: https://bugzil.la/1180775 - Implement Windows builds using the generic worker
- Windows builds in AWS (arr)
- Runner on windows
- Puppet on windows
- AMI generation for windows
- Performance issues - we're pretty sure this is solved now!
- Cross compiling
- Why? This will save us $ on licenses for windows itself for builders, cost for compiling is the same for licenses but more because compiling on windows hosts is likely slower than cross compiling (need data, but strongly suspect)
- Options:
- Visual Studio under WINE (no one active)
- clang cl (longer term?)
- Cross compiling
- We shouldn't be spending time trying to cross-compile Windows builds. We should just fix TaskCluster to be able to support Windows workers on AWS. +1
- not as a first pass, sure, but I think this bears further investigation once we're running Windows builds on TC in AWS
- We shouldn't be spending time trying to cross-compile Windows builds. We should just fix TaskCluster to be able to support Windows workers on AWS. +1
- Mac OS X builds
- Ted/coop: https://bugzil.la/921040 - Cross-compile Firefox for Mac on Linux
- Last time stalled out on:
- buildsymbols -- better story there now, llvm people are working on a dsymutil: https://github.com/llvm-mirror/llvm/tree/master/tools/dsymutil
- packaging -- was pretty close, libdmg-hfsplus does 95% of what we need: https://bugzil.la/935237
- next step: working docker image
Haz builds
Spidermonkey
- ffledgling: https://bugzil.la/1164656
l10n repacks
- Linux
- Nightly: https://bugzil.la/1171736
- Dep:
- Linux64
- Nightly: https://bugzil.la/1171738
- Dep:
- Mac
- Nightly: https://bugzil.la/1171741
- Dep:
- Win32
- Nightly: https://bugzil.la/1171743
- Dep:
- Win64
- Nightly: https://bugzil.la/1171745
- Dep:
- Android (single locale)
- Nightly: https://bugzil.la/1171787
- Dep:
Thunderbird?
==================================================
Tests
- Talos (perf tests) - jonas has done some experiments
- probably stuck on hardware forever
- possible plans for linux: running for 1 quarter on AWS side by side to see if it works or docker images on hardware
- Hardware for Windows/Mac
- Pandas? Off them by Q3, to move to AutoPhone
- jmaher believes that we need more experimentation in the cloud
- jmaher's current Q3 deliverable: Android talos tests off the pandas (moving to Autophone)
- We have not been able to get clear answers about regressions in the past
- 2 ways to solve this
- all on the cloud
- all on the hardware
- take every build on inbound or fx-team and run talos on there
- jonas said is too much volume; what was the reason?
- TODO: investigate more
- We could take some tests off but we need most of them
- Some tests require graphics
- The media tests have specialized I/O devices
- It would be meaningful if we get all tests running
- Report to graph server
- Once we see a regression we can compare if it is properly reported
- All or None
- For 28 tests we run on Linux32, if we cannot run all of them on VMs we should stick to hardware, otherwise, developers would get confused
- The more vectors we add on a regression the harder it will get to traction from developers to fix
- For instance, tp5 fails on hardware and developers wonder why the other perf suites did not fail and then they discover that it is because it was running on a VM instead
- It becomes two platforms to support instead of 1
- Different network access, disks, memory
- jonas said is too much volume; what was the reason?
- It would be nice that they could run the same VM setup locally
- Linux tests - ahal - https://bugzil.la/1171033
- [blocker] waiting on L64 builds to stop timing out before we can see them
- ask Morgan or ahal about it
- Containerized/Virtualized tests (have some in VMs now)
- linux, including Android emulator tests - probably easy, since we already have them running in AWS
- [blocker] waiting on L64 builds to stop timing out before we can see them
- Windows (?) vmware! - will need some amount of effort to green up tests in a new environment, hard to estimate until we do a first pass
- Buildbot Bridge?
- Containers?
- winjail?
- Mac OS X
- runner to replace buildbot master management?
- Buildbot Bridge?
- Containers?
- All the rest of our tests
- mac hardware
- windows hardware
Data source:
- All buildbot builders: http://people.mozilla.org/~armenzg/permanent/all_builders.txt
- Generated every night
== Blog Posts and other resources ==
TC tutorial recording:
https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=bhKVEns4kzkzdYH
http://www.chesnok.com/daily/2015/05/29/migrating-to-taskcluster-work-underway/
http://www.chesnok.com/daily/2015/06/02/taskcluster-migration-a-hello-world-for-worker-task-creator/
https://etherpad.mozilla.org/taskcluster-hello-world
https://etherpad.mozilla.org/taskcluster-migration-scheduler
Buildbot Bridge
https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879
Funsize
Build Promotion/release promotion
=== ARCHIVE ===
- Status of Q2 goals
- Linux builds available thru TC (mrrgn)
- debug & opt hooked up & triggered on try
- working on asan builds
- 32-bit on its way
- Fennec builds available thru TC (dustin)
- no 32-bit fennec (WOOO)
- Fennec needs tooltool - working on a proxy
- x86 builds? (q for catlee - we need them but probably hasn't looked at them)
- Generic worker running on Windows by EOQ (pmoore)
- Can we get a recorded presentation about this? (coop gonna bug him - sure - i'll make a recording)
- https://etherpad.mozilla.org/tc-builds-whistler-tasks (lets do things! android, other things welcome! room will be 10-12 people! HEY JONAS you can use our room if you want to hack, but not if you want to meet&talk) Look in sched.org
- http://juneworkweekwhistler2015.sched.org/mobile/#session:0319364894adb0f090250ef89f75ae03 (note that the room only holds 10- 12 people and is shared with the windows hacking session http://juneworkweekwhistler2015.sched.org/mobile/#session:0fb58a5d8b99a43dac064e7f892ebdc7, so space is limited to about 6-7 people)
- https://etherpad.mozilla.org/jonasfj-taskcluster-whistler-subjects
- armenzg: talk about self-serve for TC
- TODO: selena remove this entry when you read this
- change of plans to remove dependencies from a task graph
- http://bit.ly/1HkhrAz
- dustin: permacreds
- a millenium of credentials, ask jonas for your creds
- we need to be able to expire these :)
- armenzg: talos
- jonathan and joel: need to be running all of them on hardware, or all of them on VM
- current plan for hardware is to use generic worker; will keep enough hardware (non-virtualized) to run talos
- XP, 7, 8, (probably) Windows 10 - need GPO integration for the generic worker for helping with the tests. 2008 will be running puppet in AWS and will need puppet integration
- Need work on Mac for Generic Worker
- Q3: Linux, Q4/Q1: Windows and Mac
- the scheduler: https://etherpad.mozilla.org/taskcluster-migration-scheduler
- project: coalescing as a service
- please don't lose the ability to schedule only using the tree
- Linux builds available thru TC (mrrgn)