https://wiki.mozilla.org/api.php?action=feedcontributions&user=Bear&feedformat=atomMozillaWiki - User contributions [en]2024-03-29T13:40:44ZUser contributionsMediaWiki 1.27.4https://wiki.mozilla.org/index.php?title=Conductors&diff=713958Conductors2013-09-23T16:56:58Z<p>Bear: /* Location: Philadelphia, PA Timezone: EST */</p>
<hr />
<div>__NOTOC__ <br />
<br />
Hi! We're the Conductors - a team of Mozilla community members who are available as ''mentors'' to help conversations run more smoothly and harmoniously. We're not police or referees, just a group of people who have offered to be available to give advice, counsel and support to our fellow community members when a discussion or debate gets a little tense.<br><br />
<br><br />
It's not always easy to discuss contentious issues through primarily text-based systems like email, newsgroups and Bugzilla comments. Sometimes it can help to have someone to assist with phrasing, interpretation and the flow of discussion. If you ever find yourself in a tricky conversation within the Mozilla community and want some assistance, just hop into #conductors on irc.mozilla.org, or send an email either to the [mailto:conductors@mozilla.org whole group] or any of us individually and we'll be happy to help.<br><br />
<br><br />
(We'd also love to hear from you if you just want to learn more about strategies and techniques for online discussions, or want to join this mentorship group!)<br />
<br />
'''[mailto:conductors@mozilla.org Email conductors@mozilla.org]''' or jump to an individual ...<br />
<small><center>[[Conductors#David_Ascher|David Ascher]] | [[Conductors#Dietrich_Ayala_.28.40dietrich.29|Dietrich Ayala]] | [[Conductors#Mike_Beltzner_.28.40beltzner.29|Mike Beltzner]] | [[Conductors#Matt_Claypotch_.28.40potch.29|Matt Claypotch]] | [[Conductors#David_Eaves_.28.40daeaves.29|David Eaves]] | [[Conductors#Gen_Kanai_.28.40gen.29|Gen Kanai]] | [[Conductors#Michelle_Luna|Michelle Luna]] | <br />
[[Conductors#Johnathan_Nightingale|Johnathan Nightingale]] | [[Conductors#Stormy_Peters_.28.40storming.29|Stormy Peters]] | [[Conductors#Melissa_Shapiro_.28.40shappy.29|Melissa Shapiro]] | [[Conductors#Gavin_Sharp|Gavin Sharp]] | [[Conductors#Benjamin_Smedberg|Benjamin Smedberg]] | [[Conductors#Mike_Taylor_.28.40bear.29|Mike Taylor (Bear)]] | [[Conductors#David_Tenser|David Tenser]] | [[Conductors#Daniel_Veditz_.28.40dveditz.29|Daniel Veditz]]</center></small><br />
<br />
=== [https://mozillians.org/en-US/u/d950a0fa3c David Ascher] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
[[Image:DavidA.jpg|frame|left]] <br />
I work in Mozilla Labs, with a focus on the social web and communications software. I'm currently based in Vancouver, Canada, speak French & English, and believe the world would be better if everyone had lived in more than one country. I've been involved in open source communities for a long time, and involved with Mozilla since 1999 (and an employee for a few years).<br />
<br />
'''Email'''<br><br />
[mailto:da@mozilla.com da@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
davida on irc<br><br />
david.ascher@gmail.com (gtalk))<br><br />
@davidascher (twitter)<br><br />
davidascher (skype)<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/54a415e4c8 Dietrich Ayala] ([http://www.twitter.com/dietrich @dietrich])===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
<br />
[[Image:Dietrich-ayala.jpg|frame|left]] <br />
<br />
Dietrich Ayala is a Firefox front-end developer and engineering manager. He's been working on Firefox for over 5 years, on projects like session restore, Places (bookmarks, history and the awesomebar), Jetpack and various other bits and pieces. He also enjoys talking to The People of Earth about Firefox and Mozilla, so gets himself out to meetups, tweetups, user groups and conferences as much as possible. He thinks you should come hang out with him in Portland Oregon sometime, and tell him stories about places that are not cold and wet.<br />
<br />
'''Email'''<br><br />
[mailto:dietrich@mozilla.com dietrich@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
[http://www.twitter.com/dietrich dietrich on twitter]<br><br />
autonome on gtalk<br><br />
autonome on skype<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/590e458f89 Mike Beltzner] ([http://www.twitter.com/beltzner @beltzner]) ===<br />
Location: Toronto, Ontario (EST/UTC-5)<br><br />
[[Image:Mike-beltzner.jpg|frame|left]] <br />
Hi. I'm [https://mozillians.org/en-US/u/590e458f89 Mike], and I've been a Mozilla community member since 2004, first working on Calendar user experience issues and eventually joining the Mozilla Corporation and becoming the product director for Firefox. Although I left the company in 2011, you can still find me puttering about in the community and I currently act as moderator for the mozilla.dev.planning newsgroup. I've learned how to manage difficult online conversations the hard way, and can show you my scars if you think it would help. Feel free to drop me a line if you want to chat about any difficult situation, or commiserate about how Firefly was cancelled far, far too soon.<br />
<br />
'''Email''': [mailto:mbeltzner@gmail.com mbeltzner@gmail.com]<br><br />
'''IM/IRC''': beltzner on irc.mozilla.org, mbeltzner@gmail on gtalk<br><br />
'''Languages''': English, some French<br><br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/73d2e3295b Matt Claypotch] ([http://www.twitter.com/potch @potch]) ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
[[Image:potch_headshot.jpg|frame|left]] <br />
<br />
'''Hello!'''<br />
<br />
I'm Potch (well not legally, but functionally) and I'm a WebDev working on the Add-ons team. In addition to hugs, I'm also good at JavaScript, CSS, HTML5, Photoshopping things, and cracking wise. I'd like to think I make a tasty meatball. My favorite movie is Groundhog Day.<br />
<br />
If you're looking for me, I can usually be found on the 3rd floor of the Mountain View office over by 10 Forward.<br />
<br />
I'm a [https://mozillians.org/en-US/u/73d2e3295b Mozillian]!<br />
<br />
'''Email'''<br><br />
[mailto:potch@mozilla.com potch@mozilla.com]<br />
<br />
'''IM/IRC'''<br />
potch on IRC<br/><br />
thepotch on GTalk<br/><br />
[http://www.twitter.com/potch @potch on Twitter]<br/><br />
(for Mozilla Corporation employees, I also Yammer from time to time!)<br/><br />
<br />
<br clear="all"><br />
<br />
=== David Eaves ([http://www.twitter.com/daeaves @daeaves]) ===<br />
[[Image:deaves.png|frame|left]] <br />
David Eaves is a negotiation and conflict management expert with over 10 years experience advising companies, non-profits and open source communities on their critical negotiations. David has been contributing to Mozilla for over 5 years with a strong interest on community engagement, and workflow in systems like Bugzilla. He's also big into open data and gov2.0. Feel free to drop me a line if you'd like some advice or help!<br />
<br />
'''Email'''<br><br />
[mailto:david@eaves.ca david@eaves.ca]<br />
<br />
'''IM/IRC'''<br><br />
deaves on IRC<br/><br />
david@eaves.ca on GTalk<br/><br />
[http://www.twitter.com/daeaves @daeaves on Twitter]<br/><br />
I blog at [http://www.eaves.ca eaves.ca]<br/><br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/60db2889ab Gen Kanai] ([http://www.twitter.com/gen @gen]) ===<br />
==== Location [http://goo.gl/OJa8A Tokyo, Japan] Timezone: [http://www3.nict.go.jp/cgi-bin/JST_E.pl JST] ====<br />
[[Image:GenKanai_008.jpeg|frame|left]] <br />
I am currently a member of the Contributor Engagement team and the Director of Asia Community Engagement for Mozilla. My area of responsibility is Asia, but not including China or Japan (as those two markets have dedicated teams supporting them.) I previously held the Director of Marketing role at Mozilla Japan. I am involved in many activities including supporting Mozilla’s localization communities, marketing efforts as well as evangelism across much of Asia. I am based out of Mozilla's Tokyo office and have been involved in Mozilla since 2006. <br />
<br />
'''Email'''<br><br />
[mailto:gen@mozilla.com gen@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
gen on IRC<br/><br />
[http://www.twitter.com/gen @gen on Twitter]<br/><br />
I blog at [https://blog.mozilla.com/gen/ https://blog.mozilla.com/gen/]<br/><br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/cc44df8d28 Michelle Luna] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
[[Image:Me3.png|frame|left]] <br />
I'm a [https://mozillians.org/en-US/u/cc44df8d28 Mozillian]. I joined Mozilla to coordinate Firefox for Android support in July of 2011. I'm a long-time Firefox user and a newbie [http://blog.mozilla.com/sumo/2011/10/26/how-i-got-involved-with-mozilla/ contributor]. And I like hats!<br />
<br />
'''Email'''<br><br />
[mailto:mluna@mozilla.com mluna@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
michelleluna<br><br />
<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/4ac6e7ba5a Johnathan Nightingale] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Eastern/d/-5 EST] ====<br />
[[Image:Johnathan-nightingale.jpg|frame|left]] <br />
My business cards say "Director of Firefox Engineering" but mostly I float around trying to add Canadian-ness to things. I was a mozilla fan well before I joined the company, and when I did join it was to work on security and usability code in the front end. These days they don't really let me anywhere near a text editor.<br />
<br />
'''Email'''<br><br />
[mailto:johnath@mozilla.com johnath@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
* johnath (irc)<br />
* johnath (gmail)<br />
* [https://mozillians.org/en-US/u/4ac6e7ba5a Mozillians Page]<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/81e2e569d1 Stormy Peters] ([http://www.twitter.com/storming @storming]) ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Eastern/d/-7 MST] ====<br />
[[Image:Stormy.jpg|frame|left]] <br />
I'm Stormy and I work on the Developer Engagement team at Mozilla. I've been involved in open source software since 1999 and I'm fascinated with how well communities work online and globally. I'm part of many communities including Mozilla, GNOME and Kids on Computers and I feel privileged to work with such passionate, hard working and world changing people. If I can help you make the world a better place, let me know.<br />
<br />
'''Email''': [mailto:stormy@mozilla.com stormy@mozilla.com]<br />
<br />
'''IM/IRC''': stormy on irc.mozilla.org, stormy on irc.gnome.org<br><br />
'''Languages''': English, Spanish<br><br />
[http://www.twitter.com/storming @storming on Twitter]<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/4ee1babee0 Melissa Shapiro] [https://twitter.com/shappy (@shappy)] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
<br />
[[Image:Newheadshot.jpg|frame|left]] I head up Mozilla's global PR team and have been part of the Project since 2007. I worked with a few other open source communities prior to joining Mozilla (KDE - hi Stormy!, PHP). I'm a [http://www.flickr.com/photos/misdemeanor/6317471106/ Baltimore native] and an avid fan of John Waters, blue crabs, and the [http://www.flickr.com/photos/shutter-yid/225430309/ Bromoseltzer Tower]. I'm based out of Mozilla's [http://www.flickr.com/photos/misdemeanor/6317465120/in/photostream/ San Francisco] office. <br />
<br />
'''Email'''<br> [mailto:melissashapiro1@gmail.com melissashapiro1@gmail.com] <br />
<br />
'''IM/IRC'''<br> shappy on irc.mozilla.org<br> <br />
<br />
<br clear ="all"><br />
<br />
=== [https://mozillians.org/en-US/gavin Gavin Sharp] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
<br />
[[Image:GavinSharp.jpg|frame|left]] <br />
I'm the owner of the [[Modules/Firefox|Firefox module]], and have been involved with the Mozilla project since 2004. I'm on IRC and in Bugzilla a fair bit. I'm Canadian, and I'm a [https://mozillians.org/en-US/gavin Mozillian].<br />
<br />
'''Email'''<br><br />
[mailto:gavin@gavinsharp.com gavin@gavinsharp.com]<br />
<br />
'''IM/IRC'''<br />
* gavin on irc.mozilla.org<br />
* [https://twitter.com/gavinsharp @gavinsharp]<br />
<br clear="all"><br />
<br />
=== Benjamin Smedberg ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Eastern/d/-5 EST] ====<br />
<br />
[[Image:Bsmedberg-small.jpg|frame|left]] <br />
<br />
I started as a volunteer programmer with Mozilla in 2002 and became an employee in 2005. I have worked with diverse parts of the code including localization, the build system, the extension manager, the crash reporting system, and multi-process support. I was formerly a professional organist and choir director and have seven children.<br />
<br />
'''Website/Blog'''<br><br />
[http://benjamin.smedbergs.us/ http://benjamin.smedbergs.us/]<br />
<br />
'''Email'''<br><br />
[mailto:benjamin@smedbergs.us benjamin@smedbergs.us]<br />
<br />
'''IM/IRC'''<br><br />
bsmedberg on irc.mozilla.org<br><br />
bsmedberg on skype (normally offline)<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/16b1971f3d Mike Taylor] ([http://www.twitter.com/bear @bear]) ===<br />
==== Location: [http://g.co/maps/az2be Philadelphia, PA] Timezone: [http://www.time.gov/timezone.cgi?Eastern/d/-5 EST] ====<br />
[[Image:Mike-Taylor.jpg|frame|left]] <br />
Bear started out as a volunteer working on Bonsai and Tinderbox(en) in 2006 when he worked at Open Source Applications Foundation as their Build/Release Engineer and continues now as a Mozilla Employee since 2010 working with Build/Release and also as Release/Operations. He has been a part of online communities since the days of CompuServ and BBS and having seen almost every variation of bad behaviour, from both sides of the fence, would love to help folks enjoy the community.<br/><br />
He is a [https://mozillians.org/en-US/u/bear/ Mozillian].<br />
<br />
'''Email'''<br/><br />
[mailto:bear@bear.im bear@bear.im]<br />
<br />
'''IM/IRC'''<br/><br />
bear on irc.mozilla.com<br/><br />
bear on irc.freenode.com<br/><br />
bear@bear.im (xmpp)<br/><br />
bear42@gmail.com (gtalk)<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/a51deffc2e David Tenser] ===<br />
[[Image:Djst-mugshot.jpg|frame|left]] <br />
Hi! I'm [https://mozillians.org/en-US/u/a51deffc2e David Tenser] and I'm the director of User Support, also known as [http://support.mozilla.com SUMO]. I started to volunteer for Mozilla in 2001 and focused mainly on helping other people with Firefox -- and I still do today. Feel free to reach out anytime if you need any help!<br />
<br />
'''Email'''<br><br />
[mailto:djst@mozilla.com djst@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
IRC: djst<br><br />
<br />
<br clear="all"><br />
<br />
<br />
=== [https://mozillians.org/en-US/u/71ba821019 Daniel Veditz] ([http://www.twitter.com/dveditz @dveditz])===<br />
==== Location [http://g.co/maps/8k5bx Ben Lomond, CA] Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
<br />
[[Image:Daniel-veditz.jpg|frame|left]] <br />
Chief Nervous Nellie: worrying about what bad guys are up to so I can protect Mozillians and our users. When not staring at screens I like tactile things: playing music, reading books made of paper, turning dirt into food.<br />
<br />
'''Email'''<br><br />
[mailto:dveditz@mozilla.com dveditz@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
IRC: dveditz<br><br />
[http://www.twitter.com/dveditz @dveditz on Twitter]<br />
<br />
<br clear="all"></div>Bearhttps://wiki.mozilla.org/index.php?title=Conductors&diff=713953Conductors2013-09-23T16:55:29Z<p>Bear: /* Location: Philadelphia, PA Timezone: EST */</p>
<hr />
<div>__NOTOC__ <br />
<br />
Hi! We're the Conductors - a team of Mozilla community members who are available as ''mentors'' to help conversations run more smoothly and harmoniously. We're not police or referees, just a group of people who have offered to be available to give advice, counsel and support to our fellow community members when a discussion or debate gets a little tense.<br><br />
<br><br />
It's not always easy to discuss contentious issues through primarily text-based systems like email, newsgroups and Bugzilla comments. Sometimes it can help to have someone to assist with phrasing, interpretation and the flow of discussion. If you ever find yourself in a tricky conversation within the Mozilla community and want some assistance, just hop into #conductors on irc.mozilla.org, or send an email either to the [mailto:conductors@mozilla.org whole group] or any of us individually and we'll be happy to help.<br><br />
<br><br />
(We'd also love to hear from you if you just want to learn more about strategies and techniques for online discussions, or want to join this mentorship group!)<br />
<br />
'''[mailto:conductors@mozilla.org Email conductors@mozilla.org]''' or jump to an individual ...<br />
<small><center>[[Conductors#David_Ascher|David Ascher]] | [[Conductors#Dietrich_Ayala_.28.40dietrich.29|Dietrich Ayala]] | [[Conductors#Mike_Beltzner_.28.40beltzner.29|Mike Beltzner]] | [[Conductors#Matt_Claypotch_.28.40potch.29|Matt Claypotch]] | [[Conductors#David_Eaves_.28.40daeaves.29|David Eaves]] | [[Conductors#Gen_Kanai_.28.40gen.29|Gen Kanai]] | [[Conductors#Michelle_Luna|Michelle Luna]] | <br />
[[Conductors#Johnathan_Nightingale|Johnathan Nightingale]] | [[Conductors#Stormy_Peters_.28.40storming.29|Stormy Peters]] | [[Conductors#Melissa_Shapiro_.28.40shappy.29|Melissa Shapiro]] | [[Conductors#Gavin_Sharp|Gavin Sharp]] | [[Conductors#Benjamin_Smedberg|Benjamin Smedberg]] | [[Conductors#Mike_Taylor_.28.40bear.29|Mike Taylor (Bear)]] | [[Conductors#David_Tenser|David Tenser]] | [[Conductors#Daniel_Veditz_.28.40dveditz.29|Daniel Veditz]]</center></small><br />
<br />
=== [https://mozillians.org/en-US/u/d950a0fa3c David Ascher] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
[[Image:DavidA.jpg|frame|left]] <br />
I work in Mozilla Labs, with a focus on the social web and communications software. I'm currently based in Vancouver, Canada, speak French & English, and believe the world would be better if everyone had lived in more than one country. I've been involved in open source communities for a long time, and involved with Mozilla since 1999 (and an employee for a few years).<br />
<br />
'''Email'''<br><br />
[mailto:da@mozilla.com da@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
davida on irc<br><br />
david.ascher@gmail.com (gtalk))<br><br />
@davidascher (twitter)<br><br />
davidascher (skype)<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/54a415e4c8 Dietrich Ayala] ([http://www.twitter.com/dietrich @dietrich])===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
<br />
[[Image:Dietrich-ayala.jpg|frame|left]] <br />
<br />
Dietrich Ayala is a Firefox front-end developer and engineering manager. He's been working on Firefox for over 5 years, on projects like session restore, Places (bookmarks, history and the awesomebar), Jetpack and various other bits and pieces. He also enjoys talking to The People of Earth about Firefox and Mozilla, so gets himself out to meetups, tweetups, user groups and conferences as much as possible. He thinks you should come hang out with him in Portland Oregon sometime, and tell him stories about places that are not cold and wet.<br />
<br />
'''Email'''<br><br />
[mailto:dietrich@mozilla.com dietrich@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
[http://www.twitter.com/dietrich dietrich on twitter]<br><br />
autonome on gtalk<br><br />
autonome on skype<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/590e458f89 Mike Beltzner] ([http://www.twitter.com/beltzner @beltzner]) ===<br />
Location: Toronto, Ontario (EST/UTC-5)<br><br />
[[Image:Mike-beltzner.jpg|frame|left]] <br />
Hi. I'm [https://mozillians.org/en-US/u/590e458f89 Mike], and I've been a Mozilla community member since 2004, first working on Calendar user experience issues and eventually joining the Mozilla Corporation and becoming the product director for Firefox. Although I left the company in 2011, you can still find me puttering about in the community and I currently act as moderator for the mozilla.dev.planning newsgroup. I've learned how to manage difficult online conversations the hard way, and can show you my scars if you think it would help. Feel free to drop me a line if you want to chat about any difficult situation, or commiserate about how Firefly was cancelled far, far too soon.<br />
<br />
'''Email''': [mailto:mbeltzner@gmail.com mbeltzner@gmail.com]<br><br />
'''IM/IRC''': beltzner on irc.mozilla.org, mbeltzner@gmail on gtalk<br><br />
'''Languages''': English, some French<br><br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/73d2e3295b Matt Claypotch] ([http://www.twitter.com/potch @potch]) ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
[[Image:potch_headshot.jpg|frame|left]] <br />
<br />
'''Hello!'''<br />
<br />
I'm Potch (well not legally, but functionally) and I'm a WebDev working on the Add-ons team. In addition to hugs, I'm also good at JavaScript, CSS, HTML5, Photoshopping things, and cracking wise. I'd like to think I make a tasty meatball. My favorite movie is Groundhog Day.<br />
<br />
If you're looking for me, I can usually be found on the 3rd floor of the Mountain View office over by 10 Forward.<br />
<br />
I'm a [https://mozillians.org/en-US/u/73d2e3295b Mozillian]!<br />
<br />
'''Email'''<br><br />
[mailto:potch@mozilla.com potch@mozilla.com]<br />
<br />
'''IM/IRC'''<br />
potch on IRC<br/><br />
thepotch on GTalk<br/><br />
[http://www.twitter.com/potch @potch on Twitter]<br/><br />
(for Mozilla Corporation employees, I also Yammer from time to time!)<br/><br />
<br />
<br clear="all"><br />
<br />
=== David Eaves ([http://www.twitter.com/daeaves @daeaves]) ===<br />
[[Image:deaves.png|frame|left]] <br />
David Eaves is a negotiation and conflict management expert with over 10 years experience advising companies, non-profits and open source communities on their critical negotiations. David has been contributing to Mozilla for over 5 years with a strong interest on community engagement, and workflow in systems like Bugzilla. He's also big into open data and gov2.0. Feel free to drop me a line if you'd like some advice or help!<br />
<br />
'''Email'''<br><br />
[mailto:david@eaves.ca david@eaves.ca]<br />
<br />
'''IM/IRC'''<br><br />
deaves on IRC<br/><br />
david@eaves.ca on GTalk<br/><br />
[http://www.twitter.com/daeaves @daeaves on Twitter]<br/><br />
I blog at [http://www.eaves.ca eaves.ca]<br/><br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/60db2889ab Gen Kanai] ([http://www.twitter.com/gen @gen]) ===<br />
==== Location [http://goo.gl/OJa8A Tokyo, Japan] Timezone: [http://www3.nict.go.jp/cgi-bin/JST_E.pl JST] ====<br />
[[Image:GenKanai_008.jpeg|frame|left]] <br />
I am currently a member of the Contributor Engagement team and the Director of Asia Community Engagement for Mozilla. My area of responsibility is Asia, but not including China or Japan (as those two markets have dedicated teams supporting them.) I previously held the Director of Marketing role at Mozilla Japan. I am involved in many activities including supporting Mozilla’s localization communities, marketing efforts as well as evangelism across much of Asia. I am based out of Mozilla's Tokyo office and have been involved in Mozilla since 2006. <br />
<br />
'''Email'''<br><br />
[mailto:gen@mozilla.com gen@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
gen on IRC<br/><br />
[http://www.twitter.com/gen @gen on Twitter]<br/><br />
I blog at [https://blog.mozilla.com/gen/ https://blog.mozilla.com/gen/]<br/><br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/cc44df8d28 Michelle Luna] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
[[Image:Me3.png|frame|left]] <br />
I'm a [https://mozillians.org/en-US/u/cc44df8d28 Mozillian]. I joined Mozilla to coordinate Firefox for Android support in July of 2011. I'm a long-time Firefox user and a newbie [http://blog.mozilla.com/sumo/2011/10/26/how-i-got-involved-with-mozilla/ contributor]. And I like hats!<br />
<br />
'''Email'''<br><br />
[mailto:mluna@mozilla.com mluna@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
michelleluna<br><br />
<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/4ac6e7ba5a Johnathan Nightingale] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Eastern/d/-5 EST] ====<br />
[[Image:Johnathan-nightingale.jpg|frame|left]] <br />
My business cards say "Director of Firefox Engineering" but mostly I float around trying to add Canadian-ness to things. I was a mozilla fan well before I joined the company, and when I did join it was to work on security and usability code in the front end. These days they don't really let me anywhere near a text editor.<br />
<br />
'''Email'''<br><br />
[mailto:johnath@mozilla.com johnath@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
* johnath (irc)<br />
* johnath (gmail)<br />
* [https://mozillians.org/en-US/u/4ac6e7ba5a Mozillians Page]<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/81e2e569d1 Stormy Peters] ([http://www.twitter.com/storming @storming]) ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Eastern/d/-7 MST] ====<br />
[[Image:Stormy.jpg|frame|left]] <br />
I'm Stormy and I work on the Developer Engagement team at Mozilla. I've been involved in open source software since 1999 and I'm fascinated with how well communities work online and globally. I'm part of many communities including Mozilla, GNOME and Kids on Computers and I feel privileged to work with such passionate, hard working and world changing people. If I can help you make the world a better place, let me know.<br />
<br />
'''Email''': [mailto:stormy@mozilla.com stormy@mozilla.com]<br />
<br />
'''IM/IRC''': stormy on irc.mozilla.org, stormy on irc.gnome.org<br><br />
'''Languages''': English, Spanish<br><br />
[http://www.twitter.com/storming @storming on Twitter]<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/4ee1babee0 Melissa Shapiro] [https://twitter.com/shappy (@shappy)] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
<br />
[[Image:Newheadshot.jpg|frame|left]] I head up Mozilla's global PR team and have been part of the Project since 2007. I worked with a few other open source communities prior to joining Mozilla (KDE - hi Stormy!, PHP). I'm a [http://www.flickr.com/photos/misdemeanor/6317471106/ Baltimore native] and an avid fan of John Waters, blue crabs, and the [http://www.flickr.com/photos/shutter-yid/225430309/ Bromoseltzer Tower]. I'm based out of Mozilla's [http://www.flickr.com/photos/misdemeanor/6317465120/in/photostream/ San Francisco] office. <br />
<br />
'''Email'''<br> [mailto:melissashapiro1@gmail.com melissashapiro1@gmail.com] <br />
<br />
'''IM/IRC'''<br> shappy on irc.mozilla.org<br> <br />
<br />
<br clear ="all"><br />
<br />
=== [https://mozillians.org/en-US/gavin Gavin Sharp] ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
<br />
[[Image:GavinSharp.jpg|frame|left]] <br />
I'm the owner of the [[Modules/Firefox|Firefox module]], and have been involved with the Mozilla project since 2004. I'm on IRC and in Bugzilla a fair bit. I'm Canadian, and I'm a [https://mozillians.org/en-US/gavin Mozillian].<br />
<br />
'''Email'''<br><br />
[mailto:gavin@gavinsharp.com gavin@gavinsharp.com]<br />
<br />
'''IM/IRC'''<br />
* gavin on irc.mozilla.org<br />
* [https://twitter.com/gavinsharp @gavinsharp]<br />
<br clear="all"><br />
<br />
=== Benjamin Smedberg ===<br />
==== Timezone: [http://www.time.gov/timezone.cgi?Eastern/d/-5 EST] ====<br />
<br />
[[Image:Bsmedberg-small.jpg|frame|left]] <br />
<br />
I started as a volunteer programmer with Mozilla in 2002 and became an employee in 2005. I have worked with diverse parts of the code including localization, the build system, the extension manager, the crash reporting system, and multi-process support. I was formerly a professional organist and choir director and have seven children.<br />
<br />
'''Website/Blog'''<br><br />
[http://benjamin.smedbergs.us/ http://benjamin.smedbergs.us/]<br />
<br />
'''Email'''<br><br />
[mailto:benjamin@smedbergs.us benjamin@smedbergs.us]<br />
<br />
'''IM/IRC'''<br><br />
bsmedberg on irc.mozilla.org<br><br />
bsmedberg on skype (normally offline)<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/16b1971f3d Mike Taylor] ([http://www.twitter.com/bear @bear]) ===<br />
==== Location: [http://g.co/maps/az2be Philadelphia, PA] Timezone: [http://www.time.gov/timezone.cgi?Eastern/d/-5 EST] ====<br />
[[Image:Mike-Taylor.jpg|frame|left]] <br />
Bear started out as a volunteer working on Bonsai and Tinderbox(en) in 2006 when he worked at Open Source Applications Foundation as their Build/Release Engineer and continues now as a Mozilla Employee since 2010 working with Build/Release and also as Release/Operations. He has been a part of online communities since the days of CompuServ and BBS and having seen almost every variation of bad behaviour, from both sides of the fence, would love to help folks enjoy the community.<br/><br />
He is a [https://mozillians.org/en-US/u/16b1971f3d Mozillian].<br />
<br />
'''Email'''<br/><br />
[mailto:bear@bear.im bear@bear.im]<br />
<br />
'''IM/IRC'''<br/><br />
bear on irc.mozilla.com<br/><br />
bear on irc.freenode.com<br/><br />
bear@bear.im (xmpp)<br/><br />
bear42@gmail.com (gtalk)<br />
<br />
<br clear="all"><br />
<br />
=== [https://mozillians.org/en-US/u/a51deffc2e David Tenser] ===<br />
[[Image:Djst-mugshot.jpg|frame|left]] <br />
Hi! I'm [https://mozillians.org/en-US/u/a51deffc2e David Tenser] and I'm the director of User Support, also known as [http://support.mozilla.com SUMO]. I started to volunteer for Mozilla in 2001 and focused mainly on helping other people with Firefox -- and I still do today. Feel free to reach out anytime if you need any help!<br />
<br />
'''Email'''<br><br />
[mailto:djst@mozilla.com djst@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
IRC: djst<br><br />
<br />
<br clear="all"><br />
<br />
<br />
=== [https://mozillians.org/en-US/u/71ba821019 Daniel Veditz] ([http://www.twitter.com/dveditz @dveditz])===<br />
==== Location [http://g.co/maps/8k5bx Ben Lomond, CA] Timezone: [http://www.time.gov/timezone.cgi?Pacific/d/-8 PST] ====<br />
<br />
[[Image:Daniel-veditz.jpg|frame|left]] <br />
Chief Nervous Nellie: worrying about what bad guys are up to so I can protect Mozillians and our users. When not staring at screens I like tactile things: playing music, reading books made of paper, turning dirt into food.<br />
<br />
'''Email'''<br><br />
[mailto:dveditz@mozilla.com dveditz@mozilla.com]<br />
<br />
'''IM/IRC'''<br><br />
IRC: dveditz<br><br />
[http://www.twitter.com/dveditz @dveditz on Twitter]<br />
<br />
<br clear="all"></div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering&diff=452979ReleaseEngineering2012-07-20T21:35:17Z<p>Bear: /* Team */</p>
<hr />
<div><div style="float: right; width: 50%; text-size: 80%"><br />
'''Quick Links For You'''<br />
* [http://tbpl.mozilla.org/ TBPL] - TinderBox Push Log<br />
* [http://bit.ly/cqaNWF File a new RelEng bug]<br />
* [[ReleaseEngineering:Maintenance | RelEng Changes]] - what did RelEng break?<br />
* [https://build.mozilla.org/ build.m.o] - external RelEng services<br />
</div><br />
__NOTOC__<br />
<br />
'''Quick Links For Us'''<br />
* [[ReleaseEngineering/How To|How Tos]]<br />
* [https://wiki.mozilla.org/ReleaseEngineering:Buildduty Build Duty]<br />
* [https://intranet.mozilla.org/RelEngWiki/index.php/Machine_Bookings Machine Bookings] (auth required)<br />
* [[ReleaseEngineering:Maintenance | Maintenance Page]] - pending and completed changes<br />
<br />
= Team =<br />
[https://mozillians.org/en-US/u/68bfc023cd John O'Duinn (joduinn)]<br />
* [https://mozillians.org/en-US/u/de053fdf40 Aki Sasaki (aki)], [https://mozillians.org/en-US/u/73f6ec084b Hal Wine (hwine)]<br />
<br />
[https://mozillians.org/en-US/u/6df79a9114 Chris Atlee (catlee)]<br />
* [https://mozillians.org/en-US/u/c924120121 Ben Hearsum (bhearsum)], [https://mozillians.org/en-US/u/e24b1b187f Lukas Blakk (lsblakk)], [https://mozillians.org/en-US/u/331bba28b0 Nick Thomas (nthomas)], [https://mozillians.org/en-US/u/b023f854c8 Rail Aliiev (rail)]<br />
<br />
[https://mozillians.org/en-US/u/bec21cdba6 Chris Cooper (coop)]<br />
* [https://mozillians.org/en-US/u/20ad2c73e3 Joey Armstrong (joey)], [https://mozillians.org/en-US/u/1a64e680b9 Armen Zambrano Gasparnian (armenzg)], [https://mozillians.org/en-US/u/kmoir Kim Moir (kmoir)], [https://mozillians.org/en-US/u/050cc87a8a John Hopkins (jhopkins)]<br />
<br />
Check out our [https://wiki.mozilla.org/ReleaseEngineering:Blogs blogs!]<br />
<br />
= Documentation =<br />
* [[ReleaseEngineering/How Tos|How Tos]]<br />
* [[ReleaseEngineering/Applications|Applications]] - Various applications and services that RelEng provides<br />
* [[ReleaseEngineering/Meeting Notes | Public Meeting Notes]]<br />
* Buildbot<br />
** Master<br />
*** [[ReleaseEngineering/Master Naming|Master Naming]]<br />
*** [[ReleaseEngineering/Master Setup|Master Setup]]<br />
*** [[ReleaseEngineering/Buildbot Best Practices|Buildbot Best Practices]]<br />
*** [[ReleaseEngineering/Upgrading Buildbot|Upgrading Buildbot]]<br />
*** [[ReleaseEngineering/Managing Buildbot with Fabric|Managing Buildbot with Fabric]]<br />
*** [[ReleaseEngineering/Preproduction|Preproduction]]<br />
*** [https://intranet.mozilla.org/RelEngWiki/index.php/Masters Masters] (authentication required)<br />
*** [[ReleaseEngineering/Landing Buildbot Master Changes|Landing Buildbot Master Changes]]<br />
*** [[ReleaseEngineering/Queue directories|Queue directories]]<br />
** Slave<br />
*** [[ReleaseEngineering/Buildslave Versions|Buildslave Versions]]<br />
*** [[ReleaseEngineering/Buildslave Startup Process|Buildslave Startup Process]]<br />
* [[ReferencePlatforms|Reference Platforms]]<br />
* [[ReleaseEngineering:RelEngITSharedDowntime | RelEng+IT shared downtime]]<br />
* [[ReleaseEngineering/DisposableProjectBranches | Disposable Project Branch Bookings]]<br />
* [[ReleaseEngineering/Bugzilla/Triage | Bug Triage]]<br />
* [[ReleaseEngineering:TestingTechniques|Methods for testing your changes]]<br />
* [[ReleaseEngineering:StagingMaster|How to work on staging master]]<br />
* [[Release:Release_Automation_on_Mercurial:Documentation|Release Automation on Mercurial]]<br />
** [https://intranet.mozilla.org/Build:Release:Primer Release Primer]<br />
*** [https://intranet.mozilla.org/User:Armenzg@mozilla.com:Release:Primer:Hg Release Primer DRAFT CVS and HG combined]<br />
* [[ReleaseEngineering/Applications#slavealloc|Slave Allocator]]<br />
* [[UpdateGeneration|Update Generation]]<br />
** [[ReleaseEngineering/PatcherTags|Patcher tags for release updates]]<br />
* [[ReleaseEngineering/Official Platform Support Checklist|Official Platform Support Checklist]]<br />
* [[ReleaseEngineering/TryserverAsBranch|Tryserver As Branch]]<br />
* [[ReleaseEngineering/TryServer | TryServer]]<br />
* [[ReleaseEngineering/TryChooser | TryChooser]]<br />
* Testing<br />
** [[ReleaseEngineering:GraphServer | Graph Server Notes]]<br />
** [[ReleaseEngineering:IntermittentErrors | Intermittent Errors]]<br />
* Configuration Management<br />
** [[ReleaseEngineering:Puppet | Puppet]] (current)<br />
** [[ReleaseEngineering/PuppetAgain | PuppetAgain]] (new)<br />
** [[ReleaseEngineering/OPSI|OPSI]]<br />
* [[Tinderbox_Push_Log | TBPL]] - Tinderbox Push Log<br />
* Bugzilla<br />
** [[ReleaseEngineering/Bugzilla/Flags|Flags]]<br />
** [[ReleaseEngineering/Bugzilla/Whiteboard|Whiteboard]]<br />
** [[ReleaseEngineering/Bugzilla/Triage|Triage]]<br />
* [[ReleaseEngineering/UsefulTricks|Useful Tricks]]<br />
* Python<br />
** [[ReleaseEngineering/Virtualenv|Virtualenv]] - How to set up and use python virtual environments<br />
* [https://intranet.mozilla.org/RelEngWiki/index.php/Deployed_binaries Deployed Binaries] (authentication required)<br />
* [[ReleaseEngineering:ProjectBranchPlanning|Project Branch Planning: how to request a new project branch]]<br />
*Tinderbox / BuildBot<br />
** [[ReleaseEngineering:Farm|Mozilla.org Build Farm Roster]]<br />
** [[Build:OutageReports|Outage Reports]]<br />
** [[Build:Tinderbox Setup|Tinderbox Setup]]<br />
** [[Build:ClobberingATinderbox|Clobbering Your Own Tinderbox or Unit Test Builds]]<br />
** [[Build:TryServer| "Try" server - test patches before checking in]]<br />
** [[Build:TinderboxErrors| Tinderbox Error/Warning Reference: Debugging red and orange on Tinderbox]]<br />
<br />
= Subpages =<br />
* [[ReleaseEngineering/Planning|Planning]]<br />
* [[ReleaseEngineering/Policies|Policies]]<br />
* [[ReleaseEngineering/Wiki Guidelines|Wiki Guidelines]] - how to wiki garden for fun and profit<br />
* [[ReleaseEngineering/Obsolete Pages]]<br />
* [[Special:PrefixIndex/{{FULLPAGENAME}}/]] - all subpages of [[{{FULLPAGENAME}}]] in the wiki<br />
<br />
= Work Weeks =<br />
* [[ReleaseEngineering/2011-Q3-Workweek| 2011 Q3 Work Week]]</div>Bearhttps://wiki.mozilla.org/index.php?title=User:Bear:AndroidTegraTodo&diff=452970User:Bear:AndroidTegraTodo2012-07-20T21:22:27Z<p>Bear: Blanked the page</p>
<hr />
<div></div>Bearhttps://wiki.mozilla.org/index.php?title=CIDuty&diff=451874CIDuty2012-07-18T07:27:19Z<p>Bear: /* Kitten */</p>
<hr />
<div>'''Looking for who is on buildduty?''' - check the tree-info dropdown on [https://tbpl.mozilla.org/ tbpl]<br /><br />
'''Buildduty not around?''' - please [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Release%20Engineering open a bug]<br />
<br />
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "'''buildduty'''."<br />
<br />
Here's now to do it.<br />
<br />
__TOC__<br />
<br />
= Schedule =<br />
Mozilla Releng Sheriff Schedule ([http://www.google.com/calendar/embed?src=aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com&ctz=America/New_York Google Calendar]|[http://www.google.com/calendar/ical/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic.ics iCal]|[http://www.google.com/calendar/feeds/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic XML])<br />
<br />
= General Duties =<br />
== How should I make myself available for duty? ==<br />
* Add 'buildduty' to your IRC nick<br />
* be in at least #developers, #buildduty and #build (as well as #mozbuild of course)<br />
** also useful to be in #mobile, #planning, #release-drivers, and #ateam<br />
* watch http://tbpl.mozilla.org<br />
<br />
== What else should I take care of? ==<br />
You will need to<br />
* Direct people to [http://mzl.la/tryhelp http://mzl.la/tryhelp] for self-serve documentation.<br />
* Keep [https://wiki.mozilla.org/ReleaseEngineering:Maintenance wiki.m.o/ReleaseEngineering:Maintenance] up to date with any significant changes<br />
<br />
You should keep on top of<br />
* pending builds - available in [http://build.mozilla.org/builds/pending/ graphs] or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.<br />
* all bugs tagged with [https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;list_id=2941844;status_whiteboard=%5Bbuildduty%5D;;resolution=---;product=mozilla.org buildduty] in the whiteboard (make a saved search)<br />
* The [https://bugzilla.mozilla.org/buglist.cgi?priority=--&columnlist=bug_severity%2Cpriority%2Cop_sys%2Cassigned_to%2Cbug_status%2Cresolution%2Cshort_desc%2Cstatus_whiteboard&resolution=---&resolution=DUPLICATE&emailtype1=exact&query_based_on=releng-triage&emailassigned_to1=1&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=nobody%40mozilla.org&component=Release%20Engineering&component=Release%20Engineering%3A%20Custom%20Builds&product=mozilla.org&known_name=releng-triage releng-triage search] - part of buildduty is leaning on your colleagues to take bugs<br />
** the BuildDuty person needs to go through all the bugs in releng-triage query at *least* a few times each day. Doesnt mean you have to *fix* them all immediately, finding other owners is part of triaging the queue. However, you do need to at least *see* them, know if there are any urgent problems and categorize appropriately. Sometimes we get urgent security bugs here, which need to be jumped on immediately, like example {{bug|635638}}<br />
* Bum slaves - you should see to it that bum slaves aren't burning builds, and that all slaves are tracked on their way back to operational status<br />
** Check the [https://bugzilla.mozilla.org/buglist.cgi?list_id=2938171;resolution=---;status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=%5Bhardware%5D;;product=mozilla.org hardware] whiteboard tag, too, for anything that slipped between the cracks.<br />
** See the sections below on [[#Requesting Reboots]]<br />
* Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp])<br />
** '''wait times''' - either [https://build.mozilla.org/buildapi/reports/waittimes this page] or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)<br />
** there is a cronjob in anamarias' account on cruncher that runs this for each pool:<br />
/usr/local/bin/python $HOME/buildapi/buildapi/scripts/mailwaittimes.py \<br />
-S smtp.mozilla.org \<br />
-f nobody@cruncher.build.mozilla.org \<br />
-p testpool \<br />
-W http://cruncher.build.mozilla.org/buildapi/reports/waittimes \<br />
-e $(date -d "$(date +%Y-%m-%d)" +%s) -t 10 -z 10 \<br />
-a dev-tree-management@lists.mozilla.org<br />
<br />
* You may need to plan a reconfig or a full downtime<br />
** Reconfigs: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-reconfig releng-needs-reconfig broken query] to see what's pending. Reconfigs can be done at any time. <br />
** Downtimes: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-treeclosure releng-needs-treeclosure broken query] to see what's pending. Coordinate with Zandr and IT for send downtime notices with enough advance notice. <br />
<br />
You will also be responsible for coordinating master reconfigs - see the releng-needs-reconfig search.<br />
<br />
== Scheduled Reconfigs ==<br />
Buildduty is responsible for reconfiging the Buildbot masters <b>every Monday and Thursday</b>, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. [https://wiki.mozilla.org/ReleaseEngineering/Landing_Buildbot_Master_Changes This wiki page has step by step instructions]. It is also valid to do other additional reconfigs anytime you want.<br />
<br />
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master].<br />
<br />
You should [https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric use Fabric to do the reconfig!]<br />
<br />
The person doing reconfigs should also update https://wiki.mozilla.org/ReleaseEngineering:Maintenance#Reconfigs_.2F_Deployments<br />
<br />
= Tree Maintenance =<br />
== Repo Errors ==<br />
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:<br />
* File a bug (or have dev file it) and then poke in #ops noahm<br />
** If he doesn't respond, then escalate the bug to page on-call<br />
* Follow the steps below for "How do I close the tree"<br />
== How do I see problems in TBPL? ==<br />
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.<br />
== How do I close the tree? ==<br />
See [[ReleaseEngineering/How_To/Close_or_Open_the_Tree]]<br />
<br />
== How do I claim a rentable project branch? ==<br />
See [[ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE]]<br />
<br />
= Re-run jobs =<br />
== How to trigger Talos jobs ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-trigger all Talos runs for a build (by using sendchange) ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-run a build ==<br />
Do ''not'' go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.<br />
<br />
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the '''YOU MUST''' specify the branch, so there's no null keys in the builds-running.js.<br />
<br />
= Try Server =<br />
== Jobs not scheduled at all? ==<br />
Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.<br />
<br />
Then do a sendchange and tail the scheduler master:<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
<br />
* If tryserver was just reset verify that [[ReleaseEngineering/How_To/Reset_the_Try_Server#Try_Hg_Poller_state|the scheduler has been reset]]<br />
<br />
== How do I trigger additional talos/test runs for a given try build? ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== Using the TryChooser to submit build/test requests ==<br />
<br />
buildduty can also use the same [https://wiki.mozilla.org/Build:TryChooser TryChooser] syntax as developers use to (re)submit build and testing requests. Here is an example:<br />
<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
== How do I cancel existing jobs? ==<br />
<br />
The cancellator.py script is setup on pm02. Here is a standard example:<br />
<br />
# Dry run first to see what would be cancelled. <br />
python cancellator.py -b try -r 5ff84b660e90<br />
# Same command run again with the force option specified (--yes-really) to actually cancel the builds<br />
python cancellator.py -b try -r 5ff84b660e90 --yes-really<br />
<br />
The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.<br />
== Bug Commenter ==<br />
This is on cruncher and is run in a crontab in lsblakk's account:<br />
source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \<br />
&& time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v<br />
<br />
You can see quickly if things are working by looking at:<br />
/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log # this shows what's been posted lately<br />
/home/lsblakk/autoland/tools/scripts/autoland/try_cache # this shows what the script thinks is 'pending' completion<br />
<br />
= Nightlies =<br />
<br />
== How do I re-spin mozilla-central nightlies? ==<br />
To rebuild the same nightly, buildbot's Rebuild button works fine.<br />
<br />
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.<br />
<br />
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.<br />
<br />
= Mobile =<br />
== Android Tegras ==<br />
<br />
[[ReleaseEngineering:How To:Android Tegras | Android Tegra BuildDuty Notes]]<br />
<br />
== Android Updates aren't working! ==<br />
<br />
* Did the version number just change? If so, you may be hitting {{bug|629528}}. Kick off another Android nightly.<br />
* Check aus3-staging for size 0 complete.txt snippets:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5<br />
** If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.<br />
* Check aus3-staging to see if the checksum is correct:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2<br />
** If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.<br />
<br />
== Update mobile talos webhosts ==<br />
We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}).<br />
Here is how you update them:<br />
Update Procedure:<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/talos-repo<br />
# NOTICE that we have uncommitted files<br />
hg st<br />
# ? talos/page_load_test/tp4<br />
# Take note of the current revision to revert to (just in case)<br />
hg id<br />
hg pull -u<br />
# 488bc187a3ef tip<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.<br />
</pre><br />
<br />
Keep track of what revisions is being run.<br />
<br />
== Deploy new tegra-host-utils.zip ==<br />
There are three hosts behind a balance loader.<br />
* See {{bug|742597}} for previous instance of this case.<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/tegra<br />
wget -O tegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/<br />
</pre><br />
<br />
= Slave Maintenance =<br />
In general, slave maintenance involves:<br />
* keeping as many slaves up as possible, including<br />
** proactively checking for hung/broken slaves (see links below)<br />
** moving known-down slaves toward an operational state<br />
* handling nagios alerts for slaves<br />
* interacting with IT regarding slave maintenance<br />
<br />
== Kitten ==<br />
kitten.py is a command line tool designed to make gathering and basic host management tasks easier to do from the command line. You can get information about a host and also request to reboot it all in one command.<br />
<br />
A buildduty environment has been created on Cruncher to make it easier to work with all of the briarpatch tools (of which Kitten.py is one.)<br />
<br />
sudo su - buildduty<br />
cd /home/buildduty/briarpatch<br />
. bin/activate<br />
<br />
From there you can run:<br />
<br />
python kitten.py <hostname><br />
<br />
Example output (lines numbers added for reference):<br />
<br />
1 talos-r3-xp-013: enabled<br />
2 farm: moz<br />
3 colo: scl1<br />
4 distro: winxp<br />
5 pool: tests-scl1-windows<br />
6 trustlevel: try<br />
7 master: bm16-tests1-windows<br />
8 fqdn: talos-r3-xp-013.build.scl1.mozilla.com.<br />
9 PDU?: False<br />
10 IPMI?: False<br />
11 reachable: True<br />
12 buildbot: running; active; job 1 minute ago<br />
13 tacfile: found<br />
14 lastseen: 1 minute ago<br />
15 master: buildbot-master16.build.scl1.mozilla.com<br />
<br />
1. hostname and it's status according to slavealloc<br />
2. farm: aws or moz<br />
3. colo: what colo the host is located (from slavealloc)<br />
4. distro: what OS distribution slavealloc lists<br />
5. pool: what build/test pool slavealloc lists<br />
6. trustlevel: the host's trustlevel per slavealloc<br />
7. master: the master that slavealloc lists for the host<br />
8. fqdn: the FQDN that was returned from the DNS lookup<br />
9. PDU?: does Inventory (or tegras.json) list a PDU for this host<br />
10. IPMI?: does a -mgmt DNS entry exist for this host<br />
11. Was briarpatch able to successfully ping and SSH to the host<br />
12. buildbot: the status of buildbot and what the last activity was<br />
13. tacfile: was a buildbot.tac file found<br />
14. lastseen: the timestamp of the last entry in twistd.log<br />
15. master: what the buildbot.tac file lists as the host's master<br />
<br />
Example of a host that cannot be reached:<br />
<br />
(production)[buildduty@cruncher production]$ python kitten.py -v talos-r3-xp-019<br />
talos-r3-xp-019: enabled<br />
farm: moz<br />
colo: scl1<br />
distro: winxp<br />
pool: tests-scl1-windows<br />
trustlevel: try<br />
master: bm15-tests1-windows<br />
fqdn: talos-r3-xp-019.build.scl1.mozilla.com.<br />
PDU?: False<br />
IPMI?: False<br />
ERROR Unable to control host remotely<br />
reachable: False<br />
buildbot: <br />
tacfile: <br />
lastseen: unknown<br />
master: <br />
error: current master is different than buildbot.tac master []<br />
<br />
The output up to the "ERROR" line shows all of the metadata for a host, and if the host was reachable via SSH the lines after would show the details of the buildbot environment and it's status.<br />
<br />
Kitten.py has the following commands:<br />
<br />
kitten.py [--info | -i ] [--reboot | -r] [--verbose | -v] [--debug]<br />
<br />
* --info will show only the metadata and will not try to SSH to the host<br />
* --reboot will try to graceful the buildbot and reboot the host if it appears to be idle or hung<br />
* --verbose will show what SSH commands are being run<br />
* --debug shows everything --verbose shows and also displays the SSH output<br />
<br />
== File a bug ==<br />
* Use [https://bugzilla.mozilla.org/enter_bug.cgi?alias=&assigned_to=nobody%40mozilla.org&blocked=&bug_file_loc=http%3A%2F%2F&bug_severity=normal&bug_status=NEW&cf_crash_signature=&comment=&component=Release%20Engineering%3A%20Machine%20Management&contenttypeentry=&contenttypemethod=autodetect&contenttypeselection=text%2Fplain&data=&defined_groups=1&dependson=&description=&flag_type-4=X&flag_type-481=X&flag_type-607=X&flag_type-674=X&flag_type-720=X&flag_type-721=X&flag_type-737=X&flag_type-775=X&flag_type-780=X&form_name=enter_bug&keywords=&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=All&priority=P3&product=mozilla.org&qa_contact=armenzg%40mozilla.com&rep_platform=All&requestee_type-4=&requestee_type-607=&requestee_type-753=&short_desc=&status_whiteboard=%5Bbuildduty%5D%5Bbuildslaves%5D%5Bcapacity%5D&target_milestone=---&version=other this template] so it fills up few needed tags and priority<br />
* Make the subject and alias of the bug to be the hostname<br />
* Add any depend bugs IT actions or the slave's issue<br />
* Submit<br />
<br />
== Slave Tracking ==<br />
* Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.<br />
<br />
== Slavealloc ==<br />
=== Adding a slave ===<br />
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.<br />
<br />
You'll want a command line something like<br />
<pre><br />
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv<br />
</pre><br />
<br />
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':<br />
<pre><br />
name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool<br />
</pre><br />
<br />
Adding masters is similar - see dbimport's help for more information.<br />
=== Removing slaves ===<br />
Connect to slavealloc@slavealloc and look at the history for a command looking like this:<br />
<pre><br />
mysql -h $host_ip -p -u buildslaves buildslaves<br />
# type the password<br />
SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
</pre><br />
<br />
== How Tos ==<br />
see [[ReleaseEngineering/How_To]] for a list of public How To documents<br /><br />
see [https://intranet.mozilla.org/RelEngWiki/index.php/Category:HowTo RelEngWiki/Category:HowTo] for list of private How To documents<br />
<br />
= Nagios =<br />
== What's the difference between a downtime and an ack? ==<br />
Both will make nagios stop alerting, but there's an important difference: acks are forever. '''Never''' ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.<br />
<br />
== How do I interact with the nagios IRC bot? ==<br />
nagios: status (gives current server stats)<br />
nagios: status $regexp (gives status for a particular host)<br />
nagios: status host:svc (gives status for a particular service)<br />
nagios: ignore (shows ignores<br />
nagios: ignore $regexp (ignores alerts matching $regexp)<br />
nagios: unignore $regexp (unignores an existing ignore)<br />
nagios: ack $num $comment (adds an acknowledgement comment; $num comes from [brackets] in the alert)<br />
(note that the numbers only count up to 100, so ack things quickly or use the web interface)<br />
nagios: unack $num (reverse an acknowledgement)<br />
nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d)<br />
e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988<br />
<br />
== How do I scan all problems Nagios has detected? ==<br />
* All unacknowledged problems:<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10<br />
* All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346<br />
* Group hosts check<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?hostgroup=all&style=summary<br />
<br />
== How do I deal with Nagios problems? ==<br />
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.<br />
<br />
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ''ever'' disable notifications.<br />
<br />
You can '''acknowledge''' a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.<br />
<br />
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.<br />
<br />
You can also mark a service or host for '''downtime'''. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.<br />
<br />
At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.<br />
<br />
== Known nagios alerts ==<br />
<pre><br />
[28] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
armenzg_buildduty<br />
arr: should I be worrying about this message? [26] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
nthomas<br />
depends if ssh is down<br />
nagios-sjc1<br />
[29] talos-r3-fed-018.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
joduinn-mtg is now known as joduinn-brb<br />
nthomas<br />
seems to work ok still, so people can push<br />
16:53 nthomas<br />
I get the normal |No interactive shells allowed here!| and it kicks me out as expected<br />
</pre><br />
This is normally due to releases. We might have to bump the threshold.<br />
<pre><br />
[30] signing1.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 30.60<br />
</pre><br />
<br />
= Downtimes =<br />
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the [[ReleaseEngineering:Buildduty:Downtimes|Downtimes]] page.<br />
<br />
= Talos =<br />
'''Note''' because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...<br />
<br />
# close all trees that are impacted by the change<br />
# ensure all pending builds are done and GREEN<br />
# do the update step below<br />
# send a Talos changeset to all trees to generate new baselines<br />
<br />
== How to update the talos/pageloader zips ==<br />
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.<br />
<br />
You may need to get IT to turn on access to build.mozilla.org.<br />
<pre><br />
#use your short ldap name (jford not jford@mozilla.com)<br />
ssh jford@build.mozilla.org<br />
cd /var/www/html/build/talos/zips/<br />
# NOTE: bug# and talos cset helps tracking back<br />
wget -Otalos.bug#.cset.zip <whatever>talos.zip<br />
<br />
cd /var/www/html/build/talos/xpis<br />
# NOTE: We override it unlike with talos.zip since it has not been ported to the talos.json system<br />
wget <whatever>/pageloader.xpi<br />
</pre><br />
<br />
For taloz.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.<br />
<br />
=== Updating talos for Tegras ===<br />
<br />
To update talos on Android,<br />
<br />
# for foopy05-11<br />
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}<br />
cd /builds/talos-data/talos<br />
hg pull -u<br />
<br />
This will update talos on each foopy to the tip of default.<br />
<br />
=== Updating talos for N900s ===<br />
<br />
ssh cltbld@production-mobile-master<br />
cd checkouts<br />
./update.sh<br />
<br />
This will update the fennecmark, maemkit, talos, and pageloader tarballs on pmm to the latest in their repos.<br />
<br />
= TBPL =<br />
== How to deploy changes ==<br />
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.<br />
<br />
== How to hide/unhide builders ==<br />
* In the 'Tree Info' menu select 'Open tree admin panel'<br />
* Filter/select the builders you want to change<br />
* Save changes<br />
* Enter the sheriff password and a description (with bug number if available) of your changes<br />
<br />
= Useful Links =<br />
* [http://cruncher.build.mozilla.org/buildapi/index.html Build Dashboard Main Page]<br />
** You can get JSON dumps for people to analyze by adding <code>&format=json</code><br />
** You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this [http://cruncher.build.mozilla.org/buildapi/revision/ link] (e.g. [http://cruncher.build.mozilla.org/buildapi/revision/places/c4f8232c7aef revision/places/c4f8232c7aef])<br />
* http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.<br />
* http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)<br />
<br />
= L10n Nightly Dashboard =<br />
* [http://l10n.mozilla.org/~axel/nightlies L10n Nightly Dashboard]<br />
<br />
= Slave Handling =<br />
You'll need to be familiar with the location of slaves. You can find this with 'host' if you don't know off the top of your head<br />
host linux-ix-slave07<br />
linux-ix-slave07.build.mozilla.org is an alias for linux-ix-slave07.build.'''mtv1'''.mozilla.com.<br />
<br />
== Restarting Wedged Slaves ==<br />
See [https://wiki.mozilla.org/ReleaseEngineering/How_To/Get_a_Missing_Slave_Back_Online How To/Get a Missing Slave Back Online].<br />
<br />
Reboot an IX slave:<br />
[[ReleaseEngineering/How_To/Connect_To_IPMI|Connect To IPMI]]<br />
<br />
== Requesting Reboots ==<br />
Some slaves run on unmanaged hardware, meaning that the hardware can get into a state where someone must be onsite to unwedge it. Note that iX systems and VMs are '''not''' unmanaged, and should not be on a reboots bug. When an unmanaged host becomes unresponsive, it gets added to a reboots bug, based on its datacenter:<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-scl1 (by far the most common, since about 10 talos machines die per week)<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-sjc1<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-mtv1<br />
'''NOTE:''' these bugs are formulaic. Don't get creative! Just add the hostname of the slave in a comment, or if you are adding multiple slaves at once, list each on its own line. If there's something the onsite person needs to know, include it after the hostname, on the same line. '''Do not''' try to "summarize" all of the slaves on the bug in a single comment.<br />
<br />
Simultaneously, 'ack' the alert in #build:<br />
10:27 < nagios-sjc1> [25] talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
10:51 < dustin> nagios-sjc1: ack 25 reboots-scl1<br />
10:51 < nagios-sjc1> talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%;dustin;reboots-scl1<br />
<br />
== When Requested Reboots are Done ==<br />
=== Checking Slaves ===<br />
Once a reboots bug is closed by an onsite person, read the update to see which hosts were rebooted, and which (if any) require further work. Such further work should be deferred to a new bug, which you should open if relops did not (often time is tight at the datacenter). Update the slave tracking spreadsheet accordingly:<br />
* for slaves that were rebooted normally: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "rebooted"; and set "Blocked On" to "check" (which will turn the cell yellow). Check BuildAPI a few hours later to see if these slaves are building properly, and delete the rows from the spreadsheet if so.<br />
* for slaves that were reimaged during the reboot process: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "reimaged"; and set "Blocked On" to "set up". That set-up is your responsibility, too -- see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave].<br />
* for slaves that require further work from relops, change the "Bug #" column to reflect the bug tracking that work, and set the "Issue" and "Blocked On" columns appropriately<br />
If any slaves were missed in the reboot process, add them to a new reboots bug.<br />
<br />
=== New Bug ===<br />
Once a reboots bug is closed, you will need to open a new one for any subsequent reboots. You don't have to wait until you need a reboot to do so. Here's how:<br />
# remove the 'reboots-xxxx' alias from the previous reboots bug, and copy the bug's URL to your clipboard<br />
# create a bug in "Server Operations: RelEng", with subject "reboot requests (xxxx)". You can leave the description blank if you don't have any slaves requiring reboot yet. Submit.<br />
# edit the bug's colo-trip field to indicate the correct datacenter, and paste the previous reboot request's URL into the "See Also" field.<br />
<br />
== DNR ==<br />
Slaves that are dead and not worth repairing are marked as "DNR" in the slave tracking spreadsheet. The types of slaves that are acceptable for DNR are listed in the "DNR'd Silos" sheet of the [http://is.gd/jsHeh slave tracking spreadsheet]. Such slaves should be acked in nagios, but are not tracked in any bug.<br />
<br />
== Loans ==<br />
We need to track a slave from the time it is loaned out until it is back in its proper place (be that staging, preprod, or production). Currently we use bugs to track this flow.<br />
<br />
# Bug from dev requesting loaner (build or test slave, platform, bug this is being used to help with)<br />
# Loan it: [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Send_a_slave_out_for_loan How To/Send a slave out for loan]<br />
# File a bug to the RelEng component (connected to bug in point #1) to track the re-imaging and returning of the slave to its pool when it's returned -- I've been asking the dev to please comment in that bug when they are done with the loaner<br />
# File a bug on ServerOps asking for re-image (blocking bug in #3) [https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_That_a_Machine_Be_Reimaged How To/Request That a Machine Be Reimaged]<br />
# When it's re-imaged, put it back in the pool [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave]<br />
<br />
== Maintenance ==<br />
Periodically scan the slave spreadsheet. Check slaves marked "check". Set up slaves marked "set up". Ask developers who have borrowed slaves to see if they're done with them. Ask relops about progress on broken slaves.<br />
<br />
== Common Failure Modes ==<br />
Some slaves, especially linux VMs, will fail to clobber and repeatedly restart. In nagios, this causes all of the checks on that host to bounce up and down, because the reboots occur on a similar schedule to nagios's checks. Sometimes you can catch this via SSH, but the reboots are *very* quick and it may be easier to use vSphere Client to boot the host into single-user mode and clean out the build dirs.<br />
<br />
All of the linux slaves will reboot after 10 attempts to run puppet. A puppet failure, then, will manifest as buildbot failing to start on that host. To stop the reboot cycle, log in to the slave and kill S98puppet (centos) or run-puppet-and-slave.sh (fedora).<br />
<br />
= Standard Bugs =<br />
* The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I<br />
* Reboots bugs have the Bugzilla aliases shown above.<br />
* For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:<br />
** :aki, :armenzg, :bear, :bhearsum, :catlee, :coop, :hwine, :jhford, :joduinn, :joey, :lsblakk, :nthomas, :rail<br />
<br />
= Ganglia =<br />
* if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. {{bug|674233}}:<br />
switch to root, service gmond restart<br />
<br />
= Queue Directories =<br />
* [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories]<br />
<br />
If you see this in #build:<br />
<br />
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items<br />
<br />
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories] wiki page for details.<br />
= Cruncher = <br />
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):<br />
<br />
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):<br />
As root:<br />
du -s -h /var/spool/*<br />
# confirm that mqueue or clientmqueue is the oversized culprit<br />
# stop sendmail, clean out the queues, restart sendmail<br />
/etc/init.d/sendmail stop<br />
rm -rf /var/spool/clientmqueue/*<br />
rm -rf /var/spool/mqueue/*<br />
/etc/init.d/sendmail start</div>Bearhttps://wiki.mozilla.org/index.php?title=CIDuty&diff=451872CIDuty2012-07-18T07:26:40Z<p>Bear: /* Kitten */</p>
<hr />
<div>'''Looking for who is on buildduty?''' - check the tree-info dropdown on [https://tbpl.mozilla.org/ tbpl]<br /><br />
'''Buildduty not around?''' - please [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Release%20Engineering open a bug]<br />
<br />
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "'''buildduty'''."<br />
<br />
Here's now to do it.<br />
<br />
__TOC__<br />
<br />
= Schedule =<br />
Mozilla Releng Sheriff Schedule ([http://www.google.com/calendar/embed?src=aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com&ctz=America/New_York Google Calendar]|[http://www.google.com/calendar/ical/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic.ics iCal]|[http://www.google.com/calendar/feeds/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic XML])<br />
<br />
= General Duties =<br />
== How should I make myself available for duty? ==<br />
* Add 'buildduty' to your IRC nick<br />
* be in at least #developers, #buildduty and #build (as well as #mozbuild of course)<br />
** also useful to be in #mobile, #planning, #release-drivers, and #ateam<br />
* watch http://tbpl.mozilla.org<br />
<br />
== What else should I take care of? ==<br />
You will need to<br />
* Direct people to [http://mzl.la/tryhelp http://mzl.la/tryhelp] for self-serve documentation.<br />
* Keep [https://wiki.mozilla.org/ReleaseEngineering:Maintenance wiki.m.o/ReleaseEngineering:Maintenance] up to date with any significant changes<br />
<br />
You should keep on top of<br />
* pending builds - available in [http://build.mozilla.org/builds/pending/ graphs] or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.<br />
* all bugs tagged with [https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;list_id=2941844;status_whiteboard=%5Bbuildduty%5D;;resolution=---;product=mozilla.org buildduty] in the whiteboard (make a saved search)<br />
* The [https://bugzilla.mozilla.org/buglist.cgi?priority=--&columnlist=bug_severity%2Cpriority%2Cop_sys%2Cassigned_to%2Cbug_status%2Cresolution%2Cshort_desc%2Cstatus_whiteboard&resolution=---&resolution=DUPLICATE&emailtype1=exact&query_based_on=releng-triage&emailassigned_to1=1&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=nobody%40mozilla.org&component=Release%20Engineering&component=Release%20Engineering%3A%20Custom%20Builds&product=mozilla.org&known_name=releng-triage releng-triage search] - part of buildduty is leaning on your colleagues to take bugs<br />
** the BuildDuty person needs to go through all the bugs in releng-triage query at *least* a few times each day. Doesnt mean you have to *fix* them all immediately, finding other owners is part of triaging the queue. However, you do need to at least *see* them, know if there are any urgent problems and categorize appropriately. Sometimes we get urgent security bugs here, which need to be jumped on immediately, like example {{bug|635638}}<br />
* Bum slaves - you should see to it that bum slaves aren't burning builds, and that all slaves are tracked on their way back to operational status<br />
** Check the [https://bugzilla.mozilla.org/buglist.cgi?list_id=2938171;resolution=---;status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=%5Bhardware%5D;;product=mozilla.org hardware] whiteboard tag, too, for anything that slipped between the cracks.<br />
** See the sections below on [[#Requesting Reboots]]<br />
* Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp])<br />
** '''wait times''' - either [https://build.mozilla.org/buildapi/reports/waittimes this page] or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)<br />
** there is a cronjob in anamarias' account on cruncher that runs this for each pool:<br />
/usr/local/bin/python $HOME/buildapi/buildapi/scripts/mailwaittimes.py \<br />
-S smtp.mozilla.org \<br />
-f nobody@cruncher.build.mozilla.org \<br />
-p testpool \<br />
-W http://cruncher.build.mozilla.org/buildapi/reports/waittimes \<br />
-e $(date -d "$(date +%Y-%m-%d)" +%s) -t 10 -z 10 \<br />
-a dev-tree-management@lists.mozilla.org<br />
<br />
* You may need to plan a reconfig or a full downtime<br />
** Reconfigs: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-reconfig releng-needs-reconfig broken query] to see what's pending. Reconfigs can be done at any time. <br />
** Downtimes: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-treeclosure releng-needs-treeclosure broken query] to see what's pending. Coordinate with Zandr and IT for send downtime notices with enough advance notice. <br />
<br />
You will also be responsible for coordinating master reconfigs - see the releng-needs-reconfig search.<br />
<br />
== Scheduled Reconfigs ==<br />
Buildduty is responsible for reconfiging the Buildbot masters <b>every Monday and Thursday</b>, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. [https://wiki.mozilla.org/ReleaseEngineering/Landing_Buildbot_Master_Changes This wiki page has step by step instructions]. It is also valid to do other additional reconfigs anytime you want.<br />
<br />
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master].<br />
<br />
You should [https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric use Fabric to do the reconfig!]<br />
<br />
The person doing reconfigs should also update https://wiki.mozilla.org/ReleaseEngineering:Maintenance#Reconfigs_.2F_Deployments<br />
<br />
= Tree Maintenance =<br />
== Repo Errors ==<br />
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:<br />
* File a bug (or have dev file it) and then poke in #ops noahm<br />
** If he doesn't respond, then escalate the bug to page on-call<br />
* Follow the steps below for "How do I close the tree"<br />
== How do I see problems in TBPL? ==<br />
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.<br />
== How do I close the tree? ==<br />
See [[ReleaseEngineering/How_To/Close_or_Open_the_Tree]]<br />
<br />
== How do I claim a rentable project branch? ==<br />
See [[ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE]]<br />
<br />
= Re-run jobs =<br />
== How to trigger Talos jobs ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-trigger all Talos runs for a build (by using sendchange) ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-run a build ==<br />
Do ''not'' go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.<br />
<br />
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the '''YOU MUST''' specify the branch, so there's no null keys in the builds-running.js.<br />
<br />
= Try Server =<br />
== Jobs not scheduled at all? ==<br />
Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.<br />
<br />
Then do a sendchange and tail the scheduler master:<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
<br />
* If tryserver was just reset verify that [[ReleaseEngineering/How_To/Reset_the_Try_Server#Try_Hg_Poller_state|the scheduler has been reset]]<br />
<br />
== How do I trigger additional talos/test runs for a given try build? ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== Using the TryChooser to submit build/test requests ==<br />
<br />
buildduty can also use the same [https://wiki.mozilla.org/Build:TryChooser TryChooser] syntax as developers use to (re)submit build and testing requests. Here is an example:<br />
<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
== How do I cancel existing jobs? ==<br />
<br />
The cancellator.py script is setup on pm02. Here is a standard example:<br />
<br />
# Dry run first to see what would be cancelled. <br />
python cancellator.py -b try -r 5ff84b660e90<br />
# Same command run again with the force option specified (--yes-really) to actually cancel the builds<br />
python cancellator.py -b try -r 5ff84b660e90 --yes-really<br />
<br />
The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.<br />
== Bug Commenter ==<br />
This is on cruncher and is run in a crontab in lsblakk's account:<br />
source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \<br />
&& time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v<br />
<br />
You can see quickly if things are working by looking at:<br />
/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log # this shows what's been posted lately<br />
/home/lsblakk/autoland/tools/scripts/autoland/try_cache # this shows what the script thinks is 'pending' completion<br />
<br />
= Nightlies =<br />
<br />
== How do I re-spin mozilla-central nightlies? ==<br />
To rebuild the same nightly, buildbot's Rebuild button works fine.<br />
<br />
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.<br />
<br />
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.<br />
<br />
= Mobile =<br />
== Android Tegras ==<br />
<br />
[[ReleaseEngineering:How To:Android Tegras | Android Tegra BuildDuty Notes]]<br />
<br />
== Android Updates aren't working! ==<br />
<br />
* Did the version number just change? If so, you may be hitting {{bug|629528}}. Kick off another Android nightly.<br />
* Check aus3-staging for size 0 complete.txt snippets:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5<br />
** If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.<br />
* Check aus3-staging to see if the checksum is correct:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2<br />
** If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.<br />
<br />
== Update mobile talos webhosts ==<br />
We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}).<br />
Here is how you update them:<br />
Update Procedure:<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/talos-repo<br />
# NOTICE that we have uncommitted files<br />
hg st<br />
# ? talos/page_load_test/tp4<br />
# Take note of the current revision to revert to (just in case)<br />
hg id<br />
hg pull -u<br />
# 488bc187a3ef tip<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.<br />
</pre><br />
<br />
Keep track of what revisions is being run.<br />
<br />
== Deploy new tegra-host-utils.zip ==<br />
There are three hosts behind a balance loader.<br />
* See {{bug|742597}} for previous instance of this case.<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/tegra<br />
wget -O tegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/<br />
</pre><br />
<br />
= Slave Maintenance =<br />
In general, slave maintenance involves:<br />
* keeping as many slaves up as possible, including<br />
** proactively checking for hung/broken slaves (see links below)<br />
** moving known-down slaves toward an operational state<br />
* handling nagios alerts for slaves<br />
* interacting with IT regarding slave maintenance<br />
<br />
== Kitten ==<br />
kitten.py is a command line tool designed to make gathering and basic host management tasks easier to do from the command line. You can get information about a host and also request to reboot it all in one command.<br />
<br />
A buildduty environment has been created on Cruncher to make it easier to work with all of the briarpatch tools (of which Kitten.py is one.)<br />
<br />
sudo su - buildduty<br />
cd /home/buildduty/briarpatch<br />
. bin/activate<br />
<br />
From their you can run:<br />
<br />
python kitten.py <hostname><br />
<br />
Example output (lines numbers added for reference):<br />
<br />
1 talos-r3-xp-013: enabled<br />
2 farm: moz<br />
3 colo: scl1<br />
4 distro: winxp<br />
5 pool: tests-scl1-windows<br />
6 trustlevel: try<br />
7 master: bm16-tests1-windows<br />
8 fqdn: talos-r3-xp-013.build.scl1.mozilla.com.<br />
9 PDU?: False<br />
10 IPMI?: False<br />
11 reachable: True<br />
12 buildbot: running; active; job 1 minute ago<br />
13 tacfile: found<br />
14 lastseen: 1 minute ago<br />
15 master: buildbot-master16.build.scl1.mozilla.com<br />
<br />
1. hostname and it's status according to slavealloc<br />
2. farm: aws or moz<br />
3. colo: what colo the host is located (from slavealloc)<br />
4. distro: what OS distribution slavealloc lists<br />
5. pool: what build/test pool slavealloc lists<br />
6. trustlevel: the host's trustlevel per slavealloc<br />
7. master: the master that slavealloc lists for the host<br />
8. fqdn: the FQDN that was returned from the DNS lookup<br />
9. PDU?: does Inventory (or tegras.json) list a PDU for this host<br />
10. IPMI?: does a -mgmt DNS entry exist for this host<br />
11. Was briarpatch able to successfully ping and SSH to the host<br />
12. buildbot: the status of buildbot and what the last activity was<br />
13. tacfile: was a buildbot.tac file found<br />
14. lastseen: the timestamp of the last entry in twistd.log<br />
15. master: what the buildbot.tac file lists as the host's master<br />
<br />
Example of a host that cannot be reached:<br />
<br />
(production)[buildduty@cruncher production]$ python kitten.py -v talos-r3-xp-019<br />
talos-r3-xp-019: enabled<br />
farm: moz<br />
colo: scl1<br />
distro: winxp<br />
pool: tests-scl1-windows<br />
trustlevel: try<br />
master: bm15-tests1-windows<br />
fqdn: talos-r3-xp-019.build.scl1.mozilla.com.<br />
PDU?: False<br />
IPMI?: False<br />
ERROR Unable to control host remotely<br />
reachable: False<br />
buildbot: <br />
tacfile: <br />
lastseen: unknown<br />
master: <br />
error: current master is different than buildbot.tac master []<br />
<br />
The output up to the "ERROR" line shows all of the metadata for a host, and if the host was reachable via SSH the lines after would show the details of the buildbot environment and it's status.<br />
<br />
Kitten.py has the following commands:<br />
<br />
kitten.py [--info | -i ] [--reboot | -r] [--verbose | -v] [--debug]<br />
<br />
* --info will show only the metadata and will not try to SSH to the host<br />
* --reboot will try to graceful the buildbot and reboot the host if it appears to be idle or hung<br />
* --verbose will show what SSH commands are being run<br />
* --debug shows everything --verbose shows and also displays the SSH output<br />
<br />
== File a bug ==<br />
* Use [https://bugzilla.mozilla.org/enter_bug.cgi?alias=&assigned_to=nobody%40mozilla.org&blocked=&bug_file_loc=http%3A%2F%2F&bug_severity=normal&bug_status=NEW&cf_crash_signature=&comment=&component=Release%20Engineering%3A%20Machine%20Management&contenttypeentry=&contenttypemethod=autodetect&contenttypeselection=text%2Fplain&data=&defined_groups=1&dependson=&description=&flag_type-4=X&flag_type-481=X&flag_type-607=X&flag_type-674=X&flag_type-720=X&flag_type-721=X&flag_type-737=X&flag_type-775=X&flag_type-780=X&form_name=enter_bug&keywords=&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=All&priority=P3&product=mozilla.org&qa_contact=armenzg%40mozilla.com&rep_platform=All&requestee_type-4=&requestee_type-607=&requestee_type-753=&short_desc=&status_whiteboard=%5Bbuildduty%5D%5Bbuildslaves%5D%5Bcapacity%5D&target_milestone=---&version=other this template] so it fills up few needed tags and priority<br />
* Make the subject and alias of the bug to be the hostname<br />
* Add any depend bugs IT actions or the slave's issue<br />
* Submit<br />
<br />
== Slave Tracking ==<br />
* Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.<br />
<br />
== Slavealloc ==<br />
=== Adding a slave ===<br />
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.<br />
<br />
You'll want a command line something like<br />
<pre><br />
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv<br />
</pre><br />
<br />
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':<br />
<pre><br />
name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool<br />
</pre><br />
<br />
Adding masters is similar - see dbimport's help for more information.<br />
=== Removing slaves ===<br />
Connect to slavealloc@slavealloc and look at the history for a command looking like this:<br />
<pre><br />
mysql -h $host_ip -p -u buildslaves buildslaves<br />
# type the password<br />
SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
</pre><br />
<br />
== How Tos ==<br />
see [[ReleaseEngineering/How_To]] for a list of public How To documents<br /><br />
see [https://intranet.mozilla.org/RelEngWiki/index.php/Category:HowTo RelEngWiki/Category:HowTo] for list of private How To documents<br />
<br />
= Nagios =<br />
== What's the difference between a downtime and an ack? ==<br />
Both will make nagios stop alerting, but there's an important difference: acks are forever. '''Never''' ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.<br />
<br />
== How do I interact with the nagios IRC bot? ==<br />
nagios: status (gives current server stats)<br />
nagios: status $regexp (gives status for a particular host)<br />
nagios: status host:svc (gives status for a particular service)<br />
nagios: ignore (shows ignores<br />
nagios: ignore $regexp (ignores alerts matching $regexp)<br />
nagios: unignore $regexp (unignores an existing ignore)<br />
nagios: ack $num $comment (adds an acknowledgement comment; $num comes from [brackets] in the alert)<br />
(note that the numbers only count up to 100, so ack things quickly or use the web interface)<br />
nagios: unack $num (reverse an acknowledgement)<br />
nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d)<br />
e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988<br />
<br />
== How do I scan all problems Nagios has detected? ==<br />
* All unacknowledged problems:<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10<br />
* All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346<br />
* Group hosts check<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?hostgroup=all&style=summary<br />
<br />
== How do I deal with Nagios problems? ==<br />
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.<br />
<br />
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ''ever'' disable notifications.<br />
<br />
You can '''acknowledge''' a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.<br />
<br />
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.<br />
<br />
You can also mark a service or host for '''downtime'''. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.<br />
<br />
At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.<br />
<br />
== Known nagios alerts ==<br />
<pre><br />
[28] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
armenzg_buildduty<br />
arr: should I be worrying about this message? [26] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
nthomas<br />
depends if ssh is down<br />
nagios-sjc1<br />
[29] talos-r3-fed-018.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
joduinn-mtg is now known as joduinn-brb<br />
nthomas<br />
seems to work ok still, so people can push<br />
16:53 nthomas<br />
I get the normal |No interactive shells allowed here!| and it kicks me out as expected<br />
</pre><br />
This is normally due to releases. We might have to bump the threshold.<br />
<pre><br />
[30] signing1.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 30.60<br />
</pre><br />
<br />
= Downtimes =<br />
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the [[ReleaseEngineering:Buildduty:Downtimes|Downtimes]] page.<br />
<br />
= Talos =<br />
'''Note''' because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...<br />
<br />
# close all trees that are impacted by the change<br />
# ensure all pending builds are done and GREEN<br />
# do the update step below<br />
# send a Talos changeset to all trees to generate new baselines<br />
<br />
== How to update the talos/pageloader zips ==<br />
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.<br />
<br />
You may need to get IT to turn on access to build.mozilla.org.<br />
<pre><br />
#use your short ldap name (jford not jford@mozilla.com)<br />
ssh jford@build.mozilla.org<br />
cd /var/www/html/build/talos/zips/<br />
# NOTE: bug# and talos cset helps tracking back<br />
wget -Otalos.bug#.cset.zip <whatever>talos.zip<br />
<br />
cd /var/www/html/build/talos/xpis<br />
# NOTE: We override it unlike with talos.zip since it has not been ported to the talos.json system<br />
wget <whatever>/pageloader.xpi<br />
</pre><br />
<br />
For taloz.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.<br />
<br />
=== Updating talos for Tegras ===<br />
<br />
To update talos on Android,<br />
<br />
# for foopy05-11<br />
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}<br />
cd /builds/talos-data/talos<br />
hg pull -u<br />
<br />
This will update talos on each foopy to the tip of default.<br />
<br />
=== Updating talos for N900s ===<br />
<br />
ssh cltbld@production-mobile-master<br />
cd checkouts<br />
./update.sh<br />
<br />
This will update the fennecmark, maemkit, talos, and pageloader tarballs on pmm to the latest in their repos.<br />
<br />
= TBPL =<br />
== How to deploy changes ==<br />
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.<br />
<br />
== How to hide/unhide builders ==<br />
* In the 'Tree Info' menu select 'Open tree admin panel'<br />
* Filter/select the builders you want to change<br />
* Save changes<br />
* Enter the sheriff password and a description (with bug number if available) of your changes<br />
<br />
= Useful Links =<br />
* [http://cruncher.build.mozilla.org/buildapi/index.html Build Dashboard Main Page]<br />
** You can get JSON dumps for people to analyze by adding <code>&format=json</code><br />
** You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this [http://cruncher.build.mozilla.org/buildapi/revision/ link] (e.g. [http://cruncher.build.mozilla.org/buildapi/revision/places/c4f8232c7aef revision/places/c4f8232c7aef])<br />
* http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.<br />
* http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)<br />
<br />
= L10n Nightly Dashboard =<br />
* [http://l10n.mozilla.org/~axel/nightlies L10n Nightly Dashboard]<br />
<br />
= Slave Handling =<br />
You'll need to be familiar with the location of slaves. You can find this with 'host' if you don't know off the top of your head<br />
host linux-ix-slave07<br />
linux-ix-slave07.build.mozilla.org is an alias for linux-ix-slave07.build.'''mtv1'''.mozilla.com.<br />
<br />
== Restarting Wedged Slaves ==<br />
See [https://wiki.mozilla.org/ReleaseEngineering/How_To/Get_a_Missing_Slave_Back_Online How To/Get a Missing Slave Back Online].<br />
<br />
Reboot an IX slave:<br />
[[ReleaseEngineering/How_To/Connect_To_IPMI|Connect To IPMI]]<br />
<br />
== Requesting Reboots ==<br />
Some slaves run on unmanaged hardware, meaning that the hardware can get into a state where someone must be onsite to unwedge it. Note that iX systems and VMs are '''not''' unmanaged, and should not be on a reboots bug. When an unmanaged host becomes unresponsive, it gets added to a reboots bug, based on its datacenter:<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-scl1 (by far the most common, since about 10 talos machines die per week)<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-sjc1<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-mtv1<br />
'''NOTE:''' these bugs are formulaic. Don't get creative! Just add the hostname of the slave in a comment, or if you are adding multiple slaves at once, list each on its own line. If there's something the onsite person needs to know, include it after the hostname, on the same line. '''Do not''' try to "summarize" all of the slaves on the bug in a single comment.<br />
<br />
Simultaneously, 'ack' the alert in #build:<br />
10:27 < nagios-sjc1> [25] talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
10:51 < dustin> nagios-sjc1: ack 25 reboots-scl1<br />
10:51 < nagios-sjc1> talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%;dustin;reboots-scl1<br />
<br />
== When Requested Reboots are Done ==<br />
=== Checking Slaves ===<br />
Once a reboots bug is closed by an onsite person, read the update to see which hosts were rebooted, and which (if any) require further work. Such further work should be deferred to a new bug, which you should open if relops did not (often time is tight at the datacenter). Update the slave tracking spreadsheet accordingly:<br />
* for slaves that were rebooted normally: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "rebooted"; and set "Blocked On" to "check" (which will turn the cell yellow). Check BuildAPI a few hours later to see if these slaves are building properly, and delete the rows from the spreadsheet if so.<br />
* for slaves that were reimaged during the reboot process: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "reimaged"; and set "Blocked On" to "set up". That set-up is your responsibility, too -- see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave].<br />
* for slaves that require further work from relops, change the "Bug #" column to reflect the bug tracking that work, and set the "Issue" and "Blocked On" columns appropriately<br />
If any slaves were missed in the reboot process, add them to a new reboots bug.<br />
<br />
=== New Bug ===<br />
Once a reboots bug is closed, you will need to open a new one for any subsequent reboots. You don't have to wait until you need a reboot to do so. Here's how:<br />
# remove the 'reboots-xxxx' alias from the previous reboots bug, and copy the bug's URL to your clipboard<br />
# create a bug in "Server Operations: RelEng", with subject "reboot requests (xxxx)". You can leave the description blank if you don't have any slaves requiring reboot yet. Submit.<br />
# edit the bug's colo-trip field to indicate the correct datacenter, and paste the previous reboot request's URL into the "See Also" field.<br />
<br />
== DNR ==<br />
Slaves that are dead and not worth repairing are marked as "DNR" in the slave tracking spreadsheet. The types of slaves that are acceptable for DNR are listed in the "DNR'd Silos" sheet of the [http://is.gd/jsHeh slave tracking spreadsheet]. Such slaves should be acked in nagios, but are not tracked in any bug.<br />
<br />
== Loans ==<br />
We need to track a slave from the time it is loaned out until it is back in its proper place (be that staging, preprod, or production). Currently we use bugs to track this flow.<br />
<br />
# Bug from dev requesting loaner (build or test slave, platform, bug this is being used to help with)<br />
# Loan it: [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Send_a_slave_out_for_loan How To/Send a slave out for loan]<br />
# File a bug to the RelEng component (connected to bug in point #1) to track the re-imaging and returning of the slave to its pool when it's returned -- I've been asking the dev to please comment in that bug when they are done with the loaner<br />
# File a bug on ServerOps asking for re-image (blocking bug in #3) [https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_That_a_Machine_Be_Reimaged How To/Request That a Machine Be Reimaged]<br />
# When it's re-imaged, put it back in the pool [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave]<br />
<br />
== Maintenance ==<br />
Periodically scan the slave spreadsheet. Check slaves marked "check". Set up slaves marked "set up". Ask developers who have borrowed slaves to see if they're done with them. Ask relops about progress on broken slaves.<br />
<br />
== Common Failure Modes ==<br />
Some slaves, especially linux VMs, will fail to clobber and repeatedly restart. In nagios, this causes all of the checks on that host to bounce up and down, because the reboots occur on a similar schedule to nagios's checks. Sometimes you can catch this via SSH, but the reboots are *very* quick and it may be easier to use vSphere Client to boot the host into single-user mode and clean out the build dirs.<br />
<br />
All of the linux slaves will reboot after 10 attempts to run puppet. A puppet failure, then, will manifest as buildbot failing to start on that host. To stop the reboot cycle, log in to the slave and kill S98puppet (centos) or run-puppet-and-slave.sh (fedora).<br />
<br />
= Standard Bugs =<br />
* The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I<br />
* Reboots bugs have the Bugzilla aliases shown above.<br />
* For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:<br />
** :aki, :armenzg, :bear, :bhearsum, :catlee, :coop, :hwine, :jhford, :joduinn, :joey, :lsblakk, :nthomas, :rail<br />
<br />
= Ganglia =<br />
* if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. {{bug|674233}}:<br />
switch to root, service gmond restart<br />
<br />
= Queue Directories =<br />
* [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories]<br />
<br />
If you see this in #build:<br />
<br />
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items<br />
<br />
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories] wiki page for details.<br />
= Cruncher = <br />
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):<br />
<br />
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):<br />
As root:<br />
du -s -h /var/spool/*<br />
# confirm that mqueue or clientmqueue is the oversized culprit<br />
# stop sendmail, clean out the queues, restart sendmail<br />
/etc/init.d/sendmail stop<br />
rm -rf /var/spool/clientmqueue/*<br />
rm -rf /var/spool/mqueue/*<br />
/etc/init.d/sendmail start</div>Bearhttps://wiki.mozilla.org/index.php?title=CIDuty&diff=451863CIDuty2012-07-18T07:17:00Z<p>Bear: /* Kitten */</p>
<hr />
<div>'''Looking for who is on buildduty?''' - check the tree-info dropdown on [https://tbpl.mozilla.org/ tbpl]<br /><br />
'''Buildduty not around?''' - please [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Release%20Engineering open a bug]<br />
<br />
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "'''buildduty'''."<br />
<br />
Here's now to do it.<br />
<br />
__TOC__<br />
<br />
= Schedule =<br />
Mozilla Releng Sheriff Schedule ([http://www.google.com/calendar/embed?src=aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com&ctz=America/New_York Google Calendar]|[http://www.google.com/calendar/ical/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic.ics iCal]|[http://www.google.com/calendar/feeds/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic XML])<br />
<br />
= General Duties =<br />
== How should I make myself available for duty? ==<br />
* Add 'buildduty' to your IRC nick<br />
* be in at least #developers, #buildduty and #build (as well as #mozbuild of course)<br />
** also useful to be in #mobile, #planning, #release-drivers, and #ateam<br />
* watch http://tbpl.mozilla.org<br />
<br />
== What else should I take care of? ==<br />
You will need to<br />
* Direct people to [http://mzl.la/tryhelp http://mzl.la/tryhelp] for self-serve documentation.<br />
* Keep [https://wiki.mozilla.org/ReleaseEngineering:Maintenance wiki.m.o/ReleaseEngineering:Maintenance] up to date with any significant changes<br />
<br />
You should keep on top of<br />
* pending builds - available in [http://build.mozilla.org/builds/pending/ graphs] or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.<br />
* all bugs tagged with [https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;list_id=2941844;status_whiteboard=%5Bbuildduty%5D;;resolution=---;product=mozilla.org buildduty] in the whiteboard (make a saved search)<br />
* The [https://bugzilla.mozilla.org/buglist.cgi?priority=--&columnlist=bug_severity%2Cpriority%2Cop_sys%2Cassigned_to%2Cbug_status%2Cresolution%2Cshort_desc%2Cstatus_whiteboard&resolution=---&resolution=DUPLICATE&emailtype1=exact&query_based_on=releng-triage&emailassigned_to1=1&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=nobody%40mozilla.org&component=Release%20Engineering&component=Release%20Engineering%3A%20Custom%20Builds&product=mozilla.org&known_name=releng-triage releng-triage search] - part of buildduty is leaning on your colleagues to take bugs<br />
** the BuildDuty person needs to go through all the bugs in releng-triage query at *least* a few times each day. Doesnt mean you have to *fix* them all immediately, finding other owners is part of triaging the queue. However, you do need to at least *see* them, know if there are any urgent problems and categorize appropriately. Sometimes we get urgent security bugs here, which need to be jumped on immediately, like example {{bug|635638}}<br />
* Bum slaves - you should see to it that bum slaves aren't burning builds, and that all slaves are tracked on their way back to operational status<br />
** Check the [https://bugzilla.mozilla.org/buglist.cgi?list_id=2938171;resolution=---;status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=%5Bhardware%5D;;product=mozilla.org hardware] whiteboard tag, too, for anything that slipped between the cracks.<br />
** See the sections below on [[#Requesting Reboots]]<br />
* Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp])<br />
** '''wait times''' - either [https://build.mozilla.org/buildapi/reports/waittimes this page] or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)<br />
** there is a cronjob in anamarias' account on cruncher that runs this for each pool:<br />
/usr/local/bin/python $HOME/buildapi/buildapi/scripts/mailwaittimes.py \<br />
-S smtp.mozilla.org \<br />
-f nobody@cruncher.build.mozilla.org \<br />
-p testpool \<br />
-W http://cruncher.build.mozilla.org/buildapi/reports/waittimes \<br />
-e $(date -d "$(date +%Y-%m-%d)" +%s) -t 10 -z 10 \<br />
-a dev-tree-management@lists.mozilla.org<br />
<br />
* You may need to plan a reconfig or a full downtime<br />
** Reconfigs: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-reconfig releng-needs-reconfig broken query] to see what's pending. Reconfigs can be done at any time. <br />
** Downtimes: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-treeclosure releng-needs-treeclosure broken query] to see what's pending. Coordinate with Zandr and IT for send downtime notices with enough advance notice. <br />
<br />
You will also be responsible for coordinating master reconfigs - see the releng-needs-reconfig search.<br />
<br />
== Scheduled Reconfigs ==<br />
Buildduty is responsible for reconfiging the Buildbot masters <b>every Monday and Thursday</b>, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. [https://wiki.mozilla.org/ReleaseEngineering/Landing_Buildbot_Master_Changes This wiki page has step by step instructions]. It is also valid to do other additional reconfigs anytime you want.<br />
<br />
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master].<br />
<br />
You should [https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric use Fabric to do the reconfig!]<br />
<br />
The person doing reconfigs should also update https://wiki.mozilla.org/ReleaseEngineering:Maintenance#Reconfigs_.2F_Deployments<br />
<br />
= Tree Maintenance =<br />
== Repo Errors ==<br />
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:<br />
* File a bug (or have dev file it) and then poke in #ops noahm<br />
** If he doesn't respond, then escalate the bug to page on-call<br />
* Follow the steps below for "How do I close the tree"<br />
== How do I see problems in TBPL? ==<br />
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.<br />
== How do I close the tree? ==<br />
See [[ReleaseEngineering/How_To/Close_or_Open_the_Tree]]<br />
<br />
== How do I claim a rentable project branch? ==<br />
See [[ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE]]<br />
<br />
= Re-run jobs =<br />
== How to trigger Talos jobs ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-trigger all Talos runs for a build (by using sendchange) ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-run a build ==<br />
Do ''not'' go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.<br />
<br />
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the '''YOU MUST''' specify the branch, so there's no null keys in the builds-running.js.<br />
<br />
= Try Server =<br />
== Jobs not scheduled at all? ==<br />
Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.<br />
<br />
Then do a sendchange and tail the scheduler master:<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
<br />
* If tryserver was just reset verify that [[ReleaseEngineering/How_To/Reset_the_Try_Server#Try_Hg_Poller_state|the scheduler has been reset]]<br />
<br />
== How do I trigger additional talos/test runs for a given try build? ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== Using the TryChooser to submit build/test requests ==<br />
<br />
buildduty can also use the same [https://wiki.mozilla.org/Build:TryChooser TryChooser] syntax as developers use to (re)submit build and testing requests. Here is an example:<br />
<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
== How do I cancel existing jobs? ==<br />
<br />
The cancellator.py script is setup on pm02. Here is a standard example:<br />
<br />
# Dry run first to see what would be cancelled. <br />
python cancellator.py -b try -r 5ff84b660e90<br />
# Same command run again with the force option specified (--yes-really) to actually cancel the builds<br />
python cancellator.py -b try -r 5ff84b660e90 --yes-really<br />
<br />
The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.<br />
== Bug Commenter ==<br />
This is on cruncher and is run in a crontab in lsblakk's account:<br />
source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \<br />
&& time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v<br />
<br />
You can see quickly if things are working by looking at:<br />
/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log # this shows what's been posted lately<br />
/home/lsblakk/autoland/tools/scripts/autoland/try_cache # this shows what the script thinks is 'pending' completion<br />
<br />
= Nightlies =<br />
<br />
== How do I re-spin mozilla-central nightlies? ==<br />
To rebuild the same nightly, buildbot's Rebuild button works fine.<br />
<br />
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.<br />
<br />
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.<br />
<br />
= Mobile =<br />
== Android Tegras ==<br />
<br />
[[ReleaseEngineering:How To:Android Tegras | Android Tegra BuildDuty Notes]]<br />
<br />
== Android Updates aren't working! ==<br />
<br />
* Did the version number just change? If so, you may be hitting {{bug|629528}}. Kick off another Android nightly.<br />
* Check aus3-staging for size 0 complete.txt snippets:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5<br />
** If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.<br />
* Check aus3-staging to see if the checksum is correct:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2<br />
** If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.<br />
<br />
== Update mobile talos webhosts ==<br />
We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}).<br />
Here is how you update them:<br />
Update Procedure:<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/talos-repo<br />
# NOTICE that we have uncommitted files<br />
hg st<br />
# ? talos/page_load_test/tp4<br />
# Take note of the current revision to revert to (just in case)<br />
hg id<br />
hg pull -u<br />
# 488bc187a3ef tip<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.<br />
</pre><br />
<br />
Keep track of what revisions is being run.<br />
<br />
== Deploy new tegra-host-utils.zip ==<br />
There are three hosts behind a balance loader.<br />
* See {{bug|742597}} for previous instance of this case.<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/tegra<br />
wget -O tegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/<br />
</pre><br />
<br />
= Slave Maintenance =<br />
In general, slave maintenance involves:<br />
* keeping as many slaves up as possible, including<br />
** proactively checking for hung/broken slaves (see links below)<br />
** moving known-down slaves toward an operational state<br />
* handling nagios alerts for slaves<br />
* interacting with IT regarding slave maintenance<br />
<br />
== Kitten ==<br />
kitten.py is a command line tool designed to make gathering and basic host management tasks easier to do from the command line. You can get information about a host and also request to reboot it all in one command.<br />
<br />
A buildduty environment has been created on Cruncher to make it easier to work with all of the briarpatch tools (of which Kitten.py is one.)<br />
<br />
sudo su - buildduty<br />
cd /home/buildduty/production<br />
. bin/activate<br />
<br />
From their you can run:<br />
<br />
python kitten.py <hostname><br />
<br />
Example output (lines numbers added for reference):<br />
<br />
1 talos-r3-xp-013: enabled<br />
2 farm: moz<br />
3 colo: scl1<br />
4 distro: winxp<br />
5 pool: tests-scl1-windows<br />
6 trustlevel: try<br />
7 master: bm16-tests1-windows<br />
8 fqdn: talos-r3-xp-013.build.scl1.mozilla.com.<br />
9 PDU?: False<br />
10 IPMI?: False<br />
11 reachable: True<br />
12 buildbot: running; active; job 1 minute ago<br />
13 tacfile: found<br />
14 lastseen: 1 minute ago<br />
15 master: buildbot-master16.build.scl1.mozilla.com<br />
<br />
1. hostname and it's status according to slavealloc<br />
2. farm: aws or moz<br />
3. colo: what colo the host is located (from slavealloc)<br />
4. distro: what OS distribution slavealloc lists<br />
5. pool: what build/test pool slavealloc lists<br />
6. trustlevel: the host's trustlevel per slavealloc<br />
7. master: the master that slavealloc lists for the host<br />
8. fqdn: the FQDN that was returned from the DNS lookup<br />
9. PDU?: does Inventory (or tegras.json) list a PDU for this host<br />
10. IPMI?: does a -mgmt DNS entry exist for this host<br />
11. Was briarpatch able to successfully ping and SSH to the host<br />
12. buildbot: the status of buildbot and what the last activity was<br />
13. tacfile: was a buildbot.tac file found<br />
14. lastseen: the timestamp of the last entry in twistd.log<br />
15. master: what the buildbot.tac file lists as the host's master<br />
<br />
Example of a host that cannot be reached:<br />
<br />
(production)[buildduty@cruncher production]$ python kitten.py -v talos-r3-xp-019<br />
talos-r3-xp-019: enabled<br />
farm: moz<br />
colo: scl1<br />
distro: winxp<br />
pool: tests-scl1-windows<br />
trustlevel: try<br />
master: bm15-tests1-windows<br />
fqdn: talos-r3-xp-019.build.scl1.mozilla.com.<br />
PDU?: False<br />
IPMI?: False<br />
ERROR Unable to control host remotely<br />
reachable: False<br />
buildbot: <br />
tacfile: <br />
lastseen: unknown<br />
master: <br />
error: current master is different than buildbot.tac master []<br />
<br />
The output up to the "ERROR" line shows all of the metadata for a host, and if the host was reachable via SSH the lines after would show the details of the buildbot environment and it's status.<br />
<br />
Kitten.py has the following commands:<br />
<br />
kitten.py [--info | -i ] [--reboot | -r] [--verbose | -v] [--debug]<br />
<br />
* --info will show only the metadata and will not try to SSH to the host<br />
* --reboot will try to graceful the buildbot and reboot the host if it appears to be idle or hung<br />
* --verbose will show what SSH commands are being run<br />
* --debug shows everything --verbose shows and also displays the SSH output<br />
<br />
== File a bug ==<br />
* Use [https://bugzilla.mozilla.org/enter_bug.cgi?alias=&assigned_to=nobody%40mozilla.org&blocked=&bug_file_loc=http%3A%2F%2F&bug_severity=normal&bug_status=NEW&cf_crash_signature=&comment=&component=Release%20Engineering%3A%20Machine%20Management&contenttypeentry=&contenttypemethod=autodetect&contenttypeselection=text%2Fplain&data=&defined_groups=1&dependson=&description=&flag_type-4=X&flag_type-481=X&flag_type-607=X&flag_type-674=X&flag_type-720=X&flag_type-721=X&flag_type-737=X&flag_type-775=X&flag_type-780=X&form_name=enter_bug&keywords=&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=All&priority=P3&product=mozilla.org&qa_contact=armenzg%40mozilla.com&rep_platform=All&requestee_type-4=&requestee_type-607=&requestee_type-753=&short_desc=&status_whiteboard=%5Bbuildduty%5D%5Bbuildslaves%5D%5Bcapacity%5D&target_milestone=---&version=other this template] so it fills up few needed tags and priority<br />
* Make the subject and alias of the bug to be the hostname<br />
* Add any depend bugs IT actions or the slave's issue<br />
* Submit<br />
<br />
== Slave Tracking ==<br />
* Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.<br />
<br />
== Slavealloc ==<br />
=== Adding a slave ===<br />
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.<br />
<br />
You'll want a command line something like<br />
<pre><br />
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv<br />
</pre><br />
<br />
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':<br />
<pre><br />
name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool<br />
</pre><br />
<br />
Adding masters is similar - see dbimport's help for more information.<br />
=== Removing slaves ===<br />
Connect to slavealloc@slavealloc and look at the history for a command looking like this:<br />
<pre><br />
mysql -h $host_ip -p -u buildslaves buildslaves<br />
# type the password<br />
SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
</pre><br />
<br />
== How Tos ==<br />
see [[ReleaseEngineering/How_To]] for a list of public How To documents<br /><br />
see [https://intranet.mozilla.org/RelEngWiki/index.php/Category:HowTo RelEngWiki/Category:HowTo] for list of private How To documents<br />
<br />
= Nagios =<br />
== What's the difference between a downtime and an ack? ==<br />
Both will make nagios stop alerting, but there's an important difference: acks are forever. '''Never''' ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.<br />
<br />
== How do I interact with the nagios IRC bot? ==<br />
nagios: status (gives current server stats)<br />
nagios: status $regexp (gives status for a particular host)<br />
nagios: status host:svc (gives status for a particular service)<br />
nagios: ignore (shows ignores<br />
nagios: ignore $regexp (ignores alerts matching $regexp)<br />
nagios: unignore $regexp (unignores an existing ignore)<br />
nagios: ack $num $comment (adds an acknowledgement comment; $num comes from [brackets] in the alert)<br />
(note that the numbers only count up to 100, so ack things quickly or use the web interface)<br />
nagios: unack $num (reverse an acknowledgement)<br />
nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d)<br />
e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988<br />
<br />
== How do I scan all problems Nagios has detected? ==<br />
* All unacknowledged problems:<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10<br />
* All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346<br />
* Group hosts check<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?hostgroup=all&style=summary<br />
<br />
== How do I deal with Nagios problems? ==<br />
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.<br />
<br />
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ''ever'' disable notifications.<br />
<br />
You can '''acknowledge''' a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.<br />
<br />
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.<br />
<br />
You can also mark a service or host for '''downtime'''. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.<br />
<br />
At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.<br />
<br />
== Known nagios alerts ==<br />
<pre><br />
[28] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
armenzg_buildduty<br />
arr: should I be worrying about this message? [26] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
nthomas<br />
depends if ssh is down<br />
nagios-sjc1<br />
[29] talos-r3-fed-018.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
joduinn-mtg is now known as joduinn-brb<br />
nthomas<br />
seems to work ok still, so people can push<br />
16:53 nthomas<br />
I get the normal |No interactive shells allowed here!| and it kicks me out as expected<br />
</pre><br />
This is normally due to releases. We might have to bump the threshold.<br />
<pre><br />
[30] signing1.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 30.60<br />
</pre><br />
<br />
= Downtimes =<br />
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the [[ReleaseEngineering:Buildduty:Downtimes|Downtimes]] page.<br />
<br />
= Talos =<br />
'''Note''' because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...<br />
<br />
# close all trees that are impacted by the change<br />
# ensure all pending builds are done and GREEN<br />
# do the update step below<br />
# send a Talos changeset to all trees to generate new baselines<br />
<br />
== How to update the talos/pageloader zips ==<br />
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.<br />
<br />
You may need to get IT to turn on access to build.mozilla.org.<br />
<pre><br />
#use your short ldap name (jford not jford@mozilla.com)<br />
ssh jford@build.mozilla.org<br />
cd /var/www/html/build/talos/zips/<br />
# NOTE: bug# and talos cset helps tracking back<br />
wget -Otalos.bug#.cset.zip <whatever>talos.zip<br />
<br />
cd /var/www/html/build/talos/xpis<br />
# NOTE: We override it unlike with talos.zip since it has not been ported to the talos.json system<br />
wget <whatever>/pageloader.xpi<br />
</pre><br />
<br />
For taloz.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.<br />
<br />
=== Updating talos for Tegras ===<br />
<br />
To update talos on Android,<br />
<br />
# for foopy05-11<br />
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}<br />
cd /builds/talos-data/talos<br />
hg pull -u<br />
<br />
This will update talos on each foopy to the tip of default.<br />
<br />
=== Updating talos for N900s ===<br />
<br />
ssh cltbld@production-mobile-master<br />
cd checkouts<br />
./update.sh<br />
<br />
This will update the fennecmark, maemkit, talos, and pageloader tarballs on pmm to the latest in their repos.<br />
<br />
= TBPL =<br />
== How to deploy changes ==<br />
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.<br />
<br />
== How to hide/unhide builders ==<br />
* In the 'Tree Info' menu select 'Open tree admin panel'<br />
* Filter/select the builders you want to change<br />
* Save changes<br />
* Enter the sheriff password and a description (with bug number if available) of your changes<br />
<br />
= Useful Links =<br />
* [http://cruncher.build.mozilla.org/buildapi/index.html Build Dashboard Main Page]<br />
** You can get JSON dumps for people to analyze by adding <code>&format=json</code><br />
** You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this [http://cruncher.build.mozilla.org/buildapi/revision/ link] (e.g. [http://cruncher.build.mozilla.org/buildapi/revision/places/c4f8232c7aef revision/places/c4f8232c7aef])<br />
* http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.<br />
* http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)<br />
<br />
= L10n Nightly Dashboard =<br />
* [http://l10n.mozilla.org/~axel/nightlies L10n Nightly Dashboard]<br />
<br />
= Slave Handling =<br />
You'll need to be familiar with the location of slaves. You can find this with 'host' if you don't know off the top of your head<br />
host linux-ix-slave07<br />
linux-ix-slave07.build.mozilla.org is an alias for linux-ix-slave07.build.'''mtv1'''.mozilla.com.<br />
<br />
== Restarting Wedged Slaves ==<br />
See [https://wiki.mozilla.org/ReleaseEngineering/How_To/Get_a_Missing_Slave_Back_Online How To/Get a Missing Slave Back Online].<br />
<br />
Reboot an IX slave:<br />
[[ReleaseEngineering/How_To/Connect_To_IPMI|Connect To IPMI]]<br />
<br />
== Requesting Reboots ==<br />
Some slaves run on unmanaged hardware, meaning that the hardware can get into a state where someone must be onsite to unwedge it. Note that iX systems and VMs are '''not''' unmanaged, and should not be on a reboots bug. When an unmanaged host becomes unresponsive, it gets added to a reboots bug, based on its datacenter:<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-scl1 (by far the most common, since about 10 talos machines die per week)<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-sjc1<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-mtv1<br />
'''NOTE:''' these bugs are formulaic. Don't get creative! Just add the hostname of the slave in a comment, or if you are adding multiple slaves at once, list each on its own line. If there's something the onsite person needs to know, include it after the hostname, on the same line. '''Do not''' try to "summarize" all of the slaves on the bug in a single comment.<br />
<br />
Simultaneously, 'ack' the alert in #build:<br />
10:27 < nagios-sjc1> [25] talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
10:51 < dustin> nagios-sjc1: ack 25 reboots-scl1<br />
10:51 < nagios-sjc1> talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%;dustin;reboots-scl1<br />
<br />
== When Requested Reboots are Done ==<br />
=== Checking Slaves ===<br />
Once a reboots bug is closed by an onsite person, read the update to see which hosts were rebooted, and which (if any) require further work. Such further work should be deferred to a new bug, which you should open if relops did not (often time is tight at the datacenter). Update the slave tracking spreadsheet accordingly:<br />
* for slaves that were rebooted normally: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "rebooted"; and set "Blocked On" to "check" (which will turn the cell yellow). Check BuildAPI a few hours later to see if these slaves are building properly, and delete the rows from the spreadsheet if so.<br />
* for slaves that were reimaged during the reboot process: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "reimaged"; and set "Blocked On" to "set up". That set-up is your responsibility, too -- see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave].<br />
* for slaves that require further work from relops, change the "Bug #" column to reflect the bug tracking that work, and set the "Issue" and "Blocked On" columns appropriately<br />
If any slaves were missed in the reboot process, add them to a new reboots bug.<br />
<br />
=== New Bug ===<br />
Once a reboots bug is closed, you will need to open a new one for any subsequent reboots. You don't have to wait until you need a reboot to do so. Here's how:<br />
# remove the 'reboots-xxxx' alias from the previous reboots bug, and copy the bug's URL to your clipboard<br />
# create a bug in "Server Operations: RelEng", with subject "reboot requests (xxxx)". You can leave the description blank if you don't have any slaves requiring reboot yet. Submit.<br />
# edit the bug's colo-trip field to indicate the correct datacenter, and paste the previous reboot request's URL into the "See Also" field.<br />
<br />
== DNR ==<br />
Slaves that are dead and not worth repairing are marked as "DNR" in the slave tracking spreadsheet. The types of slaves that are acceptable for DNR are listed in the "DNR'd Silos" sheet of the [http://is.gd/jsHeh slave tracking spreadsheet]. Such slaves should be acked in nagios, but are not tracked in any bug.<br />
<br />
== Loans ==<br />
We need to track a slave from the time it is loaned out until it is back in its proper place (be that staging, preprod, or production). Currently we use bugs to track this flow.<br />
<br />
# Bug from dev requesting loaner (build or test slave, platform, bug this is being used to help with)<br />
# Loan it: [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Send_a_slave_out_for_loan How To/Send a slave out for loan]<br />
# File a bug to the RelEng component (connected to bug in point #1) to track the re-imaging and returning of the slave to its pool when it's returned -- I've been asking the dev to please comment in that bug when they are done with the loaner<br />
# File a bug on ServerOps asking for re-image (blocking bug in #3) [https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_That_a_Machine_Be_Reimaged How To/Request That a Machine Be Reimaged]<br />
# When it's re-imaged, put it back in the pool [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave]<br />
<br />
== Maintenance ==<br />
Periodically scan the slave spreadsheet. Check slaves marked "check". Set up slaves marked "set up". Ask developers who have borrowed slaves to see if they're done with them. Ask relops about progress on broken slaves.<br />
<br />
== Common Failure Modes ==<br />
Some slaves, especially linux VMs, will fail to clobber and repeatedly restart. In nagios, this causes all of the checks on that host to bounce up and down, because the reboots occur on a similar schedule to nagios's checks. Sometimes you can catch this via SSH, but the reboots are *very* quick and it may be easier to use vSphere Client to boot the host into single-user mode and clean out the build dirs.<br />
<br />
All of the linux slaves will reboot after 10 attempts to run puppet. A puppet failure, then, will manifest as buildbot failing to start on that host. To stop the reboot cycle, log in to the slave and kill S98puppet (centos) or run-puppet-and-slave.sh (fedora).<br />
<br />
= Standard Bugs =<br />
* The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I<br />
* Reboots bugs have the Bugzilla aliases shown above.<br />
* For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:<br />
** :aki, :armenzg, :bear, :bhearsum, :catlee, :coop, :hwine, :jhford, :joduinn, :joey, :lsblakk, :nthomas, :rail<br />
<br />
= Ganglia =<br />
* if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. {{bug|674233}}:<br />
switch to root, service gmond restart<br />
<br />
= Queue Directories =<br />
* [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories]<br />
<br />
If you see this in #build:<br />
<br />
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items<br />
<br />
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories] wiki page for details.<br />
= Cruncher = <br />
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):<br />
<br />
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):<br />
As root:<br />
du -s -h /var/spool/*<br />
# confirm that mqueue or clientmqueue is the oversized culprit<br />
# stop sendmail, clean out the queues, restart sendmail<br />
/etc/init.d/sendmail stop<br />
rm -rf /var/spool/clientmqueue/*<br />
rm -rf /var/spool/mqueue/*<br />
/etc/init.d/sendmail start</div>Bearhttps://wiki.mozilla.org/index.php?title=CIDuty&diff=451709CIDuty2012-07-17T20:22:12Z<p>Bear: /* Kitten */</p>
<hr />
<div>'''Looking for who is on buildduty?''' - check the tree-info dropdown on [https://tbpl.mozilla.org/ tbpl]<br /><br />
'''Buildduty not around?''' - please [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Release%20Engineering open a bug]<br />
<br />
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "'''buildduty'''."<br />
<br />
Here's now to do it.<br />
<br />
__TOC__<br />
<br />
= Schedule =<br />
Mozilla Releng Sheriff Schedule ([http://www.google.com/calendar/embed?src=aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com&ctz=America/New_York Google Calendar]|[http://www.google.com/calendar/ical/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic.ics iCal]|[http://www.google.com/calendar/feeds/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic XML])<br />
<br />
= General Duties =<br />
== How should I make myself available for duty? ==<br />
* Add 'buildduty' to your IRC nick<br />
* be in at least #developers, #buildduty and #build (as well as #mozbuild of course)<br />
** also useful to be in #mobile, #planning, #release-drivers, and #ateam<br />
* watch http://tbpl.mozilla.org<br />
<br />
== What else should I take care of? ==<br />
You will need to<br />
* Direct people to [http://mzl.la/tryhelp http://mzl.la/tryhelp] for self-serve documentation.<br />
* Keep [https://wiki.mozilla.org/ReleaseEngineering:Maintenance wiki.m.o/ReleaseEngineering:Maintenance] up to date with any significant changes<br />
<br />
You should keep on top of<br />
* pending builds - available in [http://build.mozilla.org/builds/pending/ graphs] or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.<br />
* all bugs tagged with [https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;list_id=2941844;status_whiteboard=%5Bbuildduty%5D;;resolution=---;product=mozilla.org buildduty] in the whiteboard (make a saved search)<br />
* The [https://bugzilla.mozilla.org/buglist.cgi?priority=--&columnlist=bug_severity%2Cpriority%2Cop_sys%2Cassigned_to%2Cbug_status%2Cresolution%2Cshort_desc%2Cstatus_whiteboard&resolution=---&resolution=DUPLICATE&emailtype1=exact&query_based_on=releng-triage&emailassigned_to1=1&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=nobody%40mozilla.org&component=Release%20Engineering&component=Release%20Engineering%3A%20Custom%20Builds&product=mozilla.org&known_name=releng-triage releng-triage search] - part of buildduty is leaning on your colleagues to take bugs<br />
** the BuildDuty person needs to go through all the bugs in releng-triage query at *least* a few times each day. Doesnt mean you have to *fix* them all immediately, finding other owners is part of triaging the queue. However, you do need to at least *see* them, know if there are any urgent problems and categorize appropriately. Sometimes we get urgent security bugs here, which need to be jumped on immediately, like example {{bug|635638}}<br />
* Bum slaves - you should see to it that bum slaves aren't burning builds, and that all slaves are tracked on their way back to operational status<br />
** Check the [https://bugzilla.mozilla.org/buglist.cgi?list_id=2938171;resolution=---;status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=%5Bhardware%5D;;product=mozilla.org hardware] whiteboard tag, too, for anything that slipped between the cracks.<br />
** See the sections below on [[#Requesting Reboots]]<br />
* Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp])<br />
** '''wait times''' - either [https://build.mozilla.org/buildapi/reports/waittimes this page] or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)<br />
** there is a cronjob in anamarias' account on cruncher that runs this for each pool:<br />
/usr/local/bin/python $HOME/buildapi/buildapi/scripts/mailwaittimes.py \<br />
-S smtp.mozilla.org \<br />
-f nobody@cruncher.build.mozilla.org \<br />
-p testpool \<br />
-W http://cruncher.build.mozilla.org/buildapi/reports/waittimes \<br />
-e $(date -d "$(date +%Y-%m-%d)" +%s) -t 10 -z 10 \<br />
-a dev-tree-management@lists.mozilla.org<br />
<br />
* You may need to plan a reconfig or a full downtime<br />
** Reconfigs: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-reconfig releng-needs-reconfig broken query] to see what's pending. Reconfigs can be done at any time. <br />
** Downtimes: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-treeclosure releng-needs-treeclosure broken query] to see what's pending. Coordinate with Zandr and IT for send downtime notices with enough advance notice. <br />
<br />
You will also be responsible for coordinating master reconfigs - see the releng-needs-reconfig search.<br />
<br />
== Scheduled Reconfigs ==<br />
Buildduty is responsible for reconfiging the Buildbot masters <b>every Monday and Thursday</b>, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. [https://wiki.mozilla.org/ReleaseEngineering/Landing_Buildbot_Master_Changes This wiki page has step by step instructions]. It is also valid to do other additional reconfigs anytime you want.<br />
<br />
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master].<br />
<br />
You should [https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric use Fabric to do the reconfig!]<br />
<br />
The person doing reconfigs should also update https://wiki.mozilla.org/ReleaseEngineering:Maintenance#Reconfigs_.2F_Deployments<br />
<br />
= Tree Maintenance =<br />
== Repo Errors ==<br />
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:<br />
* File a bug (or have dev file it) and then poke in #ops noahm<br />
** If he doesn't respond, then escalate the bug to page on-call<br />
* Follow the steps below for "How do I close the tree"<br />
== How do I see problems in TBPL? ==<br />
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.<br />
== How do I close the tree? ==<br />
See [[ReleaseEngineering/How_To/Close_or_Open_the_Tree]]<br />
<br />
== How do I claim a rentable project branch? ==<br />
See [[ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE]]<br />
<br />
= Re-run jobs =<br />
== How to trigger Talos jobs ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-trigger all Talos runs for a build (by using sendchange) ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-run a build ==<br />
Do ''not'' go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.<br />
<br />
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the '''YOU MUST''' specify the branch, so there's no null keys in the builds-running.js.<br />
<br />
= Try Server =<br />
== Jobs not scheduled at all? ==<br />
Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.<br />
<br />
Then do a sendchange and tail the scheduler master:<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
<br />
* If tryserver was just reset verify that [[ReleaseEngineering/How_To/Reset_the_Try_Server#Try_Hg_Poller_state|the scheduler has been reset]]<br />
<br />
== How do I trigger additional talos/test runs for a given try build? ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== Using the TryChooser to submit build/test requests ==<br />
<br />
buildduty can also use the same [https://wiki.mozilla.org/Build:TryChooser TryChooser] syntax as developers use to (re)submit build and testing requests. Here is an example:<br />
<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
== How do I cancel existing jobs? ==<br />
<br />
The cancellator.py script is setup on pm02. Here is a standard example:<br />
<br />
# Dry run first to see what would be cancelled. <br />
python cancellator.py -b try -r 5ff84b660e90<br />
# Same command run again with the force option specified (--yes-really) to actually cancel the builds<br />
python cancellator.py -b try -r 5ff84b660e90 --yes-really<br />
<br />
The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.<br />
== Bug Commenter ==<br />
This is on cruncher and is run in a crontab in lsblakk's account:<br />
source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \<br />
&& time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v<br />
<br />
You can see quickly if things are working by looking at:<br />
/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log # this shows what's been posted lately<br />
/home/lsblakk/autoland/tools/scripts/autoland/try_cache # this shows what the script thinks is 'pending' completion<br />
<br />
= Nightlies =<br />
<br />
== How do I re-spin mozilla-central nightlies? ==<br />
To rebuild the same nightly, buildbot's Rebuild button works fine.<br />
<br />
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.<br />
<br />
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.<br />
<br />
= Mobile =<br />
== Android Tegras ==<br />
<br />
[[ReleaseEngineering:How To:Android Tegras | Android Tegra BuildDuty Notes]]<br />
<br />
== Android Updates aren't working! ==<br />
<br />
* Did the version number just change? If so, you may be hitting {{bug|629528}}. Kick off another Android nightly.<br />
* Check aus3-staging for size 0 complete.txt snippets:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5<br />
** If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.<br />
* Check aus3-staging to see if the checksum is correct:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2<br />
** If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.<br />
<br />
== Update mobile talos webhosts ==<br />
We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}).<br />
Here is how you update them:<br />
Update Procedure:<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/talos-repo<br />
# NOTICE that we have uncommitted files<br />
hg st<br />
# ? talos/page_load_test/tp4<br />
# Take note of the current revision to revert to (just in case)<br />
hg id<br />
hg pull -u<br />
# 488bc187a3ef tip<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.<br />
</pre><br />
<br />
Keep track of what revisions is being run.<br />
<br />
== Deploy new tegra-host-utils.zip ==<br />
There are three hosts behind a balance loader.<br />
* See {{bug|742597}} for previous instance of this case.<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/tegra<br />
wget -O tegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/<br />
</pre><br />
<br />
= Slave Maintenance =<br />
In general, slave maintenance involves:<br />
* keeping as many slaves up as possible, including<br />
** proactively checking for hung/broken slaves (see links below)<br />
** moving known-down slaves toward an operational state<br />
* handling nagios alerts for slaves<br />
* interacting with IT regarding slave maintenance<br />
<br />
== Kitten ==<br />
kitten.py is a command line tool designed to make gathering and basic host management tasks easier to do from the command line. You can get information about a host and also request to reboot it all in one command.<br />
<br />
A buildduty environment has been created on Cruncher to make it easier to work with all of the briarpatch tools (of which Kitten.py is one.)<br />
<br />
sudo su - buildduty<br />
cd /home/buildduty/production<br />
. bin/activate<br />
<br />
From their you can run:<br />
<br />
python kitten.py <hostname><br />
<br />
For example (this example was run with -v and also against a host that is not reachable):<br />
<br />
(production)[buildduty@cruncher production]$ python kitten.py -v talos-r3-xp-019<br />
ERROR socket error establishing ssh connection<br />
Traceback (most recent call last):<br />
File "/home/buildduty/production/briar-patch/releng/remote.py", line 151, in __init__<br />
self.client.connect(self.fqdn, username=remoteEnv.sshuser, password=remoteEnv.sshPassword, allow_agent=False, look_for_keys=True)<br />
File "/home/buildduty/production/lib/python2.6/site-packages/ssh/client.py", line 296, in connect<br />
sock.connect(addr)<br />
File "<string>", line 1, in connect<br />
error: [Errno 111] Connection refused<br />
talos-r3-xp-019: enabled<br />
farm: moz<br />
colo: scl1<br />
distro: winxp<br />
pool: tests-scl1-windows<br />
trustlevel: try<br />
master: bm15-tests1-windows<br />
fqdn: talos-r3-xp-019.build.scl1.mozilla.com.<br />
PDU?: False<br />
IPMI?: False<br />
ERROR Unable to control host remotely<br />
reachable: False<br />
buildbot: <br />
tacfile: <br />
lastseen: unknown<br />
master: <br />
error: current master is different than buildbot.tac master []<br />
<br />
The output up to the "ERROR" line shows all of the metadata for a host, and if the host was reachable via SSH the lines after would show the details of the buildbot environment and it's status.<br />
<br />
Kitten.py has the following commands:<br />
<br />
kitten.py [--info | -i ] [--reboot | -r] [--verbose | -v] [--debug]<br />
<br />
* --info will show only the metadata and will not try to SSH to the host<br />
* --reboot will try to graceful the buildbot and reboot the host if it appears to be idle or hung<br />
* --verbose will show what SSH commands are being run<br />
* --debug shows everything --verbose shows and also displays the SSH output<br />
<br />
== File a bug ==<br />
* Use [https://bugzilla.mozilla.org/enter_bug.cgi?alias=&assigned_to=nobody%40mozilla.org&blocked=&bug_file_loc=http%3A%2F%2F&bug_severity=normal&bug_status=NEW&cf_crash_signature=&comment=&component=Release%20Engineering%3A%20Machine%20Management&contenttypeentry=&contenttypemethod=autodetect&contenttypeselection=text%2Fplain&data=&defined_groups=1&dependson=&description=&flag_type-4=X&flag_type-481=X&flag_type-607=X&flag_type-674=X&flag_type-720=X&flag_type-721=X&flag_type-737=X&flag_type-775=X&flag_type-780=X&form_name=enter_bug&keywords=&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=All&priority=P3&product=mozilla.org&qa_contact=armenzg%40mozilla.com&rep_platform=All&requestee_type-4=&requestee_type-607=&requestee_type-753=&short_desc=&status_whiteboard=%5Bbuildduty%5D%5Bbuildslaves%5D%5Bcapacity%5D&target_milestone=---&version=other this template] so it fills up few needed tags and priority<br />
* Make the subject and alias of the bug to be the hostname<br />
* Add any depend bugs IT actions or the slave's issue<br />
* Submit<br />
<br />
== Slave Tracking ==<br />
* Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.<br />
<br />
== Slavealloc ==<br />
=== Adding a slave ===<br />
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.<br />
<br />
You'll want a command line something like<br />
<pre><br />
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv<br />
</pre><br />
<br />
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':<br />
<pre><br />
name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool<br />
</pre><br />
<br />
Adding masters is similar - see dbimport's help for more information.<br />
=== Removing slaves ===<br />
Connect to slavealloc@slavealloc and look at the history for a command looking like this:<br />
<pre><br />
mysql -h $host_ip -p -u buildslaves buildslaves<br />
# type the password<br />
SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
</pre><br />
<br />
== How Tos ==<br />
see [[ReleaseEngineering/How_To]] for a list of public How To documents<br /><br />
see [https://intranet.mozilla.org/RelEngWiki/index.php/Category:HowTo RelEngWiki/Category:HowTo] for list of private How To documents<br />
<br />
= Nagios =<br />
== What's the difference between a downtime and an ack? ==<br />
Both will make nagios stop alerting, but there's an important difference: acks are forever. '''Never''' ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.<br />
<br />
== How do I interact with the nagios IRC bot? ==<br />
nagios: status (gives current server stats)<br />
nagios: status $regexp (gives status for a particular host)<br />
nagios: status host:svc (gives status for a particular service)<br />
nagios: ignore (shows ignores<br />
nagios: ignore $regexp (ignores alerts matching $regexp)<br />
nagios: unignore $regexp (unignores an existing ignore)<br />
nagios: ack $num $comment (adds an acknowledgement comment; $num comes from [brackets] in the alert)<br />
(note that the numbers only count up to 100, so ack things quickly or use the web interface)<br />
nagios: unack $num (reverse an acknowledgement)<br />
nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d)<br />
e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988<br />
<br />
== How do I scan all problems Nagios has detected? ==<br />
* All unacknowledged problems:<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10<br />
* All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346<br />
* Group hosts check<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?hostgroup=all&style=summary<br />
<br />
== How do I deal with Nagios problems? ==<br />
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.<br />
<br />
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ''ever'' disable notifications.<br />
<br />
You can '''acknowledge''' a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.<br />
<br />
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.<br />
<br />
You can also mark a service or host for '''downtime'''. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.<br />
<br />
At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.<br />
<br />
== Known nagios alerts ==<br />
<pre><br />
[28] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
armenzg_buildduty<br />
arr: should I be worrying about this message? [26] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
nthomas<br />
depends if ssh is down<br />
nagios-sjc1<br />
[29] talos-r3-fed-018.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
joduinn-mtg is now known as joduinn-brb<br />
nthomas<br />
seems to work ok still, so people can push<br />
16:53 nthomas<br />
I get the normal |No interactive shells allowed here!| and it kicks me out as expected<br />
</pre><br />
This is normally due to releases. We might have to bump the threshold.<br />
<pre><br />
[30] signing1.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 30.60<br />
</pre><br />
<br />
= Downtimes =<br />
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the [[ReleaseEngineering:Buildduty:Downtimes|Downtimes]] page.<br />
<br />
= Talos =<br />
'''Note''' because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...<br />
<br />
# close all trees that are impacted by the change<br />
# ensure all pending builds are done and GREEN<br />
# do the update step below<br />
# send a Talos changeset to all trees to generate new baselines<br />
<br />
== How to update the talos/pageloader zips ==<br />
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.<br />
<br />
You may need to get IT to turn on access to build.mozilla.org.<br />
<pre><br />
#use your short ldap name (jford not jford@mozilla.com)<br />
ssh jford@build.mozilla.org<br />
cd /var/www/html/build/talos/zips/<br />
# NOTE: bug# and talos cset helps tracking back<br />
wget -Otalos.bug#.cset.zip <whatever>talos.zip<br />
<br />
cd /var/www/html/build/talos/xpis<br />
# NOTE: We override it unlike with talos.zip since it has not been ported to the talos.json system<br />
wget <whatever>/pageloader.xpi<br />
</pre><br />
<br />
For taloz.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.<br />
<br />
=== Updating talos for Tegras ===<br />
<br />
To update talos on Android,<br />
<br />
# for foopy05-11<br />
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}<br />
cd /builds/talos-data/talos<br />
hg pull -u<br />
<br />
This will update talos on each foopy to the tip of default.<br />
<br />
=== Updating talos for N900s ===<br />
<br />
ssh cltbld@production-mobile-master<br />
cd checkouts<br />
./update.sh<br />
<br />
This will update the fennecmark, maemkit, talos, and pageloader tarballs on pmm to the latest in their repos.<br />
<br />
= TBPL =<br />
== How to deploy changes ==<br />
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.<br />
<br />
== How to hide/unhide builders ==<br />
* In the 'Tree Info' menu select 'Open tree admin panel'<br />
* Filter/select the builders you want to change<br />
* Save changes<br />
* Enter the sheriff password and a description (with bug number if available) of your changes<br />
<br />
= Useful Links =<br />
* [http://cruncher.build.mozilla.org/buildapi/index.html Build Dashboard Main Page]<br />
** You can get JSON dumps for people to analyze by adding <code>&format=json</code><br />
** You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this [http://cruncher.build.mozilla.org/buildapi/revision/ link] (e.g. [http://cruncher.build.mozilla.org/buildapi/revision/places/c4f8232c7aef revision/places/c4f8232c7aef])<br />
* http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.<br />
* http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)<br />
<br />
= L10n Nightly Dashboard =<br />
* [http://l10n.mozilla.org/~axel/nightlies L10n Nightly Dashboard]<br />
<br />
= Slave Handling =<br />
You'll need to be familiar with the location of slaves. You can find this with 'host' if you don't know off the top of your head<br />
host linux-ix-slave07<br />
linux-ix-slave07.build.mozilla.org is an alias for linux-ix-slave07.build.'''mtv1'''.mozilla.com.<br />
<br />
== Restarting Wedged Slaves ==<br />
See [https://wiki.mozilla.org/ReleaseEngineering/How_To/Get_a_Missing_Slave_Back_Online How To/Get a Missing Slave Back Online].<br />
<br />
Reboot an IX slave:<br />
[[ReleaseEngineering/How_To/Connect_To_IPMI|Connect To IPMI]]<br />
<br />
== Requesting Reboots ==<br />
Some slaves run on unmanaged hardware, meaning that the hardware can get into a state where someone must be onsite to unwedge it. Note that iX systems and VMs are '''not''' unmanaged, and should not be on a reboots bug. When an unmanaged host becomes unresponsive, it gets added to a reboots bug, based on its datacenter:<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-scl1 (by far the most common, since about 10 talos machines die per week)<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-sjc1<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-mtv1<br />
'''NOTE:''' these bugs are formulaic. Don't get creative! Just add the hostname of the slave in a comment, or if you are adding multiple slaves at once, list each on its own line. If there's something the onsite person needs to know, include it after the hostname, on the same line. '''Do not''' try to "summarize" all of the slaves on the bug in a single comment.<br />
<br />
Simultaneously, 'ack' the alert in #build:<br />
10:27 < nagios-sjc1> [25] talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
10:51 < dustin> nagios-sjc1: ack 25 reboots-scl1<br />
10:51 < nagios-sjc1> talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%;dustin;reboots-scl1<br />
<br />
== When Requested Reboots are Done ==<br />
=== Checking Slaves ===<br />
Once a reboots bug is closed by an onsite person, read the update to see which hosts were rebooted, and which (if any) require further work. Such further work should be deferred to a new bug, which you should open if relops did not (often time is tight at the datacenter). Update the slave tracking spreadsheet accordingly:<br />
* for slaves that were rebooted normally: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "rebooted"; and set "Blocked On" to "check" (which will turn the cell yellow). Check BuildAPI a few hours later to see if these slaves are building properly, and delete the rows from the spreadsheet if so.<br />
* for slaves that were reimaged during the reboot process: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "reimaged"; and set "Blocked On" to "set up". That set-up is your responsibility, too -- see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave].<br />
* for slaves that require further work from relops, change the "Bug #" column to reflect the bug tracking that work, and set the "Issue" and "Blocked On" columns appropriately<br />
If any slaves were missed in the reboot process, add them to a new reboots bug.<br />
<br />
=== New Bug ===<br />
Once a reboots bug is closed, you will need to open a new one for any subsequent reboots. You don't have to wait until you need a reboot to do so. Here's how:<br />
# remove the 'reboots-xxxx' alias from the previous reboots bug, and copy the bug's URL to your clipboard<br />
# create a bug in "Server Operations: RelEng", with subject "reboot requests (xxxx)". You can leave the description blank if you don't have any slaves requiring reboot yet. Submit.<br />
# edit the bug's colo-trip field to indicate the correct datacenter, and paste the previous reboot request's URL into the "See Also" field.<br />
<br />
== DNR ==<br />
Slaves that are dead and not worth repairing are marked as "DNR" in the slave tracking spreadsheet. The types of slaves that are acceptable for DNR are listed in the "DNR'd Silos" sheet of the [http://is.gd/jsHeh slave tracking spreadsheet]. Such slaves should be acked in nagios, but are not tracked in any bug.<br />
<br />
== Loans ==<br />
We need to track a slave from the time it is loaned out until it is back in its proper place (be that staging, preprod, or production). Currently we use bugs to track this flow.<br />
<br />
# Bug from dev requesting loaner (build or test slave, platform, bug this is being used to help with)<br />
# Loan it: [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Send_a_slave_out_for_loan How To/Send a slave out for loan]<br />
# File a bug to the RelEng component (connected to bug in point #1) to track the re-imaging and returning of the slave to its pool when it's returned -- I've been asking the dev to please comment in that bug when they are done with the loaner<br />
# File a bug on ServerOps asking for re-image (blocking bug in #3) [https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_That_a_Machine_Be_Reimaged How To/Request That a Machine Be Reimaged]<br />
# When it's re-imaged, put it back in the pool [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave]<br />
<br />
== Maintenance ==<br />
Periodically scan the slave spreadsheet. Check slaves marked "check". Set up slaves marked "set up". Ask developers who have borrowed slaves to see if they're done with them. Ask relops about progress on broken slaves.<br />
<br />
== Common Failure Modes ==<br />
Some slaves, especially linux VMs, will fail to clobber and repeatedly restart. In nagios, this causes all of the checks on that host to bounce up and down, because the reboots occur on a similar schedule to nagios's checks. Sometimes you can catch this via SSH, but the reboots are *very* quick and it may be easier to use vSphere Client to boot the host into single-user mode and clean out the build dirs.<br />
<br />
All of the linux slaves will reboot after 10 attempts to run puppet. A puppet failure, then, will manifest as buildbot failing to start on that host. To stop the reboot cycle, log in to the slave and kill S98puppet (centos) or run-puppet-and-slave.sh (fedora).<br />
<br />
= Standard Bugs =<br />
* The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I<br />
* Reboots bugs have the Bugzilla aliases shown above.<br />
* For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:<br />
** :aki, :armenzg, :bear, :bhearsum, :catlee, :coop, :hwine, :jhford, :joduinn, :joey, :lsblakk, :nthomas, :rail<br />
<br />
= Ganglia =<br />
* if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. {{bug|674233}}:<br />
switch to root, service gmond restart<br />
<br />
= Queue Directories =<br />
* [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories]<br />
<br />
If you see this in #build:<br />
<br />
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items<br />
<br />
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories] wiki page for details.<br />
= Cruncher = <br />
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):<br />
<br />
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):<br />
As root:<br />
du -s -h /var/spool/*<br />
# confirm that mqueue or clientmqueue is the oversized culprit<br />
# stop sendmail, clean out the queues, restart sendmail<br />
/etc/init.d/sendmail stop<br />
rm -rf /var/spool/clientmqueue/*<br />
rm -rf /var/spool/mqueue/*<br />
/etc/init.d/sendmail start</div>Bearhttps://wiki.mozilla.org/index.php?title=CIDuty&diff=451708CIDuty2012-07-17T20:19:59Z<p>Bear: /* Kitten */</p>
<hr />
<div>'''Looking for who is on buildduty?''' - check the tree-info dropdown on [https://tbpl.mozilla.org/ tbpl]<br /><br />
'''Buildduty not around?''' - please [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Release%20Engineering open a bug]<br />
<br />
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "'''buildduty'''."<br />
<br />
Here's now to do it.<br />
<br />
__TOC__<br />
<br />
= Schedule =<br />
Mozilla Releng Sheriff Schedule ([http://www.google.com/calendar/embed?src=aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com&ctz=America/New_York Google Calendar]|[http://www.google.com/calendar/ical/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic.ics iCal]|[http://www.google.com/calendar/feeds/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic XML])<br />
<br />
= General Duties =<br />
== How should I make myself available for duty? ==<br />
* Add 'buildduty' to your IRC nick<br />
* be in at least #developers, #buildduty and #build (as well as #mozbuild of course)<br />
** also useful to be in #mobile, #planning, #release-drivers, and #ateam<br />
* watch http://tbpl.mozilla.org<br />
<br />
== What else should I take care of? ==<br />
You will need to<br />
* Direct people to [http://mzl.la/tryhelp http://mzl.la/tryhelp] for self-serve documentation.<br />
* Keep [https://wiki.mozilla.org/ReleaseEngineering:Maintenance wiki.m.o/ReleaseEngineering:Maintenance] up to date with any significant changes<br />
<br />
You should keep on top of<br />
* pending builds - available in [http://build.mozilla.org/builds/pending/ graphs] or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.<br />
* all bugs tagged with [https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;list_id=2941844;status_whiteboard=%5Bbuildduty%5D;;resolution=---;product=mozilla.org buildduty] in the whiteboard (make a saved search)<br />
* The [https://bugzilla.mozilla.org/buglist.cgi?priority=--&columnlist=bug_severity%2Cpriority%2Cop_sys%2Cassigned_to%2Cbug_status%2Cresolution%2Cshort_desc%2Cstatus_whiteboard&resolution=---&resolution=DUPLICATE&emailtype1=exact&query_based_on=releng-triage&emailassigned_to1=1&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=nobody%40mozilla.org&component=Release%20Engineering&component=Release%20Engineering%3A%20Custom%20Builds&product=mozilla.org&known_name=releng-triage releng-triage search] - part of buildduty is leaning on your colleagues to take bugs<br />
** the BuildDuty person needs to go through all the bugs in releng-triage query at *least* a few times each day. Doesnt mean you have to *fix* them all immediately, finding other owners is part of triaging the queue. However, you do need to at least *see* them, know if there are any urgent problems and categorize appropriately. Sometimes we get urgent security bugs here, which need to be jumped on immediately, like example {{bug|635638}}<br />
* Bum slaves - you should see to it that bum slaves aren't burning builds, and that all slaves are tracked on their way back to operational status<br />
** Check the [https://bugzilla.mozilla.org/buglist.cgi?list_id=2938171;resolution=---;status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=%5Bhardware%5D;;product=mozilla.org hardware] whiteboard tag, too, for anything that slipped between the cracks.<br />
** See the sections below on [[#Requesting Reboots]]<br />
* Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp])<br />
** '''wait times''' - either [https://build.mozilla.org/buildapi/reports/waittimes this page] or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)<br />
** there is a cronjob in anamarias' account on cruncher that runs this for each pool:<br />
/usr/local/bin/python $HOME/buildapi/buildapi/scripts/mailwaittimes.py \<br />
-S smtp.mozilla.org \<br />
-f nobody@cruncher.build.mozilla.org \<br />
-p testpool \<br />
-W http://cruncher.build.mozilla.org/buildapi/reports/waittimes \<br />
-e $(date -d "$(date +%Y-%m-%d)" +%s) -t 10 -z 10 \<br />
-a dev-tree-management@lists.mozilla.org<br />
<br />
* You may need to plan a reconfig or a full downtime<br />
** Reconfigs: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-reconfig releng-needs-reconfig broken query] to see what's pending. Reconfigs can be done at any time. <br />
** Downtimes: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-treeclosure releng-needs-treeclosure broken query] to see what's pending. Coordinate with Zandr and IT for send downtime notices with enough advance notice. <br />
<br />
You will also be responsible for coordinating master reconfigs - see the releng-needs-reconfig search.<br />
<br />
== Scheduled Reconfigs ==<br />
Buildduty is responsible for reconfiging the Buildbot masters <b>every Monday and Thursday</b>, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. [https://wiki.mozilla.org/ReleaseEngineering/Landing_Buildbot_Master_Changes This wiki page has step by step instructions]. It is also valid to do other additional reconfigs anytime you want.<br />
<br />
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master].<br />
<br />
You should [https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric use Fabric to do the reconfig!]<br />
<br />
The person doing reconfigs should also update https://wiki.mozilla.org/ReleaseEngineering:Maintenance#Reconfigs_.2F_Deployments<br />
<br />
= Tree Maintenance =<br />
== Repo Errors ==<br />
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:<br />
* File a bug (or have dev file it) and then poke in #ops noahm<br />
** If he doesn't respond, then escalate the bug to page on-call<br />
* Follow the steps below for "How do I close the tree"<br />
== How do I see problems in TBPL? ==<br />
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.<br />
== How do I close the tree? ==<br />
See [[ReleaseEngineering/How_To/Close_or_Open_the_Tree]]<br />
<br />
== How do I claim a rentable project branch? ==<br />
See [[ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE]]<br />
<br />
= Re-run jobs =<br />
== How to trigger Talos jobs ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-trigger all Talos runs for a build (by using sendchange) ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-run a build ==<br />
Do ''not'' go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.<br />
<br />
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the '''YOU MUST''' specify the branch, so there's no null keys in the builds-running.js.<br />
<br />
= Try Server =<br />
== Jobs not scheduled at all? ==<br />
Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.<br />
<br />
Then do a sendchange and tail the scheduler master:<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
<br />
* If tryserver was just reset verify that [[ReleaseEngineering/How_To/Reset_the_Try_Server#Try_Hg_Poller_state|the scheduler has been reset]]<br />
<br />
== How do I trigger additional talos/test runs for a given try build? ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== Using the TryChooser to submit build/test requests ==<br />
<br />
buildduty can also use the same [https://wiki.mozilla.org/Build:TryChooser TryChooser] syntax as developers use to (re)submit build and testing requests. Here is an example:<br />
<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
== How do I cancel existing jobs? ==<br />
<br />
The cancellator.py script is setup on pm02. Here is a standard example:<br />
<br />
# Dry run first to see what would be cancelled. <br />
python cancellator.py -b try -r 5ff84b660e90<br />
# Same command run again with the force option specified (--yes-really) to actually cancel the builds<br />
python cancellator.py -b try -r 5ff84b660e90 --yes-really<br />
<br />
The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.<br />
== Bug Commenter ==<br />
This is on cruncher and is run in a crontab in lsblakk's account:<br />
source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \<br />
&& time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v<br />
<br />
You can see quickly if things are working by looking at:<br />
/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log # this shows what's been posted lately<br />
/home/lsblakk/autoland/tools/scripts/autoland/try_cache # this shows what the script thinks is 'pending' completion<br />
<br />
= Nightlies =<br />
<br />
== How do I re-spin mozilla-central nightlies? ==<br />
To rebuild the same nightly, buildbot's Rebuild button works fine.<br />
<br />
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.<br />
<br />
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.<br />
<br />
= Mobile =<br />
== Android Tegras ==<br />
<br />
[[ReleaseEngineering:How To:Android Tegras | Android Tegra BuildDuty Notes]]<br />
<br />
== Android Updates aren't working! ==<br />
<br />
* Did the version number just change? If so, you may be hitting {{bug|629528}}. Kick off another Android nightly.<br />
* Check aus3-staging for size 0 complete.txt snippets:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5<br />
** If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.<br />
* Check aus3-staging to see if the checksum is correct:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2<br />
** If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.<br />
<br />
== Update mobile talos webhosts ==<br />
We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}).<br />
Here is how you update them:<br />
Update Procedure:<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/talos-repo<br />
# NOTICE that we have uncommitted files<br />
hg st<br />
# ? talos/page_load_test/tp4<br />
# Take note of the current revision to revert to (just in case)<br />
hg id<br />
hg pull -u<br />
# 488bc187a3ef tip<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.<br />
</pre><br />
<br />
Keep track of what revisions is being run.<br />
<br />
== Deploy new tegra-host-utils.zip ==<br />
There are three hosts behind a balance loader.<br />
* See {{bug|742597}} for previous instance of this case.<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/tegra<br />
wget -O tegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/<br />
</pre><br />
<br />
= Slave Maintenance =<br />
In general, slave maintenance involves:<br />
* keeping as many slaves up as possible, including<br />
** proactively checking for hung/broken slaves (see links below)<br />
** moving known-down slaves toward an operational state<br />
* handling nagios alerts for slaves<br />
* interacting with IT regarding slave maintenance<br />
<br />
== Kitten ==<br />
kitten.py is a command line tool designed to make gathering and basic host management tasks easier to do from the command line. You can get information about a host and also request to reboot it all in one command.<br />
<br />
A buildduty environment has been created on Cruncher to make it easier to work with all of the briarpatch tools (of which Kitten.py is one.)<br />
<br />
sudo su - buildduty<br />
cd /home/buildduty/production<br />
. bin/activate<br />
<br />
From their you can run:<br />
<br />
python kitten.py <hostname><br />
<br />
For example:<br />
<br />
(production)[buildduty@cruncher production]$ python kitten.py -v talos-r3-xp-019<br />
ERROR socket error establishing ssh connection<br />
Traceback (most recent call last):<br />
File "/home/buildduty/production/briar-patch/releng/remote.py", line 151, in __init__<br />
self.client.connect(self.fqdn, username=remoteEnv.sshuser, password=remoteEnv.sshPassword, allow_agent=False, look_for_keys=True)<br />
File "/home/buildduty/production/lib/python2.6/site-packages/ssh/client.py", line 296, in connect<br />
sock.connect(addr)<br />
File "<string>", line 1, in connect<br />
error: [Errno 111] Connection refused<br />
talos-r3-xp-019: enabled<br />
farm: moz<br />
colo: scl1<br />
distro: winxp<br />
pool: tests-scl1-windows<br />
trustlevel: try<br />
master: bm15-tests1-windows<br />
fqdn: talos-r3-xp-019.build.scl1.mozilla.com.<br />
PDU?: False<br />
IPMI?: False<br />
ERROR Unable to control host remotely<br />
reachable: False<br />
buildbot: <br />
tacfile: <br />
lastseen: unknown<br />
master: <br />
error: current master is different than buildbot.tac master []<br />
<br />
The output up to the "ERROR" line shows all of the metadata for a host, and if the host was reachable via SSH the lines after would show the details of the buildbot environment and it's status.<br />
<br />
Kitten.py has the following commands:<br />
<br />
kitten.py [--info | -i ] [--reboot | -r]<br />
<br />
where --info will show only the metadata and will not try to SSH to the host and --reboot will try to graceful the buildbot and reboot the host if it appears to be idle or hung.<br />
<br />
== File a bug ==<br />
* Use [https://bugzilla.mozilla.org/enter_bug.cgi?alias=&assigned_to=nobody%40mozilla.org&blocked=&bug_file_loc=http%3A%2F%2F&bug_severity=normal&bug_status=NEW&cf_crash_signature=&comment=&component=Release%20Engineering%3A%20Machine%20Management&contenttypeentry=&contenttypemethod=autodetect&contenttypeselection=text%2Fplain&data=&defined_groups=1&dependson=&description=&flag_type-4=X&flag_type-481=X&flag_type-607=X&flag_type-674=X&flag_type-720=X&flag_type-721=X&flag_type-737=X&flag_type-775=X&flag_type-780=X&form_name=enter_bug&keywords=&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=All&priority=P3&product=mozilla.org&qa_contact=armenzg%40mozilla.com&rep_platform=All&requestee_type-4=&requestee_type-607=&requestee_type-753=&short_desc=&status_whiteboard=%5Bbuildduty%5D%5Bbuildslaves%5D%5Bcapacity%5D&target_milestone=---&version=other this template] so it fills up few needed tags and priority<br />
* Make the subject and alias of the bug to be the hostname<br />
* Add any depend bugs IT actions or the slave's issue<br />
* Submit<br />
<br />
== Slave Tracking ==<br />
* Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.<br />
<br />
== Slavealloc ==<br />
=== Adding a slave ===<br />
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.<br />
<br />
You'll want a command line something like<br />
<pre><br />
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv<br />
</pre><br />
<br />
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':<br />
<pre><br />
name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool<br />
</pre><br />
<br />
Adding masters is similar - see dbimport's help for more information.<br />
=== Removing slaves ===<br />
Connect to slavealloc@slavealloc and look at the history for a command looking like this:<br />
<pre><br />
mysql -h $host_ip -p -u buildslaves buildslaves<br />
# type the password<br />
SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
</pre><br />
<br />
== How Tos ==<br />
see [[ReleaseEngineering/How_To]] for a list of public How To documents<br /><br />
see [https://intranet.mozilla.org/RelEngWiki/index.php/Category:HowTo RelEngWiki/Category:HowTo] for list of private How To documents<br />
<br />
= Nagios =<br />
== What's the difference between a downtime and an ack? ==<br />
Both will make nagios stop alerting, but there's an important difference: acks are forever. '''Never''' ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.<br />
<br />
== How do I interact with the nagios IRC bot? ==<br />
nagios: status (gives current server stats)<br />
nagios: status $regexp (gives status for a particular host)<br />
nagios: status host:svc (gives status for a particular service)<br />
nagios: ignore (shows ignores<br />
nagios: ignore $regexp (ignores alerts matching $regexp)<br />
nagios: unignore $regexp (unignores an existing ignore)<br />
nagios: ack $num $comment (adds an acknowledgement comment; $num comes from [brackets] in the alert)<br />
(note that the numbers only count up to 100, so ack things quickly or use the web interface)<br />
nagios: unack $num (reverse an acknowledgement)<br />
nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d)<br />
e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988<br />
<br />
== How do I scan all problems Nagios has detected? ==<br />
* All unacknowledged problems:<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10<br />
* All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346<br />
* Group hosts check<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?hostgroup=all&style=summary<br />
<br />
== How do I deal with Nagios problems? ==<br />
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.<br />
<br />
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ''ever'' disable notifications.<br />
<br />
You can '''acknowledge''' a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.<br />
<br />
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.<br />
<br />
You can also mark a service or host for '''downtime'''. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.<br />
<br />
At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.<br />
<br />
== Known nagios alerts ==<br />
<pre><br />
[28] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
armenzg_buildduty<br />
arr: should I be worrying about this message? [26] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
nthomas<br />
depends if ssh is down<br />
nagios-sjc1<br />
[29] talos-r3-fed-018.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
joduinn-mtg is now known as joduinn-brb<br />
nthomas<br />
seems to work ok still, so people can push<br />
16:53 nthomas<br />
I get the normal |No interactive shells allowed here!| and it kicks me out as expected<br />
</pre><br />
This is normally due to releases. We might have to bump the threshold.<br />
<pre><br />
[30] signing1.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 30.60<br />
</pre><br />
<br />
= Downtimes =<br />
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the [[ReleaseEngineering:Buildduty:Downtimes|Downtimes]] page.<br />
<br />
= Talos =<br />
'''Note''' because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...<br />
<br />
# close all trees that are impacted by the change<br />
# ensure all pending builds are done and GREEN<br />
# do the update step below<br />
# send a Talos changeset to all trees to generate new baselines<br />
<br />
== How to update the talos/pageloader zips ==<br />
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.<br />
<br />
You may need to get IT to turn on access to build.mozilla.org.<br />
<pre><br />
#use your short ldap name (jford not jford@mozilla.com)<br />
ssh jford@build.mozilla.org<br />
cd /var/www/html/build/talos/zips/<br />
# NOTE: bug# and talos cset helps tracking back<br />
wget -Otalos.bug#.cset.zip <whatever>talos.zip<br />
<br />
cd /var/www/html/build/talos/xpis<br />
# NOTE: We override it unlike with talos.zip since it has not been ported to the talos.json system<br />
wget <whatever>/pageloader.xpi<br />
</pre><br />
<br />
For taloz.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.<br />
<br />
=== Updating talos for Tegras ===<br />
<br />
To update talos on Android,<br />
<br />
# for foopy05-11<br />
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}<br />
cd /builds/talos-data/talos<br />
hg pull -u<br />
<br />
This will update talos on each foopy to the tip of default.<br />
<br />
=== Updating talos for N900s ===<br />
<br />
ssh cltbld@production-mobile-master<br />
cd checkouts<br />
./update.sh<br />
<br />
This will update the fennecmark, maemkit, talos, and pageloader tarballs on pmm to the latest in their repos.<br />
<br />
= TBPL =<br />
== How to deploy changes ==<br />
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.<br />
<br />
== How to hide/unhide builders ==<br />
* In the 'Tree Info' menu select 'Open tree admin panel'<br />
* Filter/select the builders you want to change<br />
* Save changes<br />
* Enter the sheriff password and a description (with bug number if available) of your changes<br />
<br />
= Useful Links =<br />
* [http://cruncher.build.mozilla.org/buildapi/index.html Build Dashboard Main Page]<br />
** You can get JSON dumps for people to analyze by adding <code>&format=json</code><br />
** You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this [http://cruncher.build.mozilla.org/buildapi/revision/ link] (e.g. [http://cruncher.build.mozilla.org/buildapi/revision/places/c4f8232c7aef revision/places/c4f8232c7aef])<br />
* http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.<br />
* http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)<br />
<br />
= L10n Nightly Dashboard =<br />
* [http://l10n.mozilla.org/~axel/nightlies L10n Nightly Dashboard]<br />
<br />
= Slave Handling =<br />
You'll need to be familiar with the location of slaves. You can find this with 'host' if you don't know off the top of your head<br />
host linux-ix-slave07<br />
linux-ix-slave07.build.mozilla.org is an alias for linux-ix-slave07.build.'''mtv1'''.mozilla.com.<br />
<br />
== Restarting Wedged Slaves ==<br />
See [https://wiki.mozilla.org/ReleaseEngineering/How_To/Get_a_Missing_Slave_Back_Online How To/Get a Missing Slave Back Online].<br />
<br />
Reboot an IX slave:<br />
[[ReleaseEngineering/How_To/Connect_To_IPMI|Connect To IPMI]]<br />
<br />
== Requesting Reboots ==<br />
Some slaves run on unmanaged hardware, meaning that the hardware can get into a state where someone must be onsite to unwedge it. Note that iX systems and VMs are '''not''' unmanaged, and should not be on a reboots bug. When an unmanaged host becomes unresponsive, it gets added to a reboots bug, based on its datacenter:<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-scl1 (by far the most common, since about 10 talos machines die per week)<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-sjc1<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-mtv1<br />
'''NOTE:''' these bugs are formulaic. Don't get creative! Just add the hostname of the slave in a comment, or if you are adding multiple slaves at once, list each on its own line. If there's something the onsite person needs to know, include it after the hostname, on the same line. '''Do not''' try to "summarize" all of the slaves on the bug in a single comment.<br />
<br />
Simultaneously, 'ack' the alert in #build:<br />
10:27 < nagios-sjc1> [25] talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
10:51 < dustin> nagios-sjc1: ack 25 reboots-scl1<br />
10:51 < nagios-sjc1> talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%;dustin;reboots-scl1<br />
<br />
== When Requested Reboots are Done ==<br />
=== Checking Slaves ===<br />
Once a reboots bug is closed by an onsite person, read the update to see which hosts were rebooted, and which (if any) require further work. Such further work should be deferred to a new bug, which you should open if relops did not (often time is tight at the datacenter). Update the slave tracking spreadsheet accordingly:<br />
* for slaves that were rebooted normally: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "rebooted"; and set "Blocked On" to "check" (which will turn the cell yellow). Check BuildAPI a few hours later to see if these slaves are building properly, and delete the rows from the spreadsheet if so.<br />
* for slaves that were reimaged during the reboot process: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "reimaged"; and set "Blocked On" to "set up". That set-up is your responsibility, too -- see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave].<br />
* for slaves that require further work from relops, change the "Bug #" column to reflect the bug tracking that work, and set the "Issue" and "Blocked On" columns appropriately<br />
If any slaves were missed in the reboot process, add them to a new reboots bug.<br />
<br />
=== New Bug ===<br />
Once a reboots bug is closed, you will need to open a new one for any subsequent reboots. You don't have to wait until you need a reboot to do so. Here's how:<br />
# remove the 'reboots-xxxx' alias from the previous reboots bug, and copy the bug's URL to your clipboard<br />
# create a bug in "Server Operations: RelEng", with subject "reboot requests (xxxx)". You can leave the description blank if you don't have any slaves requiring reboot yet. Submit.<br />
# edit the bug's colo-trip field to indicate the correct datacenter, and paste the previous reboot request's URL into the "See Also" field.<br />
<br />
== DNR ==<br />
Slaves that are dead and not worth repairing are marked as "DNR" in the slave tracking spreadsheet. The types of slaves that are acceptable for DNR are listed in the "DNR'd Silos" sheet of the [http://is.gd/jsHeh slave tracking spreadsheet]. Such slaves should be acked in nagios, but are not tracked in any bug.<br />
<br />
== Loans ==<br />
We need to track a slave from the time it is loaned out until it is back in its proper place (be that staging, preprod, or production). Currently we use bugs to track this flow.<br />
<br />
# Bug from dev requesting loaner (build or test slave, platform, bug this is being used to help with)<br />
# Loan it: [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Send_a_slave_out_for_loan How To/Send a slave out for loan]<br />
# File a bug to the RelEng component (connected to bug in point #1) to track the re-imaging and returning of the slave to its pool when it's returned -- I've been asking the dev to please comment in that bug when they are done with the loaner<br />
# File a bug on ServerOps asking for re-image (blocking bug in #3) [https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_That_a_Machine_Be_Reimaged How To/Request That a Machine Be Reimaged]<br />
# When it's re-imaged, put it back in the pool [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave]<br />
<br />
== Maintenance ==<br />
Periodically scan the slave spreadsheet. Check slaves marked "check". Set up slaves marked "set up". Ask developers who have borrowed slaves to see if they're done with them. Ask relops about progress on broken slaves.<br />
<br />
== Common Failure Modes ==<br />
Some slaves, especially linux VMs, will fail to clobber and repeatedly restart. In nagios, this causes all of the checks on that host to bounce up and down, because the reboots occur on a similar schedule to nagios's checks. Sometimes you can catch this via SSH, but the reboots are *very* quick and it may be easier to use vSphere Client to boot the host into single-user mode and clean out the build dirs.<br />
<br />
All of the linux slaves will reboot after 10 attempts to run puppet. A puppet failure, then, will manifest as buildbot failing to start on that host. To stop the reboot cycle, log in to the slave and kill S98puppet (centos) or run-puppet-and-slave.sh (fedora).<br />
<br />
= Standard Bugs =<br />
* The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I<br />
* Reboots bugs have the Bugzilla aliases shown above.<br />
* For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:<br />
** :aki, :armenzg, :bear, :bhearsum, :catlee, :coop, :hwine, :jhford, :joduinn, :joey, :lsblakk, :nthomas, :rail<br />
<br />
= Ganglia =<br />
* if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. {{bug|674233}}:<br />
switch to root, service gmond restart<br />
<br />
= Queue Directories =<br />
* [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories]<br />
<br />
If you see this in #build:<br />
<br />
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items<br />
<br />
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories] wiki page for details.<br />
= Cruncher = <br />
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):<br />
<br />
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):<br />
As root:<br />
du -s -h /var/spool/*<br />
# confirm that mqueue or clientmqueue is the oversized culprit<br />
# stop sendmail, clean out the queues, restart sendmail<br />
/etc/init.d/sendmail stop<br />
rm -rf /var/spool/clientmqueue/*<br />
rm -rf /var/spool/mqueue/*<br />
/etc/init.d/sendmail start</div>Bearhttps://wiki.mozilla.org/index.php?title=CIDuty&diff=451706CIDuty2012-07-17T20:14:01Z<p>Bear: /* Slave Maintenance */</p>
<hr />
<div>'''Looking for who is on buildduty?''' - check the tree-info dropdown on [https://tbpl.mozilla.org/ tbpl]<br /><br />
'''Buildduty not around?''' - please [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Release%20Engineering open a bug]<br />
<br />
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "'''buildduty'''."<br />
<br />
Here's now to do it.<br />
<br />
__TOC__<br />
<br />
= Schedule =<br />
Mozilla Releng Sheriff Schedule ([http://www.google.com/calendar/embed?src=aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com&ctz=America/New_York Google Calendar]|[http://www.google.com/calendar/ical/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic.ics iCal]|[http://www.google.com/calendar/feeds/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic XML])<br />
<br />
= General Duties =<br />
== How should I make myself available for duty? ==<br />
* Add 'buildduty' to your IRC nick<br />
* be in at least #developers, #buildduty and #build (as well as #mozbuild of course)<br />
** also useful to be in #mobile, #planning, #release-drivers, and #ateam<br />
* watch http://tbpl.mozilla.org<br />
<br />
== What else should I take care of? ==<br />
You will need to<br />
* Direct people to [http://mzl.la/tryhelp http://mzl.la/tryhelp] for self-serve documentation.<br />
* Keep [https://wiki.mozilla.org/ReleaseEngineering:Maintenance wiki.m.o/ReleaseEngineering:Maintenance] up to date with any significant changes<br />
<br />
You should keep on top of<br />
* pending builds - available in [http://build.mozilla.org/builds/pending/ graphs] or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.<br />
* all bugs tagged with [https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;list_id=2941844;status_whiteboard=%5Bbuildduty%5D;;resolution=---;product=mozilla.org buildduty] in the whiteboard (make a saved search)<br />
* The [https://bugzilla.mozilla.org/buglist.cgi?priority=--&columnlist=bug_severity%2Cpriority%2Cop_sys%2Cassigned_to%2Cbug_status%2Cresolution%2Cshort_desc%2Cstatus_whiteboard&resolution=---&resolution=DUPLICATE&emailtype1=exact&query_based_on=releng-triage&emailassigned_to1=1&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=nobody%40mozilla.org&component=Release%20Engineering&component=Release%20Engineering%3A%20Custom%20Builds&product=mozilla.org&known_name=releng-triage releng-triage search] - part of buildduty is leaning on your colleagues to take bugs<br />
** the BuildDuty person needs to go through all the bugs in releng-triage query at *least* a few times each day. Doesnt mean you have to *fix* them all immediately, finding other owners is part of triaging the queue. However, you do need to at least *see* them, know if there are any urgent problems and categorize appropriately. Sometimes we get urgent security bugs here, which need to be jumped on immediately, like example {{bug|635638}}<br />
* Bum slaves - you should see to it that bum slaves aren't burning builds, and that all slaves are tracked on their way back to operational status<br />
** Check the [https://bugzilla.mozilla.org/buglist.cgi?list_id=2938171;resolution=---;status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=%5Bhardware%5D;;product=mozilla.org hardware] whiteboard tag, too, for anything that slipped between the cracks.<br />
** See the sections below on [[#Requesting Reboots]]<br />
* Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp])<br />
** '''wait times''' - either [https://build.mozilla.org/buildapi/reports/waittimes this page] or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)<br />
** there is a cronjob in anamarias' account on cruncher that runs this for each pool:<br />
/usr/local/bin/python $HOME/buildapi/buildapi/scripts/mailwaittimes.py \<br />
-S smtp.mozilla.org \<br />
-f nobody@cruncher.build.mozilla.org \<br />
-p testpool \<br />
-W http://cruncher.build.mozilla.org/buildapi/reports/waittimes \<br />
-e $(date -d "$(date +%Y-%m-%d)" +%s) -t 10 -z 10 \<br />
-a dev-tree-management@lists.mozilla.org<br />
<br />
* You may need to plan a reconfig or a full downtime<br />
** Reconfigs: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-reconfig releng-needs-reconfig broken query] to see what's pending. Reconfigs can be done at any time. <br />
** Downtimes: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-treeclosure releng-needs-treeclosure broken query] to see what's pending. Coordinate with Zandr and IT for send downtime notices with enough advance notice. <br />
<br />
You will also be responsible for coordinating master reconfigs - see the releng-needs-reconfig search.<br />
<br />
== Scheduled Reconfigs ==<br />
Buildduty is responsible for reconfiging the Buildbot masters <b>every Monday and Thursday</b>, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. [https://wiki.mozilla.org/ReleaseEngineering/Landing_Buildbot_Master_Changes This wiki page has step by step instructions]. It is also valid to do other additional reconfigs anytime you want.<br />
<br />
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master].<br />
<br />
You should [https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric use Fabric to do the reconfig!]<br />
<br />
The person doing reconfigs should also update https://wiki.mozilla.org/ReleaseEngineering:Maintenance#Reconfigs_.2F_Deployments<br />
<br />
= Tree Maintenance =<br />
== Repo Errors ==<br />
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:<br />
* File a bug (or have dev file it) and then poke in #ops noahm<br />
** If he doesn't respond, then escalate the bug to page on-call<br />
* Follow the steps below for "How do I close the tree"<br />
== How do I see problems in TBPL? ==<br />
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.<br />
== How do I close the tree? ==<br />
See [[ReleaseEngineering/How_To/Close_or_Open_the_Tree]]<br />
<br />
== How do I claim a rentable project branch? ==<br />
See [[ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE]]<br />
<br />
= Re-run jobs =<br />
== How to trigger Talos jobs ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-trigger all Talos runs for a build (by using sendchange) ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-run a build ==<br />
Do ''not'' go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.<br />
<br />
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the '''YOU MUST''' specify the branch, so there's no null keys in the builds-running.js.<br />
<br />
= Try Server =<br />
== Jobs not scheduled at all? ==<br />
Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.<br />
<br />
Then do a sendchange and tail the scheduler master:<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
<br />
* If tryserver was just reset verify that [[ReleaseEngineering/How_To/Reset_the_Try_Server#Try_Hg_Poller_state|the scheduler has been reset]]<br />
<br />
== How do I trigger additional talos/test runs for a given try build? ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== Using the TryChooser to submit build/test requests ==<br />
<br />
buildduty can also use the same [https://wiki.mozilla.org/Build:TryChooser TryChooser] syntax as developers use to (re)submit build and testing requests. Here is an example:<br />
<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
== How do I cancel existing jobs? ==<br />
<br />
The cancellator.py script is setup on pm02. Here is a standard example:<br />
<br />
# Dry run first to see what would be cancelled. <br />
python cancellator.py -b try -r 5ff84b660e90<br />
# Same command run again with the force option specified (--yes-really) to actually cancel the builds<br />
python cancellator.py -b try -r 5ff84b660e90 --yes-really<br />
<br />
The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.<br />
== Bug Commenter ==<br />
This is on cruncher and is run in a crontab in lsblakk's account:<br />
source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \<br />
&& time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v<br />
<br />
You can see quickly if things are working by looking at:<br />
/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log # this shows what's been posted lately<br />
/home/lsblakk/autoland/tools/scripts/autoland/try_cache # this shows what the script thinks is 'pending' completion<br />
<br />
= Nightlies =<br />
<br />
== How do I re-spin mozilla-central nightlies? ==<br />
To rebuild the same nightly, buildbot's Rebuild button works fine.<br />
<br />
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.<br />
<br />
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.<br />
<br />
= Mobile =<br />
== Android Tegras ==<br />
<br />
[[ReleaseEngineering:How To:Android Tegras | Android Tegra BuildDuty Notes]]<br />
<br />
== Android Updates aren't working! ==<br />
<br />
* Did the version number just change? If so, you may be hitting {{bug|629528}}. Kick off another Android nightly.<br />
* Check aus3-staging for size 0 complete.txt snippets:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5<br />
** If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.<br />
* Check aus3-staging to see if the checksum is correct:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2<br />
** If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.<br />
<br />
== Update mobile talos webhosts ==<br />
We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}).<br />
Here is how you update them:<br />
Update Procedure:<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/talos-repo<br />
# NOTICE that we have uncommitted files<br />
hg st<br />
# ? talos/page_load_test/tp4<br />
# Take note of the current revision to revert to (just in case)<br />
hg id<br />
hg pull -u<br />
# 488bc187a3ef tip<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.<br />
</pre><br />
<br />
Keep track of what revisions is being run.<br />
<br />
== Deploy new tegra-host-utils.zip ==<br />
There are three hosts behind a balance loader.<br />
* See {{bug|742597}} for previous instance of this case.<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/tegra<br />
wget -O tegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/<br />
</pre><br />
<br />
= Slave Maintenance =<br />
In general, slave maintenance involves:<br />
* keeping as many slaves up as possible, including<br />
** proactively checking for hung/broken slaves (see links below)<br />
** moving known-down slaves toward an operational state<br />
* handling nagios alerts for slaves<br />
* interacting with IT regarding slave maintenance<br />
<br />
== Kitten ==<br />
<br />
== File a bug ==<br />
* Use [https://bugzilla.mozilla.org/enter_bug.cgi?alias=&assigned_to=nobody%40mozilla.org&blocked=&bug_file_loc=http%3A%2F%2F&bug_severity=normal&bug_status=NEW&cf_crash_signature=&comment=&component=Release%20Engineering%3A%20Machine%20Management&contenttypeentry=&contenttypemethod=autodetect&contenttypeselection=text%2Fplain&data=&defined_groups=1&dependson=&description=&flag_type-4=X&flag_type-481=X&flag_type-607=X&flag_type-674=X&flag_type-720=X&flag_type-721=X&flag_type-737=X&flag_type-775=X&flag_type-780=X&form_name=enter_bug&keywords=&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=All&priority=P3&product=mozilla.org&qa_contact=armenzg%40mozilla.com&rep_platform=All&requestee_type-4=&requestee_type-607=&requestee_type-753=&short_desc=&status_whiteboard=%5Bbuildduty%5D%5Bbuildslaves%5D%5Bcapacity%5D&target_milestone=---&version=other this template] so it fills up few needed tags and priority<br />
* Make the subject and alias of the bug to be the hostname<br />
* Add any depend bugs IT actions or the slave's issue<br />
* Submit<br />
<br />
== Slave Tracking ==<br />
* Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.<br />
<br />
== Slavealloc ==<br />
=== Adding a slave ===<br />
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.<br />
<br />
You'll want a command line something like<br />
<pre><br />
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv<br />
</pre><br />
<br />
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':<br />
<pre><br />
name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool<br />
</pre><br />
<br />
Adding masters is similar - see dbimport's help for more information.<br />
=== Removing slaves ===<br />
Connect to slavealloc@slavealloc and look at the history for a command looking like this:<br />
<pre><br />
mysql -h $host_ip -p -u buildslaves buildslaves<br />
# type the password<br />
SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
</pre><br />
<br />
== How Tos ==<br />
see [[ReleaseEngineering/How_To]] for a list of public How To documents<br /><br />
see [https://intranet.mozilla.org/RelEngWiki/index.php/Category:HowTo RelEngWiki/Category:HowTo] for list of private How To documents<br />
<br />
= Nagios =<br />
== What's the difference between a downtime and an ack? ==<br />
Both will make nagios stop alerting, but there's an important difference: acks are forever. '''Never''' ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.<br />
<br />
== How do I interact with the nagios IRC bot? ==<br />
nagios: status (gives current server stats)<br />
nagios: status $regexp (gives status for a particular host)<br />
nagios: status host:svc (gives status for a particular service)<br />
nagios: ignore (shows ignores<br />
nagios: ignore $regexp (ignores alerts matching $regexp)<br />
nagios: unignore $regexp (unignores an existing ignore)<br />
nagios: ack $num $comment (adds an acknowledgement comment; $num comes from [brackets] in the alert)<br />
(note that the numbers only count up to 100, so ack things quickly or use the web interface)<br />
nagios: unack $num (reverse an acknowledgement)<br />
nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d)<br />
e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988<br />
<br />
== How do I scan all problems Nagios has detected? ==<br />
* All unacknowledged problems:<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10<br />
* All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346<br />
* Group hosts check<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?hostgroup=all&style=summary<br />
<br />
== How do I deal with Nagios problems? ==<br />
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.<br />
<br />
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ''ever'' disable notifications.<br />
<br />
You can '''acknowledge''' a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.<br />
<br />
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.<br />
<br />
You can also mark a service or host for '''downtime'''. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.<br />
<br />
At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.<br />
<br />
== Known nagios alerts ==<br />
<pre><br />
[28] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
armenzg_buildduty<br />
arr: should I be worrying about this message? [26] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
nthomas<br />
depends if ssh is down<br />
nagios-sjc1<br />
[29] talos-r3-fed-018.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
joduinn-mtg is now known as joduinn-brb<br />
nthomas<br />
seems to work ok still, so people can push<br />
16:53 nthomas<br />
I get the normal |No interactive shells allowed here!| and it kicks me out as expected<br />
</pre><br />
This is normally due to releases. We might have to bump the threshold.<br />
<pre><br />
[30] signing1.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 30.60<br />
</pre><br />
<br />
= Downtimes =<br />
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the [[ReleaseEngineering:Buildduty:Downtimes|Downtimes]] page.<br />
<br />
= Talos =<br />
'''Note''' because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...<br />
<br />
# close all trees that are impacted by the change<br />
# ensure all pending builds are done and GREEN<br />
# do the update step below<br />
# send a Talos changeset to all trees to generate new baselines<br />
<br />
== How to update the talos/pageloader zips ==<br />
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.<br />
<br />
You may need to get IT to turn on access to build.mozilla.org.<br />
<pre><br />
#use your short ldap name (jford not jford@mozilla.com)<br />
ssh jford@build.mozilla.org<br />
cd /var/www/html/build/talos/zips/<br />
# NOTE: bug# and talos cset helps tracking back<br />
wget -Otalos.bug#.cset.zip <whatever>talos.zip<br />
<br />
cd /var/www/html/build/talos/xpis<br />
# NOTE: We override it unlike with talos.zip since it has not been ported to the talos.json system<br />
wget <whatever>/pageloader.xpi<br />
</pre><br />
<br />
For taloz.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.<br />
<br />
=== Updating talos for Tegras ===<br />
<br />
To update talos on Android,<br />
<br />
# for foopy05-11<br />
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}<br />
cd /builds/talos-data/talos<br />
hg pull -u<br />
<br />
This will update talos on each foopy to the tip of default.<br />
<br />
=== Updating talos for N900s ===<br />
<br />
ssh cltbld@production-mobile-master<br />
cd checkouts<br />
./update.sh<br />
<br />
This will update the fennecmark, maemkit, talos, and pageloader tarballs on pmm to the latest in their repos.<br />
<br />
= TBPL =<br />
== How to deploy changes ==<br />
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.<br />
<br />
== How to hide/unhide builders ==<br />
* In the 'Tree Info' menu select 'Open tree admin panel'<br />
* Filter/select the builders you want to change<br />
* Save changes<br />
* Enter the sheriff password and a description (with bug number if available) of your changes<br />
<br />
= Useful Links =<br />
* [http://cruncher.build.mozilla.org/buildapi/index.html Build Dashboard Main Page]<br />
** You can get JSON dumps for people to analyze by adding <code>&format=json</code><br />
** You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this [http://cruncher.build.mozilla.org/buildapi/revision/ link] (e.g. [http://cruncher.build.mozilla.org/buildapi/revision/places/c4f8232c7aef revision/places/c4f8232c7aef])<br />
* http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.<br />
* http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)<br />
<br />
= L10n Nightly Dashboard =<br />
* [http://l10n.mozilla.org/~axel/nightlies L10n Nightly Dashboard]<br />
<br />
= Slave Handling =<br />
You'll need to be familiar with the location of slaves. You can find this with 'host' if you don't know off the top of your head<br />
host linux-ix-slave07<br />
linux-ix-slave07.build.mozilla.org is an alias for linux-ix-slave07.build.'''mtv1'''.mozilla.com.<br />
<br />
== Restarting Wedged Slaves ==<br />
See [https://wiki.mozilla.org/ReleaseEngineering/How_To/Get_a_Missing_Slave_Back_Online How To/Get a Missing Slave Back Online].<br />
<br />
Reboot an IX slave:<br />
[[ReleaseEngineering/How_To/Connect_To_IPMI|Connect To IPMI]]<br />
<br />
== Requesting Reboots ==<br />
Some slaves run on unmanaged hardware, meaning that the hardware can get into a state where someone must be onsite to unwedge it. Note that iX systems and VMs are '''not''' unmanaged, and should not be on a reboots bug. When an unmanaged host becomes unresponsive, it gets added to a reboots bug, based on its datacenter:<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-scl1 (by far the most common, since about 10 talos machines die per week)<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-sjc1<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-mtv1<br />
'''NOTE:''' these bugs are formulaic. Don't get creative! Just add the hostname of the slave in a comment, or if you are adding multiple slaves at once, list each on its own line. If there's something the onsite person needs to know, include it after the hostname, on the same line. '''Do not''' try to "summarize" all of the slaves on the bug in a single comment.<br />
<br />
Simultaneously, 'ack' the alert in #build:<br />
10:27 < nagios-sjc1> [25] talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
10:51 < dustin> nagios-sjc1: ack 25 reboots-scl1<br />
10:51 < nagios-sjc1> talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%;dustin;reboots-scl1<br />
<br />
== When Requested Reboots are Done ==<br />
=== Checking Slaves ===<br />
Once a reboots bug is closed by an onsite person, read the update to see which hosts were rebooted, and which (if any) require further work. Such further work should be deferred to a new bug, which you should open if relops did not (often time is tight at the datacenter). Update the slave tracking spreadsheet accordingly:<br />
* for slaves that were rebooted normally: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "rebooted"; and set "Blocked On" to "check" (which will turn the cell yellow). Check BuildAPI a few hours later to see if these slaves are building properly, and delete the rows from the spreadsheet if so.<br />
* for slaves that were reimaged during the reboot process: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "reimaged"; and set "Blocked On" to "set up". That set-up is your responsibility, too -- see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave].<br />
* for slaves that require further work from relops, change the "Bug #" column to reflect the bug tracking that work, and set the "Issue" and "Blocked On" columns appropriately<br />
If any slaves were missed in the reboot process, add them to a new reboots bug.<br />
<br />
=== New Bug ===<br />
Once a reboots bug is closed, you will need to open a new one for any subsequent reboots. You don't have to wait until you need a reboot to do so. Here's how:<br />
# remove the 'reboots-xxxx' alias from the previous reboots bug, and copy the bug's URL to your clipboard<br />
# create a bug in "Server Operations: RelEng", with subject "reboot requests (xxxx)". You can leave the description blank if you don't have any slaves requiring reboot yet. Submit.<br />
# edit the bug's colo-trip field to indicate the correct datacenter, and paste the previous reboot request's URL into the "See Also" field.<br />
<br />
== DNR ==<br />
Slaves that are dead and not worth repairing are marked as "DNR" in the slave tracking spreadsheet. The types of slaves that are acceptable for DNR are listed in the "DNR'd Silos" sheet of the [http://is.gd/jsHeh slave tracking spreadsheet]. Such slaves should be acked in nagios, but are not tracked in any bug.<br />
<br />
== Loans ==<br />
We need to track a slave from the time it is loaned out until it is back in its proper place (be that staging, preprod, or production). Currently we use bugs to track this flow.<br />
<br />
# Bug from dev requesting loaner (build or test slave, platform, bug this is being used to help with)<br />
# Loan it: [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Send_a_slave_out_for_loan How To/Send a slave out for loan]<br />
# File a bug to the RelEng component (connected to bug in point #1) to track the re-imaging and returning of the slave to its pool when it's returned -- I've been asking the dev to please comment in that bug when they are done with the loaner<br />
# File a bug on ServerOps asking for re-image (blocking bug in #3) [https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_That_a_Machine_Be_Reimaged How To/Request That a Machine Be Reimaged]<br />
# When it's re-imaged, put it back in the pool [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave]<br />
<br />
== Maintenance ==<br />
Periodically scan the slave spreadsheet. Check slaves marked "check". Set up slaves marked "set up". Ask developers who have borrowed slaves to see if they're done with them. Ask relops about progress on broken slaves.<br />
<br />
== Common Failure Modes ==<br />
Some slaves, especially linux VMs, will fail to clobber and repeatedly restart. In nagios, this causes all of the checks on that host to bounce up and down, because the reboots occur on a similar schedule to nagios's checks. Sometimes you can catch this via SSH, but the reboots are *very* quick and it may be easier to use vSphere Client to boot the host into single-user mode and clean out the build dirs.<br />
<br />
All of the linux slaves will reboot after 10 attempts to run puppet. A puppet failure, then, will manifest as buildbot failing to start on that host. To stop the reboot cycle, log in to the slave and kill S98puppet (centos) or run-puppet-and-slave.sh (fedora).<br />
<br />
= Standard Bugs =<br />
* The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I<br />
* Reboots bugs have the Bugzilla aliases shown above.<br />
* For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:<br />
** :aki, :armenzg, :bear, :bhearsum, :catlee, :coop, :hwine, :jhford, :joduinn, :joey, :lsblakk, :nthomas, :rail<br />
<br />
= Ganglia =<br />
* if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. {{bug|674233}}:<br />
switch to root, service gmond restart<br />
<br />
= Queue Directories =<br />
* [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories]<br />
<br />
If you see this in #build:<br />
<br />
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items<br />
<br />
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories] wiki page for details.<br />
= Cruncher = <br />
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):<br />
<br />
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):<br />
As root:<br />
du -s -h /var/spool/*<br />
# confirm that mqueue or clientmqueue is the oversized culprit<br />
# stop sendmail, clean out the queues, restart sendmail<br />
/etc/init.d/sendmail stop<br />
rm -rf /var/spool/clientmqueue/*<br />
rm -rf /var/spool/mqueue/*<br />
/etc/init.d/sendmail start</div>Bearhttps://wiki.mozilla.org/index.php?title=CIDuty&diff=444505CIDuty2012-06-22T14:51:34Z<p>Bear: /* Deploy new tegra-host-utils.zip */</p>
<hr />
<div>'''Looking for who is on buildduty?''' - check the tree-info dropdown on [https://tbpl.mozilla.org/ tbpl]<br /><br />
'''Buildduty not around?''' - please [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Release%20Engineering open a bug]<br />
<br />
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "'''buildduty'''."<br />
<br />
Here's now to do it.<br />
<br />
__TOC__<br />
<br />
= Schedule =<br />
Mozilla Releng Sheriff Schedule ([http://www.google.com/calendar/embed?src=aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com&ctz=America/New_York Google Calendar]|[http://www.google.com/calendar/ical/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic.ics iCal]|[http://www.google.com/calendar/feeds/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic XML])<br />
<br />
= General Duties =<br />
== How should I make myself available for duty? ==<br />
* Add 'buildduty' to your IRC nick<br />
* be in at least #developers, #buildduty and #build (as well as #mozbuild of course)<br />
** also useful to be in #mobile, #planning, #release-drivers, and #ateam<br />
* watch http://tbpl.mozilla.org<br />
<br />
== What else should I take care of? ==<br />
You will need to<br />
* Direct people to [http://mzl.la/tryhelp http://mzl.la/tryhelp] for self-serve documentation.<br />
* Keep [https://wiki.mozilla.org/ReleaseEngineering:Maintenance wiki.m.o/ReleaseEngineering:Maintenance] up to date with any significant changes<br />
<br />
You should keep on top of<br />
* pending builds - available in [http://build.mozilla.org/builds/pending/ graphs] or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.<br />
* all bugs tagged with [https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;list_id=2941844;status_whiteboard=%5Bbuildduty%5D;;resolution=---;product=mozilla.org buildduty] in the whiteboard (make a saved search)<br />
* The [https://bugzilla.mozilla.org/buglist.cgi?priority=--&columnlist=bug_severity%2Cpriority%2Cop_sys%2Cassigned_to%2Cbug_status%2Cresolution%2Cshort_desc%2Cstatus_whiteboard&resolution=---&resolution=DUPLICATE&emailtype1=exact&query_based_on=releng-triage&emailassigned_to1=1&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=nobody%40mozilla.org&component=Release%20Engineering&component=Release%20Engineering%3A%20Custom%20Builds&product=mozilla.org&known_name=releng-triage releng-triage search] - part of buildduty is leaning on your colleagues to take bugs<br />
** the BuildDuty person needs to go through all the bugs in releng-triage query at *least* a few times each day. Doesnt mean you have to *fix* them all immediately, finding other owners is part of triaging the queue. However, you do need to at least *see* them, know if there are any urgent problems and categorize appropriately. Sometimes we get urgent security bugs here, which need to be jumped on immediately, like example {{bug|635638}}<br />
* Bum slaves - you should see to it that bum slaves aren't burning builds, and that all slaves are tracked on their way back to operational status<br />
** Check the [https://bugzilla.mozilla.org/buglist.cgi?list_id=2938171;resolution=---;status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=%5Bhardware%5D;;product=mozilla.org hardware] whiteboard tag, too, for anything that slipped between the cracks.<br />
** See the sections below on [[#Requesting Reboots]]<br />
* Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp])<br />
** '''wait times''' - either [https://build.mozilla.org/buildapi/reports/waittimes this page] or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)<br />
** there is a cronjob in anamarias' account on cruncher that runs this for each pool:<br />
/usr/local/bin/python $HOME/buildapi/buildapi/scripts/mailwaittimes.py \<br />
-S smtp.mozilla.org \<br />
-f nobody@cruncher.build.mozilla.org \<br />
-p testpool \<br />
-W http://cruncher.build.mozilla.org/buildapi/reports/waittimes \<br />
-e $(date -d "$(date +%Y-%m-%d)" +%s) -t 10 -z 10 \<br />
-a dev-tree-management@lists.mozilla.org<br />
<br />
* You may need to plan a reconfig or a full downtime<br />
** Reconfigs: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-reconfig releng-needs-reconfig broken query] to see what's pending. Reconfigs can be done at any time. <br />
** Downtimes: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-treeclosure releng-needs-treeclosure broken query] to see what's pending. Coordinate with Zandr and IT for send downtime notices with enough advance notice. <br />
<br />
You will also be responsible for coordinating master reconfigs - see the releng-needs-reconfig search.<br />
<br />
== Scheduled Reconfigs ==<br />
Buildduty is responsible for reconfiging the Buildbot masters <b>every Monday and Thursday</b>, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. [https://wiki.mozilla.org/ReleaseEngineering/Landing_Buildbot_Master_Changes This wiki page has step by step instructions]. It is also valid to do other additional reconfigs anytime you want.<br />
<br />
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master].<br />
<br />
You should [https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric use Fabric to do the reconfig!]<br />
<br />
The person doing reconfigs should also update https://wiki.mozilla.org/ReleaseEngineering:Maintenance#Reconfigs_.2F_Deployments<br />
<br />
= Tree Maintenance =<br />
== Repo Errors ==<br />
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:<br />
* File a bug (or have dev file it) and then poke in #ops noahm<br />
** If he doesn't respond, then escalate the bug to page on-call<br />
* Follow the steps below for "How do I close the tree"<br />
== How do I see problems in TBPL? ==<br />
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.<br />
== How do I close the tree? ==<br />
See [[ReleaseEngineering/How_To/Close_or_Open_the_Tree]]<br />
<br />
== How do I claim a rentable project branch? ==<br />
See [[ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE]]<br />
<br />
= Re-run jobs =<br />
== How to trigger Talos jobs ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-trigger all Talos runs for a build (by using sendchange) ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-run a build ==<br />
Do ''not'' go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.<br />
<br />
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the '''YOU MUST''' specify the branch, so there's no null keys in the builds-running.js.<br />
<br />
= Try Server =<br />
== Jobs not scheduled at all? ==<br />
Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.<br />
<br />
Then do a sendchange and tail the scheduler master:<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
<br />
<br />
== How do I trigger additional talos/test runs for a given try build? ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== Using the TryChooser to submit build/test requests ==<br />
<br />
buildduty can also use the same [https://wiki.mozilla.org/Build:TryChooser TryChooser] syntax as developers use to (re)submit build and testing requests. Here is an example:<br />
<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
== How do I cancel existing jobs? ==<br />
<br />
The cancellator.py script is setup on pm02. Here is a standard example:<br />
<br />
# Dry run first to see what would be cancelled. <br />
python cancellator.py -b try -r 5ff84b660e90<br />
# Same command run again with the force option specified (--yes-really) to actually cancel the builds<br />
python cancellator.py -b try -r 5ff84b660e90 --yes-really<br />
<br />
The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.<br />
== Bug Commenter ==<br />
This is on cruncher and is run in a crontab in lsblakk's account:<br />
source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \<br />
&& time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v<br />
<br />
You can see quickly if things are working by looking at:<br />
/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log # this shows what's been posted lately<br />
/home/lsblakk/autoland/tools/scripts/autoland/try_cache # this shows what the script thinks is 'pending' completion<br />
<br />
= Nightlies =<br />
<br />
== How do I re-spin mozilla-central nightlies? ==<br />
To rebuild the same nightly, buildbot's Rebuild button works fine.<br />
<br />
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.<br />
<br />
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.<br />
<br />
= Mobile =<br />
== Android Tegras ==<br />
<br />
[[ReleaseEngineering:How To:Android Tegras | Android Tegra BuildDuty Notes]]<br />
<br />
== Android Updates aren't working! ==<br />
<br />
* Did the version number just change? If so, you may be hitting {{bug|629528}}. Kick off another Android nightly.<br />
* Check aus3-staging for size 0 complete.txt snippets:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5<br />
** If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.<br />
* Check aus3-staging to see if the checksum is correct:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2<br />
** If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.<br />
<br />
== Update mobile talos webhosts ==<br />
We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}).<br />
Here is how you update them:<br />
Update Procedure:<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/talos-repo<br />
# NOTICE that we have uncommitted files<br />
hg st<br />
# ? talos/page_load_test/tp4<br />
# Take note of the current revision to revert to (just in case)<br />
hg id<br />
hg pull -u<br />
# 488bc187a3ef tip<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.<br />
</pre><br />
<br />
Keep track of what revisions is being run.<br />
<br />
== Deploy new tegra-host-utils.zip ==<br />
There are three hosts behind a balance loader.<br />
* See {{bug|742597}} for previous instance of this case.<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/tegra<br />
wget -O tegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/<br />
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/<br />
</pre><br />
<br />
= Slave Maintenance =<br />
In general, slave maintenance involves:<br />
* keeping as many slaves up as possible, including<br />
** proactively checking for hung/broken slaves (see links below)<br />
** moving known-down slaves toward an operational state<br />
* handling nagios alerts for slaves<br />
* interacting with IT regarding slave maintenance<br />
== File a bug ==<br />
* Use [https://bugzilla.mozilla.org/enter_bug.cgi?alias=&assigned_to=nobody%40mozilla.org&blocked=&bug_file_loc=http%3A%2F%2F&bug_severity=normal&bug_status=NEW&cf_crash_signature=&comment=&component=Release%20Engineering%3A%20Machine%20Management&contenttypeentry=&contenttypemethod=autodetect&contenttypeselection=text%2Fplain&data=&defined_groups=1&dependson=&description=&flag_type-4=X&flag_type-481=X&flag_type-607=X&flag_type-674=X&flag_type-720=X&flag_type-721=X&flag_type-737=X&flag_type-775=X&flag_type-780=X&form_name=enter_bug&keywords=&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=All&priority=P3&product=mozilla.org&qa_contact=armenzg%40mozilla.com&rep_platform=All&requestee_type-4=&requestee_type-607=&requestee_type-753=&short_desc=&status_whiteboard=%5Bbuildduty%5D%5Bbuildslaves%5D%5Bcapacity%5D&target_milestone=---&version=other this template] so it fills up few needed tags and priority<br />
* Make the subject and alias of the bug to be the hostname<br />
* Add any depend bugs IT actions or the slave's issue<br />
* Submit<br />
<br />
== Slave Tracking ==<br />
* Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.<br />
<br />
== Slavealloc ==<br />
=== Adding a slave ===<br />
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.<br />
<br />
You'll want a command line something like<br />
<pre><br />
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv<br />
</pre><br />
<br />
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':<br />
<pre><br />
name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool<br />
</pre><br />
<br />
Adding masters is similar - see dbimport's help for more information.<br />
=== Removing slaves ===<br />
Connect to slavealloc@slavealloc and look at the history for a command looking like this:<br />
<pre><br />
mysql -h $host_ip -p -u buildslaves buildslaves<br />
# type the password<br />
SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
</pre><br />
<br />
== How Tos ==<br />
see [[ReleaseEngineering/How_To]] for a list of public How To documents<br /><br />
see [https://intranet.mozilla.org/RelEngWiki/index.php/Category:HowTo RelEngWiki/Category:HowTo] for list of private How To documents<br />
<br />
= Nagios =<br />
== What's the difference between a downtime and an ack? ==<br />
Both will make nagios stop alerting, but there's an important difference: acks are forever. '''Never''' ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.<br />
<br />
== How do I interact with the nagios IRC bot? ==<br />
nagios: status (gives current server stats)<br />
nagios: status $regexp (gives status for a particular host)<br />
nagios: status host:svc (gives status for a particular service)<br />
nagios: ignore (shows ignores<br />
nagios: ignore $regexp (ignores alerts matching $regexp)<br />
nagios: unignore $regexp (unignores an existing ignore)<br />
nagios: ack $num $comment (adds an acknowledgement comment; $num comes from [brackets] in the alert)<br />
(note that the numbers only count up to 100, so ack things quickly or use the web interface)<br />
nagios: unack $num (reverse an acknowledgement)<br />
nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d)<br />
e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988<br />
<br />
== How do I scan all problems Nagios has detected? ==<br />
* All unacknowledged problems:<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10<br />
* All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346<br />
* Group hosts check<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?hostgroup=all&style=summary<br />
<br />
== How do I deal with Nagios problems? ==<br />
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.<br />
<br />
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ''ever'' disable notifications.<br />
<br />
You can '''acknowledge''' a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.<br />
<br />
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.<br />
<br />
You can also mark a service or host for '''downtime'''. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.<br />
<br />
At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.<br />
<br />
== Known nagios alerts ==<br />
<pre><br />
[28] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
armenzg_buildduty<br />
arr: should I be worrying about this message? [26] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
nthomas<br />
depends if ssh is down<br />
nagios-sjc1<br />
[29] talos-r3-fed-018.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
joduinn-mtg is now known as joduinn-brb<br />
nthomas<br />
seems to work ok still, so people can push<br />
16:53 nthomas<br />
I get the normal |No interactive shells allowed here!| and it kicks me out as expected<br />
</pre><br />
This is normally due to releases. We might have to bump the threshold.<br />
<pre><br />
[30] signing1.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 30.60<br />
</pre><br />
<br />
= Downtimes =<br />
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the [[ReleaseEngineering:Buildduty:Downtimes|Downtimes]] page.<br />
<br />
= Talos =<br />
'''Note''' because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...<br />
<br />
# close all trees that are impacted by the change<br />
# ensure all pending builds are done and GREEN<br />
# do the update step below<br />
# send a Talos changeset to all trees to generate new baselines<br />
<br />
== How to update the talos/pageloader zips ==<br />
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.<br />
<br />
You may need to get IT to turn on access to build.mozilla.org.<br />
<pre><br />
#use your short ldap name (jford not jford@mozilla.com)<br />
ssh jford@build.mozilla.org<br />
cd /var/www/html/build/talos/zips/<br />
# NOTE: bug# and talos cset helps tracking back<br />
wget -Otalos.bug#.cset.zip <whatever>talos.zip<br />
<br />
cd /var/www/html/build/talos/xpis<br />
# NOTE: We override it unlike with talos.zip since it has not been ported to the talos.json system<br />
wget <whatever>/pageloader.xpi<br />
</pre><br />
<br />
For taloz.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.<br />
<br />
=== Updating talos for Tegras ===<br />
<br />
To update talos on Android,<br />
<br />
# for foopy05-11<br />
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}<br />
cd /builds/talos-data/talos<br />
hg pull -u<br />
<br />
This will update talos on each foopy to the tip of default.<br />
<br />
=== Updating talos for N900s ===<br />
<br />
ssh cltbld@production-mobile-master<br />
cd checkouts<br />
./update.sh<br />
<br />
This will update the fennecmark, maemkit, talos, and pageloader tarballs on pmm to the latest in their repos.<br />
<br />
= TBPL =<br />
== How to deploy changes ==<br />
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.<br />
<br />
== How to hide/unhide builders ==<br />
* In the 'Tree Info' menu select 'Open tree admin panel'<br />
* Filter/select the builders you want to change<br />
* Save changes<br />
* Enter the sheriff password and a description (with bug number if available) of your changes<br />
<br />
= Useful Links =<br />
* [http://cruncher.build.mozilla.org/buildapi/index.html Build Dashboard Main Page]<br />
** You can get JSON dumps for people to analyze by adding <code>&format=json</code><br />
** You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this [http://cruncher.build.mozilla.org/buildapi/revision/ link] (e.g. [http://cruncher.build.mozilla.org/buildapi/revision/places/c4f8232c7aef revision/places/c4f8232c7aef])<br />
* http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.<br />
* http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)<br />
<br />
= L10n Nightly Dashboard =<br />
* [http://l10n.mozilla.org/~axel/nightlies L10n Nightly Dashboard]<br />
<br />
= Slave Handling =<br />
You'll need to be familiar with the location of slaves. You can find this with 'host' if you don't know off the top of your head<br />
host linux-ix-slave07<br />
linux-ix-slave07.build.mozilla.org is an alias for linux-ix-slave07.build.'''mtv1'''.mozilla.com.<br />
<br />
== Restarting Wedged Slaves ==<br />
See [https://wiki.mozilla.org/ReleaseEngineering/How_To/Get_a_Missing_Slave_Back_Online How To/Get a Missing Slave Back Online].<br />
<br />
Reboot an IX slave:<br />
[[ReleaseEngineering/How_To/Connect_To_IPMI|Connect To IPMI]]<br />
<br />
== Requesting Reboots ==<br />
Some slaves run on unmanaged hardware, meaning that the hardware can get into a state where someone must be onsite to unwedge it. Note that iX systems and VMs are '''not''' unmanaged, and should not be on a reboots bug. When an unmanaged host becomes unresponsive, it gets added to a reboots bug, based on its datacenter:<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-scl1 (by far the most common, since about 10 talos machines die per week)<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-sjc1<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-mtv1<br />
'''NOTE:''' these bugs are formulaic. Don't get creative! Just add the hostname of the slave in a comment, or if you are adding multiple slaves at once, list each on its own line. If there's something the onsite person needs to know, include it after the hostname, on the same line. '''Do not''' try to "summarize" all of the slaves on the bug in a single comment.<br />
<br />
Simultaneously, 'ack' the alert in #build:<br />
10:27 < nagios-sjc1> [25] talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
10:51 < dustin> nagios-sjc1: ack 25 reboots-scl1<br />
10:51 < nagios-sjc1> talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%;dustin;reboots-scl1<br />
<br />
== When Requested Reboots are Done ==<br />
=== Checking Slaves ===<br />
Once a reboots bug is closed by an onsite person, read the update to see which hosts were rebooted, and which (if any) require further work. Such further work should be deferred to a new bug, which you should open if relops did not (often time is tight at the datacenter). Update the slave tracking spreadsheet accordingly:<br />
* for slaves that were rebooted normally: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "rebooted"; and set "Blocked On" to "check" (which will turn the cell yellow). Check BuildAPI a few hours later to see if these slaves are building properly, and delete the rows from the spreadsheet if so.<br />
* for slaves that were reimaged during the reboot process: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "reimaged"; and set "Blocked On" to "set up". That set-up is your responsibility, too -- see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave].<br />
* for slaves that require further work from relops, change the "Bug #" column to reflect the bug tracking that work, and set the "Issue" and "Blocked On" columns appropriately<br />
If any slaves were missed in the reboot process, add them to a new reboots bug.<br />
<br />
=== New Bug ===<br />
Once a reboots bug is closed, you will need to open a new one for any subsequent reboots. You don't have to wait until you need a reboot to do so. Here's how:<br />
# remove the 'reboots-xxxx' alias from the previous reboots bug, and copy the bug's URL to your clipboard<br />
# create a bug in "Server Operations: RelEng", with subject "reboot requests (xxxx)". You can leave the description blank if you don't have any slaves requiring reboot yet. Submit.<br />
# edit the bug's colo-trip field to indicate the correct datacenter, and paste the previous reboot request's URL into the "See Also" field.<br />
<br />
== DNR ==<br />
Slaves that are dead and not worth repairing are marked as "DNR" in the slave tracking spreadsheet. The types of slaves that are acceptable for DNR are listed in the "DNR'd Silos" sheet of the [http://is.gd/jsHeh slave tracking spreadsheet]. Such slaves should be acked in nagios, but are not tracked in any bug.<br />
<br />
== Loans ==<br />
We need to track a slave from the time it is loaned out until it is back in its proper place (be that staging, preprod, or production). Currently we use bugs to track this flow.<br />
<br />
# Bug from dev requesting loaner (build or test slave, platform, bug this is being used to help with)<br />
# Loan it: [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Send_a_slave_out_for_loan How To/Send a slave out for loan]<br />
# File a bug to the RelEng component (connected to bug in point #1) to track the re-imaging and returning of the slave to its pool when it's returned -- I've been asking the dev to please comment in that bug when they are done with the loaner<br />
# File a bug on ServerOps asking for re-image (blocking bug in #3) [https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_That_a_Machine_Be_Reimaged How To/Request That a Machine Be Reimaged]<br />
# When it's re-imaged, put it back in the pool [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave]<br />
<br />
== Maintenance ==<br />
Periodically scan the slave spreadsheet. Check slaves marked "check". Set up slaves marked "set up". Ask developers who have borrowed slaves to see if they're done with them. Ask relops about progress on broken slaves.<br />
<br />
== Common Failure Modes ==<br />
Some slaves, especially linux VMs, will fail to clobber and repeatedly restart. In nagios, this causes all of the checks on that host to bounce up and down, because the reboots occur on a similar schedule to nagios's checks. Sometimes you can catch this via SSH, but the reboots are *very* quick and it may be easier to use vSphere Client to boot the host into single-user mode and clean out the build dirs.<br />
<br />
All of the linux slaves will reboot after 10 attempts to run puppet. A puppet failure, then, will manifest as buildbot failing to start on that host. To stop the reboot cycle, log in to the slave and kill S98puppet (centos) or run-puppet-and-slave.sh (fedora).<br />
<br />
= Standard Bugs =<br />
* The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I<br />
* Reboots bugs have the Bugzilla aliases shown above.<br />
* For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:<br />
** :aki, :armenzg, :bear, :bhearsum, :catlee, :coop, :hwine, :jhford, :joduinn, :joey, :lsblakk, :nthomas, :rail<br />
<br />
= Ganglia =<br />
* if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. {{bug|674233}}:<br />
switch to root, service gmond restart<br />
<br />
= Queue Directories =<br />
* [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories]<br />
<br />
If you see this in #build:<br />
<br />
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items<br />
<br />
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories] wiki page for details.<br />
= Cruncher = <br />
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):<br />
<br />
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):<br />
As root:<br />
du -s -h /var/spool/*<br />
# confirm that mqueue or clientmqueue is the oversized culprit<br />
# stop sendmail, clean out the queues, restart sendmail<br />
/etc/init.d/sendmail stop<br />
rm -rf /var/spool/clientmqueue/*<br />
rm -rf /var/spool/mqueue/*<br />
/etc/init.d/sendmail start</div>Bearhttps://wiki.mozilla.org/index.php?title=CIDuty&diff=444504CIDuty2012-06-22T14:49:44Z<p>Bear: /* Updating talos for Tegras */</p>
<hr />
<div>'''Looking for who is on buildduty?''' - check the tree-info dropdown on [https://tbpl.mozilla.org/ tbpl]<br /><br />
'''Buildduty not around?''' - please [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Release%20Engineering open a bug]<br />
<br />
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "'''buildduty'''."<br />
<br />
Here's now to do it.<br />
<br />
__TOC__<br />
<br />
= Schedule =<br />
Mozilla Releng Sheriff Schedule ([http://www.google.com/calendar/embed?src=aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com&ctz=America/New_York Google Calendar]|[http://www.google.com/calendar/ical/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic.ics iCal]|[http://www.google.com/calendar/feeds/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic XML])<br />
<br />
= General Duties =<br />
== How should I make myself available for duty? ==<br />
* Add 'buildduty' to your IRC nick<br />
* be in at least #developers, #buildduty and #build (as well as #mozbuild of course)<br />
** also useful to be in #mobile, #planning, #release-drivers, and #ateam<br />
* watch http://tbpl.mozilla.org<br />
<br />
== What else should I take care of? ==<br />
You will need to<br />
* Direct people to [http://mzl.la/tryhelp http://mzl.la/tryhelp] for self-serve documentation.<br />
* Keep [https://wiki.mozilla.org/ReleaseEngineering:Maintenance wiki.m.o/ReleaseEngineering:Maintenance] up to date with any significant changes<br />
<br />
You should keep on top of<br />
* pending builds - available in [http://build.mozilla.org/builds/pending/ graphs] or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.<br />
* all bugs tagged with [https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;list_id=2941844;status_whiteboard=%5Bbuildduty%5D;;resolution=---;product=mozilla.org buildduty] in the whiteboard (make a saved search)<br />
* The [https://bugzilla.mozilla.org/buglist.cgi?priority=--&columnlist=bug_severity%2Cpriority%2Cop_sys%2Cassigned_to%2Cbug_status%2Cresolution%2Cshort_desc%2Cstatus_whiteboard&resolution=---&resolution=DUPLICATE&emailtype1=exact&query_based_on=releng-triage&emailassigned_to1=1&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=nobody%40mozilla.org&component=Release%20Engineering&component=Release%20Engineering%3A%20Custom%20Builds&product=mozilla.org&known_name=releng-triage releng-triage search] - part of buildduty is leaning on your colleagues to take bugs<br />
** the BuildDuty person needs to go through all the bugs in releng-triage query at *least* a few times each day. Doesnt mean you have to *fix* them all immediately, finding other owners is part of triaging the queue. However, you do need to at least *see* them, know if there are any urgent problems and categorize appropriately. Sometimes we get urgent security bugs here, which need to be jumped on immediately, like example {{bug|635638}}<br />
* Bum slaves - you should see to it that bum slaves aren't burning builds, and that all slaves are tracked on their way back to operational status<br />
** Check the [https://bugzilla.mozilla.org/buglist.cgi?list_id=2938171;resolution=---;status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=%5Bhardware%5D;;product=mozilla.org hardware] whiteboard tag, too, for anything that slipped between the cracks.<br />
** See the sections below on [[#Requesting Reboots]]<br />
* Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp])<br />
** '''wait times''' - either [https://build.mozilla.org/buildapi/reports/waittimes this page] or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)<br />
** there is a cronjob in anamarias' account on cruncher that runs this for each pool:<br />
/usr/local/bin/python $HOME/buildapi/buildapi/scripts/mailwaittimes.py \<br />
-S smtp.mozilla.org \<br />
-f nobody@cruncher.build.mozilla.org \<br />
-p testpool \<br />
-W http://cruncher.build.mozilla.org/buildapi/reports/waittimes \<br />
-e $(date -d "$(date +%Y-%m-%d)" +%s) -t 10 -z 10 \<br />
-a dev-tree-management@lists.mozilla.org<br />
<br />
* You may need to plan a reconfig or a full downtime<br />
** Reconfigs: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-reconfig releng-needs-reconfig broken query] to see what's pending. Reconfigs can be done at any time. <br />
** Downtimes: look at [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=releng-needs-treeclosure releng-needs-treeclosure broken query] to see what's pending. Coordinate with Zandr and IT for send downtime notices with enough advance notice. <br />
<br />
You will also be responsible for coordinating master reconfigs - see the releng-needs-reconfig search.<br />
<br />
== Scheduled Reconfigs ==<br />
Buildduty is responsible for reconfiging the Buildbot masters <b>every Monday and Thursday</b>, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. [https://wiki.mozilla.org/ReleaseEngineering/Landing_Buildbot_Master_Changes This wiki page has step by step instructions]. It is also valid to do other additional reconfigs anytime you want.<br />
<br />
If the reconfig gets stuck, see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master How To/Unstick a Stuck Slave From A Master].<br />
<br />
You should [https://wiki.mozilla.org/ReleaseEngineering/Managing_Buildbot_with_Fabric use Fabric to do the reconfig!]<br />
<br />
The person doing reconfigs should also update https://wiki.mozilla.org/ReleaseEngineering:Maintenance#Reconfigs_.2F_Deployments<br />
<br />
= Tree Maintenance =<br />
== Repo Errors ==<br />
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:<br />
* File a bug (or have dev file it) and then poke in #ops noahm<br />
** If he doesn't respond, then escalate the bug to page on-call<br />
* Follow the steps below for "How do I close the tree"<br />
== How do I see problems in TBPL? ==<br />
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.<br />
== How do I close the tree? ==<br />
See [[ReleaseEngineering/How_To/Close_or_Open_the_Tree]]<br />
<br />
== How do I claim a rentable project branch? ==<br />
See [[ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE]]<br />
<br />
= Re-run jobs =<br />
== How to trigger Talos jobs ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-trigger all Talos runs for a build (by using sendchange) ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== How to re-run a build ==<br />
Do ''not'' go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.<br />
<br />
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the '''YOU MUST''' specify the branch, so there's no null keys in the builds-running.js.<br />
<br />
= Try Server =<br />
== Jobs not scheduled at all? ==<br />
Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.<br />
<br />
Then do a sendchange and tail the scheduler master:<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
<br />
<br />
== How do I trigger additional talos/test runs for a given try build? ==<br />
see [[ReleaseEngineering/How_To/Trigger_Talos_Jobs]]<br />
<br />
== Using the TryChooser to submit build/test requests ==<br />
<br />
buildduty can also use the same [https://wiki.mozilla.org/Build:TryChooser TryChooser] syntax as developers use to (re)submit build and testing requests. Here is an example:<br />
<br />
<pre><br />
buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit<br />
</pre><br />
== How do I cancel existing jobs? ==<br />
<br />
The cancellator.py script is setup on pm02. Here is a standard example:<br />
<br />
# Dry run first to see what would be cancelled. <br />
python cancellator.py -b try -r 5ff84b660e90<br />
# Same command run again with the force option specified (--yes-really) to actually cancel the builds<br />
python cancellator.py -b try -r 5ff84b660e90 --yes-really<br />
<br />
The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.<br />
== Bug Commenter ==<br />
This is on cruncher and is run in a crontab in lsblakk's account:<br />
source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \<br />
&& time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v<br />
<br />
You can see quickly if things are working by looking at:<br />
/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log # this shows what's been posted lately<br />
/home/lsblakk/autoland/tools/scripts/autoland/try_cache # this shows what the script thinks is 'pending' completion<br />
<br />
= Nightlies =<br />
<br />
== How do I re-spin mozilla-central nightlies? ==<br />
To rebuild the same nightly, buildbot's Rebuild button works fine.<br />
<br />
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.<br />
<br />
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.<br />
<br />
= Mobile =<br />
== Android Tegras ==<br />
<br />
[[ReleaseEngineering:How To:Android Tegras | Android Tegra BuildDuty Notes]]<br />
<br />
== Android Updates aren't working! ==<br />
<br />
* Did the version number just change? If so, you may be hitting {{bug|629528}}. Kick off another Android nightly.<br />
* Check aus3-staging for size 0 complete.txt snippets:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5<br />
** If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.<br />
* Check aus3-staging to see if the checksum is correct:<br />
** https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2<br />
** If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.<br />
<br />
== Update mobile talos webhosts ==<br />
We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}).<br />
Here is how you update them:<br />
Update Procedure:<br />
<pre><br />
ssh root@bm-remote-talos-webhost-01<br />
cd /var/www/html/talos-repo<br />
# NOTICE that we have uncommitted files<br />
hg st<br />
# ? talos/page_load_test/tp4<br />
# Take note of the current revision to revert to (just in case)<br />
hg id<br />
hg pull -u<br />
# 488bc187a3ef tip<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.<br />
</pre><br />
<br />
Keep track of what revisions is being run.<br />
<br />
== Deploy new tegra-host-utils.zip ==<br />
There are three hosts behind a balance loader.<br />
* See {{bug|742597}} for previous instance of this case.<br />
<pre><br />
# ssh root@bm-remote-talos-webhost-01 - probably connected to MTV VPN<br />
[root@bm-remote-talos-webhost-01 tegra]# wget -Otegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip<br />
[root@bm-remote-talos-webhost-01 tegra]# rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/.<br />
[root@bm-remote-talos-webhost-01 tegra]# rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/.<br />
</pre><br />
<br />
= Slave Maintenance =<br />
In general, slave maintenance involves:<br />
* keeping as many slaves up as possible, including<br />
** proactively checking for hung/broken slaves (see links below)<br />
** moving known-down slaves toward an operational state<br />
* handling nagios alerts for slaves<br />
* interacting with IT regarding slave maintenance<br />
== File a bug ==<br />
* Use [https://bugzilla.mozilla.org/enter_bug.cgi?alias=&assigned_to=nobody%40mozilla.org&blocked=&bug_file_loc=http%3A%2F%2F&bug_severity=normal&bug_status=NEW&cf_crash_signature=&comment=&component=Release%20Engineering%3A%20Machine%20Management&contenttypeentry=&contenttypemethod=autodetect&contenttypeselection=text%2Fplain&data=&defined_groups=1&dependson=&description=&flag_type-4=X&flag_type-481=X&flag_type-607=X&flag_type-674=X&flag_type-720=X&flag_type-721=X&flag_type-737=X&flag_type-775=X&flag_type-780=X&form_name=enter_bug&keywords=&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=All&priority=P3&product=mozilla.org&qa_contact=armenzg%40mozilla.com&rep_platform=All&requestee_type-4=&requestee_type-607=&requestee_type-753=&short_desc=&status_whiteboard=%5Bbuildduty%5D%5Bbuildslaves%5D%5Bcapacity%5D&target_milestone=---&version=other this template] so it fills up few needed tags and priority<br />
* Make the subject and alias of the bug to be the hostname<br />
* Add any depend bugs IT actions or the slave's issue<br />
* Submit<br />
<br />
== Slave Tracking ==<br />
* Slave tracking is done via the [http://slavealloc.build.mozilla.org/ui/#slaves Slave Allocator]. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.<br />
<br />
== Slavealloc ==<br />
=== Adding a slave ===<br />
Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.<br />
<br />
You'll want a command line something like<br />
<pre><br />
/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv<br />
</pre><br />
<br />
where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':<br />
<pre><br />
name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool<br />
</pre><br />
<br />
Adding masters is similar - see dbimport's help for more information.<br />
=== Removing slaves ===<br />
Connect to slavealloc@slavealloc and look at the history for a command looking like this:<br />
<pre><br />
mysql -h $host_ip -p -u buildslaves buildslaves<br />
# type the password<br />
SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';<br />
</pre><br />
<br />
== How Tos ==<br />
see [[ReleaseEngineering/How_To]] for a list of public How To documents<br /><br />
see [https://intranet.mozilla.org/RelEngWiki/index.php/Category:HowTo RelEngWiki/Category:HowTo] for list of private How To documents<br />
<br />
= Nagios =<br />
== What's the difference between a downtime and an ack? ==<br />
Both will make nagios stop alerting, but there's an important difference: acks are forever. '''Never''' ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.<br />
<br />
== How do I interact with the nagios IRC bot? ==<br />
nagios: status (gives current server stats)<br />
nagios: status $regexp (gives status for a particular host)<br />
nagios: status host:svc (gives status for a particular service)<br />
nagios: ignore (shows ignores<br />
nagios: ignore $regexp (ignores alerts matching $regexp)<br />
nagios: unignore $regexp (unignores an existing ignore)<br />
nagios: ack $num $comment (adds an acknowledgement comment; $num comes from [brackets] in the alert)<br />
(note that the numbers only count up to 100, so ack things quickly or use the web interface)<br />
nagios: unack $num (reverse an acknowledgement)<br />
nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d)<br />
e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988<br />
<br />
== How do I scan all problems Nagios has detected? ==<br />
* All unacknowledged problems:<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10<br />
* All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346<br />
* Group hosts check<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01<br />
** https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?hostgroup=all&style=summary<br />
<br />
== How do I deal with Nagios problems? ==<br />
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.<br />
<br />
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ''ever'' disable notifications.<br />
<br />
You can '''acknowledge''' a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.<br />
<br />
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.<br />
<br />
You can also mark a service or host for '''downtime'''. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.<br />
<br />
At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.<br />
<br />
== Known nagios alerts ==<br />
<pre><br />
[28] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
armenzg_buildduty<br />
arr: should I be worrying about this message? [26] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds<br />
nthomas<br />
depends if ssh is down<br />
nagios-sjc1<br />
[29] talos-r3-fed-018.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
joduinn-mtg is now known as joduinn-brb<br />
nthomas<br />
seems to work ok still, so people can push<br />
16:53 nthomas<br />
I get the normal |No interactive shells allowed here!| and it kicks me out as expected<br />
</pre><br />
This is normally due to releases. We might have to bump the threshold.<br />
<pre><br />
[30] signing1.build.scl1:Ganglia IO is WARNING: CHECKGANGLIA WARNING: cpu_wio is 30.60<br />
</pre><br />
<br />
= Downtimes =<br />
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the [[ReleaseEngineering:Buildduty:Downtimes|Downtimes]] page.<br />
<br />
= Talos =<br />
'''Note''' because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...<br />
<br />
# close all trees that are impacted by the change<br />
# ensure all pending builds are done and GREEN<br />
# do the update step below<br />
# send a Talos changeset to all trees to generate new baselines<br />
<br />
== How to update the talos/pageloader zips ==<br />
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.<br />
<br />
You may need to get IT to turn on access to build.mozilla.org.<br />
<pre><br />
#use your short ldap name (jford not jford@mozilla.com)<br />
ssh jford@build.mozilla.org<br />
cd /var/www/html/build/talos/zips/<br />
# NOTE: bug# and talos cset helps tracking back<br />
wget -Otalos.bug#.cset.zip <whatever>talos.zip<br />
<br />
cd /var/www/html/build/talos/xpis<br />
# NOTE: We override it unlike with talos.zip since it has not been ported to the talos.json system<br />
wget <whatever>/pageloader.xpi<br />
</pre><br />
<br />
For taloz.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.<br />
<br />
=== Updating talos for Tegras ===<br />
<br />
To update talos on Android,<br />
<br />
# for foopy05-11<br />
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}<br />
cd /builds/talos-data/talos<br />
hg pull -u<br />
<br />
This will update talos on each foopy to the tip of default.<br />
<br />
=== Updating talos for N900s ===<br />
<br />
ssh cltbld@production-mobile-master<br />
cd checkouts<br />
./update.sh<br />
<br />
This will update the fennecmark, maemkit, talos, and pageloader tarballs on pmm to the latest in their repos.<br />
<br />
= TBPL =<br />
== How to deploy changes ==<br />
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.<br />
<br />
== How to hide/unhide builders ==<br />
* In the 'Tree Info' menu select 'Open tree admin panel'<br />
* Filter/select the builders you want to change<br />
* Save changes<br />
* Enter the sheriff password and a description (with bug number if available) of your changes<br />
<br />
= Useful Links =<br />
* [http://cruncher.build.mozilla.org/buildapi/index.html Build Dashboard Main Page]<br />
** You can get JSON dumps for people to analyze by adding <code>&format=json</code><br />
** You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this [http://cruncher.build.mozilla.org/buildapi/revision/ link] (e.g. [http://cruncher.build.mozilla.org/buildapi/revision/places/c4f8232c7aef revision/places/c4f8232c7aef])<br />
* http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.<br />
* http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)<br />
<br />
= L10n Nightly Dashboard =<br />
* [http://l10n.mozilla.org/~axel/nightlies L10n Nightly Dashboard]<br />
<br />
= Slave Handling =<br />
You'll need to be familiar with the location of slaves. You can find this with 'host' if you don't know off the top of your head<br />
host linux-ix-slave07<br />
linux-ix-slave07.build.mozilla.org is an alias for linux-ix-slave07.build.'''mtv1'''.mozilla.com.<br />
<br />
== Restarting Wedged Slaves ==<br />
See [https://wiki.mozilla.org/ReleaseEngineering/How_To/Get_a_Missing_Slave_Back_Online How To/Get a Missing Slave Back Online].<br />
<br />
Reboot an IX slave:<br />
[[ReleaseEngineering/How_To/Connect_To_IPMI|Connect To IPMI]]<br />
<br />
== Requesting Reboots ==<br />
Some slaves run on unmanaged hardware, meaning that the hardware can get into a state where someone must be onsite to unwedge it. Note that iX systems and VMs are '''not''' unmanaged, and should not be on a reboots bug. When an unmanaged host becomes unresponsive, it gets added to a reboots bug, based on its datacenter:<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-scl1 (by far the most common, since about 10 talos machines die per week)<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-sjc1<br />
* https://bugzilla.mozilla.org/show_bug.cgi?id=reboots-mtv1<br />
'''NOTE:''' these bugs are formulaic. Don't get creative! Just add the hostname of the slave in a comment, or if you are adding multiple slaves at once, list each on its own line. If there's something the onsite person needs to know, include it after the hostname, on the same line. '''Do not''' try to "summarize" all of the slaves on the bug in a single comment.<br />
<br />
Simultaneously, 'ack' the alert in #build:<br />
10:27 < nagios-sjc1> [25] talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%<br />
10:51 < dustin> nagios-sjc1: ack 25 reboots-scl1<br />
10:51 < nagios-sjc1> talos-r3-fed-025.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%;dustin;reboots-scl1<br />
<br />
== When Requested Reboots are Done ==<br />
=== Checking Slaves ===<br />
Once a reboots bug is closed by an onsite person, read the update to see which hosts were rebooted, and which (if any) require further work. Such further work should be deferred to a new bug, which you should open if relops did not (often time is tight at the datacenter). Update the slave tracking spreadsheet accordingly:<br />
* for slaves that were rebooted normally: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "rebooted"; and set "Blocked On" to "check" (which will turn the cell yellow). Check BuildAPI a few hours later to see if these slaves are building properly, and delete the rows from the spreadsheet if so.<br />
* for slaves that were reimaged during the reboot process: replace the "Bug #" column with the actual reboots bug number; set "Issue" to "reimaged"; and set "Blocked On" to "set up". That set-up is your responsibility, too -- see [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave].<br />
* for slaves that require further work from relops, change the "Bug #" column to reflect the bug tracking that work, and set the "Issue" and "Blocked On" columns appropriately<br />
If any slaves were missed in the reboot process, add them to a new reboots bug.<br />
<br />
=== New Bug ===<br />
Once a reboots bug is closed, you will need to open a new one for any subsequent reboots. You don't have to wait until you need a reboot to do so. Here's how:<br />
# remove the 'reboots-xxxx' alias from the previous reboots bug, and copy the bug's URL to your clipboard<br />
# create a bug in "Server Operations: RelEng", with subject "reboot requests (xxxx)". You can leave the description blank if you don't have any slaves requiring reboot yet. Submit.<br />
# edit the bug's colo-trip field to indicate the correct datacenter, and paste the previous reboot request's URL into the "See Also" field.<br />
<br />
== DNR ==<br />
Slaves that are dead and not worth repairing are marked as "DNR" in the slave tracking spreadsheet. The types of slaves that are acceptable for DNR are listed in the "DNR'd Silos" sheet of the [http://is.gd/jsHeh slave tracking spreadsheet]. Such slaves should be acked in nagios, but are not tracked in any bug.<br />
<br />
== Loans ==<br />
We need to track a slave from the time it is loaned out until it is back in its proper place (be that staging, preprod, or production). Currently we use bugs to track this flow.<br />
<br />
# Bug from dev requesting loaner (build or test slave, platform, bug this is being used to help with)<br />
# Loan it: [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Send_a_slave_out_for_loan How To/Send a slave out for loan]<br />
# File a bug to the RelEng component (connected to bug in point #1) to track the re-imaging and returning of the slave to its pool when it's returned -- I've been asking the dev to please comment in that bug when they are done with the loaner<br />
# File a bug on ServerOps asking for re-image (blocking bug in #3) [https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_That_a_Machine_Be_Reimaged How To/Request That a Machine Be Reimaged]<br />
# When it's re-imaged, put it back in the pool [https://wiki.mozilla.org/ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave How To/Set Up a Freshly Imaged Slave]<br />
<br />
== Maintenance ==<br />
Periodically scan the slave spreadsheet. Check slaves marked "check". Set up slaves marked "set up". Ask developers who have borrowed slaves to see if they're done with them. Ask relops about progress on broken slaves.<br />
<br />
== Common Failure Modes ==<br />
Some slaves, especially linux VMs, will fail to clobber and repeatedly restart. In nagios, this causes all of the checks on that host to bounce up and down, because the reboots occur on a similar schedule to nagios's checks. Sometimes you can catch this via SSH, but the reboots are *very* quick and it may be easier to use vSphere Client to boot the host into single-user mode and clean out the build dirs.<br />
<br />
All of the linux slaves will reboot after 10 attempts to run puppet. A puppet failure, then, will manifest as buildbot failing to start on that host. To stop the reboot cycle, log in to the slave and kill S98puppet (centos) or run-puppet-and-slave.sh (fedora).<br />
<br />
= Standard Bugs =<br />
* The current downtime bug should always be aliased as "releng-downtime": http://is.gd/cQO7I<br />
* Reboots bugs have the Bugzilla aliases shown above.<br />
* For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:<br />
** :aki, :armenzg, :bear, :bhearsum, :catlee, :coop, :hwine, :jhford, :joduinn, :joey, :lsblakk, :nthomas, :rail<br />
<br />
= Ganglia =<br />
* if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. {{bug|674233}}:<br />
switch to root, service gmond restart<br />
<br />
= Queue Directories =<br />
* [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories]<br />
<br />
If you see this in #build:<br />
<br />
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items<br />
<br />
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the [https://wiki.mozilla.org/ReleaseEngineering/Queue_directories Queue directories] wiki page for details.<br />
= Cruncher = <br />
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):<br />
<br />
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):<br />
As root:<br />
du -s -h /var/spool/*<br />
# confirm that mqueue or clientmqueue is the oversized culprit<br />
# stop sendmail, clean out the queues, restart sendmail<br />
/etc/init.d/sendmail stop<br />
rm -rf /var/spool/clientmqueue/*<br />
rm -rf /var/spool/mqueue/*<br />
/etc/init.d/sendmail start</div>Bearhttps://wiki.mozilla.org/index.php?title=User:Bear:My_Environment&diff=439270User:Bear:My Environment2012-06-08T03:21:57Z<p>Bear: </p>
<hr />
<div>== Useful Tools ==<br />
<br />
* http://code.google.com/p/csshx/<br />
* http://www.iterm2.com/<br />
* http://synergy-foss.org/<br />
* http://www.jinx.de/JollysFastVNC.html<br />
<br />
== Mac OS X 10.7 (aka Lion) ==<br />
=== core dev tools ===<br />
<br />
* Install Firefox<br />
* Install from Apple the XCode package and then ensure that you also install from it the Command Line Tools<br />
* Install the HomeBrew environment<br />
** https://github.com/mxcl/homebrew/wiki/installation<br />
<br />
/usr/bin/ruby -e "$(/usr/bin/curl -fsSL https://raw.github.com/mxcl/homebrew/master/Library/Contributions/install_homebrew.rb)"<br />
<br />
* Edit your bash profile to make sure the HomeBrew environment is preferred:<br />
<br />
export PATH=/Users/bear/bin:/usr/local/bin:/usr/local/share/python:/usr/bin:/bin:/usr/sbin:/sbin:/opt/bin:/opt/sbin:/usr/X11/bin<br />
<br />
* Install tools<br />
<br />
brew install python<br />
brew install git<br />
brew install mercurial<br />
brew install zeromq<br />
brew install pyzmq<br />
easy_install pip<br />
<br />
* Install gnupg<br />
<br />
brew install pth<br />
brew install libksba<br />
brew install libgcrypt<br />
brew install libassuan<br />
cd ~/Downloads<br />
wget ftp://ftp.gnupg.org/gcrypt/gnupg/gnupg-2.0.19.tar.bz2<br />
cd ~/installs<br />
tar xf ~/Downloads/gnupg-2.0.19.tar.gz2<br />
cd gnupg-2.0.19<br />
./configure<br />
make install<br />
<br />
<br />
<br />
== Mac OS X 10.6 (aka Snow Leopard) ==<br />
<br />
=== Bash .profile ===<br />
<br />
export PATH="/opt/bin:/opt/sbin:$PATH"<br />
<br />
<br />
=== Git ===<br />
<br />
curl -O http://kernel.org/pub/software/scm/git/git-1.7.0.tar.gz<br />
tar xzf git-1.7.0.tar.gz<br />
cd git-1.7.0/<br />
./configure --prefix=/opt<br />
make<br />
sudo make install<br />
<br />
=== Mercurial ===<br />
<br />
curl -O http://mercurial.selenic.com/release/mercurial-1.4.tar.gz<br />
tar xzf mercurial-1.4.tar.gz<br />
cd mercurial-1.4/<br />
make all<br />
sudo make install<br />
<br />
=== VirtualEnv ===<br />
<br />
'''need to find notes on this'''<br />
<br />
=== Buildbot ===<br />
<br />
==== Twisted ====<br />
<br />
cd ~/Downloads<br />
curl -O http://tmrc.mit.edu/mirror/twisted/Twisted/2.4/Twisted-2.4.0.tar.bz2<br />
<br />
cd ~/src<br />
virtualenv buildbot<br />
cd buildbot<br />
source bin/activate<br />
tar xzf ~/Downloads/Twisted-2.4.0.tar.bz2<br />
cd Twisted-2.4.0<br />
python setup.py install<br />
<br />
==== buildbot ====<br />
<br />
cd ~/src/buildbot<br />
source bin/activate<br />
hg clone http://hg.mozilla.org/build/buildbot<br />
cd buildbot<br />
python setup.py install<br />
<br />
==== creating buildbot master ====<br />
<br />
'''Note''': need to finish documenting the master and slave config changes for local setups<br />
<br />
cd ~/src/buildbot<br />
source bin/activate<br />
buildbot create-master master<br />
buildbot create-slave slave localhost:9010 moz-slave-name<br />
<br />
=== GnuPG v2 ===<br />
<br />
==== libgpg-error ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/libgpg-error/libgpg-error-1.7.tar.bz2<br />
tar xjf libgpg-error-1.7.tar.bz2<br />
cd libgpg-error-1.7/<br />
./configure CC="gcc -arch i386" --prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== libgcrypt ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/libgcrypt/libgcrypt-1.4.5.tar.bz2<br />
tar xjf libgcrypt-1.4.5.tar.bz2<br />
cd libgcrypt-1.4.5/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-gpg-error-prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== libksba ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/libksba/libksba-1.0.7.tar.bz2<br />
tar xjf libksba-1.0.7.tar.bz2<br />
cd libksba-1.0.7/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-gpg-error-prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== pth ====<br />
<br />
curl -O http://ftp.gnu.org/gnu/pth/pth-2.0.7.tar.gz<br />
tar xzf pth-2.0.7.tar.gz<br />
cd pth-2.0.7/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-gpg-error-prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== libassuan ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/libassuan/libassuan-1.0.5.tar.bz2<br />
tar xjf libassuan-1.0.5.tar.bz2<br />
cd libassuan-1.0.5/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-pth-prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== gnupg ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/gnupg/gnupg-2.0.9.tar.bz2<br />
cd gnupg-2.0.9/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-pth-prefix=/opt --with-ksba-prefix=/opt --with-libassuan-prefix=/opt --with-libgcrypt-prefix=/opt --with-gpg-error-prefix=/opt <br />
make<br />
sudo make install<br />
<br />
=== pkg-config ===<br />
<br />
curl -O http://pkgconfig.freedesktop.org/releases/pkg-config-0.23.tar.gz<br />
tar xzf pkg-config-0.23.tar.gz<br />
cd pkg-config-0.23<br />
./configure --prefix=/opt/<br />
make<br />
sudo make install<br />
<br />
=== gettext ===<br />
<br />
curl -O http://ftp.gnu.org/pub/gnu/gettext/gettext-0.17.tar.gz<br />
tar xzf gettext-0.17.tar.gz<br />
cd gettext-0.17<br />
./configure --prefix=/opt/<br />
make<br />
sudo make install<br />
<br />
=== libiconv ===<br />
<br />
curl -O http://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.13.1.tar.gz<br />
tar xzf libiconv-1.13.1.tar.gz<br />
cd libiconv-1.13.1.tar.gz<br />
./configure --prefix=/opt/<br />
make<br />
sudo make install<br />
<br />
'''Note''': to fully enable gettext, it's best to rebuild it after installing libiconv (thanks [http://letsneverdie.net/blog/?p=75])<br />
<br />
cd ../gettext-0.17<br />
make distclean<br />
./configure --prefix=/opt/<br />
make<br />
sudo make install<br />
<br />
=== glib2 ===<br />
<br />
'''Note''': the LDFLAGS and CPPFLAGS values are so that the /opt version of gettext and libiconv are used<br />
<br />
'''Note''': thanks to [http://wiki.zmanda.com/index.php/Installation/OS_Specific_Notes/Installing_Amanda_on_Mac_OS_X#Complete_set-up_for_OS_X_Snow_Leopard_10.6.2_on_2010-01-08 Amanda Notes for OS X Installs] for the *FLAGS clue on how to get glib2 to compile <br />
<br />
curl -O http://ftp.gnome.org/pub/gnome/sources/glib/2.22/glib-2.22.4.tar.bz2<br />
tar xjf glib-2.22.4.tar.bz2<br />
cd glib-2.22.4<br />
./configure --prefix=/opt LDFLAGS="-L/opt/lib" CPPFLAGS="-I/opt/include"<br />
make<br />
sudo make install<br />
<br />
=== libIDL ===<br />
<br />
curl -O http://ftp.acc.umu.se/pub/gnome/sources/libIDL/0.8/libIDL-0.8.13.tar.gz<br />
tar xzf libIDL-0.8.13.tar.gz<br />
cd libIDL-0.8.13<br />
./configure --prefix=/opt<br />
make<br />
sudo make install<br />
<br />
=== autoconf213 ===<br />
<br />
curl -0 http://ftp.gnu.org/gnu/autoconf/autoconf-2.13.tar.gz<br />
tar xzf autoconf-2.13.tar.gz<br />
cd autoconf<br />
./configure --prefix=/opt<br />
make<br />
sudo make install<br />
sudo ln -s /opt/bin/autoconf /opt/bin/autoconf213</div>Bearhttps://wiki.mozilla.org/index.php?title=User:Bear:My_Environment&diff=439269User:Bear:My Environment2012-06-08T02:21:53Z<p>Bear: </p>
<hr />
<div>== Mac OS X 10.7 (aka Lion) ==<br />
=== core dev tools ===<br />
<br />
* Install Firefox<br />
* Install from Apple the XCode package and then ensure that you also install from it the Command Line Tools<br />
* Install the HomeBrew environment<br />
** https://github.com/mxcl/homebrew/wiki/installation<br />
<br />
/usr/bin/ruby -e "$(/usr/bin/curl -fsSL https://raw.github.com/mxcl/homebrew/master/Library/Contributions/install_homebrew.rb)"<br />
<br />
* Edit your bash profile to make sure the HomeBrew environment is preferred:<br />
<br />
export PATH=/Users/bear/bin:/usr/local/bin:/usr/local/share/python:/usr/bin:/bin:/usr/sbin:/sbin:/opt/bin:/opt/sbin:/usr/X11/bin<br />
<br />
* Install tools<br />
<br />
brew install python<br />
brew install git<br />
brew install mercurial<br />
brew install zeromq<br />
brew install pyzmq<br />
easy_install pip<br />
<br />
* Install gnupg<br />
<br />
brew install pth<br />
brew install libksba<br />
brew install libgcrypt<br />
brew install libassuan<br />
cd ~/Downloads<br />
wget ftp://ftp.gnupg.org/gcrypt/gnupg/gnupg-2.0.19.tar.bz2<br />
cd ~/installs<br />
tar xf ~/Downloads/gnupg-2.0.19.tar.gz2<br />
cd gnupg-2.0.19<br />
./configure<br />
make install<br />
<br />
<br />
<br />
== Mac OS X 10.6 (aka Snow Leopard) ==<br />
<br />
=== Bash .profile ===<br />
<br />
export PATH="/opt/bin:/opt/sbin:$PATH"<br />
<br />
<br />
=== Git ===<br />
<br />
curl -O http://kernel.org/pub/software/scm/git/git-1.7.0.tar.gz<br />
tar xzf git-1.7.0.tar.gz<br />
cd git-1.7.0/<br />
./configure --prefix=/opt<br />
make<br />
sudo make install<br />
<br />
=== Mercurial ===<br />
<br />
curl -O http://mercurial.selenic.com/release/mercurial-1.4.tar.gz<br />
tar xzf mercurial-1.4.tar.gz<br />
cd mercurial-1.4/<br />
make all<br />
sudo make install<br />
<br />
=== VirtualEnv ===<br />
<br />
'''need to find notes on this'''<br />
<br />
=== Buildbot ===<br />
<br />
==== Twisted ====<br />
<br />
cd ~/Downloads<br />
curl -O http://tmrc.mit.edu/mirror/twisted/Twisted/2.4/Twisted-2.4.0.tar.bz2<br />
<br />
cd ~/src<br />
virtualenv buildbot<br />
cd buildbot<br />
source bin/activate<br />
tar xzf ~/Downloads/Twisted-2.4.0.tar.bz2<br />
cd Twisted-2.4.0<br />
python setup.py install<br />
<br />
==== buildbot ====<br />
<br />
cd ~/src/buildbot<br />
source bin/activate<br />
hg clone http://hg.mozilla.org/build/buildbot<br />
cd buildbot<br />
python setup.py install<br />
<br />
==== creating buildbot master ====<br />
<br />
'''Note''': need to finish documenting the master and slave config changes for local setups<br />
<br />
cd ~/src/buildbot<br />
source bin/activate<br />
buildbot create-master master<br />
buildbot create-slave slave localhost:9010 moz-slave-name<br />
<br />
=== GnuPG v2 ===<br />
<br />
==== libgpg-error ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/libgpg-error/libgpg-error-1.7.tar.bz2<br />
tar xjf libgpg-error-1.7.tar.bz2<br />
cd libgpg-error-1.7/<br />
./configure CC="gcc -arch i386" --prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== libgcrypt ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/libgcrypt/libgcrypt-1.4.5.tar.bz2<br />
tar xjf libgcrypt-1.4.5.tar.bz2<br />
cd libgcrypt-1.4.5/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-gpg-error-prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== libksba ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/libksba/libksba-1.0.7.tar.bz2<br />
tar xjf libksba-1.0.7.tar.bz2<br />
cd libksba-1.0.7/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-gpg-error-prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== pth ====<br />
<br />
curl -O http://ftp.gnu.org/gnu/pth/pth-2.0.7.tar.gz<br />
tar xzf pth-2.0.7.tar.gz<br />
cd pth-2.0.7/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-gpg-error-prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== libassuan ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/libassuan/libassuan-1.0.5.tar.bz2<br />
tar xjf libassuan-1.0.5.tar.bz2<br />
cd libassuan-1.0.5/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-pth-prefix=/opt<br />
make<br />
sudo make install<br />
<br />
==== gnupg ====<br />
<br />
curl -O http://ftp.gnupg.org/gcrypt/gnupg/gnupg-2.0.9.tar.bz2<br />
cd gnupg-2.0.9/<br />
./configure CC="gcc -arch i386" --prefix=/opt/ --with-pth-prefix=/opt --with-ksba-prefix=/opt --with-libassuan-prefix=/opt --with-libgcrypt-prefix=/opt --with-gpg-error-prefix=/opt <br />
make<br />
sudo make install<br />
<br />
=== pkg-config ===<br />
<br />
curl -O http://pkgconfig.freedesktop.org/releases/pkg-config-0.23.tar.gz<br />
tar xzf pkg-config-0.23.tar.gz<br />
cd pkg-config-0.23<br />
./configure --prefix=/opt/<br />
make<br />
sudo make install<br />
<br />
=== gettext ===<br />
<br />
curl -O http://ftp.gnu.org/pub/gnu/gettext/gettext-0.17.tar.gz<br />
tar xzf gettext-0.17.tar.gz<br />
cd gettext-0.17<br />
./configure --prefix=/opt/<br />
make<br />
sudo make install<br />
<br />
=== libiconv ===<br />
<br />
curl -O http://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.13.1.tar.gz<br />
tar xzf libiconv-1.13.1.tar.gz<br />
cd libiconv-1.13.1.tar.gz<br />
./configure --prefix=/opt/<br />
make<br />
sudo make install<br />
<br />
'''Note''': to fully enable gettext, it's best to rebuild it after installing libiconv (thanks [http://letsneverdie.net/blog/?p=75])<br />
<br />
cd ../gettext-0.17<br />
make distclean<br />
./configure --prefix=/opt/<br />
make<br />
sudo make install<br />
<br />
=== glib2 ===<br />
<br />
'''Note''': the LDFLAGS and CPPFLAGS values are so that the /opt version of gettext and libiconv are used<br />
<br />
'''Note''': thanks to [http://wiki.zmanda.com/index.php/Installation/OS_Specific_Notes/Installing_Amanda_on_Mac_OS_X#Complete_set-up_for_OS_X_Snow_Leopard_10.6.2_on_2010-01-08 Amanda Notes for OS X Installs] for the *FLAGS clue on how to get glib2 to compile <br />
<br />
curl -O http://ftp.gnome.org/pub/gnome/sources/glib/2.22/glib-2.22.4.tar.bz2<br />
tar xjf glib-2.22.4.tar.bz2<br />
cd glib-2.22.4<br />
./configure --prefix=/opt LDFLAGS="-L/opt/lib" CPPFLAGS="-I/opt/include"<br />
make<br />
sudo make install<br />
<br />
=== libIDL ===<br />
<br />
curl -O http://ftp.acc.umu.se/pub/gnome/sources/libIDL/0.8/libIDL-0.8.13.tar.gz<br />
tar xzf libIDL-0.8.13.tar.gz<br />
cd libIDL-0.8.13<br />
./configure --prefix=/opt<br />
make<br />
sudo make install<br />
<br />
=== autoconf213 ===<br />
<br />
curl -0 http://ftp.gnu.org/gnu/autoconf/autoconf-2.13.tar.gz<br />
tar xzf autoconf-2.13.tar.gz<br />
cd autoconf<br />
./configure --prefix=/opt<br />
make<br />
sudo make install<br />
sudo ln -s /opt/bin/autoconf /opt/bin/autoconf213</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&diff=436860ReleaseEngineering/Maintenance2012-06-02T02:13:35Z<p>Bear: /* Reconfigs / Deployments */</p>
<hr />
<div>This page is to track upcoming changes to any part of RelEng infrastructure; buildbot masters, slaves, ESX hosts, etc. This should allow us keep track of what we're doing in a downtime, and also what changes can be rolled out to production without needing a downtime. This should be helpful if we need to track what changes were made when troubleshooting problems.<br />
<br />
[[ReleaseEngineering:BuildbotBestPractices]] describes how we manage changes to our masters.<br />
<br />
= Relevant repositories =<br />
* [http://hg.mozilla.org/build/buildbot/ buildbot]<br />
* [http://hg.mozilla.org/build/buildbot-configs/ buildbot-configs]<br />
* [http://hg.mozilla.org/build/buildbotcustom/ buildbotcustom]<br />
* [http://hg.mozilla.org/build/tools/ tools]<br />
* [http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ talos]<br />
<br />
'''Are you changing the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
= Reconfigs / Deployments =<br />
This page is updated by the person who does a reconfig on production systems. Please give accurate times, as we use this page to track down if reconfigs caused debug intermittent problems.<br />
<br />
'''Did you change the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
Outcome should be 'backed out' or 'In production' or some such. Reverse date order pretty please.<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Outcome'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Bug #(s)''' - '''Description(s)'''<br />
|-<br />
| in production (try masters only)<br />
| 2012-06-01 1913 PDT<br />
|<br />
* {{bug|709480}} update PDBSTR_PATH for Try builders<br />
|-<br />
| in production (try masters only)<br />
| 2012-06-01 0845 PDT<br />
|<br />
* {{bug|709480}} - switch try win32 jobs to win64 slaves.<br />
|-<br />
| in production<br />
| 2012-05-28 1500 PDT<br />
|<br />
* {{bug|701559}} - add a try pgo strategy<br />
* {{bug|712244}} - increase test MAX_BROKER_REFS to 2048 <br />
* {{bug|712244}} - increase builder limit to 2048 on build masters<br />
* {{bug|759073}} - Please disable Android debug builds on the profiling branch<br />
* {{bug|757829}} - UrlPoller doesn't poll win32_signing_buildN.log for unsigned Thunderbird releases.<br />
* {{bug|723479}} - Turn Devtools, Jaegermonkey and Larch trees back on, add 10.7 tests to Devtools<br />
|-<br />
| in production<br />
| 2012-05-24 1300 PDT<br />
|<br />
* Build duty reconfig<br />
|<br />
|-<br />
|-<br />
| in production<br />
| 2012-05-23 950 PDT<br />
|<br />
* {{bug|756463}} - please bump the priority of the oak branch<br />
|-<br />
| in production<br />
| 2012-05-22 1700 PDT<br />
|<br />
* {{bug|754291}} - buildbot master Makefile refers to the wrong hg<br />
* {{bug|756463}} - Please reset the priority of the Oak branch<br />
* {{bug|753501}} - followup, make try look for the Android XUL tooltool manifest in the path where it actually is<br />
|-<br />
| in production<br />
| 2012-05-18 1313 PDT<br />
|<br />
* {{bug|756463}} - Please bump the priority of the Oak branch temporarily<br />
* {{bug|753132}} - Fix env for nightly pgo builds<br />
* {{bug|573722}} - set IS_NIGHTLY env var for nightly builds<br />
* {{bug|753132}} - Do periodic PGO builds on pgo_platform too<br />
|-<br />
| in production<br />
| 2012-05-18 0815 PDT<br />
|<br />
* {{bug|756463}} - Please bump the priority of the Oak branch temporarily<br />
|-<br />
| in production<br />
| 2012-05-17 0750 PDT<br />
|<br />
* {{bug|755434}} - l10n repacks should not execute config.py<br />
* {{bug|753132}} - Do 32-bit PGO builds on win64<br />
* {{bug|751158}} - Create tcheckboard3 to measure checkerboard with low res off<br />
* {{bug|755989}} - setup-master.py doesn't set up staging symlinks properly<br />
* {{bug|743304}} - After SSH failure, Android XUL mozilla-central nightly builder spams a bunch of "SyntaxError: invalid syntax" when running retry.py / balrog-client.py<br />
* {{bug|753501}} - Add empty tooltool manifests to some platforms<br />
|-<br />
| in production<br />
| 2012-05-14 1118 PDT<br />
|<br />
* {{bug|753488}} - Android native on Aurora -> multilocale.<br />
|-<br />
| in production<br />
| 2012-05-14 0830 PDT<br />
|<br />
* {{bug|750837}} - 13.0b2 build 2 configs. r=hwine<br />
* {{bug|754517}} - disable larch and enable pine rentable branches, r=bear<br />
* {{bug|747500}} - setup-master.py refers to files which have been removed. r=catlee<br />
* {{bug|753132}} - Use win64 machines for 32-bit pgo builds on build-system branch. r=rail<br />
* {{bug|754373}} - Use firefox-tuxedo.ini for Thunderbird builds. r=standard8<br />
* {{bug|754397}} - Disable signing at build time for Thunderbird. r=nthomas<br />
* {{bug|754430}} - Missing mozilla/ dir in Thunderbird beta build. p=standard8,r=jhopkins <br />
* {{bug|701783}} - remove scratchbox references from buildbotcustom. r=catlee<br />
* {{bug|753132}} - Support 'pgo_platform' key for deciding which machines do PGO builds. r=rail<br />
* {{bug|750744}} - Test and deploy SUT agent 1.08. r=bear <br />
|-<br />
| in production<br />
| 2012-05-11 0930 PDT<br />
|<br />
* {{Bug|754297}} - add sys.stdout.flush() to sut_tools' scripts <br />
|-<br />
| in production<br />
| 2012-05-11 0800 PDT<br />
|<br />
* {{Bug|746260}} - disable the screen resolution changing on android for jsreftest and crashtest, leave it on for reftest.<br />
* {{Bug|753868}} - use aus3-staging for Thunderbird release builds, r=jhopkins<br />
* {{Bug|744601}} - tracking bug for build and release of Thunderbird 13.0b2. r=standard8<br />
* {{Bug|752531}} - migrate dev-stage01 to scl3. r=rail <br />
|-<br />
| in production<br />
| 2012-05-10 1801 PDT<br />
|<br />
* {{bug|753488}} - make FN multi on m-c only, reenable nightly updates.<br />
* {{Bug|753625}} - Move all Thunderbird branches onto Firefox infra<br />
* {{bug|749748}} - kill l10n verify.<br />
* {{Bug|753868}} - Use aus3-staging.mozilla.org for Thunderbird release builds.<br />
* {{Bug|753865}} - Email thunderbird-drivers for Thunderbird release builds.<br />
* {{bug|748157}} - Load thunderbird_release_branches from master_config.json<br />
|-<br />
| in production<br />
| 2012-05-09 0800 PDT<br />
|<br />
* {{bug|752373}} - Stop running Android crashtest-1 until someone's ready to fix it<br />
* {{bug|751070}} - retire sjc1 VMs<br />
* {{bug|750031}} - moz2-darwin10-slave02 problem tracking<br />
* {{bug|746201}} - Remove unresolved machines from buildbot-configs<br />
* {{bug|752430}} - Swap comm-aurora over to Firefox infra<br />
* {{bug|749051}} - TryChooser: could -n be the default?<br />
* {{bug|751878}} - OSError: [Errno 13] Permission denied: '/home/ftp' for pvtbuilds2.dmz.scl3.mozilla.com<br />
|-<br />
| in production<br />
| 2012-05-03 1325 PDT<br />
|<br />
* {{Bug|744067}} - add them back<br />
|-<br />
| backed out<br />
| 2012-05-03 1200 PDT<br />
|<br />
* Backout 4ab5af03cce1 (new scl3 slaves). r=backout<br />
|-<br />
| in production<br />
| 2012-05-03 1000 PDT<br />
|<br />
* {{Bug|751165}} - revert higher priority for m-i. r=philor,ehsan<br />
* {{Bug|744067}} - new scl3 slaves; r=coop<br />
* {{Bug|744067}} - new scl3 slaves (must be in staging); r=aki<br />
* Add ACTIVE_THUNDERBIRD_RELEASE_BRANCHES. r=armenzg<br />
* {{Bug|751895}} - Preproduction release master fails trying to checkconfig. r=jhopkins<br />
* {{Bug|750973}} - copy in-tree m-a linux32 mozconfig into mozilla2 to fix aurora source release. r=catlee <br />
|-<br />
| in production<br />
| 2012-05-03 08:00 PDT<br />
|<br />
* {{bug|751506}} - No 10.7 32-bit debug builders on Thunderbird trees. r=coop<br />
* {{bug|748628}} - Switch Thunderbird builds to use OS X 10.7 build machines. Add in the 'TB ' prefix to match the other Thunderbird builders. r=jhopkins<br />
* {{bug|744864}} - Update list of l10n modules that trigger changes. r=Pike<br />
* {{bug|751560}} - Temporarily disable uploading symbols on Windows 32 bit try-comm-central builds. r=jhopkins <br />
* {{bug|751514}} - Thunderbird bloat test builders should warn and halt on failure, not error on failure. r=jhopkins<br />
|-<br />
| in production<br />
| 2012-05-02 12:00 PDT<br />
|<br />
* {{Bug|750635}} - Swap try-comm-central over to pushing to the thunderbird product directory, and get it running unit tests.<br />
* Follow-up to {{bug|748628}}, fix some more issues with the Thunderbird lion builders - the names and the ccache settings. <br />
* {{Bug|739994}} - Remove references to 10.5 platform and associated slaves in configs - r=jhford <br />
|-<br />
| in production<br />
| 2012-05-02 08:00 PDT<br />
|<br />
* {{Bug|748628}} - Switch Thunderbird builds to use OS X 10.7 build machines. r=coop<br />
* {{Bug|743304}} - After SSH failure, Android XUL mozilla-central nightly builder spams a bunch of "SyntaxError: invalid syntax" when running retry.py / balrog-client.py. r=catlee<br />
* {{Bug|751165}} - Bump priority of mozilla-inbound to help open the tree earlier. r=catlee <br />
|-<br />
| in production<br />
| 2012-05-01 19:00 PDT<br />
|<br />
* {{Bug|554343}} - Release builders should always clobber <br />
* {{Bug|750514}} - Disable codesighs on Thunderbird try<br />
* {{Bug|750013}} - Revert Birch customizations from {{bug|746159}}<br />
|-<br />
| in production<br />
| 2012-04-30 13:30 PDT<br />
|<br />
* {{Bug|750305}} - Use comm-central as reference branch for try-comm-central builds<br />
* {{Bug|749596}} - Enable aurora nightly updates (April 27th, 2012 edition)<br />
|-<br />
| in production<br />
| 2012-04-30 11:30 PDT<br />
|<br />
* {{Bug|749867}} - Don't try to build SpiderMonkey --enable-shark builds on 10.7 where there is no Shark, r=coop<br />
* buildbot-configs patch to reflect new all-locales locations (Bug 711534 - Configure Thunderbird release builders) r=standard8<br />
* {{Bug|669428}} - Run Jetpack tests on mozilla-inbound, r=armenzg<br />
* {{Bug|748633}} - Thunderbird try logs failing to upload. r=rail <br />
|-<br />
| in production<br />
| 2012-04-27 11:30 PDT<br />
|<br />
* {{Bug|749524}} - Upload comm-aurora snippets to comm-aurora-test channel<br />
* {{Bug|711534}} - Configure Thunderbird release builders<br />
* {{Bug|749288}} - linux comm-central builds use wrong python when calling balrog client<br />
* {{Bug|749494}} - Re-enable graph server for staging/preproduction<br />
* {{Bug|729392}} - Install toolchain needed for SPDY testing onto test machines<br />
* {{Bug|745300}} - Do Mac spidermonkey builds on 10.7<br />
<br />
|-<br />
| in production<br />
| 2012-04-26 11:00 PDT<br />
|<br />
* {{Bug|749076}} - tooltool should be invoked with -o (--overwrite) option<br />
* {{Bug|739802}} - disable b2g on aurora, beta, release<br />
|-<br />
| '''backed-out'''<br />
| 2012-04-26 09:00 PDT<br />
|<br />
* {{Bug|742131}} - deploy node.exe to fedora slaves<br />
|-<br />
| in production<br />
| 2012-04-26 07:00 PDT<br />
|<br />
* {{Bug|742131}} - deploy node.exe to fedora slaves<br />
|-<br />
| in production<br />
| 2012-04-25 22:26 PDT<br />
|<br />
* {{Bug|742131}} - fix upload host for windows try symbols<br />
|-<br />
| in production<br />
| 2012-04-25 12:00 PDT<br />
|<br />
* {{Bug|743977}} - turn off balrog client for staging and preproduction builds<br />
* {{Bug|723340}} - move dm-pvtbuild01 to a new datacenter<br />
* {{Bug|747821}} - Need to run tpr_responsiveness on Try until it's not run anywhere anymore<br />
* {{Bug|729667}} - re-create the services on dm-wwwbuild01 in scl3<br />
|-<br />
| in production<br />
| 2012-04-24 6:30 PDT<br />
|<br />
* {{Bug|747966}} - comm-central builds not firing automatically<br />
* {{bug|747862}} - Disable shark nightly builds on Thunderbird builders<br />
|-<br />
| in production<br />
| 2012-04-23 15:30 PDT<br />
|<br />
* {{Bug|746708}} - Updates builder fails running backupsnip and pushsnip<br />
* {{bug|747756}} - Bump "'make hg-bundle" timeout<br />
* {{bug|747892}} - mozilla-release's releasetestUptake value should be set to 1<br />
* {{bug|747460}} - consolidate windows peptest config files<br />
|-<br />
| in production<br />
| 2012-04-18 0645 PDT<br />
|<br />
* {{bug|745545}} - Handle Thunderbird revisions in NightlyRepackFactory.<br />
* {{bug|745547}} - Move talosCmd into SUITES loop (generateTalosBranchObjects).<br />
* {{bug|745299}} - Add hg-internal as a mirror.<br />
* {{bug|745500}} - Turn on robocop testCheck2 on tinderbox builds.<br />
* {{bug|735390}} - 12.0b6 configs + fix test-masters.sh + move l10n-changesets_mobile-aurora.json into mozilla/.<br />
* {{bug|746537}} - Increase priority for Birch, drop Maple back down<br />
|-<br />
| in production<br />
| 2012-04-17 1211 PDT<br />
|<br />
* {{bug|746159}} - make birch be like inbound<br />
* {{bug|739994}} - turn off spidermonkey builds on 10.5<br />
* {{bug|744098}} - switch xulrunner osx builds to upload tarballs<br />
* {{bug|732976}} - singlesourcefactory should generate checksums<br />
|-<br />
| in production<br />
| 2012-04-17 0630 PDT<br />
|<br />
* {{bug|739778}} - preproduction in scl3<br />
* {{bug|744119}} - decommission osx builder<br />
* {{bug|744958}} - updateSUT.py fixes<br />
* {{bug|741751}} - partner repack signing fixes<br />
* {{bug|745538}} - TB mozmill test steps<br />
* {{bug|745469}} - Turn off tinderbox mail for spidermonkey builds<br />
|-<br />
| in production<br />
| 2012-04-12 1830-45 PDT<br />
|<br />
* {{bug|722759}} - switch non-try symbols to symbols1.dmz.phx1.mozilla.com<br />
* {{bug|741657}} - Switch to aus3-staging<br />
* {{bug|730325}} - Pass product name to reallyShort()<br />
* {{bug|744495}} - xulrunner pulse messages<br />
|-<br />
| in production<br />
| 2012-04-10 various times<br />
|<br />
* {{bug|720027}} - enable lion builders<br />
|-<br />
| in production<br />
| 2012-04-10 1100 PST<br />
| <br />
* {{bug|744049}} - tcheckerboard always reports 1.0 (tegra talos web server updated to talos tip)<br />
|-<br />
| in production<br />
| 2012-04-09 0700 PDT<br />
|<br />
* {{bug|607392}} - split tagging into en-US and other<br />
* {{bug|721885}} - shut off unused branch<br />
* {{bug|400296}} - Have release automation support signing OSX builds (up to 10.7 support)<br />
|-<br />
| in production<br />
| 2012-04-04 11:00 PDT<br />
|<br />
* {{bug|690311}} - deploy newer version of cleanup.py to the foopies<br />
|-<br />
| in production<br />
| 2012-03-30 6:15 PDT<br />
|<br />
* {{bug|738588}} - add ts_paint to the android tests.<br />
* {{bug|737458}} - replace tpr_responsiveness for tp5row.<br />
* {{bug|737458}} - add tpr_responsiveness temporarily for mozilla-central and larch.<br />
* {{bug|740599}} - update staging release config files;<br />
|-<br />
| in production<br />
| 2012-03-29 6:55 PDT<br />
|<br />
* {{Bug|715193}} - If a branch does not use talos_from_source_code we should fallback to talos.mobile.old.zip (fixes esr10).<br />
|-<br />
| in production<br />
| 2012-03-28 16:35 PDT<br />
|<br />
* {{Bug|740196}} - ts_paint on Android doesn't actually work<br />
|-<br />
| in production<br />
| 2012-03-28 11:55 PDT<br />
|<br />
* {{Bug|737632}} - Remove jaegermonkey, graphics and pine to reduce builders<br />
* {{Bug|723667}} - fix Android trobocheck and ts_paint tests.<br />
* {{Bug|739486}} - test-masters.sh should run ./setup_master.py -t<br />
* add option to setup_master.py to print error logs when hit<br />
* {{Bug|723667}} - enable trobopan and tcheckerboard by default (not for m-a/m-b/m-r/1.9.2)<br />
* {{Bug|627182}} - Automate signing and publishing of XULRunner builds. r=bhearsum <br />
|-<br />
| in production<br />
| 2012-03-27 12:30 PDT<br />
|<br />
* {{Bug|723667}} - Add trobopan and trobocheck to m-c/m-i. r=jmaher<br />
|-<br />
| in production<br />
| 2012-03-23 11:30 PDT<br />
|<br />
*{{bug|627182}}<br />
*{{bug|738685}}<br />
*{{bug|734223}}<br />
*{{bug|738286}}<br />
*{{bug|719491}}<br />
*{{bug|737656}}<br />
*{{bug|715193}}<br />
*{{bug|702595}}<br />
*{{bug|735383}}<br />
|-<br />
| in production<br />
| 2012-03-27 01:35 PDT<br />
|<br />
* {{Bug|739505}} - [http://hg.mozilla.org/build/buildbot-configs/rev/3c424821358a Fix talos] on beta<br />
|-<br />
| in production<br />
| 2012-03-23 7:00 PDT<br />
|<br />
* {{Bug|737864}} - Tweak release category for Thunderbird.<br />
* {{Bug|737458}} - add tp5row side by side and cleanup config.py.<br />
* {{Bug|737581}} - enable peptest on m-c and m-i.<br />
* {{Bug|713846}} - Treat 'fennec' builds as having product 'mobile' for the purposes of uploading logs.<br />
|-<br />
| backout<br />
| 2012-03-21 11:45 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
|-<br />
| in production<br />
| 2012-03-21 9:30 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
* {{Bug|697150}}. (Bv1) Remove 'ac_add_options --disable-installer' for XulRunner current branches.<br />
* {{Bug|733394}}. Add leak test logic to mozilla-beta.<br />
* {{Bug|736587}}. Enable Android for pine.<br />
|-<br />
| in production<br />
| 2012-03-20 8:30 PDT<br />
|<br />
* {{bug|723667}} - enable talos robocop for pine and only for native tests<br />
* {{bug|737077}} - re-enable aurora updates<br />
* {{bug|713846}} - unified log handling<br />
* {{bug|734320}} - fix jetpack log parsing<br />
* {{bug|737049}} - run reftest-no-accel correctly<br />
* {{bug|723386}} - fix reserved slaves handling<br />
|-<br />
| in production<br />
| 2012-03-19 10:30 PDT<br />
|<br />
* {{bug|736284}} - re-enable aurora updates<br />
|-<br />
| in production<br />
| 2012-03-16 8:45 PDT<br />
|<br />
* {{bug|734996}} - fennec beta release update channel -> beta.<br />
* {{Bug|734221}} - deploy updateSUT.py and upgrade the boards to SUT Agent version 1.07.<br />
* {{Bug|734996}} - source: get a nonce earlier<br />
|-<br />
| in production<br />
| 2012-03-13 17:00 PDT<br />
|<br />
* {{bug|735201}} - Remove leading ../ from symbols path for tegras<br />
* {{bug|735421}} - Disable Aurora updates until the Aurora 13 has stabilized<br />
|-<br />
| in production<br />
| 2012-03-12 16:00 PDT<br />
|<br />
* {{bug|734417}} - enable mobile builds on the profiling branch<br />
* {{bug|731617}} - No nightly builds on maple branch since 27 Feb<br />
* {{bug|732285}} - Set MINIDUMP_STACKWALK for Android<br />
* {{bug|733668}} - Include "ERROR: We tried to download the talos.json file but something failed" and "ERROR 500: Internal Server Error" for Talos hgweb operations to RETRY<br />
* {{bug|630518}} - l10n verify, update verify, and final verification builders need to set "branch" when reporting to clobberer<br />
|-<br />
| in production<br />
| 2012-03-08 09:00 PT<br />
|<br />
* {{Bug|731814}} - Add checks that we're not exceeding max # of builders per slave.<br />
* {{Bug|731617}} - Remove win64 for now in maple.<br />
* {{Bug|731339}} - Remove slaves that are not production<br />
* {{Bug|732730}} - Remove non-functional and unwanted pgo_platforms overrides<br />
* {{bug|732110}} - remove buildbot-configs/mozilla2/mobile<br />
* {{Bug|728271}} - Post to graphs.m.o instead of graphs-old.m.o<br />
* {{Bug|729144}} - Post to graphs.allizom.org.<br />
* {{Bug|723667}} - Add robocop disabled.<br />
* {{bug|730050}} - TryBuildFactory looks in the wrong place for malloc.log<br />
* {{Bug|712538}} - leaktest parity on try<br />
* {{Bug|723667}} - Use talos.zip for tegras and prep work for talos robocop<br />
|-<br />
| in production<br />
| 2012-03-06 06:30 PT<br />
|<br />
* {{bug|732500}} - Enable nightly updates on maple<br />
* {{bug|732699}} - ESR release automation should push to mirrors automatically<br />
* {{bug|730918}} - Android on esr10 is busted, no doubt by branding since that always seems to be the problem<br />
* {{bug|561754}} - Don't download symbols for test runs, pass symbol zip URL as symbols path<br />
* {{bug|732516}} - l10n verification shouldn't rsync zip files<br />
* {{bug|732468}} - Add the ridiculous "abort: error:" to the list of hg errors that trigger RETRY<br />
|-<br />
| in production<br />
| 2012-03-01 7:30 PT<br />
|<br />
* {{Bug|721360}} - Do what changeset 9a0c428bdb69 really wanted to do.<br />
* {{Bug|561754}} - Disable symbol download on demand for mozilla-1.9.2 branch.<br />
* {{Bug|660480}} - mark as RETRY for common tegra errors<br />
* {{Bug|729918}} - start_uptake_monitoring builder uses wrong script_repo_revision property.<br />
* {{Bug|561754}} - Download symbols on demand by default for desktop unittests.<br />
|-<br />
| in production<br />
| 2012-02-27 7:45 PT<br />
|<br />
* {{bug|729077}} - recycle talos-r4-lion-083 and talos-r3-snow-081 as mac-signing[12]<br />
* Fix up staging and preproduction test slave lists.<br />
* {{Bug|729426}} - Do periodic PGO on services-central<br />
* {{bug|727580}} - linux-android for esr10, without merging 11.0 to m-r.<br />
|-<br />
| in production<br />
| 2012-02-21 9:30 PT<br />
|<br />
* {{bug|719511}} - add optional reboot command to ScriptFactory<br />
* {{Bug|725292}} - some repacks failed in 11.0b2 because of missing tokens<br />
* {{Bug|728104}} - AggregatingScheduler resets its state on reconfig<br />
* {{Bug|722608}} - Remove android signature verification<br />
* {{Bug|719260}} - Investigate why updates builder triggered twice for 10.0b5<br />
* {{bug|719511}} - Reenable peptest + add reboot_command<br />
* {{bug|712678}} - android-xul different update channel from android<br />
<br />
|-<br />
| in production<br />
| 20120217 1148 PST<br />
|<br />
* {{bug|721822}} - remove talos_from_code.py from the tools repo<br />
|-<br />
| in production<br />
| 20120214 1245 PST<br />
|<br />
* {{bug|726901}} - adjust resolution for reftests to 1600x1200<br />
* {{bug|689989}} - restore /system/etc/hosts on testing tegras<br />
|-<br />
| in production<br />
| 20120213 1200 PST<br />
|<br />
* {{bug|725727}} - reduce # of chunks for update_verify.<br />
* {{Bug|607392}} - split tagging into en-US and other. <br />
|-<br />
| in production<br />
| 20120208 01:20 PST<br />
|<br />
* {{bug|723954}} - 11.0b2 configs<br />
* {{bug|718385}} - android single locale updates<br />
* {{bug|717106}} - Release automation for ESR<br />
|-<br />
| in production<br />
| 20120207 13:00 PST<br />
|<br />
* {{bug|719443}} - add robocop unitttest testtype<br />
* {{bug|715715}} - download & install robocop for robocop test suites<br />
* {{Bug|725046}} - Re-enable mobile aurora updates<br />
* {{Bug|554324}} - Only set MOZ_PKG_VERSION when appVersion != version<br />
* [BACKED OUT] - <strike>{{bug|719511}} - optional ScriptFactory reboot().</strike><br />
|-<br />
| in production<br />
| 20120202 15:50 PST<br />
|<br />
* {{bug|723743}} - android native to en-US (no multilocale); disable android-xul single locale repacks.<br />
* {{bug|719697}} - --disable-tests on android* l10n-mozconfigs.<br />
* {{Bug|723277}} - don't enable remot<br />
|}<br />
<br />
=Android Testing=<br />
== Web Server Cluster ==<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Revision'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 488bc187a3ef<br />
| {{bug|753822}}<br />
| 20120510 1045 AM PDT<br />
| armenzg<br />
|}<br />
<br />
Find [[ReleaseEngineering:Buildduty#Update_mobile_talos_webhosts|here]] instructions to see how to update this.<br />
<br />
== clientproxy servers ==<br />
<br />
Production<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 2a995b4ed124<br />
| 31249cbe4f19<br />
| bfc910cd8dd3<br />
| ae5d6911905a<br />
| talos: {{bug|629503}}<br />
| 20110202 23:00 PDT<br />
| bear<br />
|}<br />
<br />
Pending<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
Servers:<br />
* bm-foopy01.build.mozilla.org<br />
* bm-foopy02.build.mozilla.org<br />
<br />
/builds/cp<br />
/builds/talos-data/talos<br />
/builds/talos-data/talos/pageloader@mozilla.org<br />
/builds/talos-data/talos/bench@taras.glek<br />
/builds/sut_tools</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Puppet/Usage&diff=436626ReleaseEngineering/Puppet/Usage2012-06-01T16:03:24Z<p>Bear: /* Working with Puppet Servers */</p>
<hr />
<div>{{ReleaseEngineering Puppet Header}}<br />
<br />
This document is intended to serve as a guide to interacting with our Puppet servers and manifests.<br />
<br />
== Definitions ==<br />
<br />
* Type - Puppet documentation talks a lot about this. Each different "type" deals with a different aspect of the system. For example, the "user" type can do most things related to user management (passwords, UID/GID, homedirs, shells, etc). The 'package' type deals with package management (eg, apt, rpm, fink, etc). And so on.<br />
<br />
== Masters ==<br />
<br />
An accurate list of puppet servers needs to be referenced by various procedures. Please keep the following list up to date.<br />
<br />
{{Anchor|PuppetServers}}<br />
<br />
{| class="wikitable" style="text-align: center;"<br />
!Role !! Data Center !! Puppet Master !! Type<br />
|-<br />
| build master || ''all'' || master-puppet1.build.mozilla.org || prehistoric<br />
|-<br />
| build slave || mtv1 || mv-production-puppet.build.mtv1.mozilla.com || prehistoric<br />
|-<br />
| build slave || scl1 || scl-production-puppet.build.scl1.mozilla.com || prehistoric<br />
|-<br />
| build slave || scl3 || scl3-production-puppet.srv.releng.scl3.mozilla.com || prehistoric<br />
|-<br />
| build slave || scl1 || releng-puppet1.build.scl1.mozilla.com || puppetagain<br />
|-<br />
| build slave || scl3 || releng-puppet1.srv.releng.scl3.mozilla.com || puppetagain<br />
|-<br />
| staging || scl3 || staging-puppet.build.mozilla.org || prehistoric<br />
|-<br />
| staging || mtv1 || relabs-puppet.build.mtv1.mozilla.com || puppetagain<br />
|}<br />
<br />
<small>To reference the above table from other wiki pages, use: <tt><nowiki>[[ReleaseEngineering/Puppet/Usage#PuppetServers]]</nowiki></tt></small><br />
<br />
=== Working with Puppet Servers ===<br />
<br />
There are several common conventions for all "old school" puppet servers:<br />
* log in as <tt>root</tt><br />
* configuration files are under <tt>/etc/puppet/manifests</tt><br />
* configuration repository is [http://hg.mozilla.org/build/puppet-manifests puppet-manifests]<br />
* set up personal environments under <tt>/etc/puppet/$USER</tt><br />
** for staging, use <tt>/etc/puppet/manifests-$USER</tt><br />
<br />
See [[ReleaseEngineering/PuppetAgain]] for details on how to work with the new hotness<br />
<br />
=== The Slave-Master Link ===<br />
You can find to which puppet master a slave connects to by checking this file's contents:<br />
# for linux testers (fedora)<br />
~cltbld/.config/autostart/gnome-terminal.desktop<br />
# for linux builders (centos)<br />
/etc/sysconfig/puppet<br />
# for osx<br />
/Library/LaunchDaemons/com.reductivelabs.puppet.plist<br />
If the slaves have to be moved between masters be sure to remove the certs after you modify this file and before their next reboot. You may also need to do 'puppetca --clean <FQDN>' on the new puppet master.<br />
# for linux<br />
find /var/lib/puppet/ssl -type f -delete<br />
# for mac<br />
find /etc/puppet/ssl -type f -delete<br />
<br />
== Our Puppet Manifests ==<br />
Out puppet manifests are organized into a few different parts:<br />
* Site files<br />
* Basic includes<br />
* Packages that make changes<br />
* Modules<br />
We are pushing toward organizing everything into modules, although this is not a particularly rapid process at the moment. Talk to Dustin.<br />
<br />
=== Site Files & Basic Includes ===<br />
Each Puppet master has its own site file which contains a few things:<br />
* Variable definitions specific to that master<br />
* Import statements which load other parts of the manifests<br />
* Node (slave) definitions<br />
<br />
<p>The basic includes are located in the 'base' directory. These files set variables with are referenced in the packages as well as base nodes for slaves.</p><br />
<br />
The most important variables to take note of are:<br />
* ${platform_fileroot} -- Used wherever the puppet:// protocol is supported, most notably with the File type.<br />
* ${platform_httproot} -- Used with the Package type and other places that don't support puppet://<br />
<br />
<p>There are also ${local_[file,http]} variables which point to the 'local' directory inside of each platform's root. See the following section for more on that.</p><br />
<br />
We have a few base nodes shared by multiple pools of slaves as well as a base node for each concrete slave type. The shared ones are:<br />
* "slave" -- For things common to ALL slaves managed by Puppet<br />
* "build" -- For things common to all build slaves<br />
* "test" -- For things common to all test slaves<br />
<br />
There are two different types of concrete nodes. Firstly, we have $platform-$arch-$type" nodes, which are used on all Puppet masters for slaves which are local to them. Two example are: "centos5-i686-build" (32-bit, CentOS 5, build slaves) and "darwin10-i386-test" (32-bit, Mac 10.6, test slaves). Secondly, there are "$location-$type-node" nodes, which only apply to the MPT master. All nodes which are not local to MPT production are listed in its configuration file as this type of node. These nodes ensure that new slaves get redirected to their local master when they first come up. Examples include "mv-build-node" and "staging-test-node".<br />
<br />
See [http://hg.mozilla.org/build/puppet-manifests/file/tip/base/nodes.pp base/nodes.pp] for the full listing of nodes..<br />
<br />
=== Packages ===<br />
* The [http://hg.mozilla.org/build/puppet-manifests/file/tip/site-production.pp site-{staging,production}.pp] files declare the list of slaves and each slave has defined which classes to include.<br />
* The classes [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/buildslave.pp buildslave.pp] and [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/staging-buildslave.pp staging-buildslave.pp] include most of the packages (devtools, nagios, mercurial, buildbot, extras, etc) we want.<br />
* The packages can have different sections or "Types" that can be "exec", "user", "package", "file", "service"<br />
<br />
=== Modules ===<br />
Going forward, puppet functionality should be encapsulated into modules. Modules include the relevant manifests, as well as files, templates, and (with some minor changes to our puppet client configs) even custom facts or types!<br />
<br />
Modules should be generic in their purpose, and well-encapsulated. They should not be specific to one operating system or distro by design, although it's OK to omit implementations we do not need (for example, it's OK for a module providing resources only used by build slaves to error out if it's used on a Fedora system - if and when we start building on Fedora, we'll need to extend the implementation).<br />
<br />
A module should be self-contained and have a well-documented and commented interface. If it depends on any other modules, that should also be highlighted in the comments.<br />
<br />
== Puppet Files ==<br />
The files that Puppet serves up (using <tt>File</tt>) are in <tt>/N</tt> on each puppet master. The MPT masters share this via an NFS mount, so it's easy to sync files from staging to MPT production. The other servers have a local copy of this data.<br />
<br />
That first 3 levels of the drive are laid out as follows:<br />
$level/$os-$hardwaremodel/$slaveType<br />
* $level is support level (production, staging, pre-production)<br />
* $os is generally one of 'centos5', 'fedora12', 'darwin9', or 'darwin10'.<br />
* $hardwaremodel is whatever 'facter' identifies the machine's CPU as (x86_64, i686, i386, etc).<br />
* $slaveType is the "type" of node of the slave is: 'build', 'test', 'stage', 'master', etc.<br />
<br />
Below '$type', are all of the files served by Puppet. They are organized according to where they'll end up on the slave. For example, if ''/usr/lib/libsomethinghuge.so'' is to be synced to the slave, it should live in ''usr/lib/libsomethinghuge.so''. Note that as much as possible, text files should not be kept in puppet-files -- use a module and its ''files/'' subdirectory instead.<br />
<br />
There are two special directories for each level/os/hardwaremodel/type combination, too:<br />
* local -- This directory contains files which should NOT be synced between staging <-> production or between different locations. Files such as the Puppet configs which have different contents depending on location and support level live here. Try not to use this.<br />
* DMGs (Mac) / RPMs (Fedora/CentOS) -- These directories contain platform specific packages which Puppet installs.<br />
<br />
== Common Use Cases ==<br />
* [[ReleaseEngineering/How To/Reset a Password with Puppet]]<br />
* [[ReleaseEngineering/How_To/Install_a_Package_with_Puppet]]<br />
<br />
== Testing ==<br />
Before you test on the Puppet server it's good to run the 'test-manifests.sh' scripts locally. This script will test the syntax of the manifest files and catch very basic issues. It will not catch any issues with run-time code such as Exec's. This should really be a Makefile - {{bug|635067}}<br />
<br />
Staging of updates is done with ''staging-puppet.build.mozilla.org'' and staging slaves. You should book staging-puppet as well as an slaves you intend to test on before making any changes to the manifests on the Puppet server. All Puppet server work is done as the root user.<br />
<br />
=== Setting up the server ===<br />
If you've never used the Puppet server before you'll want to start a clone of the manifests for yourself. You can clone the main manifests repo or your own user repo to a directory under ''/etc/puppet''. Once you have your clone, two edits are necessary:<br />
<br />
* Copy the password hash into your clone's build/cltbld.pp. This can be done with the following command, run from the root of your clone:<br />
hg -R /etc/puppet/manifests.real diff /etc/puppet/manifests.real/build/cltbld.pp | patch -p1<br />
or more easily<br />
patch -p1 < /etc/puppet/password<br />
* Copy ''staging.pp'' to ''site.pp'' and comment out all of the "node" entries except for those which you have booked.<br />
<br />
It's easiest to use the ''mq'' extension to make these changes in a patch on your queue. Then, when you want to change revisions, just pop the patch, use 'hg pull -u', and re-push your patch.<br />
<br />
If you have a patch to apply to the repository now is the time to do it.<br />
<br />
Finally, if your changes involve edits to any files served by Puppet, apply those changes in the appropriate places under /N/staging. It's usually easiest to keep a text file tracking these changes - then you can post the contents of that file to the bug for review, so that it's clear to reviewers what changes are being made here. Because puppet-files are unversioned, try to minimize the amount of change you must make here.<br />
<br />
Once all of that is done you can swap your manifests in with ''/etc/puppet/set-manifests.sh YOURNAME''. Omit the name to reset them to the default ("real") manifests. If you've added new files or changed staging-fileserver.conf you'll need to restart the Puppetmaster process with:</p><br />
service puppetmaster restart<br />
although note that the daemon will pick up the changes after some short delay if you do not restart.<br />
<br />
Now, you're ready to test.<br />
<br />
=== Testing a slave ===<br />
Puppet needs to run as root on the slaves, so equip yourself thusly and run the following command:<br />
puppetd --test --logdest console --noop --server staging-puppet.build.mozilla.org<br />
<br />
<p>This will pull updated manifests from the server, see what needs to be done, and output that. The --noop argument tells Puppet to not make any changes to the slave. Once you're satisfied with the output of that, you can run it without the --noop to have Puppet make the changes. The output should be coloured, and indicate success/fail/exception.</p><br />
<br />
<p>If you're encountering errors or weird behaviour and the normal output isn't sufficient for debugging you can enhance it with --evaltrace and --debug. Together, they will print out every command that Puppet runs, including things which are used to determine whether a file or package needs updating.</p><br />
<br />
=== Forcing a package re-install ===<br />
Especially when testing, you may have to iterate on a single package install to get it right. If you need to re-install an existing package, you'll need to remove the package contents and/or the marker file that flags that package as installed. <br />
<br />
* Linux: packages installed as rpms should be removed as one normally would for an rpm, i.e. <code>rpm -e rpmname</code>, which will delete all of the files and remove the package from the db, or <code>rpm -e --justdb rpmname</code>, which will leave all of the files and remove the package from the db<br />
* Mac: manually cleanup the installed files, and remove the marker file for your package. The marker file lives under <code>/var/db/</code> and will be named <code>.puppet_pkgdmg_installed_pkgname.dmg</code>.<br />
<br />
You can now re-test your package install with [[ReleaseEngineering:Puppet:Usage#Testing_a_slave|the command above]], i.e. <code>puppetd --test ...</code>.<br />
<br />
=== Cleaning up ===<br />
Once you're finished testing the manifests symlink needs to be re-adjusted with:<br />
cd /etc/puppet<br />
./set-manifests.sh<br />
<br />
== Moving file updates to production ==<br />
'''Production Puppet Masters:'''<br />
* mv-production-puppet.build.mozilla.org<br />
* scl-production-puppet.build.scl1.mozilla.com<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
* master-puppet1.build.scl1.mozilla.com <br />
<br />
'''NOTE: there are a lot of files that differ between the various directories, so using rsync involves a lot of whack-a-mole to avoid syncing files that aren't part of your change. It may be easier to simply use 'cp' for this step'''<br />
<br />
When you're ready to land in production it's important to sync your files from staging to ensure you don't end up with a different result in production. Here's the process to do that. On scl3-production-puppet as root, run:<br />
rsync -n --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
After verifying that only the things you want are being synced, run it without -n to push them for real:<br />
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
If there are things that shouldn't be synced carefully adjust the rsync command with --exclude or more specific paths.<br />
<br />
Once you've landed into /N/production on scl3-production-puppet, the other production puppet masters need to be updated: In theory, this is done as 'filesync', but that user does not have permission to update the relevant directories, so in practice I suspect it's done as root. Anyway, here's the example:<br />
sudo su - filesync<br />
rsync -av --exclude=**/local/etc/sysconfig/puppet* \<br />
--exclude=**/local/Library/LaunchDaemons/com.reductivelabs.puppet.plist* \<br />
--exclude=**/local/home/cltbld/.config/autostart/gnome-terminal.desktop* \<br />
--delete filesync@scl3-production-puppet.build.mozilla.org:/N/production/ /N/production/<br />
<br />
again, rsync is finicky, so scp may be your friend here:<br />
# mv-production-puppet <br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/N/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
# scl-production-puppet (bug 615313)<br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/builds/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
When you're ready, update the manifests on the masters with:<br />
hg -R /etc/puppet/manifests pull<br />
hg -R /etc/puppet/manifests update<br />
Note that some changes may require manifest updates first - think carefully about the intermediate state and what it will do to slaves!<br />
<br />
Be sure to do this on all Puppet masters.<br />
<br />
== Staging changes (environments) ==<br />
<pre><br />
armenzg: if you know of a script or a command that could catch stupid things like this<br />
dustin: I used to use environments for this purpose<br />
armenzg: what do you mean?<br />
armenzg: what are environments?<br />
dustin: you can specify a different envrionment on the client:<br />
dustin: puppetd --test --environment=dustin<br />
dustin: and then that can be configured to point to a different directory on the master<br />
dustin: so I would push my mq'd repo there<br />
dustin: and test with it, confident that only the slave I was messing with would be affected<br />
catlee: armenzg: we have that set up on master-puppet1 if you want to look<br />
</pre><br />
<br />
== Deploy changes ==<br />
* deploy the files you need (if you do)<br />
** [[ReleaseEngineering/Puppet/Usage#Moving_file_updates_to_production]]<br />
** you can try this instead:<br />
csshX --login root {mv-production-puppet,scl3-production-puppet,scl-production-puppet}.build.mozilla.org<br />
** be sure that the files are across '''all''' masters or the whole set of slaves will be going down<br />
* make sure you deploy the changes to all puppet masters (ssh as root)<br />
** see [[#PuppetServers|Masters]] for list (above)<br />
* cd /etc/puppet/manifests/<br />
* hg pull -u<br />
* watch few minutes that there are no errors<br />
** tail -F /var/log/messages<br />
** once you see a slave listed go and check to see that it got the changes</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Puppet/Usage&diff=436622ReleaseEngineering/Puppet/Usage2012-06-01T16:01:50Z<p>Bear: /* Working with Puppet Servers */</p>
<hr />
<div>{{ReleaseEngineering Puppet Header}}<br />
<br />
This document is intended to serve as a guide to interacting with our Puppet servers and manifests.<br />
<br />
== Definitions ==<br />
<br />
* Type - Puppet documentation talks a lot about this. Each different "type" deals with a different aspect of the system. For example, the "user" type can do most things related to user management (passwords, UID/GID, homedirs, shells, etc). The 'package' type deals with package management (eg, apt, rpm, fink, etc). And so on.<br />
<br />
== Masters ==<br />
<br />
An accurate list of puppet servers needs to be referenced by various procedures. Please keep the following list up to date.<br />
<br />
{{Anchor|PuppetServers}}<br />
<br />
{| class="wikitable" style="text-align: center;"<br />
!Role !! Data Center !! Puppet Master !! Type<br />
|-<br />
| build master || ''all'' || master-puppet1.build.mozilla.org || prehistoric<br />
|-<br />
| build slave || mtv1 || mv-production-puppet.build.mtv1.mozilla.com || prehistoric<br />
|-<br />
| build slave || scl1 || scl-production-puppet.build.scl1.mozilla.com || prehistoric<br />
|-<br />
| build slave || scl3 || scl3-production-puppet.srv.releng.scl3.mozilla.com || prehistoric<br />
|-<br />
| build slave || scl1 || releng-puppet1.build.scl1.mozilla.com || puppetagain<br />
|-<br />
| build slave || scl3 || releng-puppet1.srv.releng.scl3.mozilla.com || puppetagain<br />
|-<br />
| staging || scl3 || staging-puppet.build.mozilla.org || prehistoric<br />
|-<br />
| staging || mtv1 || relabs-puppet.build.mtv1.mozilla.com || puppetagain<br />
|}<br />
<br />
<small>To reference the above table from other wiki pages, use: <tt><nowiki>[[ReleaseEngineering/Puppet/Usage#PuppetServers]]</nowiki></tt></small><br />
<br />
=== Working with Puppet Servers ===<br />
<br />
There are several common conventions for all "old school" puppet servers:<br />
* log in as <tt>root</tt><br />
* configuration files are under <tt>/etc/puppet/manifests</tt><br />
* configuration repository is [http://hg.mozilla.org/build/puppet-manifests puppet-manifests]<br />
* set up personal environments under <tt>/etc/puppet/$USER</tt><br />
** for staging, use <tt>/etc/puppet/manifests-$USER</tt><br />
<br />
See [[PuppetAgain]] for details on how to work with the new hotness<br />
<br />
=== The Slave-Master Link ===<br />
You can find to which puppet master a slave connects to by checking this file's contents:<br />
# for linux testers (fedora)<br />
~cltbld/.config/autostart/gnome-terminal.desktop<br />
# for linux builders (centos)<br />
/etc/sysconfig/puppet<br />
# for osx<br />
/Library/LaunchDaemons/com.reductivelabs.puppet.plist<br />
If the slaves have to be moved between masters be sure to remove the certs after you modify this file and before their next reboot. You may also need to do 'puppetca --clean <FQDN>' on the new puppet master.<br />
# for linux<br />
find /var/lib/puppet/ssl -type f -delete<br />
# for mac<br />
find /etc/puppet/ssl -type f -delete<br />
<br />
== Our Puppet Manifests ==<br />
Out puppet manifests are organized into a few different parts:<br />
* Site files<br />
* Basic includes<br />
* Packages that make changes<br />
* Modules<br />
We are pushing toward organizing everything into modules, although this is not a particularly rapid process at the moment. Talk to Dustin.<br />
<br />
=== Site Files & Basic Includes ===<br />
Each Puppet master has its own site file which contains a few things:<br />
* Variable definitions specific to that master<br />
* Import statements which load other parts of the manifests<br />
* Node (slave) definitions<br />
<br />
<p>The basic includes are located in the 'base' directory. These files set variables with are referenced in the packages as well as base nodes for slaves.</p><br />
<br />
The most important variables to take note of are:<br />
* ${platform_fileroot} -- Used wherever the puppet:// protocol is supported, most notably with the File type.<br />
* ${platform_httproot} -- Used with the Package type and other places that don't support puppet://<br />
<br />
<p>There are also ${local_[file,http]} variables which point to the 'local' directory inside of each platform's root. See the following section for more on that.</p><br />
<br />
We have a few base nodes shared by multiple pools of slaves as well as a base node for each concrete slave type. The shared ones are:<br />
* "slave" -- For things common to ALL slaves managed by Puppet<br />
* "build" -- For things common to all build slaves<br />
* "test" -- For things common to all test slaves<br />
<br />
There are two different types of concrete nodes. Firstly, we have $platform-$arch-$type" nodes, which are used on all Puppet masters for slaves which are local to them. Two example are: "centos5-i686-build" (32-bit, CentOS 5, build slaves) and "darwin10-i386-test" (32-bit, Mac 10.6, test slaves). Secondly, there are "$location-$type-node" nodes, which only apply to the MPT master. All nodes which are not local to MPT production are listed in its configuration file as this type of node. These nodes ensure that new slaves get redirected to their local master when they first come up. Examples include "mv-build-node" and "staging-test-node".<br />
<br />
See [http://hg.mozilla.org/build/puppet-manifests/file/tip/base/nodes.pp base/nodes.pp] for the full listing of nodes..<br />
<br />
=== Packages ===<br />
* The [http://hg.mozilla.org/build/puppet-manifests/file/tip/site-production.pp site-{staging,production}.pp] files declare the list of slaves and each slave has defined which classes to include.<br />
* The classes [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/buildslave.pp buildslave.pp] and [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/staging-buildslave.pp staging-buildslave.pp] include most of the packages (devtools, nagios, mercurial, buildbot, extras, etc) we want.<br />
* The packages can have different sections or "Types" that can be "exec", "user", "package", "file", "service"<br />
<br />
=== Modules ===<br />
Going forward, puppet functionality should be encapsulated into modules. Modules include the relevant manifests, as well as files, templates, and (with some minor changes to our puppet client configs) even custom facts or types!<br />
<br />
Modules should be generic in their purpose, and well-encapsulated. They should not be specific to one operating system or distro by design, although it's OK to omit implementations we do not need (for example, it's OK for a module providing resources only used by build slaves to error out if it's used on a Fedora system - if and when we start building on Fedora, we'll need to extend the implementation).<br />
<br />
A module should be self-contained and have a well-documented and commented interface. If it depends on any other modules, that should also be highlighted in the comments.<br />
<br />
== Puppet Files ==<br />
The files that Puppet serves up (using <tt>File</tt>) are in <tt>/N</tt> on each puppet master. The MPT masters share this via an NFS mount, so it's easy to sync files from staging to MPT production. The other servers have a local copy of this data.<br />
<br />
That first 3 levels of the drive are laid out as follows:<br />
$level/$os-$hardwaremodel/$slaveType<br />
* $level is support level (production, staging, pre-production)<br />
* $os is generally one of 'centos5', 'fedora12', 'darwin9', or 'darwin10'.<br />
* $hardwaremodel is whatever 'facter' identifies the machine's CPU as (x86_64, i686, i386, etc).<br />
* $slaveType is the "type" of node of the slave is: 'build', 'test', 'stage', 'master', etc.<br />
<br />
Below '$type', are all of the files served by Puppet. They are organized according to where they'll end up on the slave. For example, if ''/usr/lib/libsomethinghuge.so'' is to be synced to the slave, it should live in ''usr/lib/libsomethinghuge.so''. Note that as much as possible, text files should not be kept in puppet-files -- use a module and its ''files/'' subdirectory instead.<br />
<br />
There are two special directories for each level/os/hardwaremodel/type combination, too:<br />
* local -- This directory contains files which should NOT be synced between staging <-> production or between different locations. Files such as the Puppet configs which have different contents depending on location and support level live here. Try not to use this.<br />
* DMGs (Mac) / RPMs (Fedora/CentOS) -- These directories contain platform specific packages which Puppet installs.<br />
<br />
== Common Use Cases ==<br />
* [[ReleaseEngineering/How To/Reset a Password with Puppet]]<br />
* [[ReleaseEngineering/How_To/Install_a_Package_with_Puppet]]<br />
<br />
== Testing ==<br />
Before you test on the Puppet server it's good to run the 'test-manifests.sh' scripts locally. This script will test the syntax of the manifest files and catch very basic issues. It will not catch any issues with run-time code such as Exec's. This should really be a Makefile - {{bug|635067}}<br />
<br />
Staging of updates is done with ''staging-puppet.build.mozilla.org'' and staging slaves. You should book staging-puppet as well as an slaves you intend to test on before making any changes to the manifests on the Puppet server. All Puppet server work is done as the root user.<br />
<br />
=== Setting up the server ===<br />
If you've never used the Puppet server before you'll want to start a clone of the manifests for yourself. You can clone the main manifests repo or your own user repo to a directory under ''/etc/puppet''. Once you have your clone, two edits are necessary:<br />
<br />
* Copy the password hash into your clone's build/cltbld.pp. This can be done with the following command, run from the root of your clone:<br />
hg -R /etc/puppet/manifests.real diff /etc/puppet/manifests.real/build/cltbld.pp | patch -p1<br />
or more easily<br />
patch -p1 < /etc/puppet/password<br />
* Copy ''staging.pp'' to ''site.pp'' and comment out all of the "node" entries except for those which you have booked.<br />
<br />
It's easiest to use the ''mq'' extension to make these changes in a patch on your queue. Then, when you want to change revisions, just pop the patch, use 'hg pull -u', and re-push your patch.<br />
<br />
If you have a patch to apply to the repository now is the time to do it.<br />
<br />
Finally, if your changes involve edits to any files served by Puppet, apply those changes in the appropriate places under /N/staging. It's usually easiest to keep a text file tracking these changes - then you can post the contents of that file to the bug for review, so that it's clear to reviewers what changes are being made here. Because puppet-files are unversioned, try to minimize the amount of change you must make here.<br />
<br />
Once all of that is done you can swap your manifests in with ''/etc/puppet/set-manifests.sh YOURNAME''. Omit the name to reset them to the default ("real") manifests. If you've added new files or changed staging-fileserver.conf you'll need to restart the Puppetmaster process with:</p><br />
service puppetmaster restart<br />
although note that the daemon will pick up the changes after some short delay if you do not restart.<br />
<br />
Now, you're ready to test.<br />
<br />
=== Testing a slave ===<br />
Puppet needs to run as root on the slaves, so equip yourself thusly and run the following command:<br />
puppetd --test --logdest console --noop --server staging-puppet.build.mozilla.org<br />
<br />
<p>This will pull updated manifests from the server, see what needs to be done, and output that. The --noop argument tells Puppet to not make any changes to the slave. Once you're satisfied with the output of that, you can run it without the --noop to have Puppet make the changes. The output should be coloured, and indicate success/fail/exception.</p><br />
<br />
<p>If you're encountering errors or weird behaviour and the normal output isn't sufficient for debugging you can enhance it with --evaltrace and --debug. Together, they will print out every command that Puppet runs, including things which are used to determine whether a file or package needs updating.</p><br />
<br />
=== Forcing a package re-install ===<br />
Especially when testing, you may have to iterate on a single package install to get it right. If you need to re-install an existing package, you'll need to remove the package contents and/or the marker file that flags that package as installed. <br />
<br />
* Linux: packages installed as rpms should be removed as one normally would for an rpm, i.e. <code>rpm -e rpmname</code>, which will delete all of the files and remove the package from the db, or <code>rpm -e --justdb rpmname</code>, which will leave all of the files and remove the package from the db<br />
* Mac: manually cleanup the installed files, and remove the marker file for your package. The marker file lives under <code>/var/db/</code> and will be named <code>.puppet_pkgdmg_installed_pkgname.dmg</code>.<br />
<br />
You can now re-test your package install with [[ReleaseEngineering:Puppet:Usage#Testing_a_slave|the command above]], i.e. <code>puppetd --test ...</code>.<br />
<br />
=== Cleaning up ===<br />
Once you're finished testing the manifests symlink needs to be re-adjusted with:<br />
cd /etc/puppet<br />
./set-manifests.sh<br />
<br />
== Moving file updates to production ==<br />
'''Production Puppet Masters:'''<br />
* mv-production-puppet.build.mozilla.org<br />
* scl-production-puppet.build.scl1.mozilla.com<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
* master-puppet1.build.scl1.mozilla.com <br />
<br />
'''NOTE: there are a lot of files that differ between the various directories, so using rsync involves a lot of whack-a-mole to avoid syncing files that aren't part of your change. It may be easier to simply use 'cp' for this step'''<br />
<br />
When you're ready to land in production it's important to sync your files from staging to ensure you don't end up with a different result in production. Here's the process to do that. On scl3-production-puppet as root, run:<br />
rsync -n --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
After verifying that only the things you want are being synced, run it without -n to push them for real:<br />
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
If there are things that shouldn't be synced carefully adjust the rsync command with --exclude or more specific paths.<br />
<br />
Once you've landed into /N/production on scl3-production-puppet, the other production puppet masters need to be updated: In theory, this is done as 'filesync', but that user does not have permission to update the relevant directories, so in practice I suspect it's done as root. Anyway, here's the example:<br />
sudo su - filesync<br />
rsync -av --exclude=**/local/etc/sysconfig/puppet* \<br />
--exclude=**/local/Library/LaunchDaemons/com.reductivelabs.puppet.plist* \<br />
--exclude=**/local/home/cltbld/.config/autostart/gnome-terminal.desktop* \<br />
--delete filesync@scl3-production-puppet.build.mozilla.org:/N/production/ /N/production/<br />
<br />
again, rsync is finicky, so scp may be your friend here:<br />
# mv-production-puppet <br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/N/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
# scl-production-puppet (bug 615313)<br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/builds/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
When you're ready, update the manifests on the masters with:<br />
hg -R /etc/puppet/manifests pull<br />
hg -R /etc/puppet/manifests update<br />
Note that some changes may require manifest updates first - think carefully about the intermediate state and what it will do to slaves!<br />
<br />
Be sure to do this on all Puppet masters.<br />
<br />
== Staging changes (environments) ==<br />
<pre><br />
armenzg: if you know of a script or a command that could catch stupid things like this<br />
dustin: I used to use environments for this purpose<br />
armenzg: what do you mean?<br />
armenzg: what are environments?<br />
dustin: you can specify a different envrionment on the client:<br />
dustin: puppetd --test --environment=dustin<br />
dustin: and then that can be configured to point to a different directory on the master<br />
dustin: so I would push my mq'd repo there<br />
dustin: and test with it, confident that only the slave I was messing with would be affected<br />
catlee: armenzg: we have that set up on master-puppet1 if you want to look<br />
</pre><br />
<br />
== Deploy changes ==<br />
* deploy the files you need (if you do)<br />
** [[ReleaseEngineering/Puppet/Usage#Moving_file_updates_to_production]]<br />
** you can try this instead:<br />
csshX --login root {mv-production-puppet,scl3-production-puppet,scl-production-puppet}.build.mozilla.org<br />
** be sure that the files are across '''all''' masters or the whole set of slaves will be going down<br />
* make sure you deploy the changes to all puppet masters (ssh as root)<br />
** see [[#PuppetServers|Masters]] for list (above)<br />
* cd /etc/puppet/manifests/<br />
* hg pull -u<br />
* watch few minutes that there are no errors<br />
** tail -F /var/log/messages<br />
** once you see a slave listed go and check to see that it got the changes</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Puppet/Usage&diff=436621ReleaseEngineering/Puppet/Usage2012-06-01T15:58:38Z<p>Bear: /* Masters */</p>
<hr />
<div>{{ReleaseEngineering Puppet Header}}<br />
<br />
This document is intended to serve as a guide to interacting with our Puppet servers and manifests.<br />
<br />
== Definitions ==<br />
<br />
* Type - Puppet documentation talks a lot about this. Each different "type" deals with a different aspect of the system. For example, the "user" type can do most things related to user management (passwords, UID/GID, homedirs, shells, etc). The 'package' type deals with package management (eg, apt, rpm, fink, etc). And so on.<br />
<br />
== Masters ==<br />
<br />
An accurate list of puppet servers needs to be referenced by various procedures. Please keep the following list up to date.<br />
<br />
{{Anchor|PuppetServers}}<br />
<br />
{| class="wikitable" style="text-align: center;"<br />
!Role !! Data Center !! Puppet Master !! Type<br />
|-<br />
| build master || ''all'' || master-puppet1.build.mozilla.org || prehistoric<br />
|-<br />
| build slave || mtv1 || mv-production-puppet.build.mtv1.mozilla.com || prehistoric<br />
|-<br />
| build slave || scl1 || scl-production-puppet.build.scl1.mozilla.com || prehistoric<br />
|-<br />
| build slave || scl3 || scl3-production-puppet.srv.releng.scl3.mozilla.com || prehistoric<br />
|-<br />
| build slave || scl1 || releng-puppet1.build.scl1.mozilla.com || puppetagain<br />
|-<br />
| build slave || scl3 || releng-puppet1.srv.releng.scl3.mozilla.com || puppetagain<br />
|-<br />
| staging || scl3 || staging-puppet.build.mozilla.org || prehistoric<br />
|-<br />
| staging || mtv1 || relabs-puppet.build.mtv1.mozilla.com || puppetagain<br />
|}<br />
<br />
<small>To reference the above table from other wiki pages, use: <tt><nowiki>[[ReleaseEngineering/Puppet/Usage#PuppetServers]]</nowiki></tt></small><br />
<br />
=== Working with Puppet Servers ===<br />
<br />
There are several common conventions for all "old school" puppet servers:<br />
* log in as <tt>root</tt><br />
* configuration files are under <tt>/etc/puppet/manifests</tt><br />
* configuration repository is [http://hg.mozilla.org/build/puppet-manifests puppet-manifests]<br />
* set up personal environments under <tt>/etc/puppet/$USER</tt><br />
** for staging, use <tt>/etc/puppet/manifests-$USER</tt><br />
<br />
For PuppetAgain servers:<br />
<br />
* log in as your ldap username<br />
* production configuration files are under <tt>/etc/puppet/production</tt><br />
* dev/test configuration files are under <tt>/etc/puppet/environments/$USER</tt><br />
<br />
=== The Slave-Master Link ===<br />
You can find to which puppet master a slave connects to by checking this file's contents:<br />
# for linux testers (fedora)<br />
~cltbld/.config/autostart/gnome-terminal.desktop<br />
# for linux builders (centos)<br />
/etc/sysconfig/puppet<br />
# for osx<br />
/Library/LaunchDaemons/com.reductivelabs.puppet.plist<br />
If the slaves have to be moved between masters be sure to remove the certs after you modify this file and before their next reboot. You may also need to do 'puppetca --clean <FQDN>' on the new puppet master.<br />
# for linux<br />
find /var/lib/puppet/ssl -type f -delete<br />
# for mac<br />
find /etc/puppet/ssl -type f -delete<br />
<br />
== Our Puppet Manifests ==<br />
Out puppet manifests are organized into a few different parts:<br />
* Site files<br />
* Basic includes<br />
* Packages that make changes<br />
* Modules<br />
We are pushing toward organizing everything into modules, although this is not a particularly rapid process at the moment. Talk to Dustin.<br />
<br />
=== Site Files & Basic Includes ===<br />
Each Puppet master has its own site file which contains a few things:<br />
* Variable definitions specific to that master<br />
* Import statements which load other parts of the manifests<br />
* Node (slave) definitions<br />
<br />
<p>The basic includes are located in the 'base' directory. These files set variables with are referenced in the packages as well as base nodes for slaves.</p><br />
<br />
The most important variables to take note of are:<br />
* ${platform_fileroot} -- Used wherever the puppet:// protocol is supported, most notably with the File type.<br />
* ${platform_httproot} -- Used with the Package type and other places that don't support puppet://<br />
<br />
<p>There are also ${local_[file,http]} variables which point to the 'local' directory inside of each platform's root. See the following section for more on that.</p><br />
<br />
We have a few base nodes shared by multiple pools of slaves as well as a base node for each concrete slave type. The shared ones are:<br />
* "slave" -- For things common to ALL slaves managed by Puppet<br />
* "build" -- For things common to all build slaves<br />
* "test" -- For things common to all test slaves<br />
<br />
There are two different types of concrete nodes. Firstly, we have $platform-$arch-$type" nodes, which are used on all Puppet masters for slaves which are local to them. Two example are: "centos5-i686-build" (32-bit, CentOS 5, build slaves) and "darwin10-i386-test" (32-bit, Mac 10.6, test slaves). Secondly, there are "$location-$type-node" nodes, which only apply to the MPT master. All nodes which are not local to MPT production are listed in its configuration file as this type of node. These nodes ensure that new slaves get redirected to their local master when they first come up. Examples include "mv-build-node" and "staging-test-node".<br />
<br />
See [http://hg.mozilla.org/build/puppet-manifests/file/tip/base/nodes.pp base/nodes.pp] for the full listing of nodes..<br />
<br />
=== Packages ===<br />
* The [http://hg.mozilla.org/build/puppet-manifests/file/tip/site-production.pp site-{staging,production}.pp] files declare the list of slaves and each slave has defined which classes to include.<br />
* The classes [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/buildslave.pp buildslave.pp] and [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/staging-buildslave.pp staging-buildslave.pp] include most of the packages (devtools, nagios, mercurial, buildbot, extras, etc) we want.<br />
* The packages can have different sections or "Types" that can be "exec", "user", "package", "file", "service"<br />
<br />
=== Modules ===<br />
Going forward, puppet functionality should be encapsulated into modules. Modules include the relevant manifests, as well as files, templates, and (with some minor changes to our puppet client configs) even custom facts or types!<br />
<br />
Modules should be generic in their purpose, and well-encapsulated. They should not be specific to one operating system or distro by design, although it's OK to omit implementations we do not need (for example, it's OK for a module providing resources only used by build slaves to error out if it's used on a Fedora system - if and when we start building on Fedora, we'll need to extend the implementation).<br />
<br />
A module should be self-contained and have a well-documented and commented interface. If it depends on any other modules, that should also be highlighted in the comments.<br />
<br />
== Puppet Files ==<br />
The files that Puppet serves up (using <tt>File</tt>) are in <tt>/N</tt> on each puppet master. The MPT masters share this via an NFS mount, so it's easy to sync files from staging to MPT production. The other servers have a local copy of this data.<br />
<br />
That first 3 levels of the drive are laid out as follows:<br />
$level/$os-$hardwaremodel/$slaveType<br />
* $level is support level (production, staging, pre-production)<br />
* $os is generally one of 'centos5', 'fedora12', 'darwin9', or 'darwin10'.<br />
* $hardwaremodel is whatever 'facter' identifies the machine's CPU as (x86_64, i686, i386, etc).<br />
* $slaveType is the "type" of node of the slave is: 'build', 'test', 'stage', 'master', etc.<br />
<br />
Below '$type', are all of the files served by Puppet. They are organized according to where they'll end up on the slave. For example, if ''/usr/lib/libsomethinghuge.so'' is to be synced to the slave, it should live in ''usr/lib/libsomethinghuge.so''. Note that as much as possible, text files should not be kept in puppet-files -- use a module and its ''files/'' subdirectory instead.<br />
<br />
There are two special directories for each level/os/hardwaremodel/type combination, too:<br />
* local -- This directory contains files which should NOT be synced between staging <-> production or between different locations. Files such as the Puppet configs which have different contents depending on location and support level live here. Try not to use this.<br />
* DMGs (Mac) / RPMs (Fedora/CentOS) -- These directories contain platform specific packages which Puppet installs.<br />
<br />
== Common Use Cases ==<br />
* [[ReleaseEngineering/How To/Reset a Password with Puppet]]<br />
* [[ReleaseEngineering/How_To/Install_a_Package_with_Puppet]]<br />
<br />
== Testing ==<br />
Before you test on the Puppet server it's good to run the 'test-manifests.sh' scripts locally. This script will test the syntax of the manifest files and catch very basic issues. It will not catch any issues with run-time code such as Exec's. This should really be a Makefile - {{bug|635067}}<br />
<br />
Staging of updates is done with ''staging-puppet.build.mozilla.org'' and staging slaves. You should book staging-puppet as well as an slaves you intend to test on before making any changes to the manifests on the Puppet server. All Puppet server work is done as the root user.<br />
<br />
=== Setting up the server ===<br />
If you've never used the Puppet server before you'll want to start a clone of the manifests for yourself. You can clone the main manifests repo or your own user repo to a directory under ''/etc/puppet''. Once you have your clone, two edits are necessary:<br />
<br />
* Copy the password hash into your clone's build/cltbld.pp. This can be done with the following command, run from the root of your clone:<br />
hg -R /etc/puppet/manifests.real diff /etc/puppet/manifests.real/build/cltbld.pp | patch -p1<br />
or more easily<br />
patch -p1 < /etc/puppet/password<br />
* Copy ''staging.pp'' to ''site.pp'' and comment out all of the "node" entries except for those which you have booked.<br />
<br />
It's easiest to use the ''mq'' extension to make these changes in a patch on your queue. Then, when you want to change revisions, just pop the patch, use 'hg pull -u', and re-push your patch.<br />
<br />
If you have a patch to apply to the repository now is the time to do it.<br />
<br />
Finally, if your changes involve edits to any files served by Puppet, apply those changes in the appropriate places under /N/staging. It's usually easiest to keep a text file tracking these changes - then you can post the contents of that file to the bug for review, so that it's clear to reviewers what changes are being made here. Because puppet-files are unversioned, try to minimize the amount of change you must make here.<br />
<br />
Once all of that is done you can swap your manifests in with ''/etc/puppet/set-manifests.sh YOURNAME''. Omit the name to reset them to the default ("real") manifests. If you've added new files or changed staging-fileserver.conf you'll need to restart the Puppetmaster process with:</p><br />
service puppetmaster restart<br />
although note that the daemon will pick up the changes after some short delay if you do not restart.<br />
<br />
Now, you're ready to test.<br />
<br />
=== Testing a slave ===<br />
Puppet needs to run as root on the slaves, so equip yourself thusly and run the following command:<br />
puppetd --test --logdest console --noop --server staging-puppet.build.mozilla.org<br />
<br />
<p>This will pull updated manifests from the server, see what needs to be done, and output that. The --noop argument tells Puppet to not make any changes to the slave. Once you're satisfied with the output of that, you can run it without the --noop to have Puppet make the changes. The output should be coloured, and indicate success/fail/exception.</p><br />
<br />
<p>If you're encountering errors or weird behaviour and the normal output isn't sufficient for debugging you can enhance it with --evaltrace and --debug. Together, they will print out every command that Puppet runs, including things which are used to determine whether a file or package needs updating.</p><br />
<br />
=== Forcing a package re-install ===<br />
Especially when testing, you may have to iterate on a single package install to get it right. If you need to re-install an existing package, you'll need to remove the package contents and/or the marker file that flags that package as installed. <br />
<br />
* Linux: packages installed as rpms should be removed as one normally would for an rpm, i.e. <code>rpm -e rpmname</code>, which will delete all of the files and remove the package from the db, or <code>rpm -e --justdb rpmname</code>, which will leave all of the files and remove the package from the db<br />
* Mac: manually cleanup the installed files, and remove the marker file for your package. The marker file lives under <code>/var/db/</code> and will be named <code>.puppet_pkgdmg_installed_pkgname.dmg</code>.<br />
<br />
You can now re-test your package install with [[ReleaseEngineering:Puppet:Usage#Testing_a_slave|the command above]], i.e. <code>puppetd --test ...</code>.<br />
<br />
=== Cleaning up ===<br />
Once you're finished testing the manifests symlink needs to be re-adjusted with:<br />
cd /etc/puppet<br />
./set-manifests.sh<br />
<br />
== Moving file updates to production ==<br />
'''Production Puppet Masters:'''<br />
* mv-production-puppet.build.mozilla.org<br />
* scl-production-puppet.build.scl1.mozilla.com<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
* master-puppet1.build.scl1.mozilla.com <br />
<br />
'''NOTE: there are a lot of files that differ between the various directories, so using rsync involves a lot of whack-a-mole to avoid syncing files that aren't part of your change. It may be easier to simply use 'cp' for this step'''<br />
<br />
When you're ready to land in production it's important to sync your files from staging to ensure you don't end up with a different result in production. Here's the process to do that. On scl3-production-puppet as root, run:<br />
rsync -n --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
After verifying that only the things you want are being synced, run it without -n to push them for real:<br />
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
If there are things that shouldn't be synced carefully adjust the rsync command with --exclude or more specific paths.<br />
<br />
Once you've landed into /N/production on scl3-production-puppet, the other production puppet masters need to be updated: In theory, this is done as 'filesync', but that user does not have permission to update the relevant directories, so in practice I suspect it's done as root. Anyway, here's the example:<br />
sudo su - filesync<br />
rsync -av --exclude=**/local/etc/sysconfig/puppet* \<br />
--exclude=**/local/Library/LaunchDaemons/com.reductivelabs.puppet.plist* \<br />
--exclude=**/local/home/cltbld/.config/autostart/gnome-terminal.desktop* \<br />
--delete filesync@scl3-production-puppet.build.mozilla.org:/N/production/ /N/production/<br />
<br />
again, rsync is finicky, so scp may be your friend here:<br />
# mv-production-puppet <br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/N/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
# scl-production-puppet (bug 615313)<br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/builds/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
When you're ready, update the manifests on the masters with:<br />
hg -R /etc/puppet/manifests pull<br />
hg -R /etc/puppet/manifests update<br />
Note that some changes may require manifest updates first - think carefully about the intermediate state and what it will do to slaves!<br />
<br />
Be sure to do this on all Puppet masters.<br />
<br />
== Staging changes (environments) ==<br />
<pre><br />
armenzg: if you know of a script or a command that could catch stupid things like this<br />
dustin: I used to use environments for this purpose<br />
armenzg: what do you mean?<br />
armenzg: what are environments?<br />
dustin: you can specify a different envrionment on the client:<br />
dustin: puppetd --test --environment=dustin<br />
dustin: and then that can be configured to point to a different directory on the master<br />
dustin: so I would push my mq'd repo there<br />
dustin: and test with it, confident that only the slave I was messing with would be affected<br />
catlee: armenzg: we have that set up on master-puppet1 if you want to look<br />
</pre><br />
<br />
== Deploy changes ==<br />
* deploy the files you need (if you do)<br />
** [[ReleaseEngineering/Puppet/Usage#Moving_file_updates_to_production]]<br />
** you can try this instead:<br />
csshX --login root {mv-production-puppet,scl3-production-puppet,scl-production-puppet}.build.mozilla.org<br />
** be sure that the files are across '''all''' masters or the whole set of slaves will be going down<br />
* make sure you deploy the changes to all puppet masters (ssh as root)<br />
** see [[#PuppetServers|Masters]] for list (above)<br />
* cd /etc/puppet/manifests/<br />
* hg pull -u<br />
* watch few minutes that there are no errors<br />
** tail -F /var/log/messages<br />
** once you see a slave listed go and check to see that it got the changes</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Puppet/Usage&diff=436620ReleaseEngineering/Puppet/Usage2012-06-01T15:57:00Z<p>Bear: /* Working with Puppet Servers */</p>
<hr />
<div>{{ReleaseEngineering Puppet Header}}<br />
<br />
This document is intended to serve as a guide to interacting with our Puppet servers and manifests.<br />
<br />
== Definitions ==<br />
<br />
* Type - Puppet documentation talks a lot about this. Each different "type" deals with a different aspect of the system. For example, the "user" type can do most things related to user management (passwords, UID/GID, homedirs, shells, etc). The 'package' type deals with package management (eg, apt, rpm, fink, etc). And so on.<br />
<br />
== Masters ==<br />
<br />
An accurate list of puppet servers needs to be referenced by various procedures. Please keep the following list up to date.<br />
<br />
{{Anchor|PuppetServers}}<br />
<br />
{| class="wikitable" style="text-align: center;"<br />
!Role !! Data Center !! Puppet Master <br />
|-<br />
| build master || ''all'' || master-puppet1.build.mozilla.org <br />
|-<br />
| build slave || mtv1 || mv-production-puppet.build.mtv1.mozilla.com<br />
|-<br />
| build slave || scl1 || scl-production-puppet.build.scl1.mozilla.com<br />
|-<br />
| build slave || scl3 || scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
|-<br />
| build slave || scl1 || releng-puppet1.build.scl1.mozilla.com<br />
|-<br />
| build slave || scl3 || releng-puppet1.srv.releng.scl3.mozilla.com<br />
|-<br />
| staging || scl3 || staging-puppet.build.mozilla.org<br />
|-<br />
| staging || mtv1 || relabs-puppet.build.mtv1.mozilla.com<br />
|}<br />
<br />
<small>To reference the above table from other wiki pages, use: <tt><nowiki>[[ReleaseEngineering/Puppet/Usage#PuppetServers]]</nowiki></tt></small><br />
<br />
=== Working with Puppet Servers ===<br />
<br />
There are several common conventions for all "old school" puppet servers:<br />
* log in as <tt>root</tt><br />
* configuration files are under <tt>/etc/puppet/manifests</tt><br />
* configuration repository is [http://hg.mozilla.org/build/puppet-manifests puppet-manifests]<br />
* set up personal environments under <tt>/etc/puppet/$USER</tt><br />
** for staging, use <tt>/etc/puppet/manifests-$USER</tt><br />
<br />
For PuppetAgain servers:<br />
<br />
* log in as your ldap username<br />
* production configuration files are under <tt>/etc/puppet/production</tt><br />
* dev/test configuration files are under <tt>/etc/puppet/environments/$USER</tt><br />
<br />
=== The Slave-Master Link ===<br />
You can find to which puppet master a slave connects to by checking this file's contents:<br />
# for linux testers (fedora)<br />
~cltbld/.config/autostart/gnome-terminal.desktop<br />
# for linux builders (centos)<br />
/etc/sysconfig/puppet<br />
# for osx<br />
/Library/LaunchDaemons/com.reductivelabs.puppet.plist<br />
If the slaves have to be moved between masters be sure to remove the certs after you modify this file and before their next reboot. You may also need to do 'puppetca --clean <FQDN>' on the new puppet master.<br />
# for linux<br />
find /var/lib/puppet/ssl -type f -delete<br />
# for mac<br />
find /etc/puppet/ssl -type f -delete<br />
<br />
== Our Puppet Manifests ==<br />
Out puppet manifests are organized into a few different parts:<br />
* Site files<br />
* Basic includes<br />
* Packages that make changes<br />
* Modules<br />
We are pushing toward organizing everything into modules, although this is not a particularly rapid process at the moment. Talk to Dustin.<br />
<br />
=== Site Files & Basic Includes ===<br />
Each Puppet master has its own site file which contains a few things:<br />
* Variable definitions specific to that master<br />
* Import statements which load other parts of the manifests<br />
* Node (slave) definitions<br />
<br />
<p>The basic includes are located in the 'base' directory. These files set variables with are referenced in the packages as well as base nodes for slaves.</p><br />
<br />
The most important variables to take note of are:<br />
* ${platform_fileroot} -- Used wherever the puppet:// protocol is supported, most notably with the File type.<br />
* ${platform_httproot} -- Used with the Package type and other places that don't support puppet://<br />
<br />
<p>There are also ${local_[file,http]} variables which point to the 'local' directory inside of each platform's root. See the following section for more on that.</p><br />
<br />
We have a few base nodes shared by multiple pools of slaves as well as a base node for each concrete slave type. The shared ones are:<br />
* "slave" -- For things common to ALL slaves managed by Puppet<br />
* "build" -- For things common to all build slaves<br />
* "test" -- For things common to all test slaves<br />
<br />
There are two different types of concrete nodes. Firstly, we have $platform-$arch-$type" nodes, which are used on all Puppet masters for slaves which are local to them. Two example are: "centos5-i686-build" (32-bit, CentOS 5, build slaves) and "darwin10-i386-test" (32-bit, Mac 10.6, test slaves). Secondly, there are "$location-$type-node" nodes, which only apply to the MPT master. All nodes which are not local to MPT production are listed in its configuration file as this type of node. These nodes ensure that new slaves get redirected to their local master when they first come up. Examples include "mv-build-node" and "staging-test-node".<br />
<br />
See [http://hg.mozilla.org/build/puppet-manifests/file/tip/base/nodes.pp base/nodes.pp] for the full listing of nodes..<br />
<br />
=== Packages ===<br />
* The [http://hg.mozilla.org/build/puppet-manifests/file/tip/site-production.pp site-{staging,production}.pp] files declare the list of slaves and each slave has defined which classes to include.<br />
* The classes [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/buildslave.pp buildslave.pp] and [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/staging-buildslave.pp staging-buildslave.pp] include most of the packages (devtools, nagios, mercurial, buildbot, extras, etc) we want.<br />
* The packages can have different sections or "Types" that can be "exec", "user", "package", "file", "service"<br />
<br />
=== Modules ===<br />
Going forward, puppet functionality should be encapsulated into modules. Modules include the relevant manifests, as well as files, templates, and (with some minor changes to our puppet client configs) even custom facts or types!<br />
<br />
Modules should be generic in their purpose, and well-encapsulated. They should not be specific to one operating system or distro by design, although it's OK to omit implementations we do not need (for example, it's OK for a module providing resources only used by build slaves to error out if it's used on a Fedora system - if and when we start building on Fedora, we'll need to extend the implementation).<br />
<br />
A module should be self-contained and have a well-documented and commented interface. If it depends on any other modules, that should also be highlighted in the comments.<br />
<br />
== Puppet Files ==<br />
The files that Puppet serves up (using <tt>File</tt>) are in <tt>/N</tt> on each puppet master. The MPT masters share this via an NFS mount, so it's easy to sync files from staging to MPT production. The other servers have a local copy of this data.<br />
<br />
That first 3 levels of the drive are laid out as follows:<br />
$level/$os-$hardwaremodel/$slaveType<br />
* $level is support level (production, staging, pre-production)<br />
* $os is generally one of 'centos5', 'fedora12', 'darwin9', or 'darwin10'.<br />
* $hardwaremodel is whatever 'facter' identifies the machine's CPU as (x86_64, i686, i386, etc).<br />
* $slaveType is the "type" of node of the slave is: 'build', 'test', 'stage', 'master', etc.<br />
<br />
Below '$type', are all of the files served by Puppet. They are organized according to where they'll end up on the slave. For example, if ''/usr/lib/libsomethinghuge.so'' is to be synced to the slave, it should live in ''usr/lib/libsomethinghuge.so''. Note that as much as possible, text files should not be kept in puppet-files -- use a module and its ''files/'' subdirectory instead.<br />
<br />
There are two special directories for each level/os/hardwaremodel/type combination, too:<br />
* local -- This directory contains files which should NOT be synced between staging <-> production or between different locations. Files such as the Puppet configs which have different contents depending on location and support level live here. Try not to use this.<br />
* DMGs (Mac) / RPMs (Fedora/CentOS) -- These directories contain platform specific packages which Puppet installs.<br />
<br />
== Common Use Cases ==<br />
* [[ReleaseEngineering/How To/Reset a Password with Puppet]]<br />
* [[ReleaseEngineering/How_To/Install_a_Package_with_Puppet]]<br />
<br />
== Testing ==<br />
Before you test on the Puppet server it's good to run the 'test-manifests.sh' scripts locally. This script will test the syntax of the manifest files and catch very basic issues. It will not catch any issues with run-time code such as Exec's. This should really be a Makefile - {{bug|635067}}<br />
<br />
Staging of updates is done with ''staging-puppet.build.mozilla.org'' and staging slaves. You should book staging-puppet as well as an slaves you intend to test on before making any changes to the manifests on the Puppet server. All Puppet server work is done as the root user.<br />
<br />
=== Setting up the server ===<br />
If you've never used the Puppet server before you'll want to start a clone of the manifests for yourself. You can clone the main manifests repo or your own user repo to a directory under ''/etc/puppet''. Once you have your clone, two edits are necessary:<br />
<br />
* Copy the password hash into your clone's build/cltbld.pp. This can be done with the following command, run from the root of your clone:<br />
hg -R /etc/puppet/manifests.real diff /etc/puppet/manifests.real/build/cltbld.pp | patch -p1<br />
or more easily<br />
patch -p1 < /etc/puppet/password<br />
* Copy ''staging.pp'' to ''site.pp'' and comment out all of the "node" entries except for those which you have booked.<br />
<br />
It's easiest to use the ''mq'' extension to make these changes in a patch on your queue. Then, when you want to change revisions, just pop the patch, use 'hg pull -u', and re-push your patch.<br />
<br />
If you have a patch to apply to the repository now is the time to do it.<br />
<br />
Finally, if your changes involve edits to any files served by Puppet, apply those changes in the appropriate places under /N/staging. It's usually easiest to keep a text file tracking these changes - then you can post the contents of that file to the bug for review, so that it's clear to reviewers what changes are being made here. Because puppet-files are unversioned, try to minimize the amount of change you must make here.<br />
<br />
Once all of that is done you can swap your manifests in with ''/etc/puppet/set-manifests.sh YOURNAME''. Omit the name to reset them to the default ("real") manifests. If you've added new files or changed staging-fileserver.conf you'll need to restart the Puppetmaster process with:</p><br />
service puppetmaster restart<br />
although note that the daemon will pick up the changes after some short delay if you do not restart.<br />
<br />
Now, you're ready to test.<br />
<br />
=== Testing a slave ===<br />
Puppet needs to run as root on the slaves, so equip yourself thusly and run the following command:<br />
puppetd --test --logdest console --noop --server staging-puppet.build.mozilla.org<br />
<br />
<p>This will pull updated manifests from the server, see what needs to be done, and output that. The --noop argument tells Puppet to not make any changes to the slave. Once you're satisfied with the output of that, you can run it without the --noop to have Puppet make the changes. The output should be coloured, and indicate success/fail/exception.</p><br />
<br />
<p>If you're encountering errors or weird behaviour and the normal output isn't sufficient for debugging you can enhance it with --evaltrace and --debug. Together, they will print out every command that Puppet runs, including things which are used to determine whether a file or package needs updating.</p><br />
<br />
=== Forcing a package re-install ===<br />
Especially when testing, you may have to iterate on a single package install to get it right. If you need to re-install an existing package, you'll need to remove the package contents and/or the marker file that flags that package as installed. <br />
<br />
* Linux: packages installed as rpms should be removed as one normally would for an rpm, i.e. <code>rpm -e rpmname</code>, which will delete all of the files and remove the package from the db, or <code>rpm -e --justdb rpmname</code>, which will leave all of the files and remove the package from the db<br />
* Mac: manually cleanup the installed files, and remove the marker file for your package. The marker file lives under <code>/var/db/</code> and will be named <code>.puppet_pkgdmg_installed_pkgname.dmg</code>.<br />
<br />
You can now re-test your package install with [[ReleaseEngineering:Puppet:Usage#Testing_a_slave|the command above]], i.e. <code>puppetd --test ...</code>.<br />
<br />
=== Cleaning up ===<br />
Once you're finished testing the manifests symlink needs to be re-adjusted with:<br />
cd /etc/puppet<br />
./set-manifests.sh<br />
<br />
== Moving file updates to production ==<br />
'''Production Puppet Masters:'''<br />
* mv-production-puppet.build.mozilla.org<br />
* scl-production-puppet.build.scl1.mozilla.com<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
* master-puppet1.build.scl1.mozilla.com <br />
<br />
'''NOTE: there are a lot of files that differ between the various directories, so using rsync involves a lot of whack-a-mole to avoid syncing files that aren't part of your change. It may be easier to simply use 'cp' for this step'''<br />
<br />
When you're ready to land in production it's important to sync your files from staging to ensure you don't end up with a different result in production. Here's the process to do that. On scl3-production-puppet as root, run:<br />
rsync -n --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
After verifying that only the things you want are being synced, run it without -n to push them for real:<br />
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
If there are things that shouldn't be synced carefully adjust the rsync command with --exclude or more specific paths.<br />
<br />
Once you've landed into /N/production on scl3-production-puppet, the other production puppet masters need to be updated: In theory, this is done as 'filesync', but that user does not have permission to update the relevant directories, so in practice I suspect it's done as root. Anyway, here's the example:<br />
sudo su - filesync<br />
rsync -av --exclude=**/local/etc/sysconfig/puppet* \<br />
--exclude=**/local/Library/LaunchDaemons/com.reductivelabs.puppet.plist* \<br />
--exclude=**/local/home/cltbld/.config/autostart/gnome-terminal.desktop* \<br />
--delete filesync@scl3-production-puppet.build.mozilla.org:/N/production/ /N/production/<br />
<br />
again, rsync is finicky, so scp may be your friend here:<br />
# mv-production-puppet <br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/N/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
# scl-production-puppet (bug 615313)<br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/builds/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
When you're ready, update the manifests on the masters with:<br />
hg -R /etc/puppet/manifests pull<br />
hg -R /etc/puppet/manifests update<br />
Note that some changes may require manifest updates first - think carefully about the intermediate state and what it will do to slaves!<br />
<br />
Be sure to do this on all Puppet masters.<br />
<br />
== Staging changes (environments) ==<br />
<pre><br />
armenzg: if you know of a script or a command that could catch stupid things like this<br />
dustin: I used to use environments for this purpose<br />
armenzg: what do you mean?<br />
armenzg: what are environments?<br />
dustin: you can specify a different envrionment on the client:<br />
dustin: puppetd --test --environment=dustin<br />
dustin: and then that can be configured to point to a different directory on the master<br />
dustin: so I would push my mq'd repo there<br />
dustin: and test with it, confident that only the slave I was messing with would be affected<br />
catlee: armenzg: we have that set up on master-puppet1 if you want to look<br />
</pre><br />
<br />
== Deploy changes ==<br />
* deploy the files you need (if you do)<br />
** [[ReleaseEngineering/Puppet/Usage#Moving_file_updates_to_production]]<br />
** you can try this instead:<br />
csshX --login root {mv-production-puppet,scl3-production-puppet,scl-production-puppet}.build.mozilla.org<br />
** be sure that the files are across '''all''' masters or the whole set of slaves will be going down<br />
* make sure you deploy the changes to all puppet masters (ssh as root)<br />
** see [[#PuppetServers|Masters]] for list (above)<br />
* cd /etc/puppet/manifests/<br />
* hg pull -u<br />
* watch few minutes that there are no errors<br />
** tail -F /var/log/messages<br />
** once you see a slave listed go and check to see that it got the changes</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Puppet/Usage&diff=436614ReleaseEngineering/Puppet/Usage2012-06-01T15:45:09Z<p>Bear: /* Masters */</p>
<hr />
<div>{{ReleaseEngineering Puppet Header}}<br />
<br />
This document is intended to serve as a guide to interacting with our Puppet servers and manifests.<br />
<br />
== Definitions ==<br />
<br />
* Type - Puppet documentation talks a lot about this. Each different "type" deals with a different aspect of the system. For example, the "user" type can do most things related to user management (passwords, UID/GID, homedirs, shells, etc). The 'package' type deals with package management (eg, apt, rpm, fink, etc). And so on.<br />
<br />
== Masters ==<br />
<br />
An accurate list of puppet servers needs to be referenced by various procedures. Please keep the following list up to date.<br />
<br />
{{Anchor|PuppetServers}}<br />
<br />
{| class="wikitable" style="text-align: center;"<br />
!Role !! Data Center !! Puppet Master <br />
|-<br />
| build master || ''all'' || master-puppet1.build.mozilla.org <br />
|-<br />
| build slave || mtv1 || mv-production-puppet.build.mtv1.mozilla.com<br />
|-<br />
| build slave || scl1 || scl-production-puppet.build.scl1.mozilla.com<br />
|-<br />
| build slave || scl3 || scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
|-<br />
| build slave || scl1 || releng-puppet1.build.scl1.mozilla.com<br />
|-<br />
| build slave || scl3 || releng-puppet1.srv.releng.scl3.mozilla.com<br />
|-<br />
| staging || scl3 || staging-puppet.build.mozilla.org<br />
|-<br />
| staging || mtv1 || relabs-puppet.build.mtv1.mozilla.com<br />
|}<br />
<br />
<small>To reference the above table from other wiki pages, use: <tt><nowiki>[[ReleaseEngineering/Puppet/Usage#PuppetServers]]</nowiki></tt></small><br />
<br />
=== Working with Puppet Servers ===<br />
<br />
There are several common conventions for all puppet servers:<br />
* log in as <tt>root</tt><br />
* configuration files are under <tt>/etc/puppet/manifests</tt><br />
* configuration repository is [http://hg.mozilla.org/build/puppet-manifests puppet-manifests]<br />
* set up personal environments under <tt>/etc/puppet/$USER</tt><br />
** for staging, use <tt>/etc/puppet/manifests-$USER</tt><br />
<br />
=== The Slave-Master Link ===<br />
You can find to which puppet master a slave connects to by checking this file's contents:<br />
# for linux testers (fedora)<br />
~cltbld/.config/autostart/gnome-terminal.desktop<br />
# for linux builders (centos)<br />
/etc/sysconfig/puppet<br />
# for osx<br />
/Library/LaunchDaemons/com.reductivelabs.puppet.plist<br />
If the slaves have to be moved between masters be sure to remove the certs after you modify this file and before their next reboot. You may also need to do 'puppetca --clean <FQDN>' on the new puppet master.<br />
# for linux<br />
find /var/lib/puppet/ssl -type f -delete<br />
# for mac<br />
find /etc/puppet/ssl -type f -delete<br />
<br />
== Our Puppet Manifests ==<br />
Out puppet manifests are organized into a few different parts:<br />
* Site files<br />
* Basic includes<br />
* Packages that make changes<br />
* Modules<br />
We are pushing toward organizing everything into modules, although this is not a particularly rapid process at the moment. Talk to Dustin.<br />
<br />
=== Site Files & Basic Includes ===<br />
Each Puppet master has its own site file which contains a few things:<br />
* Variable definitions specific to that master<br />
* Import statements which load other parts of the manifests<br />
* Node (slave) definitions<br />
<br />
<p>The basic includes are located in the 'base' directory. These files set variables with are referenced in the packages as well as base nodes for slaves.</p><br />
<br />
The most important variables to take note of are:<br />
* ${platform_fileroot} -- Used wherever the puppet:// protocol is supported, most notably with the File type.<br />
* ${platform_httproot} -- Used with the Package type and other places that don't support puppet://<br />
<br />
<p>There are also ${local_[file,http]} variables which point to the 'local' directory inside of each platform's root. See the following section for more on that.</p><br />
<br />
We have a few base nodes shared by multiple pools of slaves as well as a base node for each concrete slave type. The shared ones are:<br />
* "slave" -- For things common to ALL slaves managed by Puppet<br />
* "build" -- For things common to all build slaves<br />
* "test" -- For things common to all test slaves<br />
<br />
There are two different types of concrete nodes. Firstly, we have $platform-$arch-$type" nodes, which are used on all Puppet masters for slaves which are local to them. Two example are: "centos5-i686-build" (32-bit, CentOS 5, build slaves) and "darwin10-i386-test" (32-bit, Mac 10.6, test slaves). Secondly, there are "$location-$type-node" nodes, which only apply to the MPT master. All nodes which are not local to MPT production are listed in its configuration file as this type of node. These nodes ensure that new slaves get redirected to their local master when they first come up. Examples include "mv-build-node" and "staging-test-node".<br />
<br />
See [http://hg.mozilla.org/build/puppet-manifests/file/tip/base/nodes.pp base/nodes.pp] for the full listing of nodes..<br />
<br />
=== Packages ===<br />
* The [http://hg.mozilla.org/build/puppet-manifests/file/tip/site-production.pp site-{staging,production}.pp] files declare the list of slaves and each slave has defined which classes to include.<br />
* The classes [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/buildslave.pp buildslave.pp] and [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/staging-buildslave.pp staging-buildslave.pp] include most of the packages (devtools, nagios, mercurial, buildbot, extras, etc) we want.<br />
* The packages can have different sections or "Types" that can be "exec", "user", "package", "file", "service"<br />
<br />
=== Modules ===<br />
Going forward, puppet functionality should be encapsulated into modules. Modules include the relevant manifests, as well as files, templates, and (with some minor changes to our puppet client configs) even custom facts or types!<br />
<br />
Modules should be generic in their purpose, and well-encapsulated. They should not be specific to one operating system or distro by design, although it's OK to omit implementations we do not need (for example, it's OK for a module providing resources only used by build slaves to error out if it's used on a Fedora system - if and when we start building on Fedora, we'll need to extend the implementation).<br />
<br />
A module should be self-contained and have a well-documented and commented interface. If it depends on any other modules, that should also be highlighted in the comments.<br />
<br />
== Puppet Files ==<br />
The files that Puppet serves up (using <tt>File</tt>) are in <tt>/N</tt> on each puppet master. The MPT masters share this via an NFS mount, so it's easy to sync files from staging to MPT production. The other servers have a local copy of this data.<br />
<br />
That first 3 levels of the drive are laid out as follows:<br />
$level/$os-$hardwaremodel/$slaveType<br />
* $level is support level (production, staging, pre-production)<br />
* $os is generally one of 'centos5', 'fedora12', 'darwin9', or 'darwin10'.<br />
* $hardwaremodel is whatever 'facter' identifies the machine's CPU as (x86_64, i686, i386, etc).<br />
* $slaveType is the "type" of node of the slave is: 'build', 'test', 'stage', 'master', etc.<br />
<br />
Below '$type', are all of the files served by Puppet. They are organized according to where they'll end up on the slave. For example, if ''/usr/lib/libsomethinghuge.so'' is to be synced to the slave, it should live in ''usr/lib/libsomethinghuge.so''. Note that as much as possible, text files should not be kept in puppet-files -- use a module and its ''files/'' subdirectory instead.<br />
<br />
There are two special directories for each level/os/hardwaremodel/type combination, too:<br />
* local -- This directory contains files which should NOT be synced between staging <-> production or between different locations. Files such as the Puppet configs which have different contents depending on location and support level live here. Try not to use this.<br />
* DMGs (Mac) / RPMs (Fedora/CentOS) -- These directories contain platform specific packages which Puppet installs.<br />
<br />
== Common Use Cases ==<br />
* [[ReleaseEngineering/How To/Reset a Password with Puppet]]<br />
* [[ReleaseEngineering/How_To/Install_a_Package_with_Puppet]]<br />
<br />
== Testing ==<br />
Before you test on the Puppet server it's good to run the 'test-manifests.sh' scripts locally. This script will test the syntax of the manifest files and catch very basic issues. It will not catch any issues with run-time code such as Exec's. This should really be a Makefile - {{bug|635067}}<br />
<br />
Staging of updates is done with ''staging-puppet.build.mozilla.org'' and staging slaves. You should book staging-puppet as well as an slaves you intend to test on before making any changes to the manifests on the Puppet server. All Puppet server work is done as the root user.<br />
<br />
=== Setting up the server ===<br />
If you've never used the Puppet server before you'll want to start a clone of the manifests for yourself. You can clone the main manifests repo or your own user repo to a directory under ''/etc/puppet''. Once you have your clone, two edits are necessary:<br />
<br />
* Copy the password hash into your clone's build/cltbld.pp. This can be done with the following command, run from the root of your clone:<br />
hg -R /etc/puppet/manifests.real diff /etc/puppet/manifests.real/build/cltbld.pp | patch -p1<br />
or more easily<br />
patch -p1 < /etc/puppet/password<br />
* Copy ''staging.pp'' to ''site.pp'' and comment out all of the "node" entries except for those which you have booked.<br />
<br />
It's easiest to use the ''mq'' extension to make these changes in a patch on your queue. Then, when you want to change revisions, just pop the patch, use 'hg pull -u', and re-push your patch.<br />
<br />
If you have a patch to apply to the repository now is the time to do it.<br />
<br />
Finally, if your changes involve edits to any files served by Puppet, apply those changes in the appropriate places under /N/staging. It's usually easiest to keep a text file tracking these changes - then you can post the contents of that file to the bug for review, so that it's clear to reviewers what changes are being made here. Because puppet-files are unversioned, try to minimize the amount of change you must make here.<br />
<br />
Once all of that is done you can swap your manifests in with ''/etc/puppet/set-manifests.sh YOURNAME''. Omit the name to reset them to the default ("real") manifests. If you've added new files or changed staging-fileserver.conf you'll need to restart the Puppetmaster process with:</p><br />
service puppetmaster restart<br />
although note that the daemon will pick up the changes after some short delay if you do not restart.<br />
<br />
Now, you're ready to test.<br />
<br />
=== Testing a slave ===<br />
Puppet needs to run as root on the slaves, so equip yourself thusly and run the following command:<br />
puppetd --test --logdest console --noop --server staging-puppet.build.mozilla.org<br />
<br />
<p>This will pull updated manifests from the server, see what needs to be done, and output that. The --noop argument tells Puppet to not make any changes to the slave. Once you're satisfied with the output of that, you can run it without the --noop to have Puppet make the changes. The output should be coloured, and indicate success/fail/exception.</p><br />
<br />
<p>If you're encountering errors or weird behaviour and the normal output isn't sufficient for debugging you can enhance it with --evaltrace and --debug. Together, they will print out every command that Puppet runs, including things which are used to determine whether a file or package needs updating.</p><br />
<br />
=== Forcing a package re-install ===<br />
Especially when testing, you may have to iterate on a single package install to get it right. If you need to re-install an existing package, you'll need to remove the package contents and/or the marker file that flags that package as installed. <br />
<br />
* Linux: packages installed as rpms should be removed as one normally would for an rpm, i.e. <code>rpm -e rpmname</code>, which will delete all of the files and remove the package from the db, or <code>rpm -e --justdb rpmname</code>, which will leave all of the files and remove the package from the db<br />
* Mac: manually cleanup the installed files, and remove the marker file for your package. The marker file lives under <code>/var/db/</code> and will be named <code>.puppet_pkgdmg_installed_pkgname.dmg</code>.<br />
<br />
You can now re-test your package install with [[ReleaseEngineering:Puppet:Usage#Testing_a_slave|the command above]], i.e. <code>puppetd --test ...</code>.<br />
<br />
=== Cleaning up ===<br />
Once you're finished testing the manifests symlink needs to be re-adjusted with:<br />
cd /etc/puppet<br />
./set-manifests.sh<br />
<br />
== Moving file updates to production ==<br />
'''Production Puppet Masters:'''<br />
* mv-production-puppet.build.mozilla.org<br />
* scl-production-puppet.build.scl1.mozilla.com<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
* master-puppet1.build.scl1.mozilla.com <br />
<br />
'''NOTE: there are a lot of files that differ between the various directories, so using rsync involves a lot of whack-a-mole to avoid syncing files that aren't part of your change. It may be easier to simply use 'cp' for this step'''<br />
<br />
When you're ready to land in production it's important to sync your files from staging to ensure you don't end up with a different result in production. Here's the process to do that. On scl3-production-puppet as root, run:<br />
rsync -n --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
After verifying that only the things you want are being synced, run it without -n to push them for real:<br />
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
If there are things that shouldn't be synced carefully adjust the rsync command with --exclude or more specific paths.<br />
<br />
Once you've landed into /N/production on scl3-production-puppet, the other production puppet masters need to be updated: In theory, this is done as 'filesync', but that user does not have permission to update the relevant directories, so in practice I suspect it's done as root. Anyway, here's the example:<br />
sudo su - filesync<br />
rsync -av --exclude=**/local/etc/sysconfig/puppet* \<br />
--exclude=**/local/Library/LaunchDaemons/com.reductivelabs.puppet.plist* \<br />
--exclude=**/local/home/cltbld/.config/autostart/gnome-terminal.desktop* \<br />
--delete filesync@scl3-production-puppet.build.mozilla.org:/N/production/ /N/production/<br />
<br />
again, rsync is finicky, so scp may be your friend here:<br />
# mv-production-puppet <br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/N/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
# scl-production-puppet (bug 615313)<br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/builds/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
When you're ready, update the manifests on the masters with:<br />
hg -R /etc/puppet/manifests pull<br />
hg -R /etc/puppet/manifests update<br />
Note that some changes may require manifest updates first - think carefully about the intermediate state and what it will do to slaves!<br />
<br />
Be sure to do this on all Puppet masters.<br />
<br />
== Staging changes (environments) ==<br />
<pre><br />
armenzg: if you know of a script or a command that could catch stupid things like this<br />
dustin: I used to use environments for this purpose<br />
armenzg: what do you mean?<br />
armenzg: what are environments?<br />
dustin: you can specify a different envrionment on the client:<br />
dustin: puppetd --test --environment=dustin<br />
dustin: and then that can be configured to point to a different directory on the master<br />
dustin: so I would push my mq'd repo there<br />
dustin: and test with it, confident that only the slave I was messing with would be affected<br />
catlee: armenzg: we have that set up on master-puppet1 if you want to look<br />
</pre><br />
<br />
== Deploy changes ==<br />
* deploy the files you need (if you do)<br />
** [[ReleaseEngineering/Puppet/Usage#Moving_file_updates_to_production]]<br />
** you can try this instead:<br />
csshX --login root {mv-production-puppet,scl3-production-puppet,scl-production-puppet}.build.mozilla.org<br />
** be sure that the files are across '''all''' masters or the whole set of slaves will be going down<br />
* make sure you deploy the changes to all puppet masters (ssh as root)<br />
** see [[#PuppetServers|Masters]] for list (above)<br />
* cd /etc/puppet/manifests/<br />
* hg pull -u<br />
* watch few minutes that there are no errors<br />
** tail -F /var/log/messages<br />
** once you see a slave listed go and check to see that it got the changes</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Reset_a_Password_with_Puppet&diff=436085ReleaseEngineering/How To/Reset a Password with Puppet2012-05-31T14:09:19Z<p>Bear: </p>
<hr />
<div>{{Release Engineering How To|Reset the cltbld Password with Puppet}}<br />
User passwords are stored in a hashed format alongside other user information. We do not put the hashes in a public location for hopefully obvious reasons - please make sure you don't do this by accident.<br />
<br />
Let's say you want to update cltbld's password. First, you need to generate the new hash. You can do that by running the following:<br />
openssl passwd -1 <br />
# now type the password and confirmation<br />
Now, copy and paste that password hash into /etc/puppet/manifests/secrets.pp as the 'password' for the cltbld user. Do this on all active puppet masters. '''do not check this change in!'''<br />
<br />
Both the root and cltbld passwords can be updated this way.</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Puppet/Usage&diff=434367ReleaseEngineering/Puppet/Usage2012-05-25T02:01:44Z<p>Bear: /* Current Puppet Servers */</p>
<hr />
<div>{{ReleaseEngineering Puppet Header}}<br />
<br />
This document is intended to serve as a guide to interacting with our Puppet servers and manifests.<br />
<br />
== Definitions ==<br />
<br />
* Type - Puppet documentation talks a lot about this. Each different "type" deals with a different aspect of the system. For example, the "user" type can do most things related to user management (passwords, UID/GID, homedirs, shells, etc). The 'package' type deals with package management (eg, apt, rpm, fink, etc). And so on.<br />
<br />
== Masters ==<br />
<br />
* staging-puppet.build.mozilla.org (staging, in SCL3)<br />
* mv-production-puppet.build.mozilla.org (MV)<br />
* scl-production-puppet.build.scl1.mozilla.com (SCL1)<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com (SCL3)<br />
* master-puppet1.build.mozilla.org (for buildbot-masters, in SCL1)<br />
<br />
=== The Slave-Master Link ===<br />
You can find to which puppet master a slave connects to by checking this file's contents:<br />
# for linux testers (fedora)<br />
~cltbld/.config/autostart/gnome-terminal.desktop<br />
# for linux builders (centos)<br />
/etc/sysconfig/puppet<br />
# for osx<br />
/Library/LaunchDaemons/com.reductivelabs.puppet.plist<br />
If the slaves have to be moved between masters be sure to remove the certs after you modify this file and before their next reboot. You may also need to do 'puppetca --clean <FQDN>' on the new puppet master.<br />
# for linux<br />
find /var/lib/puppet/ssl -type f -delete<br />
# for mac<br />
find /etc/puppet/ssl -type f -delete<br />
<br />
== Our Puppet Manifests ==<br />
Out puppet manifests are organized into a few different parts:<br />
* Site files<br />
* Basic includes<br />
* Packages that make changes<br />
* Modules<br />
We are pushing toward organizing everything into modules, although this is not a particularly rapid process at the moment. Talk to Dustin.<br />
<br />
=== Site Files & Basic Includes ===<br />
Each Puppet master has its own site file which contains a few things:<br />
* Variable definitions specific to that master<br />
* Import statements which load other parts of the manifests<br />
* Node (slave) definitions<br />
<br />
<p>The basic includes are located in the 'base' directory. These files set variables with are referenced in the packages as well as base nodes for slaves.</p><br />
<br />
The most important variables to take note of are:<br />
* ${platform_fileroot} -- Used wherever the puppet:// protocol is supported, most notably with the File type.<br />
* ${platform_httproot} -- Used with the Package type and other places that don't support puppet://<br />
<br />
<p>There are also ${local_[file,http]} variables which point to the 'local' directory inside of each platform's root. See the following section for more on that.</p><br />
<br />
We have a few base nodes shared by multiple pools of slaves as well as a base node for each concrete slave type. The shared ones are:<br />
* "slave" -- For things common to ALL slaves managed by Puppet<br />
* "build" -- For things common to all build slaves<br />
* "test" -- For things common to all test slaves<br />
<br />
There are two different types of concrete nodes. Firstly, we have $platform-$arch-$type" nodes, which are used on all Puppet masters for slaves which are local to them. Two example are: "centos5-i686-build" (32-bit, CentOS 5, build slaves) and "darwin10-i386-test" (32-bit, Mac 10.6, test slaves). Secondly, there are "$location-$type-node" nodes, which only apply to the MPT master. All nodes which are not local to MPT production are listed in its configuration file as this type of node. These nodes ensure that new slaves get redirected to their local master when they first come up. Examples include "mv-build-node" and "staging-test-node".<br />
<br />
See [http://hg.mozilla.org/build/puppet-manifests/file/tip/base/nodes.pp base/nodes.pp] for the full listing of nodes..<br />
<br />
=== Packages ===<br />
* The [http://hg.mozilla.org/build/puppet-manifests/file/tip/site-production.pp site-{staging,production}.pp] files declare the list of slaves and each slave has defined which classes to include.<br />
* The classes [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/buildslave.pp buildslave.pp] and [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/staging-buildslave.pp staging-buildslave.pp] include most of the packages (devtools, nagios, mercurial, buildbot, extras, etc) we want.<br />
* The packages can have different sections or "Types" that can be "exec", "user", "package", "file", "service"<br />
<br />
=== Modules ===<br />
Going forward, puppet functionality should be encapsulated into modules. Modules include the relevant manifests, as well as files, templates, and (with some minor changes to our puppet client configs) even custom facts or types!<br />
<br />
Modules should be generic in their purpose, and well-encapsulated. They should not be specific to one operating system or distro by design, although it's OK to omit implementations we do not need (for example, it's OK for a module providing resources only used by build slaves to error out if it's used on a Fedora system - if and when we start building on Fedora, we'll need to extend the implementation).<br />
<br />
A module should be self-contained and have a well-documented and commented interface. If it depends on any other modules, that should also be highlighted in the comments.<br />
<br />
== Puppet Files ==<br />
The files that Puppet serves up (using <tt>File</tt>) are in <tt>/N</tt> on each puppet master. The MPT masters share this via an NFS mount, so it's easy to sync files from staging to MPT production. The other servers have a local copy of this data.<br />
<br />
That first 3 levels of the drive are laid out as follows:<br />
$level/$os-$hardwaremodel/$slaveType<br />
* $level is support level (production, staging, pre-production)<br />
* $os is generally one of 'centos5', 'fedora12', 'darwin9', or 'darwin10'.<br />
* $hardwaremodel is whatever 'facter' identifies the machine's CPU as (x86_64, i686, i386, etc).<br />
* $slaveType is the "type" of node of the slave is: 'build', 'test', 'stage', 'master', etc.<br />
<br />
Below '$type', are all of the files served by Puppet. They are organized according to where they'll end up on the slave. For example, if ''/usr/lib/libsomethinghuge.so'' is to be synced to the slave, it should live in ''usr/lib/libsomethinghuge.so''. Note that as much as possible, text files should not be kept in puppet-files -- use a module and its ''files/'' subdirectory instead.<br />
<br />
There are two special directories for each level/os/hardwaremodel/type combination, too:<br />
* local -- This directory contains files which should NOT be synced between staging <-> production or between different locations. Files such as the Puppet configs which have different contents depending on location and support level live here. Try not to use this.<br />
* DMGs (Mac) / RPMs (Fedora/CentOS) -- These directories contain platform specific packages which Puppet installs.<br />
<br />
== Common Use Cases ==<br />
* [[ReleaseEngineering/How To/Reset a Password with Puppet]]<br />
* [[ReleaseEngineering/How_To/Install_a_Package_with_Puppet]]<br />
<br />
== Testing ==<br />
Before you test on the Puppet server it's good to run the 'test-manifests.sh' scripts locally. This script will test the syntax of the manifest files and catch very basic issues. It will not catch any issues with run-time code such as Exec's. This should really be a Makefile - {{bug|635067}}<br />
<br />
Staging of updates is done with ''staging-puppet.build.mozilla.org'' and staging slaves. You should book staging-puppet as well as an slaves you intend to test on before making any changes to the manifests on the Puppet server. All Puppet server work is done as the root user.<br />
<br />
=== Setting up the server ===<br />
If you've never used the Puppet server before you'll want to start a clone of the manifests for yourself. You can clone the main manifests repo or your own user repo to a directory under ''/etc/puppet''. Once you have your clone, two edits are necessary:<br />
<br />
* Copy the password hash into your clone's build/cltbld.pp. This can be done with the following command, run from the root of your clone:<br />
hg -R /etc/puppet/manifests.real diff /etc/puppet/manifests.real/build/cltbld.pp | patch -p1<br />
or more easily<br />
patch -p1 < /etc/puppet/password<br />
* Copy ''staging.pp'' to ''site.pp'' and comment out all of the "node" entries except for those which you have booked.<br />
<br />
It's easiest to use the ''mq'' extension to make these changes in a patch on your queue. Then, when you want to change revisions, just pop the patch, use 'hg pull -u', and re-push your patch.<br />
<br />
If you have a patch to apply to the repository now is the time to do it.<br />
<br />
Finally, if your changes involve edits to any files served by Puppet, apply those changes in the appropriate places under /N/staging. It's usually easiest to keep a text file tracking these changes - then you can post the contents of that file to the bug for review, so that it's clear to reviewers what changes are being made here. Because puppet-files are unversioned, try to minimize the amount of change you must make here.<br />
<br />
Once all of that is done you can swap your manifests in with ''/etc/puppet/set-manifests.sh YOURNAME''. Omit the name to reset them to the default ("real") manifests. If you've added new files or changed staging-fileserver.conf you'll need to restart the Puppetmaster process with:</p><br />
service puppetmaster restart<br />
although note that the daemon will pick up the changes after some short delay if you do not restart.<br />
<br />
Now, you're ready to test.<br />
<br />
=== Testing a slave ===<br />
Puppet needs to run as root on the slaves, so equip yourself thusly and run the following command:<br />
puppetd --test --logdest console --noop --server staging-puppet.build.mozilla.org<br />
<br />
<p>This will pull updated manifests from the server, see what needs to be done, and output that. The --noop argument tells Puppet to not make any changes to the slave. Once you're satisfied with the output of that, you can run it without the --noop to have Puppet make the changes. The output should be coloured, and indicate success/fail/exception.</p><br />
<br />
<p>If you're encountering errors or weird behaviour and the normal output isn't sufficient for debugging you can enhance it with --evaltrace and --debug. Together, they will print out every command that Puppet runs, including things which are used to determine whether a file or package needs updating.</p><br />
<br />
=== Forcing a package re-install ===<br />
Especially when testing, you may have to iterate on a single package install to get it right. If you need to re-install an existing package, you'll need to remove the package contents and/or the marker file that flags that package as installed. <br />
<br />
* Linux: packages installed as rpms should be removed as one normally would for an rpm, i.e. <code>rpm -e rpmname</code>, which will delete all of the files and remove the package from the db, or <code>rpm -e --justdb rpmname</code>, which will leave all of the files and remove the package from the db<br />
* Mac: manually cleanup the installed files, and remove the marker file for your package. The marker file lives under <code>/var/db/</code> and will be named <code>.puppet_pkgdmg_installed_pkgname.dmg</code>.<br />
<br />
You can now re-test your package install with [[ReleaseEngineering:Puppet:Usage#Testing_a_slave|the command above]], i.e. <code>puppetd --test ...</code>.<br />
<br />
=== Cleaning up ===<br />
Once you're finished testing the manifests symlink needs to be re-adjusted with:<br />
cd /etc/puppet<br />
./set-manifests.sh<br />
<br />
== Moving file updates to production ==<br />
'''Production Puppet Masters:'''<br />
* mv-production-puppet.build.mozilla.org<br />
* scl-production-puppet.build.scl1.mozilla.com<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
* master-puppet1.build.scl1.mozilla.com <br />
<br />
'''NOTE: there are a lot of files that differ between the various directories, so using rsync involves a lot of whack-a-mole to avoid syncing files that aren't part of your change. It may be easier to simply use 'cp' for this step'''<br />
<br />
When you're ready to land in production it's important to sync your files from staging to ensure you don't end up with a different result in production. Here's the process to do that. On scl3-production-puppet as root, run:<br />
rsync -n --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
After verifying that only the things you want are being synced, run it without -n to push them for real:<br />
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
If there are things that shouldn't be synced carefully adjust the rsync command with --exclude or more specific paths.<br />
<br />
Once you've landed into /N/production on scl3-production-puppet, the other production puppet masters need to be updated: In theory, this is done as 'filesync', but that user does not have permission to update the relevant directories, so in practice I suspect it's done as root. Anyway, here's the example:<br />
sudo su - filesync<br />
rsync -av --exclude=**/local/etc/sysconfig/puppet* --exclude=**/local/Library/LaunchDaemons/com.reductivelabs.puppet.plist* --exclude=**/local/home/cltbld/.config/autostart/gnome-terminal.desktop* --delete filesync@scl3-production-puppet.build.mozilla.org:/N/production/ /N/production/<br />
<br />
again, rsync is finicky, so scp may be your friend here:<br />
# mv-production-puppet <br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/N/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
# scl-production-puppet (bug 615313)<br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/builds/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
When you're ready, update the manifests on the masters with:<br />
hg -R /etc/puppet/manifests pull<br />
hg -R /etc/puppet/manifests update<br />
Note that some changes may require manifest updates first - think carefully about the intermediate state and what it will do to slaves!<br />
<br />
Be sure to do this on all Puppet masters.<br />
<br />
== Staging changes (environments) ==<br />
<pre><br />
armenzg: if you know of a script or a command that could catch stupid things like this<br />
dustin: I used to use environments for this purpose<br />
armenzg: what do you mean?<br />
armenzg: what are environments?<br />
dustin: you can specify a different envrionment on the client:<br />
dustin: puppetd --test --environment=dustin<br />
dustin: and then that can be configured to point to a different directory on the master<br />
dustin: so I would push my mq'd repo there<br />
dustin: and test with it, confident that only the slave I was messing with would be affected<br />
catlee: armenzg: we have that set up on master-puppet1 if you want to look<br />
</pre><br />
<br />
== Deploy changes ==<br />
* deploy the files you need (if you do)<br />
** [[ReleaseEngineering/Puppet/Usage#Moving_file_updates_to_production]]<br />
** you can try this instead:<br />
csshX --login root {mv-production-puppet,scl3-production-puppet,scl-production-puppet}.build.mozilla.org<br />
** be sure that the files are across '''all''' masters or the whole set of slaves will be going down<br />
* make sure you deploy the changes to all puppet masters (ssh as root)<br />
** scl-production-puppet<br />
** scl3-production-puppet<br />
** mv-production-puppet<br />
** master-puppet1<br />
* cd /etc/puppet/manifests/<br />
* hg pull -u<br />
* watch few minutes that there are no errors<br />
** tail -F /var/log/messages<br />
** once you see a slave listed go and check to see that it got the changes<br />
<br />
== Current Puppet Servers ==<br />
An accurate list of puppet servers needs to be referenced by various procedures. Please keep the following list up to date.<br />
<br />
{{Anchor|PuppetServers}}<br />
<br />
{| class="wikitable" style="text-align: center;"<br />
!Role !! Data Center !! Slave Puppet Master<br />
|-<br />
| build master || ''all'' || master-puppet1.build.mozilla.org<br />
|-<br />
| build slave || scl1 || scl-production-puppet.build.scl1.mozilla.com<br />
|-<br />
| build slave || scl3 || scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
|-<br />
| build slave || mtv1 || mv-production-puppet.build.mtv1.mozilla.com<br />
|}</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Puppet/Usage&diff=434365ReleaseEngineering/Puppet/Usage2012-05-25T02:00:21Z<p>Bear: /* Moving file updates to production */</p>
<hr />
<div>{{ReleaseEngineering Puppet Header}}<br />
<br />
This document is intended to serve as a guide to interacting with our Puppet servers and manifests.<br />
<br />
== Definitions ==<br />
<br />
* Type - Puppet documentation talks a lot about this. Each different "type" deals with a different aspect of the system. For example, the "user" type can do most things related to user management (passwords, UID/GID, homedirs, shells, etc). The 'package' type deals with package management (eg, apt, rpm, fink, etc). And so on.<br />
<br />
== Masters ==<br />
<br />
* staging-puppet.build.mozilla.org (staging, in SCL3)<br />
* mv-production-puppet.build.mozilla.org (MV)<br />
* scl-production-puppet.build.scl1.mozilla.com (SCL1)<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com (SCL3)<br />
* master-puppet1.build.mozilla.org (for buildbot-masters, in SCL1)<br />
<br />
=== The Slave-Master Link ===<br />
You can find to which puppet master a slave connects to by checking this file's contents:<br />
# for linux testers (fedora)<br />
~cltbld/.config/autostart/gnome-terminal.desktop<br />
# for linux builders (centos)<br />
/etc/sysconfig/puppet<br />
# for osx<br />
/Library/LaunchDaemons/com.reductivelabs.puppet.plist<br />
If the slaves have to be moved between masters be sure to remove the certs after you modify this file and before their next reboot. You may also need to do 'puppetca --clean <FQDN>' on the new puppet master.<br />
# for linux<br />
find /var/lib/puppet/ssl -type f -delete<br />
# for mac<br />
find /etc/puppet/ssl -type f -delete<br />
<br />
== Our Puppet Manifests ==<br />
Out puppet manifests are organized into a few different parts:<br />
* Site files<br />
* Basic includes<br />
* Packages that make changes<br />
* Modules<br />
We are pushing toward organizing everything into modules, although this is not a particularly rapid process at the moment. Talk to Dustin.<br />
<br />
=== Site Files & Basic Includes ===<br />
Each Puppet master has its own site file which contains a few things:<br />
* Variable definitions specific to that master<br />
* Import statements which load other parts of the manifests<br />
* Node (slave) definitions<br />
<br />
<p>The basic includes are located in the 'base' directory. These files set variables with are referenced in the packages as well as base nodes for slaves.</p><br />
<br />
The most important variables to take note of are:<br />
* ${platform_fileroot} -- Used wherever the puppet:// protocol is supported, most notably with the File type.<br />
* ${platform_httproot} -- Used with the Package type and other places that don't support puppet://<br />
<br />
<p>There are also ${local_[file,http]} variables which point to the 'local' directory inside of each platform's root. See the following section for more on that.</p><br />
<br />
We have a few base nodes shared by multiple pools of slaves as well as a base node for each concrete slave type. The shared ones are:<br />
* "slave" -- For things common to ALL slaves managed by Puppet<br />
* "build" -- For things common to all build slaves<br />
* "test" -- For things common to all test slaves<br />
<br />
There are two different types of concrete nodes. Firstly, we have $platform-$arch-$type" nodes, which are used on all Puppet masters for slaves which are local to them. Two example are: "centos5-i686-build" (32-bit, CentOS 5, build slaves) and "darwin10-i386-test" (32-bit, Mac 10.6, test slaves). Secondly, there are "$location-$type-node" nodes, which only apply to the MPT master. All nodes which are not local to MPT production are listed in its configuration file as this type of node. These nodes ensure that new slaves get redirected to their local master when they first come up. Examples include "mv-build-node" and "staging-test-node".<br />
<br />
See [http://hg.mozilla.org/build/puppet-manifests/file/tip/base/nodes.pp base/nodes.pp] for the full listing of nodes..<br />
<br />
=== Packages ===<br />
* The [http://hg.mozilla.org/build/puppet-manifests/file/tip/site-production.pp site-{staging,production}.pp] files declare the list of slaves and each slave has defined which classes to include.<br />
* The classes [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/buildslave.pp buildslave.pp] and [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/staging-buildslave.pp staging-buildslave.pp] include most of the packages (devtools, nagios, mercurial, buildbot, extras, etc) we want.<br />
* The packages can have different sections or "Types" that can be "exec", "user", "package", "file", "service"<br />
<br />
=== Modules ===<br />
Going forward, puppet functionality should be encapsulated into modules. Modules include the relevant manifests, as well as files, templates, and (with some minor changes to our puppet client configs) even custom facts or types!<br />
<br />
Modules should be generic in their purpose, and well-encapsulated. They should not be specific to one operating system or distro by design, although it's OK to omit implementations we do not need (for example, it's OK for a module providing resources only used by build slaves to error out if it's used on a Fedora system - if and when we start building on Fedora, we'll need to extend the implementation).<br />
<br />
A module should be self-contained and have a well-documented and commented interface. If it depends on any other modules, that should also be highlighted in the comments.<br />
<br />
== Puppet Files ==<br />
The files that Puppet serves up (using <tt>File</tt>) are in <tt>/N</tt> on each puppet master. The MPT masters share this via an NFS mount, so it's easy to sync files from staging to MPT production. The other servers have a local copy of this data.<br />
<br />
That first 3 levels of the drive are laid out as follows:<br />
$level/$os-$hardwaremodel/$slaveType<br />
* $level is support level (production, staging, pre-production)<br />
* $os is generally one of 'centos5', 'fedora12', 'darwin9', or 'darwin10'.<br />
* $hardwaremodel is whatever 'facter' identifies the machine's CPU as (x86_64, i686, i386, etc).<br />
* $slaveType is the "type" of node of the slave is: 'build', 'test', 'stage', 'master', etc.<br />
<br />
Below '$type', are all of the files served by Puppet. They are organized according to where they'll end up on the slave. For example, if ''/usr/lib/libsomethinghuge.so'' is to be synced to the slave, it should live in ''usr/lib/libsomethinghuge.so''. Note that as much as possible, text files should not be kept in puppet-files -- use a module and its ''files/'' subdirectory instead.<br />
<br />
There are two special directories for each level/os/hardwaremodel/type combination, too:<br />
* local -- This directory contains files which should NOT be synced between staging <-> production or between different locations. Files such as the Puppet configs which have different contents depending on location and support level live here. Try not to use this.<br />
* DMGs (Mac) / RPMs (Fedora/CentOS) -- These directories contain platform specific packages which Puppet installs.<br />
<br />
== Common Use Cases ==<br />
* [[ReleaseEngineering/How To/Reset a Password with Puppet]]<br />
* [[ReleaseEngineering/How_To/Install_a_Package_with_Puppet]]<br />
<br />
== Testing ==<br />
Before you test on the Puppet server it's good to run the 'test-manifests.sh' scripts locally. This script will test the syntax of the manifest files and catch very basic issues. It will not catch any issues with run-time code such as Exec's. This should really be a Makefile - {{bug|635067}}<br />
<br />
Staging of updates is done with ''staging-puppet.build.mozilla.org'' and staging slaves. You should book staging-puppet as well as an slaves you intend to test on before making any changes to the manifests on the Puppet server. All Puppet server work is done as the root user.<br />
<br />
=== Setting up the server ===<br />
If you've never used the Puppet server before you'll want to start a clone of the manifests for yourself. You can clone the main manifests repo or your own user repo to a directory under ''/etc/puppet''. Once you have your clone, two edits are necessary:<br />
<br />
* Copy the password hash into your clone's build/cltbld.pp. This can be done with the following command, run from the root of your clone:<br />
hg -R /etc/puppet/manifests.real diff /etc/puppet/manifests.real/build/cltbld.pp | patch -p1<br />
or more easily<br />
patch -p1 < /etc/puppet/password<br />
* Copy ''staging.pp'' to ''site.pp'' and comment out all of the "node" entries except for those which you have booked.<br />
<br />
It's easiest to use the ''mq'' extension to make these changes in a patch on your queue. Then, when you want to change revisions, just pop the patch, use 'hg pull -u', and re-push your patch.<br />
<br />
If you have a patch to apply to the repository now is the time to do it.<br />
<br />
Finally, if your changes involve edits to any files served by Puppet, apply those changes in the appropriate places under /N/staging. It's usually easiest to keep a text file tracking these changes - then you can post the contents of that file to the bug for review, so that it's clear to reviewers what changes are being made here. Because puppet-files are unversioned, try to minimize the amount of change you must make here.<br />
<br />
Once all of that is done you can swap your manifests in with ''/etc/puppet/set-manifests.sh YOURNAME''. Omit the name to reset them to the default ("real") manifests. If you've added new files or changed staging-fileserver.conf you'll need to restart the Puppetmaster process with:</p><br />
service puppetmaster restart<br />
although note that the daemon will pick up the changes after some short delay if you do not restart.<br />
<br />
Now, you're ready to test.<br />
<br />
=== Testing a slave ===<br />
Puppet needs to run as root on the slaves, so equip yourself thusly and run the following command:<br />
puppetd --test --logdest console --noop --server staging-puppet.build.mozilla.org<br />
<br />
<p>This will pull updated manifests from the server, see what needs to be done, and output that. The --noop argument tells Puppet to not make any changes to the slave. Once you're satisfied with the output of that, you can run it without the --noop to have Puppet make the changes. The output should be coloured, and indicate success/fail/exception.</p><br />
<br />
<p>If you're encountering errors or weird behaviour and the normal output isn't sufficient for debugging you can enhance it with --evaltrace and --debug. Together, they will print out every command that Puppet runs, including things which are used to determine whether a file or package needs updating.</p><br />
<br />
=== Forcing a package re-install ===<br />
Especially when testing, you may have to iterate on a single package install to get it right. If you need to re-install an existing package, you'll need to remove the package contents and/or the marker file that flags that package as installed. <br />
<br />
* Linux: packages installed as rpms should be removed as one normally would for an rpm, i.e. <code>rpm -e rpmname</code>, which will delete all of the files and remove the package from the db, or <code>rpm -e --justdb rpmname</code>, which will leave all of the files and remove the package from the db<br />
* Mac: manually cleanup the installed files, and remove the marker file for your package. The marker file lives under <code>/var/db/</code> and will be named <code>.puppet_pkgdmg_installed_pkgname.dmg</code>.<br />
<br />
You can now re-test your package install with [[ReleaseEngineering:Puppet:Usage#Testing_a_slave|the command above]], i.e. <code>puppetd --test ...</code>.<br />
<br />
=== Cleaning up ===<br />
Once you're finished testing the manifests symlink needs to be re-adjusted with:<br />
cd /etc/puppet<br />
./set-manifests.sh<br />
<br />
== Moving file updates to production ==<br />
'''Production Puppet Masters:'''<br />
* mv-production-puppet.build.mozilla.org<br />
* scl-production-puppet.build.scl1.mozilla.com<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
* master-puppet1.build.scl1.mozilla.com <br />
<br />
'''NOTE: there are a lot of files that differ between the various directories, so using rsync involves a lot of whack-a-mole to avoid syncing files that aren't part of your change. It may be easier to simply use 'cp' for this step'''<br />
<br />
When you're ready to land in production it's important to sync your files from staging to ensure you don't end up with a different result in production. Here's the process to do that. On scl3-production-puppet as root, run:<br />
rsync -n --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
After verifying that only the things you want are being synced, run it without -n to push them for real:<br />
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
If there are things that shouldn't be synced carefully adjust the rsync command with --exclude or more specific paths.<br />
<br />
Once you've landed into /N/production on scl3-production-puppet, the other production puppet masters need to be updated: In theory, this is done as 'filesync', but that user does not have permission to update the relevant directories, so in practice I suspect it's done as root. Anyway, here's the example:<br />
sudo su - filesync<br />
rsync -av --exclude=**/local/etc/sysconfig/puppet* --exclude=**/local/Library/LaunchDaemons/com.reductivelabs.puppet.plist* --exclude=**/local/home/cltbld/.config/autostart/gnome-terminal.desktop* --delete filesync@scl3-production-puppet.build.mozilla.org:/N/production/ /N/production/<br />
<br />
again, rsync is finicky, so scp may be your friend here:<br />
# mv-production-puppet <br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/N/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
# scl-production-puppet (bug 615313)<br />
scp -p {root@scl3-production-puppet.build.mozilla.org:/N/production,/builds/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
When you're ready, update the manifests on the masters with:<br />
hg -R /etc/puppet/manifests pull<br />
hg -R /etc/puppet/manifests update<br />
Note that some changes may require manifest updates first - think carefully about the intermediate state and what it will do to slaves!<br />
<br />
Be sure to do this on all Puppet masters.<br />
<br />
== Staging changes (environments) ==<br />
<pre><br />
armenzg: if you know of a script or a command that could catch stupid things like this<br />
dustin: I used to use environments for this purpose<br />
armenzg: what do you mean?<br />
armenzg: what are environments?<br />
dustin: you can specify a different envrionment on the client:<br />
dustin: puppetd --test --environment=dustin<br />
dustin: and then that can be configured to point to a different directory on the master<br />
dustin: so I would push my mq'd repo there<br />
dustin: and test with it, confident that only the slave I was messing with would be affected<br />
catlee: armenzg: we have that set up on master-puppet1 if you want to look<br />
</pre><br />
<br />
== Deploy changes ==<br />
* deploy the files you need (if you do)<br />
** [[ReleaseEngineering/Puppet/Usage#Moving_file_updates_to_production]]<br />
** you can try this instead:<br />
csshX --login root {mv-production-puppet,scl3-production-puppet,scl-production-puppet}.build.mozilla.org<br />
** be sure that the files are across '''all''' masters or the whole set of slaves will be going down<br />
* make sure you deploy the changes to all puppet masters (ssh as root)<br />
** scl-production-puppet<br />
** scl3-production-puppet<br />
** mv-production-puppet<br />
** master-puppet1<br />
* cd /etc/puppet/manifests/<br />
* hg pull -u<br />
* watch few minutes that there are no errors<br />
** tail -F /var/log/messages<br />
** once you see a slave listed go and check to see that it got the changes<br />
<br />
== Current Puppet Servers ==<br />
An accurate list of puppet servers needs to be referenced by various procedures. Please keep the following list up to date.<br />
<br />
{{Anchor|PuppetServers}}<br />
<br />
{| class="wikitable" style="text-align: center;"<br />
!Role !! Data Center !! Slave Puppet Master<br />
|-<br />
| build master || ''all'' || master-puppet1.build.mozilla.org<br />
|-<br />
| build slave || scl1 || scl-production-puppet.build.scl1.mozilla.com<br />
|-<br />
| build slave || scl3 || scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
|}</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Puppet/Usage&diff=434364ReleaseEngineering/Puppet/Usage2012-05-25T01:58:05Z<p>Bear: /* Deploy changes */</p>
<hr />
<div>{{ReleaseEngineering Puppet Header}}<br />
<br />
This document is intended to serve as a guide to interacting with our Puppet servers and manifests.<br />
<br />
== Definitions ==<br />
<br />
* Type - Puppet documentation talks a lot about this. Each different "type" deals with a different aspect of the system. For example, the "user" type can do most things related to user management (passwords, UID/GID, homedirs, shells, etc). The 'package' type deals with package management (eg, apt, rpm, fink, etc). And so on.<br />
<br />
== Masters ==<br />
<br />
* staging-puppet.build.mozilla.org (staging, in SCL3)<br />
* mv-production-puppet.build.mozilla.org (MV)<br />
* scl-production-puppet.build.scl1.mozilla.com (SCL1)<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com (SCL3)<br />
* master-puppet1.build.mozilla.org (for buildbot-masters, in SCL1)<br />
<br />
=== The Slave-Master Link ===<br />
You can find to which puppet master a slave connects to by checking this file's contents:<br />
# for linux testers (fedora)<br />
~cltbld/.config/autostart/gnome-terminal.desktop<br />
# for linux builders (centos)<br />
/etc/sysconfig/puppet<br />
# for osx<br />
/Library/LaunchDaemons/com.reductivelabs.puppet.plist<br />
If the slaves have to be moved between masters be sure to remove the certs after you modify this file and before their next reboot. You may also need to do 'puppetca --clean <FQDN>' on the new puppet master.<br />
# for linux<br />
find /var/lib/puppet/ssl -type f -delete<br />
# for mac<br />
find /etc/puppet/ssl -type f -delete<br />
<br />
== Our Puppet Manifests ==<br />
Out puppet manifests are organized into a few different parts:<br />
* Site files<br />
* Basic includes<br />
* Packages that make changes<br />
* Modules<br />
We are pushing toward organizing everything into modules, although this is not a particularly rapid process at the moment. Talk to Dustin.<br />
<br />
=== Site Files & Basic Includes ===<br />
Each Puppet master has its own site file which contains a few things:<br />
* Variable definitions specific to that master<br />
* Import statements which load other parts of the manifests<br />
* Node (slave) definitions<br />
<br />
<p>The basic includes are located in the 'base' directory. These files set variables with are referenced in the packages as well as base nodes for slaves.</p><br />
<br />
The most important variables to take note of are:<br />
* ${platform_fileroot} -- Used wherever the puppet:// protocol is supported, most notably with the File type.<br />
* ${platform_httproot} -- Used with the Package type and other places that don't support puppet://<br />
<br />
<p>There are also ${local_[file,http]} variables which point to the 'local' directory inside of each platform's root. See the following section for more on that.</p><br />
<br />
We have a few base nodes shared by multiple pools of slaves as well as a base node for each concrete slave type. The shared ones are:<br />
* "slave" -- For things common to ALL slaves managed by Puppet<br />
* "build" -- For things common to all build slaves<br />
* "test" -- For things common to all test slaves<br />
<br />
There are two different types of concrete nodes. Firstly, we have $platform-$arch-$type" nodes, which are used on all Puppet masters for slaves which are local to them. Two example are: "centos5-i686-build" (32-bit, CentOS 5, build slaves) and "darwin10-i386-test" (32-bit, Mac 10.6, test slaves). Secondly, there are "$location-$type-node" nodes, which only apply to the MPT master. All nodes which are not local to MPT production are listed in its configuration file as this type of node. These nodes ensure that new slaves get redirected to their local master when they first come up. Examples include "mv-build-node" and "staging-test-node".<br />
<br />
See [http://hg.mozilla.org/build/puppet-manifests/file/tip/base/nodes.pp base/nodes.pp] for the full listing of nodes..<br />
<br />
=== Packages ===<br />
* The [http://hg.mozilla.org/build/puppet-manifests/file/tip/site-production.pp site-{staging,production}.pp] files declare the list of slaves and each slave has defined which classes to include.<br />
* The classes [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/buildslave.pp buildslave.pp] and [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/staging-buildslave.pp staging-buildslave.pp] include most of the packages (devtools, nagios, mercurial, buildbot, extras, etc) we want.<br />
* The packages can have different sections or "Types" that can be "exec", "user", "package", "file", "service"<br />
<br />
=== Modules ===<br />
Going forward, puppet functionality should be encapsulated into modules. Modules include the relevant manifests, as well as files, templates, and (with some minor changes to our puppet client configs) even custom facts or types!<br />
<br />
Modules should be generic in their purpose, and well-encapsulated. They should not be specific to one operating system or distro by design, although it's OK to omit implementations we do not need (for example, it's OK for a module providing resources only used by build slaves to error out if it's used on a Fedora system - if and when we start building on Fedora, we'll need to extend the implementation).<br />
<br />
A module should be self-contained and have a well-documented and commented interface. If it depends on any other modules, that should also be highlighted in the comments.<br />
<br />
== Puppet Files ==<br />
The files that Puppet serves up (using <tt>File</tt>) are in <tt>/N</tt> on each puppet master. The MPT masters share this via an NFS mount, so it's easy to sync files from staging to MPT production. The other servers have a local copy of this data.<br />
<br />
That first 3 levels of the drive are laid out as follows:<br />
$level/$os-$hardwaremodel/$slaveType<br />
* $level is support level (production, staging, pre-production)<br />
* $os is generally one of 'centos5', 'fedora12', 'darwin9', or 'darwin10'.<br />
* $hardwaremodel is whatever 'facter' identifies the machine's CPU as (x86_64, i686, i386, etc).<br />
* $slaveType is the "type" of node of the slave is: 'build', 'test', 'stage', 'master', etc.<br />
<br />
Below '$type', are all of the files served by Puppet. They are organized according to where they'll end up on the slave. For example, if ''/usr/lib/libsomethinghuge.so'' is to be synced to the slave, it should live in ''usr/lib/libsomethinghuge.so''. Note that as much as possible, text files should not be kept in puppet-files -- use a module and its ''files/'' subdirectory instead.<br />
<br />
There are two special directories for each level/os/hardwaremodel/type combination, too:<br />
* local -- This directory contains files which should NOT be synced between staging <-> production or between different locations. Files such as the Puppet configs which have different contents depending on location and support level live here. Try not to use this.<br />
* DMGs (Mac) / RPMs (Fedora/CentOS) -- These directories contain platform specific packages which Puppet installs.<br />
<br />
== Common Use Cases ==<br />
* [[ReleaseEngineering/How To/Reset a Password with Puppet]]<br />
* [[ReleaseEngineering/How_To/Install_a_Package_with_Puppet]]<br />
<br />
== Testing ==<br />
Before you test on the Puppet server it's good to run the 'test-manifests.sh' scripts locally. This script will test the syntax of the manifest files and catch very basic issues. It will not catch any issues with run-time code such as Exec's. This should really be a Makefile - {{bug|635067}}<br />
<br />
Staging of updates is done with ''staging-puppet.build.mozilla.org'' and staging slaves. You should book staging-puppet as well as an slaves you intend to test on before making any changes to the manifests on the Puppet server. All Puppet server work is done as the root user.<br />
<br />
=== Setting up the server ===<br />
If you've never used the Puppet server before you'll want to start a clone of the manifests for yourself. You can clone the main manifests repo or your own user repo to a directory under ''/etc/puppet''. Once you have your clone, two edits are necessary:<br />
<br />
* Copy the password hash into your clone's build/cltbld.pp. This can be done with the following command, run from the root of your clone:<br />
hg -R /etc/puppet/manifests.real diff /etc/puppet/manifests.real/build/cltbld.pp | patch -p1<br />
or more easily<br />
patch -p1 < /etc/puppet/password<br />
* Copy ''staging.pp'' to ''site.pp'' and comment out all of the "node" entries except for those which you have booked.<br />
<br />
It's easiest to use the ''mq'' extension to make these changes in a patch on your queue. Then, when you want to change revisions, just pop the patch, use 'hg pull -u', and re-push your patch.<br />
<br />
If you have a patch to apply to the repository now is the time to do it.<br />
<br />
Finally, if your changes involve edits to any files served by Puppet, apply those changes in the appropriate places under /N/staging. It's usually easiest to keep a text file tracking these changes - then you can post the contents of that file to the bug for review, so that it's clear to reviewers what changes are being made here. Because puppet-files are unversioned, try to minimize the amount of change you must make here.<br />
<br />
Once all of that is done you can swap your manifests in with ''/etc/puppet/set-manifests.sh YOURNAME''. Omit the name to reset them to the default ("real") manifests. If you've added new files or changed staging-fileserver.conf you'll need to restart the Puppetmaster process with:</p><br />
service puppetmaster restart<br />
although note that the daemon will pick up the changes after some short delay if you do not restart.<br />
<br />
Now, you're ready to test.<br />
<br />
=== Testing a slave ===<br />
Puppet needs to run as root on the slaves, so equip yourself thusly and run the following command:<br />
puppetd --test --logdest console --noop --server staging-puppet.build.mozilla.org<br />
<br />
<p>This will pull updated manifests from the server, see what needs to be done, and output that. The --noop argument tells Puppet to not make any changes to the slave. Once you're satisfied with the output of that, you can run it without the --noop to have Puppet make the changes. The output should be coloured, and indicate success/fail/exception.</p><br />
<br />
<p>If you're encountering errors or weird behaviour and the normal output isn't sufficient for debugging you can enhance it with --evaltrace and --debug. Together, they will print out every command that Puppet runs, including things which are used to determine whether a file or package needs updating.</p><br />
<br />
=== Forcing a package re-install ===<br />
Especially when testing, you may have to iterate on a single package install to get it right. If you need to re-install an existing package, you'll need to remove the package contents and/or the marker file that flags that package as installed. <br />
<br />
* Linux: packages installed as rpms should be removed as one normally would for an rpm, i.e. <code>rpm -e rpmname</code>, which will delete all of the files and remove the package from the db, or <code>rpm -e --justdb rpmname</code>, which will leave all of the files and remove the package from the db<br />
* Mac: manually cleanup the installed files, and remove the marker file for your package. The marker file lives under <code>/var/db/</code> and will be named <code>.puppet_pkgdmg_installed_pkgname.dmg</code>.<br />
<br />
You can now re-test your package install with [[ReleaseEngineering:Puppet:Usage#Testing_a_slave|the command above]], i.e. <code>puppetd --test ...</code>.<br />
<br />
=== Cleaning up ===<br />
Once you're finished testing the manifests symlink needs to be re-adjusted with:<br />
cd /etc/puppet<br />
./set-manifests.sh<br />
<br />
== Moving file updates to production ==<br />
'''Production Puppet Masters:'''<br />
* mv-production-puppet.build.mozilla.org<br />
* scl-production-puppet.build.scl1.mozilla.com<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
<br />
'''NOTE: there are a lot of files that differ between the various directories, so using rsync involves a lot of whack-a-mole to avoid syncing files that aren't part of your change. It may be easier to simply use 'cp' for this step'''<br />
<br />
When you're ready to land in production it's important to sync your files from staging to ensure you don't end up with a different result in production. Here's the process to do that. On production-puppet as root, run:<br />
rsync -n --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
After verifying that only the things you want are being synced, run it without -n to push them for real:<br />
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
If there are things that shouldn't be synced carefully adjust the rsync command with --exclude or more specific paths.<br />
<br />
Once you've landed into /N/production on production-puppet, the other production puppet masters need to be updated: In theory, this is done as 'filesync', but that user does not have permission to update the relevant directories, so in practice I suspect it's done as root. Anyway, here's the example:<br />
sudo su - filesync<br />
rsync -av --exclude=**/local/etc/sysconfig/puppet* --exclude=**/local/Library/LaunchDaemons/com.reductivelabs.puppet.plist* --exclude=**/local/home/cltbld/.config/autostart/gnome-terminal.desktop* --delete filesync@production-puppet.build.mozilla.org:/N/production/ /N/production/<br />
<br />
again, rsync is finicky, so scp may be your friend here:<br />
# mv-production-puppet <br />
scp -p {root@production-puppet.build.mozilla.org:/N/production,/N/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
# scl-production-puppet (bug 615313)<br />
scp -p {root@production-puppet.build.mozilla.org:/N/production,/builds/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
When you're ready, update the manifests on the masters with:<br />
hg -R /etc/puppet/manifests pull<br />
hg -R /etc/puppet/manifests update<br />
Note that some changes may require manifest updates first - think carefully about the intermediate state and what it will do to slaves!<br />
<br />
Be sure to do this on all Puppet masters.<br />
<br />
== Staging changes (environments) ==<br />
<pre><br />
armenzg: if you know of a script or a command that could catch stupid things like this<br />
dustin: I used to use environments for this purpose<br />
armenzg: what do you mean?<br />
armenzg: what are environments?<br />
dustin: you can specify a different envrionment on the client:<br />
dustin: puppetd --test --environment=dustin<br />
dustin: and then that can be configured to point to a different directory on the master<br />
dustin: so I would push my mq'd repo there<br />
dustin: and test with it, confident that only the slave I was messing with would be affected<br />
catlee: armenzg: we have that set up on master-puppet1 if you want to look<br />
</pre><br />
<br />
== Deploy changes ==<br />
* deploy the files you need (if you do)<br />
** [[ReleaseEngineering/Puppet/Usage#Moving_file_updates_to_production]]<br />
** you can try this instead:<br />
csshX --login root {mv-production-puppet,scl3-production-puppet,scl-production-puppet}.build.mozilla.org<br />
** be sure that the files are across '''all''' masters or the whole set of slaves will be going down<br />
* make sure you deploy the changes to all puppet masters (ssh as root)<br />
** scl-production-puppet<br />
** scl3-production-puppet<br />
** mv-production-puppet<br />
** master-puppet1<br />
* cd /etc/puppet/manifests/<br />
* hg pull -u<br />
* watch few minutes that there are no errors<br />
** tail -F /var/log/messages<br />
** once you see a slave listed go and check to see that it got the changes<br />
<br />
== Current Puppet Servers ==<br />
An accurate list of puppet servers needs to be referenced by various procedures. Please keep the following list up to date.<br />
<br />
{{Anchor|PuppetServers}}<br />
<br />
{| class="wikitable" style="text-align: center;"<br />
!Role !! Data Center !! Slave Puppet Master<br />
|-<br />
| build master || ''all'' || master-puppet1.build.mozilla.org<br />
|-<br />
| build slave || scl1 || scl-production-puppet.build.scl1.mozilla.com<br />
|-<br />
| build slave || scl3 || scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
|}</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Puppet/Usage&diff=434363ReleaseEngineering/Puppet/Usage2012-05-25T01:56:36Z<p>Bear: /* Moving file updates to production */</p>
<hr />
<div>{{ReleaseEngineering Puppet Header}}<br />
<br />
This document is intended to serve as a guide to interacting with our Puppet servers and manifests.<br />
<br />
== Definitions ==<br />
<br />
* Type - Puppet documentation talks a lot about this. Each different "type" deals with a different aspect of the system. For example, the "user" type can do most things related to user management (passwords, UID/GID, homedirs, shells, etc). The 'package' type deals with package management (eg, apt, rpm, fink, etc). And so on.<br />
<br />
== Masters ==<br />
<br />
* staging-puppet.build.mozilla.org (staging, in SCL3)<br />
* mv-production-puppet.build.mozilla.org (MV)<br />
* scl-production-puppet.build.scl1.mozilla.com (SCL1)<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com (SCL3)<br />
* master-puppet1.build.mozilla.org (for buildbot-masters, in SCL1)<br />
<br />
=== The Slave-Master Link ===<br />
You can find to which puppet master a slave connects to by checking this file's contents:<br />
# for linux testers (fedora)<br />
~cltbld/.config/autostart/gnome-terminal.desktop<br />
# for linux builders (centos)<br />
/etc/sysconfig/puppet<br />
# for osx<br />
/Library/LaunchDaemons/com.reductivelabs.puppet.plist<br />
If the slaves have to be moved between masters be sure to remove the certs after you modify this file and before their next reboot. You may also need to do 'puppetca --clean <FQDN>' on the new puppet master.<br />
# for linux<br />
find /var/lib/puppet/ssl -type f -delete<br />
# for mac<br />
find /etc/puppet/ssl -type f -delete<br />
<br />
== Our Puppet Manifests ==<br />
Out puppet manifests are organized into a few different parts:<br />
* Site files<br />
* Basic includes<br />
* Packages that make changes<br />
* Modules<br />
We are pushing toward organizing everything into modules, although this is not a particularly rapid process at the moment. Talk to Dustin.<br />
<br />
=== Site Files & Basic Includes ===<br />
Each Puppet master has its own site file which contains a few things:<br />
* Variable definitions specific to that master<br />
* Import statements which load other parts of the manifests<br />
* Node (slave) definitions<br />
<br />
<p>The basic includes are located in the 'base' directory. These files set variables with are referenced in the packages as well as base nodes for slaves.</p><br />
<br />
The most important variables to take note of are:<br />
* ${platform_fileroot} -- Used wherever the puppet:// protocol is supported, most notably with the File type.<br />
* ${platform_httproot} -- Used with the Package type and other places that don't support puppet://<br />
<br />
<p>There are also ${local_[file,http]} variables which point to the 'local' directory inside of each platform's root. See the following section for more on that.</p><br />
<br />
We have a few base nodes shared by multiple pools of slaves as well as a base node for each concrete slave type. The shared ones are:<br />
* "slave" -- For things common to ALL slaves managed by Puppet<br />
* "build" -- For things common to all build slaves<br />
* "test" -- For things common to all test slaves<br />
<br />
There are two different types of concrete nodes. Firstly, we have $platform-$arch-$type" nodes, which are used on all Puppet masters for slaves which are local to them. Two example are: "centos5-i686-build" (32-bit, CentOS 5, build slaves) and "darwin10-i386-test" (32-bit, Mac 10.6, test slaves). Secondly, there are "$location-$type-node" nodes, which only apply to the MPT master. All nodes which are not local to MPT production are listed in its configuration file as this type of node. These nodes ensure that new slaves get redirected to their local master when they first come up. Examples include "mv-build-node" and "staging-test-node".<br />
<br />
See [http://hg.mozilla.org/build/puppet-manifests/file/tip/base/nodes.pp base/nodes.pp] for the full listing of nodes..<br />
<br />
=== Packages ===<br />
* The [http://hg.mozilla.org/build/puppet-manifests/file/tip/site-production.pp site-{staging,production}.pp] files declare the list of slaves and each slave has defined which classes to include.<br />
* The classes [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/buildslave.pp buildslave.pp] and [http://hg.mozilla.org/build/puppet-manifests/file/tip/classes/staging-buildslave.pp staging-buildslave.pp] include most of the packages (devtools, nagios, mercurial, buildbot, extras, etc) we want.<br />
* The packages can have different sections or "Types" that can be "exec", "user", "package", "file", "service"<br />
<br />
=== Modules ===<br />
Going forward, puppet functionality should be encapsulated into modules. Modules include the relevant manifests, as well as files, templates, and (with some minor changes to our puppet client configs) even custom facts or types!<br />
<br />
Modules should be generic in their purpose, and well-encapsulated. They should not be specific to one operating system or distro by design, although it's OK to omit implementations we do not need (for example, it's OK for a module providing resources only used by build slaves to error out if it's used on a Fedora system - if and when we start building on Fedora, we'll need to extend the implementation).<br />
<br />
A module should be self-contained and have a well-documented and commented interface. If it depends on any other modules, that should also be highlighted in the comments.<br />
<br />
== Puppet Files ==<br />
The files that Puppet serves up (using <tt>File</tt>) are in <tt>/N</tt> on each puppet master. The MPT masters share this via an NFS mount, so it's easy to sync files from staging to MPT production. The other servers have a local copy of this data.<br />
<br />
That first 3 levels of the drive are laid out as follows:<br />
$level/$os-$hardwaremodel/$slaveType<br />
* $level is support level (production, staging, pre-production)<br />
* $os is generally one of 'centos5', 'fedora12', 'darwin9', or 'darwin10'.<br />
* $hardwaremodel is whatever 'facter' identifies the machine's CPU as (x86_64, i686, i386, etc).<br />
* $slaveType is the "type" of node of the slave is: 'build', 'test', 'stage', 'master', etc.<br />
<br />
Below '$type', are all of the files served by Puppet. They are organized according to where they'll end up on the slave. For example, if ''/usr/lib/libsomethinghuge.so'' is to be synced to the slave, it should live in ''usr/lib/libsomethinghuge.so''. Note that as much as possible, text files should not be kept in puppet-files -- use a module and its ''files/'' subdirectory instead.<br />
<br />
There are two special directories for each level/os/hardwaremodel/type combination, too:<br />
* local -- This directory contains files which should NOT be synced between staging <-> production or between different locations. Files such as the Puppet configs which have different contents depending on location and support level live here. Try not to use this.<br />
* DMGs (Mac) / RPMs (Fedora/CentOS) -- These directories contain platform specific packages which Puppet installs.<br />
<br />
== Common Use Cases ==<br />
* [[ReleaseEngineering/How To/Reset a Password with Puppet]]<br />
* [[ReleaseEngineering/How_To/Install_a_Package_with_Puppet]]<br />
<br />
== Testing ==<br />
Before you test on the Puppet server it's good to run the 'test-manifests.sh' scripts locally. This script will test the syntax of the manifest files and catch very basic issues. It will not catch any issues with run-time code such as Exec's. This should really be a Makefile - {{bug|635067}}<br />
<br />
Staging of updates is done with ''staging-puppet.build.mozilla.org'' and staging slaves. You should book staging-puppet as well as an slaves you intend to test on before making any changes to the manifests on the Puppet server. All Puppet server work is done as the root user.<br />
<br />
=== Setting up the server ===<br />
If you've never used the Puppet server before you'll want to start a clone of the manifests for yourself. You can clone the main manifests repo or your own user repo to a directory under ''/etc/puppet''. Once you have your clone, two edits are necessary:<br />
<br />
* Copy the password hash into your clone's build/cltbld.pp. This can be done with the following command, run from the root of your clone:<br />
hg -R /etc/puppet/manifests.real diff /etc/puppet/manifests.real/build/cltbld.pp | patch -p1<br />
or more easily<br />
patch -p1 < /etc/puppet/password<br />
* Copy ''staging.pp'' to ''site.pp'' and comment out all of the "node" entries except for those which you have booked.<br />
<br />
It's easiest to use the ''mq'' extension to make these changes in a patch on your queue. Then, when you want to change revisions, just pop the patch, use 'hg pull -u', and re-push your patch.<br />
<br />
If you have a patch to apply to the repository now is the time to do it.<br />
<br />
Finally, if your changes involve edits to any files served by Puppet, apply those changes in the appropriate places under /N/staging. It's usually easiest to keep a text file tracking these changes - then you can post the contents of that file to the bug for review, so that it's clear to reviewers what changes are being made here. Because puppet-files are unversioned, try to minimize the amount of change you must make here.<br />
<br />
Once all of that is done you can swap your manifests in with ''/etc/puppet/set-manifests.sh YOURNAME''. Omit the name to reset them to the default ("real") manifests. If you've added new files or changed staging-fileserver.conf you'll need to restart the Puppetmaster process with:</p><br />
service puppetmaster restart<br />
although note that the daemon will pick up the changes after some short delay if you do not restart.<br />
<br />
Now, you're ready to test.<br />
<br />
=== Testing a slave ===<br />
Puppet needs to run as root on the slaves, so equip yourself thusly and run the following command:<br />
puppetd --test --logdest console --noop --server staging-puppet.build.mozilla.org<br />
<br />
<p>This will pull updated manifests from the server, see what needs to be done, and output that. The --noop argument tells Puppet to not make any changes to the slave. Once you're satisfied with the output of that, you can run it without the --noop to have Puppet make the changes. The output should be coloured, and indicate success/fail/exception.</p><br />
<br />
<p>If you're encountering errors or weird behaviour and the normal output isn't sufficient for debugging you can enhance it with --evaltrace and --debug. Together, they will print out every command that Puppet runs, including things which are used to determine whether a file or package needs updating.</p><br />
<br />
=== Forcing a package re-install ===<br />
Especially when testing, you may have to iterate on a single package install to get it right. If you need to re-install an existing package, you'll need to remove the package contents and/or the marker file that flags that package as installed. <br />
<br />
* Linux: packages installed as rpms should be removed as one normally would for an rpm, i.e. <code>rpm -e rpmname</code>, which will delete all of the files and remove the package from the db, or <code>rpm -e --justdb rpmname</code>, which will leave all of the files and remove the package from the db<br />
* Mac: manually cleanup the installed files, and remove the marker file for your package. The marker file lives under <code>/var/db/</code> and will be named <code>.puppet_pkgdmg_installed_pkgname.dmg</code>.<br />
<br />
You can now re-test your package install with [[ReleaseEngineering:Puppet:Usage#Testing_a_slave|the command above]], i.e. <code>puppetd --test ...</code>.<br />
<br />
=== Cleaning up ===<br />
Once you're finished testing the manifests symlink needs to be re-adjusted with:<br />
cd /etc/puppet<br />
./set-manifests.sh<br />
<br />
== Moving file updates to production ==<br />
'''Production Puppet Masters:'''<br />
* mv-production-puppet.build.mozilla.org<br />
* scl-production-puppet.build.scl1.mozilla.com<br />
* scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
<br />
'''NOTE: there are a lot of files that differ between the various directories, so using rsync involves a lot of whack-a-mole to avoid syncing files that aren't part of your change. It may be easier to simply use 'cp' for this step'''<br />
<br />
When you're ready to land in production it's important to sync your files from staging to ensure you don't end up with a different result in production. Here's the process to do that. On production-puppet as root, run:<br />
rsync -n --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
After verifying that only the things you want are being synced, run it without -n to push them for real:<br />
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/<br />
<br />
If there are things that shouldn't be synced carefully adjust the rsync command with --exclude or more specific paths.<br />
<br />
Once you've landed into /N/production on production-puppet, the other production puppet masters need to be updated: In theory, this is done as 'filesync', but that user does not have permission to update the relevant directories, so in practice I suspect it's done as root. Anyway, here's the example:<br />
sudo su - filesync<br />
rsync -av --exclude=**/local/etc/sysconfig/puppet* --exclude=**/local/Library/LaunchDaemons/com.reductivelabs.puppet.plist* --exclude=**/local/home/cltbld/.config/autostart/gnome-terminal.desktop* --delete filesync@production-puppet.build.mozilla.org:/N/production/ /N/production/<br />
<br />
again, rsync is finicky, so scp may be your friend here:<br />
# mv-production-puppet <br />
scp -p {root@production-puppet.build.mozilla.org:/N/production,/N/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
# scl-production-puppet (bug 615313)<br />
scp -p {root@production-puppet.build.mozilla.org:/N/production,/builds/production}/darwin9-i386/build/Library/Preferences/com.apple.Bluetooth.plist<br />
<br />
When you're ready, update the manifests on the masters with:<br />
hg -R /etc/puppet/manifests pull<br />
hg -R /etc/puppet/manifests update<br />
Note that some changes may require manifest updates first - think carefully about the intermediate state and what it will do to slaves!<br />
<br />
Be sure to do this on all Puppet masters.<br />
<br />
== Staging changes (environments) ==<br />
<pre><br />
armenzg: if you know of a script or a command that could catch stupid things like this<br />
dustin: I used to use environments for this purpose<br />
armenzg: what do you mean?<br />
armenzg: what are environments?<br />
dustin: you can specify a different envrionment on the client:<br />
dustin: puppetd --test --environment=dustin<br />
dustin: and then that can be configured to point to a different directory on the master<br />
dustin: so I would push my mq'd repo there<br />
dustin: and test with it, confident that only the slave I was messing with would be affected<br />
catlee: armenzg: we have that set up on master-puppet1 if you want to look<br />
</pre><br />
<br />
== Deploy changes ==<br />
* deploy the files you need (if you do)<br />
** [[ReleaseEngineering/Puppet/Usage#Moving_file_updates_to_production]]<br />
** you can try this instead:<br />
csshX --login root {production-puppet,mv-production-puppet,scl3-production-puppet,scl-production-puppet}.build.mozilla.org<br />
** be sure that the files are across '''all''' masters or the whole set of slaves will be going down<br />
* make sure you deploy the changes to all puppet masters (ssh as root)<br />
** production-puppet (same as mpt-production-puppet)<br />
** scl-production-puppet<br />
** scl3-production-puppet<br />
** mv-production-puppet<br />
** master-puppet1<br />
* cd /etc/puppet/manifests/<br />
* hg pull -u<br />
* watch few minutes that there are no errors<br />
** tail -F /var/log/messages<br />
** once you see a slave listed go and check to see that it got the changes<br />
<br />
== Current Puppet Servers ==<br />
An accurate list of puppet servers needs to be referenced by various procedures. Please keep the following list up to date.<br />
<br />
{{Anchor|PuppetServers}}<br />
<br />
{| class="wikitable" style="text-align: center;"<br />
!Role !! Data Center !! Slave Puppet Master<br />
|-<br />
| build master || ''all'' || master-puppet1.build.mozilla.org<br />
|-<br />
| build slave || scl1 || scl-production-puppet.build.scl1.mozilla.com<br />
|-<br />
| build slave || scl3 || scl3-production-puppet.srv.releng.scl3.mozilla.com<br />
|}</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&diff=433557ReleaseEngineering/Maintenance2012-05-23T01:02:04Z<p>Bear: /* Reconfigs / Deployments */</p>
<hr />
<div>This page is to track upcoming changes to any part of RelEng infrastructure; buildbot masters, slaves, ESX hosts, etc. This should allow us keep track of what we're doing in a downtime, and also what changes can be rolled out to production without needing a downtime. This should be helpful if we need to track what changes were made when troubleshooting problems.<br />
<br />
[[ReleaseEngineering:BuildbotBestPractices]] describes how we manage changes to our masters.<br />
<br />
= Relevant repositories =<br />
* [http://hg.mozilla.org/build/buildbot/ buildbot]<br />
* [http://hg.mozilla.org/build/buildbot-configs/ buildbot-configs]<br />
* [http://hg.mozilla.org/build/buildbotcustom/ buildbotcustom]<br />
* [http://hg.mozilla.org/build/tools/ tools]<br />
* [http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ talos]<br />
<br />
'''Are you changing the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
= Reconfigs / Deployments =<br />
This page is updated by the person who does a reconfig on production systems. Please give accurate times, as we use this page to track down if reconfigs caused debug intermittent problems.<br />
<br />
'''Did you change the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
Outcome should be 'backed out' or 'In production' or some such. Reverse date order pretty please.<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Outcome'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Bug #(s)''' - '''Description(s)'''<br />
|-<br />
| in production<br />
| 2012-05-22 1700 PDT<br />
|<br />
* {{bug|754291}} - buildbot master Makefile refers to the wrong hg<br />
* {{bug|756463}} - Please reset the priority of the Oak branch<br />
* {{bug|753501}} - followup, make try look for the Android XUL tooltool manifest in the path where it actually is<br />
|-<br />
| in production<br />
| 2012-05-18 1313 PDT<br />
|<br />
* {{bug|756463}} - Please bump the priority of the Oak branch temporarily<br />
* {{bug|753132}} - Fix env for nightly pgo builds<br />
* {{bug|573722}} - set IS_NIGHTLY env var for nightly builds<br />
* {{bug|753132}} - Do periodic PGO builds on pgo_platform too<br />
|-<br />
| in production<br />
| 2012-05-18 0815 PDT<br />
|<br />
* {{bug|756463}} - Please bump the priority of the Oak branch temporarily<br />
|-<br />
| in production<br />
| 2012-05-17 0750 PDT<br />
|<br />
* {{bug|755434}} - l10n repacks should not execute config.py<br />
* {{bug|753132}} - Do 32-bit PGO builds on win64<br />
* {{bug|751158}} - Create tcheckboard3 to measure checkerboard with low res off<br />
* {{bug|755989}} - setup-master.py doesn't set up staging symlinks properly<br />
* {{bug|743304}} - After SSH failure, Android XUL mozilla-central nightly builder spams a bunch of "SyntaxError: invalid syntax" when running retry.py / balrog-client.py<br />
* {{bug|753501}} - Add empty tooltool manifests to some platforms<br />
|-<br />
| in production<br />
| 2012-05-14 1118 PDT<br />
|<br />
* {{bug|753488}} - Android native on Aurora -> multilocale.<br />
|-<br />
| in production<br />
| 2012-05-14 0830 PDT<br />
|<br />
* {{bug|750837}} - 13.0b2 build 2 configs. r=hwine<br />
* {{bug|754517}} - disable larch and enable pine rentable branches, r=bear<br />
* {{bug|747500}} - setup-master.py refers to files which have been removed. r=catlee<br />
* {{bug|753132}} - Use win64 machines for 32-bit pgo builds on build-system branch. r=rail<br />
* {{bug|754373}} - Use firefox-tuxedo.ini for Thunderbird builds. r=standard8<br />
* {{bug|754397}} - Disable signing at build time for Thunderbird. r=nthomas<br />
* {{bug|754430}} - Missing mozilla/ dir in Thunderbird beta build. p=standard8,r=jhopkins <br />
* {{bug|701783}} - remove scratchbox references from buildbotcustom. r=catlee<br />
* {{bug|753132}} - Support 'pgo_platform' key for deciding which machines do PGO builds. r=rail<br />
* {{bug|750744}} - Test and deploy SUT agent 1.08. r=bear <br />
|-<br />
| in production<br />
| 2012-05-11 0930 PDT<br />
|<br />
* {{Bug|754297}} - add sys.stdout.flush() to sut_tools' scripts <br />
|-<br />
| in production<br />
| 2012-05-11 0800 PDT<br />
|<br />
* {{Bug|746260}} - disable the screen resolution changing on android for jsreftest and crashtest, leave it on for reftest.<br />
* {{Bug|753868}} - use aus3-staging for Thunderbird release builds, r=jhopkins<br />
* {{Bug|744601}} - tracking bug for build and release of Thunderbird 13.0b2. r=standard8<br />
* {{Bug|752531}} - migrate dev-stage01 to scl3. r=rail <br />
|-<br />
| in production<br />
| 2012-05-10 1801 PDT<br />
|<br />
* {{bug|753488}} - make FN multi on m-c only, reenable nightly updates.<br />
* {{Bug|753625}} - Move all Thunderbird branches onto Firefox infra<br />
* {{bug|749748}} - kill l10n verify.<br />
* {{Bug|753868}} - Use aus3-staging.mozilla.org for Thunderbird release builds.<br />
* {{Bug|753865}} - Email thunderbird-drivers for Thunderbird release builds.<br />
* {{bug|748157}} - Load thunderbird_release_branches from master_config.json<br />
|-<br />
| in production<br />
| 2012-05-09 0800 PDT<br />
|<br />
* {{bug|752373}} - Stop running Android crashtest-1 until someone's ready to fix it<br />
* {{bug|751070}} - retire sjc1 VMs<br />
* {{bug|750031}} - moz2-darwin10-slave02 problem tracking<br />
* {{bug|746201}} - Remove unresolved machines from buildbot-configs<br />
* {{bug|752430}} - Swap comm-aurora over to Firefox infra<br />
* {{bug|749051}} - TryChooser: could -n be the default?<br />
* {{bug|751878}} - OSError: [Errno 13] Permission denied: '/home/ftp' for pvtbuilds2.dmz.scl3.mozilla.com<br />
|-<br />
| in production<br />
| 2012-05-03 1325 PDT<br />
|<br />
* {{Bug|744067}} - add them back<br />
|-<br />
| backed out<br />
| 2012-05-03 1200 PDT<br />
|<br />
* Backout 4ab5af03cce1 (new scl3 slaves). r=backout<br />
|-<br />
| in production<br />
| 2012-05-03 1000 PDT<br />
|<br />
* {{Bug|751165}} - revert higher priority for m-i. r=philor,ehsan<br />
* {{Bug|744067}} - new scl3 slaves; r=coop<br />
* {{Bug|744067}} - new scl3 slaves (must be in staging); r=aki<br />
* Add ACTIVE_THUNDERBIRD_RELEASE_BRANCHES. r=armenzg<br />
* {{Bug|751895}} - Preproduction release master fails trying to checkconfig. r=jhopkins<br />
* {{Bug|750973}} - copy in-tree m-a linux32 mozconfig into mozilla2 to fix aurora source release. r=catlee <br />
|-<br />
| in production<br />
| 2012-05-03 08:00 PDT<br />
|<br />
* {{bug|751506}} - No 10.7 32-bit debug builders on Thunderbird trees. r=coop<br />
* {{bug|748628}} - Switch Thunderbird builds to use OS X 10.7 build machines. Add in the 'TB ' prefix to match the other Thunderbird builders. r=jhopkins<br />
* {{bug|744864}} - Update list of l10n modules that trigger changes. r=Pike<br />
* {{bug|751560}} - Temporarily disable uploading symbols on Windows 32 bit try-comm-central builds. r=jhopkins <br />
* {{bug|751514}} - Thunderbird bloat test builders should warn and halt on failure, not error on failure. r=jhopkins<br />
|-<br />
| in production<br />
| 2012-05-02 12:00 PDT<br />
|<br />
* {{Bug|750635}} - Swap try-comm-central over to pushing to the thunderbird product directory, and get it running unit tests.<br />
* Follow-up to {{bug|748628}}, fix some more issues with the Thunderbird lion builders - the names and the ccache settings. <br />
* {{Bug|739994}} - Remove references to 10.5 platform and associated slaves in configs - r=jhford <br />
|-<br />
| in production<br />
| 2012-05-02 08:00 PDT<br />
|<br />
* {{Bug|748628}} - Switch Thunderbird builds to use OS X 10.7 build machines. r=coop<br />
* {{Bug|743304}} - After SSH failure, Android XUL mozilla-central nightly builder spams a bunch of "SyntaxError: invalid syntax" when running retry.py / balrog-client.py. r=catlee<br />
* {{Bug|751165}} - Bump priority of mozilla-inbound to help open the tree earlier. r=catlee <br />
|-<br />
| in production<br />
| 2012-05-01 19:00 PDT<br />
|<br />
* {{Bug|554343}} - Release builders should always clobber <br />
* {{Bug|750514}} - Disable codesighs on Thunderbird try<br />
* {{Bug|750013}} - Revert Birch customizations from {{bug|746159}}<br />
|-<br />
| in production<br />
| 2012-04-30 13:30 PDT<br />
|<br />
* {{Bug|750305}} - Use comm-central as reference branch for try-comm-central builds<br />
* {{Bug|749596}} - Enable aurora nightly updates (April 27th, 2012 edition)<br />
|-<br />
| in production<br />
| 2012-04-30 11:30 PDT<br />
|<br />
* {{Bug|749867}} - Don't try to build SpiderMonkey --enable-shark builds on 10.7 where there is no Shark, r=coop<br />
* buildbot-configs patch to reflect new all-locales locations (Bug 711534 - Configure Thunderbird release builders) r=standard8<br />
* {{Bug|669428}} - Run Jetpack tests on mozilla-inbound, r=armenzg<br />
* {{Bug|748633}} - Thunderbird try logs failing to upload. r=rail <br />
|-<br />
| in production<br />
| 2012-04-27 11:30 PDT<br />
|<br />
* {{Bug|749524}} - Upload comm-aurora snippets to comm-aurora-test channel<br />
* {{Bug|711534}} - Configure Thunderbird release builders<br />
* {{Bug|749288}} - linux comm-central builds use wrong python when calling balrog client<br />
* {{Bug|749494}} - Re-enable graph server for staging/preproduction<br />
* {{Bug|729392}} - Install toolchain needed for SPDY testing onto test machines<br />
* {{Bug|745300}} - Do Mac spidermonkey builds on 10.7<br />
<br />
|-<br />
| in production<br />
| 2012-04-26 11:00 PDT<br />
|<br />
* {{Bug|749076}} - tooltool should be invoked with -o (--overwrite) option<br />
* {{Bug|739802}} - disable b2g on aurora, beta, release<br />
|-<br />
| '''backed-out'''<br />
| 2012-04-26 09:00 PDT<br />
|<br />
* {{Bug|742131}} - deploy node.exe to fedora slaves<br />
|-<br />
| in production<br />
| 2012-04-26 07:00 PDT<br />
|<br />
* {{Bug|742131}} - deploy node.exe to fedora slaves<br />
|-<br />
| in production<br />
| 2012-04-25 22:26 PDT<br />
|<br />
* {{Bug|742131}} - fix upload host for windows try symbols<br />
|-<br />
| in production<br />
| 2012-04-25 12:00 PDT<br />
|<br />
* {{Bug|743977}} - turn off balrog client for staging and preproduction builds<br />
* {{Bug|723340}} - move dm-pvtbuild01 to a new datacenter<br />
* {{Bug|747821}} - Need to run tpr_responsiveness on Try until it's not run anywhere anymore<br />
* {{Bug|729667}} - re-create the services on dm-wwwbuild01 in scl3<br />
|-<br />
| in production<br />
| 2012-04-24 6:30 PDT<br />
|<br />
* {{Bug|747966}} - comm-central builds not firing automatically<br />
* {{bug|747862}} - Disable shark nightly builds on Thunderbird builders<br />
|-<br />
| in production<br />
| 2012-04-23 15:30 PDT<br />
|<br />
* {{Bug|746708}} - Updates builder fails running backupsnip and pushsnip<br />
* {{bug|747756}} - Bump "'make hg-bundle" timeout<br />
* {{bug|747892}} - mozilla-release's releasetestUptake value should be set to 1<br />
* {{bug|747460}} - consolidate windows peptest config files<br />
|-<br />
| in production<br />
| 2012-04-18 0645 PDT<br />
|<br />
* {{bug|745545}} - Handle Thunderbird revisions in NightlyRepackFactory.<br />
* {{bug|745547}} - Move talosCmd into SUITES loop (generateTalosBranchObjects).<br />
* {{bug|745299}} - Add hg-internal as a mirror.<br />
* {{bug|745500}} - Turn on robocop testCheck2 on tinderbox builds.<br />
* {{bug|735390}} - 12.0b6 configs + fix test-masters.sh + move l10n-changesets_mobile-aurora.json into mozilla/.<br />
* {{bug|746537}} - Increase priority for Birch, drop Maple back down<br />
|-<br />
| in production<br />
| 2012-04-17 1211 PDT<br />
|<br />
* {{bug|746159}} - make birch be like inbound<br />
* {{bug|739994}} - turn off spidermonkey builds on 10.5<br />
* {{bug|744098}} - switch xulrunner osx builds to upload tarballs<br />
* {{bug|732976}} - singlesourcefactory should generate checksums<br />
|-<br />
| in production<br />
| 2012-04-17 0630 PDT<br />
|<br />
* {{bug|739778}} - preproduction in scl3<br />
* {{bug|744119}} - decommission osx builder<br />
* {{bug|744958}} - updateSUT.py fixes<br />
* {{bug|741751}} - partner repack signing fixes<br />
* {{bug|745538}} - TB mozmill test steps<br />
* {{bug|745469}} - Turn off tinderbox mail for spidermonkey builds<br />
|-<br />
| in production<br />
| 2012-04-12 1830-45 PDT<br />
|<br />
* {{bug|722759}} - switch non-try symbols to symbols1.dmz.phx1.mozilla.com<br />
* {{bug|741657}} - Switch to aus3-staging<br />
* {{bug|730325}} - Pass product name to reallyShort()<br />
* {{bug|744495}} - xulrunner pulse messages<br />
|-<br />
| in production<br />
| 2012-04-10 various times<br />
|<br />
* {{bug|720027}} - enable lion builders<br />
|-<br />
| in production<br />
| 2012-04-10 1100 PST<br />
| <br />
* {{bug|744049}} - tcheckerboard always reports 1.0 (tegra talos web server updated to talos tip)<br />
|-<br />
| in production<br />
| 2012-04-09 0700 PDT<br />
|<br />
* {{bug|607392}} - split tagging into en-US and other<br />
* {{bug|721885}} - shut off unused branch<br />
* {{bug|400296}} - Have release automation support signing OSX builds (up to 10.7 support)<br />
|-<br />
| in production<br />
| 2012-04-04 11:00 PDT<br />
|<br />
* {{bug|690311}} - deploy newer version of cleanup.py to the foopies<br />
|-<br />
| in production<br />
| 2012-03-30 6:15 PDT<br />
|<br />
* {{bug|738588}} - add ts_paint to the android tests.<br />
* {{bug|737458}} - replace tpr_responsiveness for tp5row.<br />
* {{bug|737458}} - add tpr_responsiveness temporarily for mozilla-central and larch.<br />
* {{bug|740599}} - update staging release config files;<br />
|-<br />
| in production<br />
| 2012-03-29 6:55 PDT<br />
|<br />
* {{Bug|715193}} - If a branch does not use talos_from_source_code we should fallback to talos.mobile.old.zip (fixes esr10).<br />
|-<br />
| in production<br />
| 2012-03-28 16:35 PDT<br />
|<br />
* {{Bug|740196}} - ts_paint on Android doesn't actually work<br />
|-<br />
| in production<br />
| 2012-03-28 11:55 PDT<br />
|<br />
* {{Bug|737632}} - Remove jaegermonkey, graphics and pine to reduce builders<br />
* {{Bug|723667}} - fix Android trobocheck and ts_paint tests.<br />
* {{Bug|739486}} - test-masters.sh should run ./setup_master.py -t<br />
* add option to setup_master.py to print error logs when hit<br />
* {{Bug|723667}} - enable trobopan and tcheckerboard by default (not for m-a/m-b/m-r/1.9.2)<br />
* {{Bug|627182}} - Automate signing and publishing of XULRunner builds. r=bhearsum <br />
|-<br />
| in production<br />
| 2012-03-27 12:30 PDT<br />
|<br />
* {{Bug|723667}} - Add trobopan and trobocheck to m-c/m-i. r=jmaher<br />
|-<br />
| in production<br />
| 2012-03-23 11:30 PDT<br />
|<br />
*{{bug|627182}}<br />
*{{bug|738685}}<br />
*{{bug|734223}}<br />
*{{bug|738286}}<br />
*{{bug|719491}}<br />
*{{bug|737656}}<br />
*{{bug|715193}}<br />
*{{bug|702595}}<br />
*{{bug|735383}}<br />
|-<br />
| in production<br />
| 2012-03-27 01:35 PDT<br />
|<br />
* {{Bug|739505}} - [http://hg.mozilla.org/build/buildbot-configs/rev/3c424821358a Fix talos] on beta<br />
|-<br />
| in production<br />
| 2012-03-23 7:00 PDT<br />
|<br />
* {{Bug|737864}} - Tweak release category for Thunderbird.<br />
* {{Bug|737458}} - add tp5row side by side and cleanup config.py.<br />
* {{Bug|737581}} - enable peptest on m-c and m-i.<br />
* {{Bug|713846}} - Treat 'fennec' builds as having product 'mobile' for the purposes of uploading logs.<br />
|-<br />
| backout<br />
| 2012-03-21 11:45 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
|-<br />
| in production<br />
| 2012-03-21 9:30 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
* {{Bug|697150}}. (Bv1) Remove 'ac_add_options --disable-installer' for XulRunner current branches.<br />
* {{Bug|733394}}. Add leak test logic to mozilla-beta.<br />
* {{Bug|736587}}. Enable Android for pine.<br />
|-<br />
| in production<br />
| 2012-03-20 8:30 PDT<br />
|<br />
* {{bug|723667}} - enable talos robocop for pine and only for native tests<br />
* {{bug|737077}} - re-enable aurora updates<br />
* {{bug|713846}} - unified log handling<br />
* {{bug|734320}} - fix jetpack log parsing<br />
* {{bug|737049}} - run reftest-no-accel correctly<br />
* {{bug|723386}} - fix reserved slaves handling<br />
|-<br />
| in production<br />
| 2012-03-19 10:30 PDT<br />
|<br />
* {{bug|736284}} - re-enable aurora updates<br />
|-<br />
| in production<br />
| 2012-03-16 8:45 PDT<br />
|<br />
* {{bug|734996}} - fennec beta release update channel -> beta.<br />
* {{Bug|734221}} - deploy updateSUT.py and upgrade the boards to SUT Agent version 1.07.<br />
* {{Bug|734996}} - source: get a nonce earlier<br />
|-<br />
| in production<br />
| 2012-03-13 17:00 PDT<br />
|<br />
* {{bug|735201}} - Remove leading ../ from symbols path for tegras<br />
* {{bug|735421}} - Disable Aurora updates until the Aurora 13 has stabilized<br />
|-<br />
| in production<br />
| 2012-03-12 16:00 PDT<br />
|<br />
* {{bug|734417}} - enable mobile builds on the profiling branch<br />
* {{bug|731617}} - No nightly builds on maple branch since 27 Feb<br />
* {{bug|732285}} - Set MINIDUMP_STACKWALK for Android<br />
* {{bug|733668}} - Include "ERROR: We tried to download the talos.json file but something failed" and "ERROR 500: Internal Server Error" for Talos hgweb operations to RETRY<br />
* {{bug|630518}} - l10n verify, update verify, and final verification builders need to set "branch" when reporting to clobberer<br />
|-<br />
| in production<br />
| 2012-03-08 09:00 PT<br />
|<br />
* {{Bug|731814}} - Add checks that we're not exceeding max # of builders per slave.<br />
* {{Bug|731617}} - Remove win64 for now in maple.<br />
* {{Bug|731339}} - Remove slaves that are not production<br />
* {{Bug|732730}} - Remove non-functional and unwanted pgo_platforms overrides<br />
* {{bug|732110}} - remove buildbot-configs/mozilla2/mobile<br />
* {{Bug|728271}} - Post to graphs.m.o instead of graphs-old.m.o<br />
* {{Bug|729144}} - Post to graphs.allizom.org.<br />
* {{Bug|723667}} - Add robocop disabled.<br />
* {{bug|730050}} - TryBuildFactory looks in the wrong place for malloc.log<br />
* {{Bug|712538}} - leaktest parity on try<br />
* {{Bug|723667}} - Use talos.zip for tegras and prep work for talos robocop<br />
|-<br />
| in production<br />
| 2012-03-06 06:30 PT<br />
|<br />
* {{bug|732500}} - Enable nightly updates on maple<br />
* {{bug|732699}} - ESR release automation should push to mirrors automatically<br />
* {{bug|730918}} - Android on esr10 is busted, no doubt by branding since that always seems to be the problem<br />
* {{bug|561754}} - Don't download symbols for test runs, pass symbol zip URL as symbols path<br />
* {{bug|732516}} - l10n verification shouldn't rsync zip files<br />
* {{bug|732468}} - Add the ridiculous "abort: error:" to the list of hg errors that trigger RETRY<br />
|-<br />
| in production<br />
| 2012-03-01 7:30 PT<br />
|<br />
* {{Bug|721360}} - Do what changeset 9a0c428bdb69 really wanted to do.<br />
* {{Bug|561754}} - Disable symbol download on demand for mozilla-1.9.2 branch.<br />
* {{Bug|660480}} - mark as RETRY for common tegra errors<br />
* {{Bug|729918}} - start_uptake_monitoring builder uses wrong script_repo_revision property.<br />
* {{Bug|561754}} - Download symbols on demand by default for desktop unittests.<br />
|-<br />
| in production<br />
| 2012-02-27 7:45 PT<br />
|<br />
* {{bug|729077}} - recycle talos-r4-lion-083 and talos-r3-snow-081 as mac-signing[12]<br />
* Fix up staging and preproduction test slave lists.<br />
* {{Bug|729426}} - Do periodic PGO on services-central<br />
* {{bug|727580}} - linux-android for esr10, without merging 11.0 to m-r.<br />
|-<br />
| in production<br />
| 2012-02-21 9:30 PT<br />
|<br />
* {{bug|719511}} - add optional reboot command to ScriptFactory<br />
* {{Bug|725292}} - some repacks failed in 11.0b2 because of missing tokens<br />
* {{Bug|728104}} - AggregatingScheduler resets its state on reconfig<br />
* {{Bug|722608}} - Remove android signature verification<br />
* {{Bug|719260}} - Investigate why updates builder triggered twice for 10.0b5<br />
* {{bug|719511}} - Reenable peptest + add reboot_command<br />
* {{bug|712678}} - android-xul different update channel from android<br />
<br />
|-<br />
| in production<br />
| 20120217 1148 PST<br />
|<br />
* {{bug|721822}} - remove talos_from_code.py from the tools repo<br />
|-<br />
| in production<br />
| 20120214 1245 PST<br />
|<br />
* {{bug|726901}} - adjust resolution for reftests to 1600x1200<br />
* {{bug|689989}} - restore /system/etc/hosts on testing tegras<br />
|-<br />
| in production<br />
| 20120213 1200 PST<br />
|<br />
* {{bug|725727}} - reduce # of chunks for update_verify.<br />
* {{Bug|607392}} - split tagging into en-US and other. <br />
|-<br />
| in production<br />
| 20120208 01:20 PST<br />
|<br />
* {{bug|723954}} - 11.0b2 configs<br />
* {{bug|718385}} - android single locale updates<br />
* {{bug|717106}} - Release automation for ESR<br />
|-<br />
| in production<br />
| 20120207 13:00 PST<br />
|<br />
* {{bug|719443}} - add robocop unitttest testtype<br />
* {{bug|715715}} - download & install robocop for robocop test suites<br />
* {{Bug|725046}} - Re-enable mobile aurora updates<br />
* {{Bug|554324}} - Only set MOZ_PKG_VERSION when appVersion != version<br />
* [BACKED OUT] - <strike>{{bug|719511}} - optional ScriptFactory reboot().</strike><br />
|-<br />
| in production<br />
| 20120202 15:50 PST<br />
|<br />
* {{bug|723743}} - android native to en-US (no multilocale); disable android-xul single locale repacks.<br />
* {{bug|719697}} - --disable-tests on android* l10n-mozconfigs.<br />
* {{Bug|723277}} - don't enable remot<br />
|}<br />
<br />
=Android Testing=<br />
== Web Server Cluster ==<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Revision'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 488bc187a3ef<br />
| {{bug|753822}}<br />
| 20120510 1045 AM PDT<br />
| armenzg<br />
|}<br />
<br />
Find [[ReleaseEngineering:Buildduty#Update_mobile_talos_webhosts|here]] instructions to see how to update this.<br />
<br />
== clientproxy servers ==<br />
<br />
Production<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 2a995b4ed124<br />
| 31249cbe4f19<br />
| bfc910cd8dd3<br />
| ae5d6911905a<br />
| talos: {{bug|629503}}<br />
| 20110202 23:00 PDT<br />
| bear<br />
|}<br />
<br />
Pending<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
Servers:<br />
* bm-foopy01.build.mozilla.org<br />
* bm-foopy02.build.mozilla.org<br />
<br />
/builds/cp<br />
/builds/talos-data/talos<br />
/builds/talos-data/talos/pageloader@mozilla.org<br />
/builds/talos-data/talos/bench@taras.glek<br />
/builds/sut_tools</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&diff=433556ReleaseEngineering/Maintenance2012-05-23T01:01:19Z<p>Bear: /* Reconfigs / Deployments */</p>
<hr />
<div>This page is to track upcoming changes to any part of RelEng infrastructure; buildbot masters, slaves, ESX hosts, etc. This should allow us keep track of what we're doing in a downtime, and also what changes can be rolled out to production without needing a downtime. This should be helpful if we need to track what changes were made when troubleshooting problems.<br />
<br />
[[ReleaseEngineering:BuildbotBestPractices]] describes how we manage changes to our masters.<br />
<br />
= Relevant repositories =<br />
* [http://hg.mozilla.org/build/buildbot/ buildbot]<br />
* [http://hg.mozilla.org/build/buildbot-configs/ buildbot-configs]<br />
* [http://hg.mozilla.org/build/buildbotcustom/ buildbotcustom]<br />
* [http://hg.mozilla.org/build/tools/ tools]<br />
* [http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ talos]<br />
<br />
'''Are you changing the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
= Reconfigs / Deployments =<br />
This page is updated by the person who does a reconfig on production systems. Please give accurate times, as we use this page to track down if reconfigs caused debug intermittent problems.<br />
<br />
'''Did you change the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
Outcome should be 'backed out' or 'In production' or some such. Reverse date order pretty please.<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Outcome'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Bug #(s)''' - '''Description(s)'''<br />
|-<br />
| in production<br />
| 2012-05-22 1700 PDT<br />
|<br />
{{bug|754291}} - buildbot master Makefile refers to the wrong hg<br />
{{bug|756463}} - Please reset the priority of the Oak branch<br />
{{bug|753501}} - followup, make try look for the Android XUL tooltool manifest in the path where it actually is<br />
|-<br />
| in production<br />
| 2012-05-18 1313 PDT<br />
|<br />
* {{bug|756463}} - Please bump the priority of the Oak branch temporarily<br />
* {{bug|753132}} - Fix env for nightly pgo builds<br />
* {{bug|573722}} - set IS_NIGHTLY env var for nightly builds<br />
* {{bug|753132}} - Do periodic PGO builds on pgo_platform too<br />
|-<br />
| in production<br />
| 2012-05-18 0815 PDT<br />
|<br />
* {{bug|756463}} - Please bump the priority of the Oak branch temporarily<br />
|-<br />
| in production<br />
| 2012-05-17 0750 PDT<br />
|<br />
* {{bug|755434}} - l10n repacks should not execute config.py<br />
* {{bug|753132}} - Do 32-bit PGO builds on win64<br />
* {{bug|751158}} - Create tcheckboard3 to measure checkerboard with low res off<br />
* {{bug|755989}} - setup-master.py doesn't set up staging symlinks properly<br />
* {{bug|743304}} - After SSH failure, Android XUL mozilla-central nightly builder spams a bunch of "SyntaxError: invalid syntax" when running retry.py / balrog-client.py<br />
* {{bug|753501}} - Add empty tooltool manifests to some platforms<br />
|-<br />
| in production<br />
| 2012-05-14 1118 PDT<br />
|<br />
* {{bug|753488}} - Android native on Aurora -> multilocale.<br />
|-<br />
| in production<br />
| 2012-05-14 0830 PDT<br />
|<br />
* {{bug|750837}} - 13.0b2 build 2 configs. r=hwine<br />
* {{bug|754517}} - disable larch and enable pine rentable branches, r=bear<br />
* {{bug|747500}} - setup-master.py refers to files which have been removed. r=catlee<br />
* {{bug|753132}} - Use win64 machines for 32-bit pgo builds on build-system branch. r=rail<br />
* {{bug|754373}} - Use firefox-tuxedo.ini for Thunderbird builds. r=standard8<br />
* {{bug|754397}} - Disable signing at build time for Thunderbird. r=nthomas<br />
* {{bug|754430}} - Missing mozilla/ dir in Thunderbird beta build. p=standard8,r=jhopkins <br />
* {{bug|701783}} - remove scratchbox references from buildbotcustom. r=catlee<br />
* {{bug|753132}} - Support 'pgo_platform' key for deciding which machines do PGO builds. r=rail<br />
* {{bug|750744}} - Test and deploy SUT agent 1.08. r=bear <br />
|-<br />
| in production<br />
| 2012-05-11 0930 PDT<br />
|<br />
* {{Bug|754297}} - add sys.stdout.flush() to sut_tools' scripts <br />
|-<br />
| in production<br />
| 2012-05-11 0800 PDT<br />
|<br />
* {{Bug|746260}} - disable the screen resolution changing on android for jsreftest and crashtest, leave it on for reftest.<br />
* {{Bug|753868}} - use aus3-staging for Thunderbird release builds, r=jhopkins<br />
* {{Bug|744601}} - tracking bug for build and release of Thunderbird 13.0b2. r=standard8<br />
* {{Bug|752531}} - migrate dev-stage01 to scl3. r=rail <br />
|-<br />
| in production<br />
| 2012-05-10 1801 PDT<br />
|<br />
* {{bug|753488}} - make FN multi on m-c only, reenable nightly updates.<br />
* {{Bug|753625}} - Move all Thunderbird branches onto Firefox infra<br />
* {{bug|749748}} - kill l10n verify.<br />
* {{Bug|753868}} - Use aus3-staging.mozilla.org for Thunderbird release builds.<br />
* {{Bug|753865}} - Email thunderbird-drivers for Thunderbird release builds.<br />
* {{bug|748157}} - Load thunderbird_release_branches from master_config.json<br />
|-<br />
| in production<br />
| 2012-05-09 0800 PDT<br />
|<br />
* {{bug|752373}} - Stop running Android crashtest-1 until someone's ready to fix it<br />
* {{bug|751070}} - retire sjc1 VMs<br />
* {{bug|750031}} - moz2-darwin10-slave02 problem tracking<br />
* {{bug|746201}} - Remove unresolved machines from buildbot-configs<br />
* {{bug|752430}} - Swap comm-aurora over to Firefox infra<br />
* {{bug|749051}} - TryChooser: could -n be the default?<br />
* {{bug|751878}} - OSError: [Errno 13] Permission denied: '/home/ftp' for pvtbuilds2.dmz.scl3.mozilla.com<br />
|-<br />
| in production<br />
| 2012-05-03 1325 PDT<br />
|<br />
* {{Bug|744067}} - add them back<br />
|-<br />
| backed out<br />
| 2012-05-03 1200 PDT<br />
|<br />
* Backout 4ab5af03cce1 (new scl3 slaves). r=backout<br />
|-<br />
| in production<br />
| 2012-05-03 1000 PDT<br />
|<br />
* {{Bug|751165}} - revert higher priority for m-i. r=philor,ehsan<br />
* {{Bug|744067}} - new scl3 slaves; r=coop<br />
* {{Bug|744067}} - new scl3 slaves (must be in staging); r=aki<br />
* Add ACTIVE_THUNDERBIRD_RELEASE_BRANCHES. r=armenzg<br />
* {{Bug|751895}} - Preproduction release master fails trying to checkconfig. r=jhopkins<br />
* {{Bug|750973}} - copy in-tree m-a linux32 mozconfig into mozilla2 to fix aurora source release. r=catlee <br />
|-<br />
| in production<br />
| 2012-05-03 08:00 PDT<br />
|<br />
* {{bug|751506}} - No 10.7 32-bit debug builders on Thunderbird trees. r=coop<br />
* {{bug|748628}} - Switch Thunderbird builds to use OS X 10.7 build machines. Add in the 'TB ' prefix to match the other Thunderbird builders. r=jhopkins<br />
* {{bug|744864}} - Update list of l10n modules that trigger changes. r=Pike<br />
* {{bug|751560}} - Temporarily disable uploading symbols on Windows 32 bit try-comm-central builds. r=jhopkins <br />
* {{bug|751514}} - Thunderbird bloat test builders should warn and halt on failure, not error on failure. r=jhopkins<br />
|-<br />
| in production<br />
| 2012-05-02 12:00 PDT<br />
|<br />
* {{Bug|750635}} - Swap try-comm-central over to pushing to the thunderbird product directory, and get it running unit tests.<br />
* Follow-up to {{bug|748628}}, fix some more issues with the Thunderbird lion builders - the names and the ccache settings. <br />
* {{Bug|739994}} - Remove references to 10.5 platform and associated slaves in configs - r=jhford <br />
|-<br />
| in production<br />
| 2012-05-02 08:00 PDT<br />
|<br />
* {{Bug|748628}} - Switch Thunderbird builds to use OS X 10.7 build machines. r=coop<br />
* {{Bug|743304}} - After SSH failure, Android XUL mozilla-central nightly builder spams a bunch of "SyntaxError: invalid syntax" when running retry.py / balrog-client.py. r=catlee<br />
* {{Bug|751165}} - Bump priority of mozilla-inbound to help open the tree earlier. r=catlee <br />
|-<br />
| in production<br />
| 2012-05-01 19:00 PDT<br />
|<br />
* {{Bug|554343}} - Release builders should always clobber <br />
* {{Bug|750514}} - Disable codesighs on Thunderbird try<br />
* {{Bug|750013}} - Revert Birch customizations from {{bug|746159}}<br />
|-<br />
| in production<br />
| 2012-04-30 13:30 PDT<br />
|<br />
* {{Bug|750305}} - Use comm-central as reference branch for try-comm-central builds<br />
* {{Bug|749596}} - Enable aurora nightly updates (April 27th, 2012 edition)<br />
|-<br />
| in production<br />
| 2012-04-30 11:30 PDT<br />
|<br />
* {{Bug|749867}} - Don't try to build SpiderMonkey --enable-shark builds on 10.7 where there is no Shark, r=coop<br />
* buildbot-configs patch to reflect new all-locales locations (Bug 711534 - Configure Thunderbird release builders) r=standard8<br />
* {{Bug|669428}} - Run Jetpack tests on mozilla-inbound, r=armenzg<br />
* {{Bug|748633}} - Thunderbird try logs failing to upload. r=rail <br />
|-<br />
| in production<br />
| 2012-04-27 11:30 PDT<br />
|<br />
* {{Bug|749524}} - Upload comm-aurora snippets to comm-aurora-test channel<br />
* {{Bug|711534}} - Configure Thunderbird release builders<br />
* {{Bug|749288}} - linux comm-central builds use wrong python when calling balrog client<br />
* {{Bug|749494}} - Re-enable graph server for staging/preproduction<br />
* {{Bug|729392}} - Install toolchain needed for SPDY testing onto test machines<br />
* {{Bug|745300}} - Do Mac spidermonkey builds on 10.7<br />
<br />
|-<br />
| in production<br />
| 2012-04-26 11:00 PDT<br />
|<br />
* {{Bug|749076}} - tooltool should be invoked with -o (--overwrite) option<br />
* {{Bug|739802}} - disable b2g on aurora, beta, release<br />
|-<br />
| '''backed-out'''<br />
| 2012-04-26 09:00 PDT<br />
|<br />
* {{Bug|742131}} - deploy node.exe to fedora slaves<br />
|-<br />
| in production<br />
| 2012-04-26 07:00 PDT<br />
|<br />
* {{Bug|742131}} - deploy node.exe to fedora slaves<br />
|-<br />
| in production<br />
| 2012-04-25 22:26 PDT<br />
|<br />
* {{Bug|742131}} - fix upload host for windows try symbols<br />
|-<br />
| in production<br />
| 2012-04-25 12:00 PDT<br />
|<br />
* {{Bug|743977}} - turn off balrog client for staging and preproduction builds<br />
* {{Bug|723340}} - move dm-pvtbuild01 to a new datacenter<br />
* {{Bug|747821}} - Need to run tpr_responsiveness on Try until it's not run anywhere anymore<br />
* {{Bug|729667}} - re-create the services on dm-wwwbuild01 in scl3<br />
|-<br />
| in production<br />
| 2012-04-24 6:30 PDT<br />
|<br />
* {{Bug|747966}} - comm-central builds not firing automatically<br />
* {{bug|747862}} - Disable shark nightly builds on Thunderbird builders<br />
|-<br />
| in production<br />
| 2012-04-23 15:30 PDT<br />
|<br />
* {{Bug|746708}} - Updates builder fails running backupsnip and pushsnip<br />
* {{bug|747756}} - Bump "'make hg-bundle" timeout<br />
* {{bug|747892}} - mozilla-release's releasetestUptake value should be set to 1<br />
* {{bug|747460}} - consolidate windows peptest config files<br />
|-<br />
| in production<br />
| 2012-04-18 0645 PDT<br />
|<br />
* {{bug|745545}} - Handle Thunderbird revisions in NightlyRepackFactory.<br />
* {{bug|745547}} - Move talosCmd into SUITES loop (generateTalosBranchObjects).<br />
* {{bug|745299}} - Add hg-internal as a mirror.<br />
* {{bug|745500}} - Turn on robocop testCheck2 on tinderbox builds.<br />
* {{bug|735390}} - 12.0b6 configs + fix test-masters.sh + move l10n-changesets_mobile-aurora.json into mozilla/.<br />
* {{bug|746537}} - Increase priority for Birch, drop Maple back down<br />
|-<br />
| in production<br />
| 2012-04-17 1211 PDT<br />
|<br />
* {{bug|746159}} - make birch be like inbound<br />
* {{bug|739994}} - turn off spidermonkey builds on 10.5<br />
* {{bug|744098}} - switch xulrunner osx builds to upload tarballs<br />
* {{bug|732976}} - singlesourcefactory should generate checksums<br />
|-<br />
| in production<br />
| 2012-04-17 0630 PDT<br />
|<br />
* {{bug|739778}} - preproduction in scl3<br />
* {{bug|744119}} - decommission osx builder<br />
* {{bug|744958}} - updateSUT.py fixes<br />
* {{bug|741751}} - partner repack signing fixes<br />
* {{bug|745538}} - TB mozmill test steps<br />
* {{bug|745469}} - Turn off tinderbox mail for spidermonkey builds<br />
|-<br />
| in production<br />
| 2012-04-12 1830-45 PDT<br />
|<br />
* {{bug|722759}} - switch non-try symbols to symbols1.dmz.phx1.mozilla.com<br />
* {{bug|741657}} - Switch to aus3-staging<br />
* {{bug|730325}} - Pass product name to reallyShort()<br />
* {{bug|744495}} - xulrunner pulse messages<br />
|-<br />
| in production<br />
| 2012-04-10 various times<br />
|<br />
* {{bug|720027}} - enable lion builders<br />
|-<br />
| in production<br />
| 2012-04-10 1100 PST<br />
| <br />
* {{bug|744049}} - tcheckerboard always reports 1.0 (tegra talos web server updated to talos tip)<br />
|-<br />
| in production<br />
| 2012-04-09 0700 PDT<br />
|<br />
* {{bug|607392}} - split tagging into en-US and other<br />
* {{bug|721885}} - shut off unused branch<br />
* {{bug|400296}} - Have release automation support signing OSX builds (up to 10.7 support)<br />
|-<br />
| in production<br />
| 2012-04-04 11:00 PDT<br />
|<br />
* {{bug|690311}} - deploy newer version of cleanup.py to the foopies<br />
|-<br />
| in production<br />
| 2012-03-30 6:15 PDT<br />
|<br />
* {{bug|738588}} - add ts_paint to the android tests.<br />
* {{bug|737458}} - replace tpr_responsiveness for tp5row.<br />
* {{bug|737458}} - add tpr_responsiveness temporarily for mozilla-central and larch.<br />
* {{bug|740599}} - update staging release config files;<br />
|-<br />
| in production<br />
| 2012-03-29 6:55 PDT<br />
|<br />
* {{Bug|715193}} - If a branch does not use talos_from_source_code we should fallback to talos.mobile.old.zip (fixes esr10).<br />
|-<br />
| in production<br />
| 2012-03-28 16:35 PDT<br />
|<br />
* {{Bug|740196}} - ts_paint on Android doesn't actually work<br />
|-<br />
| in production<br />
| 2012-03-28 11:55 PDT<br />
|<br />
* {{Bug|737632}} - Remove jaegermonkey, graphics and pine to reduce builders<br />
* {{Bug|723667}} - fix Android trobocheck and ts_paint tests.<br />
* {{Bug|739486}} - test-masters.sh should run ./setup_master.py -t<br />
* add option to setup_master.py to print error logs when hit<br />
* {{Bug|723667}} - enable trobopan and tcheckerboard by default (not for m-a/m-b/m-r/1.9.2)<br />
* {{Bug|627182}} - Automate signing and publishing of XULRunner builds. r=bhearsum <br />
|-<br />
| in production<br />
| 2012-03-27 12:30 PDT<br />
|<br />
* {{Bug|723667}} - Add trobopan and trobocheck to m-c/m-i. r=jmaher<br />
|-<br />
| in production<br />
| 2012-03-23 11:30 PDT<br />
|<br />
*{{bug|627182}}<br />
*{{bug|738685}}<br />
*{{bug|734223}}<br />
*{{bug|738286}}<br />
*{{bug|719491}}<br />
*{{bug|737656}}<br />
*{{bug|715193}}<br />
*{{bug|702595}}<br />
*{{bug|735383}}<br />
|-<br />
| in production<br />
| 2012-03-27 01:35 PDT<br />
|<br />
* {{Bug|739505}} - [http://hg.mozilla.org/build/buildbot-configs/rev/3c424821358a Fix talos] on beta<br />
|-<br />
| in production<br />
| 2012-03-23 7:00 PDT<br />
|<br />
* {{Bug|737864}} - Tweak release category for Thunderbird.<br />
* {{Bug|737458}} - add tp5row side by side and cleanup config.py.<br />
* {{Bug|737581}} - enable peptest on m-c and m-i.<br />
* {{Bug|713846}} - Treat 'fennec' builds as having product 'mobile' for the purposes of uploading logs.<br />
|-<br />
| backout<br />
| 2012-03-21 11:45 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
|-<br />
| in production<br />
| 2012-03-21 9:30 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
* {{Bug|697150}}. (Bv1) Remove 'ac_add_options --disable-installer' for XulRunner current branches.<br />
* {{Bug|733394}}. Add leak test logic to mozilla-beta.<br />
* {{Bug|736587}}. Enable Android for pine.<br />
|-<br />
| in production<br />
| 2012-03-20 8:30 PDT<br />
|<br />
* {{bug|723667}} - enable talos robocop for pine and only for native tests<br />
* {{bug|737077}} - re-enable aurora updates<br />
* {{bug|713846}} - unified log handling<br />
* {{bug|734320}} - fix jetpack log parsing<br />
* {{bug|737049}} - run reftest-no-accel correctly<br />
* {{bug|723386}} - fix reserved slaves handling<br />
|-<br />
| in production<br />
| 2012-03-19 10:30 PDT<br />
|<br />
* {{bug|736284}} - re-enable aurora updates<br />
|-<br />
| in production<br />
| 2012-03-16 8:45 PDT<br />
|<br />
* {{bug|734996}} - fennec beta release update channel -> beta.<br />
* {{Bug|734221}} - deploy updateSUT.py and upgrade the boards to SUT Agent version 1.07.<br />
* {{Bug|734996}} - source: get a nonce earlier<br />
|-<br />
| in production<br />
| 2012-03-13 17:00 PDT<br />
|<br />
* {{bug|735201}} - Remove leading ../ from symbols path for tegras<br />
* {{bug|735421}} - Disable Aurora updates until the Aurora 13 has stabilized<br />
|-<br />
| in production<br />
| 2012-03-12 16:00 PDT<br />
|<br />
* {{bug|734417}} - enable mobile builds on the profiling branch<br />
* {{bug|731617}} - No nightly builds on maple branch since 27 Feb<br />
* {{bug|732285}} - Set MINIDUMP_STACKWALK for Android<br />
* {{bug|733668}} - Include "ERROR: We tried to download the talos.json file but something failed" and "ERROR 500: Internal Server Error" for Talos hgweb operations to RETRY<br />
* {{bug|630518}} - l10n verify, update verify, and final verification builders need to set "branch" when reporting to clobberer<br />
|-<br />
| in production<br />
| 2012-03-08 09:00 PT<br />
|<br />
* {{Bug|731814}} - Add checks that we're not exceeding max # of builders per slave.<br />
* {{Bug|731617}} - Remove win64 for now in maple.<br />
* {{Bug|731339}} - Remove slaves that are not production<br />
* {{Bug|732730}} - Remove non-functional and unwanted pgo_platforms overrides<br />
* {{bug|732110}} - remove buildbot-configs/mozilla2/mobile<br />
* {{Bug|728271}} - Post to graphs.m.o instead of graphs-old.m.o<br />
* {{Bug|729144}} - Post to graphs.allizom.org.<br />
* {{Bug|723667}} - Add robocop disabled.<br />
* {{bug|730050}} - TryBuildFactory looks in the wrong place for malloc.log<br />
* {{Bug|712538}} - leaktest parity on try<br />
* {{Bug|723667}} - Use talos.zip for tegras and prep work for talos robocop<br />
|-<br />
| in production<br />
| 2012-03-06 06:30 PT<br />
|<br />
* {{bug|732500}} - Enable nightly updates on maple<br />
* {{bug|732699}} - ESR release automation should push to mirrors automatically<br />
* {{bug|730918}} - Android on esr10 is busted, no doubt by branding since that always seems to be the problem<br />
* {{bug|561754}} - Don't download symbols for test runs, pass symbol zip URL as symbols path<br />
* {{bug|732516}} - l10n verification shouldn't rsync zip files<br />
* {{bug|732468}} - Add the ridiculous "abort: error:" to the list of hg errors that trigger RETRY<br />
|-<br />
| in production<br />
| 2012-03-01 7:30 PT<br />
|<br />
* {{Bug|721360}} - Do what changeset 9a0c428bdb69 really wanted to do.<br />
* {{Bug|561754}} - Disable symbol download on demand for mozilla-1.9.2 branch.<br />
* {{Bug|660480}} - mark as RETRY for common tegra errors<br />
* {{Bug|729918}} - start_uptake_monitoring builder uses wrong script_repo_revision property.<br />
* {{Bug|561754}} - Download symbols on demand by default for desktop unittests.<br />
|-<br />
| in production<br />
| 2012-02-27 7:45 PT<br />
|<br />
* {{bug|729077}} - recycle talos-r4-lion-083 and talos-r3-snow-081 as mac-signing[12]<br />
* Fix up staging and preproduction test slave lists.<br />
* {{Bug|729426}} - Do periodic PGO on services-central<br />
* {{bug|727580}} - linux-android for esr10, without merging 11.0 to m-r.<br />
|-<br />
| in production<br />
| 2012-02-21 9:30 PT<br />
|<br />
* {{bug|719511}} - add optional reboot command to ScriptFactory<br />
* {{Bug|725292}} - some repacks failed in 11.0b2 because of missing tokens<br />
* {{Bug|728104}} - AggregatingScheduler resets its state on reconfig<br />
* {{Bug|722608}} - Remove android signature verification<br />
* {{Bug|719260}} - Investigate why updates builder triggered twice for 10.0b5<br />
* {{bug|719511}} - Reenable peptest + add reboot_command<br />
* {{bug|712678}} - android-xul different update channel from android<br />
<br />
|-<br />
| in production<br />
| 20120217 1148 PST<br />
|<br />
* {{bug|721822}} - remove talos_from_code.py from the tools repo<br />
|-<br />
| in production<br />
| 20120214 1245 PST<br />
|<br />
* {{bug|726901}} - adjust resolution for reftests to 1600x1200<br />
* {{bug|689989}} - restore /system/etc/hosts on testing tegras<br />
|-<br />
| in production<br />
| 20120213 1200 PST<br />
|<br />
* {{bug|725727}} - reduce # of chunks for update_verify.<br />
* {{Bug|607392}} - split tagging into en-US and other. <br />
|-<br />
| in production<br />
| 20120208 01:20 PST<br />
|<br />
* {{bug|723954}} - 11.0b2 configs<br />
* {{bug|718385}} - android single locale updates<br />
* {{bug|717106}} - Release automation for ESR<br />
|-<br />
| in production<br />
| 20120207 13:00 PST<br />
|<br />
* {{bug|719443}} - add robocop unitttest testtype<br />
* {{bug|715715}} - download & install robocop for robocop test suites<br />
* {{Bug|725046}} - Re-enable mobile aurora updates<br />
* {{Bug|554324}} - Only set MOZ_PKG_VERSION when appVersion != version<br />
* [BACKED OUT] - <strike>{{bug|719511}} - optional ScriptFactory reboot().</strike><br />
|-<br />
| in production<br />
| 20120202 15:50 PST<br />
|<br />
* {{bug|723743}} - android native to en-US (no multilocale); disable android-xul single locale repacks.<br />
* {{bug|719697}} - --disable-tests on android* l10n-mozconfigs.<br />
* {{Bug|723277}} - don't enable remot<br />
|}<br />
<br />
=Android Testing=<br />
== Web Server Cluster ==<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Revision'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 488bc187a3ef<br />
| {{bug|753822}}<br />
| 20120510 1045 AM PDT<br />
| armenzg<br />
|}<br />
<br />
Find [[ReleaseEngineering:Buildduty#Update_mobile_talos_webhosts|here]] instructions to see how to update this.<br />
<br />
== clientproxy servers ==<br />
<br />
Production<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 2a995b4ed124<br />
| 31249cbe4f19<br />
| bfc910cd8dd3<br />
| ae5d6911905a<br />
| talos: {{bug|629503}}<br />
| 20110202 23:00 PDT<br />
| bear<br />
|}<br />
<br />
Pending<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
Servers:<br />
* bm-foopy01.build.mozilla.org<br />
* bm-foopy02.build.mozilla.org<br />
<br />
/builds/cp<br />
/builds/talos-data/talos<br />
/builds/talos-data/talos/pageloader@mozilla.org<br />
/builds/talos-data/talos/bench@taras.glek<br />
/builds/sut_tools</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&diff=432019ReleaseEngineering/Maintenance2012-05-17T14:55:04Z<p>Bear: /* Reconfigs / Deployments */</p>
<hr />
<div>This page is to track upcoming changes to any part of RelEng infrastructure; buildbot masters, slaves, ESX hosts, etc. This should allow us keep track of what we're doing in a downtime, and also what changes can be rolled out to production without needing a downtime. This should be helpful if we need to track what changes were made when troubleshooting problems.<br />
<br />
[[ReleaseEngineering:BuildbotBestPractices]] describes how we manage changes to our masters.<br />
<br />
= Relevant repositories =<br />
* [http://hg.mozilla.org/build/buildbot/ buildbot]<br />
* [http://hg.mozilla.org/build/buildbot-configs/ buildbot-configs]<br />
* [http://hg.mozilla.org/build/buildbotcustom/ buildbotcustom]<br />
* [http://hg.mozilla.org/build/tools/ tools]<br />
* [http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ talos]<br />
<br />
'''Are you changing the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
= Reconfigs / Deployments =<br />
This page is updated by the person who does a reconfig on production systems. Please give accurate times, as we use this page to track down if reconfigs caused debug intermittent problems.<br />
<br />
'''Did you change the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
Outcome should be 'backed out' or 'In production' or some such. Reverse date order pretty please.<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Outcome'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Bug #(s)''' - '''Description(s)'''<br />
|-<br />
| in production<br />
| 2012-05-17 0750 PDT<br />
|<br />
* {{bug|755434}} - l10n repacks should not execute config.py<br />
* {{bug|753132}} - Do 32-bit PGO builds on win64<br />
* {{bug|751158}} - Create tcheckboard3 to measure checkerboard with low res off<br />
* {{bug|755989}} - setup-master.py doesn't set up staging symlinks properly<br />
* {{bug|743304}} - After SSH failure, Android XUL mozilla-central nightly builder spams a bunch of "SyntaxError: invalid syntax" when running retry.py / balrog-client.py<br />
* {{bug|753501}} - Add empty tooltool manifests to some platforms<br />
|-<br />
| in production<br />
| 2012-05-14 1118 PDT<br />
|<br />
* {{bug|753488}} - Android native on Aurora -> multilocale.<br />
|-<br />
| in production<br />
| 2012-05-14 0830 PDT<br />
|<br />
* {{bug|750837}} - 13.0b2 build 2 configs. r=hwine<br />
* {{bug|754517}} - disable larch and enable pine rentable branches, r=bear<br />
* {{bug|747500}} - setup-master.py refers to files which have been removed. r=catlee<br />
* {{bug|753132}} - Use win64 machines for 32-bit pgo builds on build-system branch. r=rail<br />
* {{bug|754373}} - Use firefox-tuxedo.ini for Thunderbird builds. r=standard8<br />
* {{bug|754397}} - Disable signing at build time for Thunderbird. r=nthomas<br />
* {{bug|754430}} - Missing mozilla/ dir in Thunderbird beta build. p=standard8,r=jhopkins <br />
* {{bug|701783}} - remove scratchbox references from buildbotcustom. r=catlee<br />
* {{bug|753132}} - Support 'pgo_platform' key for deciding which machines do PGO builds. r=rail<br />
* {{bug|750744}} - Test and deploy SUT agent 1.08. r=bear <br />
|-<br />
| in production<br />
| 2012-05-11 0930 PDT<br />
|<br />
* {{Bug|754297}} - add sys.stdout.flush() to sut_tools' scripts <br />
|-<br />
| in production<br />
| 2012-05-11 0800 PDT<br />
|<br />
* {{Bug|746260}} - disable the screen resolution changing on android for jsreftest and crashtest, leave it on for reftest.<br />
* {{Bug|753868}} - use aus3-staging for Thunderbird release builds, r=jhopkins<br />
* {{Bug|744601}} - tracking bug for build and release of Thunderbird 13.0b2. r=standard8<br />
* {{Bug|752531}} - migrate dev-stage01 to scl3. r=rail <br />
|-<br />
| in production<br />
| 2012-05-10 1801 PDT<br />
|<br />
* {{bug|753488}} - make FN multi on m-c only, reenable nightly updates.<br />
* {{Bug|753625}} - Move all Thunderbird branches onto Firefox infra<br />
* {{bug|749748}} - kill l10n verify.<br />
* {{Bug|753868}} - Use aus3-staging.mozilla.org for Thunderbird release builds.<br />
* {{Bug|753865}} - Email thunderbird-drivers for Thunderbird release builds.<br />
* {{bug|748157}} - Load thunderbird_release_branches from master_config.json<br />
|-<br />
| in production<br />
| 2012-05-09 0800 PDT<br />
|<br />
* {{bug|752373}} - Stop running Android crashtest-1 until someone's ready to fix it<br />
* {{bug|751070}} - retire sjc1 VMs<br />
* {{bug|750031}} - moz2-darwin10-slave02 problem tracking<br />
* {{bug|746201}} - Remove unresolved machines from buildbot-configs<br />
* {{bug|752430}} - Swap comm-aurora over to Firefox infra<br />
* {{bug|749051}} - TryChooser: could -n be the default?<br />
* {{bug|751878}} - OSError: [Errno 13] Permission denied: '/home/ftp' for pvtbuilds2.dmz.scl3.mozilla.com<br />
|-<br />
| in production<br />
| 2012-05-03 1325 PDT<br />
|<br />
* {{Bug|744067}} - add them back<br />
|-<br />
| backed out<br />
| 2012-05-03 1200 PDT<br />
|<br />
* Backout 4ab5af03cce1 (new scl3 slaves). r=backout<br />
|-<br />
| in production<br />
| 2012-05-03 1000 PDT<br />
|<br />
* {{Bug|751165}} - revert higher priority for m-i. r=philor,ehsan<br />
* {{Bug|744067}} - new scl3 slaves; r=coop<br />
* {{Bug|744067}} - new scl3 slaves (must be in staging); r=aki<br />
* Add ACTIVE_THUNDERBIRD_RELEASE_BRANCHES. r=armenzg<br />
* {{Bug|751895}} - Preproduction release master fails trying to checkconfig. r=jhopkins<br />
* {{Bug|750973}} - copy in-tree m-a linux32 mozconfig into mozilla2 to fix aurora source release. r=catlee <br />
|-<br />
| in production<br />
| 2012-05-03 08:00 PDT<br />
|<br />
* {{bug|751506}} - No 10.7 32-bit debug builders on Thunderbird trees. r=coop<br />
* {{bug|748628}} - Switch Thunderbird builds to use OS X 10.7 build machines. Add in the 'TB ' prefix to match the other Thunderbird builders. r=jhopkins<br />
* {{bug|744864}} - Update list of l10n modules that trigger changes. r=Pike<br />
* {{bug|751560}} - Temporarily disable uploading symbols on Windows 32 bit try-comm-central builds. r=jhopkins <br />
* {{bug|751514}} - Thunderbird bloat test builders should warn and halt on failure, not error on failure. r=jhopkins<br />
|-<br />
| in production<br />
| 2012-05-02 12:00 PDT<br />
|<br />
* {{Bug|750635}} - Swap try-comm-central over to pushing to the thunderbird product directory, and get it running unit tests.<br />
* Follow-up to {{bug|748628}}, fix some more issues with the Thunderbird lion builders - the names and the ccache settings. <br />
* {{Bug|739994}} - Remove references to 10.5 platform and associated slaves in configs - r=jhford <br />
|-<br />
| in production<br />
| 2012-05-02 08:00 PDT<br />
|<br />
* {{Bug|748628}} - Switch Thunderbird builds to use OS X 10.7 build machines. r=coop<br />
* {{Bug|743304}} - After SSH failure, Android XUL mozilla-central nightly builder spams a bunch of "SyntaxError: invalid syntax" when running retry.py / balrog-client.py. r=catlee<br />
* {{Bug|751165}} - Bump priority of mozilla-inbound to help open the tree earlier. r=catlee <br />
|-<br />
| in production<br />
| 2012-05-01 19:00 PDT<br />
|<br />
* {{Bug|554343}} - Release builders should always clobber <br />
* {{Bug|750514}} - Disable codesighs on Thunderbird try<br />
* {{Bug|750013}} - Revert Birch customizations from {{bug|746159}}<br />
|-<br />
| in production<br />
| 2012-04-30 13:30 PDT<br />
|<br />
* {{Bug|750305}} - Use comm-central as reference branch for try-comm-central builds<br />
* {{Bug|749596}} - Enable aurora nightly updates (April 27th, 2012 edition)<br />
|-<br />
| in production<br />
| 2012-04-30 11:30 PDT<br />
|<br />
* {{Bug|749867}} - Don't try to build SpiderMonkey --enable-shark builds on 10.7 where there is no Shark, r=coop<br />
* buildbot-configs patch to reflect new all-locales locations (Bug 711534 - Configure Thunderbird release builders) r=standard8<br />
* {{Bug|669428}} - Run Jetpack tests on mozilla-inbound, r=armenzg<br />
* {{Bug|748633}} - Thunderbird try logs failing to upload. r=rail <br />
|-<br />
| in production<br />
| 2012-04-27 11:30 PDT<br />
|<br />
* {{Bug|749524}} - Upload comm-aurora snippets to comm-aurora-test channel<br />
* {{Bug|711534}} - Configure Thunderbird release builders<br />
* {{Bug|749288}} - linux comm-central builds use wrong python when calling balrog client<br />
* {{Bug|749494}} - Re-enable graph server for staging/preproduction<br />
* {{Bug|729392}} - Install toolchain needed for SPDY testing onto test machines<br />
* {{Bug|745300}} - Do Mac spidermonkey builds on 10.7<br />
<br />
|-<br />
| in production<br />
| 2012-04-26 11:00 PDT<br />
|<br />
* {{Bug|749076}} - tooltool should be invoked with -o (--overwrite) option<br />
* {{Bug|739802}} - disable b2g on aurora, beta, release<br />
|-<br />
| '''backed-out'''<br />
| 2012-04-26 09:00 PDT<br />
|<br />
* {{Bug|742131}} - deploy node.exe to fedora slaves<br />
|-<br />
| in production<br />
| 2012-04-26 07:00 PDT<br />
|<br />
* {{Bug|742131}} - deploy node.exe to fedora slaves<br />
|-<br />
| in production<br />
| 2012-04-25 22:26 PDT<br />
|<br />
* {{Bug|742131}} - fix upload host for windows try symbols<br />
|-<br />
| in production<br />
| 2012-04-25 12:00 PDT<br />
|<br />
* {{Bug|743977}} - turn off balrog client for staging and preproduction builds<br />
* {{Bug|723340}} - move dm-pvtbuild01 to a new datacenter<br />
* {{Bug|747821}} - Need to run tpr_responsiveness on Try until it's not run anywhere anymore<br />
* {{Bug|729667}} - re-create the services on dm-wwwbuild01 in scl3<br />
|-<br />
| in production<br />
| 2012-04-24 6:30 PDT<br />
|<br />
* {{Bug|747966}} - comm-central builds not firing automatically<br />
* {{bug|747862}} - Disable shark nightly builds on Thunderbird builders<br />
|-<br />
| in production<br />
| 2012-04-23 15:30 PDT<br />
|<br />
* {{Bug|746708}} - Updates builder fails running backupsnip and pushsnip<br />
* {{bug|747756}} - Bump "'make hg-bundle" timeout<br />
* {{bug|747892}} - mozilla-release's releasetestUptake value should be set to 1<br />
* {{bug|747460}} - consolidate windows peptest config files<br />
|-<br />
| in production<br />
| 2012-04-18 0645 PDT<br />
|<br />
* {{bug|745545}} - Handle Thunderbird revisions in NightlyRepackFactory.<br />
* {{bug|745547}} - Move talosCmd into SUITES loop (generateTalosBranchObjects).<br />
* {{bug|745299}} - Add hg-internal as a mirror.<br />
* {{bug|745500}} - Turn on robocop testCheck2 on tinderbox builds.<br />
* {{bug|735390}} - 12.0b6 configs + fix test-masters.sh + move l10n-changesets_mobile-aurora.json into mozilla/.<br />
* {{bug|746537}} - Increase priority for Birch, drop Maple back down<br />
|-<br />
| in production<br />
| 2012-04-17 1211 PDT<br />
|<br />
* {{bug|746159}} - make birch be like inbound<br />
* {{bug|739994}} - turn off spidermonkey builds on 10.5<br />
* {{bug|744098}} - switch xulrunner osx builds to upload tarballs<br />
* {{bug|732976}} - singlesourcefactory should generate checksums<br />
|-<br />
| in production<br />
| 2012-04-17 0630 PDT<br />
|<br />
* {{bug|739778}} - preproduction in scl3<br />
* {{bug|744119}} - decommission osx builder<br />
* {{bug|744958}} - updateSUT.py fixes<br />
* {{bug|741751}} - partner repack signing fixes<br />
* {{bug|745538}} - TB mozmill test steps<br />
* {{bug|745469}} - Turn off tinderbox mail for spidermonkey builds<br />
|-<br />
| in production<br />
| 2012-04-12 1830-45 PDT<br />
|<br />
* {{bug|722759}} - switch non-try symbols to symbols1.dmz.phx1.mozilla.com<br />
* {{bug|741657}} - Switch to aus3-staging<br />
* {{bug|730325}} - Pass product name to reallyShort()<br />
* {{bug|744495}} - xulrunner pulse messages<br />
|-<br />
| in production<br />
| 2012-04-10 various times<br />
|<br />
* {{bug|720027}} - enable lion builders<br />
|-<br />
| in production<br />
| 2012-04-10 1100 PST<br />
| <br />
* {{bug|744049}} - tcheckerboard always reports 1.0 (tegra talos web server updated to talos tip)<br />
|-<br />
| in production<br />
| 2012-04-09 0700 PDT<br />
|<br />
* {{bug|607392}} - split tagging into en-US and other<br />
* {{bug|721885}} - shut off unused branch<br />
* {{bug|400296}} - Have release automation support signing OSX builds (up to 10.7 support)<br />
|-<br />
| in production<br />
| 2012-04-04 11:00 PDT<br />
|<br />
* {{bug|690311}} - deploy newer version of cleanup.py to the foopies<br />
|-<br />
| in production<br />
| 2012-03-30 6:15 PDT<br />
|<br />
* {{bug|738588}} - add ts_paint to the android tests.<br />
* {{bug|737458}} - replace tpr_responsiveness for tp5row.<br />
* {{bug|737458}} - add tpr_responsiveness temporarily for mozilla-central and larch.<br />
* {{bug|740599}} - update staging release config files;<br />
|-<br />
| in production<br />
| 2012-03-29 6:55 PDT<br />
|<br />
* {{Bug|715193}} - If a branch does not use talos_from_source_code we should fallback to talos.mobile.old.zip (fixes esr10).<br />
|-<br />
| in production<br />
| 2012-03-28 16:35 PDT<br />
|<br />
* {{Bug|740196}} - ts_paint on Android doesn't actually work<br />
|-<br />
| in production<br />
| 2012-03-28 11:55 PDT<br />
|<br />
* {{Bug|737632}} - Remove jaegermonkey, graphics and pine to reduce builders<br />
* {{Bug|723667}} - fix Android trobocheck and ts_paint tests.<br />
* {{Bug|739486}} - test-masters.sh should run ./setup_master.py -t<br />
* add option to setup_master.py to print error logs when hit<br />
* {{Bug|723667}} - enable trobopan and tcheckerboard by default (not for m-a/m-b/m-r/1.9.2)<br />
* {{Bug|627182}} - Automate signing and publishing of XULRunner builds. r=bhearsum <br />
|-<br />
| in production<br />
| 2012-03-27 12:30 PDT<br />
|<br />
* {{Bug|723667}} - Add trobopan and trobocheck to m-c/m-i. r=jmaher<br />
|-<br />
| in production<br />
| 2012-03-23 11:30 PDT<br />
|<br />
*{{bug|627182}}<br />
*{{bug|738685}}<br />
*{{bug|734223}}<br />
*{{bug|738286}}<br />
*{{bug|719491}}<br />
*{{bug|737656}}<br />
*{{bug|715193}}<br />
*{{bug|702595}}<br />
*{{bug|735383}}<br />
|-<br />
| in production<br />
| 2012-03-27 01:35 PDT<br />
|<br />
* {{Bug|739505}} - [http://hg.mozilla.org/build/buildbot-configs/rev/3c424821358a Fix talos] on beta<br />
|-<br />
| in production<br />
| 2012-03-23 7:00 PDT<br />
|<br />
* {{Bug|737864}} - Tweak release category for Thunderbird.<br />
* {{Bug|737458}} - add tp5row side by side and cleanup config.py.<br />
* {{Bug|737581}} - enable peptest on m-c and m-i.<br />
* {{Bug|713846}} - Treat 'fennec' builds as having product 'mobile' for the purposes of uploading logs.<br />
|-<br />
| backout<br />
| 2012-03-21 11:45 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
|-<br />
| in production<br />
| 2012-03-21 9:30 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
* {{Bug|697150}}. (Bv1) Remove 'ac_add_options --disable-installer' for XulRunner current branches.<br />
* {{Bug|733394}}. Add leak test logic to mozilla-beta.<br />
* {{Bug|736587}}. Enable Android for pine.<br />
|-<br />
| in production<br />
| 2012-03-20 8:30 PDT<br />
|<br />
* {{bug|723667}} - enable talos robocop for pine and only for native tests<br />
* {{bug|737077}} - re-enable aurora updates<br />
* {{bug|713846}} - unified log handling<br />
* {{bug|734320}} - fix jetpack log parsing<br />
* {{bug|737049}} - run reftest-no-accel correctly<br />
* {{bug|723386}} - fix reserved slaves handling<br />
|-<br />
| in production<br />
| 2012-03-19 10:30 PDT<br />
|<br />
* {{bug|736284}} - re-enable aurora updates<br />
|-<br />
| in production<br />
| 2012-03-16 8:45 PDT<br />
|<br />
* {{bug|734996}} - fennec beta release update channel -> beta.<br />
* {{Bug|734221}} - deploy updateSUT.py and upgrade the boards to SUT Agent version 1.07.<br />
* {{Bug|734996}} - source: get a nonce earlier<br />
|-<br />
| in production<br />
| 2012-03-13 17:00 PDT<br />
|<br />
* {{bug|735201}} - Remove leading ../ from symbols path for tegras<br />
* {{bug|735421}} - Disable Aurora updates until the Aurora 13 has stabilized<br />
|-<br />
| in production<br />
| 2012-03-12 16:00 PDT<br />
|<br />
* {{bug|734417}} - enable mobile builds on the profiling branch<br />
* {{bug|731617}} - No nightly builds on maple branch since 27 Feb<br />
* {{bug|732285}} - Set MINIDUMP_STACKWALK for Android<br />
* {{bug|733668}} - Include "ERROR: We tried to download the talos.json file but something failed" and "ERROR 500: Internal Server Error" for Talos hgweb operations to RETRY<br />
* {{bug|630518}} - l10n verify, update verify, and final verification builders need to set "branch" when reporting to clobberer<br />
|-<br />
| in production<br />
| 2012-03-08 09:00 PT<br />
|<br />
* {{Bug|731814}} - Add checks that we're not exceeding max # of builders per slave.<br />
* {{Bug|731617}} - Remove win64 for now in maple.<br />
* {{Bug|731339}} - Remove slaves that are not production<br />
* {{Bug|732730}} - Remove non-functional and unwanted pgo_platforms overrides<br />
* {{bug|732110}} - remove buildbot-configs/mozilla2/mobile<br />
* {{Bug|728271}} - Post to graphs.m.o instead of graphs-old.m.o<br />
* {{Bug|729144}} - Post to graphs.allizom.org.<br />
* {{Bug|723667}} - Add robocop disabled.<br />
* {{bug|730050}} - TryBuildFactory looks in the wrong place for malloc.log<br />
* {{Bug|712538}} - leaktest parity on try<br />
* {{Bug|723667}} - Use talos.zip for tegras and prep work for talos robocop<br />
|-<br />
| in production<br />
| 2012-03-06 06:30 PT<br />
|<br />
* {{bug|732500}} - Enable nightly updates on maple<br />
* {{bug|732699}} - ESR release automation should push to mirrors automatically<br />
* {{bug|730918}} - Android on esr10 is busted, no doubt by branding since that always seems to be the problem<br />
* {{bug|561754}} - Don't download symbols for test runs, pass symbol zip URL as symbols path<br />
* {{bug|732516}} - l10n verification shouldn't rsync zip files<br />
* {{bug|732468}} - Add the ridiculous "abort: error:" to the list of hg errors that trigger RETRY<br />
|-<br />
| in production<br />
| 2012-03-01 7:30 PT<br />
|<br />
* {{Bug|721360}} - Do what changeset 9a0c428bdb69 really wanted to do.<br />
* {{Bug|561754}} - Disable symbol download on demand for mozilla-1.9.2 branch.<br />
* {{Bug|660480}} - mark as RETRY for common tegra errors<br />
* {{Bug|729918}} - start_uptake_monitoring builder uses wrong script_repo_revision property.<br />
* {{Bug|561754}} - Download symbols on demand by default for desktop unittests.<br />
|-<br />
| in production<br />
| 2012-02-27 7:45 PT<br />
|<br />
* {{bug|729077}} - recycle talos-r4-lion-083 and talos-r3-snow-081 as mac-signing[12]<br />
* Fix up staging and preproduction test slave lists.<br />
* {{Bug|729426}} - Do periodic PGO on services-central<br />
* {{bug|727580}} - linux-android for esr10, without merging 11.0 to m-r.<br />
|-<br />
| in production<br />
| 2012-02-21 9:30 PT<br />
|<br />
* {{bug|719511}} - add optional reboot command to ScriptFactory<br />
* {{Bug|725292}} - some repacks failed in 11.0b2 because of missing tokens<br />
* {{Bug|728104}} - AggregatingScheduler resets its state on reconfig<br />
* {{Bug|722608}} - Remove android signature verification<br />
* {{Bug|719260}} - Investigate why updates builder triggered twice for 10.0b5<br />
* {{bug|719511}} - Reenable peptest + add reboot_command<br />
* {{bug|712678}} - android-xul different update channel from android<br />
<br />
|-<br />
| in production<br />
| 20120217 1148 PST<br />
|<br />
* {{bug|721822}} - remove talos_from_code.py from the tools repo<br />
|-<br />
| in production<br />
| 20120214 1245 PST<br />
|<br />
* {{bug|726901}} - adjust resolution for reftests to 1600x1200<br />
* {{bug|689989}} - restore /system/etc/hosts on testing tegras<br />
|-<br />
| in production<br />
| 20120213 1200 PST<br />
|<br />
* {{bug|725727}} - reduce # of chunks for update_verify.<br />
* {{Bug|607392}} - split tagging into en-US and other. <br />
|-<br />
| in production<br />
| 20120208 01:20 PST<br />
|<br />
* {{bug|723954}} - 11.0b2 configs<br />
* {{bug|718385}} - android single locale updates<br />
* {{bug|717106}} - Release automation for ESR<br />
|-<br />
| in production<br />
| 20120207 13:00 PST<br />
|<br />
* {{bug|719443}} - add robocop unitttest testtype<br />
* {{bug|715715}} - download & install robocop for robocop test suites<br />
* {{Bug|725046}} - Re-enable mobile aurora updates<br />
* {{Bug|554324}} - Only set MOZ_PKG_VERSION when appVersion != version<br />
* [BACKED OUT] - <strike>{{bug|719511}} - optional ScriptFactory reboot().</strike><br />
|-<br />
| in production<br />
| 20120202 15:50 PST<br />
|<br />
* {{bug|723743}} - android native to en-US (no multilocale); disable android-xul single locale repacks.<br />
* {{bug|719697}} - --disable-tests on android* l10n-mozconfigs.<br />
* {{Bug|723277}} - don't enable remot<br />
|}<br />
<br />
=Android Testing=<br />
== Web Server Cluster ==<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Revision'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 488bc187a3ef<br />
| {{bug|753822}}<br />
| 20120510 1045 AM PDT<br />
| armenzg<br />
|}<br />
<br />
Find [[ReleaseEngineering:Buildduty#Update_mobile_talos_webhosts|here]] instructions to see how to update this.<br />
<br />
== clientproxy servers ==<br />
<br />
Production<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 2a995b4ed124<br />
| 31249cbe4f19<br />
| bfc910cd8dd3<br />
| ae5d6911905a<br />
| talos: {{bug|629503}}<br />
| 20110202 23:00 PDT<br />
| bear<br />
|}<br />
<br />
Pending<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
Servers:<br />
* bm-foopy01.build.mozilla.org<br />
* bm-foopy02.build.mozilla.org<br />
<br />
/builds/cp<br />
/builds/talos-data/talos<br />
/builds/talos-data/talos/pageloader@mozilla.org<br />
/builds/talos-data/talos/bench@taras.glek<br />
/builds/sut_tools</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&diff=418629ReleaseEngineering/Maintenance2012-04-10T18:59:04Z<p>Bear: /* Reconfigs / Deployments */</p>
<hr />
<div>This page is to track upcoming changes to any part of RelEng infrastructure; buildbot masters, slaves, ESX hosts, etc. This should allow us keep track of what we're doing in a downtime, and also what changes can be rolled out to production without needing a downtime. This should be helpful if we need to track what changes were made when troubleshooting problems.<br />
<br />
[[ReleaseEngineering:BuildbotBestPractices]] describes how we manage changes to our masters.<br />
<br />
= Relevant repositories =<br />
* [http://hg.mozilla.org/build/buildbot/ buildbot]<br />
* [http://hg.mozilla.org/build/buildbot-configs/ buildbot-configs]<br />
* [http://hg.mozilla.org/build/buildbotcustom/ buildbotcustom]<br />
* [http://hg.mozilla.org/build/tools/ tools]<br />
* [http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ talos]<br />
<br />
'''Are you changing the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
= Reconfigs / Deployments =<br />
This page is updated by the person who does a reconfig on production systems. Please give accurate times, as we use this page to track down if reconfigs caused debug intermittent problems.<br />
<br />
'''Did you change the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
Outcome should be 'backed out' or 'In production' or some such. Reverse date order pretty please.<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Outcome'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Bug #(s)''' - '''Description(s)'''<br />
|-<br />
| in production<br />
| 2012-04-10 1100 PST<br />
| <br />
* {{bug|744049}} - tcheckerboard always reports 1.0 (tegra talos web server updated to talos tip)<br />
|-<br />
| in production<br />
| 2012-04-09 0700 PDT<br />
|<br />
* {{bug|607392}} - split tagging into en-US and other<br />
* {{bug|721885}} - shut off unused branch<br />
* {{bug|400296}} - Have release automation support signing OSX builds (up to 10.7 support)<br />
|-<br />
| in production<br />
| 2012-04-04 11:00 PDT<br />
|<br />
* {{bug|690311}} - deploy newer version of cleanup.py to the foopies<br />
|-<br />
| in production<br />
| 2012-03-30 6:15 PDT<br />
|<br />
* {{bug|738588}} - add ts_paint to the android tests.<br />
* {{bug|737458}} - replace tpr_responsiveness for tp5row.<br />
* {{bug|737458}} - add tpr_responsiveness temporarily for mozilla-central and larch.<br />
* {{bug|740599}} - update staging release config files;<br />
|-<br />
| in production<br />
| 2012-03-29 6:55 PDT<br />
|<br />
* {{Bug|715193}} - If a branch does not use talos_from_source_code we should fallback to talos.mobile.old.zip (fixes esr10).<br />
|-<br />
| in production<br />
| 2012-03-28 16:35 PDT<br />
|<br />
* {{Bug|740196}} - ts_paint on Android doesn't actually work<br />
|-<br />
| in production<br />
| 2012-03-28 11:55 PDT<br />
|<br />
* {{Bug|737632}} - Remove jaegermonkey, graphics and pine to reduce builders<br />
* {{Bug|723667}} - fix Android trobocheck and ts_paint tests.<br />
* {{Bug|739486}} - test-masters.sh should run ./setup_master.py -t<br />
* add option to setup_master.py to print error logs when hit<br />
* {{Bug|723667}} - enable trobopan and tcheckerboard by default (not for m-a/m-b/m-r/1.9.2)<br />
* {{Bug|627182}} - Automate signing and publishing of XULRunner builds. r=bhearsum <br />
|-<br />
| in production<br />
| 2012-03-27 12:30 PDT<br />
|<br />
* {{Bug|723667}} - Add trobopan and trobocheck to m-c/m-i. r=jmaher<br />
|-<br />
| in production<br />
| 2012-03-23 11:30 PDT<br />
|<br />
*{{bug|627182}}<br />
*{{bug|738685}}<br />
*{{bug|734223}}<br />
*{{bug|738286}}<br />
*{{bug|719491}}<br />
*{{bug|737656}}<br />
*{{bug|715193}}<br />
*{{bug|702595}}<br />
*{{bug|735383}}<br />
|-<br />
| in production<br />
| 2012-03-27 01:35 PDT<br />
|<br />
* {{Bug|739505}} - [http://hg.mozilla.org/build/buildbot-configs/rev/3c424821358a Fix talos] on beta<br />
|-<br />
| in production<br />
| 2012-03-23 7:00 PDT<br />
|<br />
* {{Bug|737864}} - Tweak release category for Thunderbird.<br />
* {{Bug|737458}} - add tp5row side by side and cleanup config.py.<br />
* {{Bug|737581}} - enable peptest on m-c and m-i.<br />
* {{Bug|713846}} - Treat 'fennec' builds as having product 'mobile' for the purposes of uploading logs.<br />
|-<br />
| backout<br />
| 2012-03-21 11:45 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
|-<br />
| in production<br />
| 2012-03-21 9:30 PDT<br />
|<br />
* {{Bug|737427}}. Use 1024x768 as the screen resolution for the tegras.<br />
* {{Bug|697150}}. (Bv1) Remove 'ac_add_options --disable-installer' for XulRunner current branches.<br />
* {{Bug|733394}}. Add leak test logic to mozilla-beta.<br />
* {{Bug|736587}}. Enable Android for pine.<br />
|-<br />
| in production<br />
| 2012-03-20 8:30 PDT<br />
|<br />
* {{bug|723667}} - enable talos robocop for pine and only for native tests<br />
* {{bug|737077}} - re-enable aurora updates<br />
* {{bug|713846}} - unified log handling<br />
* {{bug|734320}} - fix jetpack log parsing<br />
* {{bug|737049}} - run reftest-no-accel correctly<br />
* {{bug|723386}} - fix reserved slaves handling<br />
|-<br />
| in production<br />
| 2012-03-19 10:30 PDT<br />
|<br />
* {{bug|736284}} - re-enable aurora updates<br />
|-<br />
| in production<br />
| 2012-03-16 8:45 PDT<br />
|<br />
* {{bug|734996}} - fennec beta release update channel -> beta.<br />
* {{Bug|734221}} - deploy updateSUT.py and upgrade the boards to SUT Agent version 1.07.<br />
* {{Bug|734996}} - source: get a nonce earlier<br />
|-<br />
| in production<br />
| 2012-03-13 17:00 PDT<br />
|<br />
* {{bug|735201}} - Remove leading ../ from symbols path for tegras<br />
* {{bug|735421}} - Disable Aurora updates until the Aurora 13 has stabilized<br />
|-<br />
| in production<br />
| 2012-03-12 16:00 PDT<br />
|<br />
* {{bug|734417}} - enable mobile builds on the profiling branch<br />
* {{bug|731617}} - No nightly builds on maple branch since 27 Feb<br />
* {{bug|732285}} - Set MINIDUMP_STACKWALK for Android<br />
* {{bug|733668}} - Include "ERROR: We tried to download the talos.json file but something failed" and "ERROR 500: Internal Server Error" for Talos hgweb operations to RETRY<br />
* {{bug|630518}} - l10n verify, update verify, and final verification builders need to set "branch" when reporting to clobberer<br />
|-<br />
| in production<br />
| 2012-03-08 09:00 PT<br />
|<br />
* {{Bug|731814}} - Add checks that we're not exceeding max # of builders per slave.<br />
* {{Bug|731617}} - Remove win64 for now in maple.<br />
* {{Bug|731339}} - Remove slaves that are not production<br />
* {{Bug|732730}} - Remove non-functional and unwanted pgo_platforms overrides<br />
* {{bug|732110}} - remove buildbot-configs/mozilla2/mobile<br />
* {{Bug|728271}} - Post to graphs.m.o instead of graphs-old.m.o<br />
* {{Bug|729144}} - Post to graphs.allizom.org.<br />
* {{Bug|723667}} - Add robocop disabled.<br />
* {{bug|730050}} - TryBuildFactory looks in the wrong place for malloc.log<br />
* {{Bug|712538}} - leaktest parity on try<br />
* {{Bug|723667}} - Use talos.zip for tegras and prep work for talos robocop<br />
|-<br />
| in production<br />
| 2012-03-06 06:30 PT<br />
|<br />
* {{bug|732500}} - Enable nightly updates on maple<br />
* {{bug|732699}} - ESR release automation should push to mirrors automatically<br />
* {{bug|730918}} - Android on esr10 is busted, no doubt by branding since that always seems to be the problem<br />
* {{bug|561754}} - Don't download symbols for test runs, pass symbol zip URL as symbols path<br />
* {{bug|732516}} - l10n verification shouldn't rsync zip files<br />
* {{bug|732468}} - Add the ridiculous "abort: error:" to the list of hg errors that trigger RETRY<br />
|-<br />
| in production<br />
| 2012-03-01 7:30 PT<br />
|<br />
* {{Bug|721360}} - Do what changeset 9a0c428bdb69 really wanted to do.<br />
* {{Bug|561754}} - Disable symbol download on demand for mozilla-1.9.2 branch.<br />
* {{Bug|660480}} - mark as RETRY for common tegra errors<br />
* {{Bug|729918}} - start_uptake_monitoring builder uses wrong script_repo_revision property.<br />
* {{Bug|561754}} - Download symbols on demand by default for desktop unittests.<br />
|-<br />
| in production<br />
| 2012-02-27 7:45 PT<br />
|<br />
* {{bug|729077}} - recycle talos-r4-lion-083 and talos-r3-snow-081 as mac-signing[12]<br />
* Fix up staging and preproduction test slave lists.<br />
* {{Bug|729426}} - Do periodic PGO on services-central<br />
* {{bug|727580}} - linux-android for esr10, without merging 11.0 to m-r.<br />
|-<br />
| in production<br />
| 2012-02-21 9:30 PT<br />
|<br />
* {{bug|719511}} - add optional reboot command to ScriptFactory<br />
* {{Bug|725292}} - some repacks failed in 11.0b2 because of missing tokens<br />
* {{Bug|728104}} - AggregatingScheduler resets its state on reconfig<br />
* {{Bug|722608}} - Remove android signature verification<br />
* {{Bug|719260}} - Investigate why updates builder triggered twice for 10.0b5<br />
* {{bug|719511}} - Reenable peptest + add reboot_command<br />
* {{bug|712678}} - android-xul different update channel from android<br />
<br />
|-<br />
| in production<br />
| 20120217 1148 PST<br />
|<br />
* {{bug|721822}} - remove talos_from_code.py from the tools repo<br />
|-<br />
| in production<br />
| 20120214 1245 PST<br />
|<br />
* {{bug|726901}} - adjust resolution for reftests to 1600x1200<br />
* {{bug|689989}} - restore /system/etc/hosts on testing tegras<br />
|-<br />
| in production<br />
| 20120213 1200 PST<br />
|<br />
* {{bug|725727}} - reduce # of chunks for update_verify.<br />
* {{Bug|607392}} - split tagging into en-US and other. <br />
|-<br />
| in production<br />
| 20120208 01:20 PST<br />
|<br />
* {{bug|723954}} - 11.0b2 configs<br />
* {{bug|718385}} - android single locale updates<br />
* {{bug|717106}} - Release automation for ESR<br />
|-<br />
| in production<br />
| 20120207 13:00 PST<br />
|<br />
* {{bug|719443}} - add robocop unitttest testtype<br />
* {{bug|715715}} - download & install robocop for robocop test suites<br />
* {{Bug|725046}} - Re-enable mobile aurora updates<br />
* {{Bug|554324}} - Only set MOZ_PKG_VERSION when appVersion != version<br />
* [BACKED OUT] - <strike>{{bug|719511}} - optional ScriptFactory reboot().</strike><br />
|-<br />
| in production<br />
| 20120202 15:50 PST<br />
|<br />
* {{bug|723743}} - android native to en-US (no multilocale); disable android-xul single locale repacks.<br />
* {{bug|719697}} - --disable-tests on android* l10n-mozconfigs.<br />
* {{Bug|723277}} - don't enable remot<br />
|}<br />
<br />
=Android Testing=<br />
== Web Server Cluster ==<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Revision'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 3efbac1f685a<br />
| unknown<br />
| unknown<br />
| unknown<br />
|}<br />
<br />
Update Procedure:<br />
ssh to bm-remote-talos-webhost-01<br />
cd /var/www/html/talos<br />
hg pull && hg up<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
<br />
Servers:<br />
* bm-remote-talos-webhost-01.build.mozilla.org<br />
* bm-remote-talos-webhost-02.build.mozilla.org<br />
* bm-remote-talos-webhost-03.build.mozilla.org<br />
<br />
== clientproxy servers ==<br />
<br />
Production<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 2a995b4ed124<br />
| 31249cbe4f19<br />
| bfc910cd8dd3<br />
| ae5d6911905a<br />
| talos: {{bug|629503}}<br />
| 20110202 23:00 PDT<br />
| bear<br />
|}<br />
<br />
Pending<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
Servers:<br />
* bm-foopy01.build.mozilla.org<br />
* bm-foopy02.build.mozilla.org<br />
<br />
/builds/cp<br />
/builds/talos-data/talos<br />
/builds/talos-data/talos/pageloader@mozilla.org<br />
/builds/talos-data/talos/bench@taras.glek<br />
/builds/sut_tools</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Process_release_email&diff=404967ReleaseEngineering/How To/Process release email2012-03-07T16:57:51Z<p>Bear: </p>
<hr />
<div>{{Release Engineering How To|Process email to release@}}<br />
This is a list of automatically generated emails you should expect to receive as a release engineer at mozilla. It is not complete<br />
<br />
= Subject Index =<br />
Zimbra glob/wildcard syntax, in alpha order<br />
<br />
{| cellpadding="10" cellspacing="0" border="1"<br />
<br />
!Field !! Wildcard !! Further Notes<br />
|-<br />
|Subject || idle kittens report || [[#briar-patch idle kittens reporting]]<br />
|-<br />
|Subject || Humpty Dumpty Error * || [[#Puppet failing too many times on a slave]]<br />
|-<br />
|Subject ||[puppet-monitoring]* ||[[#Puppet Log Monitoring]]<br />
|-<br />
|Subject ||Talos Suspected machine issue *|| ''if you don't know, you don't care''<br />
|-<br />
|Subject ||Try submission *||to: autolanduser@mozilla.com<br />
|}<br />
<br />
=briar-patch idle kittens reporting=<br />
== Why we get them ==<br />
Email report outlining the status of any host that has been flagged as "idle"<br />
<br />
== What is sending them ==<br />
A cron job that is running the kittenreaper.py task with the following parameters<br />
<br />
python kittenreaper.py -w 1 -e<br />
<br />
It pulls the list of hosts to check from http://build.mozilla.org/builds/slaves_needing_reboot.txt<br />
<br />
== What to do when one is recieved ==<br />
not sure yet, unless your buildduty - then you should be watching it<br />
<br />
== Future plans ==<br />
This will be replaced by the briar-patch dashboard<br />
<br />
== How to best filter these emails ==<br />
Filtering can be done by matching the subject line which will not change<br />
<br />
<br />
=Puppet Log Monitoring=<br />
== Why we get them ==<br />
There are messages in the puppet master logs that indicate something is wrong with a slave or master. Since we have no other master monitoring tools, we are defaulting to sending email.<br />
== What is sending them ==<br />
scl-production-puppet and soon all puppet masters have an instance of 'watch-puppet.py' running under screen as root.<br />
<br />
The code for this script is stored [https://github.com/jhford/monitor-puppet here]<br />
<br />
== What to do when one is recieved ==<br />
* if the title contains "[puppet-monitoring][master_name] <slavename> is waiting to be signed", this is for information and requires no immediate action<br />
* if the title contains "[puppet-monitoring][master_name] <slavename> has invalid cert", the script will try once to clean the cert before sending the email once there is a waiting signing request. If this is successful, you'll see a matching "<slavename> is waiting to be signed" email. The key will be automatically signed by a cronjob<br />
<br />
== How to silence or acknowledge this alert ==<br />
It is not currently possible to silence this email. This script will send email each time the corresponding line pattern is seen in /var/log/messages. This means that most likely, each time a slave tries to puppet, an email will be sent.<br />
<br />
== Future plans ==<br />
In the short term, we'd like to have this script monitor the puppet logs for more error conditions. It would also make sense to monitor all puppet masters<br />
<br />
== How to best filter these emails ==<br />
* subject includes [puppet-monitoring]<br />
<br />
=Puppet failing too many times on a slave=<br />
== Why we get them ==<br />
We have no other monitoring for slaves failing to run puppet successfully. This became a large issue with the rev4 talos machines due to {{bug|700672}}. We are now doing an exponential back off on these slaves with a set number of iterations. Once the maximum number of iterations is reached, the slave will send this email then reboot. This helps us avoid puppet master load as well as allowing the machines try to fix themselves by rebooting.<br />
<br />
== What is sending them ==<br />
Each machine that has these emails enabled will send the email itself when it fails to puppet the last time, and right before it reboots.<br />
<br />
The code that sends them is unversioned, but is deployed to the slaves from <br />
scl-production-puppet:/N/production/darwin10-i386/test/usr/local/bin/run-puppet.sh <br />
<br />
== What to do when one is recieved ==<br />
* either ignore the email or find the root of the problem and fix it. <br />
<br />
== How to silence or acknowledge this alert ==<br />
This email is a temporary workaround until we get a real puppet client monitoring tool. This email we be sent each time the maximum number of retires is reached, which is every couple hours.<br />
<br />
== Future plans ==<br />
Would really like to replace these emails with real puppet monitoring.<br />
<br />
== How to best filter these emails ==<br />
These emails are best filtered by having "Humpty Dumpty Error" in their subject. Becuase the hostname on the slave might not be correct every/all the time, filtering on domain names might not catch all cases.<br />
<br />
<hr /><br />
=Sample=<br />
== Why we get them ==<br />
Give a brief explanation of why this email is for, what it helps us do and why it should be watched<br />
<br />
== What is sending them ==<br />
Include a link to the source of the program sending the email. Include information on which hosts are sending the email, and give information on how program runs. Is it a daemon? Does it have an init script? Do you run it under screen? <br />
<br />
== What to do when one is recieved ==<br />
* if the title contains "[scl-production-puppet-new] <slavename> is waiting to be signed", this is for information and requires no immediate action<br />
* if the title contains "[scl-production-puppet-new] <slavename> has invalid cert", the script will try once to clean the cert before sending the email. If this is successful, you'll see a matching "<slavename> is waiting to be signed" email. The key will be automatically signed<br />
<br />
== How to silence or acknowledge this alert ==<br />
Include information on how to make the emails stop<br />
<br />
== Future plans ==<br />
provide any future plans for this email. Is it temporary? Is it going to be replaced by a real dashboard? Are you going to add/change things people filter on?<br />
<br />
== How to best filter these emails ==<br />
provide insight on how to filter these emails. Is there a distinguishing header? Is it always from a specifc host, or family of hosts? Is there a distinctive subject?</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Archive/Android_Tegras&diff=403183ReleaseEngineering/Archive/Android Tegras2012-03-01T23:30:40Z<p>Bear: /* Tegra Dashboard */</p>
<hr />
<div>{{Release Engineering How To|Android Tegras}}<br />
= First time? =<br />
Is it the first time dealing with tegras and foopies? Here are some no no:<br />
# Do not update the talos checkout under /builds or you will hit new bugs<br />
# Do not start a tegra unless you use "screen -x"<br />
<br />
= Tegra Dashboard =<br />
The current status of each Tegra, and other informational links, can be seen on the [http://mobile-dashboard.pub.build.mozilla.org/ Tegra Dashboard]. ''Dashboard is only updated every 8 minutes; use [[#check status of Tegra(s)|./check.sh]] on the foopy for live status.''<br />
<br />
The page is broken up into three sections: Summary, Production and Staging where Production/Staging have the same information but focus on the named set of Tegras.<br />
<br />
The Summary section has the current start/end date range of the displayed Tegras and a grid of counts.<br />
<br />
Production Staging<br />
Tegra and buildslave online 57 8<br />
Tegra online but buildslave is not 0 0<br />
Both Tegra and buildslave are offline 19 2<br />
<br />
<br />
The Production/Staging section is a detailed list of all Tegras that fall into the given category.<br />
<br />
ID Tegra CP BS Msg Online Active Foopy PDU active bar<br />
<br />
* '''ID''' Tegra-### identifier. Links to the buildslave detail page on the master<br />
* '''Tegra''' Shows if the Tegra is powered and responding: online|OFFLINE <br />
* '''CP''' Shows if the ClientProxy daemon is running: active|INACTIVE<br />
* '''BS''' Shows if the buildslave for the Tegra is running: active|OFFLINE<br />
* '''Msg''' The info message from the last [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] run for that Tegra<br />
* '''Foopy''' Which foopy server the Tegra is run on. Links to the hostname:tegra-dir<br />
* '''PDU''' Which PDU page can be used to power-cycle the Tegra. PDU0 is used for those not connected as of yet<br />
* '''Log''' Links to the text file that contains the cumulative [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] log entries<br />
* '''active bar''' A single character summary of the last 10 status checks where '_' is offline and 'A' is active<br />
<br />
= What Do I Do When... =<br />
<br />
== PING checks are failing ==<br />
See the section [[ReleaseEngineering:How_To:Android_Tegras#power_cycle_a_Tegra|power cycle a tegra]].<br />
<br />
== tegra agent check is CRITICAL ==<br />
Check the dashboard, may be rebooting. Give it up to 15 minutes, then [[#check status of Tegra(s)|verify current status]]. If still "rebooting", then treat as if [[#PING checks are failing]]<br />
<br />
= How Do I... =<br />
<br />
== recover a foopy ==<br />
<br />
If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:<br />
<br />
'''Note''': Establish the base screen session, if needed by trying screen -x first<br />
<br />
ssh cltbld@foopy##<br />
screen -x<br />
cd /builds<br />
./stop_cp.sh<br />
./start_cp.sh<br />
<br />
== find what foopy a Tegra is on ==<br />
<br />
Open the Tegra Dashboard - the foopy number is shown to the right<br />
<br />
== check status of Tegra(s) ==<br />
<br />
Find the Tegra on the Dashboard and then ssh to that foopy<br />
<br />
ssh cltbld@foopy##<br />
cd /builds<br />
./check.sh -t tegra-###<br />
<br />
To check on the status of all Tegras covered by that foopy<br />
<br />
./check.sh<br />
<br />
check.sh is found in /builds on a foopy<br />
<br />
== power cycle a Tegra ==<br />
Find the Tegra on the Dashboard and then ssh to that foopy<br />
ssh cltbld@foopy##<br />
/builds/check.sh -t tegra-## -c<br />
<br />
You have to wait approximately 5 minutes before you can check the status of the slave.<br />
<br />
What "check.py -c" does is to check that a tegra is really offline and then reboot through the PDU.<br />
"Reboot a Tegra through the PDU" is doing a hardcore reboot without checking that the tegra is really down.<br />
This means that if this section does not recover you will need to file a bug for ServerOps::Releng to get to it.<br />
<br />
<strike><br />
If the above did not work, then you will need to [[#Reboot a Tegra through the PDU]].<br />
</strike><br />
<br />
== clear an error flag ==<br />
<br />
Find the Tegra on the Dashboard, ssh to that foopy and then<br />
<br />
ssh cltbld@foopy05<br />
./check.sh -t tegra-002 -r<br />
<br />
== restart Tegra(s) ==<br />
<br />
Find out which foopy server you need to be on and then run:<br />
<br />
ssh cltbld@foopy##<br />
screen -x # or you will hit bug 642369<br />
cd /builds<br />
./stop_cp.sh tegra-###<br />
<br />
check the '''ps''' output that is generated at the end to ensure that nothing has glitched. If any zombie processes are found then you will need to kill them manually. Once clear, run<br />
<br />
./start_cp.sh tegra-###<br />
<br />
== start Tegra(s) ==<br />
<br />
Find out which foopy server you need to be on and then run:<br />
<br />
screen -x # or you will hit bug 642369<br />
cd /builds<br />
./start_cp.sh [tegra-###]<br />
<br />
If you specify the tegra-### parameter then it will only attempt to start that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*<br />
<br />
== stop Tegra(s) ==<br />
<br />
First find the foopy server for the Tegra and then run:<br />
screen -x # or you will hit bug 642369<br />
cd /builds<br />
./stop_cp.sh [tegra-###]<br />
<br />
If you specify the tegra-### parameter then it will only attempt to stop that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*<br />
<br />
At the end of the startup process, stop_cp.sh will run<br />
<br />
ps auxw | grep "tegra-###"<br />
<br />
to allow you to check that all associated or spawned child processes have been also stopped. Sadly some of them love to zombie and that just ruins any summer picnic.<br />
<br />
== find Tegras that are hung ==<br />
If you see a Tegra that has been running for 4+ hours, then it most likely has a hung fennec process. There will be a matching server.js daemon on the foopy.<br />
<br />
The easiest way to find Tegras that are in this state is via the buildbot-master. ''(N.B. in buildbot reports, all tegras report their [https://en.wikipedia.org/wiki/Nvidia_Tegra#Tegra_2_series model #], e.g. "Tegra 250". Do not confuse model name with a tegra host name, e.g. <tt>tegra-250</tt>.)''. Currently (2011-12-20) all tegras on a foopy use the same build master:<br />
<br />
{| border="1" cellpadding="2"<br />
!foopy #!!Master URL<br />
|-<br />
| <18<br />
| [http://test-master01.build.mozilla.org:8012/buildslaves?no_builders=1 test-master01]<br />
|-<br />
| >=18 & even<br />
| [http://buildbot-master20.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master20]<br />
|-<br />
| >18 & odd<br />
| [http://buildbot-master19.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master19]<br />
|}<br />
<br />
Look for Tegras that have a "Last heard from" of >4 hours. If the list of "Recent builds" for the Tegra are flapping between exceptions/failures/warnings, i.e. the status is all sorts of different pretty colours, that's a good sign that there's a stray fennec process fouling things up.<br />
<br />
Another way to identify tegras for stalls is to look on the dashboard for tegras showing INACTIVE status for both the tegra ''and'' the client proxy. (These often also have a "not connected" status on the buildslaves page.)<br />
<br />
=== whack a hung Tegra ===<br />
The only way currently to kick Tegras in this state it is to kill the server.js daemon on the appropriate foopy.<br />
<br />
The manual way to do it is to run:<br />
<br />
ps auxw | grep server.js | grep tegra-### <br />
<br />
and then kill the result PID. To keep from going crazy typing that over and over again, I created <code>kill_stalled.sh</code> which automates that task.<br />
<br />
cd /builds<br />
./kill_stalled.sh 042 050 070 099<br />
<br />
This will run the above ps and grep for each tegra id given and if a PID is found, kill it. This will cause the Tegra to be power-cycled automatically, getting it back into service.<br />
<br />
If <tt>./kill_stalled.sh</tt> reports "none found", then manually powercycle the tegra.<br />
<br />
== Reboot a Tegra through the PDU ==<br />
cd /builds<br />
python sut_tools/tegra_powercycle.py ###<br />
<br />
You will see the snmpset call result if it worked.<br />
<br />
If rebooting via PDU does not clear the problem, here are things to try:<br />
* reboot again - fairly common to have 2nd one clear it<br />
** especially if box responsive to ping & telnet (port 20701) after first reboot<br />
<br />
== check.py options ==<br />
<br />
To manually run [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] '''find the appropriate foopy server''' and<br />
<br />
cd /builds<br />
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]<br />
<br />
* -m [s|p] restrict Tegra list to 's'taging or 'p'roduction<br />
* -r reset any error.flg semaphore if found and send "rebt" command to tegra<br />
* -c powercycle the Tegra by telneting to the appropriate PDU<br />
<br />
This will scan a given Tegra (or all of them) and report back it's status.<br />
<br />
== Start ADB ==<br />
On the Tegra do:<br />
telnet tegra-### 20701<br />
exec su -c "setprop service.adb.tcp.port 5555"<br />
exec su -c "stop adbd"<br />
exec su -c "start adbd"<br />
<br />
On your computer do:<br />
adb tcpip 5555<br />
adb connect <ipaddr of tegra><br />
adb shell<br />
<br />
== Move a tegra from one foopy to another ==<br />
The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.<br />
<br />
'''NOTE:''' use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)<br />
<br />
# update foopies.sh & tegras.json in your working directory<br />
# commit the changes to <tt>foopies.sh</tt> and <tt>tegras.json</tt><br />
#* make sure json is clean: <tt>python -c 'import json; json.loads(open("tegras.json").read())'</tt><br />
# in buildbot, request a "graceful shutdown"<br />
#* wait for tegra to show "idle"<br />
# on the old foopy:<br />
#* stop the tegra via <tt>/builds/stop_cp.sh</tt><br />
#* manually remove the tegra from the <tt>/builds/create_dirs.sh</tt> file<br />
#** <strike>'''OR''' run <tt>./foopies.sh old_foopy_number</tt> from your working directory</strike> blocked by: {{bug|713690}}<br />
# on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):<br />
#* update the local tools: <tt>cd /builds/tools ; hg pull --update; cd -</tt><br />
#* manually add the tegra to the <tt>/builds/create_dirs.sh</tt> file<br />
#* manually run <tt>cd /builds; ./create_dirs.sh</tt><br />
#* if this is a replacement tegra, manually push the ini files by judicious use of: <tt>grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'</tt><br />
# on the new foopy:<br />
#* restart the tegras using <tt>screen -x # or you will hit bug 642369; cd /builds ; ./start_cp.sh</tt><br />
#** '''NOTE:''' do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.<br />
<br />
== Move a tegra from staging to production ==<br />
<br />
# If the tegra is running, stop it: <tt>/builds/stop_cp.sh tegra-###</tt><br />
# Edit the tegra's buildbot.tac: <tt>/builds/tegra-###/buildbot.tac</tt><br />
# Adjust the master, port and password to the appropriate server<br />
# Save and restart the Tegra: <tt>/screen -x # or you will hit bug 642369; builds/start_cp.sh tegra-###</tt><br />
<br />
'''Note''' - yes, it's a manual process (sorry) until Tegras are in SlaveAlloc<br />
<br />
= Environment =<br />
<br />
The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy. Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''.<br />
<br />
* Each Tegra has a '''/builds/tegra-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py<br />
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it<br />
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/)<br />
<br />
Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra/tegra-devkit-features for details. It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.<br />
<br />
Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains. All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.<br />
<br />
= References =<br />
<br />
== One source of truth ==<br />
<br />
As of Oct 2011, [https://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/tegras.json <tt>tools/buildfarm/mobile/tegras.json</tt>] should be the most authoritative document.<br />
* if you find a tegra deployed that is not listed here, check [https://docs.google.com/spreadsheet/ccc?key=0AlIN8kWEeaF0dFJHSWN4WVNVZEhlREtUNWdTYnVtMlE&hl=en_US#gid=0 bear's master list]. If there, file a releng bug to get <tt>tegras.json</tt> updated.<br />
* if you find a PDU not labeled per the <tt>tegras.json</tt> file, file a releng bug to update the human labels.</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&diff=397355ReleaseEngineering/Maintenance2012-02-14T20:54:11Z<p>Bear: /* Reconfigs / Deployments */</p>
<hr />
<div>This page is to track upcoming changes to any part of RelEng infrastructure; buildbot masters, slaves, ESX hosts, etc. This should allow us keep track of what we're doing in a downtime, and also what changes can be rolled out to production without needing a downtime. This should be helpful if we need to track what changes were made when troubleshooting problems.<br />
<br />
[[ReleaseEngineering:BuildbotBestPractices]] describes how we manage changes to our masters.<br />
<br />
= Relevant repositories =<br />
* [http://hg.mozilla.org/build/buildbot/ buildbot]<br />
* [http://hg.mozilla.org/build/buildbot-configs/ buildbot-configs]<br />
* [http://hg.mozilla.org/build/buildbotcustom/ buildbotcustom]<br />
* [http://hg.mozilla.org/build/tools/ tools]<br />
* [http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ talos]<br />
<br />
'''Are you changing the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
= Reconfigs / Deployments =<br />
This page is updated by the person who does a reconfig on production systems. Please give accurate times, as we use this page to track down if reconfigs caused debug intermittent problems.<br />
<br />
'''Did you change the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
Outcome should be 'backed out' or 'In production' or some such. Reverse date order pretty please.<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Outcome'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Bug #(s)''' - '''Description(s)'''<br />
|-<br />
| in production<br />
| 20120214 1245 PST<br />
|<br />
* {{bug|726901}} - adjust resolution for reftests to 1600x1200<br />
* {{bug|689989}} - restore /system/etc/hosts on testing tegras<br />
|-<br />
| in production<br />
| 20120213 1200 PST<br />
|<br />
* {{bug|725727}} - reduce # of chunks for update_verify.<br />
* {{Bug|607392}} - split tagging into en-US and other. <br />
|-<br />
| in production<br />
| 20120208 01:20 PST<br />
|<br />
* {{bug|723954}} - 11.0b2 configs<br />
* {{bug|718385}} - android single locale updates<br />
* {{bug|717106}} - Release automation for ESR<br />
|-<br />
| in production<br />
| 20120207 13:00 PST<br />
|<br />
* {{bug|719443}} - add robocop unitttest testtype<br />
* {{bug|715715}} - download & install robocop for robocop test suites<br />
* {{Bug|725046}} - Re-enable mobile aurora updates<br />
* {{Bug|554324}} - Only set MOZ_PKG_VERSION when appVersion != version<br />
* [BACKED OUT] - <strike>{{bug|719511}} - optional ScriptFactory reboot().</strike><br />
|-<br />
| in production<br />
| 20120202 15:50 PST<br />
|<br />
* {{bug|723743}} - android native to en-US (no multilocale); disable android-xul single locale repacks.<br />
* {{bug|719697}} - --disable-tests on android* l10n-mozconfigs.<br />
* {{Bug|723277}} - don't enable remote-tpan by default<br />
|-<br />
| in production<br />
| 20120202 14:20 PST<br />
|<br />
* [http://hg.mozilla.org/build/buildbotcustom/rev/c2f1c5472785 Bustage fix] - --no-check-certificate for Leopard and Windows<br />
|-<br />
| in production<br />
| 20120202 13:05 PST<br />
|<br />
* [http://hg.mozilla.org/build/tools/rev/ab74fec66359 Bustage fix] - talos.zip rather than talos_zip<br />
|-<br />
| in production<br />
| 20120202 11:30 PST<br />
|<br />
*{{bug|723608}} + bustage fix - add locking to surf related actions and change where the talos_from_code.py script comes from<br />
|-<br />
| in production<br />
| 20120202 0950 PST<br />
|<br />
* {{Bug|719567}} - support both formats of talos.json (no reconfig needed as it is a [http://hg.mozilla.org/build/tools/rev/d4eb54621cce tools] check-in)<br />
|- <br />
| in production<br />
| 20120202 0900 PST<br />
|<br />
* {{Bug|721822}} - Backout talos_from_code.py from repo.<br />
|-<br />
| in production<br />
| 20120202 0820 PST<br />
|<br />
* {{Bug|719567}} - expand talos.json to support pageloader.xpi. r=jmaher<br />
* {{Bug|721822}} - Download talos_from_code.py from source repo. r=jmaher <br />
* {{Bug|718777}} - Backout multi locale changes. r=catlee <br />
* {{Bug|721822}} - Clean up code for merge days. r=jmaher<br />
|- <br />
| in production<br />
| 20120201 0115 PST<br />
|<br />
* backout {{bug|660480}} - RETRY on common tegra errors<br />
|-<br />
| in production<br />
| 20120131 2100 PST<br />
|<br />
* {{bug|718777}} - updating configs for mozilla-beta so we can get native builds going<br />
* {{bug|722719}} - change to SDK 14<br />
* {{bug|722951}} - (temporarily) redirect aurora updates to test channel<br />
* {{bug|722940}} - codesize upload broken for SeaMonkey [and Thunderbird] due to tools dir being incorrect<br />
* {{bug|718777}} - Tracking bug for build and release of Firefox/Fennec 11.0b1. Poll signed Fennec APKs for all signed <br />
* {{bug|708656}} - Use signing on demand for releases. Use AggregatingScheduler for repack_complete<br />
* {{bug|719260}} - Investigate why updates builder triggered twice for 10.0b5<br />
* {{bug|660480}} - RETRY on common tegra errors<br />
|-<br />
| in production<br />
| 20120127 1130 PST<br />
|<br />
* {{bug|719697}} - robocop isn't signed properly from buildbot builds<br />
|-<br />
| in production<br />
| 20120127 1040 PST<br />
|<br />
* {{bug|721488}} - deployed new pageloader.xpi<br />
|-<br />
| in production<br />
| 20120127 0730 PST<br />
|<br />
* {{Bug|719544}}. talos_from_source.py - Make the pine branch to allow downloading talos.zip from any place like on 'try'<br />
* {{bug|717662}} - Please disable debug builds and tests on the profiling branch<br />
* {{bug|720782}} - If we dont_build a platform on project_branches we should not add testers for it<br />
* {{bug|721360}} - Bug 698827 - Run 10.5 leak builds on 10.6 machines for aurora<br />
* {{bug|721573}} - Sign the profile branch nightlies using the m-c nightly key<br />
* {{bug|717106}} - Release automation for ESR<br />
* {{bug|698827}} - Bug 698827 - Run 10.5 leak builds on 10.6 machines for aurora<br />
* {{bug|715966}} - branch 1.9.2 confusingly set on talos tbpl logs<br />
* {{bug|718828}} - Don't wait for NFS cache at the end of the updates builder<br />
* {{bug|705403}} - <strike>Sendchanges [on windows] from build steps are being done from old buildbot version</strike> - backed out<br />
* {{bug|683417}} - retry.py didn't actually kill process tree for a timed-out pushsnip<br />
* {{bug|673834}} - Obsolete ReleaseRepackFactory, fold logic into CCReleaseRepackFactory<br />
|-<br />
| in production<br />
| 20120123 1435 PST<br />
|<br />
* {{bug|719859}} - remove double posting ts_paint and tpaint. p=armenzg<br />
* {{bug|718445}} - stage-old should be referenced as stage in scripts/configs. p=bhearsum<br />
|-<br />
| in production<br />
| 20120123 1405 PST<br />
|<br />
* {{Bug|649641}} - use ntpd on linux32/linux64 ix slaves<br />
|-<br />
| in production<br />
| 20120123 1140 PST<br />
|<br />
* {{Bug|711619}} - Add Android builds+tests and periodic PGO on the Fx-Team branch, p=philor<br />
* {{Bug|719859}} - Side by side on mozilla-central for ignore_first changes. p=jmaher<br />
|-<br />
| in production<br />
| 20120123 0730 PST<br />
|<br />
* {{bug|705403}} - Sendchanges [on windows] from build steps are being done from old buildbot version<br />
* {{bug|719772}} - Sign Callek up for the full release process e-mails<br />
* {{bug|716561}} - reevaluate which release mail gets sent to release-drivers<br />
* {{bug|561198}} - compress leak test / codesighs logs prior to uploading<br />
* {{bug|699219}} - Add automated clean up of hg-shared directory<br />
* {{bug|714284}} - L10n mac dep builds busted on central and aurora<br />
* {{bug|719261}} - Add more logging to AggregatingScheduler<br />
|-<br />
| in production<br />
| 20120119 1200 PST<br />
|<br />
* {{bug|719504}} - disable peptest.<br />
* {{bug|715219}} - off-by-one bustage fix for tegra android range<br />
|-<br />
| in production<br />
| 20120119 1100 PST<br />
|<br />
* {{bug|699219}} - purge shared hg repos<br />
|-<br />
| in production<br />
| 20120117 1230 PST<br />
|<br />
* {{bug|695351}} - android mochitests to use in-tree manifest<br />
* {{bug|700415}} - peptest on try<br />
* {{bug|712750}} - print more data for screenresolution in buildbot factories<br />
|-<br />
| in production<br />
| 20120117 0800 PST<br />
|<br />
* {{Bug|698827}} - Run 10.5 leak builds on 10.6 machines for try. p=armenzg<br />
|-<br />
| in production<br />
| 20120116 1325 PST<br />
|<br />
* Require branch parameter to clobberer HTML interface<br />
|-<br />
| in production<br />
| 20120113 07:00 PST<br />
|<br />
* {{bug|714490}} - make hgtool handle mirror/master hg outages better<br />
|-<br />
| in production<br />
| 20120112 16:40 PST<br />
|<br />
* {{bug|712422}} - add a --bootstrap cli flag to reftest/crashtest/jsreftest for android<br />
* {{bug|698425}} - enable android and android-xul l10n repacks<br />
* Bustage fix. Changeset fa1c76238b7c<br />
* {{bug|713442}} - point 1.9.2 release configs to the compare-locales RELEASE_0_8_2 tag<br />
* {{bug|717621}} - Remove decomissioned slaves<br />
* {{bug|698425}} - android and android-xul l10n mozconfig<br />
* {{bug|567274}} - Talos should halt on download or unzip failure<br />
|-<br />
| in production<br />
| 20120109 1806 PST<br />
|<br />
* stage rather than masters<br />
* {{bug|712008}} - Always trim revision to 12 chars<br />
* {{bug|716431}} - Block asc files for partial mars in latest-<branch> dirs (stage<br />
|-<br />
|-<br />
| in production<br />
| 20120106 1300 PST<br />
|<br />
* {{bug|715623}} - add --cachedir support to signtool.py<br />
|-<br />
| in production<br />
| 20120104 1315 PDT<br />
|<br />
* Back out 7a7847f7fc05 ({{bug|711275}}: Make sure appVersion changes with every Firefox 10 beta)<br />
* {{bug|712008}} - Pass platform to post_upload.py for shark<br />
* {{bug|681948}} - Automatically retry after a devicemanager.DMError<br />
* {{bug|715119}} - [signing-server] Bump token TTL<br />
* {{bug|713161}} - new high tegra added<br />
* {{bug|711221}} - turn on create_snippet and create_partial for profiling branch<br />
* {{bug|712150}} - bustage fix for linux,m-r and xulrunner in-tree mozconfig path<br />
|-<br />
| in production<br />
| 20111222 0800 PDT<br />
|<br />
* {{Bug|710350}} - Don't hard-code 'firefox' and 'fennec' in misc.py.<br />
* {{Bug|707152}} - enable leaktest for 10.6 everywhere except some release branches.<br />
* {{bug|711367}} - enable android-xul tests<br />
* {{Bug|673131}} - Enable talos_from_source_code.<br />
* {{bug|712094}} - re-enable aurora updates. <br />
* {{bug|711275}} - Make sure appVersion changes with every Firefox 10 beta. r=rail<br />
|-<br />
| in production<br />
| 20111221<br />
|<br />
* {{bug|683734}} - added a bunch of talos-r3 slaves to production<br />
|-<br />
| in production<br />
| 20111221 1300 PST<br />
|<br />
* {{bug|558180}} - use in-tree mozconfigs for releases<br />
* {{bug|709114}} - add locales to aurora<br />
* {{bug|710842}} - re-enable symbols for nightly fennec xul builds<br />
* {{bug|711221}} - rename private-browsing branch to 'profiling'<br />
* {{bug|712133}} - firefox 10.0b1 release configs<br />
|-<br />
| in production<br />
| 20111221 1100 PST<br />
|<br />
* {{bug|712208}} - update binutils to 2.22<br />
|-<br />
| in production<br />
| 20111220 0610 PST<br />
|<br />
* <strike>{{bug|673131}} - when minor talos changes land, the a-team should be able to deploy with minimal releng time required</strike> - backed-out<br />
* {{bug|704582}} - [tracking bug] deploy 83 tegras<br />
* {{bug|712115}} - L10n mac nightlies busted on central and aurora<br />
* {{bug|710453}} - Release Engineering changes for the Firefox 11 merge to Aurora on Dec 20<br />
* {{bug|712094}} - push mozilla-aurora updates to auroratest channel until merge stabilizes<br />
* {{bug|712068}} - Adjust default releasetestUptake value<br />
|-<br />
| in production<br />
| 20111219 1000 PST<br />
| <br />
* {{bug|707941}} - Improve token generation step<br />
* {{bug|711179}} - fix for missing symbols for non-mobile tests<br />
* {{bug|710453}} - android-xul mozilla-release mozconfigs<br />
* {{bug|711978}} - Refresh staging release configs<br />
|-<br />
| in production<br />
| 20111217 0800 PST<br />
|<br />
* {{bug|509158}} - enable signing on all branches<br />
|-<br />
| backed out<br />
| 20111216 1700 PST<br />
|<br />
* {{bug|705403}} - Sendchanges [on windows] from build steps are being done from old buildbot version<br />
|-<br />
| in production<br />
| 20111215 0830 PST<br />
|<br />
* {{bug|711064}} Fix puppet dependencies<br />
|-<br />
| in production<br />
| 20111214 0800 PST<br />
|<br />
* {{bug|509158}} Reduce default token time to 2 hours; fix last-complete-mar detection<br />
* {{bug|683734}} Add new rev3 machines.<br />
* {{bug|708475}} accept 'mochitest' and 'reftests' as synonyms for 'mochitests' and 'reftest' (with tests)<br />
* {{bug|708859}} android signature verification should look for android-arm.apk<br />
* {{bug|709233}} reenable android and android-xul multilocale for m-c nightlies<br />
* {{bug|709383}} Turn off win64 signing on m-c<br />
* {{bug|709979}} Set the branch property for projects/addon-sdk jobs to just addon-sdk<br />
* {{bug|710048}} decrease interval between mozilla-inbound pgo builds<br />
* {{bug|710050}} never merge pgo builds<br />
* {{bug|710085}} Pass mozillaDir argument to NightlyBuildFactory<br />
* {{bug|710221}} Implement AggregatingScheduler<br />
|-<br />
| in production<br />
| 20111208 0920 PST<br />
|<br />
* {{bug|509158}} Fix nightly snippet generation, reduce default token time, and enable signing on inbound<br />
* {{bug|707666}} Enable win64 signing on elm<br />
* {{bug|708341}} Turn off android-xul talos tests<br />
|-<br />
| in production<br />
| 20111206 1300 PST ish<br />
|<br />
* {{bug|509158}} Don't enable signing for l10n check steps.<br />
* {{bug|509158}} Sign builds as part of the build process: enable signing server for debug builds; disable pre-signed updater on elm.<br />
* {{bug|671450}} Try different sources for revision in log_uploader<br />
* {{bug|706832}} Implement master side token generation for signing on demand.<br />
* {{bug|509158}} Enable signing for mozilla-central windows builds.<br />
* {{bug|704549}} reenable android native on m-c.<br />
* {{bug|703772}} disable android-xul updates + uploadsymbols.<br />
|-<br />
| in production<br />
| 20111205 0800 PST<br />
|<br />
* {{bug|509158}} - signing builds (elm/oak only, hopefully)<br />
* {{bug|706832}} - Implement master side token generation for signing on demand. r=catlee,bhearsum<br />
* {{bug|671450}} - Try different sources for revision in log_uploader - r=nthomas<br />
* {{bug|707152}} - enable leaktests for m-i, try and m-c on macos64-debug. r=rail.<br />
* {{bug|706720}} - Post to graphs-old. r=catlee<br />
|-<br />
| in production<br />
| 20111202 1600 PST<br />
|<br />
* {{bug|509158}} - tools for signing builds<br />
|-<br />
| in production<br />
| 20111201 1100 PST<br />
|<br />
* {{bug|694332}} - Use make tier_nspr when building for l10n - r=armenzg<br />
* {{bug|693352}} r=aki add minidump_stackwalk and symbols to the android automation<br />
* {{bug|705936}} - reconfigs should re-generate master_config.json a=aki<br />
|-<br />
| in production<br />
| 20111201 0900 PST<br />
| <br />
* {{bug|704555}} - deploy rss for tp4m on android (required android talos update)<br />
|-<br />
| in production<br />
| 20111128 1448 PST<br />
|<br />
* {{bug|701684}} - remove mozilla-1.9.1 from config.py. r=bhearsum<br />
* add r4 slaves 080-085 to configs r=catlee<br />
* {{bug|705040}} - reenable native android builds on try. r=bhearsum<br />
* {{bug|691483}} - update MU to 3.6.24 -> 8.0.1, r=lsblakk<br />
|-<br />
| in production<br />
| 20111124 0815 PST<br />
|<br />
* {{bug|703010}} - backfill unresponsive tegras<br />
* {{bug|702390}} - reimage buildbot-master2 and buildbot-master5 as w32-ix-slave43 and w32-ix-slave44<br />
* {{bug|702351}} - deploy talos.zip which includes responsiveness<br />
* {{bug|699838}} - Set up a project branch to allow us to run several iterations for metrics<br />
* {{bug|700534}} - make local buildbot-config modification on test-master01 permanent<br />
* {{bug|700860}} - Put mw32-ix-slave26 into the production pool<br />
* {{bug|676155}} - install r3 mini 02456 as talos-r3-w7-065<br />
* {{bug|704028}} - xulrunner release bundles often timeout<br />
* {{bug|697802}} - https://bugzilla.mozilla.org/show_bug.cgi?id=697802<br />
|-<br />
| in production<br />
| 20111121 1300 PST<br />
|<br />
* {{bug|702351}} - enable tp_responsiveness on m-c<br />
* {{bug|700705}} - remove more slaves<br />
* add talos-r4-snow-060 to 080 back to the pool<br />
* {{bug|692692}} - re-enable PGO for Win64<br />
* {{bug|701766}} - Remove tegra slaves that had not taken any jobs and are not coming back to production any time soon<br />
* {{bug|704200}} - android dep builds permared after bug 701864; sometimes causing nightlies not to trigger - disable native android builders everywhere except birch<br />
|-<br />
| in production<br />
| 20111118 0700 PST<br />
|<br />
* {{bug|700513}} - set BINSCOPE for win32 on try<br />
* {{bug|702631}} - linux, linux64 and mac partner repacks aren't triggered<br />
* {{bug|703280}} - Use dev-stage01 as SYMBOL_SERVER_HOST for staging try builds<br />
* {{bug|702351}} - deploy talos.zip which includes responsiveness<br />
|-<br />
| in production<br />
| 20111117 0600 PST<br />
|<br />
* {{bug|702834}} - Pass mozillaDir to dep factory.<br />
* {{bug|701864}} - support mobile builds+repacks out of mobile/, mobile/xul/, and mobile/android/.<br />
* {{bug|701766}} - remove staging tegras.<br />
* {{bug|700513}} - Add BINSCOPE env var to win32, win32-debug, and win32-mobile<br />
* {{bug|701476}} - split android reftests from 2 chunks to 3 chunks.<br />
* {{bug|702357}} - enable new tegras for production<br />
* {{bug|702368}} - add hangmonitor.timeout=0 pref to dirty jobs.<br />
* {{bug|702645}} - win32_repack_beta broken due to "LINK : fatal error LNK1104: cannot open file 'mozcrt.lib'".<br />
* {{bug|548551}} - Turn off arm nanojit builds.<br />
* {{bug|700705}} - Remove a bunch of decomissioned slaves.<br />
* {{bug|683734}} - remove talos-r3-snow machines, remove snowleopard-r4 platform, move talos-r4-snow to snowleopard platform<br />
|-<br />
| in production<br />
| 20111116 0700 PST<br />
|<br />
* {{Bug|702351}} - deploy talos.zip which includes responsiveness <br />
|-<br />
| in production<br />
| 20111111 1712 PST<br />
|<br />
* {{bug|697389}} - multilocale birch android nightlies, against l10n-central.<br />
* {{bug|697404}} - disable tp4m for birch<br />
|-<br />
| in production<br />
| 20111110 1200 PST<br />
|<br />
* {{bug|700901}} - reorder mozconfig to get past mozconfig diff. p=aki<br />
* {{bug|700901}} - fix l10n relbranch. p=aki<br />
* {{bug|701116}} - Mobile desktop builds should be nightly-only. p=rail<br />
* {{bug|701113}} - maemo tier 3 (removing all maemo references except mobile/) p=aki<br />
* {{Bug|672132}} - Run beta and release releases in preproduction. p=rail<br />
* {{bug|698946}} - further setup-masters.py improvements p=jhford<br />
|-<br />
| in production<br />
| 20111108 1630 PST<br />
|<br />
* {{Bug|699407}} - Set mirror / bundle URLs. p=catlee<br />
* {{bug|700721}} - update buildbot-configs for merge of nightly->aurora and aurora->beta p=lsblakk<br />
* {{bug|700453}} - make test-master01 tegra specific. p=aki<br />
* {{Bug|700794}} - Disable aurora daily updates until merge to mozilla-aurora is good. p=armenzg<br />
* {{Bug|700737}} - Remove slaves given to Thunderbird. p=armenzg<br />
|-<br />
| in production<br />
| 20111108 1100 PST<br />
| {{bug|687064}} - hgtool work. p=catlee<br />
|-<br />
| in production<br />
| 20111107 0930 PDT<br />
| {{bug|660124}} - remove "paint" set. p=armenzg<br />
|-<br />
| in production<br />
| 20111107 0845 PDT<br />
|<br />
* {{bug|692812}} - add ability to have pgo strategies p=jhford<br />
* {{bug|693771}} - add 10.7 test slaves to buildbot configs p=jhford<br />
* {{bug|698837}} - use signed updater.exe for elm and oak branches. p=bhearsum<br />
* {{Bug|695921}} - removing duplicated entry for ftp_url on jetpack p=lsblakk <br />
* {{bug|698837}} - use signed updater.exe for elm and oak project branches. p=bhearsum<br />
* {{Bug|660124}} - replace ts/twinopen for ts_paint/tpain and some cleanup. p=armenzg<br />
* {{Bug|699802}} - enable_leaktests for m-i and try. p=armenzg<br />
|-<br />
| in production<br />
| 20111028 1205 PDT<br />
|<br />
* {{bug|695707}} - mozharness should be tagged automatically for 8.0+ releases<br />
* {{bug|695921}} - test per checkin addons-sdk against opt & debug across mozilla-{beta,central,aurora,release} latest tinderbox builds<br />
|-<br />
| in production<br />
| 20111025 1200 PDT<br />
|<br />
* {{bug|681855}} - Frequent Tegra "Cleanup Device exception" or "Configure Device exception" from "Remote Device Error: devRoot from devicemanager [None] is not correct"<br />
* {{bug|697112}} - add more twigs<br />
* {{bug|689649}} - update buildbot config.py to adjust side by side talos staging for mozafterpaint<br />
* {{bug|695707}} - mozharness should be tagged automatically for 8.0+ releases<br />
|-<br />
| in production<br />
| 20111021 0932 PDT<br />
|<br />
* {{bug|683448}} - Permission check and virus scan tests shouldn't fail if files pushed to the releases directory<br />
* {{bug|689649}} - disable old_suites for mozilla-beta<br />
* {{bug|692504}} - push betas to internal mirrors automatically<br />
* {{bug|693015}} - disable android debug tests<br />
* {{bug|694077}} - add aus2_mobile_* to the "update branch vars loop" in config.py<br />
* {{bug|694893}} - Bump disk space requirement for codecoverage to 7G<br />
* {{bug|695161}} - backout 1318d1bbc15a to re-enable Win64 updates<br />
* {{bug|695429}} - FF8 beta4 config changes<br />
* {{bug|696165}} - enable tegras 129 - 153<br />
|-<br />
| in production<br />
| 20111019 1100 PDT<br />
| {{bug|695525}} Pulse enabled on test-master01<br />
|-<br />
| in production (build only)<br />
| 20111017 1728 PDT<br />
|<br />
* {{bug|695161}} Disable updates to broken Win x64 builds<br />
|-<br />
| in production<br />
| 20111017 1100 PDT<br />
|<br />
* {{bug|690860}} enable android debug nightly on m-c<br />
* {{bug|694235}} config tests shouldn't fail if there are no try slaves<br />
* {{bug|694106}} remove tegra try pool<br />
* {{bug|676879}} Config changes required to run valgrind as a nightly builder<br />
* {{bug|694716}} patch by joel to fix broken mochitests due to bug 691411<br />
* {{bug|694077}} Enable nightlies builds and updates for birch branch<br />
|-<br />
| in production<br />
| 20111017 0900 PDT<br />
|<br />
* {{bug|694579}} deployed newer talos.zip<br />
|-<br />
| in production<br />
| 20111012 0735 PDT<br />
|<br />
* backout {{bug|692928}}.jhford<br />
* {{Bug|693903}} Update slaves for staging and preproduction configs. rail<br />
* {{Bug|692823}} Reduce PGO sets to 6 hours until bug 691675 is fixed. armenzg<br />
* {{Bug|693686}} PGO talos is submitting to the Firefox-Non-PGO tree. armenzg<br />
|-<br />
| in production<br />
| 20111011 1515 PDT (for build masters, others later<br />
|<br />
* {{bug|693350}} - Don't try to add bouncer entries in preproduction<br />
* {{bug|692388}} - mozharness MercurialVCS with HG_SHARE_BASE_DIR set completely ignores specified revision<br />
* No Bug, do compare_attrs better for DependentL10n, so we don't throw in dump_masters. Will followup later to get compare_attrs better for all of buildbotcustom. Not used for Firefox builds, so NPOTB<br />
* {{bug|693686}} - PGO talos builds reporting to Non-PGO branches<br />
* {{bug|693794}} - remove unneeded usebuildbot=1 from tbpl links in try emails<br />
|-<br />
| in production<br />
| 20111007 1550 PDT<br />
|<br />
* {{bug|692928}} turn off rev4 on try <br />
* {{bug|692910}} Update preproduction test slave list<br />
* {{bug|688296}} python module conflict with xcode module<br />
* {{bug|692646}} enable PGO on release builds again<br />
* {{bug|692388}} mozharness MercurialVCS with HG_SHARE_BASE_DIR set completely ignores specified revision<br />
|-<br />
| in production<br />
| 20111006 1230 PDT<br />
| <br />
* {{bug|681834}} Insert finished jobs in the statusdb more frequently<br />
* {{bug|686578}} SpiderMonkey builds on IonMonkey TBPL - enable all debug spidermonkey builds on ionmonkey<br />
* {{bug|687832}} create generic RETRY signifier, and make retry.py print it when it fails to successfully run * * {{bug|692358}} Fix log uploading for PGO builds and tests<br />
* {{bug|692370}} Add branch name to PGO scheduler so that it shows up on self-serve<br />
|-<br />
| in production<br />
| 20111005 1050 PDT<br />
| <br />
* {{bug|558180}} - use in tree mozconfigs for win64<br />
* {{bug|658313}} - disable PGO for per-checkin builds<br />
* {{bug|683721}} - add rev4 testers to buildbot-configs<br />
* {{bug|668724}} - ensure branch is not None when needed<br />
|-<br />
| in production<br />
| 20111005 1821 PDT<br />
|<br />
* Backed out: {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (post_upload.py part)<br />
|-<br />
| in production<br />
| 20111005 1632 PDT<br />
|<br />
* {{bug|671450}} - Backout log_uploader.py change, as got_revision doesn't exist on test jobs<br />
|-<br />
| in production<br />
| 20111005 1546 PDT<br />
|<br />
* {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (buildbot part)<br />
* {{bug|686831}} - Stop TinderboxPrint-ing the rev early for try<br />
* {{bug|691483}} - Do 3.6.23 -> 7.0.1 advertised major update<br />
* {{bug|689750}} - stop sending sendchanges to jhfords personal master<br />
|-<br />
| in production<br />
| 20111005 1526 PDT<br />
|<br />
* {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (post_upload.py part)<br />
|}<br />
<br />
==Archive==<br />
<br />
[[ReleaseEngineering:BuildbotMasterChanges:Archive | Older Changes]]<br />
<br />
=Android Testing=<br />
== Web Server Cluster ==<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Revision'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 3efbac1f685a<br />
| unknown<br />
| unknown<br />
| unknown<br />
|}<br />
<br />
Update Procedure:<br />
ssh to bm-remote-talos-webhost-01<br />
cd /var/www/html/talos<br />
hg pull && hg up<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
<br />
Servers:<br />
* bm-remote-talos-webhost-01.build.mozilla.org<br />
* bm-remote-talos-webhost-02.build.mozilla.org<br />
* bm-remote-talos-webhost-03.build.mozilla.org<br />
<br />
== clientproxy servers ==<br />
<br />
Production<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 2a995b4ed124<br />
| 31249cbe4f19<br />
| bfc910cd8dd3<br />
| ae5d6911905a<br />
| talos: {{bug|629503}}<br />
| 20110202 23:00 PDT<br />
| bear<br />
|}<br />
<br />
Pending<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
Servers:<br />
* bm-foopy01.build.mozilla.org<br />
* bm-foopy02.build.mozilla.org<br />
<br />
/builds/cp<br />
/builds/talos-data/talos<br />
/builds/talos-data/talos/pageloader@mozilla.org<br />
/builds/talos-data/talos/bench@taras.glek<br />
/builds/sut_tools</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Applications/Tegra_Dashboard&diff=397158ReleaseEngineering/Applications/Tegra Dashboard2012-02-14T17:25:36Z<p>Bear: </p>
<hr />
<div>== Application Description ==<br />
Tegra Dashboard is a static html page generated by cronjob. It gathers all of the tegra status files (which are sent to it by each foopy) and builds the page that is found at mobile-dashboard1.build.mtv1.mozilla.com/tegras/<br />
<br />
== Requirements ==<br />
The server depends on<br />
* Python 2.6<br />
<br />
== External Resources ==<br />
Tegra dashboard uses data pulled from two json files found in http://hg.mozilla.org/build/tools<br />
* tools/buildfarm/mobile/tegras.json<br />
* tools/buildfarm/maintenance/production-masters.json<br />
<br />
== Security ==<br />
None - it's a static page<br />
<br />
== Monitoring ==<br />
None currently<br />
<br />
== Deployment ==<br />
The script and generated html is deployed on a single host, <tt>mobile-dashboard1.build.mtv1.mozilla.com</tt>.<br />
<br />
=== Server Setup ===<br />
IT installed RHEL6<br />
<br />
The following system packages were installed via yum:<br />
* http://download.fedora.redhat.com/pub/epel/6/i386/epel-release-6-5.noarch.rpm (for EPEL packages; use --nogpgcheck)<br />
* hg<br />
* python26</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Applications/Tegra_Dashboard&diff=397152ReleaseEngineering/Applications/Tegra Dashboard2012-02-14T17:19:23Z<p>Bear: Created page with "packages installed after VM setup: rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm yum install hg yum install python26 Required crontab:..."</p>
<hr />
<div>packages installed after VM setup:<br />
<br />
rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm<br />
<br />
yum install hg<br />
yum install python26<br />
<br />
Required crontab:<br />
<br />
*/5 * * * * python2.6 /var/www/tegras/dashboard.py<br />
<br />
Setting up of the tegra dashboard:<br />
<br />
cd /var/www/<br />
mkdir tegras<br />
cd html<br />
ln -s /var/www/tegras .<br />
cd ../tegras<br />
hg clone http://hg.mozilla.org/build/tools<br />
ln -s tools/buildfarm/mobile/tegras.json .<br />
ln -s tools/buildfarm/maintenance/production-masters.json .</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Applications&diff=397151ReleaseEngineering/Applications2012-02-14T17:18:14Z<p>Bear: /* App Store */</p>
<hr />
<div>This page is an incomplete attempt to categorize all of the applications (or systems, or infrastructures, or whatever you'd like to call them) installed and in use by release engineering.<br />
<br />
= App Store =<br />
* [[ReleaseEngineering/Applications/Buildbot]]<br />
* [[ReleaseEngineering/Applications/Clobberer]]<br />
* [[ReleaseEngineering/Applications/BuildAPI]]<br />
** [[ReleaseEngineering/Applications/BuildAPI data for TBPL]]<br />
* [[ReleaseEngineering/Applications/Slavealloc]]<br />
* [[ReleaseEngineering/Applications/Regression Detection]]<br />
* [[ReleaseEngineering/Applications/Tegra Dashboard]]<br />
* Talos<br />
** [[ReleaseEngineering/Applications/Talos Dirty Profiles]]<br />
<br />
= What the heck? =<br />
We're not quite sure yet what these pages should contain. Here are some questions that should be answered for each application:<br />
<br />
== Deployment Questions ==<br />
* what languages are needed and their version<br />
** if python, what python modules - can they be run in a virtualenv<br />
** if perl, what cpan modules are needed<br />
** if php, what php version and what php.ini entries are needed<br />
* what is the command line parameters to start the web service<br />
* does it have any special configuration or init files<br />
* does it require root or sudo<br />
* does it require a special directory layout<br />
* will it generate/use temp files or non-database assets?<br />
* what version of mysql and what database config, where is the sql to init the tables<br />
* does memcached have a min/max memory? which processes read/write to it - can it be on different IP<br />
* what ports will be opened for listening<br />
* cronjobs that need to be run<br />
* if outside services are utilized, what is that list<br />
<br />
== Maintenance Questions ==<br />
* what are common issues we run into, and how to debug?<br />
** place in buildduty docs?<br />
* where is the code/schema for hacking/reading?<br />
* are there special passwords/accounts/acl's, and what are they? (not in public docs)</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Archive/Android_Tegras&diff=395496ReleaseEngineering/Archive/Android Tegras2012-02-08T21:18:31Z<p>Bear: </p>
<hr />
<div>{{Release Engineering How To|Android Tegras}}<br />
= Tegra Dashboard =<br />
The current status of each Tegra, and other informational links, can be seen on the [http://bm-remote-talos-webhost-01.build.mozilla.org/tegras/ Tegra Dashboard]. ''Dashboard is only updated every 8 minutes; use [[#check status of Tegra(s)|./check.sh]] on the foopy for live status.''<br />
<br />
The page is broken up into three sections: Summary, Production and Staging where Production/Staging have the same information but focus on the named set of Tegras.<br />
<br />
The Summary section has the current start/end date range of the displayed Tegras and a grid of counts.<br />
<br />
Production Staging<br />
Tegra and buildslave online 57 8<br />
Tegra online but buildslave is not 0 0<br />
Both Tegra and buildslave are offline 19 2<br />
<br />
<br />
The Production/Staging section is a detailed list of all Tegras that fall into the given category.<br />
<br />
ID Tegra CP BS Msg Online Active Foopy PDU active bar<br />
<br />
* '''ID''' Tegra-### identifier. Links to the buildslave detail page on the master<br />
* '''Tegra''' Shows if the Tegra is powered and responding: online|OFFLINE <br />
* '''CP''' Shows if the ClientProxy daemon is running: active|INACTIVE<br />
* '''BS''' Shows if the buildslave for the Tegra is running: active|OFFLINE<br />
* '''Msg''' The info message from the last [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] run for that Tegra<br />
* '''Foopy''' Which foopy server the Tegra is run on. Links to the hostname:tegra-dir<br />
* '''PDU''' Which PDU page can be used to power-cycle the Tegra. PDU0 is used for those not connected as of yet<br />
* '''Log''' Links to the text file that contains the cumulative [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] log entries<br />
* '''active bar''' A single character summary of the last 10 status checks where '_' is offline and 'A' is active<br />
<br />
= What Do I Do When... =<br />
<br />
== PING checks are failing ==<br />
Reboot the Tegra through the PDU<br />
<br />
== tegra agent check is CRITICAL ==<br />
Check the dashboard, may be rebooting. Give it up to 15 minutes, then [[#check status of Tegra(s)|verify current status]]. If still "rebooting", then treat as if [[#PING checks are failing]]<br />
<br />
= How Do I... =<br />
<br />
== recover a foopy ==<br />
<br />
If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:<br />
<br />
'''Note''': Establish the base screen session, if needed by trying screen -x first<br />
<br />
ssh cltbld@foopy##<br />
screen -x<br />
cd /builds<br />
./stop_cp.sh<br />
./start_cp.sh<br />
<br />
== find what foopy a Tegra is on ==<br />
<br />
Open the Tegra Dashboard - the foopy number is shown to the right<br />
<br />
== check status of Tegra(s) ==<br />
<br />
Find the Tegra on the Dashboard and then ssh to that foopy<br />
<br />
ssh cltbld@foopy##<br />
cd /builds<br />
./check.sh -t tegra-###<br />
<br />
To check on the status of all Tegras covered by that foopy<br />
<br />
./check.sh<br />
<br />
check.sh is found in /builds on a foopy<br />
<br />
== power cycle a Tegra ==<br />
<br />
Find the Tegra on the Dashboard and then ssh to that foopy<br />
<br />
ssh cltbld@foopy##<br />
./check.sh -t tegra-## -c<br />
<br />
If the above did not work, then you will need to [[#Reboot a Tegra through the PDU]].<br />
<br />
== clear an error flag ==<br />
<br />
Find the Tegra on the Dashboard, ssh to that foopy and then<br />
<br />
ssh cltbld@foopy05<br />
./check.sh -t tegra-002 -r<br />
<br />
== restart Tegra(s) ==<br />
<br />
Find out which foopy server you need to be on and then run:<br />
<br />
ssh cltbld@foopy##<br />
cd /builds<br />
./stop_cp.sh tegra-###<br />
<br />
check the '''ps''' output that is generated at the end to ensure that nothing has glitched. If any zombie processes are found then you will need to kill them manually. Once clear, run<br />
<br />
./start_cp.sh tegra-###<br />
<br />
== start Tegra(s) ==<br />
<br />
Find out which foopy server you need to be on and then run:<br />
<br />
cd /builds<br />
./start_cp.sh [tegra-###]<br />
<br />
If you specify the tegra-### parameter then it will only attempt to start that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*<br />
<br />
== stop Tegra(s) ==<br />
<br />
First find the foopy server for the Tegra and then run:<br />
<br />
cd /builds<br />
./stop_cp.sh [tegra-###]<br />
<br />
If you specify the tegra-### parameter then it will only attempt to stop that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*<br />
<br />
At the end of the startup process, stop_cp.sh will run<br />
<br />
ps auxw | grep "tegra-###"<br />
<br />
to allow you to check that all associated or spawned child processes have been also stopped. Sadly some of them love to zombie and that just ruins any summer picnic.<br />
<br />
== find Tegras that are hung ==<br />
If you see a Tegra that has been running for 4+ hours, then it most likely has a hung fennec process. There will be a matching server.js daemon on the foopy.<br />
<br />
The easiest way to find Tegras that are in this state is via the buildbot-master. ''(N.B. in buildbot reports, all tegras report their [https://en.wikipedia.org/wiki/Nvidia_Tegra#Tegra_2_series model #], e.g. "Tegra 250". Do not confuse model name with a tegra host name, e.g. <tt>tegra-250</tt>.)''. Currently (2011-12-20) all tegras on a foopy use the same build master:<br />
<br />
{| border="1" cellpadding="2"<br />
!foopy #!!Master URL<br />
|-<br />
| <18<br />
| [http://test-master01.build.mozilla.org:8012/buildslaves?no_builders=1 test-master01]<br />
|-<br />
| >=18 & even<br />
| [http://buildbot-master20.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master20]<br />
|-<br />
| >18 & odd<br />
| [http://buildbot-master19.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master19]<br />
|}<br />
<br />
Look for Tegras that have a "Last heard from" of >4 hours. If the list of "Recent builds" for the Tegra are flapping between exceptions/failures/warnings, i.e. the status is all sorts of different pretty colours, that's a good sign that there's a stray fennec process fouling things up.<br />
<br />
Another way to identify tegras for stalls is to look on the dashboard for tegras showing INACTIVE status for both the tegra ''and'' the client proxy. (These often also have a "not connected" status on the buildslaves page.)<br />
<br />
=== whack a hung Tegra ===<br />
The only way currently to kick Tegras in this state it is to kill the server.js daemon on the appropriate foopy.<br />
<br />
The manual way to do it is to run:<br />
<br />
ps auxw | grep server.js | grep tegra-### <br />
<br />
and then kill the result PID. To keep from going crazy typing that over and over again, I created <code>kill_stalled.sh</code> which automates that task.<br />
<br />
cd /builds<br />
./kill_stalled.sh 042 050 070 099<br />
<br />
This will run the above ps and grep for each tegra id given and if a PID is found, kill it. This will cause the Tegra to be power-cycled automatically, getting it back into service.<br />
<br />
If <tt>./kill_stalled.sh</tt> reports "none found", then manually powercycle the tegra.<br />
<br />
== Reboot a Tegra through the PDU ==<br />
cd /builds<br />
python sut_tools/tegra_powercycle.py ###<br />
<br />
You will see the snmpset call result if it worked.<br />
<br />
If rebooting via PDU does not clear the problem, here are things to try:<br />
* reboot again - fairly common to have 2nd one clear it<br />
** especially if box responsive to ping & telnet (port 20701) after first reboot<br />
<br />
== check.py options ==<br />
<br />
To manually run [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] '''find the appropriate foopy server''' and<br />
<br />
cd /builds<br />
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]<br />
<br />
* -m [s|p] restrict Tegra list to 's'taging or 'p'roduction<br />
* -r reset any error.flg semaphore if found and send "rebt" command to tegra<br />
* -c powercycle the Tegra by telneting to the appropriate PDU<br />
<br />
This will scan a given Tegra (or all of them) and report back it's status.<br />
<br />
== Start ADB ==<br />
On the Tegra do:<br />
telnet tegra-### 20701<br />
exec su -c "setprop service.adb.tcp.port 5555"<br />
exec su -c "stop adbd"<br />
exec su -c "start adbd"<br />
<br />
On your computer do:<br />
adb tcpip 5555<br />
adb connect <ipaddr of tegra><br />
adb shell<br />
<br />
== Move a tegra from one foopy to another ==<br />
The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.<br />
<br />
'''NOTE:''' use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)<br />
<br />
# update foopies.sh & tegras.json in your working directory<br />
# commit the changes to <tt>foopies.sh</tt> and <tt>tegras.json</tt><br />
#* make sure json is clean: <tt>python -c 'import json; json.loads(open("tegras.json").read())'</tt><br />
# in buildbot, request a "graceful shutdown"<br />
#* wait for tegra to show "idle"<br />
# on the old foopy:<br />
#* stop the tegra via <tt>/builds/stop_cp.sh</tt><br />
#* manually remove the tegra from the <tt>/builds/create_dirs.sh</tt> file<br />
#** <strike>'''OR''' run <tt>./foopies.sh old_foopy_number</tt> from your working directory</strike> blocked by: {{bug|713690}}<br />
# on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):<br />
#* update the local tools: <tt>cd /builds/tools ; hg pull --update; cd -</tt><br />
#* manually add the tegra to the <tt>/builds/create_dirs.sh</tt> file<br />
#* manually run <tt>cd /builds; ./create_dirs.sh</tt><br />
#* if this is a replacement tegra, manually push the ini files by judicious use of: <tt>grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'</tt><br />
# on the new foopy:<br />
#* restart the tegras using <tt>cd /builds ; ./start_cp.sh</tt><br />
#** '''NOTE:''' do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.<br />
<br />
== Move a tegra from staging to production ==<br />
<br />
# If the tegra is running, stop it: <tt>/builds/stop_cp.sh tegra-###</tt><br />
# Edit the tegra's buildbot.tac: <tt>/builds/tegra-###/buildbot.tac</tt><br />
# Adjust the master, port and password to the appropriate server<br />
# Save and restart the Tegra: <tt>/builds/start_cp.sh tegra-###</tt><br />
<br />
'''Note''' - yes, it's a manual process (sorry) until Tegras are in SlaveAlloc<br />
<br />
= Environment =<br />
<br />
The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy. Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''.<br />
<br />
* Each Tegra has a '''/builds/tegra-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py<br />
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it<br />
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/)<br />
<br />
Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra/tegra-devkit-features for details. It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.<br />
<br />
Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains. All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.<br />
<br />
= References =<br />
<br />
== One source of truth ==<br />
<br />
As of Oct 2011, [https://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/tegras.json <tt>tools/buildfarm/mobile/tegras.json</tt>] should be the most authoritative document.<br />
* if you find a tegra deployed that is not listed here, check [https://docs.google.com/spreadsheet/ccc?key=0AlIN8kWEeaF0dFJHSWN4WVNVZEhlREtUNWdTYnVtMlE&hl=en_US#gid=0 bear's master list]. If there, file a releng bug to get <tt>tegras.json</tt> updated.<br />
* if you find a PDU not labeled per the <tt>tegras.json</tt> file, file a releng bug to update the human labels.</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&diff=392697ReleaseEngineering/Maintenance2012-02-01T05:11:18Z<p>Bear: /* Reconfigs / Deployments */</p>
<hr />
<div>This page is to track upcoming changes to any part of RelEng infrastructure; buildbot masters, slaves, ESX hosts, etc. This should allow us keep track of what we're doing in a downtime, and also what changes can be rolled out to production without needing a downtime. This should be helpful if we need to track what changes were made when troubleshooting problems.<br />
<br />
[[ReleaseEngineering:BuildbotBestPractices]] describes how we manage changes to our masters.<br />
<br />
= Relevant repositories =<br />
* [http://hg.mozilla.org/build/buildbot/ buildbot]<br />
* [http://hg.mozilla.org/build/buildbot-configs/ buildbot-configs]<br />
* [http://hg.mozilla.org/build/buildbotcustom/ buildbotcustom]<br />
* [http://hg.mozilla.org/build/tools/ tools]<br />
* [http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ talos]<br />
<br />
'''Are you changing the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
= Reconfigs / Deployments =<br />
This page is updated by the person who does a reconfig on production systems. Please give accurate times, as we use this page to track down if reconfigs caused debug intermittent problems.<br />
<br />
'''Did you change the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
Outcome should be 'backed out' or 'In production' or some such. Reverse date order pretty please.<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Outcome'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Bug #(s)''' - '''Description(s)'''<br />
|-<br />
| in production<br />
| 20120131 2100 PST<br />
|<br />
* {{bug|718777}} - updating configs for mozilla-beta so we can get native builds going<br />
* {{bug|722719}} - change to SDK 14<br />
* {{bug|722951}} - (temporarily) redirect aurora updates to test channel<br />
* {{bug|722940}} - codesize upload broken for SeaMonkey [and Thunderbird] due to tools dir being incorrect<br />
* {{bug|718777}} - Tracking bug for build and release of Firefox/Fennec 11.0b1. Poll signed Fennec APKs for all signed <br />
* {{bug|708656}} - Use signing on demand for releases. Use AggregatingScheduler for repack_complete<br />
* {{bug|719260}} - Investigate why updates builder triggered twice for 10.0b5<br />
* {{bug|660480}} - RETRY on common tegra errors<br />
|-<br />
| in production<br />
| 20120127 1130 PST<br />
|<br />
* {{bug|719697}} - robocop isn't signed properly from buildbot builds<br />
|-<br />
| in production<br />
| 20120127 1040 PST<br />
|<br />
* {{bug|721488}} - deployed new pageloader.xpi<br />
|-<br />
| in production<br />
| 20120127 0730 PST<br />
|<br />
* {{Bug|719544}}. talos_from_source.py - Make the pine branch to allow downloading talos.zip from any place like on 'try'<br />
* {{bug|717662}} - Please disable debug builds and tests on the profiling branch<br />
* {{bug|720782}} - If we dont_build a platform on project_branches we should not add testers for it<br />
* {{bug|721360}} - Bug 698827 - Run 10.5 leak builds on 10.6 machines for aurora<br />
* {{bug|721573}} - Sign the profile branch nightlies using the m-c nightly key<br />
* {{bug|717106}} - Release automation for ESR<br />
* {{bug|698827}} - Bug 698827 - Run 10.5 leak builds on 10.6 machines for aurora<br />
* {{bug|715966}} - branch 1.9.2 confusingly set on talos tbpl logs<br />
* {{bug|718828}} - Don't wait for NFS cache at the end of the updates builder<br />
* {{bug|705403}} - <strike>Sendchanges [on windows] from build steps are being done from old buildbot version</strike> - backed out<br />
* {{bug|683417}} - retry.py didn't actually kill process tree for a timed-out pushsnip<br />
* {{bug|673834}} - Obsolete ReleaseRepackFactory, fold logic into CCReleaseRepackFactory<br />
|-<br />
| in production<br />
| 20120123 1435 PST<br />
|<br />
* {{bug|719859}} - remove double posting ts_paint and tpaint. p=armenzg<br />
* {{bug|718445}} - stage-old should be referenced as stage in scripts/configs. p=bhearsum<br />
|-<br />
| in production<br />
| 20120123 1405 PST<br />
|<br />
* {{Bug|649641}} - use ntpd on linux32/linux64 ix slaves<br />
|-<br />
| in production<br />
| 20120123 1140 PST<br />
|<br />
* {{Bug|711619}} - Add Android builds+tests and periodic PGO on the Fx-Team branch, p=philor<br />
* {{Bug|719859}} - Side by side on mozilla-central for ignore_first changes. p=jmaher<br />
|-<br />
| in production<br />
| 20120123 0730 PST<br />
|<br />
* {{bug|705403}} - Sendchanges [on windows] from build steps are being done from old buildbot version<br />
* {{bug|719772}} - Sign Callek up for the full release process e-mails<br />
* {{bug|716561}} - reevaluate which release mail gets sent to release-drivers<br />
* {{bug|561198}} - compress leak test / codesighs logs prior to uploading<br />
* {{bug|699219}} - Add automated clean up of hg-shared directory<br />
* {{bug|714284}} - L10n mac dep builds busted on central and aurora<br />
* {{bug|719261}} - Add more logging to AggregatingScheduler<br />
|-<br />
| in production<br />
| 20120119 1200 PST<br />
|<br />
* {{bug|719504}} - disable peptest.<br />
* {{bug|715219}} - off-by-one bustage fix for tegra android range<br />
|-<br />
| in production<br />
| 20120119 1100 PST<br />
|<br />
* {{bug|699219}} - purge shared hg repos<br />
|-<br />
| in production<br />
| 20120117 1230 PST<br />
|<br />
* {{bug|695351}} - android mochitests to use in-tree manifest<br />
* {{bug|700415}} - peptest on try<br />
* {{bug|712750}} - print more data for screenresolution in buildbot factories<br />
|-<br />
| in production<br />
| 20120117 0800 PST<br />
|<br />
* {{Bug|698827}} - Run 10.5 leak builds on 10.6 machines for try. p=armenzg<br />
|-<br />
| in production<br />
| 20120116 1325 PST<br />
|<br />
* Require branch parameter to clobberer HTML interface<br />
|-<br />
| in production<br />
| 20120113 07:00 PST<br />
|<br />
* {{bug|714490}} - make hgtool handle mirror/master hg outages better<br />
|-<br />
| in production<br />
| 20120112 16:40 PST<br />
|<br />
* {{bug|712422}} - add a --bootstrap cli flag to reftest/crashtest/jsreftest for android<br />
* {{bug|698425}} - enable android and android-xul l10n repacks<br />
* Bustage fix. Changeset fa1c76238b7c<br />
* {{bug|713442}} - point 1.9.2 release configs to the compare-locales RELEASE_0_8_2 tag<br />
* {{bug|717621}} - Remove decomissioned slaves<br />
* {{bug|698425}} - android and android-xul l10n mozconfig<br />
* {{bug|567274}} - Talos should halt on download or unzip failure<br />
|-<br />
| in production<br />
| 20120109 1806 PST<br />
|<br />
* stage rather than masters<br />
* {{bug|712008}} - Always trim revision to 12 chars<br />
* {{bug|716431}} - Block asc files for partial mars in latest-<branch> dirs (stage<br />
|-<br />
|-<br />
| in production<br />
| 20120106 1300 PST<br />
|<br />
* {{bug|715623}} - add --cachedir support to signtool.py<br />
|-<br />
| in production<br />
| 20120104 1315 PDT<br />
|<br />
* Back out 7a7847f7fc05 ({{bug|711275}}: Make sure appVersion changes with every Firefox 10 beta)<br />
* {{bug|712008}} - Pass platform to post_upload.py for shark<br />
* {{bug|681948}} - Automatically retry after a devicemanager.DMError<br />
* {{bug|715119}} - [signing-server] Bump token TTL<br />
* {{bug|713161}} - new high tegra added<br />
* {{bug|711221}} - turn on create_snippet and create_partial for profiling branch<br />
* {{bug|712150}} - bustage fix for linux,m-r and xulrunner in-tree mozconfig path<br />
|-<br />
| in production<br />
| 20111222 0800 PDT<br />
|<br />
* {{Bug|710350}} - Don't hard-code 'firefox' and 'fennec' in misc.py.<br />
* {{Bug|707152}} - enable leaktest for 10.6 everywhere except some release branches.<br />
* {{bug|711367}} - enable android-xul tests<br />
* {{Bug|673131}} - Enable talos_from_source_code.<br />
* {{bug|712094}} - re-enable aurora updates. <br />
* {{bug|711275}} - Make sure appVersion changes with every Firefox 10 beta. r=rail<br />
|-<br />
| in production<br />
| 20111221<br />
|<br />
* {{bug|683734}} - added a bunch of talos-r3 slaves to production<br />
|-<br />
| in production<br />
| 20111221 1300 PST<br />
|<br />
* {{bug|558180}} - use in-tree mozconfigs for releases<br />
* {{bug|709114}} - add locales to aurora<br />
* {{bug|710842}} - re-enable symbols for nightly fennec xul builds<br />
* {{bug|711221}} - rename private-browsing branch to 'profiling'<br />
* {{bug|712133}} - firefox 10.0b1 release configs<br />
|-<br />
| in production<br />
| 20111221 1100 PST<br />
|<br />
* {{bug|712208}} - update binutils to 2.22<br />
|-<br />
| in production<br />
| 20111220 0610 PST<br />
|<br />
* <strike>{{bug|673131}} - when minor talos changes land, the a-team should be able to deploy with minimal releng time required</strike> - backed-out<br />
* {{bug|704582}} - [tracking bug] deploy 83 tegras<br />
* {{bug|712115}} - L10n mac nightlies busted on central and aurora<br />
* {{bug|710453}} - Release Engineering changes for the Firefox 11 merge to Aurora on Dec 20<br />
* {{bug|712094}} - push mozilla-aurora updates to auroratest channel until merge stabilizes<br />
* {{bug|712068}} - Adjust default releasetestUptake value<br />
|-<br />
| in production<br />
| 20111219 1000 PST<br />
| <br />
* {{bug|707941}} - Improve token generation step<br />
* {{bug|711179}} - fix for missing symbols for non-mobile tests<br />
* {{bug|710453}} - android-xul mozilla-release mozconfigs<br />
* {{bug|711978}} - Refresh staging release configs<br />
|-<br />
| in production<br />
| 20111217 0800 PST<br />
|<br />
* {{bug|509158}} - enable signing on all branches<br />
|-<br />
| backed out<br />
| 20111216 1700 PST<br />
|<br />
* {{bug|705403}} - Sendchanges [on windows] from build steps are being done from old buildbot version<br />
|-<br />
| in production<br />
| 20111215 0830 PST<br />
|<br />
* {{bug|711064}} Fix puppet dependencies<br />
|-<br />
| in production<br />
| 20111214 0800 PST<br />
|<br />
* {{bug|509158}} Reduce default token time to 2 hours; fix last-complete-mar detection<br />
* {{bug|683734}} Add new rev3 machines.<br />
* {{bug|708475}} accept 'mochitest' and 'reftests' as synonyms for 'mochitests' and 'reftest' (with tests)<br />
* {{bug|708859}} android signature verification should look for android-arm.apk<br />
* {{bug|709233}} reenable android and android-xul multilocale for m-c nightlies<br />
* {{bug|709383}} Turn off win64 signing on m-c<br />
* {{bug|709979}} Set the branch property for projects/addon-sdk jobs to just addon-sdk<br />
* {{bug|710048}} decrease interval between mozilla-inbound pgo builds<br />
* {{bug|710050}} never merge pgo builds<br />
* {{bug|710085}} Pass mozillaDir argument to NightlyBuildFactory<br />
* {{bug|710221}} Implement AggregatingScheduler<br />
|-<br />
| in production<br />
| 20111208 0920 PST<br />
|<br />
* {{bug|509158}} Fix nightly snippet generation, reduce default token time, and enable signing on inbound<br />
* {{bug|707666}} Enable win64 signing on elm<br />
* {{bug|708341}} Turn off android-xul talos tests<br />
|-<br />
| in production<br />
| 20111206 1300 PST ish<br />
|<br />
* {{bug|509158}} Don't enable signing for l10n check steps.<br />
* {{bug|509158}} Sign builds as part of the build process: enable signing server for debug builds; disable pre-signed updater on elm.<br />
* {{bug|671450}} Try different sources for revision in log_uploader<br />
* {{bug|706832}} Implement master side token generation for signing on demand.<br />
* {{bug|509158}} Enable signing for mozilla-central windows builds.<br />
* {{bug|704549}} reenable android native on m-c.<br />
* {{bug|703772}} disable android-xul updates + uploadsymbols.<br />
|-<br />
| in production<br />
| 20111205 0800 PST<br />
|<br />
* {{bug|509158}} - signing builds (elm/oak only, hopefully)<br />
* {{bug|706832}} - Implement master side token generation for signing on demand. r=catlee,bhearsum<br />
* {{bug|671450}} - Try different sources for revision in log_uploader - r=nthomas<br />
* {{bug|707152}} - enable leaktests for m-i, try and m-c on macos64-debug. r=rail.<br />
* {{bug|706720}} - Post to graphs-old. r=catlee<br />
|-<br />
| in production<br />
| 20111202 1600 PST<br />
|<br />
* {{bug|509158}} - tools for signing builds<br />
|-<br />
| in production<br />
| 20111201 1100 PST<br />
|<br />
* {{bug|694332}} - Use make tier_nspr when building for l10n - r=armenzg<br />
* {{bug|693352}} r=aki add minidump_stackwalk and symbols to the android automation<br />
* {{bug|705936}} - reconfigs should re-generate master_config.json a=aki<br />
|-<br />
| in production<br />
| 20111201 0900 PST<br />
| <br />
* {{bug|704555}} - deploy rss for tp4m on android (required android talos update)<br />
|-<br />
| in production<br />
| 20111128 1448 PST<br />
|<br />
* {{bug|701684}} - remove mozilla-1.9.1 from config.py. r=bhearsum<br />
* add r4 slaves 080-085 to configs r=catlee<br />
* {{bug|705040}} - reenable native android builds on try. r=bhearsum<br />
* {{bug|691483}} - update MU to 3.6.24 -> 8.0.1, r=lsblakk<br />
|-<br />
| in production<br />
| 20111124 0815 PST<br />
|<br />
* {{bug|703010}} - backfill unresponsive tegras<br />
* {{bug|702390}} - reimage buildbot-master2 and buildbot-master5 as w32-ix-slave43 and w32-ix-slave44<br />
* {{bug|702351}} - deploy talos.zip which includes responsiveness<br />
* {{bug|699838}} - Set up a project branch to allow us to run several iterations for metrics<br />
* {{bug|700534}} - make local buildbot-config modification on test-master01 permanent<br />
* {{bug|700860}} - Put mw32-ix-slave26 into the production pool<br />
* {{bug|676155}} - install r3 mini 02456 as talos-r3-w7-065<br />
* {{bug|704028}} - xulrunner release bundles often timeout<br />
* {{bug|697802}} - https://bugzilla.mozilla.org/show_bug.cgi?id=697802<br />
|-<br />
| in production<br />
| 20111121 1300 PST<br />
|<br />
* {{bug|702351}} - enable tp_responsiveness on m-c<br />
* {{bug|700705}} - remove more slaves<br />
* add talos-r4-snow-060 to 080 back to the pool<br />
* {{bug|692692}} - re-enable PGO for Win64<br />
* {{bug|701766}} - Remove tegra slaves that had not taken any jobs and are not coming back to production any time soon<br />
* {{bug|704200}} - android dep builds permared after bug 701864; sometimes causing nightlies not to trigger - disable native android builders everywhere except birch<br />
|-<br />
| in production<br />
| 20111118 0700 PST<br />
|<br />
* {{bug|700513}} - set BINSCOPE for win32 on try<br />
* {{bug|702631}} - linux, linux64 and mac partner repacks aren't triggered<br />
* {{bug|703280}} - Use dev-stage01 as SYMBOL_SERVER_HOST for staging try builds<br />
* {{bug|702351}} - deploy talos.zip which includes responsiveness<br />
|-<br />
| in production<br />
| 20111117 0600 PST<br />
|<br />
* {{bug|702834}} - Pass mozillaDir to dep factory.<br />
* {{bug|701864}} - support mobile builds+repacks out of mobile/, mobile/xul/, and mobile/android/.<br />
* {{bug|701766}} - remove staging tegras.<br />
* {{bug|700513}} - Add BINSCOPE env var to win32, win32-debug, and win32-mobile<br />
* {{bug|701476}} - split android reftests from 2 chunks to 3 chunks.<br />
* {{bug|702357}} - enable new tegras for production<br />
* {{bug|702368}} - add hangmonitor.timeout=0 pref to dirty jobs.<br />
* {{bug|702645}} - win32_repack_beta broken due to "LINK : fatal error LNK1104: cannot open file 'mozcrt.lib'".<br />
* {{bug|548551}} - Turn off arm nanojit builds.<br />
* {{bug|700705}} - Remove a bunch of decomissioned slaves.<br />
* {{bug|683734}} - remove talos-r3-snow machines, remove snowleopard-r4 platform, move talos-r4-snow to snowleopard platform<br />
|-<br />
| in production<br />
| 20111116 0700 PST<br />
|<br />
* {{Bug|702351}} - deploy talos.zip which includes responsiveness <br />
|-<br />
| in production<br />
| 20111111 1712 PST<br />
|<br />
* {{bug|697389}} - multilocale birch android nightlies, against l10n-central.<br />
* {{bug|697404}} - disable tp4m for birch<br />
|-<br />
| in production<br />
| 20111110 1200 PST<br />
|<br />
* {{bug|700901}} - reorder mozconfig to get past mozconfig diff. p=aki<br />
* {{bug|700901}} - fix l10n relbranch. p=aki<br />
* {{bug|701116}} - Mobile desktop builds should be nightly-only. p=rail<br />
* {{bug|701113}} - maemo tier 3 (removing all maemo references except mobile/) p=aki<br />
* {{Bug|672132}} - Run beta and release releases in preproduction. p=rail<br />
* {{bug|698946}} - further setup-masters.py improvements p=jhford<br />
|-<br />
| in production<br />
| 20111108 1630 PST<br />
|<br />
* {{Bug|699407}} - Set mirror / bundle URLs. p=catlee<br />
* {{bug|700721}} - update buildbot-configs for merge of nightly->aurora and aurora->beta p=lsblakk<br />
* {{bug|700453}} - make test-master01 tegra specific. p=aki<br />
* {{Bug|700794}} - Disable aurora daily updates until merge to mozilla-aurora is good. p=armenzg<br />
* {{Bug|700737}} - Remove slaves given to Thunderbird. p=armenzg<br />
|-<br />
| in production<br />
| 20111108 1100 PST<br />
| {{bug|687064}} - hgtool work. p=catlee<br />
|-<br />
| in production<br />
| 20111107 0930 PDT<br />
| {{bug|660124}} - remove "paint" set. p=armenzg<br />
|-<br />
| in production<br />
| 20111107 0845 PDT<br />
|<br />
* {{bug|692812}} - add ability to have pgo strategies p=jhford<br />
* {{bug|693771}} - add 10.7 test slaves to buildbot configs p=jhford<br />
* {{bug|698837}} - use signed updater.exe for elm and oak branches. p=bhearsum<br />
* {{Bug|695921}} - removing duplicated entry for ftp_url on jetpack p=lsblakk <br />
* {{bug|698837}} - use signed updater.exe for elm and oak project branches. p=bhearsum<br />
* {{Bug|660124}} - replace ts/twinopen for ts_paint/tpain and some cleanup. p=armenzg<br />
* {{Bug|699802}} - enable_leaktests for m-i and try. p=armenzg<br />
|-<br />
| in production<br />
| 20111028 1205 PDT<br />
|<br />
* {{bug|695707}} - mozharness should be tagged automatically for 8.0+ releases<br />
* {{bug|695921}} - test per checkin addons-sdk against opt & debug across mozilla-{beta,central,aurora,release} latest tinderbox builds<br />
|-<br />
| in production<br />
| 20111025 1200 PDT<br />
|<br />
* {{bug|681855}} - Frequent Tegra "Cleanup Device exception" or "Configure Device exception" from "Remote Device Error: devRoot from devicemanager [None] is not correct"<br />
* {{bug|697112}} - add more twigs<br />
* {{bug|689649}} - update buildbot config.py to adjust side by side talos staging for mozafterpaint<br />
* {{bug|695707}} - mozharness should be tagged automatically for 8.0+ releases<br />
|-<br />
| in production<br />
| 20111021 0932 PDT<br />
|<br />
* {{bug|683448}} - Permission check and virus scan tests shouldn't fail if files pushed to the releases directory<br />
* {{bug|689649}} - disable old_suites for mozilla-beta<br />
* {{bug|692504}} - push betas to internal mirrors automatically<br />
* {{bug|693015}} - disable android debug tests<br />
* {{bug|694077}} - add aus2_mobile_* to the "update branch vars loop" in config.py<br />
* {{bug|694893}} - Bump disk space requirement for codecoverage to 7G<br />
* {{bug|695161}} - backout 1318d1bbc15a to re-enable Win64 updates<br />
* {{bug|695429}} - FF8 beta4 config changes<br />
* {{bug|696165}} - enable tegras 129 - 153<br />
|-<br />
| in production<br />
| 20111019 1100 PDT<br />
| {{bug|695525}} Pulse enabled on test-master01<br />
|-<br />
| in production (build only)<br />
| 20111017 1728 PDT<br />
|<br />
* {{bug|695161}} Disable updates to broken Win x64 builds<br />
|-<br />
| in production<br />
| 20111017 1100 PDT<br />
|<br />
* {{bug|690860}} enable android debug nightly on m-c<br />
* {{bug|694235}} config tests shouldn't fail if there are no try slaves<br />
* {{bug|694106}} remove tegra try pool<br />
* {{bug|676879}} Config changes required to run valgrind as a nightly builder<br />
* {{bug|694716}} patch by joel to fix broken mochitests due to bug 691411<br />
* {{bug|694077}} Enable nightlies builds and updates for birch branch<br />
|-<br />
| in production<br />
| 20111017 0900 PDT<br />
|<br />
* {{bug|694579}} deployed newer talos.zip<br />
|-<br />
| in production<br />
| 20111012 0735 PDT<br />
|<br />
* backout {{bug|692928}}.jhford<br />
* {{Bug|693903}} Update slaves for staging and preproduction configs. rail<br />
* {{Bug|692823}} Reduce PGO sets to 6 hours until bug 691675 is fixed. armenzg<br />
* {{Bug|693686}} PGO talos is submitting to the Firefox-Non-PGO tree. armenzg<br />
|-<br />
| in production<br />
| 20111011 1515 PDT (for build masters, others later<br />
|<br />
* {{bug|693350}} - Don't try to add bouncer entries in preproduction<br />
* {{bug|692388}} - mozharness MercurialVCS with HG_SHARE_BASE_DIR set completely ignores specified revision<br />
* No Bug, do compare_attrs better for DependentL10n, so we don't throw in dump_masters. Will followup later to get compare_attrs better for all of buildbotcustom. Not used for Firefox builds, so NPOTB<br />
* {{bug|693686}} - PGO talos builds reporting to Non-PGO branches<br />
* {{bug|693794}} - remove unneeded usebuildbot=1 from tbpl links in try emails<br />
|-<br />
| in production<br />
| 20111007 1550 PDT<br />
|<br />
* {{bug|692928}} turn off rev4 on try <br />
* {{bug|692910}} Update preproduction test slave list<br />
* {{bug|688296}} python module conflict with xcode module<br />
* {{bug|692646}} enable PGO on release builds again<br />
* {{bug|692388}} mozharness MercurialVCS with HG_SHARE_BASE_DIR set completely ignores specified revision<br />
|-<br />
| in production<br />
| 20111006 1230 PDT<br />
| <br />
* {{bug|681834}} Insert finished jobs in the statusdb more frequently<br />
* {{bug|686578}} SpiderMonkey builds on IonMonkey TBPL - enable all debug spidermonkey builds on ionmonkey<br />
* {{bug|687832}} create generic RETRY signifier, and make retry.py print it when it fails to successfully run * * {{bug|692358}} Fix log uploading for PGO builds and tests<br />
* {{bug|692370}} Add branch name to PGO scheduler so that it shows up on self-serve<br />
|-<br />
| in production<br />
| 20111005 1050 PDT<br />
| <br />
* {{bug|558180}} - use in tree mozconfigs for win64<br />
* {{bug|658313}} - disable PGO for per-checkin builds<br />
* {{bug|683721}} - add rev4 testers to buildbot-configs<br />
* {{bug|668724}} - ensure branch is not None when needed<br />
|-<br />
| in production<br />
| 20111005 1821 PDT<br />
|<br />
* Backed out: {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (post_upload.py part)<br />
|-<br />
| in production<br />
| 20111005 1632 PDT<br />
|<br />
* {{bug|671450}} - Backout log_uploader.py change, as got_revision doesn't exist on test jobs<br />
|-<br />
| in production<br />
| 20111005 1546 PDT<br />
|<br />
* {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (buildbot part)<br />
* {{bug|686831}} - Stop TinderboxPrint-ing the rev early for try<br />
* {{bug|691483}} - Do 3.6.23 -> 7.0.1 advertised major update<br />
* {{bug|689750}} - stop sending sendchanges to jhfords personal master<br />
|-<br />
| in production<br />
| 20111005 1526 PDT<br />
|<br />
* {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (post_upload.py part)<br />
|}<br />
<br />
==Archive==<br />
<br />
[[ReleaseEngineering:BuildbotMasterChanges:Archive | Older Changes]]<br />
<br />
=Android Testing=<br />
== Web Server Cluster ==<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Revision'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 3efbac1f685a<br />
| unknown<br />
| unknown<br />
| unknown<br />
|}<br />
<br />
Update Procedure:<br />
ssh to bm-remote-talos-webhost-01<br />
cd /var/www/html/talos<br />
hg pull && hg up<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
<br />
Servers:<br />
* bm-remote-talos-webhost-01.build.mozilla.org<br />
* bm-remote-talos-webhost-02.build.mozilla.org<br />
* bm-remote-talos-webhost-03.build.mozilla.org<br />
<br />
== clientproxy servers ==<br />
<br />
Production<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 2a995b4ed124<br />
| 31249cbe4f19<br />
| bfc910cd8dd3<br />
| ae5d6911905a<br />
| talos: {{bug|629503}}<br />
| 20110202 23:00 PDT<br />
| bear<br />
|}<br />
<br />
Pending<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
Servers:<br />
* bm-foopy01.build.mozilla.org<br />
* bm-foopy02.build.mozilla.org<br />
<br />
/builds/cp<br />
/builds/talos-data/talos<br />
/builds/talos-data/talos/pageloader@mozilla.org<br />
/builds/talos-data/talos/bench@taras.glek<br />
/builds/sut_tools</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&diff=392696ReleaseEngineering/Maintenance2012-02-01T05:06:48Z<p>Bear: /* Reconfigs / Deployments */</p>
<hr />
<div>This page is to track upcoming changes to any part of RelEng infrastructure; buildbot masters, slaves, ESX hosts, etc. This should allow us keep track of what we're doing in a downtime, and also what changes can be rolled out to production without needing a downtime. This should be helpful if we need to track what changes were made when troubleshooting problems.<br />
<br />
[[ReleaseEngineering:BuildbotBestPractices]] describes how we manage changes to our masters.<br />
<br />
= Relevant repositories =<br />
* [http://hg.mozilla.org/build/buildbot/ buildbot]<br />
* [http://hg.mozilla.org/build/buildbot-configs/ buildbot-configs]<br />
* [http://hg.mozilla.org/build/buildbotcustom/ buildbotcustom]<br />
* [http://hg.mozilla.org/build/tools/ tools]<br />
* [http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ talos]<br />
<br />
'''Are you changing the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
= Reconfigs / Deployments =<br />
This page is updated by the person who does a reconfig on production systems. Please give accurate times, as we use this page to track down if reconfigs caused debug intermittent problems.<br />
<br />
'''Did you change the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
Outcome should be 'backed out' or 'In production' or some such. Reverse date order pretty please.<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Outcome'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Bug #(s)''' - '''Description(s)'''<br />
|-<br />
| in production<br />
| 20120131 2100 PST<br />
|<br />
* {{bug|722719}} - change to SDK 14<br />
* {{bug|722951}} - (temporarily) redirect aurora updates to test channel<br />
* {{bug|722940}} - codesize upload broken for SeaMonkey [and Thunderbird] due to tools dir being incorrect<br />
* {{bug|718777}} - Tracking bug for build and release of Firefox/Fennec 11.0b1. Poll signed Fennec APKs for all signed <br />
* {{bug|708656}} - Use signing on demand for releases. Use AggregatingScheduler for repack_complete<br />
* {{bug|719260}} - Investigate why updates builder triggered twice for 10.0b5<br />
* {{bug|660480}} - RETRY on common tegra errors<br />
|-<br />
| in production<br />
| 20120127 1130 PST<br />
|<br />
* {{bug|719697}} - robocop isn't signed properly from buildbot builds<br />
|-<br />
| in production<br />
| 20120127 1040 PST<br />
|<br />
* {{bug|721488}} - deployed new pageloader.xpi<br />
|-<br />
| in production<br />
| 20120127 0730 PST<br />
|<br />
* {{Bug|719544}}. talos_from_source.py - Make the pine branch to allow downloading talos.zip from any place like on 'try'<br />
* {{bug|717662}} - Please disable debug builds and tests on the profiling branch<br />
* {{bug|720782}} - If we dont_build a platform on project_branches we should not add testers for it<br />
* {{bug|721360}} - Bug 698827 - Run 10.5 leak builds on 10.6 machines for aurora<br />
* {{bug|721573}} - Sign the profile branch nightlies using the m-c nightly key<br />
* {{bug|717106}} - Release automation for ESR<br />
* {{bug|698827}} - Bug 698827 - Run 10.5 leak builds on 10.6 machines for aurora<br />
* {{bug|715966}} - branch 1.9.2 confusingly set on talos tbpl logs<br />
* {{bug|718828}} - Don't wait for NFS cache at the end of the updates builder<br />
* {{bug|705403}} - <strike>Sendchanges [on windows] from build steps are being done from old buildbot version</strike> - backed out<br />
* {{bug|683417}} - retry.py didn't actually kill process tree for a timed-out pushsnip<br />
* {{bug|673834}} - Obsolete ReleaseRepackFactory, fold logic into CCReleaseRepackFactory<br />
|-<br />
| in production<br />
| 20120123 1435 PST<br />
|<br />
* {{bug|719859}} - remove double posting ts_paint and tpaint. p=armenzg<br />
* {{bug|718445}} - stage-old should be referenced as stage in scripts/configs. p=bhearsum<br />
|-<br />
| in production<br />
| 20120123 1405 PST<br />
|<br />
* {{Bug|649641}} - use ntpd on linux32/linux64 ix slaves<br />
|-<br />
| in production<br />
| 20120123 1140 PST<br />
|<br />
* {{Bug|711619}} - Add Android builds+tests and periodic PGO on the Fx-Team branch, p=philor<br />
* {{Bug|719859}} - Side by side on mozilla-central for ignore_first changes. p=jmaher<br />
|-<br />
| in production<br />
| 20120123 0730 PST<br />
|<br />
* {{bug|705403}} - Sendchanges [on windows] from build steps are being done from old buildbot version<br />
* {{bug|719772}} - Sign Callek up for the full release process e-mails<br />
* {{bug|716561}} - reevaluate which release mail gets sent to release-drivers<br />
* {{bug|561198}} - compress leak test / codesighs logs prior to uploading<br />
* {{bug|699219}} - Add automated clean up of hg-shared directory<br />
* {{bug|714284}} - L10n mac dep builds busted on central and aurora<br />
* {{bug|719261}} - Add more logging to AggregatingScheduler<br />
|-<br />
| in production<br />
| 20120119 1200 PST<br />
|<br />
* {{bug|719504}} - disable peptest.<br />
* {{bug|715219}} - off-by-one bustage fix for tegra android range<br />
|-<br />
| in production<br />
| 20120119 1100 PST<br />
|<br />
* {{bug|699219}} - purge shared hg repos<br />
|-<br />
| in production<br />
| 20120117 1230 PST<br />
|<br />
* {{bug|695351}} - android mochitests to use in-tree manifest<br />
* {{bug|700415}} - peptest on try<br />
* {{bug|712750}} - print more data for screenresolution in buildbot factories<br />
|-<br />
| in production<br />
| 20120117 0800 PST<br />
|<br />
* {{Bug|698827}} - Run 10.5 leak builds on 10.6 machines for try. p=armenzg<br />
|-<br />
| in production<br />
| 20120116 1325 PST<br />
|<br />
* Require branch parameter to clobberer HTML interface<br />
|-<br />
| in production<br />
| 20120113 07:00 PST<br />
|<br />
* {{bug|714490}} - make hgtool handle mirror/master hg outages better<br />
|-<br />
| in production<br />
| 20120112 16:40 PST<br />
|<br />
* {{bug|712422}} - add a --bootstrap cli flag to reftest/crashtest/jsreftest for android<br />
* {{bug|698425}} - enable android and android-xul l10n repacks<br />
* Bustage fix. Changeset fa1c76238b7c<br />
* {{bug|713442}} - point 1.9.2 release configs to the compare-locales RELEASE_0_8_2 tag<br />
* {{bug|717621}} - Remove decomissioned slaves<br />
* {{bug|698425}} - android and android-xul l10n mozconfig<br />
* {{bug|567274}} - Talos should halt on download or unzip failure<br />
|-<br />
| in production<br />
| 20120109 1806 PST<br />
|<br />
* stage rather than masters<br />
* {{bug|712008}} - Always trim revision to 12 chars<br />
* {{bug|716431}} - Block asc files for partial mars in latest-<branch> dirs (stage<br />
|-<br />
|-<br />
| in production<br />
| 20120106 1300 PST<br />
|<br />
* {{bug|715623}} - add --cachedir support to signtool.py<br />
|-<br />
| in production<br />
| 20120104 1315 PDT<br />
|<br />
* Back out 7a7847f7fc05 ({{bug|711275}}: Make sure appVersion changes with every Firefox 10 beta)<br />
* {{bug|712008}} - Pass platform to post_upload.py for shark<br />
* {{bug|681948}} - Automatically retry after a devicemanager.DMError<br />
* {{bug|715119}} - [signing-server] Bump token TTL<br />
* {{bug|713161}} - new high tegra added<br />
* {{bug|711221}} - turn on create_snippet and create_partial for profiling branch<br />
* {{bug|712150}} - bustage fix for linux,m-r and xulrunner in-tree mozconfig path<br />
|-<br />
| in production<br />
| 20111222 0800 PDT<br />
|<br />
* {{Bug|710350}} - Don't hard-code 'firefox' and 'fennec' in misc.py.<br />
* {{Bug|707152}} - enable leaktest for 10.6 everywhere except some release branches.<br />
* {{bug|711367}} - enable android-xul tests<br />
* {{Bug|673131}} - Enable talos_from_source_code.<br />
* {{bug|712094}} - re-enable aurora updates. <br />
* {{bug|711275}} - Make sure appVersion changes with every Firefox 10 beta. r=rail<br />
|-<br />
| in production<br />
| 20111221<br />
|<br />
* {{bug|683734}} - added a bunch of talos-r3 slaves to production<br />
|-<br />
| in production<br />
| 20111221 1300 PST<br />
|<br />
* {{bug|558180}} - use in-tree mozconfigs for releases<br />
* {{bug|709114}} - add locales to aurora<br />
* {{bug|710842}} - re-enable symbols for nightly fennec xul builds<br />
* {{bug|711221}} - rename private-browsing branch to 'profiling'<br />
* {{bug|712133}} - firefox 10.0b1 release configs<br />
|-<br />
| in production<br />
| 20111221 1100 PST<br />
|<br />
* {{bug|712208}} - update binutils to 2.22<br />
|-<br />
| in production<br />
| 20111220 0610 PST<br />
|<br />
* <strike>{{bug|673131}} - when minor talos changes land, the a-team should be able to deploy with minimal releng time required</strike> - backed-out<br />
* {{bug|704582}} - [tracking bug] deploy 83 tegras<br />
* {{bug|712115}} - L10n mac nightlies busted on central and aurora<br />
* {{bug|710453}} - Release Engineering changes for the Firefox 11 merge to Aurora on Dec 20<br />
* {{bug|712094}} - push mozilla-aurora updates to auroratest channel until merge stabilizes<br />
* {{bug|712068}} - Adjust default releasetestUptake value<br />
|-<br />
| in production<br />
| 20111219 1000 PST<br />
| <br />
* {{bug|707941}} - Improve token generation step<br />
* {{bug|711179}} - fix for missing symbols for non-mobile tests<br />
* {{bug|710453}} - android-xul mozilla-release mozconfigs<br />
* {{bug|711978}} - Refresh staging release configs<br />
|-<br />
| in production<br />
| 20111217 0800 PST<br />
|<br />
* {{bug|509158}} - enable signing on all branches<br />
|-<br />
| backed out<br />
| 20111216 1700 PST<br />
|<br />
* {{bug|705403}} - Sendchanges [on windows] from build steps are being done from old buildbot version<br />
|-<br />
| in production<br />
| 20111215 0830 PST<br />
|<br />
* {{bug|711064}} Fix puppet dependencies<br />
|-<br />
| in production<br />
| 20111214 0800 PST<br />
|<br />
* {{bug|509158}} Reduce default token time to 2 hours; fix last-complete-mar detection<br />
* {{bug|683734}} Add new rev3 machines.<br />
* {{bug|708475}} accept 'mochitest' and 'reftests' as synonyms for 'mochitests' and 'reftest' (with tests)<br />
* {{bug|708859}} android signature verification should look for android-arm.apk<br />
* {{bug|709233}} reenable android and android-xul multilocale for m-c nightlies<br />
* {{bug|709383}} Turn off win64 signing on m-c<br />
* {{bug|709979}} Set the branch property for projects/addon-sdk jobs to just addon-sdk<br />
* {{bug|710048}} decrease interval between mozilla-inbound pgo builds<br />
* {{bug|710050}} never merge pgo builds<br />
* {{bug|710085}} Pass mozillaDir argument to NightlyBuildFactory<br />
* {{bug|710221}} Implement AggregatingScheduler<br />
|-<br />
| in production<br />
| 20111208 0920 PST<br />
|<br />
* {{bug|509158}} Fix nightly snippet generation, reduce default token time, and enable signing on inbound<br />
* {{bug|707666}} Enable win64 signing on elm<br />
* {{bug|708341}} Turn off android-xul talos tests<br />
|-<br />
| in production<br />
| 20111206 1300 PST ish<br />
|<br />
* {{bug|509158}} Don't enable signing for l10n check steps.<br />
* {{bug|509158}} Sign builds as part of the build process: enable signing server for debug builds; disable pre-signed updater on elm.<br />
* {{bug|671450}} Try different sources for revision in log_uploader<br />
* {{bug|706832}} Implement master side token generation for signing on demand.<br />
* {{bug|509158}} Enable signing for mozilla-central windows builds.<br />
* {{bug|704549}} reenable android native on m-c.<br />
* {{bug|703772}} disable android-xul updates + uploadsymbols.<br />
|-<br />
| in production<br />
| 20111205 0800 PST<br />
|<br />
* {{bug|509158}} - signing builds (elm/oak only, hopefully)<br />
* {{bug|706832}} - Implement master side token generation for signing on demand. r=catlee,bhearsum<br />
* {{bug|671450}} - Try different sources for revision in log_uploader - r=nthomas<br />
* {{bug|707152}} - enable leaktests for m-i, try and m-c on macos64-debug. r=rail.<br />
* {{bug|706720}} - Post to graphs-old. r=catlee<br />
|-<br />
| in production<br />
| 20111202 1600 PST<br />
|<br />
* {{bug|509158}} - tools for signing builds<br />
|-<br />
| in production<br />
| 20111201 1100 PST<br />
|<br />
* {{bug|694332}} - Use make tier_nspr when building for l10n - r=armenzg<br />
* {{bug|693352}} r=aki add minidump_stackwalk and symbols to the android automation<br />
* {{bug|705936}} - reconfigs should re-generate master_config.json a=aki<br />
|-<br />
| in production<br />
| 20111201 0900 PST<br />
| <br />
* {{bug|704555}} - deploy rss for tp4m on android (required android talos update)<br />
|-<br />
| in production<br />
| 20111128 1448 PST<br />
|<br />
* {{bug|701684}} - remove mozilla-1.9.1 from config.py. r=bhearsum<br />
* add r4 slaves 080-085 to configs r=catlee<br />
* {{bug|705040}} - reenable native android builds on try. r=bhearsum<br />
* {{bug|691483}} - update MU to 3.6.24 -> 8.0.1, r=lsblakk<br />
|-<br />
| in production<br />
| 20111124 0815 PST<br />
|<br />
* {{bug|703010}} - backfill unresponsive tegras<br />
* {{bug|702390}} - reimage buildbot-master2 and buildbot-master5 as w32-ix-slave43 and w32-ix-slave44<br />
* {{bug|702351}} - deploy talos.zip which includes responsiveness<br />
* {{bug|699838}} - Set up a project branch to allow us to run several iterations for metrics<br />
* {{bug|700534}} - make local buildbot-config modification on test-master01 permanent<br />
* {{bug|700860}} - Put mw32-ix-slave26 into the production pool<br />
* {{bug|676155}} - install r3 mini 02456 as talos-r3-w7-065<br />
* {{bug|704028}} - xulrunner release bundles often timeout<br />
* {{bug|697802}} - https://bugzilla.mozilla.org/show_bug.cgi?id=697802<br />
|-<br />
| in production<br />
| 20111121 1300 PST<br />
|<br />
* {{bug|702351}} - enable tp_responsiveness on m-c<br />
* {{bug|700705}} - remove more slaves<br />
* add talos-r4-snow-060 to 080 back to the pool<br />
* {{bug|692692}} - re-enable PGO for Win64<br />
* {{bug|701766}} - Remove tegra slaves that had not taken any jobs and are not coming back to production any time soon<br />
* {{bug|704200}} - android dep builds permared after bug 701864; sometimes causing nightlies not to trigger - disable native android builders everywhere except birch<br />
|-<br />
| in production<br />
| 20111118 0700 PST<br />
|<br />
* {{bug|700513}} - set BINSCOPE for win32 on try<br />
* {{bug|702631}} - linux, linux64 and mac partner repacks aren't triggered<br />
* {{bug|703280}} - Use dev-stage01 as SYMBOL_SERVER_HOST for staging try builds<br />
* {{bug|702351}} - deploy talos.zip which includes responsiveness<br />
|-<br />
| in production<br />
| 20111117 0600 PST<br />
|<br />
* {{bug|702834}} - Pass mozillaDir to dep factory.<br />
* {{bug|701864}} - support mobile builds+repacks out of mobile/, mobile/xul/, and mobile/android/.<br />
* {{bug|701766}} - remove staging tegras.<br />
* {{bug|700513}} - Add BINSCOPE env var to win32, win32-debug, and win32-mobile<br />
* {{bug|701476}} - split android reftests from 2 chunks to 3 chunks.<br />
* {{bug|702357}} - enable new tegras for production<br />
* {{bug|702368}} - add hangmonitor.timeout=0 pref to dirty jobs.<br />
* {{bug|702645}} - win32_repack_beta broken due to "LINK : fatal error LNK1104: cannot open file 'mozcrt.lib'".<br />
* {{bug|548551}} - Turn off arm nanojit builds.<br />
* {{bug|700705}} - Remove a bunch of decomissioned slaves.<br />
* {{bug|683734}} - remove talos-r3-snow machines, remove snowleopard-r4 platform, move talos-r4-snow to snowleopard platform<br />
|-<br />
| in production<br />
| 20111116 0700 PST<br />
|<br />
* {{Bug|702351}} - deploy talos.zip which includes responsiveness <br />
|-<br />
| in production<br />
| 20111111 1712 PST<br />
|<br />
* {{bug|697389}} - multilocale birch android nightlies, against l10n-central.<br />
* {{bug|697404}} - disable tp4m for birch<br />
|-<br />
| in production<br />
| 20111110 1200 PST<br />
|<br />
* {{bug|700901}} - reorder mozconfig to get past mozconfig diff. p=aki<br />
* {{bug|700901}} - fix l10n relbranch. p=aki<br />
* {{bug|701116}} - Mobile desktop builds should be nightly-only. p=rail<br />
* {{bug|701113}} - maemo tier 3 (removing all maemo references except mobile/) p=aki<br />
* {{Bug|672132}} - Run beta and release releases in preproduction. p=rail<br />
* {{bug|698946}} - further setup-masters.py improvements p=jhford<br />
|-<br />
| in production<br />
| 20111108 1630 PST<br />
|<br />
* {{Bug|699407}} - Set mirror / bundle URLs. p=catlee<br />
* {{bug|700721}} - update buildbot-configs for merge of nightly->aurora and aurora->beta p=lsblakk<br />
* {{bug|700453}} - make test-master01 tegra specific. p=aki<br />
* {{Bug|700794}} - Disable aurora daily updates until merge to mozilla-aurora is good. p=armenzg<br />
* {{Bug|700737}} - Remove slaves given to Thunderbird. p=armenzg<br />
|-<br />
| in production<br />
| 20111108 1100 PST<br />
| {{bug|687064}} - hgtool work. p=catlee<br />
|-<br />
| in production<br />
| 20111107 0930 PDT<br />
| {{bug|660124}} - remove "paint" set. p=armenzg<br />
|-<br />
| in production<br />
| 20111107 0845 PDT<br />
|<br />
* {{bug|692812}} - add ability to have pgo strategies p=jhford<br />
* {{bug|693771}} - add 10.7 test slaves to buildbot configs p=jhford<br />
* {{bug|698837}} - use signed updater.exe for elm and oak branches. p=bhearsum<br />
* {{Bug|695921}} - removing duplicated entry for ftp_url on jetpack p=lsblakk <br />
* {{bug|698837}} - use signed updater.exe for elm and oak project branches. p=bhearsum<br />
* {{Bug|660124}} - replace ts/twinopen for ts_paint/tpain and some cleanup. p=armenzg<br />
* {{Bug|699802}} - enable_leaktests for m-i and try. p=armenzg<br />
|-<br />
| in production<br />
| 20111028 1205 PDT<br />
|<br />
* {{bug|695707}} - mozharness should be tagged automatically for 8.0+ releases<br />
* {{bug|695921}} - test per checkin addons-sdk against opt & debug across mozilla-{beta,central,aurora,release} latest tinderbox builds<br />
|-<br />
| in production<br />
| 20111025 1200 PDT<br />
|<br />
* {{bug|681855}} - Frequent Tegra "Cleanup Device exception" or "Configure Device exception" from "Remote Device Error: devRoot from devicemanager [None] is not correct"<br />
* {{bug|697112}} - add more twigs<br />
* {{bug|689649}} - update buildbot config.py to adjust side by side talos staging for mozafterpaint<br />
* {{bug|695707}} - mozharness should be tagged automatically for 8.0+ releases<br />
|-<br />
| in production<br />
| 20111021 0932 PDT<br />
|<br />
* {{bug|683448}} - Permission check and virus scan tests shouldn't fail if files pushed to the releases directory<br />
* {{bug|689649}} - disable old_suites for mozilla-beta<br />
* {{bug|692504}} - push betas to internal mirrors automatically<br />
* {{bug|693015}} - disable android debug tests<br />
* {{bug|694077}} - add aus2_mobile_* to the "update branch vars loop" in config.py<br />
* {{bug|694893}} - Bump disk space requirement for codecoverage to 7G<br />
* {{bug|695161}} - backout 1318d1bbc15a to re-enable Win64 updates<br />
* {{bug|695429}} - FF8 beta4 config changes<br />
* {{bug|696165}} - enable tegras 129 - 153<br />
|-<br />
| in production<br />
| 20111019 1100 PDT<br />
| {{bug|695525}} Pulse enabled on test-master01<br />
|-<br />
| in production (build only)<br />
| 20111017 1728 PDT<br />
|<br />
* {{bug|695161}} Disable updates to broken Win x64 builds<br />
|-<br />
| in production<br />
| 20111017 1100 PDT<br />
|<br />
* {{bug|690860}} enable android debug nightly on m-c<br />
* {{bug|694235}} config tests shouldn't fail if there are no try slaves<br />
* {{bug|694106}} remove tegra try pool<br />
* {{bug|676879}} Config changes required to run valgrind as a nightly builder<br />
* {{bug|694716}} patch by joel to fix broken mochitests due to bug 691411<br />
* {{bug|694077}} Enable nightlies builds and updates for birch branch<br />
|-<br />
| in production<br />
| 20111017 0900 PDT<br />
|<br />
* {{bug|694579}} deployed newer talos.zip<br />
|-<br />
| in production<br />
| 20111012 0735 PDT<br />
|<br />
* backout {{bug|692928}}.jhford<br />
* {{Bug|693903}} Update slaves for staging and preproduction configs. rail<br />
* {{Bug|692823}} Reduce PGO sets to 6 hours until bug 691675 is fixed. armenzg<br />
* {{Bug|693686}} PGO talos is submitting to the Firefox-Non-PGO tree. armenzg<br />
|-<br />
| in production<br />
| 20111011 1515 PDT (for build masters, others later<br />
|<br />
* {{bug|693350}} - Don't try to add bouncer entries in preproduction<br />
* {{bug|692388}} - mozharness MercurialVCS with HG_SHARE_BASE_DIR set completely ignores specified revision<br />
* No Bug, do compare_attrs better for DependentL10n, so we don't throw in dump_masters. Will followup later to get compare_attrs better for all of buildbotcustom. Not used for Firefox builds, so NPOTB<br />
* {{bug|693686}} - PGO talos builds reporting to Non-PGO branches<br />
* {{bug|693794}} - remove unneeded usebuildbot=1 from tbpl links in try emails<br />
|-<br />
| in production<br />
| 20111007 1550 PDT<br />
|<br />
* {{bug|692928}} turn off rev4 on try <br />
* {{bug|692910}} Update preproduction test slave list<br />
* {{bug|688296}} python module conflict with xcode module<br />
* {{bug|692646}} enable PGO on release builds again<br />
* {{bug|692388}} mozharness MercurialVCS with HG_SHARE_BASE_DIR set completely ignores specified revision<br />
|-<br />
| in production<br />
| 20111006 1230 PDT<br />
| <br />
* {{bug|681834}} Insert finished jobs in the statusdb more frequently<br />
* {{bug|686578}} SpiderMonkey builds on IonMonkey TBPL - enable all debug spidermonkey builds on ionmonkey<br />
* {{bug|687832}} create generic RETRY signifier, and make retry.py print it when it fails to successfully run * * {{bug|692358}} Fix log uploading for PGO builds and tests<br />
* {{bug|692370}} Add branch name to PGO scheduler so that it shows up on self-serve<br />
|-<br />
| in production<br />
| 20111005 1050 PDT<br />
| <br />
* {{bug|558180}} - use in tree mozconfigs for win64<br />
* {{bug|658313}} - disable PGO for per-checkin builds<br />
* {{bug|683721}} - add rev4 testers to buildbot-configs<br />
* {{bug|668724}} - ensure branch is not None when needed<br />
|-<br />
| in production<br />
| 20111005 1821 PDT<br />
|<br />
* Backed out: {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (post_upload.py part)<br />
|-<br />
| in production<br />
| 20111005 1632 PDT<br />
|<br />
* {{bug|671450}} - Backout log_uploader.py change, as got_revision doesn't exist on test jobs<br />
|-<br />
| in production<br />
| 20111005 1546 PDT<br />
|<br />
* {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (buildbot part)<br />
* {{bug|686831}} - Stop TinderboxPrint-ing the rev early for try<br />
* {{bug|691483}} - Do 3.6.23 -> 7.0.1 advertised major update<br />
* {{bug|689750}} - stop sending sendchanges to jhfords personal master<br />
|-<br />
| in production<br />
| 20111005 1526 PDT<br />
|<br />
* {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (post_upload.py part)<br />
|}<br />
<br />
==Archive==<br />
<br />
[[ReleaseEngineering:BuildbotMasterChanges:Archive | Older Changes]]<br />
<br />
=Android Testing=<br />
== Web Server Cluster ==<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Revision'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 3efbac1f685a<br />
| unknown<br />
| unknown<br />
| unknown<br />
|}<br />
<br />
Update Procedure:<br />
ssh to bm-remote-talos-webhost-01<br />
cd /var/www/html/talos<br />
hg pull && hg up<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
<br />
Servers:<br />
* bm-remote-talos-webhost-01.build.mozilla.org<br />
* bm-remote-talos-webhost-02.build.mozilla.org<br />
* bm-remote-talos-webhost-03.build.mozilla.org<br />
<br />
== clientproxy servers ==<br />
<br />
Production<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 2a995b4ed124<br />
| 31249cbe4f19<br />
| bfc910cd8dd3<br />
| ae5d6911905a<br />
| talos: {{bug|629503}}<br />
| 20110202 23:00 PDT<br />
| bear<br />
|}<br />
<br />
Pending<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
Servers:<br />
* bm-foopy01.build.mozilla.org<br />
* bm-foopy02.build.mozilla.org<br />
<br />
/builds/cp<br />
/builds/talos-data/talos<br />
/builds/talos-data/talos/pageloader@mozilla.org<br />
/builds/talos-data/talos/bench@taras.glek<br />
/builds/sut_tools</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering:2012-Q1-Workweek&diff=391889ReleaseEngineering:2012-Q1-Workweek2012-01-30T21:13:20Z<p>Bear: /* Topics to discuss */</p>
<hr />
<div>=Details=<br />
==Location==<br />
Oahu, Hawaii<br />
<br />
==Hotel==<br />
[http://bit.ly/seLYES Queen Kapiolani]<br />
<br />
==Important Dates==<br />
* Travel: <br />
** Arrive on or before Feb 5th, 2012 <br />
** Depart on or after Feb 11th, 2012<br />
* Working: Feb 6th - 10th, 2012 inclusive<br />
<br />
==Topics to discuss==<br />
* Please start adding your ideas<br />
* mini sprints on some code hygene items (e.g. {{bug|644578}}) (hwine)<br />
* Improving efficiency<br />
** Process changes<br />
** Email/Bug triage<br />
<br />
* (jhopkins) publish/subscribe release automation system<br />
* (catlee) RaaS (releng-as-a-service)<br />
* (hwine) Roadmap for git support<br />
* (bear) Metrics, Dashboard, Buildduty Automation<br />
<br />
==Schedule of Talks/Presentations==<br />
* Monday<br />
** morning: reviews<br />
** afternoon: autoland code-review<br />
<br />
* Tuesday<br />
** morning: reviews<br />
<br />
* Wednesday<br />
** morning: reviews<br />
<br />
* Thursday<br />
<br />
* Friday<br />
<br />
<br />
==Meal Planning==<br />
TDB<br />
==Things to Do==<br />
Coming soon....</div>Bearhttps://wiki.mozilla.org/index.php?title=Mobile/Testing/11_18_11&diff=388103Mobile/Testing/11 18 112012-01-18T18:24:09Z<p>Bear: </p>
<hr />
<div>silly bear created the wrong entry<br />
<br />
it should be https://wiki.mozilla.org/Mobile/Testing/01_18_12</div>Bearhttps://wiki.mozilla.org/index.php?title=Mobile/Testing&diff=388081Mobile/Testing2012-01-18T17:59:10Z<p>Bear: /* Notes */</p>
<hr />
<div>= Status =<br />
[https://bugzilla.mozilla.org/buglist.cgi?resolution=---&resolution=DUPLICATE&status_whiteboard_type=allwordssubstr&query_format=advanced&status_whiteboard=%5Bmobile-testing%5D mobile testing bugs]<br />
== Overview ==<br />
<br />
{| cellspacing="1" cellpadding="1" border="1" style="width: 578px; height: 194px;"<br />
|-<br />
! scope="col" | <br />
! scope="col" | standalone <br />
! scope="col" | production <br />
! scope="col" | native UI<br />
! scope="col" | number running<br />
|-<br />
! scope="row" | reftests <br />
| align="center" | yes <br />
| align="center" | yes <br />
| align="center" | no <br />
| align="center" | 3557/6113<br />
|-<br />
! scope="row" | mochitests <br />
| align="center" | yes <br />
| align="center" | yes <br />
| align="center" | yes <br />
| align="center" | 17873/230492<br />
|-<br />
! scope="row" | browser chrome <br />
| align="center" | no<br />
| align="center" | yes <br />
| align="center" | no <br />
| align="center" | 422/422<br />
|-<br />
! scope="row" | xpcshell <br />
| align="center" | yes <br />
| align="center" | no<br />
| align="center" | yes <br />
| align="center" | 848/1202<br />
|-<br />
! scope="row" | js reftests <br />
| align="center" | no<br />
| align="center" | yes<br />
| align="center" | no <br />
| align="center" | 54570/55357<br />
|-<br />
! scope="row" | crash tests <br />
| align="center" | yes<br />
| align="center" | yes<br />
| align="center" | no <br />
| align="center" | 1880/1888<br />
|-<br />
! scope="row" | talos <br />
| align="center" | no <br />
| align="center" | yes<br />
| align="center" | yes <br />
| align="center" | 9/9<br />
|}<br />
<br />
== Reftests ==<br />
* over half the tests are not run on mobile<br />
* most of this is due to not running large directories of tests<br />
* somebody needs to go in and get more details as to what is not running.<br />
* does running tests locally work? I find a lot of failures when I don't have the proper resolution and use the --ignore-window-size flag.<br />
<br />
== Mochitests ==<br />
* currently running 11 directories <br />
* last week releng had 16 more directories running in staging which are green (m5-8), those should be turned on this week<br />
* layout/style has 96000+ tests and :mw22 is looking at cleaning those tests up<br />
** some require scrollbars and we don't have those on mobile<br />
** 1 has a e10s requirement which isn't an obvious fix<br />
* content/* tests have patches to run with e10s and on mobile {{bug|668283}}<br />
** seems to be blocked on {{bug|621363}}<br />
<br />
== Browser Chrome ==<br />
[https://bugzilla.mozilla.org/buglist.cgi?resolution=---&resolution=DUPLICATE&status_whiteboard_type=allwordssubstr&query_format=advanced&status_whiteboard=%5Bmobile-testing%5D%20%5Bbrowser-chrome%5D browser-chrome test bugs]<br />
(Broken out of Mochitests because of special requirements)<br />
the requires to be run out of the package-tests directory. When doing a 'make package-tests', we create a tests.jar file which we copy to the device (in the profile directory) and run the tests from there.<br />
<br />
== XPCShell ==<br />
Patches landed: developers can run xpcshell on Android via ADB. About 2/3 of the tests pass; bugs opened for the remainder:<br />
[https://bugzilla.mozilla.org/buglist.cgi?resolution=---&resolution=DUPLICATE&status_whiteboard_type=allwordssubstr&query_format=advanced&status_whiteboard=%5Bmobile-testing%5D%20%5Bxpcshell%5D xpcshell test bugs]<br />
<br />
== JS Reftests ==<br />
* we already run the majority of these in production<br />
* we need to create a link to the jstests.list file so we can run these on a developer machine<br />
* we should document the commented out tests with more details<br />
<br />
== Crash tests ==<br />
* almost all are running, we should look into the tests we have turned off in the manifest files and document them better.<br />
<br />
== Talos ==<br />
* we don't run tp5, but we have tp4m. <br />
* we need to turn off ts, txul and replace with ts_paint and tpaint<br />
* tpan/tzoom/tp4m are failing frequently {{bug|662936}}<br />
<br />
= Status Meetings =<br />
There will be weekly meetings to discuss the current status of testing on Mobile and coordinate the required work between teams.<br />
Details<br />
* Wednesdays @ 10:30am PST/PDT<br />
* Meeting in Warp Core<br />
* Vidyo in Warp Core<br />
* #mobile for back channel<br />
== Notes ==<br />
[[template]]<br />
<br />
Q1<br />
# [[Mobile/Testing/01_18_12 | 01/18/12]]<br />
# [[Mobile/Testing/01_11_12 | 01/11/12]]<br />
# [[Mobile/Testing/01_04_12 | 01/04/12]]<br />
<br />
2010<br />
Q4<br />
# [[Mobile/Testing/12_28_11 | 12/28/11]]<br />
# [[Mobile/Testing/12_21_11 | 12/21/11]]<br />
# [[Mobile/Testing/12_14_11 | 12/14/11]]<br />
# [[Mobile/Testing/12_07_11 | 12/07/11]]<br />
# [[Mobile/Testing/11_30_11 | 11/30/11]]<br />
# [[Mobile/Testing/11_23_11 | 11/23/11]]<br />
# [[Mobile/Testing/11_16_11 | 11/16/11]]<br />
# [[Mobile/Testing/11_09_11 | 11/09/11]]<br />
# [[Mobile/Testing/11_02_11 | 11/02/11]]<br />
# [[Mobile/Testing/10_26_11 | 10/26/11]]<br />
# [[Mobile/Testing/10_19_11 | 10/19/11]]<br />
# [[Mobile/Testing/10_12_11 | 10/12/11]]<br />
# [[Mobile/Testing/10_05_11 | 10/05/11]]<br />
Q3<br />
# [[Mobile/Testing/09_28_11 | 09/28/11]]<br />
# [[Mobile/Testing/09_21_11 | 09/21/11]]<br />
# [[Mobile/Testing/08_24_11 | 08/24/11]]<br />
# [[Mobile/Testing/08_17_11 | 08/17/11]]<br />
# [[Mobile/Testing/08_08_11 | 08/08/11]]<br />
# [[Mobile/Testing/08_01_11 | 08/01/11]]<br />
# [[Mobile/Testing/07_06_11 | 07/06/11]]</div>Bearhttps://wiki.mozilla.org/index.php?title=Mobile/Testing/11_18_11&diff=388078Mobile/Testing/11 18 112012-01-18T17:57:53Z<p>Bear: Blanked the page</p>
<hr />
<div></div>Bearhttps://wiki.mozilla.org/index.php?title=Mobile/Testing/11_18_12&diff=388076Mobile/Testing/11 18 122012-01-18T17:57:22Z<p>Bear: Blanked the page</p>
<hr />
<div></div>Bearhttps://wiki.mozilla.org/index.php?title=Mobile/Testing/01_18_12&diff=388073Mobile/Testing/01 18 122012-01-18T17:55:46Z<p>Bear: Created page with "= Previous Action Items = = Status reports = == Dev team == == Rel Eng == [http://is.gd/ZEOCi7 android_tier_1] * evaluating {{bug|715193}} (talos.json for tegra environment) * ..."</p>
<hr />
<div>= Previous Action Items =<br />
<br />
= Status reports =<br />
== Dev team ==<br />
== Rel Eng ==<br />
[http://is.gd/ZEOCi7 android_tier_1]<br />
<br />
* evaluating {{bug|715193}} (talos.json for tegra environment)<br />
* work progressing for {{bug|715215}} - install robocop.apk and add a robocop test type for native android<br />
* landed<br />
** {{bug|695351}} - mochitest intree manifests<br />
** {{bug|712422}} (--bootstrap)<br />
<br />
== A Team ==<br />
== S1/S2 Automation ==<br />
<br />
= Round Table =<br />
<br />
<br />
= Action Items =</div>Bearhttps://wiki.mozilla.org/index.php?title=Mobile/Testing/11_18_12&diff=388069Mobile/Testing/11 18 122012-01-18T17:55:20Z<p>Bear: Created page with "= Previous Action Items = = Status reports = == Dev team == == Rel Eng == [http://is.gd/ZEOCi7 android_tier_1] * evaluating {{bug|715193}} (talos.json for tegra environment) * ..."</p>
<hr />
<div>= Previous Action Items =<br />
<br />
= Status reports =<br />
== Dev team ==<br />
== Rel Eng ==<br />
[http://is.gd/ZEOCi7 android_tier_1]<br />
<br />
* evaluating {{bug|715193}} (talos.json for tegra environment)<br />
* work progressing for {{bug|715215}} - install robocop.apk and add a robocop test type for native android<br />
* landed<br />
** {{bug|695351}} - mochitest intree manifests<br />
** {{bug|712422}} (--bootstrap)<br />
<br />
== A Team ==<br />
== S1/S2 Automation ==<br />
<br />
= Round Table =<br />
<br />
<br />
= Action Items =</div>Bearhttps://wiki.mozilla.org/index.php?title=Mobile/Testing/11_18_11&diff=388066Mobile/Testing/11 18 112012-01-18T17:54:38Z<p>Bear: /* Rel Eng */</p>
<hr />
<div>= Previous Action Items =<br />
<br />
= Status reports =<br />
== Dev team ==<br />
== Rel Eng ==<br />
[http://is.gd/ZEOCi7 android_tier_1]<br />
<br />
* evaluating {{bug|715193}} (talos.json for tegra environment)<br />
* work progressing for {{bug|715215}} - install robocop.apk and add a robocop test type for native android<br />
* landed<br />
** {{bug|695351}} - mochitest intree manifests<br />
** {{bug|712422}} (--bootstrap)<br />
<br />
== A Team ==<br />
== S1/S2 Automation ==<br />
<br />
= Round Table =<br />
<br />
<br />
= Action Items =</div>Bearhttps://wiki.mozilla.org/index.php?title=Mobile/Testing/11_18_11&diff=388058Mobile/Testing/11 18 112012-01-18T17:46:44Z<p>Bear: Created page with "= Previous Action Items = = Status reports = == Dev team == == Rel Eng == [http://is.gd/ZEOCi7 android_tier_1] * evaluating {{bug|715193}} (talos.json for tegra environment) * ..."</p>
<hr />
<div>= Previous Action Items =<br />
<br />
= Status reports =<br />
== Dev team ==<br />
== Rel Eng ==<br />
[http://is.gd/ZEOCi7 android_tier_1]<br />
<br />
* evaluating {{bug|715193}} (talos.json for tegra environment)<br />
* landed<br />
** {{bug|695351}} - mochitest intree manifests<br />
** {{bug|712422}} (--bootstrap)<br />
<br />
== A Team ==<br />
== S1/S2 Automation ==<br />
<br />
= Round Table =<br />
<br />
<br />
= Action Items =</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Maintenance&diff=387714ReleaseEngineering/Maintenance2012-01-17T19:36:57Z<p>Bear: /* Reconfigs / Deployments */</p>
<hr />
<div>This page is to track upcoming changes to any part of RelEng infrastructure; buildbot masters, slaves, ESX hosts, etc. This should allow us keep track of what we're doing in a downtime, and also what changes can be rolled out to production without needing a downtime. This should be helpful if we need to track what changes were made when troubleshooting problems.<br />
<br />
[[ReleaseEngineering:BuildbotBestPractices]] describes how we manage changes to our masters.<br />
<br />
= Relevant repositories =<br />
* [http://hg.mozilla.org/build/buildbot/ buildbot]<br />
* [http://hg.mozilla.org/build/buildbot-configs/ buildbot-configs]<br />
* [http://hg.mozilla.org/build/buildbotcustom/ buildbotcustom]<br />
* [http://hg.mozilla.org/build/tools/ tools]<br />
* [http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ talos]<br />
<br />
'''Are you changing the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
= Reconfigs / Deployments =<br />
This page is updated by the person who does a reconfig on production systems. Please give accurate times, as we use this page to track down if reconfigs caused debug intermittent problems.<br />
<br />
'''Did you change the tool chain on a master? If so, let auto-tools know so they can update their masters'''<br />
<br />
Outcome should be 'backed out' or 'In production' or some such. Reverse date order pretty please.<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Outcome'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Bug #(s)''' - '''Description(s)'''<br />
|-<br />
| in production<br />
| 20120117 1330 PST<br />
|<br />
* {{bug|695351}} - android mochitests to use in-tree manifest<br />
* {{bug|700415}} - peptest on try<br />
* {{bug|712750}} - print more data for screenresolution in buildbot factories<br />
|-<br />
| in production<br />
| 20120117 0800 PST<br />
|<br />
* {{Bug|698827}} - Run 10.5 leak builds on 10.6 machines for try. p=armenzg<br />
|-<br />
| in production<br />
| 20120116 1325 PST<br />
|<br />
* Require branch parameter to clobberer HTML interface<br />
|-<br />
| in production<br />
| 20120113 07:00 PST<br />
|<br />
* {{bug|714490}} - make hgtool handle mirror/master hg outages better<br />
|-<br />
| in production<br />
| 20120112 16:40 PST<br />
|<br />
* {{bug|712422}} - add a --bootstrap cli flag to reftest/crashtest/jsreftest for android<br />
* {{bug|698425}} - enable android and android-xul l10n repacks<br />
* Bustage fix. Changeset fa1c76238b7c<br />
* {{bug|713442}} - point 1.9.2 release configs to the compare-locales RELEASE_0_8_2 tag<br />
* {{bug|717621}} - Remove decomissioned slaves<br />
* {{bug|698425}} - android and android-xul l10n mozconfig<br />
* {{bug|567274}} - Talos should halt on download or unzip failure<br />
|-<br />
| in production<br />
| 20120109 1806 PST<br />
|<br />
* stage rather than masters<br />
* {{bug|712008}} - Always trim revision to 12 chars<br />
* {{bug|716431}} - Block asc files for partial mars in latest-<branch> dirs (stage<br />
|-<br />
|-<br />
| in production<br />
| 20120106 1300 PST<br />
|<br />
* {{bug|715623}} - add --cachedir support to signtool.py<br />
|-<br />
| in production<br />
| 20120104 1315 PDT<br />
|<br />
* Back out 7a7847f7fc05 ({{bug|711275}}: Make sure appVersion changes with every Firefox 10 beta)<br />
* {{bug|712008}} - Pass platform to post_upload.py for shark<br />
* {{bug|681948}} - Automatically retry after a devicemanager.DMError<br />
* {{bug|715119}} - [signing-server] Bump token TTL<br />
* {{bug|713161}} - new high tegra added<br />
* {{bug|711221}} - turn on create_snippet and create_partial for profiling branch<br />
* {{bug|712150}} - bustage fix for linux,m-r and xulrunner in-tree mozconfig path<br />
|-<br />
| in production<br />
| 20111222 0800 PDT<br />
|<br />
* {{Bug|710350}} - Don't hard-code 'firefox' and 'fennec' in misc.py.<br />
* {{Bug|707152}} - enable leaktest for 10.6 everywhere except some release branches.<br />
* {{bug|711367}} - enable android-xul tests<br />
* {{Bug|673131}} - Enable talos_from_source_code.<br />
* {{bug|712094}} - re-enable aurora updates. <br />
* {{bug|711275}} - Make sure appVersion changes with every Firefox 10 beta. r=rail<br />
|-<br />
| in production<br />
| 20111221<br />
|<br />
* {{bug|683734}} - added a bunch of talos-r3 slaves to production<br />
|-<br />
| in production<br />
| 20111221 1300 PST<br />
|<br />
* {{bug|558180}} - use in-tree mozconfigs for releases<br />
* {{bug|709114}} - add locales to aurora<br />
* {{bug|710842}} - re-enable symbols for nightly fennec xul builds<br />
* {{bug|711221}} - rename private-browsing branch to 'profiling'<br />
* {{bug|712133}} - firefox 10.0b1 release configs<br />
|-<br />
| in production<br />
| 20111221 1100 PST<br />
|<br />
* {{bug|712208}} - update binutils to 2.22<br />
|-<br />
| in production<br />
| 20111220 0610 PST<br />
|<br />
* <strike>{{bug|673131}} - when minor talos changes land, the a-team should be able to deploy with minimal releng time required</strike> - backed-out<br />
* {{bug|704582}} - [tracking bug] deploy 83 tegras<br />
* {{bug|712115}} - L10n mac nightlies busted on central and aurora<br />
* {{bug|710453}} - Release Engineering changes for the Firefox 11 merge to Aurora on Dec 20<br />
* {{bug|712094}} - push mozilla-aurora updates to auroratest channel until merge stabilizes<br />
* {{bug|712068}} - Adjust default releasetestUptake value<br />
|-<br />
| in production<br />
| 20111219 1000 PST<br />
| <br />
* {{bug|707941}} - Improve token generation step<br />
* {{bug|711179}} - fix for missing symbols for non-mobile tests<br />
* {{bug|710453}} - android-xul mozilla-release mozconfigs<br />
* {{bug|711978}} - Refresh staging release configs<br />
|-<br />
| in production<br />
| 20111217 0800 PST<br />
|<br />
* {{bug|509158}} - enable signing on all branches<br />
|-<br />
| backed out<br />
| 20111216 1700 PST<br />
|<br />
* {{bug|705403}} - Sendchanges [on windows] from build steps are being done from old buildbot version<br />
|-<br />
| in production<br />
| 20111215 0830 PST<br />
|<br />
* {{bug|711064}} Fix puppet dependencies<br />
|-<br />
| in production<br />
| 20111214 0800 PST<br />
|<br />
* {{bug|509158}} Reduce default token time to 2 hours; fix last-complete-mar detection<br />
* {{bug|683734}} Add new rev3 machines.<br />
* {{bug|708475}} accept 'mochitest' and 'reftests' as synonyms for 'mochitests' and 'reftest' (with tests)<br />
* {{bug|708859}} android signature verification should look for android-arm.apk<br />
* {{bug|709233}} reenable android and android-xul multilocale for m-c nightlies<br />
* {{bug|709383}} Turn off win64 signing on m-c<br />
* {{bug|709979}} Set the branch property for projects/addon-sdk jobs to just addon-sdk<br />
* {{bug|710048}} decrease interval between mozilla-inbound pgo builds<br />
* {{bug|710050}} never merge pgo builds<br />
* {{bug|710085}} Pass mozillaDir argument to NightlyBuildFactory<br />
* {{bug|710221}} Implement AggregatingScheduler<br />
|-<br />
| in production<br />
| 20111208 0920 PST<br />
|<br />
* {{bug|509158}} Fix nightly snippet generation, reduce default token time, and enable signing on inbound<br />
* {{bug|707666}} Enable win64 signing on elm<br />
* {{bug|708341}} Turn off android-xul talos tests<br />
|-<br />
| in production<br />
| 20111206 1300 PST ish<br />
|<br />
* {{bug|509158}} Don't enable signing for l10n check steps.<br />
* {{bug|509158}} Sign builds as part of the build process: enable signing server for debug builds; disable pre-signed updater on elm.<br />
* {{bug|671450}} Try different sources for revision in log_uploader<br />
* {{bug|706832}} Implement master side token generation for signing on demand.<br />
* {{bug|509158}} Enable signing for mozilla-central windows builds.<br />
* {{bug|704549}} reenable android native on m-c.<br />
* {{bug|703772}} disable android-xul updates + uploadsymbols.<br />
|-<br />
| in production<br />
| 20111205 0800 PST<br />
|<br />
* {{bug|509158}} - signing builds (elm/oak only, hopefully)<br />
* {{bug|706832}} - Implement master side token generation for signing on demand. r=catlee,bhearsum<br />
* {{bug|671450}} - Try different sources for revision in log_uploader - r=nthomas<br />
* {{bug|707152}} - enable leaktests for m-i, try and m-c on macos64-debug. r=rail.<br />
* {{bug|706720}} - Post to graphs-old. r=catlee<br />
|-<br />
| in production<br />
| 20111202 1600 PST<br />
|<br />
* {{bug|509158}} - tools for signing builds<br />
|-<br />
| in production<br />
| 20111201 1100 PST<br />
|<br />
* {{bug|694332}} - Use make tier_nspr when building for l10n - r=armenzg<br />
* {{bug|693352}} r=aki add minidump_stackwalk and symbols to the android automation<br />
* {{bug|705936}} - reconfigs should re-generate master_config.json a=aki<br />
|-<br />
| in production<br />
| 20111201 0900 PST<br />
| <br />
* {{bug|704555}} - deploy rss for tp4m on android (required android talos update)<br />
|-<br />
| in production<br />
| 20111128 1448 PST<br />
|<br />
* {{bug|701684}} - remove mozilla-1.9.1 from config.py. r=bhearsum<br />
* add r4 slaves 080-085 to configs r=catlee<br />
* {{bug|705040}} - reenable native android builds on try. r=bhearsum<br />
* {{bug|691483}} - update MU to 3.6.24 -> 8.0.1, r=lsblakk<br />
|-<br />
| in production<br />
| 20111124 0815 PST<br />
|<br />
* {{bug|703010}} - backfill unresponsive tegras<br />
* {{bug|702390}} - reimage buildbot-master2 and buildbot-master5 as w32-ix-slave43 and w32-ix-slave44<br />
* {{bug|702351}} - deploy talos.zip which includes responsiveness<br />
* {{bug|699838}} - Set up a project branch to allow us to run several iterations for metrics<br />
* {{bug|700534}} - make local buildbot-config modification on test-master01 permanent<br />
* {{bug|700860}} - Put mw32-ix-slave26 into the production pool<br />
* {{bug|676155}} - install r3 mini 02456 as talos-r3-w7-065<br />
* {{bug|704028}} - xulrunner release bundles often timeout<br />
* {{bug|697802}} - https://bugzilla.mozilla.org/show_bug.cgi?id=697802<br />
|-<br />
| in production<br />
| 20111121 1300 PST<br />
|<br />
* {{bug|702351}} - enable tp_responsiveness on m-c<br />
* {{bug|700705}} - remove more slaves<br />
* add talos-r4-snow-060 to 080 back to the pool<br />
* {{bug|692692}} - re-enable PGO for Win64<br />
* {{bug|701766}} - Remove tegra slaves that had not taken any jobs and are not coming back to production any time soon<br />
* {{bug|704200}} - android dep builds permared after bug 701864; sometimes causing nightlies not to trigger - disable native android builders everywhere except birch<br />
|-<br />
| in production<br />
| 20111118 0700 PST<br />
|<br />
* {{bug|700513}} - set BINSCOPE for win32 on try<br />
* {{bug|702631}} - linux, linux64 and mac partner repacks aren't triggered<br />
* {{bug|703280}} - Use dev-stage01 as SYMBOL_SERVER_HOST for staging try builds<br />
* {{bug|702351}} - deploy talos.zip which includes responsiveness<br />
|-<br />
| in production<br />
| 20111117 0600 PST<br />
|<br />
* {{bug|702834}} - Pass mozillaDir to dep factory.<br />
* {{bug|701864}} - support mobile builds+repacks out of mobile/, mobile/xul/, and mobile/android/.<br />
* {{bug|701766}} - remove staging tegras.<br />
* {{bug|700513}} - Add BINSCOPE env var to win32, win32-debug, and win32-mobile<br />
* {{bug|701476}} - split android reftests from 2 chunks to 3 chunks.<br />
* {{bug|702357}} - enable new tegras for production<br />
* {{bug|702368}} - add hangmonitor.timeout=0 pref to dirty jobs.<br />
* {{bug|702645}} - win32_repack_beta broken due to "LINK : fatal error LNK1104: cannot open file 'mozcrt.lib'".<br />
* {{bug|548551}} - Turn off arm nanojit builds.<br />
* {{bug|700705}} - Remove a bunch of decomissioned slaves.<br />
* {{bug|683734}} - remove talos-r3-snow machines, remove snowleopard-r4 platform, move talos-r4-snow to snowleopard platform<br />
|-<br />
| in production<br />
| 20111116 0700 PST<br />
|<br />
* {{Bug|702351}} - deploy talos.zip which includes responsiveness <br />
|-<br />
| in production<br />
| 20111111 1712 PST<br />
|<br />
* {{bug|697389}} - multilocale birch android nightlies, against l10n-central.<br />
* {{bug|697404}} - disable tp4m for birch<br />
|-<br />
| in production<br />
| 20111110 1200 PST<br />
|<br />
* {{bug|700901}} - reorder mozconfig to get past mozconfig diff. p=aki<br />
* {{bug|700901}} - fix l10n relbranch. p=aki<br />
* {{bug|701116}} - Mobile desktop builds should be nightly-only. p=rail<br />
* {{bug|701113}} - maemo tier 3 (removing all maemo references except mobile/) p=aki<br />
* {{Bug|672132}} - Run beta and release releases in preproduction. p=rail<br />
* {{bug|698946}} - further setup-masters.py improvements p=jhford<br />
|-<br />
| in production<br />
| 20111108 1630 PST<br />
|<br />
* {{Bug|699407}} - Set mirror / bundle URLs. p=catlee<br />
* {{bug|700721}} - update buildbot-configs for merge of nightly->aurora and aurora->beta p=lsblakk<br />
* {{bug|700453}} - make test-master01 tegra specific. p=aki<br />
* {{Bug|700794}} - Disable aurora daily updates until merge to mozilla-aurora is good. p=armenzg<br />
* {{Bug|700737}} - Remove slaves given to Thunderbird. p=armenzg<br />
|-<br />
| in production<br />
| 20111108 1100 PST<br />
| {{bug|687064}} - hgtool work. p=catlee<br />
|-<br />
| in production<br />
| 20111107 0930 PDT<br />
| {{bug|660124}} - remove "paint" set. p=armenzg<br />
|-<br />
| in production<br />
| 20111107 0845 PDT<br />
|<br />
* {{bug|692812}} - add ability to have pgo strategies p=jhford<br />
* {{bug|693771}} - add 10.7 test slaves to buildbot configs p=jhford<br />
* {{bug|698837}} - use signed updater.exe for elm and oak branches. p=bhearsum<br />
* {{Bug|695921}} - removing duplicated entry for ftp_url on jetpack p=lsblakk <br />
* {{bug|698837}} - use signed updater.exe for elm and oak project branches. p=bhearsum<br />
* {{Bug|660124}} - replace ts/twinopen for ts_paint/tpain and some cleanup. p=armenzg<br />
* {{Bug|699802}} - enable_leaktests for m-i and try. p=armenzg<br />
|-<br />
| in production<br />
| 20111028 1205 PDT<br />
|<br />
* {{bug|695707}} - mozharness should be tagged automatically for 8.0+ releases<br />
* {{bug|695921}} - test per checkin addons-sdk against opt & debug across mozilla-{beta,central,aurora,release} latest tinderbox builds<br />
|-<br />
| in production<br />
| 20111025 1200 PDT<br />
|<br />
* {{bug|681855}} - Frequent Tegra "Cleanup Device exception" or "Configure Device exception" from "Remote Device Error: devRoot from devicemanager [None] is not correct"<br />
* {{bug|697112}} - add more twigs<br />
* {{bug|689649}} - update buildbot config.py to adjust side by side talos staging for mozafterpaint<br />
* {{bug|695707}} - mozharness should be tagged automatically for 8.0+ releases<br />
|-<br />
| in production<br />
| 20111021 0932 PDT<br />
|<br />
* {{bug|683448}} - Permission check and virus scan tests shouldn't fail if files pushed to the releases directory<br />
* {{bug|689649}} - disable old_suites for mozilla-beta<br />
* {{bug|692504}} - push betas to internal mirrors automatically<br />
* {{bug|693015}} - disable android debug tests<br />
* {{bug|694077}} - add aus2_mobile_* to the "update branch vars loop" in config.py<br />
* {{bug|694893}} - Bump disk space requirement for codecoverage to 7G<br />
* {{bug|695161}} - backout 1318d1bbc15a to re-enable Win64 updates<br />
* {{bug|695429}} - FF8 beta4 config changes<br />
* {{bug|696165}} - enable tegras 129 - 153<br />
|-<br />
| in production<br />
| 20111019 1100 PDT<br />
| {{bug|695525}} Pulse enabled on test-master01<br />
|-<br />
| in production (build only)<br />
| 20111017 1728 PDT<br />
|<br />
* {{bug|695161}} Disable updates to broken Win x64 builds<br />
|-<br />
| in production<br />
| 20111017 1100 PDT<br />
|<br />
* {{bug|690860}} enable android debug nightly on m-c<br />
* {{bug|694235}} config tests shouldn't fail if there are no try slaves<br />
* {{bug|694106}} remove tegra try pool<br />
* {{bug|676879}} Config changes required to run valgrind as a nightly builder<br />
* {{bug|694716}} patch by joel to fix broken mochitests due to bug 691411<br />
* {{bug|694077}} Enable nightlies builds and updates for birch branch<br />
|-<br />
| in production<br />
| 20111017 0900 PDT<br />
|<br />
* {{bug|694579}} deployed newer talos.zip<br />
|-<br />
| in production<br />
| 20111012 0735 PDT<br />
|<br />
* backout {{bug|692928}}.jhford<br />
* {{Bug|693903}} Update slaves for staging and preproduction configs. rail<br />
* {{Bug|692823}} Reduce PGO sets to 6 hours until bug 691675 is fixed. armenzg<br />
* {{Bug|693686}} PGO talos is submitting to the Firefox-Non-PGO tree. armenzg<br />
|-<br />
| in production<br />
| 20111011 1515 PDT (for build masters, others later<br />
|<br />
* {{bug|693350}} - Don't try to add bouncer entries in preproduction<br />
* {{bug|692388}} - mozharness MercurialVCS with HG_SHARE_BASE_DIR set completely ignores specified revision<br />
* No Bug, do compare_attrs better for DependentL10n, so we don't throw in dump_masters. Will followup later to get compare_attrs better for all of buildbotcustom. Not used for Firefox builds, so NPOTB<br />
* {{bug|693686}} - PGO talos builds reporting to Non-PGO branches<br />
* {{bug|693794}} - remove unneeded usebuildbot=1 from tbpl links in try emails<br />
|-<br />
| in production<br />
| 20111007 1550 PDT<br />
|<br />
* {{bug|692928}} turn off rev4 on try <br />
* {{bug|692910}} Update preproduction test slave list<br />
* {{bug|688296}} python module conflict with xcode module<br />
* {{bug|692646}} enable PGO on release builds again<br />
* {{bug|692388}} mozharness MercurialVCS with HG_SHARE_BASE_DIR set completely ignores specified revision<br />
|-<br />
| in production<br />
| 20111006 1230 PDT<br />
| <br />
* {{bug|681834}} Insert finished jobs in the statusdb more frequently<br />
* {{bug|686578}} SpiderMonkey builds on IonMonkey TBPL - enable all debug spidermonkey builds on ionmonkey<br />
* {{bug|687832}} create generic RETRY signifier, and make retry.py print it when it fails to successfully run * * {{bug|692358}} Fix log uploading for PGO builds and tests<br />
* {{bug|692370}} Add branch name to PGO scheduler so that it shows up on self-serve<br />
|-<br />
| in production<br />
| 20111005 1050 PDT<br />
| <br />
* {{bug|558180}} - use in tree mozconfigs for win64<br />
* {{bug|658313}} - disable PGO for per-checkin builds<br />
* {{bug|683721}} - add rev4 testers to buildbot-configs<br />
* {{bug|668724}} - ensure branch is not None when needed<br />
|-<br />
| in production<br />
| 20111005 1821 PDT<br />
|<br />
* Backed out: {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (post_upload.py part)<br />
|-<br />
| in production<br />
| 20111005 1632 PDT<br />
|<br />
* {{bug|671450}} - Backout log_uploader.py change, as got_revision doesn't exist on test jobs<br />
|-<br />
| in production<br />
| 20111005 1546 PDT<br />
|<br />
* {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (buildbot part)<br />
* {{bug|686831}} - Stop TinderboxPrint-ing the rev early for try<br />
* {{bug|691483}} - Do 3.6.23 -> 7.0.1 advertised major update<br />
* {{bug|689750}} - stop sending sendchanges to jhfords personal master<br />
|-<br />
| in production<br />
| 20111005 1526 PDT<br />
|<br />
* {{bug|671450}} - Use buildid and rev to create tinderbox-builds path (post_upload.py part)<br />
|}<br />
<br />
==Archive==<br />
<br />
[[ReleaseEngineering:BuildbotMasterChanges:Archive | Older Changes]]<br />
<br />
=Android Testing=<br />
== Web Server Cluster ==<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Revision'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 3efbac1f685a<br />
| unknown<br />
| unknown<br />
| unknown<br />
|}<br />
<br />
Update Procedure:<br />
ssh to bm-remote-talos-webhost-01<br />
cd /var/www/html/talos<br />
hg pull && hg up<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
rsync -azf --delete . bm-remote-talos-webhost-02:/var/www/html/.<br />
<br />
Servers:<br />
* bm-remote-talos-webhost-01.build.mozilla.org<br />
* bm-remote-talos-webhost-02.build.mozilla.org<br />
* bm-remote-talos-webhost-03.build.mozilla.org<br />
<br />
== clientproxy servers ==<br />
<br />
Production<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| 2a995b4ed124<br />
| 31249cbe4f19<br />
| bfc910cd8dd3<br />
| ae5d6911905a<br />
| talos: {{bug|629503}}<br />
| 20110202 23:00 PDT<br />
| bear<br />
|}<br />
<br />
Pending<br />
<br />
{| class="fullwidth-table sortable"<br />
| style="background:#cccccc" | '''Talos Rev'''<br />
| style="background:#cccccc" | '''Pageloader Rev'''<br />
| style="background:#cccccc" | '''Taras Bench Rev'''<br />
| style="background:#cccccc" | '''sut_tools'''<br />
| style="background:#cccccc" | '''Bug #'''<br />
| style="background:#cccccc" | '''When'''<br />
| style="background:#cccccc" | '''Who'''<br />
|-<br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
| <br />
|}<br />
<br />
Servers:<br />
* bm-foopy01.build.mozilla.org<br />
* bm-foopy02.build.mozilla.org<br />
<br />
/builds/cp<br />
/builds/talos-data/talos<br />
/builds/talos-data/talos/pageloader@mozilla.org<br />
/builds/talos-data/talos/bench@taras.glek<br />
/builds/sut_tools</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Archive/Android_Tegras&diff=387597ReleaseEngineering/Archive/Android Tegras2012-01-17T17:59:46Z<p>Bear: /* power cycle a Tegra */</p>
<hr />
<div>{{Release Engineering How To|Android Tegras}}<br />
= Tegra Dashboard =<br />
The current status of each Tegra, and other informational links, can be seen on the [http://bm-remote-talos-webhost-01.build.mozilla.org/tegras/ Tegra Dashboard]. ''Dashboard is only updated every 8 minutes; use [[#check status of Tegra(s)|./check.sh]] on the foopy for live status.''<br />
<br />
The page is broken up into three sections: Summary, Production and Staging where Production/Staging have the same information but focus on the named set of Tegras.<br />
<br />
The Summary section has the current start/end date range of the displayed Tegras and a grid of counts.<br />
<br />
Production Staging<br />
Tegra and buildslave online 57 8<br />
Tegra online but buildslave is not 0 0<br />
Both Tegra and buildslave are offline 19 2<br />
<br />
<br />
The Production/Staging section is a detailed list of all Tegras that fall into the given category.<br />
<br />
ID Tegra CP BS Msg Online Active Foopy PDU active bar<br />
<br />
* '''ID''' Tegra-### identifier. Links to the buildslave detail page on the master<br />
* '''Tegra''' Shows if the Tegra is powered and responding: online|OFFLINE <br />
* '''CP''' Shows if the ClientProxy daemon is running: active|INACTIVE<br />
* '''BS''' Shows if the buildslave for the Tegra is running: active|OFFLINE<br />
* '''Msg''' The info message from the last [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] run for that Tegra<br />
* '''Foopy''' Which foopy server the Tegra is run on. Links to the hostname:tegra-dir<br />
* '''PDU''' Which PDU page can be used to power-cycle the Tegra. PDU0 is used for those not connected as of yet<br />
* '''Log''' Links to the text file that contains the cumulative [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] log entries<br />
* '''active bar''' A single character summary of the last 10 status checks where '_' is offline and 'A' is active<br />
<br />
= What Do I Do When... =<br />
<br />
== PING checks are failing ==<br />
Reboot the Tegra through the PDU<br />
<br />
== tegra agent check is CRITICAL ==<br />
Check the dashboard, may be rebooting. Give it up to 15 minutes, then [[#check status of Tegra(s)|verify current status]]. If still "rebooting", then treat as if [[#PING checks are failing]]<br />
<br />
= How Do I... =<br />
<br />
== recover a foopy ==<br />
<br />
If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:<br />
<br />
'''Note''': Establish the base screen session, if needed by trying screen -x first<br />
<br />
ssh cltbld@foopy##<br />
screen -x<br />
cd /builds<br />
./stop_cp.sh<br />
./start_cp.sh<br />
<br />
== find what foopy a Tegra is on ==<br />
<br />
Open the Tegra Dashboard - the foopy number is shown to the right<br />
<br />
== check status of Tegra(s) ==<br />
<br />
Find the Tegra on the Dashboard and then ssh to that foopy<br />
<br />
ssh cltbld@foopy##<br />
cd /builds<br />
./check.sh -t tegra-###<br />
<br />
To check on the status of all Tegras covered by that foopy<br />
<br />
./check.sh<br />
<br />
check.sh is found in /builds on a foopy<br />
<br />
== power cycle a Tegra ==<br />
<br />
Find the Tegra on the Dashboard and then ssh to that foopy<br />
<br />
ssh cltbld@foopy##<br />
./check.sh -t tegra-## -c<br />
<br />
If the above did not work, then you will need to [[#Reboot a Tegra through the PDU]].<br />
<br />
== clear an error flag ==<br />
<br />
Find the Tegra on the Dashboard, ssh to that foopy and then<br />
<br />
ssh cltbld@foopy05<br />
./check.sh -t tegra-002 -r<br />
<br />
== restart Tegra(s) ==<br />
<br />
Find out which foopy server you need to be on and then run:<br />
<br />
ssh cltbld@foopy##<br />
cd /builds<br />
./stop_cp.sh tegra-###<br />
<br />
check the '''ps''' output that is generated at the end to ensure that nothing has glitched. If any zombie processes are found then you will need to kill them manually. Once clear, run<br />
<br />
./start_cp.sh tegra-###<br />
<br />
== start Tegra(s) ==<br />
<br />
Find out which foopy server you need to be on and then run:<br />
<br />
cd /builds<br />
./start_cp.sh [tegra-###]<br />
<br />
If you specify the tegra-### parameter then it will only attempt to start that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*<br />
<br />
== stop Tegra(s) ==<br />
<br />
First find the foopy server for the Tegra and then run:<br />
<br />
cd /builds<br />
./stop_cp.sh [tegra-###]<br />
<br />
If you specify the tegra-### parameter then it will only attempt to stop that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*<br />
<br />
At the end of the startup process, stop_cp.sh will run<br />
<br />
ps auxw | grep "tegra-###"<br />
<br />
to allow you to check that all associated or spawned child processes have been also stopped. Sadly some of them love to zombie and that just ruins any summer picnic.<br />
<br />
== find Tegras that are hung ==<br />
If you see a Tegra that has been running for 4+ hours, then it most likely has a hung fennec process. There will be a matching server.js daemon on the foopy.<br />
<br />
The easiest way to find Tegras that are in this state is via the buildbot-master. ''(N.B. in buildbot reports, all tegras report their [https://en.wikipedia.org/wiki/Nvidia_Tegra#Tegra_2_series model #], e.g. "Tegra 250". Do not confuse model name with a tegra host name, e.g. <tt>tegra-250</tt>.)''. Currently (2011-12-20) all tegras on a foopy use the same build master:<br />
<br />
{| border="1" cellpadding="2"<br />
!foopy #!!Master URL<br />
|-<br />
| <18<br />
| [http://test-master01.build.mozilla.org:8012/buildslaves?no_builders=1 test-master01]<br />
|-<br />
| >=18 & even<br />
| [http://buildbot-master20.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master20]<br />
|-<br />
| >18 & odd<br />
| [http://buildbot-master19.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master19]<br />
|}<br />
<br />
Look for Tegras that have a "Last heard from" of >4 hours. If the list of "Recent builds" for the Tegra are flapping between exceptions/failures/warnings, i.e. the status is all sorts of different pretty colours, that's a good sign that there's a stray fennec process fouling things up.<br />
<br />
Another way to identify tegras for stalls is to look on the dashboard for tegras showing INACTIVE status for both the tegra ''and'' the client proxy. (These often also have a "not connected" status on the buildslaves page.)<br />
<br />
=== whack a hung Tegra ===<br />
The only way currently to kick Tegras in this state it is to kill the server.js daemon on the appropriate foopy.<br />
<br />
The manual way to do it is to run:<br />
<br />
ps auxw | grep server.js | grep tegra-### <br />
<br />
and then kill the result PID. To keep from going crazy typing that over and over again, I created <code>kill_stalled.sh</code> which automates that task.<br />
<br />
cd /builds<br />
./kill_stalled.sh 042 050 070 099<br />
<br />
This will run the above ps and grep for each tegra id given and if a PID is found, kill it. This will cause the Tegra to be power-cycled automatically, getting it back into service.<br />
<br />
If <tt>./kill_stalled.sh</tt> reports "none found", then manually powercycle the tegra.<br />
<br />
== Reboot a Tegra through the PDU ==<br />
cd /builds<br />
python sut_tools/tegra_powercycle.py ###<br />
<br />
You will see the snmpset call result if it worked.<br />
<br />
If rebooting via PDU does not clear the problem, here are things to try:<br />
* reboot again - fairly common to have 2nd one clear it<br />
** especially if box responsive to ping & telnet (port 20701) after first reboot<br />
<br />
== check.py options ==<br />
<br />
To manually run [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] '''find the appropriate foopy server''' and<br />
<br />
cd /builds<br />
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]<br />
<br />
* -m [s|p] restrict Tegra list to 's'taging or 'p'roduction<br />
* -r reset any error.flg semaphore if found and send "rebt" command to tegra<br />
* -c powercycle the Tegra by telneting to the appropriate PDU<br />
<br />
This will scan a given Tegra (or all of them) and report back it's status.<br />
<br />
== Start ADB ==<br />
On the Tegra do:<br />
telnet tegra-### 20701<br />
exec su -c "setprop service.adb.tcp.port 5555"<br />
exec su -c "stop adbd"<br />
exec su -c "start adbd"<br />
<br />
On your computer do:<br />
adb tcpip 5555<br />
adb connect <ipaddr of tegra><br />
adb shell<br />
<br />
== Move a tegra from one foopy to another ==<br />
The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.<br />
<br />
'''NOTE:''' use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)<br />
<br />
# update foopies.sh & tegras.json in your working directory<br />
# commit the changes to <tt>foopies.sh</tt> and <tt>tegras.json</tt><br />
#* make sure json is clean: <tt>python -c 'import json; json.loads(open("tegras.json").read())'</tt><br />
# in buildbot, request a "graceful shutdown"<br />
#* wait for tegra to show "idle"<br />
# on the old foopy:<br />
#* stop the tegra via <tt>/builds/stop_cp.sh</tt><br />
#* manually remove the tegra from the <tt>/builds/create_dirs.sh</tt> file<br />
#** <strike>'''OR''' run <tt>./foopies.sh old_foopy_number</tt> from your working directory</strike> blocked by: {{bug|713690}}<br />
# on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):<br />
#* update the local tools: <tt>cd /builds/tools ; hg pull --update; cd -</tt><br />
#* manually add the tegra to the <tt>/builds/create_dirs.sh</tt> file<br />
#* manually run <tt>cd /builds; ./create_dirs.sh</tt><br />
#* if this is a replacement tegra, manually push the ini files by judicious use of: <tt>grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'</tt><br />
# on the new foopy:<br />
#* restart the tegras using <tt>cd /builds ; ./start_cp.sh</tt><br />
#** '''NOTE:''' do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.<br />
<br />
= Environment =<br />
<br />
The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy. Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''.<br />
<br />
* Each Tegra has a '''/builds/tegra-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py<br />
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it<br />
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/)<br />
<br />
Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra/tegra-devkit-features for details. It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.<br />
<br />
Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains. All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.<br />
<br />
= References =<br />
<br />
== One source of truth ==<br />
<br />
As of Oct 2011, [https://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/tegras.json <tt>tools/buildfarm/mobile/tegras.json</tt>] should be the most authoritative document.<br />
* if you find a tegra deployed that is not listed here, check [https://docs.google.com/spreadsheet/ccc?key=0AlIN8kWEeaF0dFJHSWN4WVNVZEhlREtUNWdTYnVtMlE&hl=en_US#gid=0 bear's master list]. If there, file a releng bug to get <tt>tegras.json</tt> updated.<br />
* if you find a PDU not labeled per the <tt>tegras.json</tt> file, file a releng bug to update the human labels.</div>Bearhttps://wiki.mozilla.org/index.php?title=ReleaseEngineering/Archive/Android_Tegras&diff=387596ReleaseEngineering/Archive/Android Tegras2012-01-17T17:59:06Z<p>Bear: </p>
<hr />
<div>{{Release Engineering How To|Android Tegras}}<br />
= Tegra Dashboard =<br />
The current status of each Tegra, and other informational links, can be seen on the [http://bm-remote-talos-webhost-01.build.mozilla.org/tegras/ Tegra Dashboard]. ''Dashboard is only updated every 8 minutes; use [[#check status of Tegra(s)|./check.sh]] on the foopy for live status.''<br />
<br />
The page is broken up into three sections: Summary, Production and Staging where Production/Staging have the same information but focus on the named set of Tegras.<br />
<br />
The Summary section has the current start/end date range of the displayed Tegras and a grid of counts.<br />
<br />
Production Staging<br />
Tegra and buildslave online 57 8<br />
Tegra online but buildslave is not 0 0<br />
Both Tegra and buildslave are offline 19 2<br />
<br />
<br />
The Production/Staging section is a detailed list of all Tegras that fall into the given category.<br />
<br />
ID Tegra CP BS Msg Online Active Foopy PDU active bar<br />
<br />
* '''ID''' Tegra-### identifier. Links to the buildslave detail page on the master<br />
* '''Tegra''' Shows if the Tegra is powered and responding: online|OFFLINE <br />
* '''CP''' Shows if the ClientProxy daemon is running: active|INACTIVE<br />
* '''BS''' Shows if the buildslave for the Tegra is running: active|OFFLINE<br />
* '''Msg''' The info message from the last [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] run for that Tegra<br />
* '''Foopy''' Which foopy server the Tegra is run on. Links to the hostname:tegra-dir<br />
* '''PDU''' Which PDU page can be used to power-cycle the Tegra. PDU0 is used for those not connected as of yet<br />
* '''Log''' Links to the text file that contains the cumulative [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] log entries<br />
* '''active bar''' A single character summary of the last 10 status checks where '_' is offline and 'A' is active<br />
<br />
= What Do I Do When... =<br />
<br />
== PING checks are failing ==<br />
Reboot the Tegra through the PDU<br />
<br />
== tegra agent check is CRITICAL ==<br />
Check the dashboard, may be rebooting. Give it up to 15 minutes, then [[#check status of Tegra(s)|verify current status]]. If still "rebooting", then treat as if [[#PING checks are failing]]<br />
<br />
= How Do I... =<br />
<br />
== recover a foopy ==<br />
<br />
If a foopy has been shutdown without having cleanly stopped all Tegras, you will need to do the following:<br />
<br />
'''Note''': Establish the base screen session, if needed by trying screen -x first<br />
<br />
ssh cltbld@foopy##<br />
screen -x<br />
cd /builds<br />
./stop_cp.sh<br />
./start_cp.sh<br />
<br />
== find what foopy a Tegra is on ==<br />
<br />
Open the Tegra Dashboard - the foopy number is shown to the right<br />
<br />
== check status of Tegra(s) ==<br />
<br />
Find the Tegra on the Dashboard and then ssh to that foopy<br />
<br />
ssh cltbld@foopy##<br />
cd /builds<br />
./check.sh -t tegra-###<br />
<br />
To check on the status of all Tegras covered by that foopy<br />
<br />
./check.sh<br />
<br />
check.sh is found in /builds on a foopy<br />
<br />
== power cycle a Tegra ==<br />
<br />
Find the Tegra on the Dashboard and then ssh to that foopy<br />
<br />
ssh cltbld@foopy##<br />
./check.sh -t tegra-## -c<br />
<br />
If the above did not work, then you will need to [[#Reboot a Tegra through the PDU|cycle it via the PDU]]<br />
<br />
== clear an error flag ==<br />
<br />
Find the Tegra on the Dashboard, ssh to that foopy and then<br />
<br />
ssh cltbld@foopy05<br />
./check.sh -t tegra-002 -r<br />
<br />
== restart Tegra(s) ==<br />
<br />
Find out which foopy server you need to be on and then run:<br />
<br />
ssh cltbld@foopy##<br />
cd /builds<br />
./stop_cp.sh tegra-###<br />
<br />
check the '''ps''' output that is generated at the end to ensure that nothing has glitched. If any zombie processes are found then you will need to kill them manually. Once clear, run<br />
<br />
./start_cp.sh tegra-###<br />
<br />
== start Tegra(s) ==<br />
<br />
Find out which foopy server you need to be on and then run:<br />
<br />
cd /builds<br />
./start_cp.sh [tegra-###]<br />
<br />
If you specify the tegra-### parameter then it will only attempt to start that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*<br />
<br />
== stop Tegra(s) ==<br />
<br />
First find the foopy server for the Tegra and then run:<br />
<br />
cd /builds<br />
./stop_cp.sh [tegra-###]<br />
<br />
If you specify the tegra-### parameter then it will only attempt to stop that Tegra, otherwise it will walk thru all Tegras found in /builds/tegra-*<br />
<br />
At the end of the startup process, stop_cp.sh will run<br />
<br />
ps auxw | grep "tegra-###"<br />
<br />
to allow you to check that all associated or spawned child processes have been also stopped. Sadly some of them love to zombie and that just ruins any summer picnic.<br />
<br />
== find Tegras that are hung ==<br />
If you see a Tegra that has been running for 4+ hours, then it most likely has a hung fennec process. There will be a matching server.js daemon on the foopy.<br />
<br />
The easiest way to find Tegras that are in this state is via the buildbot-master. ''(N.B. in buildbot reports, all tegras report their [https://en.wikipedia.org/wiki/Nvidia_Tegra#Tegra_2_series model #], e.g. "Tegra 250". Do not confuse model name with a tegra host name, e.g. <tt>tegra-250</tt>.)''. Currently (2011-12-20) all tegras on a foopy use the same build master:<br />
<br />
{| border="1" cellpadding="2"<br />
!foopy #!!Master URL<br />
|-<br />
| <18<br />
| [http://test-master01.build.mozilla.org:8012/buildslaves?no_builders=1 test-master01]<br />
|-<br />
| >=18 & even<br />
| [http://buildbot-master20.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master20]<br />
|-<br />
| >18 & odd<br />
| [http://buildbot-master19.build.mozilla.org:8201/buildslaves?no_builders=1 buildbot-master19]<br />
|}<br />
<br />
Look for Tegras that have a "Last heard from" of >4 hours. If the list of "Recent builds" for the Tegra are flapping between exceptions/failures/warnings, i.e. the status is all sorts of different pretty colours, that's a good sign that there's a stray fennec process fouling things up.<br />
<br />
Another way to identify tegras for stalls is to look on the dashboard for tegras showing INACTIVE status for both the tegra ''and'' the client proxy. (These often also have a "not connected" status on the buildslaves page.)<br />
<br />
=== whack a hung Tegra ===<br />
The only way currently to kick Tegras in this state it is to kill the server.js daemon on the appropriate foopy.<br />
<br />
The manual way to do it is to run:<br />
<br />
ps auxw | grep server.js | grep tegra-### <br />
<br />
and then kill the result PID. To keep from going crazy typing that over and over again, I created <code>kill_stalled.sh</code> which automates that task.<br />
<br />
cd /builds<br />
./kill_stalled.sh 042 050 070 099<br />
<br />
This will run the above ps and grep for each tegra id given and if a PID is found, kill it. This will cause the Tegra to be power-cycled automatically, getting it back into service.<br />
<br />
If <tt>./kill_stalled.sh</tt> reports "none found", then manually powercycle the tegra.<br />
<br />
== Reboot a Tegra through the PDU ==<br />
cd /builds<br />
python sut_tools/tegra_powercycle.py ###<br />
<br />
You will see the snmpset call result if it worked.<br />
<br />
If rebooting via PDU does not clear the problem, here are things to try:<br />
* reboot again - fairly common to have 2nd one clear it<br />
** especially if box responsive to ping & telnet (port 20701) after first reboot<br />
<br />
== check.py options ==<br />
<br />
To manually run [http://hg.mozilla.org/build/tools/file/default/sut_tools/check.py check.py] '''find the appropriate foopy server''' and<br />
<br />
cd /builds<br />
python sut_tools/check.py [-m [s|p]] [-r] [-c] [-t tegra-###]<br />
<br />
* -m [s|p] restrict Tegra list to 's'taging or 'p'roduction<br />
* -r reset any error.flg semaphore if found and send "rebt" command to tegra<br />
* -c powercycle the Tegra by telneting to the appropriate PDU<br />
<br />
This will scan a given Tegra (or all of them) and report back it's status.<br />
<br />
== Start ADB ==<br />
On the Tegra do:<br />
telnet tegra-### 20701<br />
exec su -c "setprop service.adb.tcp.port 5555"<br />
exec su -c "stop adbd"<br />
exec su -c "start adbd"<br />
<br />
On your computer do:<br />
adb tcpip 5555<br />
adb connect <ipaddr of tegra><br />
adb shell<br />
<br />
== Move a tegra from one foopy to another ==<br />
The steps are written for moving one tegra. If you're moving a bunch, then you may want to apply each major step to all tegras involved, and use the "reconfigure foopy" approach to save work.<br />
<br />
'''NOTE:''' use this technique to replace a tegra as well. (It's really two moves: move old to dust bin, then move replacement to live.)<br />
<br />
# update foopies.sh & tegras.json in your working directory<br />
# commit the changes to <tt>foopies.sh</tt> and <tt>tegras.json</tt><br />
#* make sure json is clean: <tt>python -c 'import json; json.loads(open("tegras.json").read())'</tt><br />
# in buildbot, request a "graceful shutdown"<br />
#* wait for tegra to show "idle"<br />
# on the old foopy:<br />
#* stop the tegra via <tt>/builds/stop_cp.sh</tt><br />
#* manually remove the tegra from the <tt>/builds/create_dirs.sh</tt> file<br />
#** <strike>'''OR''' run <tt>./foopies.sh old_foopy_number</tt> from your working directory</strike> blocked by: {{bug|713690}}<br />
# on the new foopy, if it is already in service (if the foopy is brand new, the initial install will do all these steps):<br />
#* update the local tools: <tt>cd /builds/tools ; hg pull --update; cd -</tt><br />
#* manually add the tegra to the <tt>/builds/create_dirs.sh</tt> file<br />
#* manually run <tt>cd /builds; ./create_dirs.sh</tt><br />
#* if this is a replacement tegra, manually push the ini files by judicious use of: <tt>grep python update_tegra_ini.sh | sed 's/$TEGRA/tegra-xxx/'</tt><br />
# on the new foopy:<br />
#* restart the tegras using <tt>cd /builds ; ./start_cp.sh</tt><br />
#** '''NOTE:''' do not start any new tegras, which require a reconfig to be active, until after the reconfig is complete.<br />
<br />
= Environment =<br />
<br />
The Tegra builders are run on multiple "foopy" servers with about 15-20 Tegra's per foopy. Each Tegra has it's own buildslave environment and they share common tool and talos environments -- all found inside of '''/builds'''.<br />
<br />
* Each Tegra has a '''/builds/tegra-###''' containing a buildslave and a symlink to /builds/tools/sut_tools/clientproxy.py<br />
* All of the shared talos info is in '''/builds/talos-data''' and HG is used to maintain it<br />
* All of the sut related helper code is found '''/builds/sut_tools''' (a symlink to /builds/tools/sut_tools/)<br />
<br />
Tegra is the short name for the Tegra 250 Developer Kit test board, see http://developer.nvidia.com/tegra/tegra-devkit-features for details. It allows us to install and test Firefox on a device that runs Android Froyo while also allowing for debugging.<br />
<br />
Unlike the N900's we don't run a buildbot environment on the device, but rather communicate to the device via the sutAgentAndroid program that the a-team maintains. All of the buildslave activities are handled by the clientproxy.py program which monitors the Tegra and it's state and starts/stops the buildslave as needed.<br />
<br />
= References =<br />
<br />
== One source of truth ==<br />
<br />
As of Oct 2011, [https://hg.mozilla.org/build/tools/file/default/buildfarm/mobile/tegras.json <tt>tools/buildfarm/mobile/tegras.json</tt>] should be the most authoritative document.<br />
* if you find a tegra deployed that is not listed here, check [https://docs.google.com/spreadsheet/ccc?key=0AlIN8kWEeaF0dFJHSWN4WVNVZEhlREtUNWdTYnVtMlE&hl=en_US#gid=0 bear's master list]. If there, file a releng bug to get <tt>tegras.json</tt> updated.<br />
* if you find a PDU not labeled per the <tt>tegras.json</tt> file, file a releng bug to update the human labels.</div>Bear