Mobile/Maemo4 Testfarm Notes

From MozillaWiki
Jump to navigation Jump to search

Here is a writeup of a number of the issues we had to solve, work around, or live with (to this day) in ramping up the automated Maemo4 test device pool. Hopefully this will give some insights as to some possible hurdles we will face as we attempt to automate a large number of devices on other platforms.

I'm not sure if this is a comprehensive list of the issues we've faced, but it's what I can recall from memory. Hope it helps.


Power and Battery

Maemo Power Management

On the N810s, power management is software-based (and buggy). Also, the power drawn by the power supply appears to be less than the amount of power needed by a running device, resulting in constant battery drain when plugged in and powered on.

On reboot [?], occasionally the devices will go into a) an infinite reboot cycle, where it goes to the Nokia screen then blanks the screen then comes back, forever, or b) a hibernate mode. Battery power appears to be the cause for the latter; some sort of "pretend dead" status in the boot scripts appear to be the cause for the former.

Diablo has a bug where a reboot with the power plugged in can cause some of these infinite reboot situations. The fix is to unplug the power and reboot, then replug the power once it's booted; this is not really an option in an automated 24/7 farm of dozens of devices. The workaround, which I haven't gotten to work, is to let the battery drain all the way then launch a terminal with a

while [ 1 ] ; do sleep 1; done

running inside that terminal. We attempted this and found that all it really did was annoy us and hide the fact that the device hadn't connected to wifi.

(This bug is supposedly fixed in Fremantle.)

Test Harnesses or Power Bench

These devices were never meant to be turned on and left on 24/7. We've spent a significant amount of time to get them to do this.

The real solution here is [appears to be] to get rid of the battery and get the device onto a power supply that gives enough power. There are test harnesses for the Nokias that cost $500 a pop that do this (and little else that we need); we're also investigating hacking some sort of bench power supply and soldering connections to the back of the devices.

($500 may seem trivial until you realize we may need 30 more of them, at which time it starts looking like real money. Also, our current count of 40 devices is not anywhere near enough to keep up with 3 branches throttled. I hear rumors of covering all branches, as well as Try, per-checkin; this would require hundreds of working devices. All of a sudden this becomes 10s to 100s of thousands of dollars on power supplies alone.)


Screen Blanking and Power Save

As mentioned above, we've had to attempt a number of solutions to power saving. Power saving takes a number of guises:

  • wifi power saving (mentioned below)
  • screen power saving
  • cpu power saving

You can disable a number of these in the Control Panel, to a degree. After setting the screen dimming to 2min dim, 5min off, never turn off while charging, you can force it further by installing the 3rd party MoreDimmingOptions package and setting those times to 1440min (24 hours).

However, when the battery is full, the N810 no longer sees itself as "charging", even if it's still plugged in.

I've tried the Advanced Power Management package, which is great at keeping the device on with lots of information about the charge, but has a way of constantly popping up an annoying notification about the battery that you have to click to make go away. I suspect this will affect unit tests that require focus. I haven't tried rolling this out to production.

For cpu, we added a Talos buildbot step to

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

which should keep the cpu from falling asleep mid-testsuite.

Network and Wifi

Wired vs. Wireless

In limited comparisons between wired and wireless ethernet, I found wireless more reliable. Very few people believe me when I state this fact. On the N8x0s, wired ethernet involves:

  • enabling host USB on the N8x0
  • attaching a USB ethernet device to the N8x0 via a USB cable and adapter which aren't terribly physically robust, so bumping into the table might disconnect the network
  • cross-compiling a third party driver that isn't supported by Nokia
  • expecting this whole thing to be at a level of enterprise-testing robustness

When I say that wireless was more reliable, this involves any of

  • physical decoupling from the wire,
  • finding the network down on the wired device more often (/etc/init.d/network restart or reboot), or
  • finding the wired device hibernating/turned off/infinite rebooting more often than the wireless device.

The latter may be due to additional power drain from the USB connection.

Wifi Specific Issues

Sometimes the devices have issues reconnecting to the wifi network automatically. This has been mostly resolved or worked around.

Turn wifi power saving off. This can cause disconnects in processes that need to stay connected (ssh, twisted) and dougt says it can crash your wireless router.

We reduced the signal strength from 100mW to 10mW since the routers are right there and we don't want the devices interfering as much with each other.

zandr says we should keep the N810s 10cm apart for the same reason.

We're waiting on an RF-shielded room which should hopefully help our wifi stability; there are so many wifi networks all clamoring for the 802.11b/g space and possibly people attempting to hack in from elsewhere. A wifi network inside an RF shielded room should help with both those scenarios.

[Non] Disconnects

As a mobile device, the N810 [correctly] defaults to keeping network connections open even when there is a disconnect. This is useful for mobile users moving from one tower's (or wifi router's) range to another.

When the N810 is used as a networked 24/7 test device, this can become a point of aggravation or confusion. SSH connections take a long time to drop after the device loses network or reboots, and buildbot thinks that the device is still available days after it's dropped off the map.

We're still living with this but it certainly isn't the biggest problem we have.

(jhford thinks this is the OSX network stack is at least partially to blame here; we'll know more if/when we move the buildbot master to linux.)

/dev/random

At dougt's suggestion I have moved /dev/random elsewhere and softlinked /dev/urandom to /dev/random, specifically to reduce the strictness in random numbers, so ssh connections can happen faster. This makes sense since availability and speed matter more to us in the test infrastructure than data security.

However, after I noticed that a) /dev/random comes back at boot, and b) the devices that I made this change seemed to fall over [need reimaging] more than 2-3 times more often than devices without this change, I stopped making this change. It's solving a non-critical issue and introducing a new one.


Update Icon

Every once and again, the little orange icon will blink forever, notifying you that you have updates available on your debian packages.

Interestingly enough, when this was going on we noticed there was a large discrepancy between devices in performance numbers on certain Talos suites. This discrepancy disappeared once I clicked on the orange update icon, chose "update now", then cancelled out of the update inside of App Manager. That made the orange icon go away, but kept my packages at the same revision. And stabilized perf numbers.

I'm not sure what was causing the perf regression exactly: perhaps multiple pings to the repositories, or maybe just the cpu/graphics time needed to render a blinking icon. I'm sadly guessing the latter.

We can disable most of the update notifications, but not all (I think the system repo is hardcoded and not disable-able in the Control Panel) so this problem has mostly gone away. Also, we'd rather have each device at a known state of packages and only roll out updates when we plan on doing so, so this is an acceptable solution for us for the moment.


Disk and Filesystems

Internal Flash

Out of the box, the N810 had /, which is jffs2 and /media/mmc2 which has much more disk but is formatted vfat for some reason. Due to vfat limitations, running executables is problematic, even if you fix the vfat mount options [1]. I've hit enough errors trying to run with fennec in /media/mmc2 that I've given up and installed as much as I can in /... all that remained in /media/mmc2 were the tp3/tp4 pagesets and maemkit (python was on / though).

[1] https://wiki.mozilla.org/ReferencePlatforms/Test/Maemo#Fix_mount_options

/ is small enough that there isn't enough disk space to download and extract two tarballs of fennec, especially if the unit tests are there as well. jhford noticed that if we *image* the filesystem rather than writing onto the filesystem, it uses much less space... jffs2 is not very optimized for writing.

/media/mmc2 tends to become corrupted easily. This may have to do with swap living on /media/mmc2/.swap by default, maybe it's something else, but when /media/mmc2 goes read-only it's time to re-image, as I don't trust it after a fsck.vfat -y. (See the imaging section below.)

However, we have since switched to booting off of...


SD Cards

The main win here is reimaging turnaround time; we could have dozens of spare, pre-imaged SD cards that are ready to hot-swap into a device that needs reimaging. Power on, change the hostname, and you're set.

If we get multiple SD card readers and/or an 20x SD card cloner that could reduce the serial 13min/card time significantly.

We're running fairly well on these. We're still hitting a number of the above issues as we try to enable some sort of intelligent monitoring and/or logging so we can maintain this many devices without having to look at or poke at each one individually to tell if they're running smoothly.

I have, however, run into some un-removable files. This might be ext2 corruption


Memory and Swap

Max Swap

Corruption

Swap Required

Manual Intervention and Maintenance

prompts "touch my power button" "power me up without the power connected" "reimage me"


More on Imaging

Talos Setup

Unittest Setup