Closed Bug 538516 Opened 15 years ago Closed 15 years ago

Order 30 n900s

Categories

(Infrastructure & Operations Graveyard :: Servicedesk, task)

ARM
Maemo
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: sean)

References

Details

Once we have vetted a proof of concept imaging process. Hoping to have the order in by mid-next week (by Jan 13?) so we can have them in-hand by Feb 15.
Blocks: 538517
Assigning to joduinn. Do you want to close this out, as we have 20, or keep open to track the full order?
Assignee: jhford → joduinn
(In reply to comment #1) > Assigning to joduinn. > > Do you want to close this out, as we have 20, or keep open to track the full > order? Lets keep this open; for now, its ok to leave with me. Once we get these 20 online, mobile devs can see what the load/waittimes/coverage is like with these. That will let us feel more informed and hopefully make better decision about how many more to order.
No longer blocks: 538517
Depends on: 538517
Justin: Regression-finding came up as the #2 priority in the Fennec 1.0 post-mortem. 20 n900s is just not going to cut it. Who do I need to yell at to get more?
no yelling needed, just to ask :-) How many? Pav, can you sign off on this from the dev side?
(In reply to comment #4) > How many? My estimates remain unchanged from when I originally filed this bug. * 80 n810s are keeping up fairly well, without Try builds (and a couple other branches), without adding new tests, with only GTK testing. ** Estimate similar throughput for n900s ** Try builds will most likely swamp those 80 due to inability to collapse the queue. ** New test suites are in the works. Also, we'll have to test both QT and GTK on n900s. If we can de-prioritize, say, unit tests and QT (or GTK) talos runs on certain branches, then the lower number would probably be fine. We can probably fudge things by running certain tests or branches nightly/weekly or using the n810s to patch holes in coverage. If the n900s need to run all tests on all branches on checkin, 150 will be low. I decided on the numbers based on estimates, aiming for coverage and maintainability. If I need to gate on budget over either of those, or if you need more statistics as to how the n810s perform over time currently, let me know what my constraints are and what information you need. (I'm guessing vendor availability will be another constraint.)
Pav - can you please comment on the test coverage you need? 100 of these is over $50k, so no small expenditure...
Right. I was just thinking -- at these prices, ramp-up time, and short lifespan of each device model, I think we also need to ask if we're getting as much out of our automated mobile tests as we would manual testing across a number of device models. That's an honest question, not trying to sway the opinion one way or the other.
I agree with Aki that, for the reasons he states, QA, eng and rel eng should talk through testing strategy to make sure that this is the right direction. Matt should have some insight here also.
So I owe some answers here. Some rough timing per test suite -- rough in that I'm just taking sample times on m-c. == unit tests == mochitests: ~8hrs split across 4 columns. chrome: ~1:15 crashtest: ~1:30 reftest: 3:45-4:30 xpcshell: 0:45 full: ~16 hrs per run However, a number of these are dying mid-test or otherwise skipping tests. If we resolve that these tests may take longer. == talos == Tzoom ~1:00 Twinopen ~:15 Tsvg ~:20 Tsspider ~:20 Ts ~:40 Tpan ~1:00 Tp4 nochrome crashes between :20 and :30 Tp4 crashes between :20 and :30 Tgfx ~:20 Tdhtml ~:20 full: 5:15 However, I believe Tp4 and Tp4 nochrome take 2+ hours each when they actually go green, so closer to 8 hours. Unit tests are definitely taking the lion's share of phone compute time. The mochitests have been turned off for now, but aiui we're going to turn those back on at some point. I will build a per-branch profile from joduinn's infrastructure load blog posts (e.g. http://oduinn.com/2010/01/16/infrastructure-load-for-december-2009/ ) to break down those numbers as well.
Using http://oduinn.com/images/2009/blogpost_2009_12_math.png (even though December was a low-checkin month): mozilla-central: ~ 14 checkins / day mozilla-1.9.2: ~ 5 checkins / day tracemonkey: ~ 4 checkins / day places: < 1 checkin / day electrolysis: ~ 2 checkins / day (special cases) mobile-browser: ~ 3 checkins / day try: ~ 11 checkins / day We've also added the firefox-lorentz branch. The first 5 branches add up to ~ 26 checkins/day. For Maemo4 and WinMo, we build one type of build per checkin; for Maemo5 we would build two (QT and non-QT). 2 builds/checkin * 26 checkins/day * 24 phone-hours/build = 1248 phone-hours/day, ignoring mobile-browser, try, and nightly builds/respins. If we assume ~21 phone-hours/build, this number goes down to 1092 phone-hours/day. The mobile-browser repository is a special case because it's a multiplier. A single checkin in mobile-browser will trigger builds across branches. 3 checkins/day can spawn jobs across the first 6 branches (m-c, m-1.9.2, tm, places, electrolysis, lorentz). So 2 builds/branch/checkin * 3 checkins/day * 6 branches * 24 phone-hours/build = 864 additional phone-hours/day. Try is also a special case. In the above branches, if we happen to queue-collapse (skipping builds because of a large backlog), we're still testing the same changes, just not per-checkin. Meaning, if 3 checkins happen on mozilla-central, and the phones are all busy, if we test build #3, we're also testing checkins 1 and 2. We just don't know which checkin broke things if something breaks. However, on Try, checkin 1 may have nothing to do with checkin 2 or 3. We cannot skip builds and still say that we've tested everything. Also, with the variability of results, a single run on a Try build will not give us reliably accurate perf/unit test results. 2 builds/checkin * 11 checkins/day * 24 phone-hours/build = 528 hours of phone time for Try. If we run tests twice per build for some redundancy, double that. And we can't reduce this number and still fully test the Try builds. Add all these up and we have 2484 hours of phone time per day. Assuming we're at 100% phone capacity 24/7 (which is never the case), 100 n900s will not keep up with December's checkins, which were low for the year. Add in mfinkle's request that we run talos on a single checkin multiple times, and we're in the negatives as far as phone time.
Nightlies should add the equivalent of 6 checkins a day, one per moz branch... ~144 phone-hours. Assuming the 24 hours of phone time per build, the above 2484 is incorrect, should be 2640. Add 144 and we get 2784 hours of phone time with nightlies. This is for an average day, assuming checkins are even across all hours of the day and days of the month (not so), all checkins result in builds (not all do), and all phones are up and running at full capacity in production (never, though we can spend all our time trying to reach this metric if that's the priority). Sadly, the days/hours where checkins spike are often the times when we need the most granular testing.
Sorry, 6 checkins/day should spawn 2 builds, so that's 288 phone-hours/day for nightlies. 2928 phone-hours/day.
My first gut reaction is "do we need real devices for all of these branches and per checkin?" Real devices are absolutely needed for Talos. Real devices are "nice to have, but not required" for Unit Tests. And real devices are not required at all for Builds.
Just running Talos on the devices would reduce this number by 16 phone-hours/build (or 2/3), meaning full coverage would be reduced to 976 phone-hours/day (assuming one run per checkin per branch). Just running tests on either QT or non-QT would reduce the number by half. Removing branches would remove time based on that branch's activity, plus a nightly.
The automation for running tests on N900s makes it easy to add different devices to our mobile testing pool. I was able to add add a linux pc to the pool and run unit tests on it using builds that we already produce. We could do: -Talos on device (per commit if we have the desire and hardware budget) -Unit test on linux pc hardware (as above) -Unit test on device once per night/week to make sure that things unit tests work on device. For the unit tests we could either get an ARM board or use Linux PCs. If we went with ARM, the bagleboard [1] might be an option. I would be reluctant to go with an ARM board if it required doing a separate build. If we went with Linux PCs, I don't know what hardware we would use. [1] http://beagleboard.org/ has the same Cortex-A8 as the N900, is $150 and runs Debian Linux. It would require us to use a USB network controller though. It would also require someone to take the time to verify that things run correctly on the device.
I don't think adding another pool of linux boxes is ideal here -- we already have linux pools on talos masters and production masters. Since there isn't anything mobile-specific we need to do on them, other than the buildbot configs master-side, I'd lean towards running them there.
For Stuart's spreadsheet -- We'd also need staging devices and a cushion of production devices. - Staging devices help us test new changes before they roll out to production. Limiting these will slow down new changes, whether that's new test suites, a new image, changes to the configs, etc. - The cushion of production devices means we can lose a number of these to network/power issues, devices needing reimaging, devices hung on dialog prompts, etc. before we cross that threshold of too-little-testing. If we hit this threshold daily, we'll need to do n900 maintenance daily; if we hit it once every week or two we'll be able to put off n900 maintenance for that period of time. If these were free I'd ask for a 50% cushion in total, staging+production. They aren't free; I'm going to ask for 20-25%. If we get much less than that we may rely on n810s in staging and kind of assume that the changes work on n900s.
(In reply to comment #9) > So I owe some answers here. > Some rough timing per test suite -- rough in that I'm just taking sample times > on m-c. > Are these times based on the N810 or N900?
(In reply to comment #18) > Are these times based on the N810 or N900? N810. I got a couple rough numbers from jhford; the N900 is in the same ballpark. Unfortunately we don't have a strong history of runs from the N900 yet. The test run on the N900 was about half the time of the full N810 run end-to-end, but we're not sure how long the other steps are. Primarily, how long will it take to reboot, reconnect to wifi, and reset itself on average (we reformat the /builds partition on boot on the N810; I'm not sure how long the equivalent will take on the N900)... also, download the build+tests. So my rough, imprecise guess is the N900 might be anywhere from 75% to 110% of the N810 time. We can revisit after the N900s are running in production if need be; I just assumed it'd be roughly equivalent.
One thing I was thinking about to improve stability of perf numbers would be to increase the talos test cycle count. We currently do 3 cycles per test. A lot of device time is spent in setting up the run (downloads + checkouts), so the time taken per cycle would not be a linear relation to the number of cycles. The downside to doing more cycles would be that if the device/browser crashes mid-run, that whole run is a wash as well as the test taking longer. That considered, I don't know that doing 10 cycles of Tp4 would be realistic on device. From IRC: alice: for pageload: the median of each individual page excluding the max load for that page, averaged excluding the max median alice: ts/txul: run 20 times, exclude max, take average alice: experimentally, for pageload tests you need at least 10 cycles, 25 is best, 50 shows no improvement on 25
According to bug 542910 comment 25, we'll need to add a new branch to this list almost immediately. I think a larger cushion for new branches, new tests, and peak checkin times would be better. However, if we're fine with gaps in coverage and want to revisit after we see some more concrete numbers in production, that's fine too.
Based on initial estimates, we should get our total pool up to 50 devices (from 20), and continue to evaluate the test times and coverage we need to evaluate ordering additional ones
OK - 30 more n900 - sean can you get these ordered and update the bug with an eta today? Might be best to go through rich for this volume if he can produce them.
Assignee: joduinn → sean
Component: Release Engineering → Server Operations: Desktop Issues
QA Contact: release → mrz
Summary: Order 100-150 n900s → Order 30 n900s
30 N900s on order, ETA is middle to late next week.
Status: NEW → ASSIGNED
Whiteboard: On order
Whiteboard: On order → On order (ETA 3/19)
Had to source these elsewhere. New ETA is next Monday.
Whiteboard: On order (ETA 3/19) → On order (ETA 3/22)
30 N900s tagged, inventoried, and deployed.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Whiteboard: On order (ETA 3/22)
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.