Basic sanity testing around core phone features needs to happen per commit on a physical device that would otherwise be missed during b2g desktop and emulator testing. Discussions have taken place already and some notes can be found at: https://etherpad.mozilla.org/10minutesuite
A basic set of tests have been collected and added to a job in Jenkins: flame.b2g-inbound.ui.functional.sanity.[restart|no_restart] We decided to create jobs for both restart and no_restart to determine if it's necessary to restart the phone in between test runs. Restarting the phone adds about 2 minutes to the total run time with just the basic set of tests.
If we don't restart between tests do we still recover when there's a crash?
Good point, I didn't think about that. Still learning all the nuances. I wonder if there is a way that we could determine if there was a crash and only reboot if necessary. That also reminds me of a conversation I was having with Stephen on how we could recover from crashes without cleanup_data() blowing away the Crash Reports when setting up the next test. Seems like chasing crash reports is a hit or miss type thing.
I can't be sure but I think it will still recover insomuch as it will wait for the marionette server to be ready again. The only thing I am not sure about is whether the existing setUp will 'kill' the crash recovery prompt or not. It would depend on the nature of the crash.
Crash reporting will be handled by bug 1039626, which will at least output the crash details to the console.
Clint, James, can you clarify the definition of the 10 minute of time? Does the 10 minutes include flashing/configuring the environment? It takes around 4.5 minutes but can easily be twice that if checking out the git repo is slow. I think the 10 minutes should exclude environment setup. At the moment the 5 tests it runs are not representative enough of a sanity test, IMO. With 10 minutes of just test runtime we could run 2x as many tests.
I've just configured the job with restarts to post results to the Treeherder staging instance. You can see an example revision with results here: https://treeherder.allizom.org/ui/#/jobs?repo=b2g-inbound&revision=104a51033ef9
(In reply to Zac C (:zac) from comment #6) > Clint, James, can you clarify the definition of the 10 minute of time? > > Does the 10 minutes include flashing/configuring the environment? It takes > around 4.5 minutes but can easily be twice that if checking out the git repo > is slow. > > I think the 10 minutes should exclude environment setup. At the moment the 5 > tests it runs are not representative enough of a sanity test, IMO. With 10 > minutes of just test runtime we could run 2x as many tests. The Flame builds are happening in 39 minutes. So, if we have near constant landings, in order for this test to keep up with the backlog is that it needs to run at least once every 39 minutes. And, build times continue to improve, so you'd best not try to keep up by running a test job at 38 minutes. So, while there is some obvious overkill in the number, I think the 10 minutes needs to include everything. That way you have the most flexibility and you can perform a quick job (one that devs might even run locally) just to see if core functionality on the phone is broken. Honestly though, now that all the world is shifted, it's no longer my call. I direct you to James. Dave, the job you have running and uploading to Treeherder looks great! The only thing I think that is missing is that Treeherder is not finding the test log. I'm not sure if that's because it isn't getting uploaded, or if there is a bug treeherder side. But, in the detail pane output there is no "log" icon. So that might be something good to chase down. Follow up with Jeads/Mauro/Camd if you think there's a bug Treeherder side.
>I think the 10 minutes should exclude environment setup. At the moment the 5 tests it runs are not representative enough of a sanity test, IMO. With 10 minutes of just test runtime we could run 2x as many tests. TLDR; 10 minutes is the goal (in my mind) and higher is acceptable up to the highest number of another test suite run (assuming we have enough phones!) In my mind the 10 minute number should be the entire end-to-end number inclusive of setup... The war will be on setup and build times as we bring more phones online and begin running multiple "chunks" in parallel... One thing to keep in mind is the 10 minute number would actually be a very fast test right now often times a single chunk takes between 10-30 minutes so as long as we have enough phones to keep up with the pace of b2g-inbound slower is okay initially... I would like to see more data once we are up and running and have a good sense of what our failure (or intermittent) rate looks like. In the near future I am bringing someone onboard (1 month~) to optimize our phone builds like crazy... we are doing many suboptimal things there outside of the ongoing optimization of gecko build times that we can improve upon ... Some what pessimistically 20 minute builds are in the near future and lower is certainly possible.
With the current 10 minute suite, the device test results are arriving faster into treeherder than the linux64 and macos desktopb2g test results! We have the capacity to add 5 minutes more of tests now.
(In reply to Clint Talbert ( :ctalbert ) from comment #8) > Dave, the job you have running and uploading to Treeherder looks great! The > only thing I think that is missing is that Treeherder is not finding the > test log. I'm not sure if that's because it isn't getting uploaded, or if > there is a bug treeherder side. It's not currently being uploaded, see bug 1049723.
Hey Greg, do you want to 'upgrade' this to a pull request? Also we should omit the activesync tests entirely because of the ongoing bug 1028192 issue and because they're very long running, hence expensive. We could get two smaller tests for one email test.
After a suggestion by someone (I think :davehunt), I am just specifying the tests specifically in the job rather than altering the manifest. Of course the more we add to this, the more unwieldy it becomes and there wouldn't be an easy way to run the sanity tests locally without duplicating the command. For instance: gaiatest [...options] gaiatest/tests/functional/browser/test_browser_lan.py gaiatest/tests/functional/camera/test_camera_capture_photo.py gaiatest/tests/functional/dialer/test_dialer_airplane_mode.py gaiatest/tests/functional/lockscreen/test_lockscreen_unlock_to_homescreen.py gaiatest/tests/functional/messages/test_sms.py Are we in agreement to just use the "sanity" flag in the manifests for specifying tests and just have it point to the root manifest? I can do that, no problem. We could also review it and see if there are more tests we would like to add.
I suggested that but I only meant it as a short term. We should use manifest long term because then if we enable/disable tests then we don't have to adjust the job. I'm Ok with 'sanity'.
Created attachment 8475148 [details] [review] Github PR: #23037
Updated the PR to include the tag in the root manifest file at gaiatest/tests/manifest.ini and updated the docs to include information about the 'sanity' tag.
Comment on attachment 8475148 [details] [review] Github PR: #23037 In camera.ini, the new tag splits the comment and the tag below it so we just need to move the sanity=true to line 14. Also I think we should have just 1 settings test in there and we can spare it considering that we've dropped the email test. I think test_settings_bluetooth would be perfect because it encompasses a piece of hardware and the settings app in one go.
Comment on attachment 8475148 [details] [review] Github PR: #23037 Adjusted manifest for camera and added test for bluetooth. Bluetooth test added 1 minute 41 seconds to an adhoc job I did in CI.
Comment on attachment 8475148 [details] [review] Github PR: #23037 r+
OK I've merged this https://github.com/mozilla-b2g/gaia/commit/d56ac44b9cd12f662e2e92b1e546a312302e4bae and modified the jobs to use --type=b2g+sanity. The set of tests we are using has changed slightly but I'm more happy with the breadth of the hardware and gaia coverage now.
This has been working for some time now and gone through multiple iterations/adjustments. Original work completed.