Closed Bug 1004668 Opened 11 years ago Closed 10 years ago

Intermittent Android 4.0 robocop-5 shutdown hang following testUITelemetry

Categories

(Firefox for Android Graveyard :: General, defect)

All
Android
defect
Not set
normal

Tracking

(firefox30 unaffected, firefox31 fixed, firefox32 fixed, firefox-esr24 unaffected)

RESOLVED FIXED
Firefox 32
Tracking Status
firefox30 --- unaffected
firefox31 --- fixed
firefox32 --- fixed
firefox-esr24 --- unaffected

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #1004657 +++ Breaking out the Android panda instances from bug 918759 in robocop-5. These seem to always follow testUITelemetry, which was disabled on Android 2.2 for a different error. eg: (In reply to TBPL Robot from comment #2789) > philor > https://tbpl.mozilla.org/php/getParsedLog.php?id=38749817&tree=Fx-Team > Android 4.0 Panda fx-team opt test robocop-5 on 2014-04-29 16:11:17 > revision: 3e0ff6c29303 > slave: panda-0613 > > TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds > with no output > Return code: 2 (In reply to TBPL Robot from comment #2805) > philor > https://tbpl.mozilla.org/php/getParsedLog.php?id=38852462&tree=Mozilla- > Inbound > Android 4.0 Panda mozilla-inbound opt test robocop-5 on 2014-04-30 21:00:15 > revision: 658330efb152 > slave: panda-0519 > > TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds > with no output > Return code: 2
After the test ends, there is a service timeout, then an anr... 16:38:42 INFO - 04-29 16:28:04.578 I/Robocop ( 2883): 44 INFO SimpleTest FINISHED 16:38:42 INFO - 04-29 16:28:04.578 I/GeckoDump( 2883): Robocop:Quit received -- requesting quit 16:38:42 INFO - 04-29 16:28:04.882 D/GeckoThumbnailHelper( 2883): Using new thumbnail size: 186472 (width 286) 16:38:42 INFO - 04-29 16:28:04.882 D/GeckoThumbnailHelper( 2883): Sending thumbnail event: 286, 163 ... 16:38:42 INFO - 04-29 16:32:23.843 W/ActivityManager( 1402): Timeout executing service: ServiceRecord{4159cd90 org.mozilla.fennec/org.mozilla.gecko.background.healthreport.upload.HealthReportUploadService} 16:38:42 INFO - 04-29 16:32:23.874 I/Process ( 1402): Sending signal. PID: 2883 SIG: 3 16:38:42 INFO - 04-29 16:32:24.117 I/Process ( 1402): Sending signal. PID: 1402 SIG: 3 16:38:42 INFO - 04-29 16:32:24.117 I/dalvikvm( 1402): threadid=3: reacting to signal 3 16:38:42 INFO - 04-29 16:32:24.132 I/dalvikvm( 1402): Wrote stack traces to '/data/anr/traces.txt' 16:38:42 INFO - 04-29 16:32:24.148 I/Process ( 1402): Sending signal. PID: 1602 SIG: 3 16:38:42 INFO - 04-29 16:32:24.148 I/dalvikvm( 1602): threadid=3: reacting to signal 3 16:38:42 INFO - 04-29 16:32:24.164 I/dalvikvm( 1602): Wrote stack traces to '/data/anr/traces.txt' 16:38:42 INFO - 04-29 16:32:24.164 I/Process ( 1402): Sending signal. PID: 1500 SIG: 3 16:38:42 INFO - 04-29 16:32:24.164 I/dalvikvm( 1500): threadid=3: reacting to signal 3 16:38:42 INFO - 04-29 16:32:24.164 I/dalvikvm( 1500): Wrote stack traces to '/data/anr/traces.txt' 16:38:42 INFO - 04-29 16:32:25.351 I/Process ( 1402): Sending signal. PID: 1894 SIG: 3 16:38:42 INFO - 04-29 16:32:25.351 I/dalvikvm( 1894): threadid=3: reacting to signal 3 16:38:42 INFO - 04-29 16:32:25.359 I/dalvikvm( 1894): Wrote stack traces to '/data/anr/traces.txt' 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): ANR in org.mozilla.fennec 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): Reason: Executing service org.mozilla.fennec/org.mozilla.gecko.background.healthreport.upload.HealthReportUploadService 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): Load: 0.01 / 0.16 / 0.15 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): CPU usage from 18771ms to 0ms ago with 99% awake: 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 2.2% 1894/com.mozilla.SUTAgentAndroid: 1.9% user + 0.3% kernel / faults: 46 minor 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 0.5% 1291/surfaceflinger: 0.2% user + 0.2% kernel 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 0.2% 1402/system_server: 0% user + 0.1% kernel / faults: 2 minor 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 0.1% 4/kworker/0:0: 0% user + 0.1% kernel 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 0% 325/sync_supers: 0% user + 0% kernel 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 0% 2872/app_process: 0% user + 0% kernel 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 0.3% TOTAL: 0.2% user + 0% kernel + 0% iowait 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): CPU usage from 917ms to 1472ms later: 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 5.4% 1402/system_server: 0% user + 5.4% kernel 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 5.4% 1416/ActivityManager: 0% user + 5.4% kernel 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 1.8% 1894/com.mozilla.SUTAgentAndroid: 1.8% user + 0% kernel 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 3.6% 1961/RedirOutputThre: 3.6% user + 0% kernel 16:38:42 INFO - 04-29 16:32:25.367 E/ActivityManager( 1402): 5% TOTAL: 2% user + 3% kernel
(Tweaking summary to reduce mis-stars later :-))
Summary: Intermittent Android 4.0 TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output...following testUITelemetry → Intermittent Android 4.0 *robocop-5* TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output...following testUITelemetry
Well, if the machines don't have network access to the FHR upload endpoints, then the upload service will time out. That's expected. The uploader will run unless you've opted out of upload, but it shouldn't be running soon after an install… Document generation might block one of the browser threads, but we shouldn't be blocking in that way while we time out on upload.
(In reply to Richard Newman [:rnewman] from comment #3) > Well, if the machines don't have network access to the FHR upload endpoints, > then the upload service will time out. That's expected. We should be preffing off FHR rather than relying on network blocks; filed bug 1004682.
There is a slight variation on Comment 1 in some runs: https://tbpl.mozilla.org/php/getParsedLog.php?id=38998335&tree=Try&full=1#error0 10:27:13 INFO - 05-03 10:25:21.906 W/ActivityManager( 1403): Timeout executing service: ServiceRecord{415218c0 org.mozilla.fennec/org.mozilla.gecko.background.announcements.AnnouncementsService} 10:27:13 INFO - 05-03 10:25:21.937 I/Process ( 1403): Sending signal. PID: 2344 SIG: 3 10:27:13 INFO - 05-03 10:25:22.249 I/Process ( 1403): Sending signal. PID: 1403 SIG: 3 10:27:13 INFO - 05-03 10:25:22.249 I/dalvikvm( 1403): threadid=3: reacting to signal 3 10:27:13 INFO - 05-03 10:25:22.289 I/dalvikvm( 1403): Wrote stack traces to '/data/anr/traces.txt' 10:27:13 INFO - 05-03 10:25:22.296 I/Process ( 1403): Sending signal. PID: 1572 SIG: 3 10:27:13 INFO - 05-03 10:25:22.296 I/dalvikvm( 1572): threadid=3: reacting to signal 3 10:27:13 INFO - 05-03 10:25:22.296 I/dalvikvm( 1572): Wrote stack traces to '/data/anr/traces.txt' 10:27:13 INFO - 05-03 10:25:22.296 I/Process ( 1403): Sending signal. PID: 1502 SIG: 3 10:27:13 INFO - 05-03 10:25:22.304 I/dalvikvm( 1502): threadid=3: reacting to signal 3 10:27:13 INFO - 05-03 10:25:22.312 I/dalvikvm( 1502): Wrote stack traces to '/data/anr/traces.txt' 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): ANR in org.mozilla.fennec 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): Reason: Executing service org.mozilla.fennec/org.mozilla.gecko.background.announcements.AnnouncementsService 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): Load: 0.0 / 0.04 / 0.08 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): CPU usage from 16707ms to 0ms ago: 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 1.1% 1897/com.mozilla.SUTAgentAndroid: 1% user + 0.1% kernel / faults: 62 minor 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 0.1% 1403/system_server: 0.1% user + 0% kernel / faults: 1 minor 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 0% 1292/surfaceflinger: 0% user + 0% kernel 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 0% 1359/kworker/0:2: 0% user + 0% kernel 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 0% 2001/com.android.gallery3d: 0% user + 0% kernel / faults: 7 minor 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 0% 2016/com.svox.pico: 0% user + 0% kernel / faults: 33 minor 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 0% 2333/app_process: 0% user + 0% kernel 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 0.3% TOTAL: 0.2% user + 0% kernel + 0% iowait 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): CPU usage from 1020ms to 1562ms later: 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 7.4% 1403/system_server: 0% user + 7.4% kernel / faults: 2 minor 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 7.4% 1417/ActivityManager: 0% user + 7.4% kernel 10:27:13 INFO - 05-03 10:25:23.507 E/ActivityManager( 1403): 4.2% TOTAL: 3.1% user + 1% kernel Notice org.mozilla.gecko.background.announcements.AnnouncementsService instead of org.mozilla.gecko.background.healthreport.upload.HealthReportUploadService.
Attached patch work in progress (obsolete) — Splinter Review
I was not able to determine the cause of the ANRs, so took a closer look at the test itself. Surprise -- most of the test is not running! "let type = m[0]" results in type being undefined, so the rest of the test does not check anything: All of the checks for expected events and sessions simply do not run. Correcting for type turned up a couple of other oversights (s/method/reason, add TEST_PREFIX to sessions), presumably never caught because that code never ran? Correcting those problems sometimes gave me a clean run on Android 4.0; in the intermittent failed cases, we received "home.1" as a session. If I allow for "home.1", the test seems to pass consistently on Android 4.0 and Android 4.2 x86, with no sign of ANRs or other problems: https://tbpl.mozilla.org/php/getParsedLog.php?id=39223869&tree=Try&full=1. But on Android 2.2 and Android 2.3, there are *still* frequent failures on session because both "home.1" and "homepanel.1.<guid>" are received: https://tbpl.mozilla.org/php/getParsedLog.php?id=39225984&tree=Try&full=1. Is there a better way to handle "home.1" and "homepanel.1.xxx>, or shall I continue to hack?
Attachment #8418920 - Flags: feedback?(rnewman)
*shakes fist at sola*
(In reply to Geoff Brown [:gbrown] from comment #20) Thanks for doing the sleuthing! > "let type = m[0]" results in type being undefined, so the rest of the test > does not check anything: All of the checks for expected events and sessions > simply do not run. Why so? That bit of code didn't change from when I wrote this, and it worked then... each measurement should be ["event", ...] so m[0] should be correct, no? > Correcting those problems sometimes gave me a clean run on Android 4.0; in > the intermittent failed cases, we received "home.1" as a session. The test is racing with the real app's UI telemetry. > Is there a better way to handle "home.1" and "homepanel.1.xxx>, or shall I > continue to hack? That set will extend over time, so we shouldn't hard-code to accept those. The right thing to do is to -- as with events -- filter out sessions that don't start with the test prefix, and add the test prefix to the sessions the test creates.
(In reply to Richard Newman [:rnewman] from comment #26) > (In reply to Geoff Brown [:gbrown] from comment #20) > > > "let type = m[0]" results in type being undefined, so the rest of the test > > does not check anything: All of the checks for expected events and sessions > > simply do not run. > > Why so? That bit of code didn't change from when I wrote this, and it worked > then... each measurement should be > > ["event", ...] > > so m[0] should be correct, no? I'm sure you know this code better than I do, but I read it as: - measurements is the return value of getUIMeasurements() (filtered, but of the same form), which looks to me like an array of objects: eg. measurements = [ {type: "session", name: "blah", reason: "blah", ...}, {type: "event", action: "blah", method: "blah", ...}, ... ] - m = measurements[i], an object eg. m = {type: "session", name: "blah", reason: "blah" ...} -> m.type == "session" / m[0] == ??
(In reply to Geoff Brown [:gbrown] from comment #27) > of the same form), which looks to me like an array of objects: Good catch! I think events started life as arrays, and ended up as objects... and the test continued to pass, so wasn't fixed. Damn you, dynamic languages.
Comment on attachment 8418920 [details] [diff] [review] work in progress Review of attachment 8418920 [details] [diff] [review]: ----------------------------------------------------------------- ::: mobile/android/base/tests/testUITelemetry.js @@ +79,5 @@ > if (type == "event") { > let [type, action, method, sessions, extras] = expected[i]; > do_check_eq(m.action, action); > do_check_eq(m.method, method); > + do_check_session_array(m.sessions, sessions); I'd prefer simply: do_check_array_eq(m.sessions.filter(TEST_REGEX.test), sessions);
Attachment #8418920 - Flags: feedback?(rnewman) → feedback+
Status: NEW → ASSIGNED
Hardware: ARM → All
Incorporated your suggestion for sessions -- thanks! Latest try run looks very solid and resolves bug 993813 as a bonus: https://tbpl.mozilla.org/?tree=Try&rev=98387b9caf43
Attachment #8418920 - Attachment is obsolete: true
Attachment #8419883 - Flags: review?(rnewman)
Comment on attachment 8419883 [details] [diff] [review] update testUITelemetry Review of attachment 8419883 [details] [diff] [review]: ----------------------------------------------------------------- ::: mobile/android/base/tests/testUITelemetry.js @@ +68,5 @@ > do_check_eq(m.action, action); > do_check_eq(m.method, method); > + // might receive real sessions in addition to the test ones -- remove the real ones > + let test_sessions = m.sessions.filter(function(s) { > + return TEST_REGEX.test(s); You can just use TEST_REGEX.test as the argument.
Attachment #8419883 - Flags: review?(rnewman) → review+
Backed-out: https://hg.mozilla.org/integration/mozilla-inbound/rev/c80cc079cbc8 for a silly mistake. Will correct when I am more alert.
(In reply to TBPL Robot from comment #40) > RyanVM > https://tbpl.mozilla.org/php/getParsedLog.php?id=39379587&tree=Mozilla-Inbound > Android 4.0 Panda mozilla-inbound opt test robocop-5 on 2014-05-09 12:04:31 It seems my change does not eliminate the ANR + time out -- more work to do here.
Whiteboard: [leave open]
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Whiteboard: [leave open]
Target Milestone: --- → Firefox 32
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(mcMerge now looks primarily for the keyword "leave-open" rather than the whiteboard value; though don't know why the backward compat support for the whiteboard didn't kick in)
We got the test running better, but the shutdown hang persists. I don't know why. I won't be able to get back to this before May 22nd.
Assignee: gbrown → nobody
Another possible dup of bug 924622 (landed 05/23)?
Assignee: nobody → gbrown
Let's call the shutdown hang fixed by bug 924622 & the test cleanup fixed by the comment 35 landing. As such think we're all done here :-) (Tweaking summary to avoid future TBPL false positives)
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
Summary: Intermittent Android 4.0 *robocop-5* TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output...following testUITelemetry → Intermittent Android 4.0 robocop-5 shutdown hang following testUITelemetry
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: