Closed Bug 1553971 Opened 6 years ago Closed 9 months ago

Intermittent layout/style/crashtests/1546255.html | application timed out after 370 seconds with no output

Categories

(Core :: CSS Parsing and Computation, defect, P5)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Regression)

Details

(Keywords: crash, intermittent-failure, regression, Whiteboard: [stockwell disabled])

Crash Data

Attachments

(2 files, 1 obsolete file)

Filed by: aciure [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=248019165&repo=mozilla-central
Full log: https://queue.taskcluster.net/v1/task/OL6FdGIxRvCPluOeO2NjtA/runs/1/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/OL6FdGIxRvCPluOeO2NjtA/runs/1/artifacts/public/logs/live_backing.log&only_show_unexpected=1


[task 2019-05-23T17:55:21.129Z] 17:55:21 INFO - REFTEST TEST-PASS | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | (LOAD ONLY)
[task 2019-05-23T17:55:21.129Z] 17:55:21 INFO - REFTEST TEST-END | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html
[task 2019-05-23T18:01:52.889Z] 18:01:52 INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
[task 2019-05-23T18:01:52.992Z] 18:01:52 INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...
[task 2019-05-23T18:01:52.992Z] 18:01:52 INFO - REFTEST TEST-INFO | started process screentopng
[task 2019-05-23T18:01:53.316Z] 18:01:53 INFO - REFTEST TEST-INFO | screentopng: exit 0
[task 2019-05-23T18:02:07.657Z] 18:02:07 WARNING - TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | application timed out after 370 seconds with no output
[task 2019-05-23T18:02:07.657Z] 18:02:07 INFO - remoteautomation.py | Application ran for: 0:13:54.845250
[task 2019-05-23T18:02:08.192Z] 18:02:08 INFO - REFTEST INFO | Downloading symbols from: https://queue.taskcluster.net/v1/task/cm0pFFzpTsmXUojTHNF9aA/artifacts/public/build/target.crashreporter-symbols.zip
[task 2019-05-23T18:02:12.945Z] 18:02:12 INFO - REFTEST INFO | Copy/paste: /builds/worker/workspace/build/linux64-minidump_stackwalk /tmp/tmpRHOTix/37a6e1c4-40e7-2c33-8a1c-492691cd8c13.dmp /tmp/tmpXilQ5r
[task 2019-05-23T18:02:18.804Z] 18:02:18 INFO - REFTEST INFO | Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/37a6e1c4-40e7-2c33-8a1c-492691cd8c13.dmp
[task 2019-05-23T18:02:18.804Z] 18:02:18 INFO - REFTEST INFO | Saved app info as /builds/worker/workspace/build/blobber_upload_dir/37a6e1c4-40e7-2c33-8a1c-492691cd8c13.extra
[task 2019-05-23T18:02:18.807Z] 18:02:18 INFO - REFTEST PROCESS-CRASH | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | application crashed [@ libc.so + 0x8c66a]
[task 2019-05-23T18:02:18.807Z] 18:02:18 INFO - Crash dump filename: /tmp/tmpRHOTix/37a6e1c4-40e7-2c33-8a1c-492691cd8c13.dmp
[task 2019-05-23T18:02:18.807Z] 18:02:18 INFO - Operating system: Android
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - 0.0.0 Linux 3.10.0+ #1 PREEMPT Thu Jan 5 00:46:30 UTC 2017 x86_64
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - CPU: amd64
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - family 6 model 2 stepping 3
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - 1 CPU
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - GPU: UNKNOWN
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - Crash reason: SIGABRT
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - Crash address: 0xd5c
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - Process uptime: not available
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - Thread 0 (crashed)
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - 0 libc.so + 0x8c66a
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - rax = 0xfffffffffffffffc rdx = 0x0000000000000010
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - rcx = 0xffffffffffffffff rbx = 0x00000000ffffffff
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - rsi = 0x00007fff36f14580 rdi = 0x000000000000001c
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - rbp = 0x00000000ffffffff rsp = 0x00007fff36f14508
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - r8 = 0x0000000000000000 r9 = 0x0000000000000008
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - r10 = 0x00000000ffffffff r11 = 0x0000000000000246
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - r12 = 0x00000000ffffffff r13 = 0x00007708fe5e3f40
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - r14 = 0x0000000000000000 r15 = 0x0000000000000000
[task 2019-05-23T18:02:18.808Z] 18:02:18 INFO - rip = 0x00007708ffc0866a
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - Found by: given as instruction pointer in context
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - 1 libc.so + 0x2a0e6
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - rsp = 0x00007fff36f14510 rip = 0x00007708ffba60e6
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - Found by: stack scanning
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - 2 libutils.so + 0x184e7
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - rsp = 0x00007fff36f14530 rip = 0x00007708ffc914e7
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - Found by: stack scanning
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - 3 libart.so + 0x14cd91
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - rsp = 0x00007fff36f14540 rip = 0x00007708fde58d91
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - Found by: stack scanning
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - 4 system@framework@boot-framework.art + 0x3d5350
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - rsp = 0x00007fff36f14558 rip = 0x0000000071675350
[task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - Found by: stack scanning

The priority flag is not set for this bug.
:emilio, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(emilio)

Do we know why there are no usable stacks on android? I'm not sure about whether you're the right person to ask Mike, feel free to forward as needed.

Flags: needinfo?(emilio) → needinfo?(mh+mozilla)
Regressed by: 1546255

minidump_stackwalk can't do anything about system libraries, for which it doesn't have breakpad symbol files.

Flags: needinfo?(mh+mozilla)

The priority flag is not set for this bug.
:heycam, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(cam)

These are all on Android 7.0 (geckoview). Tests hang at the end of 1546255.html. This logcat seems typical:

https://taskcluster-artifacts.net/e_Wost4ISb2X-Bg91-GLvA/0/public/test_info//logcat-emulator-5554.log

06-08 13:00:32.290  2417  2432 I Gecko   : REFTEST TEST-LOAD | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | 3167 / 3761 (84%)
06-08 13:00:32.290  2417  2432 I Gecko   : 
06-08 13:00:32.290  2417  2432 I Gecko   : {"action":"log","time":1559995232290,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"START http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html"}
06-08 13:00:32.320  2477  2492 W Web Content: [JavaScript Warning: "The stylesheet data:;base64,QGN7IGJ9CkRBTVAsIEhUTUwgey was loaded as CSS even though its MIME type, “text/plain”, is not “text/css”." {file: "http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html" line: 0}]
06-08 13:00:32.320  2477  2492 E Web Content: [JavaScript Error: "The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol." {file: "http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html" line: 0}]
06-08 13:00:32.330  2417  2432 I Gecko   : 
06-08 13:00:32.330  2417  2432 I Gecko   : {"action":"log","time":1559995232330,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] OnDocumentLoad triggering AfterOnLoadScripts"}
06-08 13:00:32.330  2477  2492 W Web Content: [JavaScript Warning: "Synchronous XMLHttpRequest on the main thread is deprecated because of its detrimental effects to the end user’s experience. For more help http://xhr.spec.whatwg.org/" {file: "http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html" line: 21}]
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"Initializing canvas snapshot"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] AfterOnLoadScripts belatedly entering WaitForTestEnd"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] WaitForTestEnd: Adding listeners"}
06-08 13:00:32.340  2477  2492 W Web Content: [JavaScript Warning: "Use of Mutation Events is deprecated. Use MutationObserver instead." {file: "resource://reftest/reftest-content.js" line: 784}]
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"Initializing canvas snapshot"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_TO_FIRE_INVALIDATE_EVENT"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: dispatching MozReftestInvalidate"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_FOR_REFTEST_WAIT_REMOVAL"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_FOR_SPELL_CHECKS"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_FOR_APZ_FLUSH"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: APZ flush not required"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_TO_FINISH"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: Doing sync flush to compositor"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: Completed"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] RecordResult fired"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"RecordResult fired"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"test_status","time":1559995232340,"thread":null,"pid":null,"source":"reftest","test":"http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html","subtest":"(LOAD ONLY)","status":"PASS"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"test_end","time":1559995232340,"thread":null,"pid":null,"source":"reftest","test":"http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html","status":"OK"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"Loading a blank page"}
06-08 13:00:32.400  2417  2432 I Gecko   : 
06-08 13:00:32.400  2417  2432 I Gecko   : {"action":"log","time":1559995232400,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] AfterPaintListener in data:text/html;charset=UTF-8,%3C%21%2D%2DCLEAR%2D%2D%3E"}
06-08 13:00:32.400  2417  2432 I Gecko   : 
06-08 13:00:32.400  2417  2432 I Gecko   : {"action":"log","time":1559995232400,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] AfterPaintListener in data:text/html;charset=UTF-8,%3C%21%2D%2DCLEAR%2D%2D%3E"}
06-08 13:07:12.090  2417  2423 I art     : Thread[3,tid=2423,WaitingInMainSignalCatcherLoop,Thread*=0x73c2032b7000,peer=0x12c4dd30,"Signal Catcher"]: reacting to signal 3
Summary: Intermittent layout/style/crashtests/1546255.html | application crashed [@ libc.so + 0x8c66a] after application timed out after 370 seconds with no output → Intermittent layout/style/crashtests/1546255.html | application timed out after 370 seconds with no output

We had 102 failures in the past week, all on android-em-7-0-x86_64 opt.
Recent log failure:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=255173994&repo=autoland&lineNumber=14569

This looks to be caused of Bug 1556058. Joel, can we star these timeouts against that bug or should we have bugs for each test timing out?

Flags: needinfo?(jmaher)

the plan is to keep on the new emulator, so we should skip this crashtest on android&&!debug and move on. Every time we switch platforms or run on a new config or platform many tests change- the tests that we are running on the new emulators are running more tests and more reliable, this seems to be one of the few that slipped by in early testing.

Flags: needinfo?(jmaher)

good question Cosmin, yes lets do that, but please indicate in bug 1556058 the tests or bug numbers, you could even have this bug as blocking bug 1556058 as a quick way to do it.

Flags: needinfo?(jmaher)

Thanks for the info Joel. Will let the other sheriffs know about this.

Flags: needinfo?(cam)
Keywords: leave-open
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Pushed by malexandru@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e839eae27aa4 Disable crashtests/1546255.html for frequent failures on android 7 opt r=jmaher
Attached file Bug 1553971 - test (obsolete) —
Attachment #9077061 - Attachment is obsolete: true

How sure are we that this failure had anything to do with this test? Before the timeout, I see:

REFTEST TEST-END | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html
INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...

So this test has ended (and we've probably even navigated away from it) by the time we hit the problem.

It looks like GeckoView is hitting a shutdown hang, or something -- not a hang in this test. And the harness is (kinda-mistakenly) attributing the failure to this test because it happens to be the most recent test that was running, but it seems entirely likely that the shutdown hang was caused by something else.

In other words, I'm wondering whether this failure will now suddenly start being reported for another nearby test (whichever test ends up being the last one now that this test has been disabled)... Worth keeping an eye out for.

(In reply to Daniel Holbert [:dholbert] from comment #30)

How sure are we that this failure had anything to do with this test?

Comment 19 makes me even more curious about this... Sounds like this may be a problem with the emulator and disabling individual harness-"blamed" tests is unlikely to help?

see also e.g. bug 1471222 where we had a similar class of failures for ASAN testruns, where it seems to be a shutdown hang and a harness issue, though the failure message mistakenly implied that the unlucky last test was somehow involved (though it was not). It looks like we later improved the logging in bug 1475141 to make that clearer, but I'm not sure if that fix benefited the crashtest harness vs. just reftests... ahal, do you know?

Flags: needinfo?(ahal)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #20)

the plan is to keep on the new emulator, so we should skip this crashtest on android&&!debug and move on.

Joel, are you still sure about this^^ assessment, given that this failure looks like a shutdown hang & potentially has nothing to do with this test?

Perhaps a more-robust fix here is to revert to the old emulator for crashtests, as we did for xpcshell it seems? (bug 1556058 comment 19)

Flags: needinfo?(jmaher)

it is fine to revert to the old emulator, I am not clear on what issues were fixed for crashtests by switching to the new emulator. :gbrown knows a lot more about both emulators and will be back from PTO on Monday, lets get this on his radar.

Flags: needinfo?(jmaher) → needinfo?(gbrown)

I also wonder if this is the same issue as bug 1563997. The first comment there is for a debug android Mochitest run (specifically "Android 7.0 x86-64 opt Mochitests test-android-em-7.0-x86_64/opt-mochitest-gpu-e10s"), and it ends with the same pattern that I'm seeing here and quoted in comment 30. In particular, it's got:

[logging for last test finishing]
INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...

...and a 370-second harness timeout.

Assuming this is a more general harness/emulator problem and unrelated to this particular test, we should probably backout comment 27 (unless we're noticing less orange as a result of it, but I'm skeptical that we would be).

Flags: needinfo?(malexandru)
See Also: → 1563997
See Also: → 1565286

I think bug 1475141 should have also benefited crashtest. But if you're seeing evidence to the contrary then I wouldn't be too surprised if crashtest wasn't working.

Here's a reftest task where it's working fwiw:
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=218864193&revision=75a0b52395c94a2fdc77642d5dfc6cf0e027ad58

Flags: needinfo?(ahal)

(In reply to Andrew Halberstadt [:ahal] from comment #35)

I think bug 1475141 should have also benefited crashtest. But if you're seeing evidence to the contrary then I wouldn't be too surprised if crashtest wasn't working.

I spun this off as bug 1565286 -- let's move this discussion over there. Thanks!

(In reply to Cosmin Sabou [:CosminS] from comment #19)

This looks to be caused of Bug 1556058. Joel, can we star these timeouts against that bug or should we have bugs for each test timing out?

We have a history of failures in this bug beginning May 22, while bug 1556058 did not land anything until June 28; there may have been an increase in failure frequency with the emulator change, but the failure before/after bug 1556058 looks to be the same.

(In reply to Daniel Holbert [:dholbert] from comment #30)

In other words, I'm wondering whether this failure will now suddenly start being reported for another nearby test (whichever test ends up being the last one now that this test has been disabled)... Worth keeping an eye out for.

I had the same concern, but actually, I see no new failures since this test was disabled: There are no shutdown timeouts in Android 7.0 crashtests since comment 29.

It is okay to revert to the old emulator for crashtests if that is desired, but in light of comment 38 and comment 39 I think the best course of action is to do nothing and/or actually investigate the cause of the hang.

Flags: needinfo?(gbrown)

(In reply to Daniel Holbert [:dholbert] from comment #34)

I also wonder if this is the same issue as bug 1563997. The first comment there is for a debug android Mochitest run (specifically "Android 7.0 x86-64 opt Mochitests test-android-em-7.0-x86_64/opt-mochitest-gpu-e10s"), and it ends with the same pattern that I'm seeing here and quoted in comment 30. In particular, it's got:

[logging for last test finishing]
INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...

...and a 370-second harness timeout.

Assuming this is a more general harness/emulator problem and unrelated to this particular test, we should probably backout comment 27 (unless we're noticing less orange as a result of it, but I'm skeptical that we would be).

Hi Daniel. There aren't any failures since June 11th according to OF so I presume that the backout you mentioned is not necessary. Please correct me if I'm wrong.

Flags: needinfo?(malexandru) → needinfo?(dholbert)

(In reply to Andreea Pavel [:apavel] from comment #41)

Hi Daniel. There aren't any failures since June 11th according to OF

No Android crashtest "application timed out after 370 seconds with no output" issues at all?

I was guessing that this issue would just pop up as a similar failure attributed to another nearby testcase (whichever testcase became the new final test in this bucket), and it would spawn a new intermittent-failure bug. But if that hasn't happened, then I agree the hypothetical backout would not be necessary & the test-disabling should stick.

Flags: needinfo?(dholbert)

(In reply to Daniel Holbert [:dholbert] from comment #43)

(In reply to Andreea Pavel [:apavel] from comment #41)

Hi Daniel. There aren't any failures since June 11th according to OF

No Android crashtest "application timed out after 370 seconds with no output" issues at all?

I was guessing that this issue would just pop up as a similar failure attributed to another nearby testcase (whichever testcase became the new final test in this bucket), and it would spawn a new intermittent-failure bug. But if that hasn't happened, then I agree the hypothetical backout would not be necessary & the test-disabling should stick.

I was referring solely to this failure that it doesn't occur anymore.

(In reply to Andreea Pavel [:apavel] from comment #44)

I was referring solely to this failure that it doesn't occur anymore.

Well, all we know from OrangeFactor + this bug is that this failure doesn't occur anymore with this test mentioned in the failure message, which isn't particularly surprising since the test has been disabled. But since this is an error that happens when the whole crashtest run is complete, I was worried that it'd keep failing & just not be reported here since it'd be attributed to some other test.

But, having said that: it looks like there haven't been any new crashtest "timed out after 370 seconds" bugs filed since Comment 27 (when this test was disabled). I did a bugzilla search for that "timed out" phrase in bugs filed since ~July 9th, and I didn't see any results that were for a crashtest run. So, I think this did indeed go away, which means the test-disabling was in fact a Good Thing.

So that means something about 1546255.html is problematic and can cause shutdown hangs, at least on Android.

I agree, looks like the same issue. But now it's apparently affecting Android Debug as well..

We could paper-over it by adjusting the skip-if(Android&&!isDebugBuild) annotation to remove the "!isDebug" restriction. But given that we've got a tight range (this particular autoland-->m-c merge), it'd be nice to see if we can figure out where/why this started perma-failing.

:malexandru, one request for you: is it possible for us to backfill this testrun (from comment 50) on Autoland and see if there's one push there where it got bad? (It doesn't look to me like we have any recent "Android 4.3" runs of any sort at all on Autoland right now, but I'm wondering if we can spawn some over the potentially-guilty range...)

Also, one observation (knocking down a possible explanation): I don't think this is just a case of "rebucketing", because the logs from m-c's last good push and first bad push both have this identical messaging/bucket-description:

REFTEST INFO | Running chunk 9 out of 10 chunks.  tests 3039-3417/3795
REFTEST SUITE-START | Running 379 tests
REFTEST TEST-START | http://10.0.2.2:8854/tests/layout/style/crashtests/1245260-1.html

(Notably: exact same total number of tests, exact same range of tests (3039-3417), exact same first test name, and the problematic test does get run in both the last-good and the first-bad build (as the final test before we time out, in the latter case).

Flags: needinfo?(dholbert) → needinfo?(malexandru)

Thanks, malexandru! For now: if this perma-orange is showing up and causing sheriff pain, I think we're fine to land a followup to your earlier patch to remove &&!isDebugBuild (i.e. just skip this crashtest on Android for the moment).

emilio, it looks like you wrote the crashtest in question, 1546255.html - do you have any idea how the just-landed Bug 1564594 ( https://hg.mozilla.org/integration/autoland/rev/2fc4f931f041ebab1d02fe3433fdaba6f9a49fdf ) could've tickled it and make us reliably timeout after running it? I'm guessing maybe there's a latent race condition that Bug 1564594 caused us to "lose", but I don't really know how....

:twisniewski, same question to you... Do you know if any of your Bug 1564594 code is intended to be active during tests? Is it conceivable that something about that patch could've led to a deadlock/hang somehow?

For reference/convenience, here's one of the failing logs from comment 50:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=263027894&repo=mozilla-central&lineNumber=2058

Relevant lines:


[task 2019-08-22T23:12:19.196Z] 23:12:19     INFO -  REFTEST TEST-LOAD | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | 156 / 379 (41%)
[task 2019-08-22T23:12:30.112Z] 23:12:30     INFO -  REFTEST TEST-PASS | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | (LOAD ONLY)
[task 2019-08-22T23:12:30.112Z] 23:12:30     INFO -  REFTEST TEST-END | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html
[task 2019-08-22T23:25:01.899Z] 23:25:01     INFO -  wait for org.mozilla.fennec_aurora complete; top activity=org.mozilla.fennec_aurora
[task 2019-08-22T23:25:02.204Z] 23:25:02     INFO -  org.mozilla.fennec_aurora unexpectedly found running. Killing...

(notice the 12m31s minute delay between REFTEST TEST-END and the next line where we detect that something went wrong & bail)

Flags: needinfo?(dholbert) → needinfo?(emilio)
Flags: needinfo?(twisniewski)

I don't expect the extension to run on crashtests, but that failure may highlight a deeper issue.

That crashtest is doing a sync AJAX request. So it's not totally unexpected, I suspect, the fact that it runs the extension code?

But there's no race condition there afaik, the crashtest just loads and finishes as you can see by the REFTEST-TEST-END. So I think something in the landed extension code or related infrastructure gets confused by the sync XHR and we end up running something for a long amount of time.

Flags: needinfo?(emilio)

Do you know if any of your Bug 1564594 code is intended to be active during tests? Is it conceivable that something about that patch could've led to a deadlock/hang somehow?

No, the changes in that patch should not have any impact on desktop Firefox at all, and should definitely have nothing to do with this specific test.

All the patch did was update a system addon (the webcompat addon), and in a way that shouldn't impact anything any more than the addon did before.

Flags: needinfo?(twisniewski)

(In reply to Thomas Wisniewski [:twisniewski] from comment #56)

No, the changes in that patch should not have any impact on desktop Firefox at all, and should definitely have nothing to do with this specific test.

This is an android failure, fwiw.

Ah, I see. In that case, I'm afraid that I still don't see how the patch would affect this test, because it ultimately just overrides the user-agent string for navigations to some Google URLs, and blocks specific network requests to the Google serviceworker. I don't see the test making requests to either.

The only things outside of the system addon which the patch does (on Fennec) is to add two properties to the core telemetry ping, and also inform Fennec of an upcoming Fretboard experiment. Both of these are trivial changes to the relevant Java files.

The only thing I could see happening here is that there may be very slight changes to how long the webcompat system addon loads, or this specific patch triggers the addon to somehow fail loading (which shouldn't be the case, or I'd expect many other tests to trip as well). I don't see any references to the webcompat addon in the full log, either.

Pushed by ncsoregi@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/327cefb3bf5e Skip crashtest 1545177.html on Android. r=jmaher

Note that bug 1564594 (the thing that made this perma-orange) has a pending beta-uplift request, so we may need to uplift the test-skipping-patch (comment 60) along with that in order to avoid perma-orange on Beta.

However, I'd very much like to understand what piece of bug 1564594 is causing this failure before we actually do the uplift... I believe Tom is investigating (thanks Tom!)

One initial piece of data: I tried removing bug 1564594's new ua-overrides.js Google snippet in a Try run, and that was not sufficient to fix the perma-orange:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b9bcfe5f16234eb9437a4521b1c0f07f7551854
However, a full backout of bug 1564594 did fix the perma-orange:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=01802c227ec583ad8f265a1f8c7cc27ca4853e96
...so some other piece of bug 1564594 (besides the Google ua-override.js snippet) seems to be responsible.

I ran a series of try-runs over the past couple of days to determine what the issue is:

So the crash is somehow being caused by the addon registering at least one blocking webRequest onBeforeSendHeaders/onBeforeRequest listener with the match pattern ["*://*/*"]

This all-URLs listener does not have to actually do anything, merely having one of them is enough.

I'm not sure why this would cause the test to timeout though.

I don't see any strange timeouts or other behavior when I view the same HTML as the test-case on a local build on my own phone (though it cannot access SpecialPowers, of course).

So maybe it's specific to older Android versions somehow? (My phone is running Android 8).

Luca, do you have any thoughts on what we could try here?

mixedpuppy also suggested that I try using <all_urls> instead of *://*/* as my match-pattern, but that also fails: https://treeherder.mozilla.org/#/jobs?repo=try&revision=152349dfe7e8346c7e70557b94528a963f846a0e

Flags: needinfo?(lgreco)

(Adding :robwu to the ni?)

Rob, maybe you have some clues here? Note that in bug 1543639 it doesn't seem as though it's just webRequest listeners that might trigger test failures, but they do seem to be the problem here.

Flags: needinfo?(rob)

(In reply to Thomas Wisniewski [:twisniewski] from comment #64)

Luca, do you have any thoughts on what we could try here?

mixedpuppy also suggested that I try using <all_urls> instead of *://*/* as my match-pattern, but that also fails: https://treeherder.mozilla.org/#/jobs?repo=try&revision=152349dfe7e8346c7e70557b94528a963f846a0e

Apologies for the lag time, I have been trying to reproduce the issue locally in an android emulator, to make it easier for me to take a deeper look into it, it is faster than pushing to try... but the feedback loop while investigating a bug on the android emulator is still a bit slow (it feels like a slow motion scene from a movie :-))

Anyway, I'm pretty sure that using <all_urls> instead of *://**/* is not going to make any difference based on what it seems to be the underlying reason for the failure, from what I see the test case that is getting stuck is the only one in this group of crashtests to be using a synchronous XMLHttpRequest.

When that request is intercepted by the webRequest API internals and a blocking webRequest listener is already registered ([1]), the request is suspended but it is never resumed because the promise returned by fire.sync(...) is never resolved.

It doesn't matter what the webRequest listener contains, it could be even an empty function and the issue would still be triggered.

On Fennec e10s is set to false and both the extension pages and the webpages are going to run in the same process (the main process), and so I think that a synchronous XMLHttpRequest may be preventing the "API:RunListener" event to be even received by the extension code until the synchronous XMLHttpRequest completes (or fails).

The reftests test harness is stuck because, while the synchronous XMLHttpRequest is still suspended, it is calling LoadURI(BLANK_URL_FOR_CLEARING); and then waiting for the BLANK_URL_FOR_CLEARING data url to be loaded, which never happens ([2]).

It seems a legit bug on Fennec (I haven't tried yet, but I would assume that a similar issue may be triggered by doing a synchronous XMLHttpRequest from a webpage while webcompat or an adblocker extension are installed), but the same scenario shouldn't be a problem for GeckoView, where the webpages are loaded in remote browsers and the extension runs in the main process.

[1]: which happens if the webcompat extension has been already started and it has been able to register its webRequest listeners, and so when running the test file alone and for a single time it seems to be able to complete successfully because the webcompat extension has not yet subscribed its listeners and so it doesn't intercept and suspend the synchronous XMLHttpRequest, but if I run the test case more than once in a row (e.g. using ./mach test --repeat 1 ...) it gets consistently stuck on the second run (when webcompat is already started and listening for webRequest API events) and fails by timing out.

[2]: loading the data url is having the side-effect of unblocking the XMLHttpRequest, but the data url is still not loaded and reftests onDocumentLoad is never called and the test harness is stuck until the timeout.

Flags: needinfo?(lgreco)

Thanks Luca. Would you have any ideas I could try to work around this? Because I'm seeing other, similar timeout regressions from my patch which likewise make no sense to me, as they only seem to fail in the test framework (not when I run the tests on a nightly build with my patch applied, our of the framework). That makes me suspect that either the test framework uses sync XHRs internally somewhere, or maybe there's more to the story? (you can see all the regressions on parent bug 1564594). It's also worth noting that the system addon was already using webRequest listeners for other site-patches, so it's unclear to me why my patch would suddenly start causing these errors..

Flags: needinfo?(lgreco)
Flags: needinfo?(rob)

(In reply to Thomas Wisniewski [:twisniewski] from comment #68)

Thanks Luca. Would you have any ideas I could try to work around this?

As we briefly discussed over IRC on Friday, using a more specific match pattern (more specific than "*://*/*" and/or "<all_urls>") should be enough to prevent the issue triggered in this crashtests (because the webRequest listener subscribed by webcompat would not be matching the synchronous XMLHttpRquest that this crashtest is using and the crashtest shouldn't get stuck anymore).

I just gave a try to the updated webcompat version (which is not using the "*://*/*" catch-all anymore) in my local dev environment and the crashtests is now consistently completing without getting stuck anymore (which confirms that the fix is preventing this failure, and very likely also the other regression tracked in the parent Bug 1564594).

Flags: needinfo?(lgreco)
Priority: -- → P5
Has Regression Range: --- → yes
Severity: critical → S2

No reports in 3 years, which I guess is unsurprising given that this was on Android (per comment 8) and we landed patches here to skip the test on Android (comment 59).

I guess we can leave this open to track the fact that this is presumably still an issue if we were to un-skip these tests.

Dropping severity to S4, given that there's nothing actively causing trouble at the moment & it's not clear there were any user-facing issues here.

Severity: S2 → S4
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: