1553971 - Intermittent layout/style/crashtests/1546255.html | application timed out after 370 seconds with no output

Reporter

Description

•

6 years ago

treeherder

Filed by: aciure [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=248019165&repo=mozilla-central
Full log: https://queue.taskcluster.net/v1/task/OL6FdGIxRvCPluOeO2NjtA/runs/1/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/OL6FdGIxRvCPluOeO2NjtA/runs/1/artifacts/public/logs/live_backing.log&only_show_unexpected=1

[task 2019-05-23T17:55:21.129Z] 17:55:21 [task 2019-05-23T17:55:21.129Z] 17:55:21 [task 2019-05-23T18:01:52.889Z] 18:01:52 [task 2019-05-23T18:01:52.992Z] 18:01:52 [task 2019-05-23T18:01:52.992Z] 18:01:52 [task 2019-05-23T18:01:53.316Z] 18:01:53 [task 2019-05-23T18:02:07.657Z] 18:02:07 [task 2019-05-23T18:02:07.657Z] 18:02:07 [task 2019-05-23T18:02:08.192Z] 18:02:08 [task 2019-05-23T18:02:12.945Z] 18:02:12 [task 2019-05-23T18:02:18.804Z] 18:02:18 [task 2019-05-23T18:02:18.804Z] 18:02:18 [task 2019-05-23T18:02:18.807Z] 18:02:18 [task 2019-05-23T18:02:18.807Z] 18:02:18 [task 2019-05-23T18:02:18.807Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.808Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 [task 2019-05-23T18:02:18.809Z] 18:02:18 INFO - REFTEST TEST-PASS | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | (LOAD ONLY)
INFO - REFTEST TEST-END | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html
INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...
INFO - REFTEST TEST-INFO | started process screentopng
INFO - REFTEST TEST-INFO | screentopng: exit 0
WARNING - TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | application timed out after 370 seconds with no output
INFO - remoteautomation.py | Application ran for: 0:13:54.845250
INFO - REFTEST INFO | Downloading symbols from: https://queue.taskcluster.net/v1/task/cm0pFFzpTsmXUojTHNF9aA/artifacts/public/build/target.crashreporter-symbols.zip
INFO - REFTEST INFO | Copy/paste: /builds/worker/workspace/build/linux64-minidump_stackwalk /tmp/tmpRHOTix/37a6e1c4-40e7-2c33-8a1c-492691cd8c13.dmp /tmp/tmpXilQ5r
INFO - REFTEST INFO | Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/37a6e1c4-40e7-2c33-8a1c-492691cd8c13.dmp
INFO - REFTEST INFO | Saved app info as /builds/worker/workspace/build/blobber_upload_dir/37a6e1c4-40e7-2c33-8a1c-492691cd8c13.extra
INFO - REFTEST PROCESS-CRASH | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | application crashed [@ libc.so + 0x8c66a]
INFO - Crash dump filename: /tmp/tmpRHOTix/37a6e1c4-40e7-2c33-8a1c-492691cd8c13.dmp
INFO - Operating system: Android
INFO - 0.0.0 Linux 3.10.0+ #1 PREEMPT Thu Jan 5 00:46:30 UTC 2017 x86_64
INFO - CPU: amd64
INFO - family 6 model 2 stepping 3
INFO - 1 CPU
INFO - GPU: UNKNOWN
INFO - Crash reason: SIGABRT
INFO - Crash address: 0xd5c
INFO - Process uptime: not available
INFO - Thread 0 (crashed)
INFO - 0 libc.so + 0x8c66a
INFO - rax = 0xfffffffffffffffc rdx = 0x0000000000000010
INFO - rcx = 0xffffffffffffffff rbx = 0x00000000ffffffff
INFO - rsi = 0x00007fff36f14580 rdi = 0x000000000000001c
INFO - rbp = 0x00000000ffffffff rsp = 0x00007fff36f14508
INFO - r8 = 0x0000000000000000 r9 = 0x0000000000000008
INFO - r10 = 0x00000000ffffffff r11 = 0x0000000000000246
INFO - r12 = 0x00000000ffffffff r13 = 0x00007708fe5e3f40
INFO - r14 = 0x0000000000000000 r15 = 0x0000000000000000
INFO - rip = 0x00007708ffc0866a
INFO - Found by: given as instruction pointer in context
INFO - 1 libc.so + 0x2a0e6
INFO - rsp = 0x00007fff36f14510 rip = 0x00007708ffba60e6
INFO - Found by: stack scanning
INFO - 2 libutils.so + 0x184e7
INFO - rsp = 0x00007fff36f14530 rip = 0x00007708ffc914e7
INFO - Found by: stack scanning
INFO - 3 libart.so + 0x14cd91
INFO - rsp = 0x00007fff36f14540 rip = 0x00007708fde58d91
INFO - Found by: stack scanning
INFO - 4 system@framework@boot-framework.art + 0x3d5350
INFO - rsp = 0x00007fff36f14558 rip = 0x0000000071675350
INFO - Found by: stack scanning

Comment hidden (Intermittent Failures Robot)

BugBot [:suhaib / :marco/ :calixte]

Comment 3

•

6 years ago

The priority flag is not set for this bug.
:emilio, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(emilio)

Emilio Cobos Álvarez [:emilio]

Comment 4

•

6 years ago

Do we know why there are no usable stacks on android? I'm not sure about whether you're the right person to ask Mike, feel free to forward as needed.

Flags: needinfo?(emilio) → needinfo?(mh+mozilla)

Regressed by: 1546255

Mike Hommey [:glandium]

Comment 5

•

6 years ago

minidump_stackwalk can't do anything about system libraries, for which it doesn't have breakpad symbol files.

Flags: needinfo?(mh+mozilla)

Comment hidden (Intermittent Failures Robot)

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

6 years ago

The priority flag is not set for this bug.
:heycam, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(cam)

Geoff Brown [:gbrown]

Comment 8

•

6 years ago

These are all on Android 7.0 (geckoview). Tests hang at the end of 1546255.html. This logcat seems typical:

https://taskcluster-artifacts.net/e_Wost4ISb2X-Bg91-GLvA/0/public/test_info//logcat-emulator-5554.log

06-08 13:00:32.290  2417  2432 I Gecko   : REFTEST TEST-LOAD | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | 3167 / 3761 (84%)
06-08 13:00:32.290  2417  2432 I Gecko   : 
06-08 13:00:32.290  2417  2432 I Gecko   : {"action":"log","time":1559995232290,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"START http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html"}
06-08 13:00:32.320  2477  2492 W Web Content: [JavaScript Warning: "The stylesheet data:;base64,QGN7IGJ9CkRBTVAsIEhUTUwgey was loaded as CSS even though its MIME type, â€œtext/plainâ€, is not â€œtext/cssâ€." {file: "http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html" line: 0}]
06-08 13:00:32.320  2477  2492 E Web Content: [JavaScript Error: "The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol." {file: "http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html" line: 0}]
06-08 13:00:32.330  2417  2432 I Gecko   : 
06-08 13:00:32.330  2417  2432 I Gecko   : {"action":"log","time":1559995232330,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] OnDocumentLoad triggering AfterOnLoadScripts"}
06-08 13:00:32.330  2477  2492 W Web Content: [JavaScript Warning: "Synchronous XMLHttpRequest on the main thread is deprecated because of its detrimental effects to the end userâ€™s experience. For more help http://xhr.spec.whatwg.org/" {file: "http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html" line: 21}]
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"Initializing canvas snapshot"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] AfterOnLoadScripts belatedly entering WaitForTestEnd"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] WaitForTestEnd: Adding listeners"}
06-08 13:00:32.340  2477  2492 W Web Content: [JavaScript Warning: "Use of Mutation Events is deprecated. Use MutationObserver instead." {file: "resource://reftest/reftest-content.js" line: 784}]
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"Initializing canvas snapshot"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_TO_FIRE_INVALIDATE_EVENT"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: dispatching MozReftestInvalidate"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_FOR_REFTEST_WAIT_REMOVAL"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_FOR_SPELL_CHECKS"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_FOR_APZ_FLUSH"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: APZ flush not required"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: STATE_WAITING_TO_FINISH"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: Doing sync flush to compositor"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] MakeProgress: Completed"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] RecordResult fired"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"RecordResult fired"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"test_status","time":1559995232340,"thread":null,"pid":null,"source":"reftest","test":"http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html","subtest":"(LOAD ONLY)","status":"PASS"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"test_end","time":1559995232340,"thread":null,"pid":null,"source":"reftest","test":"http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html","status":"OK"}
06-08 13:00:32.340  2417  2432 I Gecko   : 
06-08 13:00:32.340  2417  2432 I Gecko   : {"action":"log","time":1559995232340,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"Loading a blank page"}
06-08 13:00:32.400  2417  2432 I Gecko   : 
06-08 13:00:32.400  2417  2432 I Gecko   : {"action":"log","time":1559995232400,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] AfterPaintListener in data:text/html;charset=UTF-8,%3C%21%2D%2DCLEAR%2D%2D%3E"}
06-08 13:00:32.400  2417  2432 I Gecko   : 
06-08 13:00:32.400  2417  2432 I Gecko   : {"action":"log","time":1559995232400,"thread":null,"pid":null,"source":"reftest","level":"DEBUG","message":"[CONTENT] AfterPaintListener in data:text/html;charset=UTF-8,%3C%21%2D%2DCLEAR%2D%2D%3E"}
06-08 13:07:12.090  2417  2423 I art     : Thread[3,tid=2423,WaitingInMainSignalCatcherLoop,Thread*=0x73c2032b7000,peer=0x12c4dd30,"Signal Catcher"]: reacting to signal 3

Summary: Intermittent layout/style/crashtests/1546255.html | application crashed [@ libc.so + 0x8c66a] after application timed out after 370 seconds with no output → Intermittent layout/style/crashtests/1546255.html | application timed out after 370 seconds with no output

Comment hidden (Intermittent Failures Robot)

Razvan Maries

Comment 11

•

6 years ago

There are 22 total failures in the last 7 days: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-06-23&endday=2019-06-30&tree=trunk&bug=1553971

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=254128988&repo=autoland&lineNumber=14491

Whiteboard: [stockwell needswork:owner]

Comment hidden (Intermittent Failures Robot)

Noemi Erli[:noemi_erli]

Comment 18

•

6 years ago

We had 102 failures in the past week, all on android-em-7-0-x86_64 opt.
Recent log failure:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=255173994&repo=autoland&lineNumber=14569

Cosmin Sabou [:CosminS]

Comment 19

•

6 years ago

This looks to be caused of Bug 1556058. Joel, can we star these timeouts against that bug or should we have bugs for each test timing out?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 20

•

6 years ago

the plan is to keep on the new emulator, so we should skip this crashtest on android&&!debug and move on. Every time we switch platforms or run on a new config or platform many tests change- the tests that we are running on the new emulators are running more tests and more reliable, this seems to be one of the few that slipped by in early testing.

Flags: needinfo?(jmaher)

Cosmin Sabou [:CosminS]

Comment 21

•

6 years ago

•

Edited

Joel, I've come across more timeouts like this so my question was somewhat towards a general rule for them. Should we go ahead and skip tests that are timing out cause of 1556058?
eg: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=255204230&revision=f4081c5e99bf3f47f82f70b09edc7b94cc243841
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=255025669&revision=b4f0e8b4b6faec847999027e2cff9d5227077d21
https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr68&selectedJob=255027398&revision=501f4667334f5ccaee37ca013d6d1a5923e076f6
https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&selectedJob=255029929&revision=353628fec415324ca6aa333ab6c47d447ecc128e

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 22

•

6 years ago

good question Cosmin, yes lets do that, but please indicate in bug 1556058 the tests or bug numbers, you could even have this bug as blocking bug 1556058 as a quick way to do it.

Flags: needinfo?(jmaher)

Cosmin Sabou [:CosminS]

Comment 23

•

6 years ago

Thanks for the info Joel. Will let the other sheriffs know about this.

Alexandru Michis [:malexandru]

Comment 24

•

6 years ago

Attached file Bug 1553971 - Disable crashtests/1546255.html for frequent failures on android 7 opt r=jmaher — Details

Cameron McCormack (:heycam)

Updated

•

6 years ago

Flags: needinfo?(cam)

Comment hidden (Intermittent Failures Robot)

Alexandru Michis [:malexandru]

Updated

•

6 years ago

Keywords: leave-open

Whiteboard: [stockwell disable-recommended] → [stockwell disabled]

Pulsebot

Comment 27

•

6 years ago

Pushed by malexandru@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e839eae27aa4 Disable crashtests/1546255.html for frequent failures on android 7 opt r=jmaher

Andreea Pavel [:apavel]

Comment 28

•

6 years ago

Attached file Bug 1553971 - test (obsolete) — Details

Phabricator Automation

Updated

•

6 years ago

Attachment #9077061 - Attachment is obsolete: true

Arthur Iakab [arthur_iakab]

Comment 29

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e839eae27aa4

Daniel Holbert [:dholbert]

Comment 30

•

6 years ago

How sure are we that this failure had anything to do with this test? Before the timeout, I see:

REFTEST TEST-END | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html
INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...

So this test has ended (and we've probably even navigated away from it) by the time we hit the problem.

It looks like GeckoView is hitting a shutdown hang, or something -- not a hang in this test. And the harness is (kinda-mistakenly) attributing the failure to this test because it happens to be the most recent test that was running, but it seems entirely likely that the shutdown hang was caused by something else.

In other words, I'm wondering whether this failure will now suddenly start being reported for another nearby test (whichever test ends up being the last one now that this test has been disabled)... Worth keeping an eye out for.

Daniel Holbert [:dholbert]

Comment 31

•

6 years ago

(In reply to Daniel Holbert [:dholbert] from comment #30)

How sure are we that this failure had anything to do with this test?

Comment 19 makes me even more curious about this... Sounds like this may be a problem with the emulator and disabling individual harness-"blamed" tests is unlikely to help?

see also e.g. bug 1471222 where we had a similar class of failures for ASAN testruns, where it seems to be a shutdown hang and a harness issue, though the failure message mistakenly implied that the unlucky last test was somehow involved (though it was not). It looks like we later improved the logging in bug 1475141 to make that clearer, but I'm not sure if that fix benefited the crashtest harness vs. just reftests... ahal, do you know?

Flags: needinfo?(ahal)

Daniel Holbert [:dholbert]

Comment 32

•

6 years ago

•

Edited

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #20)

the plan is to keep on the new emulator, so we should skip this crashtest on android&&!debug and move on.

Joel, are you still sure about this^^ assessment, given that this failure looks like a shutdown hang & potentially has nothing to do with this test?

Perhaps a more-robust fix here is to revert to the old emulator for crashtests, as we did for xpcshell it seems? (bug 1556058 comment 19)

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 33

•

6 years ago

it is fine to revert to the old emulator, I am not clear on what issues were fixed for crashtests by switching to the new emulator. :gbrown knows a lot more about both emulators and will be back from PTO on Monday, lets get this on his radar.

Flags: needinfo?(jmaher) → needinfo?(gbrown)

Daniel Holbert [:dholbert]

Comment 34

•

6 years ago

•

Edited

I also wonder if this is the same issue as bug 1563997. The first comment there is for a debug android Mochitest run (specifically "Android 7.0 x86-64 opt Mochitests test-android-em-7.0-x86_64/opt-mochitest-gpu-e10s"), and it ends with the same pattern that I'm seeing here and quoted in comment 30. In particular, it's got:

[logging for last test finishing]
INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...

...and a 370-second harness timeout.

Assuming this is a more general harness/emulator problem and unrelated to this particular test, we should probably backout comment 27 (unless we're noticing less orange as a result of it, but I'm skeptical that we would be).

Flags: needinfo?(malexandru)

Updated

•

6 years ago

Comment 35

•

6 years ago

I think bug 1475141 should have also benefited crashtest. But if you're seeing evidence to the contrary then I wouldn't be too surprised if crashtest wasn't working.

Here's a reftest task where it's working fwiw:
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=218864193&revision=75a0b52395c94a2fdc77642d5dfc6cf0e027ad58

Flags: needinfo?(ahal)

Daniel Holbert [:dholbert]

Comment 36

•

6 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #35)

I think bug 1475141 should have also benefited crashtest. But if you're seeing evidence to the contrary then I wouldn't be too surprised if crashtest wasn't working.

I spun this off as bug 1565286 -- let's move this discussion over there. Thanks!

Comment hidden (Intermittent Failures Robot)

Geoff Brown [:gbrown]

Comment 38

•

6 years ago

(In reply to Cosmin Sabou [:CosminS] from comment #19)

This looks to be caused of Bug 1556058. Joel, can we star these timeouts against that bug or should we have bugs for each test timing out?

We have a history of failures in this bug beginning May 22, while bug 1556058 did not land anything until June 28; there may have been an increase in failure frequency with the emulator change, but the failure before/after bug 1556058 looks to be the same.

Geoff Brown [:gbrown]

Comment 39

•

6 years ago

(In reply to Daniel Holbert [:dholbert] from comment #30)

In other words, I'm wondering whether this failure will now suddenly start being reported for another nearby test (whichever test ends up being the last one now that this test has been disabled)... Worth keeping an eye out for.

I had the same concern, but actually, I see no new failures since this test was disabled: There are no shutdown timeouts in Android 7.0 crashtests since comment 29.

Geoff Brown [:gbrown]

Comment 40

•

6 years ago

It is okay to revert to the old emulator for crashtests if that is desired, but in light of comment 38 and comment 39 I think the best course of action is to do nothing and/or actually investigate the cause of the hang.

Flags: needinfo?(gbrown)

Andreea Pavel [:apavel]

Comment 41

•

6 years ago

(In reply to Daniel Holbert [:dholbert] from comment #34)

I also wonder if this is the same issue as bug 1563997. The first comment there is for a debug android Mochitest run (specifically "Android 7.0 x86-64 opt Mochitests test-android-em-7.0-x86_64/opt-mochitest-gpu-e10s"), and it ends with the same pattern that I'm seeing here and quoted in comment 30. In particular, it's got:

[logging for last test finishing]
INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...

...and a 370-second harness timeout.

Assuming this is a more general harness/emulator problem and unrelated to this particular test, we should probably backout comment 27 (unless we're noticing less orange as a result of it, but I'm skeptical that we would be).

Hi Daniel. There aren't any failures since June 11th according to OF so I presume that the backout you mentioned is not necessary. Please correct me if I'm wrong.

Flags: needinfo?(malexandru) → needinfo?(dholbert)

Comment hidden (Intermittent Failures Robot)

Daniel Holbert [:dholbert]

Comment 43

•

6 years ago

•

Edited

(In reply to Andreea Pavel [:apavel] from comment #41)

Hi Daniel. There aren't any failures since June 11th according to OF

No Android crashtest "application timed out after 370 seconds with no output" issues at all?

I was guessing that this issue would just pop up as a similar failure attributed to another nearby testcase (whichever testcase became the new final test in this bucket), and it would spawn a new intermittent-failure bug. But if that hasn't happened, then I agree the hypothetical backout would not be necessary & the test-disabling should stick.

Flags: needinfo?(dholbert)

Andreea Pavel [:apavel]

Comment 44

•

6 years ago

(In reply to Daniel Holbert [:dholbert] from comment #43)

(In reply to Andreea Pavel [:apavel] from comment #41)

Hi Daniel. There aren't any failures since June 11th according to OF

No Android crashtest "application timed out after 370 seconds with no output" issues at all?

I was guessing that this issue would just pop up as a similar failure attributed to another nearby testcase (whichever testcase became the new final test in this bucket), and it would spawn a new intermittent-failure bug. But if that hasn't happened, then I agree the hypothetical backout would not be necessary & the test-disabling should stick.

I was referring solely to this failure that it doesn't occur anymore.

Daniel Holbert [:dholbert]

Comment 45

•

6 years ago

(In reply to Andreea Pavel [:apavel] from comment #44)

I was referring solely to this failure that it doesn't occur anymore.

Well, all we know from OrangeFactor + this bug is that this failure doesn't occur anymore with this test mentioned in the failure message, which isn't particularly surprising since the test has been disabled. But since this is an error that happens when the whole crashtest run is complete, I was worried that it'd keep failing & just not be reported here since it'd be attributed to some other test.

But, having said that: it looks like there haven't been any new crashtest "timed out after 370 seconds" bugs filed since Comment 27 (when this test was disabled). I did a bugzilla search for that "timed out" phrase in bugs filed since ~July 9th, and I didn't see any results that were for a crashtest run. So, I think this did indeed go away, which means the test-disabling was in fact a Good Thing.

So that means something about 1546255.html is problematic and can cause shutdown hangs, at least on Android.

Comment hidden (Intermittent Failures Robot)

Alexandru Michis [:malexandru]

Comment 50

•

6 years ago

This started permafailing in this push:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&fromchange=5909193688b5a750be9eeaf6ea70fb59476d5b7f&searchStr=android%2C4.3%2Capi16%2B%2Cdebug%2Creftests%2Cwithout%2Ce10s%2Ctest-android-em-4.3-arm7-api-16%2Fdebug-fennec-crashtest-1proc-9%2Cr-1proc%28c9%29

Daniel, can you take a look at it? I believe is the same issue as discussed above.

Flags: needinfo?(dholbert)

Daniel Holbert [:dholbert]

Comment 51

•

6 years ago

•

Edited

I agree, looks like the same issue. But now it's apparently affecting Android Debug as well..

We could paper-over it by adjusting the skip-if(Android&&!isDebugBuild) annotation to remove the "!isDebug" restriction. But given that we've got a tight range (this particular autoland-->m-c merge), it'd be nice to see if we can figure out where/why this started perma-failing.

:malexandru, one request for you: is it possible for us to backfill this testrun (from comment 50) on Autoland and see if there's one push there where it got bad? (It doesn't look to me like we have any recent "Android 4.3" runs of any sort at all on Autoland right now, but I'm wondering if we can spawn some over the potentially-guilty range...)

Also, one observation (knocking down a possible explanation): I don't think this is just a case of "rebucketing", because the logs from m-c's last good push and first bad push both have this identical messaging/bucket-description:

REFTEST INFO | Running chunk 9 out of 10 chunks.  tests 3039-3417/3795
REFTEST SUITE-START | Running 379 tests
REFTEST TEST-START | http://10.0.2.2:8854/tests/layout/style/crashtests/1245260-1.html

(Notably: exact same total number of tests, exact same range of tests (3039-3417), exact same first test name, and the problematic test does get run in both the last-good and the first-bad build (as the final test before we time out, in the latter case).

Flags: needinfo?(dholbert) → needinfo?(malexandru)

Alexandru Michis [:malexandru]

Comment 52

•

6 years ago

I did what you asked and here are the results:

https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=android%2C4.3%2Capi16%2B%2Cdebug%2Creftests%2Cwithout%2Ce10s%2Ctest-android-em-4.3-arm7-api-16%2Fdebug-fennec-crashtest-1proc-9%2Cr-1proc%28c9%29&tochange=57a5a83f9a6275931ffa5b3f6504e2fea1f91ebf&fromchange=13e2b7bfea3efa84f9420bd236517c34bd7333c0

Looks like Bug 1564594 is the culprit here.

Flags: needinfo?(malexandru) → needinfo?(dholbert)

Daniel Holbert [:dholbert]

Comment 53

•

6 years ago

Thanks, malexandru! For now: if this perma-orange is showing up and causing sheriff pain, I think we're fine to land a followup to your earlier patch to remove &&!isDebugBuild (i.e. just skip this crashtest on Android for the moment).

emilio, it looks like you wrote the crashtest in question, 1546255.html - do you have any idea how the just-landed Bug 1564594 ( https://hg.mozilla.org/integration/autoland/rev/2fc4f931f041ebab1d02fe3433fdaba6f9a49fdf ) could've tickled it and make us reliably timeout after running it? I'm guessing maybe there's a latent race condition that Bug 1564594 caused us to "lose", but I don't really know how....

:twisniewski, same question to you... Do you know if any of your Bug 1564594 code is intended to be active during tests? Is it conceivable that something about that patch could've led to a deadlock/hang somehow?

For reference/convenience, here's one of the failing logs from comment 50:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=263027894&repo=mozilla-central&lineNumber=2058

Relevant lines:


[task 2019-08-22T23:12:19.196Z] 23:12:19     INFO -  REFTEST TEST-LOAD | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | 156 / 379 (41%)
[task 2019-08-22T23:12:30.112Z] 23:12:30     INFO -  REFTEST TEST-PASS | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html | (LOAD ONLY)
[task 2019-08-22T23:12:30.112Z] 23:12:30     INFO -  REFTEST TEST-END | http://10.0.2.2:8854/tests/layout/style/crashtests/1546255.html
[task 2019-08-22T23:25:01.899Z] 23:25:01     INFO -  wait for org.mozilla.fennec_aurora complete; top activity=org.mozilla.fennec_aurora
[task 2019-08-22T23:25:02.204Z] 23:25:02     INFO -  org.mozilla.fennec_aurora unexpectedly found running. Killing...

(notice the 12m31s minute delay between REFTEST TEST-END and the next line where we detect that something went wrong & bail)

Flags: needinfo?(dholbert) → needinfo?(emilio)

Daniel Holbert [:dholbert]

Updated

•

6 years ago

Flags: needinfo?(twisniewski)

Emilio Cobos Álvarez [:emilio]

Comment 54

•

6 years ago

I don't expect the extension to run on crashtests, but that failure may highlight a deeper issue.

That crashtest is doing a sync AJAX request. So it's not totally unexpected, I suspect, the fact that it runs the extension code?

But there's no race condition there afaik, the crashtest just loads and finishes as you can see by the REFTEST-TEST-END. So I think something in the landed extension code or related infrastructure gets confused by the sync XHR and we end up running something for a long amount of time.

Flags: needinfo?(emilio)

Natalia Csoregi [:nataliaCs]

Comment 55

•

6 years ago

Attached file Bug 1553971 - Skip crashtest 1545177.html on Android. r=jmaher — Details

Thomas Wisniewski [:twisniewski]

Comment 56

•

6 years ago

Do you know if any of your Bug 1564594 code is intended to be active during tests? Is it conceivable that something about that patch could've led to a deadlock/hang somehow?

No, the changes in that patch should not have any impact on desktop Firefox at all, and should definitely have nothing to do with this specific test.

All the patch did was update a system addon (the webcompat addon), and in a way that shouldn't impact anything any more than the addon did before.

Flags: needinfo?(twisniewski)

Emilio Cobos Álvarez [:emilio]

Comment 57

•

6 years ago

(In reply to Thomas Wisniewski [:twisniewski] from comment #56)

No, the changes in that patch should not have any impact on desktop Firefox at all, and should definitely have nothing to do with this specific test.

This is an android failure, fwiw.

Thomas Wisniewski [:twisniewski]

Comment 58

•

6 years ago

Ah, I see. In that case, I'm afraid that I still don't see how the patch would affect this test, because it ultimately just overrides the user-agent string for navigations to some Google URLs, and blocks specific network requests to the Google serviceworker. I don't see the test making requests to either.

The only things outside of the system addon which the patch does (on Fennec) is to add two properties to the core telemetry ping, and also inform Fennec of an upcoming Fretboard experiment. Both of these are trivial changes to the relevant Java files.

The only thing I could see happening here is that there may be very slight changes to how long the webcompat system addon loads, or this specific patch triggers the addon to somehow fail loading (which shouldn't be the case, or I'd expect many other tests to trip as well). I don't see any references to the webcompat addon in the full log, either.

Pulsebot

Comment 59

•

6 years ago

Pushed by ncsoregi@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/327cefb3bf5e Skip crashtest 1545177.html on Android. r=jmaher

Cristian Brindusan [:cbrindusan]

Comment 60

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/327cefb3bf5e

Comment hidden (Intermittent Failures Robot)

Daniel Holbert [:dholbert]

Comment 62

•

6 years ago

Note that bug 1564594 (the thing that made this perma-orange) has a pending beta-uplift request, so we may need to uplift the test-skipping-patch (comment 60) along with that in order to avoid perma-orange on Beta.

However, I'd very much like to understand what piece of bug 1564594 is causing this failure before we actually do the uplift... I believe Tom is investigating (thanks Tom!)

One initial piece of data: I tried removing bug 1564594's new ua-overrides.js Google snippet in a Try run, and that was not sufficient to fix the perma-orange:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b9bcfe5f16234eb9437a4521b1c0f07f7551854
However, a full backout of bug 1564594 did fix the perma-orange:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=01802c227ec583ad8f265a1f8c7cc27ca4853e96
...so some other piece of bug 1564594 (besides the Google ua-override.js snippet) seems to be responsible.

Thomas Wisniewski [:twisniewski]

Comment 63

•

6 years ago

I ran a series of try-runs over the past couple of days to determine what the issue is:

disabling just the Enhanced Search code-block which uses the new features in the patch DOES resolve the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e2d26df3ce2667d0ea422b34b4f0e83ffe61d618
disabling just its "blocks" parameter does not resolve the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a40b72f1c5debc7560e62821a3d7d1aac15ccca
not using a regex for the "matches" parameter does not resolve the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c6ec71019878e37d87441fb91a16c86720b6c2a8
skipping permanent pref, telemetry and experiments bits does not resolve the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d6c6515c5e7ef9fa27edc95d48ab003e3cf772b3
skipping "blocks" and not using a regex for "matches" DOES resolve the crash: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d88a1ce0a5c27677aa073089220aa388b1a5821f
having the two webRequest listeners with an <all_urls> pattern seems to be the culprit: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7cc7e866c29f0d0c10ccd732775786ce1008be1e
indeed, only adding one of the listeners (onBeforeSendHeaders), whether it actually does anything, is enough: https://treeherder.mozilla.org/#/jobs?repo=try&revision=49dcb54d98ab31987f309021d35b30be27dfa3ea
same when adding adding the other listener (onBeforeRequest) without the handler doing anything: https://treeherder.mozilla.org/#/jobs?repo=try&revision=de3e393ab26d9171845b459cd56a5db0f08c53cc

So the crash is somehow being caused by the addon registering at least one blocking webRequest onBeforeSendHeaders/onBeforeRequest listener with the match pattern ["*://*/*"]

This all-URLs listener does not have to actually do anything, merely having one of them is enough.

I'm not sure why this would cause the test to timeout though.

I don't see any strange timeouts or other behavior when I view the same HTML as the test-case on a local build on my own phone (though it cannot access SpecialPowers, of course).

So maybe it's specific to older Android versions somehow? (My phone is running Android 8).

Thomas Wisniewski [:twisniewski]

Comment 64

•

6 years ago

Luca, do you have any thoughts on what we could try here?

mixedpuppy also suggested that I try using <all_urls> instead of *://*/* as my match-pattern, but that also fails: https://treeherder.mozilla.org/#/jobs?repo=try&revision=152349dfe7e8346c7e70557b94528a963f846a0e

Flags: needinfo?(lgreco)

Comment hidden (Intermittent Failures Robot)

Thomas Wisniewski [:twisniewski]

Comment 66

•

6 years ago

(Adding :robwu to the ni?)

Rob, maybe you have some clues here? Note that in bug 1543639 it doesn't seem as though it's just webRequest listeners that might trigger test failures, but they do seem to be the problem here.

Flags: needinfo?(rob)

Luca Greco [:rpl] [:luca] [:lgreco]

Comment 67

•

6 years ago

(In reply to Thomas Wisniewski [:twisniewski] from comment #64)

Luca, do you have any thoughts on what we could try here?

mixedpuppy also suggested that I try using <all_urls> instead of *://*/* as my match-pattern, but that also fails: https://treeherder.mozilla.org/#/jobs?repo=try&revision=152349dfe7e8346c7e70557b94528a963f846a0e

Apologies for the lag time, I have been trying to reproduce the issue locally in an android emulator, to make it easier for me to take a deeper look into it, it is faster than pushing to try... but the feedback loop while investigating a bug on the android emulator is still a bit slow (it feels like a slow motion scene from a movie :-))

Anyway, I'm pretty sure that using <all_urls> instead of *://**/* is not going to make any difference based on what it seems to be the underlying reason for the failure, from what I see the test case that is getting stuck is the only one in this group of crashtests to be using a synchronous XMLHttpRequest.

When that request is intercepted by the webRequest API internals and a blocking webRequest listener is already registered ([1]), the request is suspended but it is never resumed because the promise returned by fire.sync(...) is never resolved.

It doesn't matter what the webRequest listener contains, it could be even an empty function and the issue would still be triggered.

On Fennec e10s is set to false and both the extension pages and the webpages are going to run in the same process (the main process), and so I think that a synchronous XMLHttpRequest may be preventing the "API:RunListener" event to be even received by the extension code until the synchronous XMLHttpRequest completes (or fails).

The reftests test harness is stuck because, while the synchronous XMLHttpRequest is still suspended, it is calling LoadURI(BLANK_URL_FOR_CLEARING); and then waiting for the BLANK_URL_FOR_CLEARING data url to be loaded, which never happens ([2]).

It seems a legit bug on Fennec (I haven't tried yet, but I would assume that a similar issue may be triggered by doing a synchronous XMLHttpRequest from a webpage while webcompat or an adblocker extension are installed), but the same scenario shouldn't be a problem for GeckoView, where the webpages are loaded in remote browsers and the extension runs in the main process.

[1]: which happens if the webcompat extension has been already started and it has been able to register its webRequest listeners, and so when running the test file alone and for a single time it seems to be able to complete successfully because the webcompat extension has not yet subscribed its listeners and so it doesn't intercept and suspend the synchronous XMLHttpRequest, but if I run the test case more than once in a row (e.g. using ./mach test --repeat 1 ...) it gets consistently stuck on the second run (when webcompat is already started and listening for webRequest API events) and fails by timing out.

[2]: loading the data url is having the side-effect of unblocking the XMLHttpRequest, but the data url is still not loaded and reftests onDocumentLoad is never called and the test harness is stuck until the timeout.

Flags: needinfo?(lgreco)

Thomas Wisniewski [:twisniewski]

Comment 68

•

6 years ago

•

Edited

Thanks Luca. Would you have any ideas I could try to work around this? Because I'm seeing other, similar timeout regressions from my patch which likewise make no sense to me, as they only seem to fail in the test framework (not when I run the tests on a nightly build with my patch applied, our of the framework). That makes me suspect that either the test framework uses sync XHRs internally somewhere, or maybe there's more to the story? (you can see all the regressions on parent bug 1564594). It's also worth noting that the system addon was already using webRequest listeners for other site-patches, so it's unclear to me why my patch would suddenly start causing these errors..

Flags: needinfo?(lgreco)

Rob Wu [:robwu]

Comment 69

•

6 years ago

•

Edited

(going to answer in bug 1564594)

EDIT: See https://bugzilla.mozilla.org/show_bug.cgi?id=1564594#c40

Flags: needinfo?(rob)

Luca Greco [:rpl] [:luca] [:lgreco]

Comment 70

•

6 years ago

(In reply to Thomas Wisniewski [:twisniewski] from comment #68)

Thanks Luca. Would you have any ideas I could try to work around this?

As we briefly discussed over IRC on Friday, using a more specific match pattern (more specific than "*://*/*" and/or "<all_urls>") should be enough to prevent the issue triggered in this crashtests (because the webRequest listener subscribed by webcompat would not be matching the synchronous XMLHttpRquest that this crashtest is using and the crashtest shouldn't get stuck anymore).

I just gave a try to the updated webcompat version (which is not using the "*://*/*" catch-all anymore) in my local dev environment and the crashtests is now consistently completing without getting stuck anymore (which confirms that the fix is preventing this failure, and very likely also the other regression tracked in the parent Bug 1564594).

Flags: needinfo?(lgreco)

Comment hidden (Intermittent Failures Robot)

Daniel Holbert [:dholbert]

Updated

•

5 years ago

Priority: -- → P5

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Has Regression Range: --- → yes

BMO Automation

Updated

•

3 years ago

Severity: critical → S2

Daniel Holbert [:dholbert]

Comment 73

•

3 years ago

No reports in 3 years, which I guess is unsurprising given that this was on Android (per comment 8) and we landed patches here to skip the test on Android (comment 59).

I guess we can leave this open to track the fact that this is presumably still an issue if we were to un-skip these tests.

Dropping severity to S4, given that there's nothing actively causing trouble at the moment & it's not clear there were any user-facing issues here.

Severity: S2 → S4

BugBot [:suhaib / :marco/ :calixte]

Comment 74

•

1 year ago

https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
For more information, please visit BugBot documentation.

Status: NEW → RESOLVED

Closed: 1 year ago

Resolution: --- → INCOMPLETE

BugBot (nomail) [:suhaib / :marco/ :calixte]

Updated

•

1 year ago

Keywords: leave-open

Bug 1553971 - Disable crashtests/1546255.html for frequent failures on android 7 opt r=jmaher 6 years ago Alexandru Michis [:malexandru] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1553971 - test 6 years ago Andreea Pavel [:apavel] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1553971 - Skip crashtest 1545177.html on Android. r=jmaher 6 years ago Natalia Csoregi [:nataliaCs] 47 bytes, text/x-phabricator-request		Details \| Review