Open Bug 1112322 Opened 10 years ago Updated 2 years ago

Improve error reporting on xpcshell ASAN failures

Categories

(Testing :: XPCShell Harness, defect)

defect

Tracking

(Not tracked)

People

(Reporter: jgriffin, Unassigned)

References

Details

(Whiteboard: [ateam_harness_work])

We see a lot of intermittent xpcshell ASAN failures in which the process seems to die but we don't get a stack trace.  Things like:

12:23:34     INFO -  TEST-INFO | /builds/slave/test/build/tests/xpcshell/tests/security/manager/ssl/tests/unit/test_ocsp_stapling_expired.js | running test ...
12:28:34  WARNING -  TEST-UNEXPECTED-FAIL | /builds/slave/test/build/tests/xpcshell/tests/security/manager/ssl/tests/unit/test_ocsp_stapling_expired.js | Test timed out
12:28:34     INFO -  Can't trigger Breakpad, just killing process

See bug 1065640 for logs.

Because we don't get actionable data for these failures, they are simply ignored.  We should figure out what's going on and fix it, otherwise a number of tests will likely be hidden.
I'd like to understand what's going on here, I asked decoder if he could try some testing on a local ASAN build.
Blocks: 1115282
Whiteboard: [ateam_harness_work]
ASAN xpcshell has been hidden for 8 months now. What can we do to move this forward again?
Flags: needinfo?(jgriffin)
If someone is capable of building these locally recording them with rr may shed some light onto the situation.
Ted, any idea how to investigate this?
Flags: needinfo?(jgriffin) → needinfo?(ted)
I don't have any good ideas, I don't know that much about ASAN. The first thing to do would be to have someone try to reproduce this locally. Anyone can do it, just download an ASAN build + test package and run the command from the logs. If you can reproduce locally you should be able to attach a debugger and figure out what's happening (ASAN builds are not stripped).
Flags: needinfo?(ted)
I will try and look at this in the next week or so.
Flags: needinfo?(nfroyd)
This is a little difficult to reproduce with current-ish clang because there are a slew of leaks reported by LSan.  I'll see about disabling LSan or using an older version without LSan.
Disabling LSan completely doesn't seem to work; using suppressions and limiting max leaks reported to 1 does seem to work (lots of leaks are deep in the JS engine, and there's no obvious Gecko caller to use for suppressing leaks).  I get a number of failing tests for each test run.

But the output (mach xpcshell-test --log-tbpl -) is typically:

TEST-TIMEOUT | dom/push/test/xpcshell/test_quota_exceeded.js | took 300000ms
Can't trigger Breakpad, just killing process
19:52.72 Can't trigger Breakpad, just killing process
xpcshell return code: None
dom/push/test/xpcshell/test_quota_exceeded.js | Process still running after test!
TEST-START | intl/uconv/tests/unit/test_decode_gb18030.js
dom/push/test/xpcshell/test_quota_exceeded.js failed or timed out, will retry.

which sure looks like the harness is doing something with the failing test.  Ted?
Flags: needinfo?(nfroyd) → needinfo?(ted)
I never really noticed that it wasn't actually "any xpcshell test at all" like the summary of bug 1065640 says, but instead "a test in the set of things that run right at the end.

Now ASan xpcshell has gotten slowed down to the point of 100% hitting the 7200 second maxtime before it hits those tests and thus hits this, making me wonder whether it would still hit this at all if we split it in two like we would have to before we could do anything with it.
So fundamentally this is a test timing out (reasons still unknown) and then we can't trigger Breakpad (because ASAN builds --disable-crashreporter) so we try to send it a SIGKILL:
https://dxr.mozilla.org/mozilla-central/rev/82d0a583a9a39bf0b0000bccbf6d5c9ec2596bcc/build/automation.py.in#282

If that doesn't work I really don't know what else we can usefully do. For other builds triggering a crash with Breakpad is our last resort but that's not an option here. I think unless someone with ASAN knowledge digs into this it's just not ever going to get fixed (and these tests will stay hidden forever).
Flags: needinfo?(ted)
Do we know why ASAN builds --disable-crashreporter? I just did a local ASAN build that (I think) didn't disable the crashreporter and it appears to have succeeded - perhaps the original reason for this has been fixed?
Were you able to crash successfully and have it work? I don't remember the exact reason but I'd guess it has to do with ASan and the crash reporter both wanting to intercept crashes.
Yeah, I think things get weird when crashes do happen. I'm not sure what happens if you build in that configuration and then send it a SIGABRT, though. If that works that'd meet this use case, and maybe we could just make all the crashreporter tests skip-if = asan?
Crash reporter tests should already be disabled on ASan builds, because they don't work, because we have no crash reporter. :)
Right, but I meant "could we make ASAN builds *not* --disable-crashreporter, and just skip the crashreporter tests instead?"
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.