1112322 - Improve error reporting on xpcshell ASAN failures

Reporter

Description

•

11 years ago

We see a lot of intermittent xpcshell ASAN failures in which the process seems to die but we don't get a stack trace. Things like: 12:23:34 INFO - TEST-INFO | /builds/slave/test/build/tests/xpcshell/tests/security/manager/ssl/tests/unit/test_ocsp_stapling_expired.js | running test ... 12:28:34 WARNING - TEST-UNEXPECTED-FAIL | /builds/slave/test/build/tests/xpcshell/tests/security/manager/ssl/tests/unit/test_ocsp_stapling_expired.js | Test timed out 12:28:34 INFO - Can't trigger Breakpad, just killing process See bug 1065640 for logs. Because we don't get actionable data for these failures, they are simply ignored. We should figure out what's going on and fix it, otherwise a number of tests will likely be hidden.

(not currently active) Ted Mielczarek

Comment 1

•

11 years ago

I'd like to understand what's going on here, I asked decoder if he could try some testing on a local ASAN build.

Phil Ringnalda (:philor)

Updated

•

11 years ago

Blocks: 1115282

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

10 years ago

Whiteboard: [ateam_harness_work]

Ryan VanderMeulen [:RyanVM]

Comment 2

•

10 years ago

ASAN xpcshell has been hidden for 8 months now. What can we do to move this forward again?

Ryan VanderMeulen [:RyanVM]

Updated

•

10 years ago

Flags: needinfo?(jgriffin)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 3

•

10 years ago

If someone is capable of building these locally recording them with rr may shed some light onto the situation.

Jonathan Griffin (:jgriffin)

Reporter

Comment 4

•

10 years ago

Ted, any idea how to investigate this?

Flags: needinfo?(jgriffin) → needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 5

•

10 years ago

I don't have any good ideas, I don't know that much about ASAN. The first thing to do would be to have someone try to reproduce this locally. Anyone can do it, just download an ASAN build + test package and run the command from the logs. If you can reproduce locally you should be able to attach a debugger and figure out what's happening (ASAN builds are not stripped).

Flags: needinfo?(ted)

Nathan Froyd [:froydnj]

Comment 6

•

10 years ago

I will try and look at this in the next week or so.

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 7

•

10 years ago

This is a little difficult to reproduce with current-ish clang because there are a slew of leaks reported by LSan. I'll see about disabling LSan or using an older version without LSan.

Nathan Froyd [:froydnj]

Comment 8

•

10 years ago

Disabling LSan completely doesn't seem to work; using suppressions and limiting max leaks reported to 1 does seem to work (lots of leaks are deep in the JS engine, and there's no obvious Gecko caller to use for suppressing leaks). I get a number of failing tests for each test run. But the output (mach xpcshell-test --log-tbpl -) is typically: TEST-TIMEOUT | dom/push/test/xpcshell/test_quota_exceeded.js | took 300000ms Can't trigger Breakpad, just killing process 19:52.72 Can't trigger Breakpad, just killing process xpcshell return code: None dom/push/test/xpcshell/test_quota_exceeded.js | Process still running after test! TEST-START | intl/uconv/tests/unit/test_decode_gb18030.js dom/push/test/xpcshell/test_quota_exceeded.js failed or timed out, will retry. which sure looks like the harness is doing something with the failing test. Ted?

Flags: needinfo?(nfroyd) → needinfo?(ted)

Phil Ringnalda (:philor)

Comment 9

•

10 years ago

I never really noticed that it wasn't actually "any xpcshell test at all" like the summary of bug 1065640 says, but instead "a test in the set of things that run right at the end. Now ASan xpcshell has gotten slowed down to the point of 100% hitting the 7200 second maxtime before it hits those tests and thus hits this, making me wonder whether it would still hit this at all if we split it in two like we would have to before we could do anything with it.

(not currently active) Ted Mielczarek

Comment 10

•

9 years ago

So fundamentally this is a test timing out (reasons still unknown) and then we can't trigger Breakpad (because ASAN builds --disable-crashreporter) so we try to send it a SIGKILL: https://dxr.mozilla.org/mozilla-central/rev/82d0a583a9a39bf0b0000bccbf6d5c9ec2596bcc/build/automation.py.in#282 If that doesn't work I really don't know what else we can usefully do. For other builds triggering a crash with Breakpad is our last resort but that's not an option here. I think unless someone with ASAN knowledge digs into this it's just not ever going to get fixed (and these tests will stay hidden forever).

Flags: needinfo?(ted)

Dana Keeler (she/her) [:keeler]

Comment 11

•

9 years ago

Do we know why ASAN builds --disable-crashreporter? I just did a local ASAN build that (I think) didn't disable the crashreporter and it appears to have succeeded - perhaps the original reason for this has been fixed?

Andrew McCreight (out of office until 8/21) [:mccr8]

Comment 12

•

9 years ago

Were you able to crash successfully and have it work? I don't remember the exact reason but I'd guess it has to do with ASan and the crash reporter both wanting to intercept crashes.

(not currently active) Ted Mielczarek

Comment 13

•

9 years ago

Yeah, I think things get weird when crashes do happen. I'm not sure what happens if you build in that configuration and then send it a SIGABRT, though. If that works that'd meet this use case, and maybe we could just make all the crashreporter tests skip-if = asan?

Andrew McCreight (out of office until 8/21) [:mccr8]

Comment 14

•

9 years ago

Crash reporter tests should already be disabled on ASan builds, because they don't work, because we have no crash reporter. :)

(not currently active) Ted Mielczarek

Comment 15

•

9 years ago

Right, but I meant "could we make ASAN builds *not* --disable-crashreporter, and just skip the crashreporter tests instead?"

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Bugzilla

Improve error reporting on xpcshell ASAN failures

Categories

(Testing :: XPCShell Harness, defect)

Tracking

(Not tracked)

People

(Reporter: jgriffin, Unassigned)

References

Details

(Whiteboard: [ateam_harness_work])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Updated