Closed Bug 1065640 Opened 10 years ago Closed 9 years ago

Intermittent ASAN any xpcshell test at all | Test timed out

Categories

(Testing :: XPCShell Harness, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: intermittent-failure)

https://tbpl.mozilla.org/php/getParsedLog.php?id=47810174&tree=Mozilla-Inbound

Ubuntu ASAN VM 12.04 x64 mozilla-inbound opt test xpcshell on 2014-09-10 11:16:13 PDT for push 09d86eb69c01
slave: tst-linux64-spot-137

12:23:34     INFO -  TEST-INFO | /builds/slave/test/build/tests/xpcshell/tests/security/manager/ssl/tests/unit/test_ocsp_stapling_expired.js | running test ...
12:28:34  WARNING -  TEST-UNEXPECTED-FAIL | /builds/slave/test/build/tests/xpcshell/tests/security/manager/ssl/tests/unit/test_ocsp_stapling_expired.js | Test timed out
12:28:34     INFO -  Can't trigger Breakpad, just killing process
https://tbpl.mozilla.org/php/getParsedLog.php?id=47815288&tree=Mozilla-Inbound
Summary: Intermittent ASAN test_ocsp_stapling_expired.js | Test timed out → Intermittent ASAN test_ocsp_stapling.js,test_ocsp_stapling_expired.js | Test timed out
https://tbpl.mozilla.org/php/getParsedLog.php?id=47973992&tree=Mozilla-Central

We're sorely lacking in anything useful from the logs, unfortunately. Ted, should we be getting some sort of force-kill and stack?
Flags: needinfo?(ted)
Summary: Intermittent ASAN test_ocsp_stapling.js,test_ocsp_stapling_expired.js | Test timed out → Intermittent ASAN test_ocsp_caching.js,test_ocsp_stapling.js,test_ocsp_stapling_expired.js | Test timed out
ASAN builds disable the crashreporter:
http://hg.mozilla.org/mozilla-central/annotate/59d4326311e0/build/unix/mozconfig.asan#l25

so we fall through killAndGetStackNoScreenshot without doing anything:
http://hg.mozilla.org/mozilla-central/annotate/59d4326311e0/build/automation.py.in#l654

decoder: I feel like we talked about the ASAN+crashreporter interaction on IRC recently. My recollection was that it just needed some work to be usable, am I correct?
Flags: needinfo?(ted)
https://tbpl.mozilla.org/php/getParsedLog.php?id=48592590&tree=Fx-Team
Summary: Intermittent ASAN test_ocsp_caching.js,test_ocsp_stapling.js,test_ocsp_stapling_expired.js | Test timed out → Intermittent ASAN test_ocsp_caching.js,test_ocsp_required.js,test_ocsp_stapling.js,test_ocsp_stapling_expired.js | Test timed out
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #7)
> ASAN builds disable the crashreporter:
> http://hg.mozilla.org/mozilla-central/annotate/59d4326311e0/build/unix/
> mozconfig.asan#l25
> 
> so we fall through killAndGetStackNoScreenshot without doing anything:
> http://hg.mozilla.org/mozilla-central/annotate/59d4326311e0/build/automation.
> py.in#l654
> 
> decoder: I feel like we talked about the ASAN+crashreporter interaction on
> IRC recently. My recollection was that it just needed some work to be
> usable, am I correct?
Flags: needinfo?(choller)
This maniftests in other test suites as well (some recent links below). And apparently it happens across different trees (I see them pretty regularly on Aurora/Beta as well). I suspect there's something funky going on infrawise here, but I have nothing to go off in the logs to prove it.

https://tbpl.mozilla.org/php/getParsedLog.php?id=48692883&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=48692237&tree=B2g-Inbound
Component: Security: PSM → XPCShell Harness
Product: Core → Testing
Summary: Intermittent ASAN test_ocsp_caching.js,test_ocsp_required.js,test_ocsp_stapling.js,test_ocsp_stapling_expired.js | Test timed out → Intermittent ASAN test_bug324121.js,test_bug335238.js,test_ocsp_caching.js,test_ocsp_required.js,test_ocsp_stapling.js,test_ocsp_stapling_expired.js | Test timed out
I don't know exactly what this particular problem has to do with the crash reporter at all. As far as I can see, automation is sending a SIGABRT to the process, to get a stack. If you send a SIGABRT or SIGSEGV to an ASan process, then I suspect it will output a trace on stderr like with any other ASan error. So maybe we don't even need the crash reporter here at all.

The reason for the timeouts here is another story. I suspect either excessive memory usage or some deadlock that only happens under ASan because of thread scheduling.
Flags: needinfo?(choller)
We've gotten nowhere with this bug for 3 months now. Now it's #3 on OrangeFactor and would almost certainly be contending for #1 if we properly included every other bug that's on file for other ASAN xpcshell timeouts (i.e. bug 1100364 for example).

As far as I can tell, this bug is blocked on the harness giving useful output for debugging. Can we please find some resources to help improve that situation? Otherwise I'm probably going to resort to just mass-disabling tests on ASAN, which seems bad :)
Flags: needinfo?(jgriffin)
I'm still not totally sure what's happening here. From comment 676's log:
09:27:58     INFO -  Can't trigger Breakpad, just killing process
09:27:58     INFO -  xpcshell return code: None

It looks like we should be calling killPid there, which sends a SIGKILL to the process. However, we're not hitting the "still alive" check in postCheck: http://hg.mozilla.org/mozilla-central/annotate/a3030140d5df/testing/xpcshell/runxpcshelltests.py#l274 so it must actually be dead. In light of that I'm not sure why the harness isn't returning properly.
See Also: → 1112322
Bug 1112322 filed for fixing this.  It's easier to track as a distinct bug without all the orange bugspam.
Flags: needinfo?(jgriffin)
Now that the suite this bug is actually about is long since hidden, we're not doing ourselves any favors by having that little set of test names in the summary of an open intermittent bug.
Summary: Intermittent ASAN test_bug324121.js,test_bug335238.js,test_ocsp_caching.js,test_ocsp_required.js,test_ocsp_stapling.js,test_ocsp_stapling_expired.js | Test timed out → Intermittent ASAN any xpcshell test at all | Test timed out
Inactive; closing (see bug 1180138).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.