Closed Bug 1361476 Opened 7 years ago Closed 7 years ago

Intermittent LeakSanitizer | LLVMSymbolizer was unable to allocate memory.

Categories

(Testing :: General, defect)

Version 3
defect
Not set
normal

Tracking

(firefox56 wontfix, firefox57 fixed, firefox58 fixed)

RESOLVED FIXED
mozilla58
Tracking Status
firefox56 --- wontfix
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: gbrown)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Attachments

(2 files)

Also shows up as e.g. browser_ext_windows_update.js | application terminated with exit code -9
https://treeherder.mozilla.org/logviewer.html#?job_id=96013177&repo=mozilla-inbound

[task 2017-05-02T18:23:43.144614Z] 18:23:43     INFO - TEST-START | browser/components/extensions/test/browser/browser_ext_windows_update.js
[task 2017-05-02T18:23:53.044524Z] 18:23:53     INFO - GECKO(1092) | MEMORY STAT | vsize 20974052MB | residentFast 922MB
[task 2017-05-02T18:23:53.046592Z] 18:23:53     INFO - TEST-OK | browser/components/extensions/test/browser/browser_ext_windows_update.js | took 9898ms
[task 2017-05-02T18:23:53.268178Z] 18:23:53     INFO - checking window state
[task 2017-05-02T18:24:02.640290Z] 18:24:02     INFO - GECKO(1092) | Completed ShutdownLeaks collections in process 1218
[task 2017-05-02T18:24:02.883361Z] 18:24:02     INFO - GECKO(1092) | Completed ShutdownLeaks collections in process 1176
[task 2017-05-02T18:24:03.060844Z] 18:24:03     INFO - GECKO(1092) | Completed ShutdownLeaks collections in process 1198
[task 2017-05-02T18:24:03.230339Z] 18:24:03     INFO - GECKO(1092) | Completed ShutdownLeaks collections in process 1144
[task 2017-05-02T18:24:06.335985Z] 18:24:06     INFO - GECKO(1092) | Completed ShutdownLeaks collections in process 1092
[task 2017-05-02T18:24:06.336370Z] 18:24:06     INFO - TEST-START | Shutdown
[task 2017-05-02T18:24:06.336597Z] 18:24:06     INFO - Browser Chrome Test Summary
[task 2017-05-02T18:24:06.347194Z] 18:24:06     INFO - Passed:  3936
[task 2017-05-02T18:24:06.347467Z] 18:24:06     INFO - Failed:  0
[task 2017-05-02T18:24:06.349556Z] 18:24:06     INFO - Todo:    0
[task 2017-05-02T18:24:06.352062Z] 18:24:06     INFO - Mode:    e10s
[task 2017-05-02T18:24:06.354072Z] 18:24:06     INFO - *** End BrowserChrome Test Results ***
[task 2017-05-02T18:24:07.359801Z] 18:24:07     INFO - GECKO(1092) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800B2,name=PContent::Msg_ConsoleMessage) Channel closing: too late to send/recv, messages will be lost
[task 2017-05-02T18:24:07.485290Z] 18:24:07     INFO - GECKO(1092) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800B2,name=PContent::Msg_ConsoleMessage) Channel closing: too late to send/recv, messages will be lost
[task 2017-05-02T18:24:07.588698Z] 18:24:07     INFO - GECKO(1092) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800B2,name=PContent::Msg_ConsoleMessage) Channel closing: too late to send/recv, messages will be lost
[task 2017-05-02T18:24:07.778473Z] 18:24:07     INFO - GECKO(1092) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800B2,name=PContent::Msg_ConsoleMessage) Channel closing: too late to send/recv, messages will be lost
[task 2017-05-02T18:24:11.450698Z] 18:24:11     INFO - GECKO(1092) | 1493749451447	Marionette	INFO	Ceased listening
[task 2017-05-02T18:24:14.688861Z] 18:24:14     INFO - GECKO(1092) | -----------------------------------------------------
[task 2017-05-02T18:24:14.689320Z] 18:24:14     INFO - GECKO(1092) | Suppressions used:
[task 2017-05-02T18:24:14.689621Z] 18:24:14     INFO - GECKO(1092) |   count      bytes template
[task 2017-05-02T18:24:14.689869Z] 18:24:14     INFO - GECKO(1092) |       1       2032 libglib-2.0.so
[task 2017-05-02T18:24:14.690174Z] 18:24:14     INFO - GECKO(1092) | -----------------------------------------------------
[task 2017-05-02T18:24:39.488139Z] 18:24:39     INFO - GECKO(1092) | LLVM ERROR: IO failure on output stream.
[task 2017-05-02T18:24:39.540687Z] 18:24:39     INFO - GECKO(1092) | LLVM ERROR: IO failure on output stream.
[task 2017-05-02T18:24:39.540964Z] 18:24:39     INFO - GECKO(1092) | LLVM ERROR: IO failure on output stream.
[task 2017-05-02T18:24:39.541190Z] 18:24:39     INFO - GECKO(1092) | LLVM ERROR: IO failure on output stream.
[task 2017-05-02T18:24:39.542811Z] 18:24:39     INFO - TEST-INFO | Main app process: killed by SIGKILL
[task 2017-05-02T18:24:39.543673Z] 18:24:39     INFO - Buffered messages finished
[task 2017-05-02T18:24:39.544626Z] 18:24:39    ERROR - TEST-UNEXPECTED-FAIL | browser/components/extensions/test/browser/browser_ext_windows_update.js | application terminated with exit code -9
Blocks: LSan
this seemed to be a 1 day spike and has been fixed since then.
Whiteboard: [stockwell unknown]
I've seen some LSan logs where stack depth truncation is not working.

For instance, here there's a leak report where it is reporting all the way to the stop of the stack:
https://treeherder.mozilla.org/logviewer.html#?job_id=93338084&repo=mozilla-esr52&lineNumber=9225

However, I've seen other instances where it was properly truncated. That could certainly cause memory issues, though I don't know why that would happen intermittently.
This bug was dormant for almost a month until July 17.

On July 17, all failures were on try, in https://treeherder.mozilla.org/#/jobs?repo=try&revision=e9941a7448eea58445e4bfc35fe828015809cf53, for bug 1351148. But that bug has not landed any patches yet!

On July 18, failures started on autoland:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=linux%20asan%20mochitest&tochange=8ff4f17b266db9a780efe06f7fbdae629e49f5bc&fromchange=0be298129be1221d42b269a61ab3911349738e70

suggesting failures started with the backout at https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=8ff4f17b266db9a780efe06f7fbdae629e49f5bc, bug 1363361. The backout looks correct/complete to me, and I don't see any sign of this happening before those patches landed...so, this isn't making sense to me.


:mccr8 -- Thoughts?
Flags: needinfo?(continuation)
Whiteboard: [stockwell unknown] → [stockwell needswork]
One odd thing about this is that if it fails, it seems to fail repeatedly. But it is scattered across many test chunks, so it isn't like there's one problem. I think one theory glandium had was that maybe sometimes we produce a binary that is extremely large, but you'd think that it would fail every chunk in that case rather than just one of them at a high rate. Maybe somebody could check if the binaries are oddly large in failing runs?

Glandium, do you have any ideas here? This is pretty baffling.
Flags: needinfo?(continuation) → needinfo?(mh+mozilla)
There's an IDL change in bug 1363361, so maybe this is something a clobber would fix? I don't know how often we clobber ASan builds.
Decoder said upgrading from ASan 3.9 to 4.0 might help. Or might make it worse...
See Also: → 1303021
(In reply to Andrew McCreight [:mccr8] from comment #16)
> Decoder said upgrading from ASan 3.9 to 4.0 might help.
Sorry, I misunderstood what he said. He was talking about some other kind of error.
(In reply to Andrew McCreight [:mccr8] from comment #14)
> One odd thing about this is that if it fails, it seems to fail repeatedly.
> But it is scattered across many test chunks, so it isn't like there's one
> problem. I think one theory glandium had was that maybe sometimes we produce
> a binary that is extremely large, but you'd think that it would fail every
> chunk in that case rather than just one of them at a high rate. Maybe
> somebody could check if the binaries are oddly large in failing runs?
> 
> Glandium, do you have any ideas here? This is pretty baffling.

I don't remember having come up with a theory, but one observation to be made here is that those jobs are running on m1.medium instances, which have 3.75GiB of physical memory. That's not a lot... and I wouldn't be surprised if it's the core problem.
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(gbrown)
(In reply to Mike Hommey [:glandium] from comment #21)
> I don't remember having come up with a theory, but one observation to be
> made here is that those jobs are running on m1.medium instances, which have
> 3.75GiB of physical memory. That's not a lot... and I wouldn't be surprised
> if it's the core problem.

There have been 3 "outbreaks" of this failure over the last month. In each one, all failures are linux64-asan/opt and occur in 3 suites:

mochitest
mochitest-browser-chrome
mochitest-devtools-chrome

These suites have taskcluster configurations with "instance-size" of "legacy", with a comment referencing bug 1281241. I'll check on try to see if tests fail when these are changed to "default" or "xlarge". If not, perhaps we can "upgrade" one of the suites to see if that eliminates failures on that suite the next time this failure returns.
My motivation here is to modify one type of test (devtools) to see if, the next time we get a spike/outbreak of this failure, there is a clear difference between devtools and, say, browser-chrome. This is experimental only and intended to test the theory of comment 21.

I'm changing to xlarge instead of default since I think legacy and default have the same amount of memory.

Changing the instance type to xlarge seems to do no harm:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=99c6be8436b1ac9fbb67c678582f3282b71091b5

In fact, it may reduce the frequency of other intermittent failures.

Do you have any concerns with this? I'm not sure I understand the history of bug 1281241.


I'm taking the bug so that I won't lose track of it, but I don't have plans other than follow-up on this experiment; others are welcome to take it!
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
Attachment #8893525 - Flags: review?(jmaher)
Attachment #8893525 - Flags: review?(jmaher) → review+
This is happening with unnerving regularity on Beta now too since the uplift yesterday, but I guess it's not anything specific to that.
Keywords: leave-open
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3d0c24a37df9
Run linux64-asan/opt mochitest-devtools-chrome tests on xlarge instances, experimentally; r=jmaher
If this doesn't help, maybe you could talk to some of the fuzzing people. I think they have experience with running ASan builds on AWS.
Whiteboard: [stockwell needswork] → [stockwell needswork][checkin-needed-beta]
https://hg.mozilla.org/releases/mozilla-beta/rev/5fb77be6beb6

Not that I actually expect this to help much on Beta since they've all been mochitest-bc there.
Whiteboard: [stockwell needswork][checkin-needed-beta] → [stockwell needswork]
No devtools failures since comment 29. Can we consider doing the same for browser-chrome now?
Flags: needinfo?(gbrown)
Whiteboard: [stockwell needswork] → [stockwell fixed:other]
(In reply to Ryan VanderMeulen [:RyanVM] from comment #33)
> No devtools failures since comment 29. Can we consider doing the same for
> browser-chrome now?

But there have only been 3 other failures since then too. I'd like to wait for more definitive evidence. Will continue to watch...
Flags: needinfo?(gbrown)
Keywords: leave-open
Whiteboard: [stockwell fixed:other] → [stockwell unknown]
There have been no failures reported here since Aug 21, on any test type. I don't think my devtools change made any difference...this just fixed itself somehow.
Attachment #8910977 - Flags: review?(ryanvm)
Comment on attachment 8910977 [details] [diff] [review]
return to legacy instances for linux64-asan/opt mochitest-devtools-chrome

Review of attachment 8910977 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me, I guess.
Attachment #8910977 - Flags: review?(ryanvm) → review+
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ad80a6d082c4
Change instance type for linux64-asan mochitest-devtools tests; r=ryanvm
https://hg.mozilla.org/mozilla-central/rev/ad80a6d082c4
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Blocks: 1429595
You need to log in before you can comment on or make changes to this bug.