Open Bug 1395424 Opened 7 years ago Updated 2 years ago

AddressSanitizer isn't using symbols

Categories

(Testing :: General, defect, P3)

Version 3
defect

Tracking

(Not tracked)

People

(Reporter: KWierso, Unassigned)

References

Details

I'm trying to figure out what's causing bug 1395422, but it's made more difficult because the symbolizer doesn't appear to be working.

Backscroll from #developers:
15:29:32 <RyanVM> mccr8: https://public-artifacts.taskcluster.net/aTs05-UsTpO16aCmgs5pbQ/0/public/logs/live_backing.log is concerning
15:29:45 <RyanVM> mccr8: "==1596==WARNING: Failed to use and restart external symbolizer!"
15:30:13 <RyanVM> we think we know what push is causing the leaks - *really* hoping that whatever's causing the leaks is what's breaking LSAN too, though...
15:30:41 <mccr8> RyanVM: well, that sounds like an existing intermittent
15:30:54 <RyanVM> this is consistent
15:31:06 <RyanVM> (trying to track down a leak on autoland and every instance is hitting it)
15:31:12 <mccr8> yeah, "} else for" is really terrifying...
15:31:23 <mccr8> RyanVM: Ah. Well, a big leak could certainly cause that.
15:31:37 <RyanVM>  GECKO(2492) | ==2649==ERROR: AddressSanitizer failed to allocate 0x22000 (139264) bytes of LargeMmapAllocator (error code: 12) 
15:31:40 <RyanVM> from another log
15:31:49 <mccr8> RyanVM: the theory is that the symbolizer uses a ton of memory, so if you don't have much free memory then it fails.
15:32:01 <mccr8> because it has to load in the whole Firefox binary, or something.
15:32:09 <RyanVM> interesting, we upgraded the instances the devtools asan runs are on too IIRC
15:32:35 <mccr8> yeah, I think that helped...
15:32:37 <RyanVM> welp, I feel bad for the dev that needs to hunt down his leaks after he gets backed out :P
15:41:01 <mccr8> Hmm the leak itself is very small so I can't see it causing the symbolizer to break. That is unfortunate....
15:50:02 <%KWierso> mccr8: yeah, only a few kb as far as I can see...
16:29:58 <%KWierso> mccr8: hrm, still happening on the backout of the main suspect
16:31:50 <mccr8> KWierso: So, I think that the real problem is the OOM, not the leak. If that makes sense.
16:32:11 <mccr8> KWierso: When we can't run the symbolizer, then the leak white list does not work. 
16:32:40 <RyanVM|bbl> KWierso: i still think the backout was justified given the various netmonitor timeouts it was causing :P
16:32:46 <RyanVM|bbl> but still, boooo
16:32:49 <mccr8> because it matches against the stack, and obviously libxul.so isn't something in the list.
16:33:04 <mccr8> ==1596==WARNING: failed to fork (errno 12)
16:33:07 <mccr8> I see a lot of that.
16:33:31 <%KWierso> really wish there wasn't hours of build failures...
16:34:16 <mccr8> I see "GECKO(1442) | Completed ShutdownLeaks collections in process 1596" but no earlier references to process pid 1596, so I'm not sure what that process is, or what it is doing...
16:34:27 <%KWierso> and the history rewriting makes it harder to tell what started when
16:34:49 <%KWierso> I guess I could just back out anything touching devtools as the next guess?
16:35:51 <mccr8> yeah. something that deals with preallocated processes might be suspect too....
16:36:10 <mccr8> I don't remember anything like that landing today but I could be wrong.
16:37:21 <mccr8> KWierso: which branch are the failures on?
16:37:28 <%KWierso> autoland
16:37:34 <%KWierso> https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=32607ab7ecb69318f7c98d2a0b7428dbcfb89793&noautoclassify&filter-searchStr=asan%20dt&group_state=expanded
16:37:50 <%KWierso> YMMV looking at these, since the chunks are probably shuffling tests around at some point
16:37:57 <mccr8> Ah, ok. I saw something for preallocated processes, but that's in inbound.
16:40:02 <%KWierso> mccr8: how about https://hg.mozilla.org/integration/autoland/rev/ce0752c07ff698c8dd7c94928e5160812318edfd ?
16:40:12 <mccr8> Bah, I don't see any of these pids in the log, so I guess that's just a red herring theory.
16:41:08 <mccr8> KWierso: that does sound a little scary, and the initial bug talks about devtools, so I guess it is worth a shot...
Blocks: 1245527
Priority: -- → P3

I've got a try push with ASAN failures where I can't get any symbols at all because of this. How can I diagnose this without any symbolication?

Summary: AddressSanitizer isn't using symbols on at least autoland today → AddressSanitizer isn't using symbols

Andrew, are you the person to ask about this?

Flags: needinfo?(continuation)

First off, I'll say that whatever you are hitting is different than the original issue in this bug, which was about running out of memory ("AddressSanitizer failed to allocate").

Off-hand, I'd guess that this is due to something Windows-specific. Unfortunately, it doesn't look like there's anything useful in the log here. The only error I see is:
==3832==WARNING: Failed to use and restart external symbolizer!

This indicates where the symbolizer is:
INFO | runtests.py | ASan using symbolizer at Z:\task_1568141858\build\application\firefox\llvm-symbolizer.exe

I don't know what could be going wrong there. I think the error message would be different if it couldn't find the symbolizer at all.

I looked around at other people hitting this error message, and usually it would dump out an unsymbolized stack (for instance, bug 1477490 comment 29), so you'd be able to maybe symbolize it yourself, but maybe stackwalking is just broken entirely here, which I guess is the bigger issue.

Nathan, do you have any ideas why ASan on Windows might not be producing stacks?

If Nathan doesn't know, perhaps Decoder or one of the other fuzzer people might have an idea. They are the ones with the most experience at running ASan, and might have run it on Windows recently.

Flags: needinfo?(continuation) → needinfo?(nfroyd)

I don't have any good ideas as to why things are failing. You might try setting verbosity=5 in __asan_default_options():

https://searchfox.org/mozilla-central/source/mozglue/build/AsanOptions.cpp

to get a little more logging, particularly around what asan thinks is happening with process creation.

But I also see that code if #ifndef _MSC_VER...maybe that ifdef can go away now?

Flags: needinfo?(nfroyd)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.