Closed Bug 1468541 Opened 6 years ago Closed 5 years ago

Crash in AllocInfo::Get<T>

Categories

(Core :: DOM: Core & HTML, defect, P3)

ARM
Android
defect

Tracking

()

RESOLVED INVALID
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox60 --- unaffected
firefox61 + fixed
firefox62 - wontfix
firefox63 - verified disabled
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix

People

(Reporter: philipp, Unassigned)

References

Details

(Keywords: crash, regression, topcrash, Whiteboard: [geckoview:p3])

Crash Data

This bug was filed from the Socorro interface and is report bp-4df0ccf6-c30e-4cc2-9191-c3fa80180613. ============================================================= Top 10 frames of crashing thread: 0 libmozglue.so AllocInfo::Get<true> memory/build/mozjemalloc.cpp:3234 1 libxul.so mozilla::dom::Element::SetCustomElementDefinition dom/base/Element.cpp:4303 2 libxul.so nsAttrAndChildArray::SizeOfExcludingThis dom/base/nsAttrAndChildArray.cpp:962 3 libxul.so mozilla::dom::Element::SetCustomElementDefinition dom/base/Element.cpp:4303 4 libxul.so mozilla::dom::FragmentOrElement::AddSizeOfExcludingThis dom/base/FragmentOrElement.cpp:2363 5 libxul.so mozilla::dom::Element::AddSizeOfExcludingThis dom/base/Element.cpp:4312 6 libxul.so mozilla::dom::Element::SetCustomElementDefinition dom/base/Element.cpp:4303 7 libxul.so AddSizeOfNodeTree dom/base/nsDocument.cpp:11822 8 libxul.so mozilla::dom::ExplicitChildIterator::GetNextChild dom/base/ChildIterator.cpp:91 9 libxul.so AddSizeOfNodeTree dom/base/nsDocument.cpp:11850 ============================================================= this is a crash signature on fennec builds with snapdragon 820/821 processors that's starting to show up in larger numbers in 61.0b12 & b13.
This is the top crash in B13, 875 crashes/311 installs.
Keywords: topcrash
Lots of custom element stuff on the stack. I'm worried that this is another version of the random Snapdragon 820/821 Stylo crashes we've been seeing previously, but the volume of this one is scary high :(. And b15 is also showing as affected. Emilio, anything in the b11->b13 pushlog range here that looks possibly-related? https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FENNEC_61_0b11_RELEASE&tochange=FENNEC_61_0b13_RELEASE
Flags: needinfo?(emilio)
No, plus custom elements are nightly-only: https://searchfox.org/mozilla-central/rev/d0a41d2e7770fc00df7844d5f840067cc35ba26f/modules/libpref/init/all.js#1331 I don't know what's going on with those processors, or with those stacks...
Flags: needinfo?(emilio)
Hey Olli, looks like custom elements might be turned on in beta 61, which merges next week to release. Can you take a look?
Flags: needinfo?(bugs)
For release drivers, we should block on figuring out what's going on here.
The stack is totally bogus - looking at https://crash-stats.mozilla.com/report/index/4df0ccf6-c30e-4cc2-9191-c3fa80180613 At least #7 -> #6 is broken. And #2 -> #1
Stuart do we have snapdragon 820/821 processors in our inventory? (Maybe SV can recreate?)
Flags: needinfo?(sphilp)
Adding a ni on Ioana to see if she can help. I know that they have Pixels, which have Snapdragon 821.
Flags: needinfo?(ioana.chiorean)
There is a note in today's Channel meeting regarding this - tried to reproduce Bug 1468541 but none of us succeeded ** devices used Pixel, samsung S7 and tab, S8 plus, HTC 10, LG G5.
Flags: needinfo?(ioana.chiorean)
On our side (SV inventory) we tried on Pixel, Samsung S7, Galaxy Tab, S8 plus, HTC 10, LG G5, OnePlus 3 and we were not able to reproduce at all. We tried several scenarios from the comments and links but none turn out successfully. // I had the tab opened and got mid air collision. commenting here too as I said in the call. @Kevin - can you try with the S8 you got? low chance but still worth trying.
Flags: needinfo?(sphilp)
Flags: needinfo?(kbrosnan)
the 61 builds pushed to the release population are affected by this crash as well.
This is currently the #2 top crash of the RC build too.
This is the #1 topcrash on Fennec 61 now and I think it needs to block wider rollout in the Play Store. This needs an owner ASAP.
Flags: needinfo?(snorp)
Flags: needinfo?(sdaswani)
Priority: -- → P1
Vlad and Co. can you look at this ASAP?
Flags: needinfo?(sdaswani) → needinfo?(vlad.baicu)
See Also: → 1470925
Top affected devices: * OnePlus 3 & 3T * Google Pixel & Pixel XL * LG G6 (H870, H871, and H872 models) * Xiaomi Mi 5s * Samsung Galaxy Tab S3
The stacks do not make much sense, but I do think it's SVG related. The root frame always seems to be mozilla::image::VectorImage::SizeOfSourceWithComputedFallback() which then calls nsDocument::DocAddSizeOfExcludingThis(). The stacks seem to go south from here, as we're obviously not going to be calling things like mozilla::dom::Element::SetCustomElementDefinition(). I seem to recall a similar weirdness where a loop got miscompiled, but it was years ago. I think gcp found it? Anyway, as this seems to be DOM code, maybe someone on Andrew's team should own it.
Flags: needinfo?(snorp)
Flags: needinfo?(overholt)
Flags: needinfo?(gpascutto)
I tried to repro this on my Pixel (which is an affected device) by going to some of the URLs that show up in crash-stats and had no luck. I imagine any page with a SVG will work, but Wikipedia article pages show up several times. The logo is a SVG.
I guess we're probably using _msize() somewhere in DocAddSizeOfExcludingThis(), so it makes sense that we could end up in AllocInfo::Get<true>(). Glandium, do you have any idea what could be going on here?
Flags: needinfo?(mh+mozilla)
The crash address (0xd23d3829) is *in* libmozglue.so: "base_addr": "0xd2385000", "end_addr": "0xd24da000"
Specifically, the crash address is in the read/executable part of libmozglue.so. The crashing code says: 4d1ee: e9c4 0600 strd r0, r6, [r4] and r4 is 0xd23d3829, so this is actively trying to write to a non-writable part of memory. looking at the disassembly for the function shows that r4 comes directly from r0, which is the first argument to the function. So whatever is calling this is giving a bogus argument in the first place. The function itself starts at 0x4d1b8, is .hidden, and there's only one caller in libmozglue.so: malloc_usable_size, which feeds r0 with... sp. That is consistent with the C++ code (the first argument to AllocInfo::Get being `this`, and in that case, it's allocated on the stack). That is however not consistent with the value of sp in the crash report. The value of lr is also bogus. Now, an interesting fact about 0xd23d3829 is that it is actually a valid function pointer, to moz_malloc_size_of... Looking at a few crash reports other than the one from comment 0 show they all have the same characteristics: - r4 containing the address of moz_malloc_size_of (as well as r8) - lr = pc + 1, which is bogus - sp clearly not what it's supposed to be considering r4. Because lr is bogus, stack traces are bogus too, which is why the second frame makes no sense. They're all found by stack scanning. Considering this is all supposed to come from malloc_usable_size, it would make sense that the AddSizeOfExcludingThis frames or at least some of them, are accurate. That doesn't explain how we get on the crashing address with r4 with a value that is not sp + 32. Copy/pasting the relevant assembly: 0004ae12 <malloc_usable_size@@Base>: 4ae12: f000 b800 b.w 4ae16 <malloc_usable_size@@Base+0x4> 4ae16: b5e0 push {r5, r6, r7, lr} 4ae18: 4601 mov r1, r0 4ae1a: 4668 mov r0, sp 4ae1c: f002 f9cc bl 4d1b8 <AllocInfo AllocInfo::Get<true>(void const*)> 4ae20: 9800 ldr r0, [sp, #0] 4ae22: b002 add sp, #8 4ae24: bd80 pop {r7, pc} 0004d1b8: <AllocInfo AllocInfo::Get<true>(void const*)>: 4d1b8: b570 push {r4, r5, r6, lr} 4d1ba: b088 sub sp, #32 4d1bc: 4604 mov r4, r0 4d1be: 481b ldr r0, [pc, #108] ; (4d22c <AllocInfo AllocInfo::Get<true>(void const*)+0x74>) 4d1c0: 4478 add r0, pc 4d1c2: 6bc0 ldr r0, [r0, #60] ; 0x3c 4d1c4: f3bf 8f5b dmb ish 4d1c8: b1a0 cbz r0, 4d1f4 <AllocInfo AllocInfo::Get<true>(void const*)+0x3c> 4d1ca: 460e mov r6, r1 4d1cc: f36f 0613 bfc r6, #0, #20 4d1d0: b186 cbz r6, 4d1f4 <AllocInfo AllocInfo::Get<true>(void const*)+0x3c> 4d1d2: 4817 ldr r0, [pc, #92] ; (4d230 <AllocInfo AllocInfo::Get<true>(void const*)+0x78>) 4d1d4: 0d0a lsrs r2, r1, #20 4d1d6: 4478 add r0, pc 4d1d8: 6d40 ldr r0, [r0, #84] ; 0x54 4d1da: eb00 0082 add.w r0, r0, r2, lsl #2 4d1de: b148 cbz r0, 4d1f4 <AllocInfo AllocInfo::Get<true>(void const*)+0x3c> 4d1e0: 6800 ldr r0, [r0, #0] 4d1e2: b138 cbz r0, 4d1f4 <AllocInfo AllocInfo::Get<true>(void const*)+0x3c> 4d1e4: 428e cmp r6, r1 4d1e6: d00a beq.n 4d1fe <AllocInfo AllocInfo::Get<true>(void const*)+0x46> 4d1e8: 4608 mov r0, r1 4d1ea: f7ff ff43 bl 4d074 <arena_salloc(void const*)> 4d1ee: e9c4 0600 strd r0, r6, [r4] 4d1f2: e002 b.n 4d1fa <AllocInfo AllocInfo::Get<true>(void const*)+0x42> 4d1f4: 2000 movs r0, #0 4d1f6: 6020 str r0, [r4, #0] 4d1f8: 6060 str r0, [r4, #4] 4d1fa: b008 add sp, #32 4d1fc: bd70 pop {r4, r5, r6, pc} 4d1fe: 4d0d ldr r5, [pc, #52] ; (4d234 <AllocInfo AllocInfo::Get<true>(void const*)+0x7c>) 4d200: 9605 str r6, [sp, #20] 4d202: 447d add r5, pc 4d204: 4628 mov r0, r5 4d206: f7f0 ed74 blx 3dcf0 <pthread_mutex_lock@plt> 4d20a: 1d28 adds r0, r5, #4 4d20c: a901 add r1, sp, #4 4d20e: f001 f9c4 bl 4e59a <RedBlackTree<extent_node_t, ExtentTreeTrait>::Search(RedBlackTree<extent_node_t, ExtentTreeTrait>::TreeNode)> 4d212: b110 cbz r0, 4d21a <AllocInfo AllocInfo::Get<true>(void const*)+0x62> 4d214: 6941 ldr r1, [r0, #20] 4d216: 6060 str r0, [r4, #4] 4d218: e001 b.n 4d21e <AllocInfo AllocInfo::Get<true>(void const*)+0x66> 4d21a: 2100 movs r1, #0 4d21c: 6061 str r1, [r4, #4] 4d21e: 4806 ldr r0, [pc, #24] ; (4d238 <AllocInfo AllocInfo::Get<true>(void const*)+0x80>) 4d220: 6021 str r1, [r4, #0] 4d222: 4478 add r0, pc 4d224: f7f0 ed6a blx 3dcfc <pthread_mutex_unlock@plt> 4d228: e7e7 b.n 4d1fa <AllocInfo AllocInfo::Get<true>(void const*)+0x42> 4d22a: bf00 nop 4d22c: 768c strb r4, [r1, #26] 4d22e: 0010 movs r0, r2 4d230: 7676 strb r6, [r6, #25] 4d232: 0010 movs r0, r2 4d234: 764a strb r2, [r1, #25] 4d236: 0010 movs r0, r2 4d238: 762a strb r2, [r5, #24] 4d23a: 0010 movs r0, r2
Flags: needinfo?(mh+mozilla)
>I seem to recall a similar weirdness where a loop got miscompiled, but it was years ago. I think gcp found it? I think I know what bug you mean: https://bugzilla.mozilla.org/show_bug.cgi?format=default&id=1323277 We initially thought this was a core bug, but it seems to have been caused by a bad system zlib being shipped with the chipset.
Flags: needinfo?(gpascutto)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #18) > The stacks do not make much sense, but I do think it's SVG related. not sure if it's relevant, but in the pushlog to fennec 61.0b12 (where the crash signature first showed up), there's bug 1457078, which particularly touched a svg file for android. https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FENNEC_61_0b11_RELEASE&tochange=FENNEC_61_0b12_RELEASE&full=1
We're missing full stacks, and stacks look rather broken. Do we even know the mozilla::image::VectorImage::SizeOfSourceWithComputedFallback() call is right? Why would we call that method so often?
(In reply to Olli Pettay [:smaug] from comment #25) > We're missing full stacks, and stacks look rather broken. Do we even know > the mozilla::image::VectorImage::SizeOfSourceWithComputedFallback() call is > right? Why would we call that method so often? We automatically call the memory reporter periodically for telemetry purposes.
Flags: needinfo?(vlad.baicu)
Does anyone know why we might get so very short stacks. Normal crash reports on Android do look reasonable - nsThread::ProcessNextEvent somewhere close to the starting point of the stack. Are the crash report all bogus/broken?
(In reply to Olli Pettay [:smaug] from comment #27) > Does anyone know why we might get so very short stacks. Normal crash reports > on Android do look reasonable - nsThread::ProcessNextEvent somewhere close > to the starting point of the stack. > Are the crash report all bogus/broken? The stack walker is using stack scanning for finding the other frames. Since stack scanning is based on heuristics it can fail, especially if the crash caused garbage to be dumped on the stack.
Whiteboard: [geckoview:klar:p1]
Assigning to Jim to coordinate investigation
Assignee: nobody → nchen
I don't have any of the affected devices. The Galaxy S8 is not on the list of affected devices.
Flags: needinfo?(kbrosnan)
Flags: needinfo?(overholt)
I just was able to reproduce this on my Pixel: https://crash-stats.mozilla.com/report/index/f7b8788b-c648-4195-b625-dd1bb0180711. It happened when the phone was idle and it was after I installed AdBlock Plus.
OK, I think the fact that it crashed while idle backs up the claim in comment #26 that maybe we're doing a memory measurement for telemetry. Frank, do you know how often this happens? Is there a way we can trigger an immediate collection to try to reproduce this bug?
Flags: needinfo?(fbertsch)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #32) > OK, I think the fact that it crashed while idle backs up the claim in > comment #26 that maybe we're doing a memory measurement for telemetry. > Frank, do you know how often this happens? Is there a way we can trigger an > immediate collection to try to reproduce this bug? Redirecting to Alessio, who will be better able to answer this question.
Flags: needinfo?(fbertsch) → needinfo?(alessio.placitelli)
If it's using the regular memory reporting infrastructure it should be possible to reproduce this by requesting a report from about:memory.
(In reply to Gabriele Svelto [:gsvelto] (PTO until 30/07) from comment #34) > If it's using the regular memory reporting infrastructure it should be > possible to reproduce this by requesting a report from about:memory. I did try that, but haven't be able to get a crash report that way yet.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #32) > OK, I think the fact that it crashed while idle backs up the claim in > comment #26 that maybe we're doing a memory measurement for telemetry. > Frank, do you know how often this happens? Is there a way we can trigger an > immediate collection to try to reproduce this bug? Telemetry gathers memory information at most once per minute [1]. Gathering the memory happens when the "cycle-collector-begins" topic is observed and the operation happens on the main thread idle queue [3]. It should be possible to manually trigger this from a scratchpad that can access browser code by: > let TelemetrySession= ChromeUtils.import("resource://gre/modules/TelemetrySession.jsm", {}); > TelemetrySession.observe("", "cycle-collector-begin"); Alternatively, one could simply refresh the about:telemetry page which will end up calling the memory gathering function [4]. However, this won't happen in the idle queue on the main thread. [1] - https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/toolkit/components/telemetry/TelemetrySession.jsm#59,1679 [2] - https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/toolkit/components/telemetry/TelemetrySession.jsm#85 [3] - https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/toolkit/components/telemetry/TelemetrySession.jsm#1686 [4] - https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/toolkit/content/aboutTelemetry.js#382
Flags: needinfo?(alessio.placitelli)
I tried refreshing the about:telemetry page, but that didn't result in a crash. I did try pasting the Telemetry code in the console but I was getting errors. So far since yesterday's 2 crashes, I haven't had any additional crashes on the Google Pixel.
Since the memory reporters run on the main thread, this could be a race that results from the memory reporter obtaining a reference to an off-main-thread object without sufficient locking/strong references being obtained, resulting in a use-after-free calling into the sizing code because the object has been deallocated on the non-main-thread. For example, the imgReport memory reporter's call-stack seems to look like: - imgMemoryReporter::CollectReports https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/imgLoader.cpp#81 - RecordCounterForRequest which obtains a RefPtr<image::Image> https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/imgLoader.cpp#459 - ImageMemoryCounter::ImageMemoryCounter takes a bare Image* (which is covered, but...) https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/Image.cpp#21 - Image::CollectSizeOfSurfaces is invoked inside there https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/Image.cpp#44 And then there are a bunch of those implementations, as we can see via: https://searchfox.org/mozilla-central/search?q=symbol:_ZNK7mozilla5image5Image21CollectSizeOfSurfacesER8nsTArrayINS0_20SurfaceMemoryCounterEEPFmPKvE&redirect=false But in particular: - RasterImage::CollectSizeOfSurfaces conditionally calls mFrameAnimator->CollectSizeOfCompositingSurfaces https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/RasterImage.cpp#705 - But mFrameAnimator is nullable in case of error (in DoError) which means that maybe mFrameAnimator was not null when the main thread started to do stuff, but it got freed out from under the reporter. https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/RasterImage.cpp#1615
Let me point you back to comment 22. Don't forget the crash happens because the code is trying to write to a memory address that points to a function in libmozglue, while the address is supposed to point to the stack! And that considering the normal flow of the function, it's actually impossible!
Right. To clarify, I'm suggesting that things go off the rails in the manner you described thanks to CollectSizeOfCompositingSurfaces or something like it running with a 'this' whose contents are no longer what it thinks they are due to a cross-thread race. The race is important because it explains why this doesn't happen 100% of the time and creates a scenario where heap memory may change between accesses. I don't have a direct explanation for the stack pointer getting corrupted, but it's worth pointing out that the image memory reporters pass around a `MallocSizeOf` function pointer in order to call into the moz_malloc_size_of/moz_malloc_usable_size/malloc_usable_size_impl functions. Since it's an explicit argument that's passed around a lot, it may get spilled to the stack in many places, which means that some prior cause of stack corruption could more easily snowball into what we're seeing. That said, the small amount of ARM disassembly I've looked at does not look like it likes to overwrite the stack pointer versus adding/subtracting, so... I still have no idea how that would work. (And unfortunately radare2 doesn't like breakpad's linux-extended minidump format, so my brief attempt to see if the minidump's raw stack data had anything illuminating in it didn't work out. Ted's https://github.com/luser/rust-minidump does work, but looking at raw hex data isn't super illuminating on its own either.)
FWIW, there's a tool to convert minidumps to core files in the breakpad repository.
That being said, the pointer we try to write to is contained in r4, which is directory assigned from r0, which is the first argument to the function, and the caller sets it directly from sp (as in, the stack pointer, not a value on the stack), so there's theoretically no room for r4 to contain anything else than that pointer to the stack. Now, what's printed in the crash report is as good as what's available in memory when breakpad reads it, and maybe the problem we have here and other recent crashes is that those devices have some bug that trashes the register state that breakpad has access to. Let's assume that r4 is just plain bogus. Let's assume for a second that the stack pointer is right, and that per the code, that's also the actual value of r4. Then crashing on this instruction still makes no sense because the stack address is writable per the minidump.
Susheel: Advice on this thread?
Flags: needinfo?(sdaswani)
I just wanted to call out that his signature has generated over 30K crashes in a 30 day period. If anyone has any other suggestions as other things to try in terms of STR or investigation, they would be greatly appreciated.
Marion, no advice, seems like a Gecko 'platform' issue, and not one my Softvision team can handle best (the current assignee seems like the right person).
Flags: needinfo?(sdaswani)
I filed bug 1478707 on my suspicions from comment 38 in the graphics component to try and get graphics eyes on it and potentially a fix.
We're also waiting on some leads at Qualcomm since this appears to happen on specific Snapdragon chips. There is an analysis of the crash at https://gist.github.com/darchons/c64fc6460b1d3613b0369865dfdc28ea
Is https://gist.github.com/darchons/c64fc6460b1d3613b0369865dfdc28ea#file-gistfile1-txt-L70 speculation from the stack contents or did you manage somehow to actually step through the crash happening and that's what happened?
I should have mentioned it's my speculation from looking at the crash dump.
Discussed offline with auth about the possible RasterImage::mFrameAnimator concerns, but mFrameAnimator should never be freed off the main thread, so I'm not convinced it could be causing the problem. I've been looking at this crash report: https://crash-stats.mozilla.com/report/index/67d3ab01-baca-4322-a5f9-2fba50180728#tab-details Which is the only non-Android one. It might provide a different perspective / a more reliable stack trace. VectorImage::VectorImage::SizeOfSourceWithComputedFallback gets a pointer to the SVGDocument (indirect subclass of nsDocument). It seems to get lost in the weeds iterating over the nodes in that document. I think the pointer we use should be valid because it should be refcounted all the way from VectorImage to the SVGDocument pointer, and it checks for nulls. I note that it does call nsIDocument::AsSVGDocument() which is a simple static_cast (maybe we somehow got the wrong document?!). - VectorImage has a refcounted pointer to SVGDocumentWrapper (mSVGDocumentWrapper) - SVGDocumentWrapper has a refcounted pointer to nsIContentViewer (mViewer) - nsDocumentViewer seems to be the only impl of nsIContentViewer, and it has refcounted pointer to nsIDocument (mDocument) Scanning the Android reports, it doesn't seem to have any consistency as to the type of node the report suggests we failed on, just that we are iterating over the nodes until we find a "bad" one (assuming the document itself is fine...).
Again, also not Android, but similar reports that may also provide useful information from 62.0b: https://crash-stats.mozilla.com/report/index/f71bd73b-6a96-4d24-b07b-fc63b0180620 https://crash-stats.mozilla.com/report/index/f84314c7-3e72-4732-a1c8-b306e0180721 It is also worth noting there was a similar report from as early as 58: https://crash-stats.mozilla.com/report/index/c8733391-e53e-4d18-bb90-732ca0180208 Unfortunately the volume isn't high enough for non-Android for me to put too much faith in these isolated reports.
If we look at the crash volumes for release/beta, the beta reports dropped off quite a bit from their peak at the end of June. This presumably lines up with the release of 61 in that time frame, so the beta users got moved to 62. Is it possible this was fixed or otherwise we landed something that blunted the effect in 62?
Hey Jim, any updates here from Qualcom?
Flags: needinfo?(nchen)
Not yet. We're reaching out to other contacts.
Flags: needinfo?(nchen)
Whiteboard: [geckoview:klar:p1]
Noting crash volume on 62 beta is still very low while we are seeing 30K+ crashes per week on release.
This crash appears to be gone entirely in 61.0.2 which shipped last week (and had no related changes).
Assignee: nchen → nobody
1 crash on 63 since we shipped, a few crashes on pre-release, ESR has most of the crashes but it's not a lot. Adjusting flags accordingly.
The crash volume is very low, maybe it was fixed somewhere else or the signature changed. Also a geckoview bug.
Priority: P1 → P3
Whiteboard: [geckoview]
Component: General → DOM
Product: Firefox for Android → Core
Version: Firefox 61 → unspecified

There were 22398 reports in Fennec 61.0, 47 in Fennec 64.0, and only 1 in 66.0a1.

This crash signature is reported on all platforms, but 90% of the reports are from Android.

Whiteboard: [geckoview] → [geckoview:p3]
Component: DOM → DOM: Core & HTML

I stumbled over this bug by accident, and if I query https://crash-stats.mozilla.org/search/?signature=%3DAllocInfo%3A%3AGet%3CT%3E&date=%3E%3D2018-12-04T00%3A00%3A00.000Z&date=%3C2019-12-03T14%3A58%3A00.000Z&_facets=install_time&_facets=version&_facets=address&_facets=moz_crash_reason&_facets=reason&_facets=build_id&_facets=platform_pretty_version&_facets=signature&_facets=useragent_locale&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=moz_crash_reason&_columns=build_id&_columns=platform_pretty_version#crash-reports, I see crashes in all our products on all platforms.

All of them (well, the ones I clicked on) reported to fail more or less here: https://searchfox.org/mozilla-central/rev/732a4acab44c9ed4199611c15be0ab63c08e7ca2/memory/build/mozjemalloc.cpp#3181

Thus it does not look like a DOM specific failure to me. And it would be helpful to know, what is the intent of the MOZ_DIAGNOSTIC_ASSERT(ret != 0);: Does it check a potential error in the mozjemalloc implementation itself or is it indicating a misuse (like calling this with an already freed pointer or similar)?

In the latter case probably we should analyze all stack traces and find the source of the dangling pointer case by case. Phew.

Flags: needinfo?(erahm)

I also had a look and the signature is too generic to be meaningfull; we should add it to Socorro's prefix list so that we can tell different stacks apart. Some crashes have common stacks but a bunch of others don't so this could just be coming from faulty hardware. It happens often with over-generic crash signatures such as this one.

(In reply to Gabriele Svelto [:gsvelto] from comment #61)

I also had a look and the signature is too generic to be meaningfull; we should add it to Socorro's prefix list so that we can tell different stacks apart. Some crashes have common stacks but a bunch of others don't so this could just be coming from faulty hardware. It happens often with over-generic crash signatures such as this one.

I would want to start from the assumption, that the MOZ_DIAGNOSTIC_ASSERT(ret != 0); has been put on purpose and wants to tell us something. And if it's a misuse: As we frequently see other UAF issues, it would be helpful to have some (potentially 1000 as of the number of search results) hints on potential UAFs through this assert. But I am speculating.

(In reply to Jens Stutte [:jstutte] from comment #62)

I would want to start from the assumption, that the MOZ_DIAGNOSTIC_ASSERT(ret != 0); has been put on purpose and wants to tell us something. And if it's a misuse: As we frequently see other UAF issues, it would be helpful to have some (potentially 1000 as of the number of search results) hints on potential UAFs through this assert. But I am speculating.

Yeah, I'm not saying this isn't valid, it's possible that there are valid stacks hiding in there. In particular this stack comes up a few times:

0 	mozglue.dll 	AllocInfo::Get<1>(void const*) 	memory/build/mozjemalloc.cpp:3047
1 	mozglue.dll 	moz_malloc_size_of 	memory/mozalloc/mozalloc.cpp:128
2 	xul.dll 	unsigned __int64 mozilla::webgl::ShaderValidatorResults::SizeOfIncludingThis(*) 	dom/canvas/WebGLShaderValidator.cpp:485
3 	xul.dll 	mozilla::WebGLShader::SizeOfIncludingThis(unsigned __int64 (*)(void const*)) 	dom/canvas/WebGLShader.cpp:302
4 	xul.dll 	static __int64 mozilla::WebGLMemoryTracker::GetShaderSize() 	dom/canvas/WebGLMemoryTracker.cpp:121
5 	xul.dll 	nsresult mozilla::WebGLMemoryTracker::CollectReports(class nsIHandleReportCallback*, class nsISupports*, bool) 	dom/canvas/WebGLMemoryTracker.cpp:61
6 	xul.dll 	nsresult mozilla::detail::RunnableFunction<`lambda at z:/task_1564955640/build/src/xpcom/base/nsMemoryReporterManager.cpp:1863:7'>::Run() 	xpcom/threads/nsThreadUtils.h:564
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1224
8 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:486
9 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:88
10 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:308

This is also the only stack associated with a crash of a machine that has ECC memory and thus not susceptible to memory errors.

For other crashes with weird stacks and older versions it's possible given the low volume that these are bit-flips in memory being caught by the assertion since none of the reports come from machines with ECC memory. We're loading the mapbits data from memory here; since the assertion is checking against 0 (after masking) it's possible for stuck bits or bit flips to trigger it.

Depends on: 1600951

I filed bug 1600951 to disambiguate the crashes; once it lands the crash volume here will go to zero as Socorro generates new more specific signatures. It should be sufficient to triage some crashes with a signature starting with AllocInfo::Get<T> to find out a new more relevant signature.

Does it check a potential error in the mozjemalloc implementation itself or is it indicating a misuse (like calling this with an already freed pointer or similar)?

The latter.

Thanks :glandium, clearing needinfo to Eric then.

:gsvelto: Thanks for de mangling this! Can we easily apply this prefix filter also on already existing reports for the last 6 months?

Flags: needinfo?(erahm) → needinfo?(gsvelto)

(In reply to Jens Stutte [:jstutte] from comment #66)

:gsvelto: Thanks for de mangling this! Can we easily apply this prefix filter also on already existing reports for the last 6 months?

Yes, I think we can reprocess the crashes. I'll see if I can do it.

OK, I can reprocess all the crashes with this signature (and similar ones) but I'll have to wait until bug 1600951 reaches production.

Flags: needinfo?(gsvelto)

Looks like we'll also need bug 1601223.

AllocInfo::Get<T> is now in the prefix list, time to reprocess crashes.

... and it's done. I reprocessed all the crashes from the past 6 months so the actual results might take a while to show up on crash-stats, but tomorrow we can probably start looking for new signatures.

Reprocessing the crashes paid off handsomely: the vast majority of the resulting signatures affect Firefox & Thunderbird version 60 and older so we don't care about those. The stack I put in comment 63 now has his own signature and looks like a real problem affecting both Firefox 72 & 73. I'm closing this bug since this signature is no more and opening a new one for that.

Closing this, we can follow up with the new signature is in bug 1603041.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.