<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Updated

•

6 years ago

tracking-firefox61: --- → +

Comment 2

•

6 years ago

This is the top crash in B13, 875 crashes/311 installs.

Keywords: topcrash

Emilio Cobos Álvarez (:emilio)

Comment 3

•

6 years ago

Lots of custom element stuff on the stack. I'm worried that this is another version of the random Snapdragon 820/821 Stylo crashes we've been seeing previously, but the volume of this one is scary high :(. And b15 is also showing as affected. Emilio, anything in the b11->b13 pushlog range here that looks possibly-related? https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FENNEC_61_0b11_RELEASE&tochange=FENNEC_61_0b13_RELEASE

Flags: needinfo?(emilio)

Comment 4

•

6 years ago

No, plus custom elements are nightly-only: https://searchfox.org/mozilla-central/rev/d0a41d2e7770fc00df7844d5f840067cc35ba26f/modules/libpref/init/all.js#1331 I don't know what's going on with those processors, or with those stacks...

Flags: needinfo?(emilio)

Jim Mathies [:jimm]

Comment 5

•

6 years ago

Hey Olli, looks like custom elements might be turned on in beta 61, which merges next week to release. Can you take a look?

Flags: needinfo?(bugs)

Jim Mathies [:jimm]

Comment 6

•

6 years ago

For release drivers, we should block on figuring out what's going on here.

Olli Pettay [:smaug][bugs@pettay.fi](vacation until Nov 16)

Comment 7

•

6 years ago

Custom elements aren't turned on on 61. https://searchfox.org/mozilla-central/rev/d0a41d2e7770fc00df7844d5f840067cc35ba26f/modules/libpref/init/all.js#1326-1332

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi](vacation until Nov 16)

Comment 8

•

6 years ago

The stack is totally bogus - looking at https://crash-stats.mozilla.com/report/index/4df0ccf6-c30e-4cc2-9191-c3fa80180613 At least #7 -> #6 is broken. And #2 -> #1

David Bolter [:davidb] (NeedInfo me for attention)

Comment 9

•

6 years ago

Stuart do we have snapdragon 820/821 processors in our inventory? (Maybe SV can recreate?)

Flags: needinfo?(sphilp)

Comment 10

•

6 years ago

Adding a ni on Ioana to see if she can help. I know that they have Pixels, which have Snapdragon 821.

Flags: needinfo?(ioana.chiorean)

Comment 11

•

6 years ago

There is a note in today's Channel meeting regarding this - tried to reproduce Bug 1468541 but none of us succeeded ** devices used Pixel, samsung S7 and tab, S8 plus, HTC 10, LG G5.

Updated

•

6 years ago

Flags: needinfo?(ioana.chiorean)

Ioana Chiorean

Comment 12

•

6 years ago

On our side (SV inventory) we tried on Pixel, Samsung S7, Galaxy Tab, S8 plus, HTC 10, LG G5, OnePlus 3 and we were not able to reproduce at all. We tried several scenarios from the comments and links but none turn out successfully. // I had the tab opened and got mid air collision. commenting here too as I said in the call. @Kevin - can you try with the S8 you got? low chance but still worth trying.

Flags: needinfo?(sphilp)

Updated

•

6 years ago

Flags: needinfo?(kbrosnan)

Updated

•

6 years ago

status-firefox62: ? → affected

tracking-firefox62: --- → +

[:philipp]

Reporter

Comment 13

•

6 years ago

the 61 builds pushed to the release population are affected by this crash as well.

Comment 14

•

6 years ago

This is currently the #2 top crash of the RC build too.

Comment 15

•

6 years ago

This is the #1 topcrash on Fennec 61 now and I think it needs to block wider rollout in the Play Store. This needs an owner ASAP.

Flags: needinfo?(snorp)

Flags: needinfo?(sdaswani)

Priority: -- → P1

:sdaswani

Comment 16

•

6 years ago

Vlad and Co. can you look at this ASAP?

Flags: needinfo?(sdaswani) → needinfo?(vlad.baicu)

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Updated

•

6 years ago

Comment 17

•

6 years ago

Top affected devices: * OnePlus 3 & 3T * Google Pixel & Pixel XL * LG G6 (H870, H871, and H872 models) * Xiaomi Mi 5s * Samsung Galaxy Tab S3

Comment 18

•

6 years ago

The stacks do not make much sense, but I do think it's SVG related. The root frame always seems to be mozilla::image::VectorImage::SizeOfSourceWithComputedFallback() which then calls nsDocument::DocAddSizeOfExcludingThis(). The stacks seem to go south from here, as we're obviously not going to be calling things like mozilla::dom::Element::SetCustomElementDefinition(). I seem to recall a similar weirdness where a loop got miscompiled, but it was years ago. I think gcp found it? Anyway, as this seems to be DOM code, maybe someone on Andrew's team should own it.

Flags: needinfo?(snorp)

Flags: needinfo?(overholt)

Flags: needinfo?(gpascutto)

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 19

•

6 years ago

I tried to repro this on my Pixel (which is an affected device) by going to some of the URLs that show up in crash-stats and had no luck. I imagine any page with a SVG will work, but Wikipedia article pages show up several times. The logo is a SVG.

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 20

•

6 years ago

I guess we're probably using _msize() somewhere in DocAddSizeOfExcludingThis(), so it makes sense that we could end up in AllocInfo::Get<true>(). Glandium, do you have any idea what could be going on here?

Flags: needinfo?(mh+mozilla)

Comment 21

•

6 years ago

The crash address (0xd23d3829) is *in* libmozglue.so: "base_addr": "0xd2385000", "end_addr": "0xd24da000"

Gian-Carlo Pascutto [:gcp]

Comment 22

•

6 years ago

Specifically, the crash address is in the read/executable part of libmozglue.so. The crashing code says: 4d1ee: e9c4 0600 strd r0, r6, [r4] and r4 is 0xd23d3829, so this is actively trying to write to a non-writable part of memory. looking at the disassembly for the function shows that r4 comes directly from r0, which is the first argument to the function. So whatever is calling this is giving a bogus argument in the first place. The function itself starts at 0x4d1b8, is .hidden, and there's only one caller in libmozglue.so: malloc_usable_size, which feeds r0 with... sp. That is consistent with the C++ code (the first argument to AllocInfo::Get being `this`, and in that case, it's allocated on the stack). That is however not consistent with the value of sp in the crash report. The value of lr is also bogus. Now, an interesting fact about 0xd23d3829 is that it is actually a valid function pointer, to moz_malloc_size_of... Looking at a few crash reports other than the one from comment 0 show they all have the same characteristics: - r4 containing the address of moz_malloc_size_of (as well as r8) - lr = pc + 1, which is bogus - sp clearly not what it's supposed to be considering r4. Because lr is bogus, stack traces are bogus too, which is why the second frame makes no sense. They're all found by stack scanning. Considering this is all supposed to come from malloc_usable_size, it would make sense that the AddSizeOfExcludingThis frames or at least some of them, are accurate. That doesn't explain how we get on the crashing address with r4 with a value that is not sp + 32. Copy/pasting the relevant assembly: 0004ae12 <malloc_usable_size@@Base>: 4ae12: f000 b800 b.w 4ae16 <malloc_usable_size@@Base+0x4> 4ae16: b5e0 push {r5, r6, r7, lr} 4ae18: 4601 mov r1, r0 4ae1a: 4668 mov r0, sp 4ae1c: f002 f9cc bl 4d1b8 <AllocInfo AllocInfo::Get<true>(void const*)> 4ae20: 9800 ldr r0, [sp, #0] 4ae22: b002 add sp, #8 4ae24: bd80 pop {r7, pc} 0004d1b8: <AllocInfo AllocInfo::Get<true>(void const*)>: 4d1b8: b570 push {r4, r5, r6, lr} 4d1ba: b088 sub sp, #32 4d1bc: 4604 mov r4, r0 4d1be: 481b ldr r0, [pc, #108] ; (4d22c <AllocInfo AllocInfo::Get<true>(void const*)+0x74>) 4d1c0: 4478 add r0, pc 4d1c2: 6bc0 ldr r0, [r0, #60] ; 0x3c 4d1c4: f3bf 8f5b dmb ish 4d1c8: b1a0 cbz r0, 4d1f4 <AllocInfo AllocInfo::Get<true>(void const*)+0x3c> 4d1ca: 460e mov r6, r1 4d1cc: f36f 0613 bfc r6, #0, #20 4d1d0: b186 cbz r6, 4d1f4 <AllocInfo AllocInfo::Get<true>(void const*)+0x3c> 4d1d2: 4817 ldr r0, [pc, #92] ; (4d230 <AllocInfo AllocInfo::Get<true>(void const*)+0x78>) 4d1d4: 0d0a lsrs r2, r1, #20 4d1d6: 4478 add r0, pc 4d1d8: 6d40 ldr r0, [r0, #84] ; 0x54 4d1da: eb00 0082 add.w r0, r0, r2, lsl #2 4d1de: b148 cbz r0, 4d1f4 <AllocInfo AllocInfo::Get<true>(void const*)+0x3c> 4d1e0: 6800 ldr r0, [r0, #0] 4d1e2: b138 cbz r0, 4d1f4 <AllocInfo AllocInfo::Get<true>(void const*)+0x3c> 4d1e4: 428e cmp r6, r1 4d1e6: d00a beq.n 4d1fe <AllocInfo AllocInfo::Get<true>(void const*)+0x46> 4d1e8: 4608 mov r0, r1 4d1ea: f7ff ff43 bl 4d074 <arena_salloc(void const*)> 4d1ee: e9c4 0600 strd r0, r6, [r4] 4d1f2: e002 b.n 4d1fa <AllocInfo AllocInfo::Get<true>(void const*)+0x42> 4d1f4: 2000 movs r0, #0 4d1f6: 6020 str r0, [r4, #0] 4d1f8: 6060 str r0, [r4, #4] 4d1fa: b008 add sp, #32 4d1fc: bd70 pop {r4, r5, r6, pc} 4d1fe: 4d0d ldr r5, [pc, #52] ; (4d234 <AllocInfo AllocInfo::Get<true>(void const*)+0x7c>) 4d200: 9605 str r6, [sp, #20] 4d202: 447d add r5, pc 4d204: 4628 mov r0, r5 4d206: f7f0 ed74 blx 3dcf0 <pthread_mutex_lock@plt> 4d20a: 1d28 adds r0, r5, #4 4d20c: a901 add r1, sp, #4 4d20e: f001 f9c4 bl 4e59a <RedBlackTree<extent_node_t, ExtentTreeTrait>::Search(RedBlackTree<extent_node_t, ExtentTreeTrait>::TreeNode)> 4d212: b110 cbz r0, 4d21a <AllocInfo AllocInfo::Get<true>(void const*)+0x62> 4d214: 6941 ldr r1, [r0, #20] 4d216: 6060 str r0, [r4, #4] 4d218: e001 b.n 4d21e <AllocInfo AllocInfo::Get<true>(void const*)+0x66> 4d21a: 2100 movs r1, #0 4d21c: 6061 str r1, [r4, #4] 4d21e: 4806 ldr r0, [pc, #24] ; (4d238 <AllocInfo AllocInfo::Get<true>(void const*)+0x80>) 4d220: 6021 str r1, [r4, #0] 4d222: 4478 add r0, pc 4d224: f7f0 ed6a blx 3dcfc <pthread_mutex_unlock@plt> 4d228: e7e7 b.n 4d1fa <AllocInfo AllocInfo::Get<true>(void const*)+0x42> 4d22a: bf00 nop 4d22c: 768c strb r4, [r1, #26] 4d22e: 0010 movs r0, r2 4d230: 7676 strb r6, [r6, #25] 4d232: 0010 movs r0, r2 4d234: 764a strb r2, [r1, #25] 4d236: 0010 movs r0, r2 4d238: 762a strb r2, [r5, #24] 4d23a: 0010 movs r0, r2

Flags: needinfo?(mh+mozilla)

Comment 23

•

6 years ago

>I seem to recall a similar weirdness where a loop got miscompiled, but it was years ago. I think gcp found it? I think I know what bug you mean: https://bugzilla.mozilla.org/show_bug.cgi?format=default&id=1323277 We initially thought this was a core bug, but it seems to have been caused by a bad system zlib being shipped with the chipset.

Flags: needinfo?(gpascutto)

[:philipp]

Reporter

Comment 24

•

6 years ago

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #18) > The stacks do not make much sense, but I do think it's SVG related. not sure if it's relevant, but in the pushlog to fennec 61.0b12 (where the crash signature first showed up), there's bug 1457078, which particularly touched a svg file for android. https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FENNEC_61_0b11_RELEASE&tochange=FENNEC_61_0b12_RELEASE&full=1

Olli Pettay [:smaug][bugs@pettay.fi](vacation until Nov 16)

Comment 25

•

6 years ago

We're missing full stacks, and stacks look rather broken. Do we even know the mozilla::image::VectorImage::SizeOfSourceWithComputedFallback() call is right? Why would we call that method so often?

Andrew McCreight [:mccr8]

Comment 26

•

6 years ago

(In reply to Olli Pettay [:smaug] from comment #25) > We're missing full stacks, and stacks look rather broken. Do we even know > the mozilla::image::VectorImage::SizeOfSourceWithComputedFallback() call is > right? Why would we call that method so often? We automatically call the memory reporter periodically for telemetry purposes.

:sdaswani

Updated

•

6 years ago

Flags: needinfo?(vlad.baicu)

Olli Pettay [:smaug][bugs@pettay.fi](vacation until Nov 16)

Comment 27

•

6 years ago

Does anyone know why we might get so very short stacks. Normal crash reports on Android do look reasonable - nsThread::ProcessNextEvent somewhere close to the starting point of the stack. Are the crash report all bogus/broken?

Chris Peterson [:cpeterson]

Comment 28

•

6 years ago

(In reply to Olli Pettay [:smaug] from comment #27) > Does anyone know why we might get so very short stacks. Normal crash reports > on Android do look reasonable - nsThread::ProcessNextEvent somewhere close > to the starting point of the stack. > Are the crash report all bogus/broken? The stack walker is using stack scanning for finding the other frames. Since stack scanning is based on heuristics it can fail, especially if the crash caused garbage to be dumped on the stack.

Updated

•

6 years ago

status-firefox63: --- → affected

status-firefox-esr52: --- → unaffected

status-firefox-esr60: --- → unaffected

Whiteboard: [geckoview:klar:p1]

Chris Peterson [:cpeterson]

Comment 29

•

6 years ago

Assigning to Jim to coordinate investigation

Assignee: nobody → nchen

Kevin Brosnan [Ex-Mozilla]

Comment 30

•

6 years ago

I don't have any of the affected devices. The Galaxy S8 is not on the list of affected devices.

Flags: needinfo?(kbrosnan)

Andrew Overholt [:overholt]

Updated

•

6 years ago

Flags: needinfo?(overholt)

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Comment 31

•

6 years ago

I just was able to reproduce this on my Pixel: https://crash-stats.mozilla.com/report/index/f7b8788b-c648-4195-b625-dd1bb0180711. It happened when the phone was idle and it was after I installed AdBlock Plus.

Comment 32

•

6 years ago

OK, I think the fact that it crashed while idle backs up the claim in comment #26 that maybe we're doing a memory measurement for telemetry. Frank, do you know how often this happens? Is there a way we can trigger an immediate collection to try to reproduce this bug?

Flags: needinfo?(fbertsch)

Frank Bertsch [:frank]

Comment 33

•

6 years ago

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #32) > OK, I think the fact that it crashed while idle backs up the claim in > comment #26 that maybe we're doing a memory measurement for telemetry. > Frank, do you know how often this happens? Is there a way we can trigger an > immediate collection to try to reproduce this bug? Redirecting to Alessio, who will be better able to answer this question.

Flags: needinfo?(fbertsch) → needinfo?(alessio.placitelli)

Comment 34

•

6 years ago

If it's using the regular memory reporting infrastructure it should be possible to reproduce this by requesting a report from about:memory.

Alessio Placitelli [:Dexter]

Comment 35

•

6 years ago

(In reply to Gabriele Svelto [:gsvelto] (PTO until 30/07) from comment #34) > If it's using the regular memory reporting infrastructure it should be > possible to reproduce this by requesting a report from about:memory. I did try that, but haven't be able to get a crash report that way yet.

Comment 36

•

6 years ago

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #32) > OK, I think the fact that it crashed while idle backs up the claim in > comment #26 that maybe we're doing a memory measurement for telemetry. > Frank, do you know how often this happens? Is there a way we can trigger an > immediate collection to try to reproduce this bug? Telemetry gathers memory information at most once per minute [1]. Gathering the memory happens when the "cycle-collector-begins" topic is observed and the operation happens on the main thread idle queue [3]. It should be possible to manually trigger this from a scratchpad that can access browser code by: > let TelemetrySession= ChromeUtils.import("resource://gre/modules/TelemetrySession.jsm", {}); > TelemetrySession.observe("", "cycle-collector-begin"); Alternatively, one could simply refresh the about:telemetry page which will end up calling the memory gathering function [4]. However, this won't happen in the idle queue on the main thread. [1] - https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/toolkit/components/telemetry/TelemetrySession.jsm#59,1679 [2] - https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/toolkit/components/telemetry/TelemetrySession.jsm#85 [3] - https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/toolkit/components/telemetry/TelemetrySession.jsm#1686 [4] - https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/toolkit/content/aboutTelemetry.js#382

Flags: needinfo?(alessio.placitelli)

Comment 37

•

6 years ago

I tried refreshing the about:telemetry page, but that didn't result in a crash. I did try pasting the Telemetry code in the console but I was getting errors. So far since yesterday's 2 crashes, I haven't had any additional crashes on the Google Pixel.

Comment 38

•

6 years ago

Since the memory reporters run on the main thread, this could be a race that results from the memory reporter obtaining a reference to an off-main-thread object without sufficient locking/strong references being obtained, resulting in a use-after-free calling into the sizing code because the object has been deallocated on the non-main-thread. For example, the imgReport memory reporter's call-stack seems to look like: - imgMemoryReporter::CollectReports https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/imgLoader.cpp#81 - RecordCounterForRequest which obtains a RefPtr<image::Image> https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/imgLoader.cpp#459 - ImageMemoryCounter::ImageMemoryCounter takes a bare Image* (which is covered, but...) https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/Image.cpp#21 - Image::CollectSizeOfSurfaces is invoked inside there https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/Image.cpp#44 And then there are a bunch of those implementations, as we can see via: https://searchfox.org/mozilla-central/search?q=symbol:_ZNK7mozilla5image5Image21CollectSizeOfSurfacesER8nsTArrayINS0_20SurfaceMemoryCounterEEPFmPKvE&redirect=false But in particular: - RasterImage::CollectSizeOfSurfaces conditionally calls mFrameAnimator->CollectSizeOfCompositingSurfaces https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/RasterImage.cpp#705 - But mFrameAnimator is nullable in case of error (in DoError) which means that maybe mFrameAnimator was not null when the main thread started to do stuff, but it got freed out from under the reporter. https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/image/RasterImage.cpp#1615

Comment 39

•

6 years ago

Let me point you back to comment 22. Don't forget the crash happens because the code is trying to write to a memory address that points to a function in libmozglue, while the address is supposed to point to the stack! And that considering the normal flow of the function, it's actually impossible!

Comment 40

•

6 years ago

Right. To clarify, I'm suggesting that things go off the rails in the manner you described thanks to CollectSizeOfCompositingSurfaces or something like it running with a 'this' whose contents are no longer what it thinks they are due to a cross-thread race. The race is important because it explains why this doesn't happen 100% of the time and creates a scenario where heap memory may change between accesses. I don't have a direct explanation for the stack pointer getting corrupted, but it's worth pointing out that the image memory reporters pass around a `MallocSizeOf` function pointer in order to call into the moz_malloc_size_of/moz_malloc_usable_size/malloc_usable_size_impl functions. Since it's an explicit argument that's passed around a lot, it may get spilled to the stack in many places, which means that some prior cause of stack corruption could more easily snowball into what we're seeing. That said, the small amount of ARM disassembly I've looked at does not look like it likes to overwrite the stack pointer versus adding/subtracting, so... I still have no idea how that would work. (And unfortunately radare2 doesn't like breakpad's linux-extended minidump format, so my brief attempt to see if the minidump's raw stack data had anything illuminating in it didn't work out. Ted's https://github.com/luser/rust-minidump does work, but looking at raw hex data isn't super illuminating on its own either.)

Comment 41

•

6 years ago

FWIW, there's a tool to convert minidumps to core files in the breakpad repository.

Comment 42

•

6 years ago

That being said, the pointer we try to write to is contained in r4, which is directory assigned from r0, which is the first argument to the function, and the caller sets it directly from sp (as in, the stack pointer, not a value on the stack), so there's theoretically no room for r4 to contain anything else than that pointer to the stack. Now, what's printed in the crash report is as good as what's available in memory when breakpad reads it, and maybe the problem we have here and other recent crashes is that those devices have some bug that trashes the register state that breakpad has access to. Let's assume that r4 is just plain bogus. Let's assume for a second that the stack pointer is right, and that per the code, that's also the actual value of r4. Then crashing on this instruction still makes no sense because the stack address is writable per the minidump.

Marion Daly [:mdaly]

Comment 43

•

6 years ago

Susheel: Advice on this thread?

Flags: needinfo?(sdaswani)

Comment 44

•

6 years ago

I just wanted to call out that his signature has generated over 30K crashes in a 30 day period. If anyone has any other suggestions as other things to try in terms of STR or investigation, they would be greatly appreciated.

:sdaswani

Comment 45

•

6 years ago

Marion, no advice, seems like a Gecko 'platform' issue, and not one my Softvision team can handle best (the current assignee seems like the right person).

Flags: needinfo?(sdaswani)

Pascal Chevrel:pascalc

Updated

•

6 years ago

tracking-firefox63: --- → +

Updated

•

6 years ago

Depends on: 1478707

Comment 46

•

6 years ago

I filed bug 1478707 on my suspicions from comment 38 in the graphics component to try and get graphics eyes on it and potentially a fix.

Comment 47

•

6 years ago

We're also waiting on some leads at Qualcomm since this appears to happen on specific Snapdragon chips. There is an analysis of the crash at https://gist.github.com/darchons/c64fc6460b1d3613b0369865dfdc28ea

Comment 48

•

6 years ago

Is https://gist.github.com/darchons/c64fc6460b1d3613b0369865dfdc28ea#file-gistfile1-txt-L70 speculation from the stack contents or did you manage somehow to actually step through the crash happening and that's what happened?

Andrew Osmond [:aosmond] (he/him)

Comment 49

•

6 years ago

I should have mentioned it's my speculation from looking at the crash dump.

Comment 50

•

6 years ago

Discussed offline with auth about the possible RasterImage::mFrameAnimator concerns, but mFrameAnimator should never be freed off the main thread, so I'm not convinced it could be causing the problem. I've been looking at this crash report: https://crash-stats.mozilla.com/report/index/67d3ab01-baca-4322-a5f9-2fba50180728#tab-details Which is the only non-Android one. It might provide a different perspective / a more reliable stack trace. VectorImage::VectorImage::SizeOfSourceWithComputedFallback gets a pointer to the SVGDocument (indirect subclass of nsDocument). It seems to get lost in the weeds iterating over the nodes in that document. I think the pointer we use should be valid because it should be refcounted all the way from VectorImage to the SVGDocument pointer, and it checks for nulls. I note that it does call nsIDocument::AsSVGDocument() which is a simple static_cast (maybe we somehow got the wrong document?!). - VectorImage has a refcounted pointer to SVGDocumentWrapper (mSVGDocumentWrapper) - SVGDocumentWrapper has a refcounted pointer to nsIContentViewer (mViewer) - nsDocumentViewer seems to be the only impl of nsIContentViewer, and it has refcounted pointer to nsIDocument (mDocument) Scanning the Android reports, it doesn't seem to have any consistency as to the type of node the report suggests we failed on, just that we are iterating over the nodes until we find a "bad" one (assuming the document itself is fine...).

Andrew Osmond [:aosmond] (he/him)

Comment 51

•

6 years ago

Again, also not Android, but similar reports that may also provide useful information from 62.0b: https://crash-stats.mozilla.com/report/index/f71bd73b-6a96-4d24-b07b-fc63b0180620 https://crash-stats.mozilla.com/report/index/f84314c7-3e72-4732-a1c8-b306e0180721 It is also worth noting there was a similar report from as early as 58: https://crash-stats.mozilla.com/report/index/c8733391-e53e-4d18-bb90-732ca0180208 Unfortunately the volume isn't high enough for non-Android for me to put too much faith in these isolated reports.

Andrew Osmond [:aosmond] (he/him)

Comment 52

•

6 years ago

If we look at the crash volumes for release/beta, the beta reports dropped off quite a bit from their peak at the end of June. This presumably lines up with the release of 61 in that time frame, so the beta users got moved to 62. Is it possible this was fixed or otherwise we landed something that blunted the effect in 62?

Jim Mathies [:jimm]

Comment 53

•

6 years ago

Hey Jim, any updates here from Qualcom?

Flags: needinfo?(nchen)

David Bolter [:davidb] (NeedInfo me for attention)

Comment 54

•

6 years ago

Not yet. We're reaching out to other contacts.

Flags: needinfo?(nchen)

Updated

•

6 years ago

Whiteboard: [geckoview:klar:p1]

Julien Cristau [:jcristau]

Comment 55

•

6 years ago

Noting crash volume on 62 beta is still very low while we are seeing 30K+ crashes per week on release.

Comment 56

•

6 years ago

This crash appears to be gone entirely in 61.0.2 which shipped last week (and had no related changes).

Updated

•

6 years ago

status-firefox62: affected → fix-optional

status-firefox63: affected → fix-optional

tracking-firefox62: + → -

tracking-firefox63: + → -

Updated

•

6 years ago

status-firefox61: affected → fixed

Updated

•

6 years ago

Assignee: nchen → nobody

Pascal Chevrel:pascalc

Comment 57

•

6 years ago

1 crash on 63 since we shipped, a few crashes on pre-release, ESR has most of the crashes but it's not a lot. Adjusting flags accordingly.

status-firefox62: fix-optional → wontfix

status-firefox63: fix-optional → verified disabled

status-firefox64: --- → affected

status-firefox65: --- → affected

status-firefox-esr60: unaffected → affected

Comment 58

•

6 years ago

The crash volume is very low, maybe it was fixed somewhere else or the signature changed. Also a geckoview bug.

status-firefox64: affected → wontfix

status-firefox65: affected → fix-optional

status-firefox66: --- → ?

Priority: P1 → P3

Whiteboard: [geckoview]

Chris Peterson [:cpeterson]

Updated

•

6 years ago

Component: General → DOM

Product: Firefox for Android → Core

Version: Firefox 61 → unspecified

Comment 59

•

6 years ago

There were 22398 reports in Fennec 61.0, 47 in Fennec 64.0, and only 1 in 66.0a1.

This crash signature is reported on all platforms, but 90% of the reports are from Android.

status-firefox65: fix-optional → wontfix

status-firefox66: ? → fix-optional

Whiteboard: [geckoview] → [geckoview:p3]

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: DOM → DOM: Core & HTML

Jens Stutte [:jstutte]

Comment 60

•

5 years ago

I stumbled over this bug by accident, and if I query https://crash-stats.mozilla.org/search/?signature=%3DAllocInfo%3A%3AGet%3CT%3E&date=%3E%3D2018-12-04T00%3A00%3A00.000Z&date=%3C2019-12-03T14%3A58%3A00.000Z&_facets=install_time&_facets=version&_facets=address&_facets=moz_crash_reason&_facets=reason&_facets=build_id&_facets=platform_pretty_version&_facets=signature&_facets=useragent_locale&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=moz_crash_reason&_columns=build_id&_columns=platform_pretty_version#crash-reports, I see crashes in all our products on all platforms.

All of them (well, the ones I clicked on) reported to fail more or less here: https://searchfox.org/mozilla-central/rev/732a4acab44c9ed4199611c15be0ab63c08e7ca2/memory/build/mozjemalloc.cpp#3181

Thus it does not look like a DOM specific failure to me. And it would be helpful to know, what is the intent of the MOZ_DIAGNOSTIC_ASSERT(ret != 0);: Does it check a potential error in the mozjemalloc implementation itself or is it indicating a misuse (like calling this with an already freed pointer or similar)?

In the latter case probably we should analyze all stack traces and find the source of the dangling pointer case by case. Phew.

Flags: needinfo?(erahm)

Comment 61

•

5 years ago

I also had a look and the signature is too generic to be meaningfull; we should add it to Socorro's prefix list so that we can tell different stacks apart. Some crashes have common stacks but a bunch of others don't so this could just be coming from faulty hardware. It happens often with over-generic crash signatures such as this one.

Jens Stutte [:jstutte]

Comment 62

•

5 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #61)

I also had a look and the signature is too generic to be meaningfull; we should add it to Socorro's prefix list so that we can tell different stacks apart. Some crashes have common stacks but a bunch of others don't so this could just be coming from faulty hardware. It happens often with over-generic crash signatures such as this one.

I would want to start from the assumption, that the MOZ_DIAGNOSTIC_ASSERT(ret != 0); has been put on purpose and wants to tell us something. And if it's a misuse: As we frequently see other UAF issues, it would be helpful to have some (potentially 1000 as of the number of search results) hints on potential UAFs through this assert. But I am speculating.

Comment 63

•

5 years ago

(In reply to Jens Stutte [:jstutte] from comment #62)

I would want to start from the assumption, that the MOZ_DIAGNOSTIC_ASSERT(ret != 0); has been put on purpose and wants to tell us something. And if it's a misuse: As we frequently see other UAF issues, it would be helpful to have some (potentially 1000 as of the number of search results) hints on potential UAFs through this assert. But I am speculating.

Yeah, I'm not saying this isn't valid, it's possible that there are valid stacks hiding in there. In particular this stack comes up a few times:

0 	mozglue.dll 	AllocInfo::Get<1>(void const*) 	memory/build/mozjemalloc.cpp:3047
1 	mozglue.dll 	moz_malloc_size_of 	memory/mozalloc/mozalloc.cpp:128
2 	xul.dll 	unsigned __int64 mozilla::webgl::ShaderValidatorResults::SizeOfIncludingThis(*) 	dom/canvas/WebGLShaderValidator.cpp:485
3 	xul.dll 	mozilla::WebGLShader::SizeOfIncludingThis(unsigned __int64 (*)(void const*)) 	dom/canvas/WebGLShader.cpp:302
4 	xul.dll 	static __int64 mozilla::WebGLMemoryTracker::GetShaderSize() 	dom/canvas/WebGLMemoryTracker.cpp:121
5 	xul.dll 	nsresult mozilla::WebGLMemoryTracker::CollectReports(class nsIHandleReportCallback*, class nsISupports*, bool) 	dom/canvas/WebGLMemoryTracker.cpp:61
6 	xul.dll 	nsresult mozilla::detail::RunnableFunction<`lambda at z:/task_1564955640/build/src/xpcom/base/nsMemoryReporterManager.cpp:1863:7'>::Run() 	xpcom/threads/nsThreadUtils.h:564
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1224
8 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:486
9 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:88
10 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:308

This is also the only stack associated with a crash of a machine that has ECC memory and thus not susceptible to memory errors.

For other crashes with weird stacks and older versions it's possible given the low volume that these are bit-flips in memory being caught by the assertion since none of the reports come from machines with ECC memory. We're loading the mapbits data from memory here; since the assertion is checking against 0 (after masking) it's possible for stuck bits or bit flips to trigger it.

Updated

•

5 years ago

Depends on: 1600951

Comment 64

•

5 years ago

I filed bug 1600951 to disambiguate the crashes; once it lands the crash volume here will go to zero as Socorro generates new more specific signatures. It should be sufficient to triage some crashes with a signature starting with AllocInfo::Get<T> to find out a new more relevant signature.

Comment 65

•

5 years ago

Does it check a potential error in the mozjemalloc implementation itself or is it indicating a misuse (like calling this with an already freed pointer or similar)?

The latter.

Jens Stutte [:jstutte]

Comment 66

•

5 years ago

Thanks :glandium, clearing needinfo to Eric then.

:gsvelto: Thanks for de mangling this! Can we easily apply this prefix filter also on already existing reports for the last 6 months?

Flags: needinfo?(erahm) → needinfo?(gsvelto)

Comment 67

•

5 years ago

(In reply to Jens Stutte [:jstutte] from comment #66)

:gsvelto: Thanks for de mangling this! Can we easily apply this prefix filter also on already existing reports for the last 6 months?

Yes, I think we can reprocess the crashes. I'll see if I can do it.

Comment 68

•

5 years ago

OK, I can reprocess all the crashes with this signature (and similar ones) but I'll have to wait until bug 1600951 reaches production.

Flags: needinfo?(gsvelto)

Comment 69

•

5 years ago

Looks like we'll also need bug 1601223.

Comment 70

•

5 years ago

AllocInfo::Get<T> is now in the prefix list, time to reprocess crashes.

Comment 71

•

5 years ago

... and it's done. I reprocessed all the crashes from the past 6 months so the actual results might take a while to show up on crash-stats, but tomorrow we can probably start looking for new signatures.

Comment 72

•

5 years ago

Reprocessing the crashes paid off handsomely: the vast majority of the resulting signatures affect Firefox & Thunderbird version 60 and older so we don't care about those. The stack I put in comment 63 now has his own signature and looks like a real problem affecting both Firefox 72 & 73. I'm closing this bug since this signature is no more and opening a new one for that.

Comment 73

•

5 years ago

Closing this, we can follow up with the new signature is in bug 1603041.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → INVALID