<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

8 years ago

How frequent is this crash? I suspect this is a duplicate of bug 1198710, just with the new signature.

Flags: needinfo?(matt.woodrow)

Astley Chen (inactive)

Comment 2

•

8 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #1) > How frequent is this crash? > > I suspect this is a duplicate of bug 1198710, just with the new signature. A bit higher than bug 1198710 within the period of 57 beta. It seems the display item address returned from RemoveBottm() could be invalid or contained invalid child items. Do you have any idea how to progress the investigation?

Comment 3

•

8 years ago

[Tracking Requested - why for this release]: UAF sec-critical regression in 57 Crashes are all wildptr and/or clear UAFs. Worse, they're READ, WRITE and EXEC crashes. This may be a dup of bug 1198710 with a new signature -- that shows similar problems and probably should be marked as sec.

Group: core-security

tracking-firefox57: --- → ?

Flags: needinfo?(milan)

Keywords: csectype-uaf, csectype-wildptr, sec-critical

Marcia Knous [:marcia]

Comment 4

•

8 years ago

Tracking 57+ based on Randell's assessment in Comment 3. There are a few crashes in 58 as well. The other signature nsDisplayList::DeleteAll in Bug 1198710 has crashes in 57 as well, but also has crashes dating back several releases. Although this isn't high volume in beta, would be good to understand more about why these appear more severe in 57.

tracking-firefox57: ? → +

Comment 5

•

8 years ago

(In reply to Marcia Knous [:marcia - use ni] from comment #4) > The other signature nsDisplayList::DeleteAll in Bug 1198710 has crashes in > 57 as well, but also has crashes dating back several releases. > > Although this isn't high volume in beta, would be good to understand more > about why these appear more severe in 57. The other ones on bug 1198710 are sec crashes too, just with a different signature. Might be different bugs, but I'm betting on the same bug. We probably should mark that one as sec as well, though it's been open a long time NI dveditz to decide on the other bug

Flags: needinfo?(dveditz)

Comment 6

•

8 years ago

It looks like we have a display item in the list that contains garbage memory. Trying to call Destroy() on it crashes. Display items are all allocated in an arena, which we free once we've finished painting (and after this call). If we had already free'd this display item, I'd still expect the memory contents to be valid since we don't do any sort of recycling of the arena memory, and the dtor shouldn't be writing garbage into our vtable pointer. A few of the crashes (though not many) even look like they have the poison value (0xE5) written into them. I can't think of anywhere where we'd write the poison value into the still-valid arena. So it seems like either have corruption, something else is clobbering the contents of this display item. Possibly an invalid static_cast? That doesn't explain where the poison value comes from though. Another possibility is that the display item was allocated into a different arena, one that has been actually released to the OS at this point (and then can be clobbered/poisoned by another user of the same memory). I can't see how this would happen though. We definitely have cases where we can recurse into nsLayoutUtils::PaintFrame which creates a new builder/arena, but this doesn't pass through the display list from the outer caller. There would have to be some way of a display item, display list, or display list builder crossing this boundary, and then us adding an item to the wrong list. I did a quick audit and can't find anywhere that we cache any of these items in a way that the wrong could access them, but it's possible I missed something. Not sure how to make further progress without STR.

Comment 7

•

8 years ago

Attached patch assert-display-item (obsolete) — Details — Splinter Review

We could try this, it should at least narrow the options down. Might not be ideal for performance, so I think we'd want to back this out once we get results (ideally in Nightly, but Beta at the latest). Nathan, would you be ok with adding something like this? I could add lots of comments telling people they aren't allowed to use the new functions except for this bug, but we also don't want to draw too much attention.

Assignee: nobody → matt.woodrow

Attachment #8917655 - Flags: feedback?

Updated

•

8 years ago

Attachment #8917655 - Flags: feedback? → feedback?(nfroyd)

Updated

•

8 years ago

Group: core-security → layout-core-security

Comment 8

•

8 years ago

Could we do some ifdefs to restrict to nightly or to early-beta-and-earlier? And/or add some code to check arena-it-was-allocated-from when elements are added? Or before releasing the list?

Flags: needinfo?(matt.woodrow)

Comment 9

•

8 years ago

On the other bug, I see this crash: https://crash-stats.mozilla.com/report/index/551eb5bf-6b3b-4e58-a5d1-187e20171007 That's an EXEC crash with an e5e5 address in 57b7 (and there are quite a few wildptr EXEC crashes there). I'm going to mark it as sec also. I strongly suspect these two signatures are part of the same bug, but will leave separate for now.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 10

•

8 years ago

Comment on attachment 8917655 [details] [diff] [review] assert-display-item Review of attachment 8917655 [details] [diff] [review]: ----------------------------------------------------------------- I think we could do something like this, sure. Definitely says that something fishy is going on with this code, though. =/ I like jesup's idea about trying to restrict this to nightly/early beta. ::: xpcom/ds/ArenaAllocator.h @@ +178,5 @@ > + > + bool DebugContains(void* aPtr) > + { > + return aPtr >= reinterpret_cast<void*>(header.offset) && > + aPtr < reinterpret_cast<void*>(header.tail); This should be written as: uintptr_t p = reinterpret_cast<uintptr_t>(aPtr); return p >= header.offset && p < header.tail; due to rules in the standard about comparing pointers: aPtr doesn't necessarily point into the same object bounded by header.{offset,tail}, so it can't be compared to them.

Attachment #8917655 - Flags: feedback?(nfroyd) → feedback+

Comment 11

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #10) > Comment on attachment 8917655 [details] [diff] [review] > assert-display-item > ... > > I like jesup's idea about trying to restrict this to nightly/early beta. MOZ_DIAGNOSTIC_ASSERT instead of MOZ_RELEASE_ASSERT? It'd hit nightly and dev edition, but not beta and release.

Flags: needinfo?(milan)

Comment 12

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #8) > And/or add some code to check arena-it-was-allocated-from when elements are > added? Or before releasing the list? Checking the arena when elements are added is what we really want I think, but it's considerably bigger patch to write, so I thought this would be a good start to see if we're on the right track. I can bite the bullet and write the bigger patch if you think it's worth it. (In reply to Nathan Froyd [:froydnj] from comment #10) > I think we could do something like this, sure. Definitely says that > something fishy is going on with this code, though. =/ It does :( I'd want to remove this again once we figure out exactly what that is.

Flags: needinfo?(matt.woodrow)

Comment 13

•

8 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #12) > Checking the arena when elements are added is what we really want I think, > but it's considerably bigger patch to write, so I thought this would be a > good start to see if we're on the right track. > > I can bite the bullet and write the bigger patch if you think it's worth it. I'm far from the final arbiter here, but I think it is, since this is a LONG-standing sec-critical (see bug 1198710).

Comment hidden (me-too)

Comment 15

•

8 years ago

Attached patch assert-display-item — Details — Splinter Review

Attachment #8917655 - Attachment is obsolete: true

Attachment #8918686 - Flags: review?(nfroyd)

Attachment #8918686 - Flags: review?(mstange)

Markus Stange [:mstange]

Comment 16

•

8 years ago

Comment on attachment 8918686 [details] [diff] [review] assert-display-item Review of attachment 8918686 [details] [diff] [review]: ----------------------------------------------------------------- r+ on everything outside ArenaAllocator.h

Attachment #8918686 - Flags: review?(mstange) → review+

Comment 17

•

8 years ago

Comment on attachment 8918686 [details] [diff] [review] assert-display-item Review of attachment 8918686 [details] [diff] [review]: ----------------------------------------------------------------- r=me on the ArenaAllocator.h part, but we should talk about what the right test below is. ::: xpcom/ds/ArenaAllocator.h @@ +183,5 @@ > +#ifdef MOZ_DIAGNOSTIC_ASSERT_ENABLED > + bool DebugContains(void* aPtr) > + { > + return aPtr >= (this + 1) && > + reinterpret_cast<uintptr_t>(aPtr) < header.tail; Wait, what? Oh, I see, because `this` is actually embedded in the arena, and testing: header.offset <= aPtr && aPtr < header.tail is not the correct test, because we'd be checking to see whether the pointer lives in the *unallocated* part of the arena. So is the right test something like (this + 1) <= aPtr && aPtr < header.offset since that would be checking for things actually allocated? Or do we want to use the test written here? Either way, I think we want to write the first part of the test as: reinterpret_cast<uintptr_t>(aPtr) >= reinterpret_cast<uintptr_t>(this + 1) for the aforementioned pointer comparison issues.

Attachment #8918686 - Flags: review?(nfroyd) → review+

Comment 18

•

8 years ago

Comment on attachment 8918686 [details] [diff] [review] assert-display-item [Security approval request comment] How easily could an exploit be constructed based on the patch? Not easily, we still haven't figured out what the issue is. Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? It makes the outcome fairly obvious, but doesn't really help finding the cause (until we get crash reports). Which older supported branches are affected by this flaw? All. Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? No, but this code hasn't changed much, should be easy. How likely is this patch to cause regressions; how much testing does it need? Unlikely, just adding some new assertions.

Attachment #8918686 - Flags: sec-approval?

Comment 19

•

8 years ago

Comment on attachment 8918686 [details] [diff] [review] assert-display-item sec-approval+ for trunk. Is there any point in backporting this to beta if we're just adding assertions?

Flags: needinfo?(matt.woodrow)

Attachment #8918686 - Flags: sec-approval? → sec-approval+

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 20

•

8 years ago

(In reply to Al Billings [:abillings] from comment #19) > Comment on attachment 8918686 [details] [diff] [review] > assert-display-item > > sec-approval+ for trunk. > Is there any point in backporting this to beta if we're just adding > assertions? Beta just gets a much higher crash volume, so it'll make it quicker to track down. Looks like this happens a few times a week on Nightly though, so maybe there's no point.

Flags: needinfo?(matt.woodrow)

Updated

•

8 years ago

Depends on: 1409411

Comment 21

•

8 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #20) > (In reply to Al Billings [:abillings] from comment #19) > > Comment on attachment 8918686 [details] [diff] [review] > > assert-display-item > > > > sec-approval+ for trunk. > > Is there any point in backporting this to beta if we're just adding > > assertions? > > Beta just gets a much higher crash volume, so it'll make it quicker to track > down. Looks like this happens a few times a week on Nightly though, so maybe > there's no point. Are you sure? An assertion crash is a LOT better than a UAF

Flags: needinfo?(matt.woodrow)

Comment 22

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #21) > (In reply to Matt Woodrow (:mattwoodrow) from comment #20) > > (In reply to Al Billings [:abillings] from comment #19) > > > Comment on attachment 8918686 [details] [diff] [review] > > > assert-display-item > > > > > > sec-approval+ for trunk. > > > Is there any point in backporting this to beta if we're just adding > > > assertions? > > > > Beta just gets a much higher crash volume, so it'll make it quicker to track > > down. Looks like this happens a few times a week on Nightly though, so maybe > > there's no point. > > Are you sure? An assertion crash is a LOT better than a UAF Well the changes are restricted to development builds since it's both a guess and a performance regression. Protecting our beta users might be worth it, but we really just need to figure out the real issue ASAP and get it fixed so we can ship something to release.

Flags: needinfo?(matt.woodrow)

https://hg.mozilla.org/mozilla-central/rev/940a0c98e46c https://hg.mozilla.org/mozilla-central/rev/2c77560532c9

Comment 23

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox58: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla58

Updated

•

8 years ago

Group: layout-core-security → core-security-release

Comment 24

•

8 years ago

Calling this 57:wontfix given comment 22. Feel free to set it back to affected and nominate for Beta approval if I've misunderstood, though.

status-firefox57: affected → wontfix

Updated

•

8 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Comment 25

•

8 years ago

Crash reports: bp-fec67bcc-475e-442a-9a1e-2912d0171028 bp-e6d6de36-221f-4b01-a759-e9e780171029 These are really weird, we're hitting the assertion when tearing down the list, or when sorting the list (which removes and then adds items from the same list). At no point do we ever hit an assertion when actually adding a new item to the list. I did another audit to try find locations where we might edit the list that aren't covered by assertions and couldn't find any. It's possible that this is just a corruption issue.

Comment 26

•

8 years ago

Attached patch more-assertions — Details — Splinter Review

Trying some extra logging to figure out exactly what the broken pointer is (if it's close to arena memory, or entirely unrelated), and also trying to see if we can catch the 'mAbove' linked list pointer being corrupt.

Attachment #8924049 - Flags: review?(nfroyd)

Comment 27

•

8 years ago

Comment on attachment 8924049 [details] [diff] [review] more-assertions Review of attachment 8924049 [details] [diff] [review]: ----------------------------------------------------------------- r=me, but you need to ask for data review: https://wiki.mozilla.org/Firefox/Data_Collection I think this falls into the "technical data" category, so review should be pretty painless. ::: layout/painting/nsDisplayList.h @@ +1823,5 @@ > // destructors. > protected: > nsDisplayItemLink() : mAbove(nullptr) {} > nsDisplayItemLink(const nsDisplayItemLink&) : mAbove(nullptr) {} > + uint64_t mSentinel = 0xDEADBEEFDEADBEEF; You may want to guard this with MOZ_DIAGNOSTIC_ASSERT_ENABLED, just in case we don't back this out when diagnostic asserts are expiring? ::: xpcom/ds/ArenaAllocator.h @@ +134,5 @@ > } > + std::stringstream log; > + log << "Failed to find pointer " << aPtr << " within arena blocks: "; > + for (ArenaChunk* arena = mHead.next; arena; arena = arena->next) { > + log << "(" << reinterpret_cast<uintptr_t>(arena + 1) << ", " << arena->header.offset << "), "; Dumping the pointer that we're looking for and the addresses of the arena in which we looked for it, so we can potentially tell why we missed it...OK, makes sense. @@ +136,5 @@ > + log << "Failed to find pointer " << aPtr << " within arena blocks: "; > + for (ArenaChunk* arena = mHead.next; arena; arena = arena->next) { > + log << "(" << reinterpret_cast<uintptr_t>(arena + 1) << ", " << arena->header.offset << "), "; > + } > + MOZ_CRASH_UNSAFE_OOL(log.str().c_str()); Usages of this macro need a data-review to ensure we're not sending private data.

Attachment #8924049 - Flags: review?(nfroyd) → review+

Andrew McCreight [:mccr8]

Comment 28

•

8 years ago

Comment on attachment 8924049 [details] [diff] [review] more-assertions Review of attachment 8924049 [details] [diff] [review]: ----------------------------------------------------------------- (Sorry, I should have just done this as part of my review, redirecting your review request) Requesting data-review for sending runtime-defined strings with MOZ_CRASH_UNSAFE_OOL: we're sending information about memory addresses of internal data structures. We believe this falls into Category 1 ("technical data"). This data is being collected to understand crashes we're seeing in the wild better and is being collected for nightly builds and early beta builds.

Attachment #8924049 - Flags: review?(francois)

Updated

•

8 years ago

status-firefox58: fixed → affected

François Marier [:francois]

Comment 29

•

8 years ago

Comment on attachment 8924049 [details] [diff] [review] more-assertions Review of attachment 8924049 [details] [diff] [review]: ----------------------------------------------------------------- datareview+ based on the description from comment 28.

Attachment #8924049 - Flags: review?(francois) → review+

Comment 30

•

8 years ago

Comment on attachment 8924049 [details] [diff] [review] more-assertions [Security approval request comment] How easily could an exploit be constructed based on the patch? Not easily, we still haven't figured out what the issue is. Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? It makes the outcome fairly obvious, but doesn't really help finding the cause (until we get crash reports). Which older supported branches are affected by this flaw? All. Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? No, but this code hasn't changed much, should be easy. How likely is this patch to cause regressions; how much testing does it need? Unlikely, just adding some new assertions.

Attachment #8924049 - Flags: sec-approval?

Comment 31

•

8 years ago

Comment on attachment 8924049 [details] [diff] [review] more-assertions sec-approval+. Land away.

Attachment #8924049 - Flags: sec-approval? → sec-approval+

Updated

•

8 years ago

Keywords: checkin-needed

https://hg.mozilla.org/integration/mozilla-inbound/rev/19955c48f09c13e78ec50e7cf29ade5382e43c66

Comment 32

•

8 years ago

Keywords: checkin-needed → leave-open

Target Milestone: mozilla58 → ---

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 33

•

8 years ago

https://hg.mozilla.org/mozilla-central/rev/19955c48f09c

Comment 34

•

8 years ago

Crash report with the newer assertions: https://crash-stats.mozilla.com/report/index/be49f893-9f70-4d7e-a265-d3a3e0171105 It's failing the mSentinel check, that's pretty bad. The only explanation I can think of is heap corruption/buffer overflow. Someone is overwriting memory belonging to a display item. I'm not sure what to do next to track down where this is coming from, will think more about it.

Comment 35

•

8 years ago

Crash report from the other assertion, showing the arena blocks: https://crash-stats.mozilla.com/report/index/25293941-d437-484f-8195-6a5b60171108

Comment 36

•

8 years ago

Nathan: Looks like one of the arena chunks there has 0 for the 'offset' value. That's pretty bizarre. The arena is 8192, so the failing pointer is probably in that chunk, if it had the expected end value. I can't see any way we could get a 0 there though, excluding heap corruption.

Flags: needinfo?(nfroyd)

Comment 37

•

8 years ago

The 'offset' value is part of the header, which is in the same allocation as the actual arena memory that we give out. So this could be roughly the same as the other crash report, something is corrupting memory within our arena blocks. It looks like the corrupt one doesn't have another block right before it, so it would seem possible that we've got a buffer overflow from something else overflowing into an arena block. This is a fairly common crash (more than I think would be expected for generic heap corruption), so the overflowing memory is likely something else we allocate during painting.

Comment 38

•

8 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #36) > Nathan: Looks like one of the arena chunks there has 0 for the 'offset' > value. That's pretty bizarre. > > The arena is 8192, so the failing pointer is probably in that chunk, if it > had the expected end value. > > I can't see any way we could get a 0 there though, excluding heap corruption. That *is* pretty weird! ArenaAllocator::mHead is initialized with offset=0, tail=0, but no other ArenaChunk is. And mHead isn't even included in the list of chunks; it's just there as a placeholder. The only ways I can think of a zero getting there are: 1. Some kind of weird race. I can't really construct a plausible race for this, though. 2. Overflow in ArenaChunk::Allocate: header.offset += aSize; That'd have to be a *very* specific aSize, though; comment 35's crash, for instance, would require an allocation of ~2.5GB or so. I guess that could arise from allocation sizes overflowing before coming into ArenaAllocator...we could add some overflow checks in ArenaAllocator to determine whether that's what's happening. But IIUC, we can't get an overflowing size there, since we're coming from nsPresArena's pooled allocator, and we don't allocate arbitrary (i.e. calculated) sizes there...do we? We could try inserting (several!) CorruptionCanary members (from mfbt/Poison.h) into ArenaHeader and checking them at various points. That would provide solid evidence of heap corruption, but wouldn't move us much closer to figuring out what was going on. :( Maybe we should try printing the tail member, too? Checking for overflow would have the virtue--if the overflow check(s) ever failed--of telling us exactly where things were going wrong, rather than "somebody scribbled over this memory at some point in the past".

Flags: needinfo?(nfroyd)

Comment 39

•

8 years ago

I really can't see how it could be overflow, shouldn't the size check prevent us from allocating a size that would go past the tail pointer? If we did overflow, then I think further allocation attempts would succeed, and we'd never move on to a new chunk. As you said, we also don't do arbitrary allocations from the display list builder, so I can't see how we'd ever get something big enough. I'll have a go at the canary option.

Comment 40

•

8 years ago

Attached patch canary-check — Details — Splinter Review

Attachment #8927693 - Flags: review?(nfroyd)

Jet Villegas (inactive)

Updated

•

7 years ago

Blocks: 1415496

Comment 41

•

7 years ago

Comment on attachment 8927693 [details] [diff] [review] canary-check Review of attachment 8927693 [details] [diff] [review]: ----------------------------------------------------------------- I'm really hesitant to remove the logging in favor of the canary checking. I'd prefer to keep both, if possible. WDYT? ::: xpcom/ds/ArenaAllocator.h @@ -134,5 @@ > } > - std::stringstream log; > - log << "Failed to find pointer " << aPtr << " within arena blocks: "; > - for (ArenaChunk* arena = mHead.next; arena; arena = arena->next) { > - log << "(" << reinterpret_cast<uintptr_t>(arena + 1) << ", " << arena->header.offset << "), "; I haven't looked through the crash reports to tell if there have been multiple reports with a 0 offset...do we really want to lose the logging here and/or do we want to print header.tail into this to try and understand better what's going on? Couldn't we, say, keep the logging and canary.Check() in the DebugContains() loop above? We'd crash on the check if there was heap corruption, but we'd still get useful information from the logging if there *wasn't* heap corruption. Just looking at the crash report summaries, we're still hitting read/exec breakpoints on sometimes-freed memory with the signatures from this bug and not hitting the crash here...maybe several different things are being conflated.

Attachment #8927693 - Flags: review?(nfroyd)

Comment 42

•

7 years ago

The primary issue is that this logging is really expensive, we've got quite a few open regression bugs because of it. Retained display lists landed, and showed a regression on the display list mutation talos test, just because this logging adds overhead to display list operations (we show a 30% win without the logging). I realize that the logging is restricted to dev builds, but it's interfering with our ability to track performance regressions accurately, and that's not great. I do think you're right that there's more than one way that this can crash, I'm just not sure this logging is helping us narrow it down enough to justify the cost.

Comment 43

•

7 years ago

Hm, that's fair. But is the canary going to be much cheaper? The string logging is only activated if there's ever a problem, so we don't care about how expensive that is. The cost the benchmarks are detecting is really the loop + DebugContains check. The DebugContains check is two comparisons, whereas the canary check is a single comparison. So that will be a little cheaper, but we're still paying a bunch of cache misses to wander through memory, which I'd bet is the performance problem. We could make the DebugContains check a single unsigned comparison (I think) *and* do the Canary check and it wouldn't be much more expensive than what we're doing now... Have you measured the cost of the current checks vs. canary checking? Another alternative would be to make the individual arenas bigger, which would waste some memory, but give us fewer things to loop over when we're doing the debug checks.

Comment 44

•

7 years ago

The big difference is the canary check (as currently written) only happens 3 times during display list building, whereas DebugContains runs per-display-item (can easily be 20k times per-paint).

Comment 45

•

7 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #44) > The big difference is the canary check (as currently written) only happens 3 > times during display list building, whereas DebugContains runs > per-display-item (can easily be 20k times per-paint). Aha, I totally missed that. My mistake. I'm still not excited about re-doing data review if we have to add annotations for the crash, but maybe the canary check will reveal new information. Worth a shot.

Comment 46

•

7 years ago

Comment on attachment 8927693 [details] [diff] [review] canary-check Review of attachment 8927693 [details] [diff] [review]: ----------------------------------------------------------------- Let's try this, then.

Attachment #8927693 - Flags: review+

Comment 47

•

7 years ago

Comment on attachment 8927693 [details] [diff] [review] canary-check [Security approval request comment] How easily could an exploit be constructed based on the patch? Same as above, more attempts to try track down the cause of this bug.

Attachment #8927693 - Flags: sec-approval?

Comment 48

•

7 years ago

sec-approval+ for trunk.

status-firefox59: --- → affected

tracking-firefox59: --- → +

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Attachment #8927693 - Flags: sec-approval? → sec-approval+

Comment 49

•

7 years ago

https://hg.mozilla.org/mozilla-central/rev/99b6831627b1

Astley Chen (inactive)

Updated

•

7 years ago

Blocks: 1411988

Matt Wobensmith [:mwobensmith][:matt:]

Comment 50

•

7 years ago

Hi Matt, did your latest patch shed any light on the situation?

Flags: needinfo?(matt.woodrow)

Comment 51

•

7 years ago

Looks like there's one crash so far [1], and it's crashing at the first check point [2]. That suggests that the bug is in display list building itself, not FrameLayerBuilder. Need to think of something to narrow it down further. [1] https://crash-stats.mozilla.com/report/index/3ee5702a-db44-4411-805c-fa1840171128 [2] https://searchfox.org/mozilla-central/source/layout/base/nsLayoutUtils.cpp#3922

Updated

•

7 years ago

tracking-firefox58: --- → +

Comment 52

•

7 years ago

Comment on attachment 8927693 [details] [diff] [review] canary-check Approval Request Comment [Feature/Bug causing the regression]: Unknown. [User impact if declined]: Needed as part of trying to debug 1421345. [Is this code covered by automated tests?]: No. [Has the fix been verified in Nightly?]: No. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: No [Why is the change risky/not risky?]: Just a diagnostic assert to try track down a possible corruption issue. [String changes made/needed]: None.

Flags: needinfo?(matt.woodrow)

Attachment #8927693 - Flags: approval-mozilla-beta?

Gerry Chang [:gchang]

Comment 53

•

7 years ago

Comment on attachment 8927693 [details] [diff] [review] canary-check After discussing with dev in mail thread, we can live with this and let's fix this in 59.

Attachment #8927693 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

Comment 54

•

7 years ago

Still seeing exploitable looking crashes in 59 beta. Matt, anything else to try here?

Flags: needinfo?(matt.woodrow)

Comment 55

•

7 years ago

Attached patch debug-dl-canary — Details — Splinter Review

There's some more interesting crash reports here: https://crash-stats.mozilla.com/signature/?moz_crash_reason=~Canary%20check%20failed&signature=nsPresArena%3A%3AAllocate&date=%3E%3D2018-01-07T03%3A26%3A09.000Z&date=%3C2018-02-07T03%3A26%3A09.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports That's a canary check every time we try to allocate something new from the arena. I was hoping the stacks would have something in common that would point to the issue, but I haven't found it yet. nsFlexContainerFrame shows up fairly frequently (more than I'd expect), but auditing the display list code there hasn't given me any ideas. Adding a canary check each time we enter/exit a BuildDisplayList variant should hopefully narrow this down even further.

Flags: needinfo?(matt.woodrow)

Attachment #8948548 - Flags: review?(nfroyd)

Updated

•

7 years ago

status-firefox60: --- → ?

tracking-firefox60: --- → +

Comment 56

•

7 years ago

Comment on attachment 8948548 [details] [diff] [review] debug-dl-canary Review of attachment 8948548 [details] [diff] [review]: ----------------------------------------------------------------- rs=me More diagnostics is more better.

Attachment #8948548 - Flags: review?(nfroyd) → review+

Comment 57

•

7 years ago

Comment on attachment 8948548 [details] [diff] [review] debug-dl-canary [Security approval request comment] Yet another diagnostic patch to try track this down. Not sure how hard it is to exploit since we haven't been able to figure out how to trigger it. Approval Request Comment [Feature/Bug causing the regression]: Unknown. [User impact if declined]: Needed as part of trying to debug 1421345. [Is this code covered by automated tests?]: No. [Has the fix been verified in Nightly?]: No. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: No [Why is the change risky/not risky?]: Just a diagnostic assert to try track down a possible corruption issue. [String changes made/needed]: None.

Attachment #8948548 - Flags: sec-approval?

Attachment #8948548 - Flags: approval-mozilla-beta?

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 58

•

7 years ago

Comment on attachment 8948548 [details] [diff] [review] debug-dl-canary approvals given!

Attachment #8948548 - Flags: sec-approval?

Attachment #8948548 - Flags: sec-approval+

Attachment #8948548 - Flags: approval-mozilla-beta?

Attachment #8948548 - Flags: approval-mozilla-beta+

Comment 59

•

7 years ago

diagnostics patch: https://hg.mozilla.org/releases/mozilla-beta/rev/01d3f0695fc4947d9a086d01f8d105816e94a3ce

Andrei Vaida [:avaida]

Comment 60

•

7 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #57) > [Is this code covered by automated tests?]: No. > [Has the fix been verified in Nightly?]: No. > [Needs manual test from QE? If yes, steps to reproduce]: No. Setting qe-verify- based on Matt's assessment on manual testing needs.

Flags: qe-verify-

Updated

•

7 years ago

status-firefox60: ? → affected

Comment 61

•

7 years ago

This keeps getting more confusing. There's a bunch of crashes with the latest diagnostic patch, and this one is particularly interesting: https://crash-stats.mozilla.com/report/index/e5aad7dd-483c-4653-ada5-d55600180214 The caller of nsTextFrame::BuildDisplayList called nsDisplayListBuilder::Check() right before we called BuildDisplayList. I checked the minidump and isSelected is still Nothing(), so we didn't take the branch there, and basically didn't execute any code before creating the display item and crashing. That really sounds like someone else is corrupting our memory and we just crash at random times. As with bug 1359569, these crashes are all with D2DEnabled:false, which almost certainly is meaningful, but I have no idea how not having d2d means that another thread corrupts our memory.

Julien Cristau [:jcristau]

Comment 62

•

7 years ago

I also went through all the other threads in the crash report, nothing seems to be running relevant code.

Updated

•

7 years ago

status-firefox58: affected → wontfix

Comment 63

•

7 years ago

I can leave this tracked for 59 for now, but it sounds like this may not be actionable before the 59 release.

Updated

•

7 years ago

status-firefox59: affected → wontfix

status-firefox61: --- → affected

Frederik Braun [:freddy]

Comment 64

•

7 years ago

Matt, I understand this is a tricky issue, but maybe there are some newer crashes with your diagnostic patch applied that might be more conclusive? Would you please take a look?

Flags: needinfo?(matt.woodrow)

Comment 65

•

7 years ago

(In reply to Frederik Braun [:freddyb] from comment #64) > Matt, I understand this is a tricky issue, but maybe there are some newer > crashes with your diagnostic patch applied that might be more conclusive? > Would you please take a look? Thanks for reminding me to look at this. Most of the previous diagnostic path was calls to nsDisplayListBuilder::Check, which is behind a MOZ_DIAGNOSTIC_ASSERT ifdef. My understanding is that this won't do anything in beta builds (though it should for dev edition?). All the crashes I see from beta look to be true beta (based on telemetry environment settings:update:channel:"beta"), so I think the conclusion in comment 61 is incorrect. I'm going to write some patches to try correct this, and hopefully get more interesting result. I still see a strong correlation between the crashes and text, flexbox and tables. No progress on figuring out why that might be though.

Flags: needinfo?(matt.woodrow)

Comment 66

•

7 years ago

Attached patch debug-dl-canary v2 — Details — Splinter Review

Fixed patch, one that should actually do something useful on beta. This is identical to the previous patch, but without the MOZ_DIAGNOSTIC_ASSERT_ENABLED, so it should be ok to carry the review and approval.

Attachment #8969134 - Flags: review+

Comment 67

•

7 years ago

Comment on attachment 8969134 [details] [diff] [review] debug-dl-canary v2 Same patch as before, just without the ifdef that made it not work on beta. [Security approval request comment] Yet another diagnostic patch to try track this down. Not sure how hard it is to exploit since we haven't been able to figure out how to trigger it. Approval Request Comment [Feature/Bug causing the regression]: Unknown. [User impact if declined]: Needed as part of trying to debug 1421345. [Is this code covered by automated tests?]: No. [Has the fix been verified in Nightly?]: No. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: No [Why is the change risky/not risky?]: Just a diagnostic assert to try track down a possible corruption issue. [String changes made/needed]: None.

Attachment #8969134 - Flags: sec-approval?

Attachment #8969134 - Flags: approval-mozilla-beta?

Comment 68

•

7 years ago

Comment on attachment 8969134 [details] [diff] [review] debug-dl-canary v2 Approvals given

Attachment #8969134 - Flags: sec-approval?

Attachment #8969134 - Flags: sec-approval+

Attachment #8969134 - Flags: approval-mozilla-beta?

Attachment #8969134 - Flags: approval-mozilla-beta+

https://hg.mozilla.org/integration/mozilla-inbound/rev/9ff54d6e39b0e215df9cf3fa251d0f1c5ba0d811

Comment 69

•

7 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/b9d595e7b90c

Comment 70

•

7 years ago

uplift

tracking-firefox61: --- → +

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 71

•

7 years ago

https://hg.mozilla.org/mozilla-central/rev/9ff54d6e39b0

Comment 72

•

7 years ago

Few reports starting to show up with the new checks: https://crash-stats.mozilla.com/report/index/b0e0ea6e-8201-42ad-a19c-502c60180423 https://crash-stats.mozilla.com/report/index/651cea1f-0055-48fe-9a47-071390180422 ScrollFrameHelper::BuildDisplayList looks very suspicious at this point, still can't find anything obviously wrong with it though.

Comment 73

•

7 years ago

Attached patch more-checks — Details — Splinter Review

Same as before, just adding some more checks specific to the code that looks close.

Attachment #8970400 - Flags: sec-approval?

Attachment #8970400 - Flags: review?(mstange)

Attachment #8970400 - Flags: approval-mozilla-beta?

Comment 74

•

7 years ago

Al, I'm ok with this for beta 16 if you want to give it sec-approval.

Flags: needinfo?(abillings)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 75

•

7 years ago

Comment on attachment 8970400 [details] [diff] [review] more-checks sec-approval+ given and I marked Beta as well.

Flags: needinfo?(abillings)

Attachment #8970400 - Flags: sec-approval?

Attachment #8970400 - Flags: sec-approval+

Attachment #8970400 - Flags: approval-mozilla-beta?

Attachment #8970400 - Flags: approval-mozilla-beta+

Markus Stange [:mstange]

Updated

•

7 years ago

Attachment #8970400 - Flags: review?(mstange) → review+

Comment 76

•

7 years ago

uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/bda9a66dc358

guigs [:guigs] PST (please needinfo me!)

Comment 77

•

7 years ago

Hi there, I have a user experiencing the this crash if you need more information. [Apologies, I don't mean to spam the bug] this is the thread we are working with the user in.https://support.mozilla.org/en-US/questions/1216050 I will ask them to try beta.

Updated

•

7 years ago

Blocks: RDLbugs

Updated

•

7 years ago

No longer blocks: RDLbugs

Julien Cristau [:jcristau]

Comment 78

•

7 years ago

(In reply to guigs [guigs] PST (please needinfo me!) from comment #77) > Hi there, I have a user experiencing the this crash if you need more > information. [Apologies, I don't mean to spam the bug] this is the thread we > are working with the user > in.https://support.mozilla.org/en-US/questions/1216050 > > I will ask them to try beta. That's really interesting! I did not expect this to be consistently reproducible. Any chance you could try get them to test out a Nightly ASAN build? The latest one is available here: https://queue.taskcluster.net/v1/task/TByH83HhRCunzxqwqq774A/runs/0/artifacts/public/build/target.zip

Flags: needinfo?(rmcguigan)

Comment 79

•

7 years ago

This is wontfix for 60 at this point.

status-firefox60: affected → wontfix

status-firefox62: --- → affected

status-firefox-esr60: --- → affected

tracking-firefox62: --- → +

tracking-firefox-esr60: --- → ?

guigs [:guigs] PST (please needinfo me!)

Comment 80

•

7 years ago

I went through all the relevant release-60 crash reports that I could find and tried to analyze them as best I could: https://pastebin.mozilla.org/9085739 Some of the callstack information is confusing in crash-stats thanks to inlined code, and sometimes we seem to miss the BuildDisplayListForChild stack frame that calls into CheckCanary. The BuildDisplayListForChild line numbers in the pastebin are obtained by looking at the minidump, looking at stack memory, pulling the return address and then looking up what code that corresponds to. Mats, as I said in bug 1436505, this seems like it might actually be similar. I'd really appreciate if you could have a look over this when you get time. Thanks!

Flags: needinfo?(mats)

Updated

•

7 years ago

Flags: needinfo?(rmcguigan)

Mats Palmgren (inactive)

Comment 81

•

7 years ago

(replied in bug 1436505)

Flags: needinfo?(mats)

Comment 82

•

7 years ago

While I'm still very-much interested in fixing this in 61 if at all possible, I don't think tracking it is going to help at this point. Please don't hesitate to nominate a patch for uplift should the root cause be found and fixed in time.

status-firefox61: affected → fix-optional

tracking-firefox61: + → -

Comment 83

•

7 years ago

I'd also happily take a fix here in 62 beta. Note, bug 1198710, bug 1406727, bug 1436505, and bug 1467526 may all be related.

status-firefox63: --- → ?

tracking-firefox62: + → -

Updated

•

7 years ago

status-firefox61: fix-optional → wontfix

status-firefox62: affected → fix-optional

status-firefox63: ? → affected

Updated

•

7 years ago

Keywords: stalled

Updated

•

7 years ago

status-firefox62: fix-optional → wontfix

status-firefox63: affected → fix-optional

status-firefox64: --- → fix-optional

status-firefox-esr60: affected → fix-optional

tracking-firefox-esr60: ? → ---

Pascal Chevrel:pascalc

Updated

•

6 years ago

status-firefox63: fix-optional → wontfix

status-firefox64: fix-optional → wontfix

Updated

•

6 years ago

status-firefox65: --- → wontfix

status-firefox66: --- → fix-optional

status-firefox67: --- → ?

Updated

•

6 years ago

Keywords: sec-critical → sec-high

Updated

•

6 years ago

status-firefox66: fix-optional → wontfix

status-firefox67: ? → wontfix

status-firefox68: --- → fix-optional

status-firefox69: --- → fix-optional

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

status-firefox68: fix-optional → wontfix

status-firefox69: fix-optional → wontfix

status-firefox-esr60: fix-optional → wontfix

Comment 84

•

5 years ago

The leave-open keyword is there and there is no activity for 6 months.
:mattwoodrow, maybe it's time to close this bug?

Flags: needinfo?(matt.woodrow)

BugBot [:suhaib / :marco/ :calixte]

Comment 85

•

5 years ago

The crash rate is super low, but I guess we can leave this open, since it still happens sometimes.

Flags: needinfo?(matt.woodrow)

Comment 86

•

3 years ago

The bug assignee didn't login in Bugzilla in the last 7 months and this bug has priority 'P2'/severity 'critical'.
:miko, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: matt.woodrow → nobody

Flags: needinfo?(mikokm)

Miko Mynttinen

Comment 87

•

3 years ago

We have had a total of 10 crashes in the last six months, last crashing version is 91.6.0esr.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 3 years ago

Flags: needinfo?(mikokm)

Resolution: --- → WORKSFORME

Miko Mynttinen

Updated

•

3 years ago

Keywords: leave-open

BugBot [:suhaib / :marco/ :calixte]

Comment 88

•

3 years ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled