Crash in MergeState::HasMatchingItemInOldList

ASSIGNED
Assigned to

Status

()

defect
P2
critical
ASSIGNED
a year ago
a month ago

People

(Reporter: mattwoodrow, Assigned: hiro)

Tracking

(Blocks 1 bug, {crash, regression})

Trunk
Unspecified
All
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox-esr60 unaffected, firefox60 unaffected, firefox61- wontfix, firefox62- wontfix, firefox63- wontfix, firefox64 wontfix, firefox65 wontfix, firefox66 wontfix, firefox67 wontfix, firefox68 affected)

Details

(crash signature)

Attachments

(2 attachments)

Reporter

Description

a year ago
+++ This bug was initially created as a clone of Bug #1462497 +++

Bug 1262497 fixed most causes of this bug, and prevented it from crashing in release bugs.

This bug tracks the remaining crashes.
(In reply to [:philipp] from bug 1462497 comment #41)
> the signature is still present in 61.0b10 devedition builds - this query
> is/will be listing the moz_crash_reason fields:
> https://bit.ly/2JnOLAp
Priority: -- → P2
Reporter

Updated

a year ago
Depends on: 1466954
Duplicate of this bug: 1468616
OS: Windows 10 → All
Duplicate of this bug: 1468182
This is the top non-shutdown hang crash on the 6-13 Windows Nightly with 3.52% of all crashes (which is 30 crashes). Three of the URLs for the crashes are YouTube, and one is Google Docs.
I believe these crashes are hitting diagnostic asserts (given that only aurora channel builds appear to be affected on Beta), so calling this fix-optional for 61.
Assignee

Updated

11 months ago
See Also: → 1469220
I might have run into what look like more instances of this bug while triaging nightly builds. I found at least three reports with a similar MOZ_CRASH() reason but with a useless signature. This one for example:

https://crash-stats.mozilla.com/report/index/66a891bd-45fb-4194-ba70-d45b10180618

We might be undercounting the crashes.
Reporter

Updated

11 months ago
Depends on: 1469746
Reporter

Comment 7

11 months ago
Filed bug 1469746 for the specific case of this reported in webcompat bug 17317.

(In reply to Gabriele Svelto [:gsvelto] from comment #6)
> I might have run into what look like more instances of this bug while
> triaging nightly builds. I found at least three reports with a similar
> MOZ_CRASH() reason but with a useless signature. This one for example:
> 
> https://crash-stats.mozilla.com/report/index/66a891bd-45fb-4194-ba70-
> d45b10180618
> 
> We might be undercounting the crashes.

Missing symbols seems like a generic problem that would cause us to undercount crashes for all issues. It's not obvious why symbols missing would be specific to this crash. Do you know if we have other reported cases of this?

That said, if we could link the bug report to the assert message instead of the symbol, we'd be less likely to have this issue.
(In reply to Matt Woodrow (:mattwoodrow) from comment #7)
> Missing symbols seems like a generic problem that would cause us to
> undercount crashes for all issues. It's not obvious why symbols missing
> would be specific to this crash. Do you know if we have other reported cases
> of this?

Not really, but I wanted to point this out because there were 6 instances of this for a single nightly build which is a lot.

> That said, if we could link the bug report to the assert message instead of
> the symbol, we'd be less likely to have this issue.

Yeah.
Crashed while making a youtube video fullscreen.
Adding 63 as affected.
This is the top non-shutdown hang crash in the 6-27 Windows Nightly, with 19 crashes. URLs are various YouTube videos, plus http://www.pedalsandeffects.com/blog/2018/4/10/rainger-fx-reverb-x  (which has an embedded YouTube video).
Reporter

Updated

11 months ago
Depends on: 1472053
Reporter

Comment 12

11 months ago
(In reply to Andrew McCreight [:mccr8] from comment #11)
> This is the top non-shutdown hang crash in the 6-27 Windows Nightly, with 19
> crashes. URLs are various YouTube videos, plus
> http://www.pedalsandeffects.com/blog/2018/4/10/rainger-fx-reverb-x  (which
> has an embedded YouTube video).

It looks to be a really complex interaction with our fullscreen transition and an animation finishing in the middle of it.

I can reproduce it very very occasionally, but not in rr.

I've filed bug 1472053 for fixing what we think the issue is.
Assignee

Updated

11 months ago
Depends on: 1472076
Similar to comment 11, this is the #1 non-shutdown hang crash in the Windows
nightly of 20180628130527, with 30+ different installations recording 43
crashes.

Note that, despite "almost impossible to reproduce" comments above, one of
the installations (2018-06-28 18:46:34) reports 5 crashes, and in total there
are 4 installations reporting a crash more than once.  Given that, per
comments in bug 1472053, this sounds like some kind of race condition,
I wonder if it would be easier to reproduce on a very slow machine.

The other Windows nightly of the day, 20180628220051, reports 31 crashes
including 3 installations that crashed more than once.
Assignee

Comment 14

11 months ago
We think we already know what happens there on the race condition.  Bug 1472053 is an approach to fix the race, and bug 1472076 is another approach for the race.  Probably we can land bug 1472076 within a few days.
Assignee

Comment 15

11 months ago
After bug 1472076 landed, there are still some crashes unfortunately, e.g;

https://crash-stats.mozilla.com/report/index/5a7153f7-4e0f-447b-9374-203a20180703

The buildid 20180703100028 also includes bug 1467688, so there are still something flaw we are not aware of.
Reporter

Comment 16

11 months ago
(In reply to Hiroyuki Ikezoe (:hiro) from comment #15)
> After bug 1472076 landed, there are still some crashes unfortunately, e.g;
> 
> https://crash-stats.mozilla.com/report/index/5a7153f7-4e0f-447b-9374-
> 203a20180703
> 
> The buildid 20180703100028 also includes bug 1467688, so there are still
> something flaw we are not aware of.

Very interesting that this is also from the WillPaintWindow callstack, not nsRefreshDriver::Tick.

Is it possible that there's another way for the refresh driver time to change without us notifying our observers? Maybe we switched timers?
Assignee

Comment 17

11 months ago
Another possibility that changes the returned value of AnimationEffect::IsCurrent (or IsInEffect) I can think of is that GetNavigationTiming call [1] in DocumentTimlineGetCurrentTimeStamp if we don't have refresh driver yet. 

[1] https://searchfox.org/mozilla-central/rev/403038737ba75af3842ba6b43b6e2fb47eb06609/dom/animation/DocumentTimeline.cpp#96

We should probably fix this too.
Assignee

Updated

11 months ago
Depends on: 1473172
Assignee

Comment 18

11 months ago
Still happens from WillPaintWindow;

bp-235580ba-cb0f-4ad5-8dcd-8df3c0180705

And on normal tick;

bp-e2ee2b95-118f-481d-90e8-c65230180705
Assignee

Comment 19

11 months ago
I've been trying to reproduce the crash on youtube for a few hours but no luck. :/
[Tracking Requested - why for this release]:
Pretty high volume of crashes (701) and the signature 'MergeState::HasMatchingItemInOldList' is ranked #2 in content top-crashers.
Keywords: topcrash
Reporter

Updated

11 months ago
Blocks: 1472525
Assignee

Comment 21

11 months ago
Today, I did setup my old laptop (core2duo machine), and am trying to reproduce the crash, but no luck so far.  Do I need to login there?  I am going to try with logged-in state.
Assignee

Comment 22

11 months ago
I thought, finally, I can reproduce this crash, but that was a difference crashe. :/  Filed bug 1474231 for that.
Assignee

Comment 23

11 months ago
Brian suggested me that modifying nsRefreshDriver to make time changes happens between tick and paint artificially.  Attaching patch advances the refresh driver timestamp ALWAYS before paint.  Though even with this patch I couldn't reproduce the crash on youtube.com, some test cases caused the crash on a try with this patch, e.g test_bug1354633_media_error.html [1].  This looks pretty much the same as the crash on youtube.com.  In this test case, the problematic element is 'scrubberStack' div [2], and the animation in question seems this controlBar animation [3] (note that this is a script animation, and 'scrubberStack' doesn't have any animations).

Tomorrow I am going to keep investigating in further detail, but I am wondering it is still possible that we use an advanced refresh driver time for paint process?  We've already fixed two possibilities (bug 1472076 and bug 1473172), and I can't find any other suspicious points so far unfortunately.

[1] https://treeherder.mozilla.org/logviewer.html#?job_id=187305801&repo=try&lineNumber=3413
[2] https://hg.mozilla.org/mozilla-central/file/a675c5d7eb76/toolkit/content/widgets/videocontrols.xml#l46
[3] https://hg.mozilla.org/mozilla-central/file/a675c5d7eb76/toolkit/content/widgets/videocontrols.xml#l1109
Assignee

Comment 24

11 months ago
I think I could replicate the crash on youtube.com with below modification.  I don't yet understand what's going on there, but anyway I take this.

diff --git a/layout/painting/nsDisplayList.cpp b/layout/painting/nsDisplayList.cpp
--- a/layout/painting/nsDisplayList.cpp
+++ b/layout/painting/nsDisplayList.cpp
@@ -2770,7 +2770,9 @@ already_AddRefed<LayerManager> nsDisplay
   aBuilder->SetIsCompositingCheap(temp);
 
   if (document && widgetTransaction) {
-    TriggerPendingAnimations(document, layerManager->GetAnimationReadyTime());
+    TimeStamp now = layerManager->GetAnimationReadyTime() +
+                      TimeDuration::FromMilliseconds(100);
+    TriggerPendingAnimations(document, now);// layerManager->GetAnimationReadyTime());
   }
Assignee: nobody → hikezoe
Status: NEW → ASSIGNED
Reporter

Comment 25

11 months ago
(In reply to Hiroyuki Ikezoe (:hiro) from comment #23)
> Created attachment 8990945 [details]
> Advance refresh driver time before paint
> 
> Brian suggested me that modifying nsRefreshDriver to make time changes
> happens between tick and paint artificially.  Attaching patch advances the
> refresh driver timestamp ALWAYS before paint.  Though even with this patch I
> couldn't reproduce the crash on youtube.com, some test cases caused the
> crash on a try with this patch, e.g test_bug1354633_media_error.html [1]. 
> This looks pretty much the same as the crash on youtube.com.  In this test
> case, the problematic element is 'scrubberStack' div [2], and the animation
> in question seems this controlBar animation [3] (note that this is a script
> animation, and 'scrubberStack' doesn't have any animations).

This is super confusing! If the changed element doesn't have an animation, how does changing the refresh driver tick time affect it?
Assignee

Comment 26

11 months ago
(In reply to Matt Woodrow (:mattwoodrow) from comment #25)
> (In reply to Hiroyuki Ikezoe (:hiro) from comment #23)
> > Created attachment 8990945 [details]
> > Advance refresh driver time before paint
> > 
> > Brian suggested me that modifying nsRefreshDriver to make time changes
> > happens between tick and paint artificially.  Attaching patch advances the
> > refresh driver timestamp ALWAYS before paint.  Though even with this patch I
> > couldn't reproduce the crash on youtube.com, some test cases caused the
> > crash on a try with this patch, e.g test_bug1354633_media_error.html [1]. 
> > This looks pretty much the same as the crash on youtube.com.  In this test
> > case, the problematic element is 'scrubberStack' div [2], and the animation
> > in question seems this controlBar animation [3] (note that this is a script
> > animation, and 'scrubberStack' doesn't have any animations).
> 
> This is super confusing! If the changed element doesn't have an animation,
> how does changing the refresh driver tick time affect it?

In this case, the 'scrubberStack' is a descendant of the controlBar animation, so if the animating frame  doesn't call MarkNeedsDisplayItemRebuild(), the crash happens.  That's said, after some more digging into fullscreen stuff, I realized that nsRefreshDriver::Freeze/Thaw is called only for chrome window [1], so it shouldn't affect content window, which means the crash on youtube.com is not fixed by either bug 1472076 or bug 1473172.  They fixed the cases once we enable layout.display-list.retain.chrome. :)

[1] https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/dom/base/nsGlobalWindowOuter.cpp#4338
Assignee

Comment 27

11 months ago
(In reply to Hiroyuki Ikezoe (:hiro) from comment #26)
> In this case, the 'scrubberStack' is a descendant of the controlBar
> animation, so if the animating frame  doesn't call
> MarkNeedsDisplayItemRebuild(), the crash happens.  That's said, after some
> more digging into fullscreen stuff, I realized that
> nsRefreshDriver::Freeze/Thaw is called only for chrome window [1], so it
> shouldn't affect content window, which means the crash on youtube.com is not
> fixed by either bug 1472076 or bug 1473172.  They fixed the cases once we
> enable layout.display-list.retain.chrome. :)

To be clear, fullscreen stuff is definitly involved with the crash on youtube.com, but I believe the refresh driver timer changes by Freeze/Thaw is not a cause of the crash.
Assignee

Updated

11 months ago
See Also: → 1475155
Assignee

Updated

11 months ago
Depends on: 1475155
See Also: 1475155
Assignee

Updated

10 months ago
Depends on: 1475769
Assignee

Updated

10 months ago
No longer depends on: 1475155
Assignee

Comment 28

10 months ago
Note that bug 1475769 landed in buildid: 20180715220110.
Assignee

Comment 29

10 months ago
Bug 1475769 fixed the crash happened by WillPaint.  There are still three reports that caused by normal nsRefreshDriver::Tick.  The only one suspicious place for that I can think of is AddRefreshObserver call [1] in nsRefreshDriver::IsWaitingForPaint().  I can't make it work (i.e how to cause this crash) in my brain yet, though I hope the case is pretty rare, I am going to figure out a bit more.

[1] https://hg.mozilla.org/mozilla-central/file/2ed1506d1dc7/layout/base/nsRefreshDriver.cpp#l2265
Looks like the big issue here is fixed by bug 1475769, which is headed for beta 62 as well. So, I'll untrack this, and if you do fix the left over issues and want to uplift that fix to beta, please don't hesitate to nominate for uplift again. Thanks!
Assignee

Comment 31

10 months ago
I found a crash report [1] commenting the crash happened on https://theconversation.com/crispr-cas9-gene-editing-scissors-are-less-accurate-than-we-thought-but-there-are-fixes-100007 , and I can reproduce the crash locally. :)  I am going to debug it.

[1] https://crash-stats.mozilla.com/report/index/458861f0-2288-443b-84ce-6183c0180717
(In reply to Hiroyuki Ikezoe (:hiro) from comment #31)
> I found a crash report [1] commenting the crash happened on
> https://theconversation.com/crispr-cas9-gene-editing-scissors-are-less-
> accurate-than-we-thought-but-there-are-fixes-100007 , and I can reproduce
> the crash locally. :)  I am going to debug it.
> 
> [1]
> https://crash-stats.mozilla.com/report/index/458861f0-2288-443b-84ce-
> 6183c0180717

Hiro, were you able to reproduce and debug the crash? Thanks
Flags: needinfo?(hikezoe)
Assignee

Comment 33

10 months ago
Yes, I can still reproduce the crash on m-c rev: dd386b5b9fa7f5cd6dc4bbbfa0503b3eb2969af5.  And I was debugging it two weeks ago but haven't finished yet.
Flags: needinfo?(hikezoe)

Updated

10 months ago
Blocks: 1476971
Reporter

Updated

10 months ago
Crash Signature: [@ MergeState::HasMatchingItemInOldList] [@ nsDisplayItem::GetOldListIndex ] → [@ MergeState::HasMatchingItemInOldList] [@ nsDisplayItem::GetOldListIndex ] [@ bool nsDisplayItem::GetOldListIndex ]
Reporter

Updated

10 months ago
Duplicate of this bug: 1481377
Any progress on this? I think the debug on #c33 might address concerns on bug 1476971 as well.
Flags: needinfo?(hikezoe)
Assignee

Comment 36

9 months ago
Nothing particular.  I haven't taken time for this for a while, unfortunately.
Flags: needinfo?(hikezoe)
Marking as fix-optional for 63 as it only affects the aurora channel.
Duplicate of this bug: 1476971

Comment 39

8 months ago
Posted file opt asan stack
bughunter can reproduce on nightly 64 but not release 62 nor beta 63 on https://www.apple.com/in/ipad-pro/ and various other language localizations on Linux. Opt asan gives

==2987==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000001 (pc 0x555d5790d87a bp 0x7ffdaed9a690 sp 0x7ffdaed9a520 T0)
==2987==The signal is caused by a WRITE memory access.
==2987==Hint: address points to the zero page.
    #0 0x555d5790d879 in MOZ_CrashPrintf /builds/worker/workspace/build/src/mfbt/Assertions.cpp:55:5
    #1 0x7fea7ee49762 in GetOldListIndex /builds/worker/workspace/build/src/layout/painting/nsDisplayList.h:3171:7
    #2 0x7fea7ee49762 in MergeState::HasMatchingItemInOldList(nsDisplayItem*, Index<OldListUnits>*) /builds/worker/workspace/build/src/layout/painting/RetainedDisplayListBuilder.cpp:449
Very low-volume crash, so I'm marking this fix-optional.

Updating flags as this is still visible from 66-68, albeit in fairly low volume.

You need to log in before you can comment on or make changes to this bug.