Closed Bug 1251150 Opened 8 years ago Closed 5 years ago

top crash in DecrementVisibleCount

Categories

(Core :: Layout: Images, Video, and HTML Frames, defect)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED WONTFIX
Tracking Status
e10s - ---
firefox47 --- fixed
firefox48 --- wontfix
firefox-esr45 --- wontfix

People

(Reporter: ting, Assigned: tnikkel)

Details

(Keywords: crash)

Crash Data

Attachments

(5 files)

This bug was filed from the Socorro interface and is 
report bp-1d890811-8a45-4174-a1c5-a26542160218.
=============================================================
This is the #28 top crash for e10s and doesn't show up in non-e10s.
Timothy, this is in UpdateImageVisibility code which you seem to have worked on -- any chance you could diagnose this and/or come up with some diagnostics we could add to test theories about how/why we might be crashing here?

The crash is a null deref in DecrementVisibleCount, at this line:
>    iter.Get()->GetKey()->DecrementVisibleCount(aNonvisibleAction);

Here, 'iter' is traversing a nsTHashtable<nsRefPtrHashKey<nsIImageLoadingContent>>, which holds the swapped contents of mVisibleImages.

I think the fact that we null-crash here implies that we somehow ended up with a null pointer inside of that hash table...

But, I don't immediately see how that happens, because every insertion seems to either null-check the inserted thing, or do something else right around the insertion that would make us crash earlier if we were inserting a null pointer.
Component: Layout → Layout: Images
Flags: needinfo?(tnikkel)
Summary: crash in DecrementVisibleCount → top crash in DecrementVisibleCount
(In reply to Ting-Yu Chou [:ting] from comment #0)
> This is the #28 top crash for e10s and doesn't show up in non-e10s.

I take it by "doesn't show up" you mean it's not a *topcrash* for non-e10s, right?

Because I *do* see some crashes with this signature that seem likely to be non-e10s, e.g. this one in Firefox 43.0.4: bp-0dc7387d-d13f-4369-9257-90f262160130
...and this one on Firefox 44:
bp-c1565ae8-2b60-4617-8e19-c34b02160130

(I'm not 100% sure how to tell, but I'm assuming that these crash reports are from non-e10s builds, since they're on release channel.)
Flags: needinfo?(janus926)
(In reply to Daniel Holbert [:dholbert] from comment #2)
> (In reply to Ting-Yu Chou [:ting] from comment #0)
> > This is the #28 top crash for e10s and doesn't show up in non-e10s.
> 
> I take it by "doesn't show up" you mean it's not a *topcrash* for non-e10s,
> right?
> 

I am sorry, you're right. If you don't mind, could you tell me how did you find the same crashes, I am still new to crash-stats.

I looked around, and found there's a "Product" in "Summary" which list the versions, is this where you found the other crashes for non-e10s? Thanks.
Flags: needinfo?(janus926)
(In reply to Ting-Yu Chou [:ting] from comment #3)
> I am sorry, you're right. If you don't mind, could you tell me how did you
> find the same crashes, I am still new to crash-stats.

I clicked "More Reports", near the top of your crash report linked in comment 0, on the "Signature" row. Then I clicked the "Reports" tab.  This brings up a table showing a bunch of recent crash reports with this signature, & their version numbers.
One more note: when we crash, we're iterating over a nsTHashtable "oldVisibleImages" that we've *just* populated via a call to SwapElements:
> 5643   // Remove the entries of the mVisibleImages hashtable and put them in
> 5644   // oldVisibleImages.
> 5645   nsTHashtable< nsRefPtrHashKey<nsIImageLoadingContent> > oldVisibleImages;
> 5646   mVisibleImages.SwapElements(oldVisibleImages);
http://mxr.mozilla.org/mozilla-central/source/layout/base/nsPresShell.cpp?rev=c8be68703225#5645

Note that the nsTHashtable::SwapElements method was only added last year, for this specific call site, in bug 1120198.

It seems possible that nsTHashtable::SwapElements might do the wrong thing in some edge case, and that's involved here... (e.g. maybe due to some weirdness in PLDHashtable's move copy-constructor, or one of its member vars' copy constructors methods...)
Also: multiple crash reports mention flash, and 2 mention that the user manually killed plugin-container (e.g. bp-096bdf89-3952-4fb3-9d85-4cded2160202) for flash, and that this crash happened right after that.
I'll look into nsTHashtable::SwapElements().
Assignee: nobody → janus926
Checked PLDHashTable(PLDHashTable&&) and PLDHashTable::operator=(PLDHashTable&&), but I didn't see anything suspicious, the member vars of PLDHashTable are primitives.
It is GetKey() returning 0x0 from disassembly:

  iter.Get()->GetKey()->DecrementVisibleCount(aNonvisibleAction);

nsRefPtrHashKey does not have an interface to change |mKey|, also nsPreshShell does not call mVisibleImages.PutEntry() with a null pointer. Not sure how could |mKey| be 0.
I've traced following functions, but still didn't see anything could alter nsRefPtrHashKey.mKey:

  PLDHashTable::Clear(),
  PLDHashTable::Add(),
  PLDHashTable::RemoveEntry()

Also tried to reproduce the issue locally by visiting:

  www.cracked.com,
  www.foxnews.com,
  zyngagames.com/play/farmville-two

but no luck yet.

Furthermore, there are few crashes not accessing 0x0 but some other addresses, it could be actually a memory corruption.
The only thought I have is maybe we are in a bad state before this and only crashing here.

So how do we detect that? We can start by checking if the presshell is in a good state. Things we can check: mIsZombie, mIsDestroying, mIsReflowing, mPaintingIsFrozen, mFrozen, mForwardingContainer, mIsNeverPainting, mHaveShutDown, mIsDocumentGone, nsContentUtils::IsSafeToRunScript(). We can use gfxCriticalNote to annotate the crash report just before we crash.

How long will we have to long diagnostic code in before we get results? Is nightly enough?
Attached patch add annotationsSplinter Review
Attachment #8728793 - Flags: review?(dholbert)
Attachment #8728793 - Flags: review?(dholbert) → review+
Keywords: leave-open
Are you confident that nothing can mutate mVisibleImages during enumeration of it, e.g., inside of ClearVisibleImagesList or PresShell::UpdateImageLockingState.  Though it's not clear to me how that could lead to corruption of mVisibleImages that doesn't show up until a later enumeration.
From minidumps:

  b2662f4a-a670-4930-b7ca-f6b7c2160310,
  1d890811-8a45-4174-a1c5-a26542160218

mVisiblieImages.mTable.mOps has something interesting:

  hashKey    = nsTHashtable<nsPtrHashKey<mozilla::embedding::PPrintSettingsDialogParent>>::s_HashKey
  matchEntry = nsTHashtable<nsPtrHashKey<mozilla::FrameLayerBuildier::DisplayItemData>>::s_MatchEntry
  moveEntry  = 0x0
  clearEntry = 0x0
  initEntry  = 0x0

Wonder if there's anything related to this comment:

  https://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsTHashtable.h#355-358
(In reply to Ting-Yu Chou [:ting] from comment #15)
> From minidumps:
> 
>   b2662f4a-a670-4930-b7ca-f6b7c2160310,
>   1d890811-8a45-4174-a1c5-a26542160218

Correction: it's from the latter one, the former one has different hashKey and matchEntry but also has 0x0 for moveEntry, clearEntry and initEntry.
(In reply to David Baron [:dbaron] ⌚️UTC+8 from comment #14)
> Are you confident that nothing can mutate mVisibleImages during enumeration
> of it, e.g., inside of ClearVisibleImagesList or
> PresShell::UpdateImageLockingState.  Though it's not clear to me how that
> could lead to corruption of mVisibleImages that doesn't show up until a
> later enumeration.

Good find. I was thinking that because RebuildImageVisibility iterates over a copy of the hashtable we'd be safe from that.

I went through everything that can happen via ClearVisibleImagesList and UpdateImageLockingState with a fine tooth comb. The only possibility I found is discarding an image sends a image notification observer notification. I can only find test code that observes the discard notification in our tree.

I'll wait until we get a crash with the new annotations (should take < 1 week based on the current crashes). Then I'll try making Discard send notifications async and/or making a copy of the hashtable to iterate over.
(In reply to Ting-Yu Chou [:ting] from comment #15)
> mVisiblieImages.mTable.mOps has something interesting:
> 
>   initEntry  = 0x0

When it's null, PLDHashTable::Add() skips initializing mKey, but still has mKeyHash set.

https://dxr.mozilla.org/mozilla-central/source/xpcom/glue/PLDHashTable.cpp#581-584
First crash in a build with the annotations
https://crash-stats.mozilla.com/report/index/0c494a56-eec1-445b-a48f-f8c5d2160314
I don't see any gfxCriticalNote's anywhere, but I've never seen them in a crash report before so I'm not sure if I'm looking in the right places.
IIRC, they show up in "App Notes".  I don't see any of your messages in the raw dump
either so I think none of the conditions you tested were true.

> Then I'll try making Discard send notifications async and/or
> making a copy of the hashtable to iterate over.

It would be nice to know how this can happen before that.
Perhaps it would be worth to check for some kind of reentrancy?
e.g. in ClearVisibleImagesList, set a flag on the shell before the call to
DecrementVisibleCount and clear it afterwards.  Then add something like this:

if (mTheNewFlag)
  gfxCriticalNote << "mVisibleImages is busy! " << __FILE__ << ":" << __LINE__;

at the start of every method that can possibly change mVisibleImages.

Also, just in case, we should probably spray all nsPresShell methods with:
MOZ_ASSERT(NS_IsMainThread());
(In reply to Mats Palmgren (:mats) from comment #21)
> IIRC, they show up in "App Notes".  I don't see any of your messages in the
> raw dump
> either so I think none of the conditions you tested were true.

Thanks.

> > Then I'll try making Discard send notifications async and/or
> > making a copy of the hashtable to iterate over.
> 
> It would be nice to know how this can happen before that.
> Perhaps it would be worth to check for some kind of reentrancy?
> e.g. in ClearVisibleImagesList, set a flag on the shell before the call to
> DecrementVisibleCount and clear it afterwards.  Then add something like this:
> 
> if (mTheNewFlag)
>   gfxCriticalNote << "mVisibleImages is busy! " << __FILE__ << ":" <<
> __LINE__;
> 
> at the start of every method that can possibly change mVisibleImages.

Yeah, I agree. I'll write something like that.
This will confirm if we are re-entering via discard. And give us a little information on how.
Attachment #8730443 - Flags: review?(mats)
Comment on attachment 8730443 [details] [diff] [review]
add crash annotations for re-entrancy

LGTM.  Perhaps add "<< GetCurrentThreadId()" too if that's printable
on all platforms?
Attachment #8730443 - Flags: review?(mats) → review+
I couldn't see anything anywhere near layout that used GetCurrentThreadId() so I didn't want to pull in some headers. I just used NS_IsMainThread and used that to record on of off main thread in the note.
No crashes with the new annotations yet but while browsing I found two interesting crashes

https://crash-stats.mozilla.com/report/index/e2913662-1186-471f-93b6-c78732160318
https://crash-stats.mozilla.com/report/index/45a035f2-d070-458e-be38-0b2972160318

Look like probably from the same users, and startup crashes. The stacks seem reasonable and both go from nsXULWindow::ShowModal() and then into NS_ProcessNextEvent(), and that processes something which leads to image visibility. The crash addresses are not near null and the crash reason is EXCEPTION_PRIV_INSTRUCTION, so seems like it's a different way to crash here.
Unassign myself because it is not e10s specific crash.
Assignee: janus926 → nobody
No crashes on nightly in over a week now, might have to uplift to aurora to get info.
Sounds like a good idea to me.
Comment on attachment 8728793 [details] [diff] [review]
add annotations

We are trying to diagnose this crash by adding crash annotations, but the frequency isn't high enough on nightly, so we want more coverage with aurora.

Approval Request Comment
[Feature/regressing bug #]: none
[User impact if declined]: none
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: none, only adds crash annotations
[String/UUID change made/needed]: none
Attachment #8728793 - Flags: approval-mozilla-aurora?
Comment on attachment 8730443 [details] [diff] [review]
add crash annotations for re-entrancy

We are trying to diagnose this crash by adding crash annotations, but the frequency isn't high enough on nightly, so we want more coverage with aurora.

Approval Request Comment
[Feature/regressing bug #]: none
[User impact if declined]: none
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: none, only adds crash annotations
[String/UUID change made/needed]: none
Attachment #8730443 - Flags: approval-mozilla-aurora?
QA Contact: tnikkel
Assignee: nobody → tnikkel
QA Contact: tnikkel
Finally a crash in a build with our annotations
https://crash-stats.mozilla.com/report/index/c494c575-f13d-49d7-a880-c9d6e2160324

Nothing in the app notes.

The current re-entrance checks only catch if we re-enter via a call to DecrementVisibleCount. I'll try to craft a patch that catches any kind of re-entrance.
Comment on attachment 8730443 [details] [diff] [review]
add crash annotations for re-entrancy

Patch for improving crash diagnostic information, Aurora47+
Attachment #8730443 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I wanted to check to make sure the notes would show up in a crash report, so I made try build with this crash report

https://crash-stats.mozilla.com/report/index/2ed9082e-8d1d-40f3-bc83-aa26a2160327

I found that you have to click the metadata tab, and there is a field specifically for gfx critical errors. The error did not show up in the raw dump.
Attachment #8728793 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8735721 [details] [diff] [review]
crash annotations for re-entrancy from somewhere besides DecrementVisibleCount

I fixed clearing the bool for the early returns in DoUpdateApproximateFrameVisibility in my local version of the patch.
Comment on attachment 8735721 [details] [diff] [review]
crash annotations for re-entrancy from somewhere besides DecrementVisibleCount

r=mats

Can we also add an annotation + setting flags in 
RebuildApproximateFrameVisibilityDisplayList ?

It's (only) called from nsSubDocumentFrame on its subdoc shell:
http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsSubDocumentFrame.cpp#576

It seems bad if mInFrameVisibilityUpdate is true already in that
shell for that call.
Attachment #8735721 - Flags: review?(mats) → review+
(In reply to Mats Palmgren (:mats) from comment #41)
> Comment on attachment 8735721 [details] [diff] [review]
> crash annotations for re-entrancy from somewhere besides
> DecrementVisibleCount
> 
> r=mats
> 
> Can we also add an annotation + setting flags in 
> RebuildApproximateFrameVisibilityDisplayList ?
> 
> It's (only) called from nsSubDocumentFrame on its subdoc shell:
> http://mxr.mozilla.org/mozilla-central/source/layout/generic/
> nsSubDocumentFrame.cpp#576
> 
> It seems bad if mInFrameVisibilityUpdate is true already in that
> shell for that call.

It would be bad, but the display list image visibility code should be completely inactive. The line you link to, for example, only happens if |aBuilder->IsForFrameVisibility()|, which is when the display list builder was created with mode |FRAME_VISIBILITY|, which is only done here

http://mxr.mozilla.org/mozilla-central/source/layout/base/nsPresShell.cpp#5959

which is inside an ifdef that is not true. (The reason it is in the tree is to be able to use it to compare to the frame tree walker code to make sure it match the display list results.)
Comment on attachment 8735721 [details] [diff] [review]
crash annotations for re-entrancy from somewhere besides DecrementVisibleCount

Crash volume still low on nightly (name of crashing function changed, I'm searching both old and new, and searching for the notes we are adding to reports in all signatures).

Request uplift to aurora because not enough crash volume to see crashes of this on nightly.

Approval Request Comment
[Feature/regressing bug #]: none
[User impact if declined]: none
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: none, only adds crash annotations
[String/UUID change made/needed]: none
Attachment #8735721 - Flags: approval-mozilla-aurora?
Comment on attachment 8735721 [details] [diff] [review]
crash annotations for re-entrancy from somewhere besides DecrementVisibleCount

Needed to better diagnose crashes, Aurora47+
Attachment #8735721 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Needs rebasing for Aurora.
Flags: needinfo?(ryanvm) → needinfo?(wkocher)
Two crashes in builds with the latest annotations
https://crash-stats.mozilla.com/report/index/ab6ab05e-c7b2-48ec-930b-db1182160408
https://crash-stats.mozilla.com/report/index/e69f2492-149a-45d9-832a-b54a72160408
no graphics critical notes in these. I also search every crash report for the landed graphics critical notes and find no hits. But I think this search might not work as it does not find my test crash from comment 38.

So I think the only lead we have left is that the image visibility function may be called off main thread?
Yeah, unless we missed something else. :-)
Perhaps we should just add MOZ_RELEASE_ASSERT(NS_IsMainThread()) in all
pres shell methods?

Maybe we should also wallpaper the problem by adding a local hashtable and do
a SwapElements in PresShell::ClearVisibleImagesList?
That's the only place where we pass mVisibleImages directly to DecrementVisibleCount
AFAICT.  If we don't see any crashes at all after that then we know we missed
something in our analysis.  And if so, we probably want to start from scratch anyway
since the code has changed significantly on trunk.
Fwiw, looking at the crash incidents from comment 50, it appears these installations have
a large number of plugins.  Most are click-to-play of course, and it may be an unrelated
coincidence.

Also, looking at the URLs, I see a large number of these:
wyciwyg://725/https://zynga2-a.akamaihd.net/zap/zac/1/lib/xd_injection_frame.html#...
which looks strange.  I have no clue what that thing is.
(The initial number after // and the part after the # (a different number) varies.)
First crash since the big changes on nightly
https://crash-stats.mozilla.com/report/index/7aaa446b-90c3-4cd8-bdff-02a442160413

We're accessing an nsIFrame now, the hashtable no longer holds a reference obviously, but nsFrame::DestroyFrom removes the frame from the hashtable. The crash address is no longer null, it's the seemingly valid 0x106172c3c8. Not poison, but the memory at that location could be poison, but in this case seems like it's not a valid address we can access.

(In reply to Mats Palmgren (:mats) from comment #52)
> Also, looking at the URLs, I see a large number of these:
> wyciwyg://725/https://zynga2-a.akamaihd.net/zap/zac/1/lib/xd_injection_frame.
> html#...
> which looks strange.  I have no clue what that thing is.
> (The initial number after // and the part after the # (a different number)
> varies.)

Interesting. wyciwyg = what you cache is what you get, sounds like it is used to cache documents created via document.write().
Yeah, that crash looks ... worrying.  Given that destroyed frames are retained in the arena
it means its shell is deleted, otherwise we wouldn't crash on that line.
Boris, do you know what these strange URLs come from:
wyciwyg://725/https://zynga2-a.akamaihd.net/zap/zac/1/lib/xd_injection_frame.html#...
and under what circumstances they would turn up as the URL
in crash data?  (just want to know if they give some clue
about what we're doing when the crash occurs)
Flags: needinfo?(bzbarsky)
(In reply to Mats Palmgren (:mats) from comment #54)
> Yeah, that crash looks ... worrying.  Given that destroyed frames are
> retained in the arena
> it means its shell is deleted, otherwise we wouldn't crash on that line.

When a shell is destroyed it clears it's list of images and revokes the image visibility event. So the only way this could happen is if a frame gets put into the hashtable of the wrong presshell.
> Boris, do you know what these strange URLs come from:

Sure.  They come from documents created via document.open()/document.write()/document.close().  The part after "wyciwyg://NNNN/" indicates what the document URI was before it was blown away via open().  The NNNN is just a global counter.
Flags: needinfo?(bzbarsky)
These crash annotations won't be providing any more benefit, time to back them out. Aurora first cause it needs approval.

Approval Request Comment
[Feature/regressing bug #]: no regression
[User impact if declined]: none
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: none
[String/UUID change made/needed]: none

These are crash annotations that we added to try to diagnose a crash. They can't provide anymore info, let's remove them.
Attachment #8743561 - Flags: approval-mozilla-aurora?
Comment on attachment 8743561 [details] [diff] [review]
backout the annotations from aurora

Backing out annotations that are not needed anymore, Aurora47+
Attachment #8743561 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Hi Wes, Carsten, we are using the same bug for backout. Just thought I'd ni you guys as I wasn't sure how to set the status-fx47 flag.
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
Crash volume for signature 'DecrementVisibleCount':
 - nightly (version 50): 0 crashes from 2016-06-06.
 - aurora  (version 49): 0 crashes from 2016-06-07.
 - beta    (version 48): 0 crashes from 2016-06-06.
 - release (version 47): 1430 crashes from 2016-05-31.
 - esr     (version 45): 277 crashes from 2016-04-07.

Crash volume on the last weeks:
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly       0       0       0       0       0       0       0
 - aurora        0       0       0       0       0       0       0
 - beta          0       0       0       0       0       0       0
 - release     173     183     182     180     185     204     216
 - esr          37      21      17      22      24      32      23

Affected platforms: Windows, Linux
Product: Core → Core Graveyard
Product: Core Graveyard → Core

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.