Closed Bug 1064107 Opened 5 years ago Closed 5 years ago

crash in RtlEnterCriticalSection | LockImpl::Lock()

Categories

(Core :: Graphics: Layers, defect, critical)

34 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla35
Tracking Status
firefox32 --- unaffected
firefox33 + verified
firefox34 + verified
firefox35 + verified

People

(Reporter: lizzard, Assigned: nical)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-3b3598a4-6210-4841-b9cb-74b742140905.
=============================================================

#1 topcrash, 373/1476 crashes in the last 3 days for Firefox 34.0a2 (Aurora)
It looks like a startup crash on Windows platforms.  Also crashing Nightly and Beta.

While the signature has shown up at a low level since 2012, it spiked dramatically with the 2014083103 build. 

Crashing thread:

0 	ntdll.dll 	RtlEnterCriticalSection 	
1 	xul.dll 	LockImpl::Lock() 	ipc/chromium/src/base/lock_impl_win.cc
2 	xul.dll 	MessageLoop::PostTask_Helper(tracked_objects::Location const&, Task*, int, bool) 	ipc/chromium/src/base/message_loop.cc
3 	xul.dll 	MessageLoop::PostTask(tracked_objects::Location const&, Task*) 	ipc/chromium/src/base/message_loop.cc
4 	xul.dll 	mozilla::layers::CompositorParent::CompositorParent(nsIWidget*, bool, int, int) 	gfx/layers/ipc/CompositorParent.cpp
5 	xul.dll 	nsBaseWidget::NewCompositorParent(int, int) 	widget/xpwidgets/nsBaseWidget.cpp
6 	xul.dll 	nsBaseWidget::CreateCompositor(int, int) 	widget/xpwidgets/nsBaseWidget.cpp
7 	xul.dll 	nsBaseWidget::CreateCompositor() 	widget/xpwidgets/nsBaseWidget.cpp
8 	xul.dll 	nsWindow::GetLayerManager(mozilla::layers::PLayerTransactionChild*, mozilla::layers::LayersBackend, nsIWidget::LayerManagerPersistence, bool*) 	widget/windows/nsWindow.cpp
9 	mozjs.dll 	js::EmptyShape::getInitialShape(js::ExclusiveContext*, js::Class const*, js::TaggedProto, JSObject*, JSObject*, unsigned int, unsigned int) 	js/src/vm/Shape.cpp
10 	xul.dll 	mozilla::dom::CanvasRenderingContext2D::DrawImage(mozilla::dom::HTMLImageElementOrHTMLCanvasElementOrHTMLVideoElement const&, double, double, double, double, double, double, double, double, unsigned char, mozilla::ErrorResult&) 	dom/canvas/CanvasRenderingContext2D.cpp
Component: General → Graphics: Layers
Usually RtlEnterCriticalSection means an uninitialized CS, but the code seems to initialize correctly. At first glance I can't tell what's causing this.

Liz, can you post a regression range for the spike on the 31st?
Flags: needinfo?(lhenry)
This probably started when 34 was still on nightly, we see the signature on 35 nightly as well.

That said, the data on https://crash-analysis.mozilla.com/rkaiser/2014-09-07/2014-09-07.firefox.nightly.explosiveness.html makes it looks very much like 2014-08-31 is the start of this. And looking at the nightly channel with the start date set to a number of days before 8/32 on https://crash-stats.mozilla.com/report/list?signature=RtlEnterCriticalSection%20|%20LockImpl%3A%3ALock%28%29#tab-graph also points to the nightly build of the 31st to be the first to have the issue.
Here is a not-very-narrow window for now...
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=eed9fe35a00d&tochange=1db35d2c9a2f
Flags: needinfo?(lhenry)
CompositorLoop() is null here: http://hg.mozilla.org/mozilla-central/file/152ef25e89ae/gfx/layers/ipc/CompositorParent.cpp#l234

Bas is your FIXME still true?

I don't see what in the pushlog would cause this, though...
Flags: needinfo?(bas)
(In reply to David Major [:dmajor] from comment #4)
> CompositorLoop() is null here:
> http://hg.mozilla.org/mozilla-central/file/152ef25e89ae/gfx/layers/ipc/
> CompositorParent.cpp#l234
> 
> Bas is your FIXME still true?
> 
> I don't see what in the pushlog would cause this, though...

I believe it was Nical that wrote this particular code.. fwiw, I believe it is, but Nical, can you tell me?
Flags: needinfo?(bas) → needinfo?(nical.bugzilla)
This comment means "we are sending a pointer (this) to another task and the reason we know it will still be alive is because whatever can delete the instance will be created after in the thread that receives the task". It is still true AFAICT. It's not related to the event loop being null, though.
Flags: needinfo?(nical.bugzilla)
This is the #1 topcrash on Firefox 34.0a2 with 1511/10042 crashes in the last 7 days. 
It looks like it is most often a startup crash on Windows 7. 
More reports: https://crash-stats.mozilla.com/report/list?signature=RtlEnterCriticalSection+|+LockImpl%3A%3ALock%28%29&range_value=7&range_unit=days&date=2014-09-15#tab-reports
> It's not related to the event loop being null, though.

Who would be the best person to investigate a null CompositorLoop()?
Flags: needinfo?(nical.bugzilla)
(In reply to David Major [:dmajor] from comment #8)
> > It's not related to the event loop being null, though.
> 
> Who would be the best person to investigate a null CompositorLoop()?

bjacob, maybe? it looks like some code is trying to create a layer manager in a process where all of the OMTC stuff hasn't been initialized (yet?), that is CompositorParent::StartUp() hasn't been called (the startup happens in gfxPlatform::Startup so if we are too early, then we are really really early).
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(bjacob)
I'm not current on what changed with OMTC-on-desktop (this being on 34.0a2 is a strong indication that it's OMTC related, and I didn't work on that), but the interesting thing here is how there is Canvas2D stuff in the call stack. It seems as if this startup crash occurred when a user started Firefox directly on a page using Canvas2D, and that tried to talk to the compositor before it got started up.

Bas/Nical, I think that we might have to update the Canvas2D code to wait for the compositor to start up before doing things that interact directly with it.
Flags: needinfo?(bjacob)
Setting a ni? for comment #10 as we really need some progress here.
Flags: needinfo?(bas)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #11)
> Setting a ni? for comment #10 as we really need some progress here.

I've been unable to reproduce this so far, it doesn't make a lot of sense that we could create a canvas -before- we initialize the platform. But possible some add-on code or something could do this? It probably won't be hard to construct a work-around here that will create minimal artifacts.

One question to assist in diagnosing this, am I correct in understanding from earlier comments this is -not- happening in Beta?
Flags: needinfo?(bas)
(In reply to Benoit Jacob [:bjacob] from comment #10)
> but the interesting thing here is how there is Canvas2D stuff in the call
> stack. It seems as if this startup crash occurred when a user started
> Firefox directly on a page using Canvas2D, and that tried to talk to the
> compositor before it got started up.

Even before gfxPlatform::Startup() ? This is when (IIUC) we instantiate the compositor thread.
(cf: gfxPlatform::Startup -> gfxPlatform::InitLayersIPC -> CompositorParent::StartUp -> new CompositorThreadHolder).
(In reply to Bas Schouten (:bas.schouten) from comment #12)
> One question to assist in diagnosing this, am I correct in understanding
> from earlier comments this is -not- happening in Beta?

Yes. This only happens on 34 Aurora and on 35 Nightly.
Duplicate of this bug: 1061744
Updating the tracking flags wrt comment #14
This is the #2 topcrasher for Firefox 34.0a2 with 1133/8332 crashes in the last 7 days. 

It looks like a pretty common problem and one that we probably need to fix if OMTC is on for 34 as it moves toward release.
Bas, I understand we can't reproduce this locally yet, but can we at least add some logging to help us?
Assignee: nobody → bas
RtlFreeHeap | LockImpl::Lock() reports have similar stacks. I am not sure if this is the same issue but it's showing up as the #8 top crash in Aurora right now, and maybe it's related.
Crash Signature: [@ RtlEnterCriticalSection | LockImpl::Lock()] → [@ RtlEnterCriticalSection | LockImpl::Lock()] @ RtlFreeHeap | LockImpl::Lock() ]
Crash Signature: [@ RtlEnterCriticalSection | LockImpl::Lock()] @ RtlFreeHeap | LockImpl::Lock() ] → [@ RtlEnterCriticalSection | LockImpl::Lock()] [@ RtlFreeHeap | LockImpl::Lock() ]
Actually, 33 is affected, but at much lover volume, and it's so late in the cycle, so let's mark it wontfix.
These sigs are now the #1 and #7 topcrashers in Aurora. My worry is that it will affect many users as 34 goes into Beta. This is just an update on the crash volume; I know that doesn't particularly help diagnose the problem.
:nical, can we wallpaper this and see if it helps?
Flags: needinfo?(nical.bugzilla)
I find it slightly worrying that we are executing js and canvas2d stuff before gfxPlatform::Init, but it looks like the latter gets initialized lazily so maybe we got really unlucky and never called gfxPlatform::GetPlatform() (which triggers Init) by the time we tried to create the compositor.
This patch should fix the issue if this is the case.
Assignee: bas → nical.bugzilla
Attachment #8502317 - Flags: review?(bas)
Flags: needinfo?(nical.bugzilla)
Comment on attachment 8502317 [details] [diff] [review]
all GetPlatform at the beginning of CreateCompositor to make sure that gfxPlatfrom gets initialized

Review of attachment 8502317 [details] [diff] [review]:
-----------------------------------------------------------------

It can't hurt I suppose...
Attachment #8502317 - Flags: review?(bas) → review+
https://hg.mozilla.org/mozilla-central/rev/5c6980f9caff
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Oh, awesome! I'll keep an eye on it on 35 and maybe it'll be good to uplift.
QA Contact: lhenry
This didn't appear to affect anything, at least not obviously, in Nightly. This crash signature wasn't showing up as a topcrasher in 35 though, at least not for the past month. There were only a few crashes reported for 35.  None in 35 past 10-6. 

Let's try uplifting it to 34. Lawrence, what do you think?
Flags: needinfo?(lmandel)
(In reply to Liz Henry :lizzard from comment #29)
> This didn't appear to affect anything, at least not obviously, in Nightly.

Well, we didn't have any crashes in builds after this landed on Nightly, the last crash happened in the 10/10 build, which was just before this was merged to m-c. That said, we only had a total of 8 of those crashes in all October on Nightly 35, whereas it's a top crash on Aurora 34. We really need to uplift to know if the fix worked, and I'd hope we can do this before we actually go to build with b1 today, ideally.
And for documentation, the query I used as a basis for the comment above is here: https://crash-stats.mozilla.com/search/?signature=~LockImpl%3A%3ALock%28%29&release_channel=nightly&date=%3E2014-10-01&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=build_id&_columns=platform (click on the build id facet and then on its build id colum to sort by build IDs)
Nicolas can you nominate this for uplift to 34? Thanks!  I guess I could but I we usually leave it to the developer or reviewer.
Comment on attachment 8502317 [details] [diff] [review]
all GetPlatform at the beginning of CreateCompositor to make sure that gfxPlatfrom gets initialized

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: Crashes
[Describe test coverage new/current, TBPL]:
[Risks and why]: None. The patch is trivial. In most cases it's a no-op, in the other cases we are already crashing anyway, so there is no risk of making things worse.
[String/UUID change made/needed]:
Attachment #8502317 - Flags: approval-mozilla-beta?
Attachment #8502317 - Flags: approval-mozilla-aurora?
Attachment #8502317 - Flags: approval-mozilla-beta?
Attachment #8502317 - Flags: approval-mozilla-beta+
Attachment #8502317 - Flags: approval-mozilla-aurora?
Attachment #8502317 - Flags: approval-mozilla-aurora+
Comment on attachment 8502317 [details] [diff] [review]
all GetPlatform at the beginning of CreateCompositor to make sure that gfxPlatfrom gets initialized

Aurora has already landed
Attachment #8502317 - Flags: approval-mozilla-aurora+ → approval-mozilla-aurora-
Flags: needinfo?(lmandel)
Flags: qe-verify+
Looking in Socorro ([1] & [2]) for crashes with the two signatures over the past 14 days (since the fix landed in 35 on October 10) I see the following data:

RtlFreeHeap | LockImpl::Lock()
-	Firefox 34 Beta 1 and newer – 3 crashes in 34 Beta 1
-	Firefox 35 Aurora (builds after October 11) – NO crashes

RtlEnterCriticalSection | LockImpl::Lock()
-	Firefox 34 Beta 1 and newer – 399 crashes in 34 Beta 1, 26 crashes in 34 Beta 2
-	Firefox 35 Aurora (builds after October 11) – 6 crashes
-	Firefox 36 Nightly (builds after October 11) – 4 crashes

[1] https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=14&signature=RtlFreeHeap+%7C+LockImpl%3A%3ALock%28%29#tab-reports
[2] https://crash-stats.mozilla.com/report/list?product=Firefox&query_type=contains&range_unit=days&process_type=any&hang_type=any&signature=RtlEnterCriticalSection+%7C+LockImpl%3A%3ALock%28%29&date=2014-10-24+12%3A00%3A00&range_value=14#tab-sigsummary
Comment on attachment 8502317 [details] [diff] [review]
all GetPlatform at the beginning of CreateCompositor to make sure that gfxPlatfrom gets initialized

Let's take it for 33.0.3...

Nical, Bas, you confirm that you didn't see any regression from this?
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(bas)
Attachment #8502317 - Flags: approval-mozilla-release+
We need in the GECKO331_2014103013_RELBRANCH relbranch too :) thanks again
No regression.
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(bas)
Also landed on GECKO330_2014101104_RELBRANCH for 33.0.3:
https://hg.mozilla.org/releases/mozilla-release/rev/691739025fac
Looking in Socorro ([1] & [2]) for crashes with the two signatures over the past 28 days I see the following data:

RtlFreeHeap | LockImpl::Lock()
- Firefox 33.1 & 33.0.3 or later - 1 crash in 33.0.3
- Firefox 34 Beta 1 and newer – 7 crashes
- Firefox 35 Aurora – NO crashes
- Firefox 36 Nightly – NO crashes

RtlEnterCriticalSection | LockImpl::Lock()
- Firefox 33.1 & 33.0.3 or later - NO crashes
- Firefox 34 Beta 1 and newer – 397 crashes in 34 Beta 1, 22 crashes in 34 Beta 2, NO crashes in Beta 3 or later
- Firefox 35 Aurora – 5 crashes
- Firefox 36 Nightly – 3 crashes

Comparing to results from 3 weeks ago (comment 36), and considering the low number of occurences, I think it's safe to call this fixed.

[1] - https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=RtlFreeHeap+%7C+LockImpl%3A%3ALock%28%29
[2] - https://crash-stats.mozilla.com/report/list?product=Firefox&query_type=contains&range_unit=days&process_type=any&hang_type=any&signature=RtlEnterCriticalSection+%7C+LockImpl%3A%3ALock%28%29&date=2014-10-24+12%3A00%3A00&range_value=28
You need to log in before you can comment on or make changes to this bug.