Closed Bug 1261321 Opened 4 years ago Closed 4 years ago

crash in mozilla::layers::CompositableClient::IsConnected

Categories

(Core :: Graphics: Layers, defect, critical)

45 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox45 --- wontfix
firefox46 + fixed
firefox47 + fixed
firefox48 + fixed
firefox-esr45 46+ fixed

People

(Reporter: philipp, Assigned: nical)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file, 1 obsolete file)

[Tracking Requested - why for this release]:
newly introduced crash in 45

This bug was filed from the Socorro interface and is 
report bp-61743e6e-3cd2-4e8e-b8ac-a9adc2160320.
=============================================================
Crashing Thread (20)
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::layers::CompositableClient::IsConnected() 	gfx/layers/client/CompositableClient.cpp
1 	xul.dll 	mozilla::layers::CompositableClient::Destroy() 	gfx/layers/client/CompositableClient.cpp
2 	xul.dll 	mozilla::layers::ImageBridgeShutdownStep1 	gfx/layers/ipc/ImageBridgeChild.cpp
3 	xul.dll 	RunnableFunction<void (*)(mozilla::layers::ImageClient*, mozilla::layers::PImageContainerChild*), mozilla::Tuple<mozilla::layers::ImageClient*, mozilla::layers::PImageContainerChild*> >::Run() 	ipc/chromium/src/base/task.h
4 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc
5 	xul.dll 	base::MessagePumpDefault::Run(base::MessagePump::Delegate*) 	ipc/chromium/src/base/message_pump_default.cc
6 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc
7 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc
8 	xul.dll 	base::Thread::ThreadMain() 	ipc/chromium/src/base/thread.cc
9 	xul.dll 	mozilla::LazyIdleThread::LazyIdleThread(unsigned int, nsACString_internal const&, mozilla::LazyIdleThread::ShutdownMethod, nsIObserver*) 	xpcom/threads/LazyIdleThread.cpp
10 		@0x164ef86f 	
11 	ntdll.dll 	__RtlUserThreadStart 	
12 	ntdll.dll 	_RtlUserThreadStart

this is a new crash signature in firefox 45 builds and upwards which has
apparently been introduced by the work in bug 121976.
it is mid- to low-level in volume - on 45.0.1 it is #83 & making up 0.15% of
all crashes.
Shutdown bug. Apparently, ImageClient::FromIPDLActor returned nullptr in ImageBridgeShutdownStep1, which is somewhat unexpected. A simple null-check will be enough to fix the crash, although I'd like to understand how we got in this state.
Assignee: nobody → nical.bugzilla
Let's get a simple patch "just stop the crash" ready and into review while we're figuring out the cause?
This is blocking a wrong bug - missing a digit.  Do we know the actual bug number?
Flags: needinfo?(madperson)
sorry, that was bug 1219761 - got the wrong number from https://hg.mozilla.org/mozilla-central/rev/932318a4e150
Blocks: 1219761
No longer blocks: 121976
Flags: needinfo?(madperson)
(In reply to Milan Sreckovic [:milan] from comment #2)
> Let's get a simple patch "just stop the crash" ready and into review while
> we're figuring out the cause?

I am working on a fix that should fix this in a reliable way. If we just null-check we'll run into a similar problem as bug 1241875. It's easier to fix this race condition for CompoaitableClient than TextureClient so I should have a patch ready tomorrow.
The fix is getting a bit more involved than I'd like to but there are several issues at play and fixing one mostly makes us fall on the others.

The root of the problem, from a high level perspective, is that we have written the gfx ipdl actors with the idea that each actors is fully owned by one object (for example CompositableClient fully owns CompositableChild) and that object's destruction drives the actor's destruction. This is mostly true, except when IPDL shuts down and forces the destruction of the actor.
Orchestrating this all becomes a bit complicated when threads get involved.

CompositableClient and TextureClient have the same problems (Well, TextureClient's case is more complex but I've tried to leave that out for a followup patch as much as possible), so it makes sense to fix this at the ChildActor<T> level (TextureChild and COmpositableChild both inherit from it).

This patch does a bunch of things:
- rename DestroyIPDLActor -> DeallocIPDLActor to avoid confusion with the Destroy() methods which involve doing the destruction handshake and sending messages, while DeallocIPDLActor is the hook that gets called when IPDL wants to delete the actor (at this point we can't send messages).
- expose "ReleaseActor" rather than "Destroy" to make it clear that after this is called the TextureClient/CompositableClient should NOT keep their pointer to the actor (which may just have been deleted).
- track what happens and in which order for the following things:
  - ***Client is destroyed and calls ReleaseActor
  - IPDL shuts down (normal or abnormal shutdown)
  - the ***Child actor's destruction handshake is complete
And make sure that the actor is deleted at the right time, ideally when the destruction handshake finishes, if and only if the ***Client has already released its reference to the the ***Child actor (this is where most of the complexity is).

When all the pieces come together other crashes like bug 1241875 should get fixed as well.
Attached patch Null checkSplinter Review
Getting the more-correct-but-complicated approach to work is proving harder than I expected, so in the mean-time, let's just null-check to avoid some of the crashes. I'll pursue the other approach in another bug.
Attachment #8738185 - Attachment is obsolete: true
Attachment #8738626 - Flags: review?(jnicol)
Comment on attachment 8738626 [details] [diff] [review]
Null check

Review of attachment 8738626 [details] [diff] [review]:
-----------------------------------------------------------------

Do we want a warning when compositable is null?
Attachment #8738626 - Flags: review?(jnicol) → review+
(In reply to Jamie Nicol [:jnicol] from comment #8)
> Do we want a warning when compositable is null?

This is going to change soon-ish, and there will be better places to report what puts us in this state. So for now let's just paper over the null dereference.
[Tracking Requested - why for this release]:
newly introduced crash in 45
https://hg.mozilla.org/mozilla-central/rev/6271ce7ff2aa
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Nicolas, could you fill the uplift request for this crash? The volume is quite important. Thanks
Flags: needinfo?(nical.bugzilla)
Comment on attachment 8738626 [details] [diff] [review]
Null check

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: crashes
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: just the simples null-check. Absolutely no risk
[String/UUID change made/needed]: none
Flags: needinfo?(nical.bugzilla)
Attachment #8738626 - Flags: approval-mozilla-beta?
Attachment #8738626 - Flags: approval-mozilla-aurora?
Comment on attachment 8738626 [details] [diff] [review]
Null check

[Triage Comment]

Crash fix, recent regression, let's take this fix for 46. 
Please uplift to aurora, beta, m-r.
Attachment #8738626 - Flags: approval-mozilla-release+
Attachment #8738626 - Flags: approval-mozilla-beta?
Attachment #8738626 - Flags: approval-mozilla-beta+
Attachment #8738626 - Flags: approval-mozilla-aurora?
Attachment #8738626 - Flags: approval-mozilla-aurora+
This would land in an RC2 for 46, later this week. Too late for rc1
Attachment #8738626 - Flags: approval-mozilla-esr45+
You need to log in before you can comment on or make changes to this bug.