shutdownhangs in mozilla::layers::CompositorThreadHolder::Shutdown

RESOLVED FIXED in Firefox 56

Status

()

Core
Graphics: Layers
P3
critical
RESOLVED FIXED
10 months ago
7 months ago

People

(Reporter: philipp, Assigned: aosmond)

Tracking

(Depends on: 1 bug, Blocks: 1 bug, {crash, regression, topcrash})

56 Branch
mozilla57
All
Windows
crash, regression, topcrash
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox55 unaffected, firefox56blocking fixed, firefox57 fixed)

Details

(Whiteboard: [gfx-noted], crash signature)

Attachments

(6 attachments, 4 obsolete attachments)

2.19 KB, patch
nical
: review+
Details | Diff | Splinter Review
2.33 KB, patch
Details | Diff | Splinter Review
95.63 KB, text/plain
Details
2.36 KB, patch
nical
: review+
jerry
: feedback+
Details | Diff | Splinter Review
19.57 KB, patch
Details | Diff | Splinter Review
9.54 KB, patch
aosmond
: review+
Details | Diff | Splinter Review
(Reporter)

Description

10 months ago
[Tracking Requested - why for this release]:
in the first hours of the 56.0b1 rollout this issue is manifesting itself as one of the main crash issues (13% of all browser crashes)

This bug was filed from the Socorro interface and is 
report bp-c8561832-aae4-44ce-ac47-61a290170810.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	KiFastSystemCallRet 	
1 	ntdll.dll 	ZwWaitForKeyedEvent 	
2 	ntdll.dll 	RtlSleepConditionVariableCS 	
3 	kernel32.dll 	SleepConditionVariableCS 	
4 	mozglue.dll 	mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
5 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 	obj-firefox/dist/include/mozilla/CondVar.h:68
6 	xul.dll 	nsEventQueue::GetEvent(bool, nsIRunnable**, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsEventQueue.cpp:76
7 	xul.dll 	nsThread::nsChainedEventQueue::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsThread.cpp:839
8 	xul.dll 	nsThread::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) 	xpcom/threads/nsThread.cpp:1297
9 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1380
10 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:480
11 	xul.dll 	mozilla::layers::CompositorThreadHolder::Shutdown() 	gfx/layers/ipc/CompositorThread.cpp:137
12 	xul.dll 	gfxPlatform::ShutdownLayersIPC() 	gfx/thebes/gfxPlatform.cpp:1039
13 	xul.dll 	mozilla::ShutdownXPCOM(nsIServiceManager*) 	xpcom/build/XPCOMInit.cpp:887
14 	xul.dll 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp:1464
15 	xul.dll 	mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) 	obj-firefox/dist/include/mozilla/UniquePtr.h:345
16 	xul.dll 	XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4800
17 	xul.dll 	XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:4867
18 	xul.dll 	mozilla::BootstrapImpl::XRE_main(int, char** const, mozilla::BootstrapConfig const&) 	toolkit/xre/Bootstrap.cpp:45
19 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:115
20 	firefox.exe 	__scrt_common_main_seh 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
21 	kernel32.dll 	BaseThreadInitThunk 	
22 	ntdll.dll 	__RtlUserThreadStart 	
23 	ntdll.dll 	_RtlUserThreadStart

these shutdownhang crashes on windows with mozilla::layers::CompositorThreadHolder::Shutdown are regressing in volume in the 56 cycle. the first affected build on nightly seems to have been 56.0a1 20170615030208 - pushlog to the day before: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b266a8d8fd595b84a7d6218d7b8c6b7af0b5027c&tochange=035c25bef7b5e4175006e63eff10c61c2eef73f1

the associated crash signatures are generic shutdownhang signature lumping together a bunch of different issues - this is the query i've used to filter for the crashes described in this report:
https://crash-stats.mozilla.com/search/?proto_signature=~mozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdown&version=56.0a1&version=57.0a1&version=56.0b&platform=Windows&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platform_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cpu_arch&_facets=process_type&_facets=moz_crash_reason&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Milan, this is shaping up to be a top crash in beta 56, can you find someone to investigate? Thanks.
tracking-firefox56: ? → +
Flags: needinfo?(milan)
Let's see if Andrew can make heads or tails out of this.
Flags: needinfo?(milan) → needinfo?(aosmond)
Threads 19 and 20 are image decoder threads...
Assignee: nobody → hshih
Blocks: 1388995

Comment 4

10 months ago
I checked several crash reports and all of them had the following configuration.

features":{"compositor":"basic","gpuProcess":{"status":"unavailable"},"advancedLayers":{"status":"disabled"},"d3d11":{"status":"failed"},"d2d":{"status":"blacklisted","version":"1.1"}}},"isWow64":false},

"settings":{"blocklistEnabled":true,"e10sEnabled":true,"e10sMultiProcesses":4,"e10sCohort":"multiBucket4",
And all crashes occurred at win7.
(Assignee)

Comment 6

9 months ago
Pretty sure I understand what went wrong here -- I believe I broke it in part 5 of bug 1365927. Patch coming up.
Assignee: hshih → aosmond
Blocks: 1365927
Status: NEW → ASSIGNED
Flags: needinfo?(aosmond)
Crash Signature: [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip; → [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip;
(Assignee)

Comment 7

9 months ago
Created attachment 8897011 [details] [diff] [review]
Release owned CompositorManagerParent in CompositorBridgeParent::DeferredDestroy., v1

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cfa51eb247a5fb9e345abe3218b814f2b5a3d502
(Assignee)

Updated

9 months ago
Attachment #8897011 - Flags: review?(dvander)
Comment on attachment 8897011 [details] [diff] [review]
Release owned CompositorManagerParent in CompositorBridgeParent::DeferredDestroy., v1

Review of attachment 8897011 [details] [diff] [review]:
-----------------------------------------------------------------

:dvander on PTO, let's get a quicker review
Attachment #8897011 - Flags: review?(dvander) → review?(nical.bugzilla)

Updated

9 months ago
Attachment #8897011 - Flags: review?(nical.bugzilla) → review+

Comment 9

9 months ago
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/dd4bf41ca47d
Release owned CompositorManagerParent in CompositorBridgeParent::DeferredDestroy. r=nical

Comment 10

9 months ago
Backout by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a9f759fd05fd
Backed out changeset dd4bf41ca47d for bustage
(Assignee)

Comment 11

9 months ago
Created attachment 8898062 [details] [diff] [review]
Release the compositor thread after CompositorManagerParent IPDL reference is released., v2

Let's try this again...

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=802c95d19b25ca1373006fe3cf14eb86e55c7da9

Did a full try run, none of the failures seem related to this change. I broke up the freeing of the thread and the freeing of CompositorManagerParent. It appears the failures were caused by trying to free shared memory after the CompositorManagerParent was freed. So we need to keep it around for just as long as we do now, but ensure the compositor thread itself can go free when the IPDL is shutdown.
Attachment #8897011 - Attachment is obsolete: true
Attachment #8898062 - Flags: review?(nical.bugzilla)

Updated

9 months ago
Attachment #8898062 - Flags: review?(nical.bugzilla) → review+

Comment 12

9 months ago
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c289ff85f94b
Release the compositor thread after CompositorManagerParent IPDL reference is released. r=nical

Comment 13

9 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/c289ff85f94b
Status: ASSIGNED → RESOLVED
Last Resolved: 9 months ago
status-firefox57: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57

Comment 15

9 months ago
(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #14)
> (In reply to Wes Kocher (:KWierso) from comment #13)
> > https://hg.mozilla.org/mozilla-central/rev/c289ff85f94b
> 
> This patch will have appeared in 08-18 nightly build.  Reduced weekend usage
> makes it is unclear whether nightly crash rate dropped. But judging from two
> queries for the most common nightly signatures containing
> RtlSleepConditionVariableCS, the crash definitely is not gone
> 
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWai
> tForAlertByThreadId%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditio
> nVariableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08-
> 20T11%3A52%3A37.
> 000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_colum
> ns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-
> date&page=1#reports
> 
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWai
> tForKeyedEvent%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVari
> ableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08-
> 20T11%3A52%3A37.
> 000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_colum
> ns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-
> date&page=1#reports

From the first query, I still saw crash from 20170820 nightly build.
Jerry, could you help to check the fix was include the build or not?
If yes, then we need to re-open this issue and check it.
c00f79dc-97ac-4604-a5c2-ddbaa0170820 	2017-08-20 11:16:43 	Firefox 	57.0a1 	20170819100442
Flags: needinfo?(hshih)
The version of "Firefox 	57.0a1 	20170819100442":
https://download-origin.cdn.mozilla.net/pub/firefox/nightly/2017/08/2017-08-19-10-04-42-mozilla-central/firefox-57.0a1.en-US.win64.txt

It already contains the fix is this bug.

hg log --follow -r :4f4487cc2d30d988742109868dcf21c4113f12f5 | grep c289ff85f94b
=>
changeset:   421990:c289ff85f94b

So, reopen this bug.
Status: RESOLVED → REOPENED
Flags: needinfo?(hshih)
Resolution: FIXED → ---
There are a lot of different crash signatures here and only a few crashes on nightly. We could still uplift this fix to beta (for beta 6) if you think it will mitigate the issue for beta users. What do you think?
Flags: needinfo?(aosmond)
Unfortunately this missed the beta 5 build today. We could uplift the current patch for beta 6 on Thursday though.
shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS signature was the top ranked browser crash in Beta 4.
tracking-firefox56: + → blocking
Created attachment 8900253 [details] [diff] [review]
[uplift for beta] Release the compositor thread after CompositorManagerParent IPDL reference is released.

MozReview-Commit-ID: 3eVziNDH59f
Comment on attachment 8900253 [details] [diff] [review]
[uplift for beta] Release the compositor thread after CompositorManagerParent IPDL reference is released.

:aosmond is in PTO, so I request this beta uplift. And I will still investigate the remaining crash with the same signature like: https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c16

---

Approval Request Comment

[Feature/Bug causing the regression]:
bug 1365927
Please check https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c6

[User impact if declined]:
We might hit the shutdown timeout.

[Is this code covered by automated tests?]:
Already pass the try in Nightly.
https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c11

[Has the fix been verified in Nightly?]:
https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c18
We still see the same crash in nightly, but the number is fewer.

[Needs manual test from QE? If yes, steps to reproduce]: 
none

[List of other uplifts needed for the feature/fix]:
none

[Is the change risky?]:
no.

[Why is the change risky/not risky?]:
The "mozilla::layers::CompositorThreadHolder::Shutdown()" hang is because that CompositorThreadHolder::Shutdown() is waiting for the CompositorThreadHolder obj released. And this patch try to release the CompositorThreadHolder obj ref-counting during shutdown.

[String changes made/needed]:
none
Attachment #8900253 - Flags: approval-mozilla-beta?
Comment on attachment 8900253 [details] [diff] [review]
[uplift for beta] Release the compositor thread after CompositorManagerParent IPDL reference is released.

Review of attachment 8900253 [details] [diff] [review]:
-----------------------------------------------------------------

Should fix a beta top crash or at least mitigate the crash rate. Aiming for beta 6 here.
Attachment #8900253 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
n-i to myself to check the crash rate on Friday after beta 6 is released.
Flags: needinfo?(lhenry)

Comment 24

9 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/b08da0b32226
status-firefox56: affected → fixed
I just ran a quick query, and it looks as if this is still happening in B6: http://bit.ly/2vDeOs1
It may be happening less  with beta 6. Most 56 users are now on beta 6 (as of Monday morning). So we'll see from today's crash rate. 
Jerry or Andrew, can you look at the signatures coming in for 56 beta 6?
Flags: needinfo?(hshih)
I'm checking for this issue now.
Flags: needinfo?(hshih)
(Assignee)

Comment 28

9 months ago
I'm looking into this again as well, although I no longer have a straightforward explanation as to why this is still happening. I only see reports for when the compositor thread is in the UI process -- this may just be an artifact of the reporting however because I see no "shutdownhang" crashes for the GPU process. Linux and Mac do not appear to be affected as they have relatively few shutdownhang reports in the last month, and none of them (that I checked) matched the profile -- this is interesting because Linux and Mac don't have the GPU process enabled at all so all users could be affected if it reproduced there (instead of just the subset of Windows users which disable the GPU process).
Currently, the crash are all at win7. And there is no gpu process in these crash reports.
I try to add some log in the AddRef()/Release() for compositorThreadHolder object, but I don't see the non-matched addRef/Release pair for compositorThreadHolder.
checking...
(Assignee)

Comment 30

9 months ago
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #29)
> Currently, the crash are all at win7. And there is no gpu process in these
> crash reports.
> I try to add some log in the AddRef()/Release() for compositorThreadHolder
> object, but I don't see the non-matched addRef/Release pair for
> compositorThreadHolder.
> checking...

Are you sure it is tied to the Windows version? I see reports on Windows 8.1, Windows 10.
Hi Andrew,
Here is my query string.
https://crash-stats.mozilla.com/signature/?_sort=-date&platform=Windows&version=56.0a1&version=57.0a1&version=56.0b&signature=shutdownhang%20%7C%20ZwWaitForKeyedEvent%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&date=%3C2017-08-31T13%3A50%3A00.000Z&proto_signature=%7Emozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdown

And all crashes are at win7.

And here is another query string.
"Most" of crashes doesn't use gpu process.
https://crash-stats.mozilla.com/search/?proto_signature=~mozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdown&telemetry_environment=~%22gpuProcess%22%3A%7B%22status%22%3A%22unavailable%22%7D&version=56.0a1&version=57.0a1&version=56.0b&platform=Windows&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&date=%3C2017-08-31T13%3A50%3A00.000Z&_sort=-date&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platform_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cpu_arch&_facets=process_type&_facets=moz_crash_reason&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

We use some tools to open the firefox and close it. Then, I saw one shutdown hang log. I will turn on the addRef/Release call log for compositorThreadHolder.
(Assignee)

Comment 33

9 months ago
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #31)
>
> And here is another query string.
> "Most" of crashes doesn't use gpu process.
> https://crash-stats.mozilla.com/search/
> ?proto_signature=~mozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdo
> wn&telemetry_environment=~%22gpuProcess%22%3A%7B%22status%22%3A%22unavailable
> %22%7D&version=56.0a1&version=57.0a1&version=56.
> 0b&platform=Windows&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&date=%3C2017-08-
> 31T13%3A50%3A00.000Z&_sort=-
> date&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_
> facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platfo
> rm_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cp
> u_arch&_facets=process_type&_facets=moz_crash_reason&_columns=date&_columns=s
> ignature&_columns=product&_columns=version&_columns=build_id&_columns=platfor
> m#facet-signature
> 

Hmm, the crash reports are all for the main process, as the only way we hit this particular code path (gfxPlatform::ShutdownLayersIPC) is for shutting down the compositor thread in the main process. Looking for the alternative path for the GPU process (mozilla::gfx::GPUParent::ActorDestroy) turns up nothing.

However tweaking your query to search for Linux and Mac OS X did turn up a handful of recent results on Linux, e.g.

https://crash-stats.mozilla.com/report/index/39a8a8cb-cf19-4544-8087-0f94b0170821
https://crash-stats.mozilla.com/report/index/d1c54c34-f447-4eda-8526-94c190170827
https://crash-stats.mozilla.com/report/index/cefb6e21-7254-4f16-816e-b99bc0170816
https://crash-stats.mozilla.com/report/index/dc37eb9c-7a4d-47d2-9d3f-376510170815
https://crash-stats.mozilla.com/report/index/27d63e0f-f9f1-4e0e-bb67-65f080170815

Another variant of Linux is it blocking on acquiring the MessageChannel mutex:

https://crash-stats.mozilla.com/report/index/b95ba2b6-2cd8-4537-852f-c00bb0170826

I suspect unrelated, but notable in case it offers additional clues.

It was also observed on Mac OS X, but last seen on build 20170728100358 for 56.0a1, e.g.:

https://crash-stats.mozilla.com/report/index/61fce4e1-5465-47f8-8cbe-4771c0170812

If I can glean anything from these, I will note it on the bug.

> We use some tools to open the firefox and close it. Then, I saw one shutdown
> hang log. I will turn on the addRef/Release call log for
> compositorThreadHolder.

Great news -- if you are able to reproduce again with the extra logging, I look forward to seeing the results :D.
(Assignee)

Comment 34

9 months ago
It appears the Linux crashes only started in build 20170814100258. It has bug 1383742 attached to the signature, which was found to be introduced by bug 1351148. Bug 1351148 relanded on central on 2017-08-13. The volume is very low on Linux 57, and it wasn't promoted to 56, so it cannot explain our 56 crashes. It also does not appear to have had an impact on the Windows 57 crash rate.
(Assignee)

Comment 35

9 months ago
The first build containing the changes in bug 1365927 was 20170615030208 (https://hg.mozilla.org/mozilla-central/rev/035c25bef7b5e4175006e63eff10c61c2eef73f1, confirmed as present in the build via https://hg.mozilla.org/mozilla-central/log/035c25bef7b5e4175006e63eff10c61c2eef73f1/gfx/layers/ipc/CompositorManagerChild.cpp).

The first build that we see the crash signature however is the *next day*'s nightly, 20170616030207. There appear to be lots of telemetry results / crash reports in general for both builds, so unless 20170615030208 got pulled early (I do see the top crash volume (IPCError) got cut in half between 20170615030208 and 20170616030207), it seems it is in the realm of possibility the change causing the crash was not bug 1365927 but something that came the next day.

Pushlog between the two builds:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=035c25bef7b5e4175006e63eff10c61c2eef73f1&tochange=fe809f57bf2287bb937c3422ed03a63740b3448b
(Assignee)

Comment 36

9 months ago
Milan confirmed that there was nothing especially bad about 20170615030208 (e.g. such that we pulled it earlier and thus maybe didn't have enough users on it to hit the crash). It was also on a Thursday which has "normal" usage patterns.

Jerry, if your own testing with the AddRef/Release logging doesn't turn up anything useful, could you throw build 20170615030208 into the tool you are using to reproduce? If it doesn't happen there, but it *does* happen on 20170616030207, I think we need to hook it up to mozregression to figure out what the offending change is. The tinderbox builds should still be available for us to narrow this down without having to rebuild ourselves.
Flags: needinfo?(aosmond) → needinfo?(hshih)
(Assignee)

Comment 37

9 months ago
My bad, hg itself contains the actual build information. It seems it was in a day earlier than I thought:

first release with	
  nightly linux32 b266a8d8fd59 / 56.0a1 / 20170614100332 / files
  nightly linux64 b266a8d8fd59 / 56.0a1 / 20170614100332 / files
  nightly mac b266a8d8fd59 / 56.0a1 / 20170614030206 / files
  nightly win32 b266a8d8fd59 / 56.0a1 / 20170614030206 / files
  nightly win64 b266a8d8fd59 / 56.0a1 / 20170614030206 / files

last release without	
  nightly linux32 5293e5f89358 / 56.0a1 / 20170613100218 / files
  nightly linux64 5293e5f89358 / 56.0a1 / 20170613100218 / files
  nightly mac 5293e5f89358 / 56.0a1 / 20170613030203 / files
  nightly win32 5293e5f89358 / 56.0a1 / 20170613030203 / files
  nightly win64 5293e5f89358 / 56.0a1 / 20170613030203 / files 

This is even more suggestive that it was something new in the 20170616 build.
(Reporter)

Comment 38

9 months ago
(In reply to Andrew Osmond [:aosmond] from comment #35)
> The first build that we see the crash signature however is the *next day*'s
> nightly, 20170616030207.

i'm not sure if this is correct - in crash stats i see those crashes already popping up starting with build 20170615030208 like bp-0d5df06e-bc3a-4230-ade5-47ebb0170626
Created attachment 8904211 [details]
CompositorThreadHolder ref-count log
After adding the ref-count log, I only got one crash log at win7. Here is the log: attachment 8904211 [details].

I add debug message in CompositorThreadHolder::Shutdown(), but I don't see the debug message for this function.
Another interesting thing is that I saw some logs like "[Child 263496] WARNING: '!mMainThread', file c:/mozilla-source/git/gecko-dev/xpcom/threads/nsThreadManager.cpp, line 320". It means that the main thread is already deleted.
But the main thread deletion is called after CompositorThreadHolder::Shutdown(). So, I add some additional log in "nsThreadManager::GetMainThread()" for more information.
I have another crash log which show the gc module try to post some tasks after the main thread deleted. I am not sure it's related to this problem. Still investigating.
This is still high volume in 56.0b8
status-firefox56: fixed → affected
status-firefox57: fixed → affected
(Assignee)

Comment 43

9 months ago
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #40)
> After adding the ref-count log, I only got one crash log at win7. Here is
> the log: attachment 8904211 [details].
> 
> I add debug message in CompositorThreadHolder::Shutdown(), but I don't see
> the debug message for this function.
> Another interesting thing is that I saw some logs like "[Child 263496]
> WARNING: '!mMainThread', file
> c:/mozilla-source/git/gecko-dev/xpcom/threads/nsThreadManager.cpp, line
> 320". It means that the main thread is already deleted.
> But the main thread deletion is called after
> CompositorThreadHolder::Shutdown(). So, I add some additional log in
> "nsThreadManager::GetMainThread()" for more information.

Looks like the thread which the shutdown hang is reported on is not the main thread (since it doesn't match the thread addref and release was called on). From that I can only deduce the content process CompositorManagerParent never gets deallocated. From the crash reports I have looked at, the compositor thread is never blocked and just waiting for events to come in. This would suggest then that the close request from the content processes never came. I checked a number of crash reports and couldn't find any content process crash reports around the same time as the thread shutdown hang.

I wonder if the content process is blocking in CompositorBridgeChild::ShutDown. Maybe CompositorBridgeChild::ActorDestroy got called on its own somehow, so we cleared mCanSend which prevents CompositorBridgeChild::Destroy from running, which in turn is the only way CompositorBridgeChild::AfterDestroy gets called, which in turn is how sCompositorBridge is cleared which is the spin condition in CompositorBridgeChild::ShutDown? And the parent process crashes first on the shutdown hang so we never get to see the content process shutdown hang? I can prepare something to fix this, but it is just more speculation from me ;).
(Assignee)

Comment 44

9 months ago
Created attachment 8904681 [details] [diff] [review]
CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning., v1

As I said, total speculation. Jerry, can you take a look at the patch and give your thoughts? And maybe apply this patch to build running in that special test harness? My thought process is moving entirely to a badly behaving content process now over something gone wrong in the parent process, but I could be completely off base here...

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9257f97a9fdb64b05c98af94133bae0c11d3b70e
Attachment #8904681 - Flags: feedback?(hshih)
I am not sure I catch your idea. But the patch was applied for the shutdown test. Hope there will have more info for this issue. Let me think more for your description.
I set a infinite loop in the CompositorBridgeChild::ShutDown(), then we will have the same shutdown hang problem as this bug. Since the parent side are waiting for the CompositorBridgeParent::ActorDestroy(), but the child side never sends that.
The interesting thing is that parent process has a shutdown hang watchdog, but the content process doesn't.

I setup 2 laptop for the the original beta code and attachment 8904681 [details] [diff] [review], but there is no more crash occurred from my testing environment. I can't tell the attachment 8904681 [details] [diff] [review] could fix the problem but at least the sCompositorBridge could be always released.
Flags: needinfo?(hshih)
Attachment #8904681 - Flags: feedback?(hshih) → feedback+
(Assignee)

Updated

9 months ago
Attachment #8904681 - Attachment description: [WIP] CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning., v1 → CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning., v1
Attachment #8904681 - Flags: review?(nical.bugzilla)
(Assignee)

Comment 47

9 months ago
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #45)
> I am not sure I catch your idea. But the patch was applied for the shutdown
> test. Hope there will have more info for this issue. Let me think more for
> your description.

I should probably explain on the bug itself.

When CompositorManagerChild::Shutdown is called, it in turn delegates to CompositorBridgeChild::ShutDown before calling CompositorManagerChild::Close and thus starting the process required to release the compositor thread reference in the parent process.

CompositorBridgeChild::ShutDown calls CompositorBridgeChild::Destroy and then spin the main thread until the static reference to itself is released. Ordinarily Destroy will tear down the CompositorBridgeChild bits and bobs, and then trigger CompositorBridgeChild::AfterDestroy (via a dispatch to the main thread) to delete the IPDL object and clear said static reference. Then ShutDown stops spinning, and we call Close.

If however CompositorBridgeChild::ActorDestroy already got called (somehow -- I did not think too hard on why...), then Destroy does nothing. And we will loop waiting for the static reference to be cleared (which will never happen, since it only happens in AfterDestroy). Which in turn means Close never gets called, and the compositor thread references linger.
(Assignee)

Updated

9 months ago
Priority: -- → P3
Whiteboard: [gfx-noted]

Updated

9 months ago
Attachment #8904681 - Flags: review?(nical.bugzilla) → review+
(In reply to Andrew Osmond [:aosmond] from comment #47)
> (In reply to Jerry Shih[:jerry] (UTC+8) from comment #45)
> > I am not sure I catch your idea. But the patch was applied for the shutdown
> > test. Hope there will have more info for this issue. Let me think more for
> > your description.
> 
> I should probably explain on the bug itself.
> 
> When CompositorManagerChild::Shutdown is called, it in turn delegates to
> CompositorBridgeChild::ShutDown before calling CompositorManagerChild::Close
> and thus starting the process required to release the compositor thread
> reference in the parent process.
> 
> CompositorBridgeChild::ShutDown calls CompositorBridgeChild::Destroy and
> then spin the main thread until the static reference to itself is released.
> Ordinarily Destroy will tear down the CompositorBridgeChild bits and bobs,
> and then trigger CompositorBridgeChild::AfterDestroy (via a dispatch to the
> main thread) to delete the IPDL object and clear said static reference. Then
> ShutDown stops spinning, and we call Close.
> 
> If however CompositorBridgeChild::ActorDestroy already got called (somehow
> -- I did not think too **** why...), then Destroy does nothing. And we
> will loop waiting for the static reference to be cleared (which will never
> happen, since it only happens in AfterDestroy). Which in turn means Close
> never gets called, and the compositor thread references linger.

It's more clear.

The parent process has a shutdown hang watchdog, but the content process doesn't.
I think we should also have a hang monitor at the child side to prevent the problem again. In this case, the child side might still be spinning but we can't know.

Comment 49

9 months ago
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/df9d3661abb6
CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning. r=nical
https://hg.mozilla.org/mozilla-central/rev/df9d3661abb6
Status: REOPENED → RESOLVED
Last Resolved: 9 months ago9 months ago
status-firefox57: affected → fixed
Resolution: --- → FIXED
For release build, the content process will go through ProcessChild::QuickExit(). Then, the content process will not call xpcom_shutdown. So, it will also not go though CompositorBridgeChild::ShutDown.
I'm not sure the attachment #8904681 [details] [diff] [review] could fix the problem. But let's see how it will be.
Flags: needinfo?(aosmond)
(Assignee)

Comment 52

9 months ago
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #51)
> For release build, the content process will go through
> ProcessChild::QuickExit(). Then, the content process will not call
> xpcom_shutdown. So, it will also not go though
> CompositorBridgeChild::ShutDown.
> I'm not sure the attachment #8904681 [details] [diff] [review] [diff] [review] could fix the
> problem. But let's see how it will be.

Ugh, I wonder if is this related to reproducing in other ways? Maybe I should follow the exact configuration for a nightly build. I am still seeing reports (4 thus far) on 20170907220212 which contains the new fix.

https://crash-stats.mozilla.com/report/index/609a2aab-8fa9-4af5-ba11-315210170908
https://crash-stats.mozilla.com/report/index/5edb45fa-1e37-4e69-9b09-658420170908

At this point, I need to think about what information would be useful to debug and land something which dumps it to the GFX log. Maybe a list of what CompositorManagerParents are still alive at shutdown...
(Assignee)

Comment 53

9 months ago
It still calls CompositorBridgeChild::Destroy in a few other places (e.g. by the CompositorSession) that should still happen on shutdown (e.g. nsBaseWidget::DestroyCompositor via WidgetShutdownObserver::Observe). Before my changes this would release the CompositorBridgeParent and its thread before the quick exit. While it still releases the CompositorBridgeParent, the CompositorManagerParent remains up. Maybe we need to do more in these places.

It doesn't make sense though because a quick exit should error out the ProcessLink on the other side...surely. But it is a behavioural change, so I guess it is an option.
(Assignee)

Comment 54

9 months ago
The signature morphed as a result of some other changes (probably the quantum DOM scheduler).
Status: RESOLVED → REOPENED
Crash Signature: [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip; → [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip;
Resolution: FIXED → ---
Created attachment 8906563 [details] [diff] [review]
AddRef/Release log for CompositorThreadHolder.

Here is the customize AddRef/Release impl for CompositorThreadHolder. It will use
gfxCriticalNote to dump the stack for AddRef/Release call.
But I can't get the complete log from the gfxCriticalNote. Some logs seems to be
overwrote by the newer log.

Comment 56

8 months ago
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e66c8d65d0f1
Explicitly shutdown the CompositorManagerChild during ContentChild::ActorDestroy. r=me
(Assignee)

Comment 57

8 months ago
(In reply to Pulsebot from comment #56)
> Pushed by aosmond@gmail.com:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/e66c8d65d0f1
> Explicitly shutdown the CompositorManagerChild during
> ContentChild::ActorDestroy. r=me

I will back this out after we get a few builds with it and observe the effect on crash rates. Fingers crossed.
https://hg.mozilla.org/mozilla-central/rev/e66c8d65d0f1
Status: REOPENED → RESOLVED
Last Resolved: 9 months ago8 months ago
Resolution: --- → FIXED
(In reply to Andrew Osmond [:aosmond] from comment #53)
> It still calls CompositorBridgeChild::Destroy in a few other places (e.g. by
> the CompositorSession) that should still happen on shutdown (e.g.
> nsBaseWidget::DestroyCompositor via WidgetShutdownObserver::Observe). Before
> my changes this would release the CompositorBridgeParent and its thread

Yes, but it happens at parent process. If the parent is blocked at the CompositorThreadHolder::Shutdown, it should already have ran the nsBaseWidget::DestroyCompositor().

> before the quick exit. While it still releases the CompositorBridgeParent,
> the CompositorManagerParent remains up. Maybe we need to do more in these
> places.
> 
> It doesn't make sense though because a quick exit should error out the
> ProcessLink on the other side...surely. But it is a behavioural change, so I
> guess it is an option.

The parent actor should be deleted by ActorDestory(). So, I'm still checking for the remaining holder for the compositorThread.
https://dxr.mozilla.org/mozilla-central/source/obj-x86_64-pc-linux-gnu/ipc/ipdl/PCompositorManagerParent.cpp#234
Duplicate of this bug: 1379619
If this is fixed in nightly can you request uplift to beta? Or is this still a diagnostic patch?
Flags: needinfo?(lhenry)
(Reporter)

Comment 62

8 months ago
unfortunately i still see those crashes in today's nightly build 20170912100139 which contains the patch:
http://bit.ly/2f3G29h
(Assignee)

Comment 63

8 months ago
Yes I agree. Back to the drawing board.
Status: RESOLVED → REOPENED
Flags: needinfo?(aosmond)
Resolution: FIXED → ---
status-firefox57: fixed → affected
Created attachment 8907556 [details] [diff] [review]
show the ref-counting for CompositorThreadHolder in XPCOM shutdown.

This patch use CompositorThreadHolderRef to record the different types of CompositorThreadHolder object holder.
Then, we could dump the log for the remaining holder type during the XPCOM shutdown.

MozReview-Commit-ID: GfUtMD11Ajt
Attachment #8907556 - Flags: feedback?(nical.bugzilla)
Attachment #8907556 - Flags: feedback?(aosmond)
Attachment #8906563 - Attachment is obsolete: true
There are 53 crashes in nightly 57 for signature "shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-" and 86 for "shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-".
Crash Signature: [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip; → [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip;

Comment 66

8 months ago
Backout by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bcbb4575e482
Backed out changeset e66c8d65d0f1 because it did not reduce the crash volume.
(Assignee)

Comment 67

8 months ago
Created attachment 8907693 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v1

So two things in the pipeline from my end right now. One, bug 1399453 has landed on inbound which should annotate the crash reports with the current compositor thread references -- when a shutdown hang is tripped, we should be able to just search for "gfxCompositorThread" to find them. Hopefully that makes it into the next build this evening/tomorrow morning. This may help with diagnosing the problem.

In parallel, I had a discussion with dvander about the problem. We agreed that in an ideal world, the parent process should be the one initiating the IPDL objects closing, rather than the child processes. I figured this was more risky than it probably is, because as dvander pointed out, the child processes already need to handle the GPU process going away at a random point. This is sort of a backup plan, assuming the annotations don't tell us anything new but hopefully a sound one if the try is clean.

try (central): https://treeherder.mozilla.org/#/jobs?repo=try&revision=92a2008a0c4b243673cc2e57f19bff2c21bb3ec1
try (beta): https://treeherder.mozilla.org/#/jobs?repo=try&revision=030ee6b93585fa860012a965b3cbb2afabf84a8c
(Assignee)

Comment 68

8 months ago
(In reply to Andrew Osmond [:aosmond] from comment #67)
> Created attachment 8907693 [details] [diff] [review]
> Force CompositorManagerParent to close before shutting down the compositor
> thread., v1
> 
> try (central):
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=92a2008a0c4b243673cc2e57f19bff2c21bb3ec1
> try (beta):
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=030ee6b93585fa860012a965b3cbb2afabf84a8c

My bad:

try (central): https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc8e6365ad4398e155ebbdeb67412e808a90157c
(Assignee)

Comment 69

8 months ago
Created attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2

Fix that leak check (grr).

try (central): https://treeherder.mozilla.org/#/jobs?repo=try&revision=e27af77640ce1b1e6bef88a27f31d74cc5e0b4fb
try (beta): https://treeherder.mozilla.org/#/jobs?repo=try&revision=316a24034803386ae55d6c315200c671f0f0061a
Attachment #8907693 - Attachment is obsolete: true
(Assignee)

Comment 70

8 months ago
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2

The crash report annotations show that the CompositorManagerChild/Parent for the content processes are still alive. So we were on the right track with the previous patches, but for some reason, they aren't helping. The content process should quick exit and let the OS cleanup the IPC resources but that doesn't seem to be happening either.

As said in comment 67, we actually would prefer to move towards shutting down in the parent process anyways, rather than rely upon orderly shutdown in the content processes. This should be safe since they need to handle the GPU process dying at any time in theory. So this patch should move us in the right direction and hopefully fix the hang once and for all.
Attachment #8907841 - Flags: review?(dvander)
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2

Review of attachment 8907841 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorManagerParent.cpp
@@ +133,5 @@
>    AddRef();
> +
> +  StaticMutexAutoLock lock(sMutex);
> +  if (!sActiveActors) {
> +    sActiveActors = new nsTArray<CompositorManagerParent*>();

Arguably we should create the array at startup, and refuse to create new CompositorManagers if this array doesn't exist, since otherwise they might not be severed later. But it's okay as-is.
Attachment #8907841 - Flags: review?(dvander) → review+

Comment 72

8 months ago
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7d0893f1b6ce
Force CompositorManagerParent to close before shutting down the compositor thread. r=dvander
https://hg.mozilla.org/mozilla-central/rev/7d0893f1b6ce
Status: REOPENED → RESOLVED
Last Resolved: 8 months ago8 months ago
status-firefox57: affected → fixed
Resolution: --- → FIXED
(Assignee)

Comment 74

8 months ago
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2

No crash reports since the change landed. We typically have seen one by now, so I'm ready to call it.

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1365927
[User impact if declined]: May experience shutdown hangs on Windows.
[Is this code covered by automated tests?]: The code is run when it shuts down the test harness in general.
[Has the fix been verified in Nightly?]: Yes. The shutdown hang crash reports have stopped coming in.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: It is self contained within one module and easily auditable. While the CompositorManager and subprotocols could be closed earlier (in theory) than we have historically tested, the content processes already support the GPU process going away unexpectedly (e.g. it crashes and later recovers), which is all this emulates from the content process perspective. There should not be any state information lost as a result of the earlier close.
[String changes made/needed]: None.
Attachment #8907841 - Flags: approval-mozilla-beta?
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2

56 is now on m-r.  See comment 74.
Attachment #8907841 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2

Looks like this should fix a topcrash blocking the 56 release. Please uplift for the RC build.
Attachment #8907841 - Flags: approval-mozilla-release? → approval-mozilla-release+
Depends on: 1400116
status-firefox55: --- → unaffected
status-firefox-esr52: --- → unaffected
I had to back this out from mozilla-release because it broke debug builds (and I confirmed on Try that this patch was the culprit). Basically, Windows and Linux builds would appear to finish their checktests and then just hang until eventually timing out. Also, OSX debug xpcshell would similarly appear to just hang and end up timing out during chrome/test/unit_ipc/test_resolve_uris_ipc.js.
https://hg.mozilla.org/releases/mozilla-release/rev/28c16d0c9746f7e5f4a5d01c21bd3fede6c6e1a5

https://treeherder.mozilla.org/logviewer.html#?job_id=131483091&repo=mozilla-release
https://treeherder.mozilla.org/logviewer.html#?job_id=131477593&repo=mozilla-release
status-firefox56: fixed → affected
Flags: needinfo?(aosmond)
(Assignee)

Comment 79

8 months ago
Ugh, I should have posted my own trys against mozilla-release instead of mozilla-beta at the time of posted I take it.
Your Try push from comment 69 against Beta shows the same problem.
And no, you did the right thing at the time. 56 went from m-b to m-r on the 14th, so your push against Beta on the 13th was fine.
(Assignee)

Comment 82

8 months ago
(In reply to Ryan VanderMeulen [:RyanVM] from comment #81)
> And no, you did the right thing at the time. 56 went from m-b to m-r on the
> 14th, so your push against Beta on the 13th was fine.

Ugh -- I guess I didn't think it to be related. Is this happening on central as well (to your knowledge)? I know about the leakcheck failure, which is in my queue to look at. But that seems less serious than this.
It's not happening on m-c. I'm quite certain it would have been backed out already had it been :)
Backed out for leaking in mda2 on Linux x64 asan (bug 1374856) and because the patch was only applied on central:

https://hg.mozilla.org/mozilla-central/rev/877d536658e644bd0d1662d608cc2ca3a9fa8d9f
Status: RESOLVED → REOPENED
status-firefox57: fixed → affected
Resolution: FIXED → ---
Target Milestone: mozilla57 → ---
(Assignee)

Comment 86

8 months ago
Created attachment 8909466 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v3 [carries r=dvander,me]

I fixed the asan/leakcheck issue, so this should be landable on mozilla-central:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=223084d2affbc22de128e0bdfb3bef4a4d61116a

And I fixed the 56 branch debug vs release issue by not shutting down in the parent in debug builds. I believe the proper fix will be to always shutdown in the parent in both debug and release but we are already past the 11th hour...
Attachment #8907841 - Attachment is obsolete: true
Flags: needinfo?(aosmond)
Attachment #8909466 - Flags: review+

Comment 88

8 months ago
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4dd86bbce649
Force CompositorManagerParent to close before shutting down the compositor thread. r=dvander,me
https://hg.mozilla.org/mozilla-central/rev/4dd86bbce649
Status: REOPENED → RESOLVED
Last Resolved: 8 months ago8 months ago
status-firefox57: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Comment on attachment 8907556 [details] [diff] [review]
show the ref-counting for CompositorThreadHolder in XPCOM shutdown.

I am late to the game sorry.
Attachment #8907556 - Flags: feedback?(nical.bugzilla)
Attachment #8907556 - Flags: feedback?(aosmond)

Updated

8 months ago
Duplicate of this bug: 1402016

Updated

8 months ago
Duplicate of this bug: 1364798

Updated

8 months ago
Blocks: 1402037

Updated

8 months ago
Crash Signature: [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip; → [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip;

Comment 94

8 months ago
Bug 1379619 is specific to Mac OS X.
Whereas Bug 1389021 is specific to Windows OS.

These bugs were marked as duplicates of each other. But with different operating systems being affected, and different crash signatures, should they be left separate instead? Maybe linked via "See Also"?

Or should the problems with both bugs and operating systems be merged? Changing the OS on Bug 1389021 from Windows to All.

Comment 95

8 months ago
(In reply to Andrew Osmond [:aosmond] from comment #74)
> Comment on attachment 8907841 [details] [diff] [review]
> Force CompositorManagerParent to close before shutting down the compositor
> thread., v2
> 
> No crash reports since the change landed. We typically have seen one by now, so I'm ready to call it.
> 
> Approval Request Comment
> [Feature/Bug causing the regression]: Bug 1365927


Are these crash reports relevant? They appeared in Nightly recently.

00e502a7-1bc6-4b77-a3f2-2627e0170919 	2017-09-19 15:35:04 	Firefox 	57.0a1 	20170918220054 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 	2017-09-19 15:12:59
2a12f86b-5f6c-48db-ad9c-c588d0170917 	2017-09-17 00:37:59 	Thunderbird 	57.0a1 	20170914030208 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 	2017-09-16 15:52:23
3ad05691-0c7c-4c20-b645-127a50170916 	2017-09-16 15:06:09 	Firefox 	57.0a1 	20170914100122 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 	2017-09-15 04:31:26
7c040dc6-f960-42dc-af21-cffef0170915 	2017-09-15 13:29:15 	Firefox 	57.0a1 	20170914220209 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 	2017-09-15 09:16:34
(Assignee)

Comment 96

8 months ago
(In reply to skywalker333 from comment #95)
> Are these crash reports relevant? They appeared in Nightly recently.
> 
> 00e502a7-1bc6-4b77-a3f2-2627e0170919 	2017-09-19 15:35:04 	Firefox 	57.0a1 
> 20170918220054 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 
> 2017-09-19 15:12:59
> 2a12f86b-5f6c-48db-ad9c-c588d0170917 	2017-09-17 00:37:59 	Thunderbird 
> 57.0a1 	20170914030208 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0
> 2017-09-16 15:52:23
> 3ad05691-0c7c-4c20-b645-127a50170916 	2017-09-16 15:06:09 	Firefox 	57.0a1 
> 20170914100122 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 
> 2017-09-15 04:31:26
> 7c040dc6-f960-42dc-af21-cffef0170915 	2017-09-15 13:29:15 	Firefox 	57.0a1 
> 20170914220209 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 
> 2017-09-15 09:16:34

They are unrelated crashes, and should get a new bug. For the crash tracked by this bug, you should see CompositorThreadHolder::Shutdown in the crashing thread stack trace. Sadly the crash signature can't be made more specific (afaik).

Updated

8 months ago
See Also: → bug 1379619

Comment 97

8 months ago
(In reply to Andrew Osmond [:aosmond] from comment #96)
> (In reply to skywalker333 from comment #95)
> > Are these crash reports relevant? They appeared in Nightly recently.
> > 
> > 00e502a7-1bc6-4b77-a3f2-2627e0170919 	2017-09-19 15:35:04 	Firefox 	57.0a1 
> > 20170918220054 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 
> > 2017-09-19 15:12:59
> > 2a12f86b-5f6c-48db-ad9c-c588d0170917 	2017-09-17 00:37:59 	Thunderbird 
> > 57.0a1 	20170914030208 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0
> > 2017-09-16 15:52:23
> > 3ad05691-0c7c-4c20-b645-127a50170916 	2017-09-16 15:06:09 	Firefox 	57.0a1 
> > 20170914100122 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 
> > 2017-09-15 04:31:26
> > 7c040dc6-f960-42dc-af21-cffef0170915 	2017-09-15 13:29:15 	Firefox 	57.0a1 
> > 20170914220209 	Mac OS X 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 	0x0 
> > 2017-09-15 09:16:34
> 
> They are unrelated crashes, and should get a new bug. For the crash tracked
> by this bug, you should see CompositorThreadHolder::Shutdown in the crashing
> thread stack trace. Sadly the crash signature can't be made more specific
> (afaik).

So then these two bugs are not duplicates and should be kept separate.

[Bug 1379619 is specific to Mac OS X.
Whereas Bug 1389021 is specific to Windows OS.]

Updated

8 months ago
Crash Signature: [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip; → [@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] [@ shutdownhang | NtWaitForKeyedEvent | RtlSleepC&hellip;
Duplicate of this bug: 1379619
Duplicate of this bug: 1379619
(In reply to Andrew Osmond [:aosmond] from comment #74)
> [Is this code covered by automated tests?]: The code is run when it shuts
> down the test harness in general.
> [Has the fix been verified in Nightly?]: Yes. The shutdown hang crash
> reports have stopped coming in.
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Setting qe-verify- based on Andrew's assessment on manual testing needs and the fact that it has automated coverage.
Flags: qe-verify-
Depends on: 1402592

Updated

8 months ago
No longer blocks: 1402037
Depends on: 1402037
(Assignee)

Comment 102

8 months ago
(In reply to Marco Castelluccio [:marco] from comment #101)
> Are we sure the bug is actually fixed? It's possible the signature just
> shifted a bit.
> E.g. look at
> https://crash-stats.mozilla.com/search/
> ?proto_signature=~CompositorThreadHolder%3A%3AShutdown&product=Firefox&date=%
> 3E%3D2017-09-23T10%3A57%3A41.000Z&date=%3C2017-09-26T10%3A57%3A41.
> 000Z&_sort=-
> date&_facets=signature&_facets=proto_signature&_facets=content_sandbox_level&
> _facets=content_sandbox_capable&_facets=content_sandbox_enabled&_facets=conte
> nt_sandbox_capabilities&_columns=date&_columns=signature&_columns=product&_co
> lumns=version&_columns=build_id&_columns=platform#facet-proto_signature.

Yes. It went away from 56. I think this is fallout from bug 1401668. Maybe I should back it out on central to confirm now that bug 1398070 will have made us more forgiving on a late free, and to make a real fix less urgent.
(Assignee)

Updated

7 months ago
Depends on: 1413011
You need to log in before you can comment on or make changes to this bug.