Closed Bug 1314543 Opened 3 years ago Closed 3 years ago

Crash in mozilla::gfx::FeatureChange::AssertSanity | "unexpected type tag"

Categories

(Core :: Graphics, defect, critical)

All
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox52 --- unaffected
firefox-esr52 --- unaffected
firefox53 --- fixed
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: kanru, Assigned: vliu)

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-5abc044b-91fe-4bba-a8f8-71a002161102.
=============================================================

9 crashes since build 20161031030202

Looks like GPU process related.
Flags: needinfo?(dvander)
I can't easily see how this assertion is possible. There's only two possible values for the tag (null_t, or not null_t), and we do a null_t check before unwrapping the union.

Keeping this on my radar - will check back in a week to see if we have more reports.
Not seeing any more crashes this week.
Flags: needinfo?(dvander)
Whiteboard: [gfx-noted]
it's still around around and accounting for about 0.4% of browser crashes in early data for 53.0b1.
Flags: needinfo?(dvander)
Keywords: regression
OS: Windows 10 → Windows
Hardware: Unspecified → All
Vincent, please take a look at this bug. I saw around 62 crashes on FF53, 3 crashes on FF55.

https://crash-stats.mozilla.com/report/index/9f90899b-8912-47cf-b4cf-81a822170311

|[0][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=100476) |[1][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=100476) |[2][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=100476) |[3][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=100476) |[4][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=100476) |[5][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=100476)
Assignee: nobody → vliu
(In reply to Peter Chang[:pchang] from comment #4)
> Vincent, please take a look at this bug. I saw around 62 crashes on FF53, 3
> crashes on FF55.
> 
> https://crash-stats.mozilla.com/report/index/9f90899b-8912-47cf-b4cf-
> 81a822170311
> 

From the mini-dump, the crash happens at [1]. From [2], AssertSanity() passed TFeatureFailure to it, meaning it expected mType in [1] was also TFeatureFailure, but apparently the mismatch happens.

[1]: https://dxr.mozilla.org/mozilla-central/source/obj-x86_64-pc-linux-gnu/ipc/ipdl/_ipdlheaders/mozilla/gfx/GraphicsMessages.h#476
[2]:https://dxr.mozilla.org/mozilla-central/source/obj-x86_64-pc-linux-gnu/ipc/ipdl/_ipdlheaders/mozilla/gfx/GraphicsMessages.h#549

More observing frame 4 in mini-dump, GPUDeviceData was gotten from [3], the type would be T_None(0). In [4], it seems that if the mType was Tnull_t, gecko won't go into get_FeatureFailure()

d3d11Compositing_.mType = T_None(0)
oglCompositing_.mType = T_None(0)
gpuDevice_.mType =  T_None(0)

[3]: https://dxr.mozilla.org/mozilla-central/source/gfx/ipc/GPUChild.cpp#83
[4]: https://dxr.mozilla.org/mozilla-central/source/gfx/config/gfxConfig.cpp#264


(In reply to David Anderson [:dvander] from comment #1)
> I can't easily see how this assertion is possible. There's only two possible
> values for the tag (null_t, or not null_t), and we do a null_t check before
> unwrapping the union.
> 
> Keeping this on my radar - will check back in a week to see if we have more
> reports.

Hi David,
Is it possible gecko got the wrong type in [3]? Does it should return Tnull_t?
I also did an experiment. I tried to kill GPU process manually after firefox was launched. in [3] in Comment 5, mType is Tnull_t. I think that's why I think gecko should return Tnull_t in [3]. Or maybe I was wrong because I am not familiar with this part.
If GPU process was killed at the moment when sync ipc call GetDeviceStatus had sent in child side in [5] but not yet to return normally, mType would keep T_None and then hit the same crash point in [6]. I think it would be the possible reason to cause this crash.

[5]: https://dxr.mozilla.org/mozilla-central/source/gfx/ipc/GPUChild.cpp#83
[6]: https://dxr.mozilla.org/mozilla-central/source/obj-x86_64-pc-linux-gnu/ipc/ipdl/_ipdlheaders/mozilla/gfx/GraphicsMessages.h#476
Hi David,

Based on Comment 7, a patch was attached intends to make early return if IPC call fail. Can you please have a review? We may observe if this patch can drop the crash rate.
Attachment #8847037 - Flags: review?(dvander)
Comment on attachment 8847037 [details] [diff] [review]
0001-Bug-1314543-Make-a-early-return-when-SendGetDeviceSt.patch

Review of attachment 8847037 [details] [diff] [review]:
-----------------------------------------------------------------

Great catch! It looks like callers of EnsureGPUReady don't expect it to fail either. It would be good to fix that now that the function is fallible.

Could you make it have a bool return, and make callers early-return on failure? This would replace patterns like:

> EnsureGPUReady();
>
> if (!mGPUChild) {
>   return...

With:

> if (!EnsureGPUReady()) {
>   return...
Attachment #8847037 - Flags: review?(dvander) → feedback+
Hi David,

Thanks for your good suggestion on Comment 9. Could you please have a review to see if it meets your expectation?
Attachment #8847037 - Attachment is obsolete: true
Attachment #8847479 - Flags: review?(dvander)
Comment on attachment 8847479 [details] [diff] [review]
0001-Bug-1314543-Make-EnsureGPUReady-returns-bool-to-make.patch

Review of attachment 8847479 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great, thanks!
Attachment #8847479 - Flags: review?(dvander) → review+
Pushed by vliu@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a6780188f669
Make EnsureGPUReady() returns bool to make sure GPU process is readay. r=dvander
vicent told me that this not realted to failures we see bug https://hg.mozilla.org/integration/mozilla-inbound/rev/a6780188f66986608a220efc6095e93daaa9e45e indicate frequent test failures in windows 8 xpc test that are perma failing , so backing this out because this change might have caused this
Flags: needinfo?(vliu)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/62cd40511419
Backed out changeset a6780188f669 for suspicion this cause perma windows 7 xpcshell failures
(In reply to Carsten Book [:Tomcat] from comment #14)
> vicent told me that this not realted to failures we see bug
> https://hg.mozilla.org/integration/mozilla-inbound/rev/
> a6780188f66986608a220efc6095e93daaa9e45e indicate frequent test failures in
> windows 8 xpc test that are perma failing , so backing this out because this
> change might have caused this

Thanks for the help. I'll look into my patch to figure it out.
Flags: needinfo?(vliu)
Hi David,

It seem that my previous r+ed patch caused xpcshell failure. After looked into my patch, the code modification in GPUProcessManager::CreateContentCompositorBridge() by me may change the original behavior. Here to attach my patch. In this patch. There were some changes in GPUProcessManager::CreateContentCompositorBridge() from previous one, others were the same. Could you please have a review? Thanks.
Attachment #8847479 - Attachment is obsolete: true
Attachment #8849008 - Flags: review?(dvander)
Here is the xpcshell test try result based on the patch on Comment 17

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc6684a2d9de525e18a823d89d69b67ef9696aac
Attachment #8849008 - Flags: review?(dvander) → review+
Pushed by vliu@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/50907f304e24
Make EnsureGPUReady() returns bool to make sure GPU process is readay. r=dvander
https://hg.mozilla.org/mozilla-central/rev/50907f304e24
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
I believe this needs an Aurora approval request but not Beta?
Whoops, I think we need this for 53 as well.
david, want to request uplift on this for aurora and beta?
Flags: needinfo?(dvander)
Comment on attachment 8849008 [details] [diff] [review]
0001-Bug-1314543-Make-EnsureGPUReady-returns-bool-to-make.patch

Approval Request Comment
[Feature/Bug causing the regression]: GPU process
[User impact if declined]: potential UI crashes if the GPU process fails to start
[Is this code covered by automated tests?]: no
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: n/a
[Is the change risky?]: no
[Why is the change risky/not risky?]: This simply ensures that IPC is working before we decide that the GPU process has started successfully. In most cases it is working, and when it isn't, we had a chance of crashing before this patch.
[String changes made/needed]:
Flags: needinfo?(dvander)
Attachment #8849008 - Flags: approval-mozilla-beta?
Attachment #8849008 - Flags: approval-mozilla-aurora?
Comment on attachment 8849008 [details] [diff] [review]
0001-Bug-1314543-Make-EnsureGPUReady-returns-bool-to-make.patch

Prevents a browser crash, let's uplift for beta 10.
Attachment #8849008 - Flags: approval-mozilla-beta?
Attachment #8849008 - Flags: approval-mozilla-beta+
Attachment #8849008 - Flags: approval-mozilla-aurora?
Attachment #8849008 - Flags: approval-mozilla-aurora+
Clear the ni? flag since David has done with this. Thanks David.
Flags: needinfo?(vliu)
You need to log in before you can comment on or make changes to this bug.