1330693 - Crash in mozilla::dom::workers::ServiceWorkerJobQueue::JobFinished

Assignee

Description

•

7 years ago

This bug was filed from the Socorro interface and is 
report bp-460ce510-1953-43fd-9be1-11a732170112.
=============================================================

Our double-completion checks in JobFinish() now trigger the release assertions in nsTArray.  We should change this to make an explicit IsEmpty() check.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 1

•

7 years ago

Actually, I think this is more about calling JobFinished() on a nullptr mFinalCallback object.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 2

•

7 years ago

Attached patch Try to handle SW job double-completion better. r=ehsan — Details — Splinter Review

Ehsan, given the crash stack in this bug I think we still have some trouble with double-completion of SW jobs.  This patch improves our protections by checking further up the completion call chain.  In this case I think the crash is from calling JobFinished() on a nullptr mFinalCallback for some reason.

I also added more diagnostic assertions to try to narrow down the problem.

Attachment #8826243 - Flags: review?(ehsan)

(no longer active)

Comment 3

•

7 years ago

Hmm.  So there are only two crashes with this signature so far.  It's obvious that this isn't SafeElementAt failing.  But I also think it's clear this isn't a nullptr dereference crash, given that the crash address in comment 0 is 0x1612004 and the crash address in bp-aac6a324-b920-4a7c-8cef-4fe292170111 is 0xfffffffff35a5fb0 (nullptr deref crashes typically occur at an address close to 0, for example, accessing a member with offset N would crash on address N if |this| is nullptr.)  More specifically, since ServiceWorkerJobQueue::Callback::JobFinished is virtual, if mFinalCallback was 0 then the deref of the |this| pointer to get to the vtable would crash at address 0.  So I think there is something else going on here.

(no longer active)

Comment 4

•

7 years ago

Comment on attachment 8826243 [details] [diff] [review]
Try to handle SW job double-completion better. r=ehsan

Review of attachment 8826243 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/ServiceWorkerJob.cpp
@@ +215,5 @@
>  
>    mState = State::Finished;
>  
> +  MOZ_DIAGNOSTIC_ASSERT(mFinalCallback);
> +  if (mFinalCallback) {

As I said before, I'm pretty sure this isn't going to solve the crash, but still the patch may uncover the bug we're chasing here.

FWIW I suggest looking at the crash dumps for this, let me know if you want me to download them for you.  But since we only have two of these crashes, it's up to you how much time you want to spend on debugging it...  :-)

Attachment #8826243 - Flags: review?(ehsan) → review+

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 5

•

7 years ago

In the last 24 hours we've gotten 3 more reports that suggest nullptr of mFinalCallback:

https://crash-stats.mozilla.com/report/index/571939d5-08eb-4fa2-b654-459832170113

Is our nullptr de-ref working as expected in release?

Flags: needinfo?(ehsan)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 6

•

7 years ago

Comment on attachment 8826243 [details] [diff] [review]
Try to handle SW job double-completion better. r=ehsan

Approval Request Comment
[Feature/Bug causing the regression]: Service workers
[User impact if declined]: Low frequency crashes
[Is this code covered by automated tests?]: Its heavily tested, but our automation does not trigger this problem.
[Has the fix been verified in Nightly?]: It has not yet merged to nightly, but I'd like to flag now so the patch might make it into the beta RC next week.
[Needs manual test from QE? If yes, steps to reproduce]:  None
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Minimal risk
[Why is the change risky/not risky?]: This patch adds some additional state and nullptr checks.  It has a very low risk of regression.
[String changes made/needed]: None

Note, I don't think this patch is very speculative.  See the crash stack in comment 5 which is pretty clearly a nullptr de-ref on FF51b that is fixed by this patch.

Attachment #8826243 - Flags: approval-mozilla-beta?

Attachment #8826243 - Flags: approval-mozilla-aurora?

Liz Henry (:lizzard) (relman/hg->git project)

Comment 7

•

7 years ago

Ben are you going to land this soon so that it might get into m-c? The sheriffs do often handle some merges over the weekend. I will be checking back on uplifts tomorrow.

status-firefox51: --- → affected

status-firefox52: --- → affected

status-firefox53: --- → affected

Flags: needinfo?(bkelly)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 8

•

7 years ago

Gerry, just letting you know you might want this for tomorrow's RC build.

Flags: needinfo?(gchang)

Gerry Chang [:gchang]

Comment 9

•

7 years ago

Comment on attachment 8826243 [details] [diff] [review]
Try to handle SW job double-completion better. r=ehsan

Fix a crash. Beta51+ & Aurora52+. Should be in 51 RC.

Flags: needinfo?(gchang)

Attachment #8826243 - Flags: approval-mozilla-beta?

Attachment #8826243 - Flags: approval-mozilla-beta+

Attachment #8826243 - Flags: approval-mozilla-aurora?

Attachment #8826243 - Flags: approval-mozilla-aurora+

Carsten Book [:Tomcat]

Comment 10

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/f987327e5c36

status-firefox52: affected → fixed

Carsten Book [:Tomcat]

Comment 11

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/15467610e733

status-firefox51: affected → fixed

Carsten Book [:Tomcat]

Comment 12

•

7 years ago

ben, i landed this on gchangs request on beta and aurora. Does this need to land on trunk too ? (i guess so from the status flags)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 13

•

7 years ago

Sorry, I thought I landed this.  I think maybe the tree was closed when I tried to push and forgot to try again.  Landing now.

Flags: needinfo?(bkelly)

Pulsebot

Comment 14

•

7 years ago

Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9b84a435cd2a
Try to handle SW job double-completion better. r=ehsan

Ben Kelly [:bkelly, not reviewing]

Assignee

Updated

•

7 years ago

Crash Signature: [@ mozilla::dom::workers::ServiceWorkerJobQueue::JobFinished] → [@ mozilla::dom::workers::ServiceWorkerJobQueue::JobFinished] [@ mozilla::dom::workers::ServiceWorkerJob::Finish ]

Phil Ringnalda (:philor)

Comment 15

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9b84a435cd2a

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox53: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla53

(no longer active)

Comment 16

•

7 years ago

(In reply to Ben Kelly [:bkelly] from comment #5)
> In the last 24 hours we've gotten 3 more reports that suggest nullptr of
> mFinalCallback:
> 
> https://crash-stats.mozilla.com/report/index/571939d5-08eb-4fa2-b654-
> 459832170113

Yeah this is definitely a clear case of a null deref.

> Is our nullptr de-ref working as expected in release?

What do you mean?

Flags: needinfo?(ehsan) → needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 17

•

7 years ago

I mean, are you sure things like PGO, other optimizations, or some kind of runtime config could be allowing us to traverse that nullptr vtable on windows release?  I feel like I see a lot of crash reports where it seems we have made it through a nullptr read.

Flags: needinfo?(bkelly)

(no longer active)

Comment 18

•

7 years ago

If the vtable pointer is null and we do try to make a virtual call, we have to dereference that address.  Barring that, there are only two cases that I can think of where something like what you explain can happen.

1. Somebody having mapped a readable page at address 0.  Technically we don't protect against that, but if some malicious code does that then null pointer derefs will return what's at the address.  This value isn't probably specific to the call site though, and will be some constant, and interpreting that as a vtable can have arbitrary results.  But the chances of you seeing something on the stack past that point that makes sense are slim.

2. The call being devirtualized by the optimizer, if the compiler can figure out the concrete type of the object at hand.  Here, the compiler can't know anything about mFinalCallback since it's set based on what the caller passed to Start(), so there's no way for it to devirtualize.  Just to be certain, I checked the code in Nightly and there's definitely a virtual call there:

  if (mFinalCallback) {
    mFinalCallback->JobFinished(this, aRv);
66B52860  mov         eax,dword ptr [ecx]  
  if (mFinalCallback) {
    mFinalCallback->JobFinished(this, aRv);
66B52862  push        ebx  
66B52863  push        esi  
66B52864  call        dword ptr [eax]

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 19

•

7 years ago

(In reply to :Ehsan Akhgari from comment #18)
> 2. The call being devirtualized by the optimizer, if the compiler can figure
> out the concrete type of the object at hand.  Here, the compiler can't know
> anything about mFinalCallback since it's set based on what the caller passed
> to Start(), so there's no way for it to devirtualize.  Just to be certain, I
> checked the code in Nightly and there's definitely a virtual call there:
> 
>   if (mFinalCallback) {
>     mFinalCallback->JobFinished(this, aRv);
> 66B52860  mov         eax,dword ptr [ecx]  
>   if (mFinalCallback) {
>     mFinalCallback->JobFinished(this, aRv);
> 66B52862  push        ebx  
> 66B52863  push        esi  
> 66B52864  call        dword ptr [eax]

Is this a windows PGO build?  Because mFinalCallback will only ever have a single concrete type set for it;  ServiceWorkerJobQueue::Callback.  It does seem optimizable to me.

(no longer active)

Comment 20

•

7 years ago

(In reply to Ben Kelly [:bkelly] from comment #19)
> (In reply to :Ehsan Akhgari from comment #18)
> > 2. The call being devirtualized by the optimizer, if the compiler can figure
> > out the concrete type of the object at hand.  Here, the compiler can't know
> > anything about mFinalCallback since it's set based on what the caller passed
> > to Start(), so there's no way for it to devirtualize.  Just to be certain, I
> > checked the code in Nightly and there's definitely a virtual call there:
> > 
> >   if (mFinalCallback) {
> >     mFinalCallback->JobFinished(this, aRv);
> > 66B52860  mov         eax,dword ptr [ecx]  
> >   if (mFinalCallback) {
> >     mFinalCallback->JobFinished(this, aRv);
> > 66B52862  push        ebx  
> > 66B52863  push        esi  
> > 66B52864  call        dword ptr [eax]
> 
> Is this a windows PGO build?

Yes, that's what Nightly is.

> Because mFinalCallback will only ever have a
> single concrete type set for it;  ServiceWorkerJobQueue::Callback.  It does
> seem optimizable to me.

Well, like I showed before, the code isn't optimized to devirtualize this call.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 21

•

7 years ago

(In reply to :Ehsan Akhgari from comment #20)
> Yes, that's what Nightly is.

You didn't tell me what platform you were looking at.  I thought you used a mac, so I wasn't sure if you were looking at the windows visual studio produced binary.

(no longer active)

Comment 22

•

7 years ago

(In reply to Ben Kelly [:bkelly] from comment #21)
> (In reply to :Ehsan Akhgari from comment #20)
> > Yes, that's what Nightly is.
> 
> You didn't tell me what platform you were looking at.  I thought you used a
> mac, so I wasn't sure if you were looking at the windows visual studio
> produced binary.

Oh no sorry, I was talking about Windows (since the crash happens on Windows it seems).