If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Crash in mozilla::dom::workers::ServiceWorkerJobQueue::JobFinished

RESOLVED FIXED in Firefox 51

Status

()

Core
DOM: Service Workers
--
critical
RESOLVED FIXED
9 months ago
8 months ago

People

(Reporter: bkelly, Assigned: bkelly)

Tracking

(Blocks: 1 bug, {crash})

unspecified
mozilla53
x86
Windows 10
crash
Points:
---

Firefox Tracking Flags

(firefox51 fixed, firefox52 fixed, firefox53 fixed)

Details

(crash signature)

Attachments

(1 attachment)

(Assignee)

Description

9 months ago
This bug was filed from the Socorro interface and is 
report bp-460ce510-1953-43fd-9be1-11a732170112.
=============================================================

Our double-completion checks in JobFinish() now trigger the release assertions in nsTArray.  We should change this to make an explicit IsEmpty() check.
(Assignee)

Comment 1

9 months ago
Actually, I think this is more about calling JobFinished() on a nullptr mFinalCallback object.
(Assignee)

Comment 2

9 months ago
Created attachment 8826243 [details] [diff] [review]
Try to handle SW job double-completion better. r=ehsan

Ehsan, given the crash stack in this bug I think we still have some trouble with double-completion of SW jobs.  This patch improves our protections by checking further up the completion call chain.  In this case I think the crash is from calling JobFinished() on a nullptr mFinalCallback for some reason.

I also added more diagnostic assertions to try to narrow down the problem.
Attachment #8826243 - Flags: review?(ehsan)
Hmm.  So there are only two crashes with this signature so far.  It's obvious that this isn't SafeElementAt failing.  But I also think it's clear this isn't a nullptr dereference crash, given that the crash address in comment 0 is 0x1612004 and the crash address in bp-aac6a324-b920-4a7c-8cef-4fe292170111 is 0xfffffffff35a5fb0 (nullptr deref crashes typically occur at an address close to 0, for example, accessing a member with offset N would crash on address N if |this| is nullptr.)  More specifically, since ServiceWorkerJobQueue::Callback::JobFinished is virtual, if mFinalCallback was 0 then the deref of the |this| pointer to get to the vtable would crash at address 0.  So I think there is something else going on here.
Comment on attachment 8826243 [details] [diff] [review]
Try to handle SW job double-completion better. r=ehsan

Review of attachment 8826243 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/ServiceWorkerJob.cpp
@@ +215,5 @@
>  
>    mState = State::Finished;
>  
> +  MOZ_DIAGNOSTIC_ASSERT(mFinalCallback);
> +  if (mFinalCallback) {

As I said before, I'm pretty sure this isn't going to solve the crash, but still the patch may uncover the bug we're chasing here.

FWIW I suggest looking at the crash dumps for this, let me know if you want me to download them for you.  But since we only have two of these crashes, it's up to you how much time you want to spend on debugging it...  :-)
Attachment #8826243 - Flags: review?(ehsan) → review+
(Assignee)

Comment 5

9 months ago
In the last 24 hours we've gotten 3 more reports that suggest nullptr of mFinalCallback:

https://crash-stats.mozilla.com/report/index/571939d5-08eb-4fa2-b654-459832170113

Is our nullptr de-ref working as expected in release?
Flags: needinfo?(ehsan)
(Assignee)

Comment 6

9 months ago
Comment on attachment 8826243 [details] [diff] [review]
Try to handle SW job double-completion better. r=ehsan

Approval Request Comment
[Feature/Bug causing the regression]: Service workers
[User impact if declined]: Low frequency crashes
[Is this code covered by automated tests?]: Its heavily tested, but our automation does not trigger this problem.
[Has the fix been verified in Nightly?]: It has not yet merged to nightly, but I'd like to flag now so the patch might make it into the beta RC next week.
[Needs manual test from QE? If yes, steps to reproduce]:  None
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Minimal risk
[Why is the change risky/not risky?]: This patch adds some additional state and nullptr checks.  It has a very low risk of regression.
[String changes made/needed]: None

Note, I don't think this patch is very speculative.  See the crash stack in comment 5 which is pretty clearly a nullptr de-ref on FF51b that is fixed by this patch.
Attachment #8826243 - Flags: approval-mozilla-beta?
Attachment #8826243 - Flags: approval-mozilla-aurora?
Ben are you going to land this soon so that it might get into m-c? The sheriffs do often handle some merges over the weekend. I will be checking back on uplifts tomorrow.
status-firefox51: --- → affected
status-firefox52: --- → affected
status-firefox53: --- → affected
Flags: needinfo?(bkelly)
Gerry, just letting you know you might want this for tomorrow's RC build.
Flags: needinfo?(gchang)

Comment 9

8 months ago
Comment on attachment 8826243 [details] [diff] [review]
Try to handle SW job double-completion better. r=ehsan

Fix a crash. Beta51+ & Aurora52+. Should be in 51 RC.
Flags: needinfo?(gchang)
Attachment #8826243 - Flags: approval-mozilla-beta?
Attachment #8826243 - Flags: approval-mozilla-beta+
Attachment #8826243 - Flags: approval-mozilla-aurora?
Attachment #8826243 - Flags: approval-mozilla-aurora+

Comment 10

8 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/f987327e5c36
status-firefox52: affected → fixed

Comment 11

8 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/15467610e733
status-firefox51: affected → fixed
ben, i landed this on gchangs request on beta and aurora. Does this need to land on trunk too ? (i guess so from the status flags)
(Assignee)

Comment 13

8 months ago
Sorry, I thought I landed this.  I think maybe the tree was closed when I tried to push and forgot to try again.  Landing now.
Flags: needinfo?(bkelly)

Comment 14

8 months ago
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9b84a435cd2a
Try to handle SW job double-completion better. r=ehsan
(Assignee)

Updated

8 months ago
Crash Signature: [@ mozilla::dom::workers::ServiceWorkerJobQueue::JobFinished] → [@ mozilla::dom::workers::ServiceWorkerJobQueue::JobFinished] [@ mozilla::dom::workers::ServiceWorkerJob::Finish ]

Comment 15

8 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/9b84a435cd2a
Status: ASSIGNED → RESOLVED
Last Resolved: 8 months ago
status-firefox53: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
(In reply to Ben Kelly [:bkelly] from comment #5)
> In the last 24 hours we've gotten 3 more reports that suggest nullptr of
> mFinalCallback:
> 
> https://crash-stats.mozilla.com/report/index/571939d5-08eb-4fa2-b654-
> 459832170113

Yeah this is definitely a clear case of a null deref.

> Is our nullptr de-ref working as expected in release?

What do you mean?
Flags: needinfo?(ehsan) → needinfo?(bkelly)
(Assignee)

Comment 17

8 months ago
I mean, are you sure things like PGO, other optimizations, or some kind of runtime config could be allowing us to traverse that nullptr vtable on windows release?  I feel like I see a lot of crash reports where it seems we have made it through a nullptr read.
Flags: needinfo?(bkelly)
If the vtable pointer is null and we do try to make a virtual call, we have to dereference that address.  Barring that, there are only two cases that I can think of where something like what you explain can happen.

1. Somebody having mapped a readable page at address 0.  Technically we don't protect against that, but if some malicious code does that then null pointer derefs will return what's at the address.  This value isn't probably specific to the call site though, and will be some constant, and interpreting that as a vtable can have arbitrary results.  But the chances of you seeing something on the stack past that point that makes sense are slim.

2. The call being devirtualized by the optimizer, if the compiler can figure out the concrete type of the object at hand.  Here, the compiler can't know anything about mFinalCallback since it's set based on what the caller passed to Start(), so there's no way for it to devirtualize.  Just to be certain, I checked the code in Nightly and there's definitely a virtual call there:

  if (mFinalCallback) {
    mFinalCallback->JobFinished(this, aRv);
66B52860  mov         eax,dword ptr [ecx]  
  if (mFinalCallback) {
    mFinalCallback->JobFinished(this, aRv);
66B52862  push        ebx  
66B52863  push        esi  
66B52864  call        dword ptr [eax]
(Assignee)

Comment 19

8 months ago
(In reply to :Ehsan Akhgari from comment #18)
> 2. The call being devirtualized by the optimizer, if the compiler can figure
> out the concrete type of the object at hand.  Here, the compiler can't know
> anything about mFinalCallback since it's set based on what the caller passed
> to Start(), so there's no way for it to devirtualize.  Just to be certain, I
> checked the code in Nightly and there's definitely a virtual call there:
> 
>   if (mFinalCallback) {
>     mFinalCallback->JobFinished(this, aRv);
> 66B52860  mov         eax,dword ptr [ecx]  
>   if (mFinalCallback) {
>     mFinalCallback->JobFinished(this, aRv);
> 66B52862  push        ebx  
> 66B52863  push        esi  
> 66B52864  call        dword ptr [eax]

Is this a windows PGO build?  Because mFinalCallback will only ever have a single concrete type set for it;  ServiceWorkerJobQueue::Callback.  It does seem optimizable to me.
(In reply to Ben Kelly [:bkelly] from comment #19)
> (In reply to :Ehsan Akhgari from comment #18)
> > 2. The call being devirtualized by the optimizer, if the compiler can figure
> > out the concrete type of the object at hand.  Here, the compiler can't know
> > anything about mFinalCallback since it's set based on what the caller passed
> > to Start(), so there's no way for it to devirtualize.  Just to be certain, I
> > checked the code in Nightly and there's definitely a virtual call there:
> > 
> >   if (mFinalCallback) {
> >     mFinalCallback->JobFinished(this, aRv);
> > 66B52860  mov         eax,dword ptr [ecx]  
> >   if (mFinalCallback) {
> >     mFinalCallback->JobFinished(this, aRv);
> > 66B52862  push        ebx  
> > 66B52863  push        esi  
> > 66B52864  call        dword ptr [eax]
> 
> Is this a windows PGO build?

Yes, that's what Nightly is.

> Because mFinalCallback will only ever have a
> single concrete type set for it;  ServiceWorkerJobQueue::Callback.  It does
> seem optimizable to me.

Well, like I showed before, the code isn't optimized to devirtualize this call.
(Assignee)

Comment 21

8 months ago
(In reply to :Ehsan Akhgari from comment #20)
> Yes, that's what Nightly is.

You didn't tell me what platform you were looking at.  I thought you used a mac, so I wasn't sure if you were looking at the windows visual studio produced binary.
(In reply to Ben Kelly [:bkelly] from comment #21)
> (In reply to :Ehsan Akhgari from comment #20)
> > Yes, that's what Nightly is.
> 
> You didn't tell me what platform you were looking at.  I thought you used a
> mac, so I wasn't sure if you were looking at the windows visual studio
> produced binary.

Oh no sorry, I was talking about Windows (since the crash happens on Windows it seems).
You need to log in before you can comment on or make changes to this bug.