Closed Bug 1381748 Opened 7 years ago Closed 7 years ago

Intermittent test_quit_restart.py TestServerQuitApplication.test_attempt_quit | application crashed [@ RefPtr<mozilla::dom::FetchBody<mozilla::dom::Response> >::operator->] after assertion failure: mRawPtr != nullptr

Categories

(Core :: DOM: Service Workers, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 --- unaffected
firefox56 --- fixed
firefox57 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: baku)

References

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(6 files)

Btw. there is an assertion:

Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/RefPtr.h:315
[task 2017-07-16T15:30:27.407601Z] 15:30:27     INFO -  #01: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x19efa9c]
[task 2017-07-16T15:30:27.409850Z] 15:30:27     INFO -  #02: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x19f236a]
[task 2017-07-16T15:30:27.409918Z] 15:30:27     INFO -  #03: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x19f278a]
[task 2017-07-16T15:30:27.410567Z] 15:30:27     INFO -  #04: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x548ca5]
[task 2017-07-16T15:30:27.410642Z] 15:30:27     INFO -  #05: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x544663]
[task 2017-07-16T15:30:27.410723Z] 15:30:27     INFO -  #06: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x55f03c]
[task 2017-07-16T15:30:27.411567Z] 15:30:27     INFO -  #07: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x319e5a4]
[task 2017-07-16T15:30:27.412064Z] 15:30:27     INFO -  #08: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x319e5dd]
[task 2017-07-16T15:30:27.412778Z] 15:30:27     INFO -  #09: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x31a5962]
[task 2017-07-16T15:30:27.413453Z] 15:30:27     INFO -  #10: ???[/home/worker/workspace/build/application/firefox/libxul.so +0x31a5bd6]
[task 2017-07-16T15:30:27.414110Z] 15:30:27     INFO -  #11: ???[/home/worker/workspace/build/application/firefox/firefox +0x55fd]
[task 2017-07-16T15:30:27.414508Z] 15:30:27     INFO -  #12: ???[/home/worker/workspace/build/application/firefox/firefox +0x4e06]
[task 2017-07-16T15:30
Flags: needinfo?(amarchesini)
Summary: Intermittent test_quit_restart.py TestServerQuitApplication.test_attempt_quit | application crashed [@ RefPtr<mozilla::dom::FetchBody<mozilla::dom::Response> >::operator->] → Intermittent test_quit_restart.py TestServerQuitApplication.test_attempt_quit | application crashed [@ RefPtr<mozilla::dom::FetchBody<mozilla::dom::Response> >::operator->] after assertion failure: mRawPtr != nullptr
This has been fixed by bug 1380604.
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(amarchesini)
Resolution: --- → DUPLICATE
No, it's not. The patch for bug 1380604 got merged to autoland on July 13th:
https://hg.mozilla.org/integration/autoland/rev/530b1eb0a548

But this failure is still present:
https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mn&bugfiler&fromchange=0ee162368b58416628490ee3cd97e3f7f6bf9b77&selectedJob=115465007
Status: RESOLVED → REOPENED
Flags: needinfo?(amarchesini)
Resolution: DUPLICATE → ---
 03:01:04     INFO -  1500433264831	Marionette	DEBUG	Received observer notification "xpcom-shutdown"
[task 2017-07-19T03:01:04.841559Z] 03:01:04     INFO -  1500433264831	Marionette	DEBUG	Received observer notification "xpcom-shutdown"
[task 2017-07-19T03:01:04.870654Z] 03:01:04     INFO -  [Parent 3444] WARNING: 'NS_FAILED(aStatus)', file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 558
[task 2017-07-19T03:01:04.870752Z] 03:01:04     INFO -  [Parent 3444] WARNING: 'rv.Failed()', file /home/worker/workspace/build/src/dom/fetch/FetchConsumer.cpp, line 690

It's the same null dereference. Looks like the sync runnable dispatch at http://searchfox.org/mozilla-central/source/dom/fetch/FetchConsumer.cpp#689 can still fail during shutdown.
The issue here is that, on shutdown, the worker decides to do not create a sync event loop; we don't cleanup the main-thread part of FetchConsumer; FetchConsumer crashes using a null mBody pointer.

This first patch removes mBody from FetchConsumer. Note the real fix is in patch 3, but this is nice to have: only thread-safe objects are touched on main-thread if main-thread is not the target thread.
Assignee: nobody → amarchesini
Flags: needinfo?(amarchesini)
Attachment #8887910 - Flags: review?(bkelly)
Completely unrelated patch. I can move it to a separate bug if you prefer.
Attachment #8887911 - Flags: review?(bkelly)
Note that this patch is based on your work on control runnables. I don't use any control runnables!

The issue here is that, if the creation of the sync event loop fails, FetchConsumer releases resources also when on the main-thread they will be used, eventually.

The new approach doesn't release resources when the WorkerHolder is notified. But instead, it dispatches and event on the main-thread in order to start the shutting down.

I have identified 3 scenarios:

1. the beginning of the consuming is not started yet. In this case, AutoFailConsumeBody will end up calling ContinueConsumeBody() in BeginConsumeBodyMainThread().

2. the reading is running. The pump is canceled. OnStreamComplete() will be called. And from here we end up calling ContinueConsumeBody().

3. the reading is completed. ContinueConsumeBody() has been scheduled and it will run eventually.

In all these scenarios, the pump is nullified on the main-thread because, if it is created, we always run OnStreamComplete() - except for scenario 1.

ContinueConsumeBody() is called when the pump is always null and we can release resources. Nothing should run after it.
Attachment #8887916 - Flags: review?(bkelly)
Priority: -- → P1
Comment on attachment 8887910 [details] [diff] [review]
part 1 - Remove mBody

Redirecting to Catalin since I believe he was looking at this code recently and I may not have time to do this before next week.
Attachment #8887910 - Flags: review?(bkelly) → review?(catalin.badea392)
Attachment #8887911 - Flags: review?(bkelly) → review?(catalin.badea392)
Attachment #8887916 - Flags: review?(bkelly) → review?(catalin.badea392)
(In reply to Andrea Marchesini [:baku] from comment #7)
> Created attachment 8887916 [details] [diff] [review]
> part 3 - shutdown
> 
> Note that this patch is based on your work on control runnables. I don't use
> any control runnables!

Does this mean it depends on bug 1379243?  I was planning to land that friday after I got back home, but I suppose we could try to land sooner if needed.  Note, I need to remove that printf_stderr(), so I don't think we can just do a checkin-needed with the currently uploaded patches.
Comment on attachment 8887910 [details] [diff] [review]
part 1 - Remove mBody

Review of attachment 8887910 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8887910 - Flags: review?(catalin.badea392) → review+
Attachment #8887911 - Flags: review?(catalin.badea392) → review+
Comment on attachment 8887916 [details] [diff] [review]
part 3 - shutdown

Review of attachment 8887916 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Andrea Marchesini [:baku] from comment #7)
> Created attachment 8887916 [details] [diff] [review]
> part 3 - shutdown
> 
> Note that this patch is based on your work on control runnables. I don't use
> any control runnables!
AutoFailConsumeBody uses control runnables. Could you please explain what's the dependency
on the other bug?

::: dom/fetch/FetchConsumer.cpp
@@ +454,5 @@
>  FetchBodyConsumer<Derived>::BeginConsumeBodyMainThread()
>  {
>    AssertIsOnMainThread();
>  
> +  AutoFailConsumeBody<Derived> autoReject(this);

Is this guaranteed to succeed in dispatching the runnable by the worker holder? I'm thinking of the following case:
1. queue BeginConsumeBodyMainThread()
<long backlog on the main thread>
2. Worker gets terminated

3. BeginConsumeBodyMainThread gets executed. Can we still dispatch a runnable to release objects at this point?

@@ +661,2 @@
>  
> +    mMainThreadEventTarget->Dispatch(r.forget(), NS_DISPATCH_NORMAL);

Can this fail during browser shutdown?

The logs suggest that the previous sync runnable dispatch r->Dispatch(Killing, rv) can still fail. That's either because the worker is in  >=Killing state or because the main thread dispatch failed. But since we have a WorkerHolder preventing the shutdown, we shouldn't reach the Killing state, right?
(In reply to Ben Kelly [PTO, back July 24][:bkelly] from comment #9)
> > Note that this patch is based on your work on control runnables. I don't use
> > any control runnables!
> 
> Does this mean it depends on bug 1379243?

FWIW, I pushed bug 1379243 to inbound this morning.
Summary: Intermittent test_quit_restart.py TestServerQuitApplication.test_attempt_quit | application crashed [@ RefPtr<mozilla::dom::FetchBody<mozilla::dom::Response> >::operator->] after assertion failure: mRawPtr != nullptr → Intermittent test_quit_restart.py TestServerQuitApplication.test_attempt_quit | application crashed [@ RefPtr<mozilla::dom::FetchBody<mozilla::dom::Response> >::operator->] after assertion failure: mRawPtr != nullptr
Sorry I didn't see your comment.

> AutoFailConsumeBody uses control runnables. Could you please explain what's
> the dependency
> on the other bug?

bug 1379243 is about 'converting' normal runnables to ControlRunnables when a worker is shutting down. This helps to do the shutting down.

> <long backlog on the main thread>
> 2. Worker gets terminated
> 
> 3. BeginConsumeBodyMainThread gets executed. Can we still dispatch a
> runnable to release objects at this point?

The holder is not released, right? So this means that the it's still possible to dispatch runnables.
Using bug 1379243, the normal runnable will be converted to a control one and it should be dispatched.

> The logs suggest that the previous sync runnable dispatch
> r->Dispatch(Killing, rv) can still fail. That's either because the worker is
> in  >=Killing state or because the main thread dispatch failed. But since we
> have a WorkerHolder preventing the shutdown, we shouldn't reach the Killing
> state, right?

Worker fails in the creation of the sync event loop because it is shutting down.
But then we still release the WorkerHolder (see the RAII class), and the body. When BeginConsumeBodyMainThread runs, mBody is null, and we crash.

Here we don't create any sync event loop. We just keep the worker alive and we wait to receive a runnable (maybe a Control one).
Attachment #8887916 - Flags: review?(catalin.badea392) → review+
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2a8ceea666e9
Cleanup FetchConsumer workflow - part 1 - no mBody, r=catalinb
https://hg.mozilla.org/integration/mozilla-inbound/rev/067897f212ac
Cleanup FetchConsumer workflow - part 2 - cleanup RegisterWorkerHolder, r=catalinb
https://hg.mozilla.org/integration/mozilla-inbound/rev/da550c5c845f
Cleanup FetchConsumer workflow - part 3 - shutdown workflow, r=catalinb
sorry had to back this out for frequent failures on android like https://treeherder.mozilla.org/logviewer.html#?job_id=117908955&repo=mozilla-inbound starting with this push here
Flags: needinfo?(amarchesini)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/19f1250e37cd
Backed out changeset da550c5c845f 
https://hg.mozilla.org/integration/mozilla-inbound/rev/299ae6eb5ec1
Backed out changeset 067897f212ac 
https://hg.mozilla.org/integration/mozilla-inbound/rev/fa20618aedb5
Backed out changeset 2a8ceea666e9 for frequent android failures in test_postMessage_override.html
That looks like an unrelated test failure.
It looks so, but somehow it exists only when those patches are in the tree.
Do we have an update here? This crash is still happening a lot.
Cătălin and Ben, is there someone else who could pick this up until Andrea is back?
Flags: needinfo?(catalin.badea392)
Flags: needinfo?(bkelly)
Andrew, see question in comment 24.
Flags: needinfo?(bkelly) → needinfo?(overholt)
(In reply to Henrik Skupin (:whimboo) from comment #24)
> Cătălin and Ben, is there someone else who could pick this up until Andrea
> is back?

Same answer as the other bug. Is this super-urgent?
Flags: needinfo?(overholt) → needinfo?(hskupin)
We are close to get warnings from OF but given comment 4 it should be a null pointer crash too. So depending the trend for the next days, we should be able to wait.
Flags: needinfo?(hskupin)
Andrea, do you have an update for us now that you are back? Since lately it seems to be a beta only failure. Thanks.
Flags: needinfo?(catalin.badea392)
Flags: needinfo?(amarchesini)
Flags: needinfo?(amarchesini)
Flags: needinfo?(amarchesini)
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4206a9271660
Cleanup FetchConsumer workflow - part 1 - no mBody, r=catalinb
https://hg.mozilla.org/integration/mozilla-inbound/rev/30aeba7deb29
Cleanup FetchConsumer workflow - part 2 - cleanup RegisterWorkerHolder, r=catalinb
https://hg.mozilla.org/integration/mozilla-inbound/rev/c309a6360b2f
Cleanup FetchConsumer workflow - part 3 - shutdown workflow, r=catalinb
We'll eventually want to backport this to Beta, but probably not until early next week after it's had some bake time.
Is this ready for a Beta approval request? At this point, it wouldn't ship in a Beta until b11 next week.
Flags: needinfo?(amarchesini)
Attached patch m-b - part 1Splinter Review
Flags: needinfo?(amarchesini)
Attached patch m-b - part 2Splinter Review
Attached patch m-b - part 3Splinter Review
Attachment #8905995 - Attachment is patch: true
Did you mean to request approval on these too? :)
Flags: needinfo?(amarchesini)
Comment on attachment 8905995 [details] [diff] [review]
m-b - part 1

Approval Request Comment
[Feature/Bug causing the regression]: Fetch + workers
[User impact if declined]: a crash when fetch is running in a worker that is terminating. Race condition.
[Is this code covered by automated tests?]: no
[Has the fix been verified in Nightly?]: no but we don't have crashes on try. So I guess we can consider it verified.
[Needs manual test from QE? If yes, steps to reproduce]: none 
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: maybe...
[Why is the change risky/not risky?]: this is a rewrite of the canceling of the fetch-body consumption. It's a lot of code, but it's green on try, it fixes crashes, and it's important to have it on beta.
[String changes made/needed]: none
Flags: needinfo?(amarchesini)
Attachment #8905995 - Flags: approval-mozilla-beta?
Attachment #8905996 - Flags: approval-mozilla-beta?
Attachment #8905997 - Flags: approval-mozilla-beta?
Comment on attachment 8905995 [details] [diff] [review]
m-b - part 1

Big change but this should fix a lot of test failures/crashes.
Try builds look good so let's uplift it for beta 11 today.
Attachment #8905995 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8905996 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8905997 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
No longer depends on: 1375749
You need to log in before you can comment on or make changes to this bug.