Closed Bug 1274818 Opened 3 years ago Closed 3 years ago

Assertion failure: !mOnStartRequestCalled (We should not call OsStartRequest twice), panning on vfrmap.com

Categories

(Core :: Networking: HTTP, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox49 --- affected
firefox50 --- fixed

People

(Reporter: dbaron, Assigned: mayhemer)

References

()

Details

(Keywords: assertion, crash, Whiteboard: [necko-active])

Attachments

(4 files, 1 obsolete file)

Steps to reproduce:
 1. load http://vfrmap.com/
 2. Zoom in to the San Francisco bay area, zoomed in enough that you can't see San Francisco and San Jose at the same time (i.e., to where the map switches to a higher resolution version)
 3. pan around a bit

Actual results:
Crash due to fatal assertion failure:
Assertion failure: !mOnStartRequestCalled (We should not call OsStartRequest twice), at netwerk/protocol/http/nsHttpChannel.cpp:1030

I reproduced this crash twice just now (the second time in a slightly newer build than the first -- the second was in a debug build from mozilla-central that's about 24 hours old).
Attached file gdb session
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Whiteboard: [necko-active]
I was trying to reproduce this but no luck. To figure out what is going on I will need to know when OnStartRequest was called first time. It is probably a timing issue therefore I can no reproduce it.

Can you try to reproduce this with logging? hopefully it will save enough data.
Flags: needinfo?(dbaron)
What sort of logging?
Flags: needinfo?(dbaron)
Have you been able to reproduce this?
Thanks.
Flags: needinfo?(dbaron)
Attached file nspr.logs.tar.bz2
After the fatal assertion, I attached to both parent and child in gdb and called PR_LogFlush() to flush out the unwritten parts, but that only had an effect in the parent.
Flags: needinfo?(dbaron)
Thanks a lot for the log. 

We check cache entry, it is not complete, so we decide to read/write concurrently (https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpChannel.cpp#3478). We start reading from cache, e.g. we call OnStart, OnData, and then we figured out that we do not have complete data so we set up a new transaction:
https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpChannel.cpp#6364

After this the channel gets canceled and we end up calling OnStart again.

2016-07-07 13:56:08.578322 UTC - [Main Thread]: V/nsHttp nsHttpChannel::OnStopRequest [this=52250b0 request=5053300 status=0]
2016-07-07 13:56:08.578331 UTC - [Main Thread]: V/nsHttp   concurrent cache entry write has been interrupted
2016-07-07 13:56:08.578350 UTC - [Main Thread]: V/nsHttp nsHttpChannel::SetupTransaction [this=52250b0]
2016-07-07 13:56:08.578424 UTC - [Main Thread]: D/nsHttp Creating nsHttpTransaction @648aac0
2016-07-07 13:56:08.578430 UTC - [Main Thread]: V/nsHttp nsHttpChannel 52250b0 created nsHttpTransaction 648aac0
2016-07-07 13:56:08.578439 UTC - [Main Thread]: D/nsHttp nsHttpTransaction::Init [this=648aac0 caps=21]
....

2016-07-07 13:56:08.584577 UTC - [Main Thread]: V/nsHttp nsHttpChannel::Cancel [this=52250b0 status=804b0002]
2016-07-07 13:56:08.584584 UTC - [Main Thread]: V/nsHttp nsHttpConnectionMgr::CancelTransaction [trans=648aac0 reason=804b0002]
2016-07-07 13:56:08.584619 UTC - [Main Thread]: D/nsSocketTransport STS dispatch [3fc30c0]
2016-07-07 13:56:08.584638 UTC - [Main Thread]: D/nsSocketTransport PollableEvent::Signal
2016-07-07 13:56:08.584666 UTC - [Main Thread]: D/nsSocketTransport PollableEvent::Signal PR_Write 1
2016-07-07 13:56:08.584676 UTC - [Main Thread]: D/nsStreamPump nsInputStreamPump::Cancel [this=3b86030 status=804b0002]
...
2016-07-07 13:56:08.591960 UTC - [Main Thread]: V/nsHttp nsHttpChannel::OnStartRequest [this=52250b0 request=3b86030 status=804b0002]
2016-07-07 13:56:08.591964 UTC - [Main Thread]: V/nsHttp nsHttpChannel::ProcessFallback [this=52250b0]
2016-07-07 13:56:08.591967 UTC - [Main Thread]: V/nsHttp   choosing not to fallback [0,,0]
2016-07-07 13:56:08.591970 UTC - [Main Thread]: V/nsHttp   calling mListener->OnStartRequest
Attached patch bug_1274818_v1.patch (obsolete) — Splinter Review
Attachment #8769763 - Flags: review?(honzab.moz)
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #6)
> Created attachment 8768790 [details]
> nspr.logs.tar.bz2
> 
> After the fatal assertion, I attached to both parent and child in gdb and
> called PR_LogFlush() to flush out the unwritten parts, but that only had an
> effect in the parent.

It probably no longer works.  We have switched to a different logging code few months ago (resides in XPCOM now).  'sync' module is suggested when hunting for crashes.
Comment on attachment 8769763 [details] [diff] [review]
bug_1274818_v1.patch

Review of attachment 8769763 [details] [diff] [review]:
-----------------------------------------------------------------

good catch.  I somehow thought there were a logic preventing this.  I wrote a test that is capable of reproducing the problem.  But the patch doesn't fix it correctly.  I will take the bug and fix it.

Dragana, thanks for the analyzes!
Attachment #8769763 - Flags: review?(honzab.moz)
Assignee: dd.mozilla → honzab.moz
Dragana, this is mostly similar to your patch, with omission of bypass of OnStartRequest.  I left (a slightly modified) bypass of CallOnStartRequest only, which may be triggered during the network read phase of an interrupted cache concurrent read in following cases:

- response is somehow broken (unexpected range in a response header) etc
- channel is canceled during that phase (what David experienced)

There is one more check added for a 206 response code to the concurrent range net load.  If it's not 206, we must not continue processing the response, since it would just lead to a broken content delivery!

I also added tests to exercise (hopefully) all the new code paths.

(no try: try is CLOSED! Reason: Buildbot DB issues (bug 1286942))
Attachment #8769763 - Attachment is obsolete: true
Attachment #8771507 - Flags: review?(dd.mozilla)
Comment on attachment 8771507 [details] [diff] [review]
v2 (bypass CallOnStartRequest for the second time)

Review of attachment 8771507 [details] [diff] [review]:
-----------------------------------------------------------------

thanks.

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +1049,5 @@
>          if (NS_FAILED(rv))
>              return rv;
>      } else {
>          NS_WARNING("OnStartRequest skipped because of null listener");
> +        mOnStartRequestCalled = true;

why don't you move this out of if?
Attachment #8771507 - Flags: review?(dd.mozilla) → review+
(In reply to Dragana Damjanovic [:dragana] from comment #13)
> Comment on attachment 8771507 [details] [diff] [review]
> v2 (bypass CallOnStartRequest for the second time)
> 
> Review of attachment 8771507 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> thanks.
> 
> ::: netwerk/protocol/http/nsHttpChannel.cpp
> @@ +1049,5 @@
> >          if (NS_FAILED(rv))
> >              return rv;
> >      } else {
> >          NS_WARNING("OnStartRequest skipped because of null listener");
> > +        mOnStartRequestCalled = true;
> 
> why don't you move this out of if?

Because I want this flag be set to true even for case when OnStartRequest returns a failure code but not sooner than it's actually been called.  Hence it's here (duplicated).
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b2337a0557fdccf1e6f0a9cd81e125705f4e07bb

The "browser_webapi_install.js | Wrong error message for invalid download url - Got false, expected true" perma fail is coming from the base commit (fixed with m-c/0fbdcd21fad7)
Keywords: checkin-needed
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c0503b9ac96a
Early return from nsHttpChannel::CallOnStartRequest on its second call, r=dragana
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/c0503b9ac96a
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
You need to log in before you can comment on or make changes to this bug.