Intermittent extensions/cookie/test/browser_test_favicon.js | application crashed [@ nsPipe::AdvanceReadCursor(nsPipeReadState&, unsigned int)] after Assertion failure: aReadState.mAvailable >= aBytesRead

RESOLVED FIXED in Firefox 57

Status

()

defect
P5
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: intermittent-bug-filer, Assigned: bkelly)

Tracking

({crash, intermittent-failure})

unspecified
mozilla57
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 wontfix, firefox55 wontfix, firefox56 wontfix, firefox57 fixed)

Details

(crash signature)

Attachments

(2 attachments, 4 obsolete attachments)

I looked a bit at this and I'm trying to figure out how this assertion could be firing.  My best guess is that two different threads are trying to read the same pipe at the same time.

Maybe this is related to our new efforts to run OnDataAvailable off-main-thread?  Is it possible that we dispatch more than one ODA to a thread pool like StreamTransportService at the same time?

Dragana, what do you think?
Flags: needinfo?(dd.mozilla)
As a short-term mitigation, perhaps we should abort the stream if this assertion does not hold.
(In reply to Ben Kelly [:bkelly] from comment #1)
> I looked a bit at this and I'm trying to figure out how this assertion could
> be firing.  My best guess is that two different threads are trying to read
> the same pipe at the same time.
> 
> Maybe this is related to our new efforts to run OnDataAvailable
> off-main-thread?  Is it possible that we dispatch more than one ODA to a
> thread pool like StreamTransportService at the same time?
> 
> Dragana, what do you think?

I think we are only dispatching a new ODA after the previous us done.

Asking sc, he worked on the off-main-thread.
Flags: needinfo?(dd.mozilla) → needinfo?(schien)
From the backtrace, this is HTTP off-main-thread in chrome process (nsHttpChannel), so it is not using the content process HTTP OMT mechanism I introduced recently.

As @dragana said, new ODA is run after previous ODA callback is done. In addition, this case is ODA for imgRequest and Gecko only use single "ImageIO" thread to read the IO stream. It's not likely to hit the scenario @bkelly mentioned in comment #1.
Flags: needinfo?(schien)
I also noticed that assertion happens after "TEST-OK | extensions/cookie/test/browser_test_favicon.js | took 460ms".
https://treeherder.mozilla.org/logviewer.html#?job_id=128990395&repo=mozilla-inbound&lineNumber=28566

Could this indicate that the pipe is closed before the nsInputStreamReadyEvent is executed?
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #4)
> As @dragana said, new ODA is run after previous ODA callback is done. In
> addition, this case is ODA for imgRequest and Gecko only use single
> "ImageIO" thread to read the IO stream. It's not likely to hit the scenario
> @bkelly mentioned in comment #1.

Ok, thanks for verifying.  I was a bit scared thinking about that possibility.  I'll take a look at the pipe code a bit more here.
(Assignee)

Updated

2 years ago
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
(Assignee)

Updated

2 years ago
Component: Networking: Cookies → XPCOM
I'm going to add some assertions here to try to narrow down where the state is going bad.  We should leave this bug open in the meantime.
Keywords: leave-open
Hmm, or maybe we should just get rid of mAvailable.  Its duplicate data and can be computed from `(mReadLimit - mReadCursor)`.  If there is a bug in this book-keeping, lets just delete the book-keeping.
(Assignee)

Updated

2 years ago
Keywords: leave-open
(In reply to Ben Kelly [:bkelly] from comment #8)
> Hmm, or maybe we should just get rid of mAvailable.  Its duplicate data and
> can be computed from `(mReadLimit - mReadCursor)`.  If there is a bug in
> this book-keeping, lets just delete the book-keeping.

Actually, we can't do this because mAvailable is the total length of the pipe, not just the current segment length.

I think I might see the problem, though.

1. When we close the stream we call DrainInputStream() which sets mAvailable to 0.
2. We then call AdvanceReadSegment() as far as we to release the buffer memory. Usually this leaves us with mReadCursor == mReadLimit, but its not guaranteed.
3. When a Read() happens we use mReadCursor and mReadLimit to determine the length of data available.  If they are not equal then we can get a read longer than mAvailable thanks to step 1 setting it to zero.

I think the solution here is to null out mReadCursor and mReadLimit when we drain the stream.  We only drain when the stream is closed, so we pretty much want to avoid any further reads.
Comment on attachment 8905956 [details] [diff] [review]
P1 Convert existing nsPipe3.cpp MOZ_ASSERT's to use MOZ_DIAGNOSTIC_ASSERT. r=froydnj

This patch converts existing MOZ_ASSERT's to MOZ_DIAGNOSTIC_ASSERT instead.  As far as I could tell all of these were relatively cheap checks.  I think having some opt-build assertion checking is important here because pipe is used so frequently throughout the browser.
Attachment #8905956 - Flags: review?(nfroyd)
Comment on attachment 8905958 [details] [diff] [review]
P2 Convert nsPipe3.cpp NS_ASSERTION() calls to hard assertions. r=froydnj

This patch converts "soft" NS_ASSERTION() calls into "hard" MOZ_DIAGNOSTIC_ASSERT and MOZ_ASSERT calls.  I think it should be easier to do this now that we are not exposing pipe streams directly to legacy addons.
Attachment #8905958 - Flags: review?(nfroyd)
Comment on attachment 8905992 [details] [diff] [review]
P3 Clear nsPipeInputStream mReadCursoe and mReadLimit in DrainInputStream() to keep state consistent with mAvailable. r=froydnj

This patch fixes a potential cause for the crash in this bug.  The crash is triggering because we are advancing the read cursor by a number larger than mAvailable.  The original amount to read and move the cursor is derived from the current segment's (mReadLimit - mReadCursor).  So ultimately this bug is about mAvailable somehow getting set to a value inconsistent with those other pointers.

I believe the cause is that we were forcing mAvailable to zero when the stream was closed, but we did not necessarily clear the mReadLimit and mReadCursor.  The result is a low probability of an additional read call seeing inconsistent state.

The solution is to clear mReadLimit and mReadCursor when force mAvailable to zero.

There is another low-probably problem here in that someone writing to the pipe could then update the state again after this.  The next patch addresses that.
Attachment #8905992 - Flags: review?(nfroyd)
Comment on attachment 8905993 [details] [diff] [review]
P4 Delay removing nsPipeInputStream from nsPipe list until DrainInputStream() actually runs. r=froydnj

> There is another low-probably problem here in that someone writing to the
> pipe could then update the state again after this.  The next patch addresses
> that.

Sorry, this is not the lingering problem.  This can't happen because when the stream is closed we remove it from nsPipe::mInputList.  Since its not in the list it won't get updated any more.

The actual, problem, though is sometimes we don't call DrainInputStream immediately when its closed.  We wait until the current active read completes and then we drain it.

This creates a situation where an input stream is not in nsPipe::mInputList, but does hold active pointers to the central buffer.  There is some risk that a writer or pipe will release the segment being pointed to by the input stream leaving this pointers corrupt.

To remedy this situation I propose we only remove the stream of mInputList once DrainInputStream() actually runs successfully.  If its deferred then we delay removing the stream as well.  This ensures its state continues to be updated while its undrained.
Attachment #8905993 - Flags: review?(nfroyd)
(Assignee)

Updated

2 years ago
Blocks: 1395308
Comment hidden (Intermittent Failures Robot)
Comment on attachment 8905956 [details] [diff] [review]
P1 Convert existing nsPipe3.cpp MOZ_ASSERT's to use MOZ_DIAGNOSTIC_ASSERT. r=froydnj

Review of attachment 8905956 [details] [diff] [review]:
-----------------------------------------------------------------

Please hold off on committing this part, at least, until after 57 goes to beta, so if something does go wrong, people don't start panic'ing.
Attachment #8905956 - Flags: review?(nfroyd) → review+
Comment on attachment 8905958 [details] [diff] [review]
P2 Convert nsPipe3.cpp NS_ASSERTION() calls to hard assertions. r=froydnj

Review of attachment 8905958 [details] [diff] [review]:
-----------------------------------------------------------------

Same comments about 57 moving to beta apply here, too.

::: xpcom/io/nsPipe3.cpp
@@ +691,5 @@
>  
>  void
>  nsPipe::AdvanceReadCursor(nsPipeReadState& aReadState, uint32_t aBytesRead)
>  {
> +  MOZ_DIAGNOSTIC_ASSERT(aBytesRead);

Nit: aBytesRead > 0, please.

@@ +898,5 @@
>  
>  void
>  nsPipe::AdvanceWriteCursor(uint32_t aBytesWritten)
>  {
> +  MOZ_DIAGNOSTIC_ASSERT(aBytesWritten);

Nit: aBytesWritten > 0, please.
Attachment #8905958 - Flags: review?(nfroyd) → review+
Attachment #8905992 - Flags: review?(nfroyd) → review+
Comment on attachment 8905993 [details] [diff] [review]
P4 Delay removing nsPipeInputStream from nsPipe list until DrainInputStream() actually runs. r=froydnj

Review of attachment 8905993 [details] [diff] [review]:
-----------------------------------------------------------------

Makes sense to me.  Is it possible to write a test for this?

::: xpcom/io/nsPipe3.cpp
@@ +1015,5 @@
>  
>      bool needNotify = false;
>  
> +    // OnInputException() can drain the stream and remove it from
> +    // mInputList.  So iterate over a temp list instead.

Does something continue to hold the input stream alive for us if it is removed from the list?
Attachment #8905993 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #21)

> Makes sense to me.  Is it possible to write a test for this?

I think it would be tricky to reliably trigger.  You would have to schedule separate threads to try to perform the operations at the same time and get the locking to fall just right.
(In reply to Nathan Froyd [:froydnj] from comment #19)
> Please hold off on committing this part, at least, until after 57 goes to
> beta, so if something does go wrong, people don't start panic'ing.

I'll move P1 and P2 to a separate bug.  I'd like to get the other fixes in for FF57.
(Assignee)

Updated

2 years ago
Blocks: 1398942
(In reply to Nathan Froyd [:froydnj] from comment #21)
> ::: xpcom/io/nsPipe3.cpp
> @@ +1015,5 @@
> >  
> >      bool needNotify = false;
> >  
> > +    // OnInputException() can drain the stream and remove it from
> > +    // mInputList.  So iterate over a temp list instead.
> 
> Does something continue to hold the input stream alive for us if it is
> removed from the list?

Maybe yes, maybe no.  Depends on if something external is holding the nsIInputStream.
Inbound is closed, so lets do another try build with just these two patches while we wait:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cc7ed824a0fb47fa166e9547982fa37dc459bb7

Comment 28

2 years ago
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/59bb8f38be8d
P1 Clear nsPipeInputStream mReadCursoe and mReadLimit in DrainInputStream() to keep state consistent with mAvailable. r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/3e638bc2c4f3
P2 Delay removing nsPipeInputStream from nsPipe list until DrainInputStream() actually runs. r=froydnj
https://hg.mozilla.org/mozilla-central/rev/59bb8f38be8d
https://hg.mozilla.org/mozilla-central/rev/3e638bc2c4f3
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.