Closed Bug 1354455 Opened 7 years ago Closed 7 years ago

Intermittent Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91

Categories

(Core :: Networking, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla55
Performance Impact high
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: schien)

References

Details

(Keywords: assertion, intermittent-failure, Whiteboard: [necko-active][PBg-HTTP-M3])

Attachments

(1 file)

Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at /home/worker/workspace/build/src/netwerk/ipc/ChannelEventQueue.cpp:91
00:02:35     INFO - PID 1644 | #01: 2017-04-07 00:02:35.221 atos[1659:9302] Metadata.framework [Error]: couldn't get the client port
00:02:35     INFO - PID 1644 | #02: 0x005eed9c (in XUL) + 44
00:02:35     INFO - PID 1644 | #03: 0x0012784c (in XUL) + 1596
00:02:35     INFO - PID 1644 | #04: 0x00125dff (in XUL) + 47
00:02:35     INFO - PID 1644 | #05: 0x006caec2 (in XUL) + 370
00:02:35     INFO - PID 1644 | #06: 0x0067daac (in XUL) + 60
00:02:35     INFO - PID 1644 | #07: 2017-04-07 00:02:35.442 atos[1661:9320] Metadata.framework [Error]: couldn't get the client port
00:02:35     INFO - PID 1644 | #08: 2017-04-07 00:02:35.481 atos[1663:9334] Metadata.framework [Error]: couldn't get the client port
00:02:35     INFO - PID 1644 | #09: _pthread_body (in libsystem_pthread.dylib) + 131
Flags: needinfo?(schien)
Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | /html/semantics/document-metadata/the-base-element/base_href_specified.html | expected OK → Intermittent timeout base_href_specified.html | expected OK after 00:02:35 INFO - PID 1644 | Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91
Might have the same root cause as bug 1351450, try reproducing it with http log.
Flags: needinfo?(schien)
Assignee: nobody → schien
Whiteboard: [necko-active][PBg-HTTP-M3]
Summary: Intermittent timeout base_href_specified.html | expected OK after 00:02:35 INFO - PID 1644 | Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91 → Intermittent Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91
One possible execution sequence will trigger this assertion, which reveals a mutli-thread bug.

[on thread A]
// during event flushing, mFlushing == true
ChannelEventQueue::Suspend
ChannelEventQueue::PrependEvent
ChannelEventQueue::Resume (dispatch CompleteResume to thread B)
// previous event finished, try poll next event
      UniquePtr<ChannelEvent> event(TakeEvent());
      if (!event) {
        break;
      }
[on thread B]
ChannelEventQueue::CompleteResume
  mSuspended = false;
ChannelEventQueue::MaybeFlushQueue
  // consider no need to trigger flush because mFlushing == true.
[on thread A]
  mFlushing = false;
  ASSERT because there is still event in queue (at least one because of PrependEvent) but no following flush will be triggered.
The root cause is that there is a lock gap between |TakeEvent()| and |mFlush = false|, which will allows CompleteResume to be executed while queue flush is about to be finished.
Comment on attachment 8865388 [details]
Bug 1354455 - avoid CompleteResume while FlushQueue is ran on other thread.

https://reviewboard.mozilla.org/r/137062/#review140632

::: netwerk/ipc/ChannelEventQueue.cpp
(Diff revision 1)
>  
> -  bool needResumeOnOtherThread = false;
> -  {
> -    // Don't allow event enqueued during flush to make sure all events
> -    // are run.
> -    ReentrantMonitorAutoEnter monitor(mRunningMonitor);

remove mRunningMonitor completely

::: netwerk/ipc/ChannelEventQueue.cpp:94
(Diff revision 1)
> +      }
> +      break;
>      }
> +
> +    event->Run();
>    }

please identify the while() { } closing '}' with "// while (true)" comment.
Attachment #8865388 - Flags: review?(honzab.moz) → review+
Comment on attachment 8865388 [details]
Bug 1354455 - avoid CompleteResume while FlushQueue is ran on other thread.

https://reviewboard.mozilla.org/r/137062/#review140632

> remove mRunningMonitor completely

mRunningMonitor is to ensure no two ChannelEvent can run at the same time. For example, OnStopRequest::Run() is invoked on main thread while ODA::Run() is not finished on parser thread. This is for addressing your comment on bug 1320744 Comment #59.

> please identify the while() { } closing '}' with "// while (true)" comment.

done.
see my reply about mRunningMonitor on comment #10.
Flags: needinfo?(honzab.moz)
Pushed by schien@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c19a3a38e2d7
avoid CompleteResume while FlushQueue is ran on other thread. r=mayhemer
https://hg.mozilla.org/mozilla-central/rev/c19a3a38e2d7
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #11)
> see my reply about mRunningMonitor on comment #10.

Ah.. it's still used in the header.  I was just hoping you finally found a solution w/o this rather expensive monitor.  I though that the mechanism as is would preserve sequencing just using the one lock we have and flags.

To be honest, I think there are still issues in this code.  When you need a lock _and_ a reentrant monitor, then something is probably wrong.  The lock is open too often and I honestly believe there is a chance that MOZ_ASSERT(!mFlushing); in ChannelEventQueue::FlushQueue() could fail.  I don't see a single line that would atomically protect entering FlushQueue on two threads while the winning thread has already set mFlushing.

Maybe I'm wrong, it's just that my guts are telling me this code is not right.

But go ahead, land the patch and we'll see.
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #15)
> (In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment
> #11)
> > see my reply about mRunningMonitor on comment #10.
> 
> Ah.. it's still used in the header.  I was just hoping you finally found a
> solution w/o this rather expensive monitor.  I though that the mechanism as
> is would preserve sequencing just using the one lock we have and flags.
> 
> To be honest, I think there are still issues in this code.  When you need a
> lock _and_ a reentrant monitor, then something is probably wrong.  The lock
> is open too often and I honestly believe there is a chance that
> MOZ_ASSERT(!mFlushing); in ChannelEventQueue::FlushQueue() could fail.  I
> don't see a single line that would atomically protect entering FlushQueue on
> two threads while the winning thread has already set mFlushing.
> 
> Maybe I'm wrong, it's just that my guts are telling me this code is not
> right.
> 
> But go ahead, land the patch and we'll see.

Yeah agreed that the ChannelEventQueue is not bullet-proof right now. mRunningMonitor might be able to be replaced by an atomic variable since it is now only used for protecting multi access of RunOrEnqueue. FlushQueue is not well protected against consecutive groups of Suspend/PrependEvent/Resume operations, if the prepended events have different event target. We definitely need to define a clear queuing strategy against multi-thread access and reentrance.

Bug 1365149 is filed for following up these issues and improvements.
Whiteboard: [necko-active][PBg-HTTP-M3] → [necko-active][PBg-HTTP-M3][qf:p1]
Performance Impact: --- → P1
Whiteboard: [necko-active][PBg-HTTP-M3][qf:p1] → [necko-active][PBg-HTTP-M3]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: