Closed
Bug 1354455
Opened 8 years ago
Closed 8 years ago
Intermittent Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91
Categories
(Core :: Networking, defect)
Core
Networking
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox53 | --- | unaffected |
firefox54 | --- | unaffected |
firefox55 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: schien)
References
Details
(Keywords: assertion, intermittent-failure, Whiteboard: [necko-active][PBg-HTTP-M3])
Attachments
(1 file)
Comment 1•8 years ago
|
||
Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at /home/worker/workspace/build/src/netwerk/ipc/ChannelEventQueue.cpp:91
00:02:35 INFO - PID 1644 | #01: 2017-04-07 00:02:35.221 atos[1659:9302] Metadata.framework [Error]: couldn't get the client port
00:02:35 INFO - PID 1644 | #02: 0x005eed9c (in XUL) + 44
00:02:35 INFO - PID 1644 | #03: 0x0012784c (in XUL) + 1596
00:02:35 INFO - PID 1644 | #04: 0x00125dff (in XUL) + 47
00:02:35 INFO - PID 1644 | #05: 0x006caec2 (in XUL) + 370
00:02:35 INFO - PID 1644 | #06: 0x0067daac (in XUL) + 60
00:02:35 INFO - PID 1644 | #07: 2017-04-07 00:02:35.442 atos[1661:9320] Metadata.framework [Error]: couldn't get the client port
00:02:35 INFO - PID 1644 | #08: 2017-04-07 00:02:35.481 atos[1663:9334] Metadata.framework [Error]: couldn't get the client port
00:02:35 INFO - PID 1644 | #09: _pthread_body (in libsystem_pthread.dylib) + 131
Flags: needinfo?(schien)
Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | /html/semantics/document-metadata/the-base-element/base_href_specified.html | expected OK → Intermittent timeout base_href_specified.html | expected OK after 00:02:35 INFO - PID 1644 | Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91
Assignee | ||
Comment 2•8 years ago
|
||
Might have the same root cause as bug 1351450, try reproducing it with http log.
Flags: needinfo?(schien)
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → schien
Whiteboard: [necko-active][PBg-HTTP-M3]
Updated•8 years ago
|
Summary: Intermittent timeout base_href_specified.html | expected OK after 00:02:35 INFO - PID 1644 | Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91 → Intermittent Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91
Assignee | ||
Comment 6•8 years ago
|
||
One possible execution sequence will trigger this assertion, which reveals a mutli-thread bug.
[on thread A]
// during event flushing, mFlushing == true
ChannelEventQueue::Suspend
ChannelEventQueue::PrependEvent
ChannelEventQueue::Resume (dispatch CompleteResume to thread B)
// previous event finished, try poll next event
UniquePtr<ChannelEvent> event(TakeEvent());
if (!event) {
break;
}
[on thread B]
ChannelEventQueue::CompleteResume
mSuspended = false;
ChannelEventQueue::MaybeFlushQueue
// consider no need to trigger flush because mFlushing == true.
[on thread A]
mFlushing = false;
ASSERT because there is still event in queue (at least one because of PrependEvent) but no following flush will be triggered.
Assignee | ||
Comment 7•8 years ago
|
||
The root cause is that there is a lock gap between |TakeEvent()| and |mFlush = false|, which will allows CompleteResume to be executed while queue flush is about to be finished.
Comment hidden (mozreview-request) |
Comment 9•8 years ago
|
||
mozreview-review |
Comment on attachment 8865388 [details]
Bug 1354455 - avoid CompleteResume while FlushQueue is ran on other thread.
https://reviewboard.mozilla.org/r/137062/#review140632
::: netwerk/ipc/ChannelEventQueue.cpp
(Diff revision 1)
>
> - bool needResumeOnOtherThread = false;
> - {
> - // Don't allow event enqueued during flush to make sure all events
> - // are run.
> - ReentrantMonitorAutoEnter monitor(mRunningMonitor);
remove mRunningMonitor completely
::: netwerk/ipc/ChannelEventQueue.cpp:94
(Diff revision 1)
> + }
> + break;
> }
> +
> + event->Run();
> }
please identify the while() { } closing '}' with "// while (true)" comment.
Attachment #8865388 -
Flags: review?(honzab.moz) → review+
Assignee | ||
Comment 10•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8865388 [details]
Bug 1354455 - avoid CompleteResume while FlushQueue is ran on other thread.
https://reviewboard.mozilla.org/r/137062/#review140632
> remove mRunningMonitor completely
mRunningMonitor is to ensure no two ChannelEvent can run at the same time. For example, OnStopRequest::Run() is invoked on main thread while ODA::Run() is not finished on parser thread. This is for addressing your comment on bug 1320744 Comment #59.
> please identify the while() { } closing '}' with "// while (true)" comment.
done.
Assignee | ||
Comment 11•8 years ago
|
||
see my reply about mRunningMonitor on comment #10.
Flags: needinfo?(honzab.moz)
Comment hidden (mozreview-request) |
Comment 13•8 years ago
|
||
Pushed by schien@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c19a3a38e2d7
avoid CompleteResume while FlushQueue is ran on other thread. r=mayhemer
Comment 14•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment 15•8 years ago
|
||
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #11)
> see my reply about mRunningMonitor on comment #10.
Ah.. it's still used in the header. I was just hoping you finally found a solution w/o this rather expensive monitor. I though that the mechanism as is would preserve sequencing just using the one lock we have and flags.
To be honest, I think there are still issues in this code. When you need a lock _and_ a reentrant monitor, then something is probably wrong. The lock is open too often and I honestly believe there is a chance that MOZ_ASSERT(!mFlushing); in ChannelEventQueue::FlushQueue() could fail. I don't see a single line that would atomically protect entering FlushQueue on two threads while the winning thread has already set mFlushing.
Maybe I'm wrong, it's just that my guts are telling me this code is not right.
But go ahead, land the patch and we'll see.
Flags: needinfo?(honzab.moz)
Updated•8 years ago
|
status-firefox53:
--- → unaffected
status-firefox54:
--- → unaffected
status-firefox-esr52:
--- → unaffected
Keywords: assertion
Assignee | ||
Comment 16•8 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #15)
> (In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment
> #11)
> > see my reply about mRunningMonitor on comment #10.
>
> Ah.. it's still used in the header. I was just hoping you finally found a
> solution w/o this rather expensive monitor. I though that the mechanism as
> is would preserve sequencing just using the one lock we have and flags.
>
> To be honest, I think there are still issues in this code. When you need a
> lock _and_ a reentrant monitor, then something is probably wrong. The lock
> is open too often and I honestly believe there is a chance that
> MOZ_ASSERT(!mFlushing); in ChannelEventQueue::FlushQueue() could fail. I
> don't see a single line that would atomically protect entering FlushQueue on
> two threads while the winning thread has already set mFlushing.
>
> Maybe I'm wrong, it's just that my guts are telling me this code is not
> right.
>
> But go ahead, land the patch and we'll see.
Yeah agreed that the ChannelEventQueue is not bullet-proof right now. mRunningMonitor might be able to be replaced by an atomic variable since it is now only used for protecting multi access of RunOrEnqueue. FlushQueue is not well protected against consecutive groups of Suspend/PrependEvent/Resume operations, if the prepended events have different event target. We definitely need to define a clear queuing strategy against multi-thread access and reentrance.
Bug 1365149 is filed for following up these issues and improvements.
Updated•7 years ago
|
Whiteboard: [necko-active][PBg-HTTP-M3] → [necko-active][PBg-HTTP-M3][qf:p1]
Updated•3 years ago
|
Performance Impact: --- → P1
Whiteboard: [necko-active][PBg-HTTP-M3][qf:p1] → [necko-active][PBg-HTTP-M3]
You need to log in
before you can comment on or make changes to this bug.
Description
•