Closed Bug 1354455 Opened 7 years ago Closed 7 years ago

Intermittent Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla55

Project Flags:

Performance Impact

high

Tracking Flags:

Tracking

Status

firefox-esr52

---

unaffected

firefox53

---

unaffected

firefox54

---

unaffected

firefox55

---

fixed

People

(Reporter: intermittent-bug-filer, Assigned: schien)

References

Details

(Keywords: assertion, intermittent-failure, Whiteboard: [necko-active][PBg-HTTP-M3])

Attachments

(1 file)

Bug 1354455 - avoid CompleteResume while FlushQueue is ran on other thread. 7 years ago Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) 59 bytes, text/x-review-board-request	mayhemer : review+	Details

Treeherder Bug Filer

Reporter

Description

•

7 years ago

treeherder

Filed by: cbook [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=89461691&repo=autoland

https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-macosx64-debug/autoland_yosemite_r7-debug_test-web-platform-tests-e10s-10-bm134-tests1-macosx-build301.txt.gz

Carsten Book [:Tomcat]

Comment 1

•

7 years ago

Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at /home/worker/workspace/build/src/netwerk/ipc/ChannelEventQueue.cpp:91
00:02:35     INFO - PID 1644 | #01: 2017-04-07 00:02:35.221 atos[1659:9302] Metadata.framework [Error]: couldn't get the client port
00:02:35     INFO - PID 1644 | #02: 0x005eed9c (in XUL) + 44
00:02:35     INFO - PID 1644 | #03: 0x0012784c (in XUL) + 1596
00:02:35     INFO - PID 1644 | #04: 0x00125dff (in XUL) + 47
00:02:35     INFO - PID 1644 | #05: 0x006caec2 (in XUL) + 370
00:02:35     INFO - PID 1644 | #06: 0x0067daac (in XUL) + 60
00:02:35     INFO - PID 1644 | #07: 2017-04-07 00:02:35.442 atos[1661:9320] Metadata.framework [Error]: couldn't get the client port
00:02:35     INFO - PID 1644 | #08: 2017-04-07 00:02:35.481 atos[1663:9334] Metadata.framework [Error]: couldn't get the client port
00:02:35     INFO - PID 1644 | #09: _pthread_body (in libsystem_pthread.dylib) + 131

Flags: needinfo?(schien)

Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | /html/semantics/document-metadata/the-base-element/base_href_specified.html | expected OK → Intermittent timeout base_href_specified.html | expected OK after 00:02:35 INFO - PID 1644 | Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 2

•

7 years ago

Might have the same root cause as bug 1351450, try reproducing it with http log.

Flags: needinfo?(schien)

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Updated

•

7 years ago

Assignee: nobody → schien

Whiteboard: [necko-active][PBg-HTTP-M3]

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Summary: Intermittent timeout base_href_specified.html | expected OK after 00:02:35 INFO - PID 1644 | Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91 → Intermittent Assertion failure: mEventQueue.IsEmpty() || (needResumeOnOtherThread || mSuspended || !!mForcedCount), at ChannelEventQueue.cpp:91

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

7 years ago

Blocks: 1351450

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

7 years ago

Blocks: 1361926

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 6

•

7 years ago

One possible execution sequence will trigger this assertion, which reveals a mutli-thread bug.

[on thread A]
// during event flushing, mFlushing == true
ChannelEventQueue::Suspend
ChannelEventQueue::PrependEvent
ChannelEventQueue::Resume (dispatch CompleteResume to thread B)
// previous event finished, try poll next event
      UniquePtr<ChannelEvent> event(TakeEvent());
      if (!event) {
        break;
      }
[on thread B]
ChannelEventQueue::CompleteResume
  mSuspended = false;
ChannelEventQueue::MaybeFlushQueue
  // consider no need to trigger flush because mFlushing == true.
[on thread A]
  mFlushing = false;
  ASSERT because there is still event in queue (at least one because of PrependEvent) but no following flush will be triggered.

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 7

•

7 years ago

The root cause is that there is a lock gap between |TakeEvent()| and |mFlush = false|, which will allows CompleteResume to be executed while queue flush is about to be finished.

Comment hidden (mozreview-request)

If HttpChannelChild::Cancel is triggered off main thread while flushing channel event queue,
CompleteResume might be executed while the flush is about to be finished. In this case, queue resumption
will not be able to trigger the second queue flush because the previous one is not finished yet.
Therefore, the HttpChannelChild::Cancel will be sitted in the queue without executing.

Review commit: https://reviewboard.mozilla.org/r/137062/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/137062/

Honza Bambas (:mayhemer)

Comment 9

•

7 years ago

mozreview-review

Comment on attachment 8865388 [details]
Bug 1354455 - avoid CompleteResume while FlushQueue is ran on other thread.

https://reviewboard.mozilla.org/r/137062/#review140632

::: netwerk/ipc/ChannelEventQueue.cpp
(Diff revision 1)
>  
> -  bool needResumeOnOtherThread = false;
> -  {
> -    // Don't allow event enqueued during flush to make sure all events
> -    // are run.
> -    ReentrantMonitorAutoEnter monitor(mRunningMonitor);

remove mRunningMonitor completely

::: netwerk/ipc/ChannelEventQueue.cpp:94
(Diff revision 1)
> +      }
> +      break;
>      }
> +
> +    event->Run();
>    }

please identify the while() { } closing '}' with "// while (true)" comment.

Attachment #8865388 - Flags: review?(honzab.moz) → review+

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 10

•

7 years ago

mozreview-review-reply

Comment on attachment 8865388 [details]
Bug 1354455 - avoid CompleteResume while FlushQueue is ran on other thread.

https://reviewboard.mozilla.org/r/137062/#review140632

> remove mRunningMonitor completely

mRunningMonitor is to ensure no two ChannelEvent can run at the same time. For example, OnStopRequest::Run() is invoked on main thread while ODA::Run() is not finished on parser thread. This is for addressing your comment on bug 1320744 Comment #59.

> please identify the while() { } closing '}' with "// while (true)" comment.

done.

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 11

•

7 years ago

see my reply about mRunningMonitor on comment #10.

Flags: needinfo?(honzab.moz)

Comment hidden (mozreview-request)

Comment on attachment 8865388 [details]
Bug 1354455 - avoid CompleteResume while FlushQueue is ran on other thread.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/137062/diff/1-2/

Pulsebot

Comment 13

•

7 years ago

Pushed by schien@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c19a3a38e2d7
avoid CompleteResume while FlushQueue is ran on other thread. r=mayhemer

Carsten Book [:Tomcat]

Comment 14

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c19a3a38e2d7

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox55: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Honza Bambas (:mayhemer)

Comment 15

•

7 years ago

(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #11)
> see my reply about mRunningMonitor on comment #10.

Ah.. it's still used in the header.  I was just hoping you finally found a solution w/o this rather expensive monitor.  I though that the mechanism as is would preserve sequencing just using the one lock we have and flags.

To be honest, I think there are still issues in this code.  When you need a lock _and_ a reentrant monitor, then something is probably wrong.  The lock is open too often and I honestly believe there is a chance that MOZ_ASSERT(!mFlushing); in ChannelEventQueue::FlushQueue() could fail.  I don't see a single line that would atomically protect entering FlushQueue on two threads while the winning thread has already set mFlushing.

Maybe I'm wrong, it's just that my guts are telling me this code is not right.

But go ahead, land the patch and we'll see.

Flags: needinfo?(honzab.moz)

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox53: --- → unaffected

status-firefox54: --- → unaffected

status-firefox-esr52: --- → unaffected

Keywords: assertion

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 16

•

7 years ago

(In reply to Honza Bambas (:mayhemer) from comment #15)
> (In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment
> #11)
> > see my reply about mRunningMonitor on comment #10.
> 
> Ah.. it's still used in the header.  I was just hoping you finally found a
> solution w/o this rather expensive monitor.  I though that the mechanism as
> is would preserve sequencing just using the one lock we have and flags.
> 
> To be honest, I think there are still issues in this code.  When you need a
> lock _and_ a reentrant monitor, then something is probably wrong.  The lock
> is open too often and I honestly believe there is a chance that
> MOZ_ASSERT(!mFlushing); in ChannelEventQueue::FlushQueue() could fail.  I
> don't see a single line that would atomically protect entering FlushQueue on
> two threads while the winning thread has already set mFlushing.
> 
> Maybe I'm wrong, it's just that my guts are telling me this code is not
> right.
> 
> But go ahead, land the patch and we'll see.

Yeah agreed that the ChannelEventQueue is not bullet-proof right now. mRunningMonitor might be able to be replaced by an atomic variable since it is now only used for protecting multi access of RunOrEnqueue. FlushQueue is not well protected against consecutive groups of Suspend/PrependEvent/Resume operations, if the prepended events have different event target. We definitely need to define a clear queuing strategy against multi-thread access and reentrance.

Bug 1365149 is filed for following up these issues and improvements.

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

7 years ago

Blocks: 1363609

Jean Gong :jgong

Updated

•

7 years ago

Whiteboard: [necko-active][PBg-HTTP-M3] → [necko-active][PBg-HTTP-M3][qf:p1]

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

2 years ago

Performance Impact: --- → P1

Whiteboard: [necko-active][PBg-HTTP-M3][qf:p1] → [necko-active][PBg-HTTP-M3]

You need to log in before you can comment on or make changes to this bug.