Intermittent toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html | Correctly intercepted page content - got "Start Middle\n", expected "Start Middle\n End"
Categories
(WebExtensions :: General, defect, P3)
Tracking
(firefox-esr115 wontfix, firefox-esr128 wontfix, firefox138 wontfix, firefox139 wontfix, firefox140 fixed)
People
(Reporter: intermittent-bug-filer, Assigned: robwu)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: intermittent-failure, regression, Whiteboard: [addons-jira])
Attachments
(3 files)
Filed by: btara [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=296919334&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Zt4Z1hxxQYK5xIH0SmYNyw/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Zt4Z1hxxQYK5xIH0SmYNyw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
[task 2020-04-09T09:24:34.911Z] 09:24:34 INFO - TEST-START | toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html
[task 2020-04-09T09:24:34.996Z] 09:24:34 INFO - GECKO(2265) | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2020-04-09T09:24:35.010Z] 09:24:35 INFO - TEST-INFO | started process screencapture
[task 2020-04-09T09:24:35.177Z] 09:24:35 INFO - TEST-INFO | screencapture: exit 0
[task 2020-04-09T09:24:35.177Z] 09:24:35 INFO - Buffered messages logged at 09:24:34
[task 2020-04-09T09:24:35.178Z] 09:24:35 INFO - add_task | Entering test
[task 2020-04-09T09:24:35.178Z] 09:24:35 INFO - Extension loaded
[task 2020-04-09T09:24:35.178Z] 09:24:35 INFO - Extension loaded
[task 2020-04-09T09:24:35.178Z] 09:24:35 INFO - Buffered messages finished
[task 2020-04-09T09:24:35.178Z] 09:24:35 INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html | Correctly intercepted page content - got "Start Middle\n", expected "Start Middle\n End"
[task 2020-04-09T09:24:35.178Z] 09:24:35 INFO - SimpleTest.is@SimpleTest/SimpleTest.js:383:14
[task 2020-04-09T09:24:35.178Z] 09:24:35 INFO - @toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html:82:14
[task 2020-04-09T09:24:35.179Z] 09:24:35 INFO - async*nextTick/<@SimpleTest/SimpleTest.js:2090:34
[task 2020-04-09T09:24:35.179Z] 09:24:35 INFO - nextTick@SimpleTest/SimpleTest.js:2115:11
[task 2020-04-09T09:24:35.179Z] 09:24:35 INFO - setTimeout handler*SimpleTest_setTimeoutShim@SimpleTest/SimpleTest.js:788:41
[task 2020-04-09T09:24:35.179Z] 09:24:35 INFO - add_task@SimpleTest/SimpleTest.js:2045:17
[task 2020-04-09T09:24:35.179Z] 09:24:35 INFO - @toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html:18:9
[task 2020-04-09T09:24:35.179Z] 09:24:35 INFO - add_task | Leaving test
[task 2020-04-09T09:24:35.179Z] 09:24:35 INFO - GECKO(2265) | MEMORY STAT | vsize 6723MB | residentFast 92MB | heapAllocated 12MB
[task 2020-04-09T09:24:35.179Z] 09:24:35 INFO - TEST-OK | toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html | took 118ms
Updated•5 years ago
|
Comment 1•5 years ago
|
||
Comment 2•5 years ago
|
||
It started to pop up on Windows too.
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=296958731&repo=autoland&lineNumber=9742
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 5•5 years ago
|
||
Comment 6•5 years ago
|
||
The test run in comment 5 is to see if this is a timing issue in the test itself.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 9•5 years ago
|
||
The failure appeared: https://treeherder.mozilla.org/#/jobs?repo=try&revision=27de46f16eff3f8d03eba9fc39e5db12a69068e3&selectedJob=297251812&group_state=expanded
Shane, should we disable the test for the time being?
Thank you.
Comment 10•5 years ago
|
||
Pernosco link: https://pernos.co/debug/igUdXYLFWmZBXz2g0FTaAw/index.html
I'll try to debug this tomorrow.
Comment 11•5 years ago
|
||
It looks like the issue happens when the second extension runs first.
The second extension is emitting data in two chunks (one with the page content, one with the extra appended data).
The first extension handles ondata once, prepending its content and then disconnecting. It doesn't wait for the second ondata to forward that through.
I think the first extension should do its prepending in onstart, directly copy all data in ondata, and then disconnect in onstop.
Comment 12•5 years ago
|
||
Actually, that seems like it should behave fine, since the first extension is disconnecting.
It looks like we get to StreamFilterParent::DoSendData with aData=" End" (the second chunk of data from second extension being received by the filter parent for first extension). At this point mState==Disconnecting, so we don't call SendData, and just drop the data.
It seems like a pretty bad problem if disconnecting can race with later packets of the response and cause them to be dropped.
Comment 13•5 years ago
|
||
I'm not entirely sure how to fix this.
We could post the data back to the IOThread to call BufferData with it, but it's possible in the mean time for another OnDataAvaible call to have arrived, seen mState==Disconnecting and also called BufferData ahead of us (and thus in the wrong order).
I suspect we need to route all data through the actor thread (which is the only thread where mState is non-racy) so that we can resolve these ordering races.
Any thoughts kmag?
Comment hidden (Intermittent Failures Robot) |
Comment 15•5 years ago
|
||
Updated•5 years ago
|
Comment 16•5 years ago
|
||
Updated•5 years ago
|
Comment 17•5 years ago
|
||
Given that this seems like a high probability of data loss or corruption and that it is not so uncommon to have a couple extensions doing filters, I think this bug needs to be a priority. I'm not certain how best to handle this issue, but it's probably also going to be related to bug 1629299.
Comment 18•5 years ago
|
||
(In reply to Shane Caraveo (:mixedpuppy) from comment #17)
Given that this seems like a high probability of data loss or corruption and that it is not so uncommon to have a couple extensions doing filters, I think this bug needs to be a priority. I'm not certain how best to handle this issue, but it's probably also going to be related to bug 1629299.
I think it could happen even with a single extension, it just needs a network response delivered in multiple OnDataAvailable calls, and an extension that uses disconnect() from the not-last one.
As for fixing this, we have mBufferedData and the two-stage+round-trip disconnect to prevent races due to the asynchronous connection between StreamFilterParent/Child. Unfortunately, we also have a second async step from IO thread to actor thread, which isn't handled.
One solution could be to extend the round-trip disconnect synchronization to go through the IO thread (and also the main thread somehow).
Another would be to unconditionally forward OnDataAvailable to the actor thread, and only make decisions on that thread. This might add some more overhead, but seems simpler to get right.
Either way, I think we want to more clearly mark which thread each member variable is valid on, and try to racy accesses.
Comment 19•5 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 22•5 years ago
|
||
I think we need to check for the Disconnecting state in the DoSendData callback, and buffer the data rather than dropping it if we are disconnecting.
Comment 23•5 years ago
|
||
@CosminS There were a few race fixes a month or so after this test was disabled, can we reenable the test for a bit and see if we get the intermittent back?
Comment 24•5 years ago
|
||
Comment 25•5 years ago
|
||
Just pushed a testonly change and will circle back next week to see if the failures are still around.
Updated•5 years ago
|
Comment 27•5 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Comment 29•5 years ago
|
||
Shane, there are 22 failures since the test was re-enabled, most on linux1804-64-asan: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-10-21&endday=2020-10-28&tree=trunk&bug=1628642
Recent log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=320049341&repo=autoland&lineNumber=4756
[task 2020-10-28T19:06:25.301Z] 19:06:25 INFO - TEST-START | toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html
[task 2020-10-28T19:06:25.760Z] 19:06:25 INFO - GECKO(3196) | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2020-10-28T19:06:25.788Z] 19:06:25 INFO - TEST-INFO | started process screentopng
[task 2020-10-28T19:06:26.108Z] 19:06:26 INFO - TEST-INFO | screentopng: exit 0
[task 2020-10-28T19:06:26.109Z] 19:06:26 INFO - Buffered messages logged at 19:06:25
[task 2020-10-28T19:06:26.109Z] 19:06:26 INFO - add_task | Entering test
[task 2020-10-28T19:06:26.109Z] 19:06:26 INFO - Extension loaded
[task 2020-10-28T19:06:26.109Z] 19:06:26 INFO - Extension loaded
[task 2020-10-28T19:06:26.109Z] 19:06:26 INFO - Buffered messages finished
[task 2020-10-28T19:06:26.110Z] 19:06:26 INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html | Correctly intercepted page content - got "Start Middle\n", expected "Start Middle\n End"
[task 2020-10-28T19:06:26.110Z] 19:06:26 INFO - SimpleTest.is@SimpleTest/SimpleTest.js:500:14
[task 2020-10-28T19:06:26.110Z] 19:06:26 INFO - @toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html:82:14
[task 2020-10-28T19:06:26.111Z] 19:06:26 INFO - add_task | Leaving test
[task 2020-10-28T19:06:26.112Z] 19:06:26 INFO - GECKO(3196) | MEMORY STAT | vsize 20974599MB | residentFast 861MB
[task 2020-10-28T19:06:26.112Z] 19:06:26 INFO - TEST-OK | toolkit/components/extensions/test/mochitest/test_ext_streamfilter_multiple.html | took 580ms
[task 2020-10-28T19:06:26.113Z] 19:06:26 INFO - TEST-START | toolkit/components/extensions/test/mochitest/test_ext_streamfilter_processswitch.html
Could you have a look over them? Thank you.
Comment hidden (Intermittent Failures Robot) |
Comment 31•5 years ago
|
||
Ok, this will require some more investigation. I'm wondering if disabling it on linux and win7 (or maybe make it osx-only) would be sufficient to bring the numbers down low enough to be acceptable, that way at least the test is running.
Comment 32•5 years ago
|
||
Updated•5 years ago
|
Comment 33•5 years ago
|
||
Most failures are on opt builds and linux so made a disable patch for that to bring the failures down.
https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2020-10-08&endday=2020-11-07&tree=trunk&bug=1628642
Comment hidden (Intermittent Failures Robot) |
Comment 35•5 years ago
|
||
Comment 36•5 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 79•4 months ago
|
||
I was looking into this, and found why this is happening. In short, data is dropped due to a race condition.
I added some logging (C++ side patch in bug 1765991, plus enabling logging in the test itself with https://hg-edge.mozilla.org/try/rev/4b27c4f993461b5e2ac0a0fd5547d5cbbb231c5c). The relevant log extract is below. I'll attach the patch which has more explanations.
HTML5 Parser]: D/StreamFilterParent StreamFilterParent::OnDataAvailable @p1 -- "End" data
Socket Thread]: D/StreamFilterParent StreamFilterParent::Destroy @p2
Socket Thread]: D/StreamFilterParent StreamFilterParent::RecvWrite @p1 mState=2 -- "Start "
HTML5 Parser]: D/StreamFilterParent StreamFilterParent::Write @p1 mState=2 -- "Start"
Socket Thread]: D/StreamFilterParent StreamFilterParent::RecvWrite @p1 mState=2 -- "Middle"
Socket Thread]: D/StreamFilterParent StreamFilterParent::RecvDisconnect @p1 mState=2
Socket Thread]: D/StreamFilterParent StreamFilterParent::DoSendData @p1 mState=5 -- "End" mState != 2 - dropped!
Assignee | ||
Comment 80•4 months ago
|
||
When state is Disconnecting, OnDataAvailable (on the IO thread) should
buffer data. Otherwise it dispatches DoSendData to the actor thread,
which supposedly sends data to the StreamFilterChild.
But because the Disconnecting state is set from the actor, it is
possible for a race condition to happen: If the Disconnecting state is
entered between the dispatch and the actual running of DoSendData(),
then before this patch the data would be dropped.
This patch fixes the issue by putting the data back at the front of the
queue, where it logically should have been.
Comment 82•4 months ago
|
||
The severity field for this bug is set to S4
. However, the following bug duplicate has higher severity:
- Bug 1644621: S3
:robwu, could you consider increasing the severity of this bug to S3
?
For more information, please visit BugBot documentation.
Assignee | ||
Updated•4 months ago
|
Assignee | ||
Updated•4 months ago
|
Comment 83•3 months ago
|
||
Comment 84•3 months ago
|
||
bugherder |
Updated•3 months ago
|
Description
•