Closed Bug 1392841 Opened 7 years ago Closed 6 years ago

Enable RCWN

Categories

(Core :: Networking: Cache, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox55 --- unaffected
firefox56 --- unaffected
firefox57 --- unaffected
firefox58 --- unaffected
firefox59 --- fixed

People

(Reporter: michal, Assigned: michal)

References

(Depends on 2 open bugs, Blocks 2 open bugs)

Details

(Whiteboard: [necko-active])

Attachments

(2 files)

We should enable RCWN on Nightly permanently. We didn't experience crashes during second shield study (bug 1381816).

Telemetry from 2017/07/27 to 2017/07/31 shows that average time saved by racing was 5.82s while the median was 150ms. We can estimate the added bandwidth usage by racing as (NETWORK_RACE_CACHE_BANDWIDTH_RACE_NETWORK_WIN * cache_hit_rate + NETWORK_RACE_CACHE_BANDWIDTH_RACE_CACHE_WIN) which is around 0.53%.
Blocks: RCWN
Attached patch patchSplinter Review
Attachment #8900147 - Flags: review?(valentin.gosu)
Attachment #8900147 - Flags: review?(valentin.gosu) → review+
Push which ran many failing tests: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=d97a8d7abdd65ef67038a6b139914b296f736a1c&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable

XPCshell failure: https://treeherder.mozilla.org/logviewer.html#?job_id=125259167&repo=mozilla-inbound

[task 2017-08-23T18:03:00.949496Z] 18:03:00     INFO -  "CONSOLE_MESSAGE: (info) 1503511379414	addons.xpi	DEBUG	Download started for http://localhost:40510/broken.xpi to file /tmp/tmp-u6l.xpi"
[task 2017-08-23T18:03:00.954611Z] 18:03:00     INFO -  "CONSOLE_MESSAGE: (info) 1503511379416	addons.xpi	DEBUG	Download of http://localhost:40510/broken.xpi completed."
[task 2017-08-23T18:03:00.967501Z] 18:03:00     INFO -  "CONSOLE_MESSAGE: (info) 1503511379417	addons.xpi	WARN	Download of http://localhost:40510/broken.xpi failed: 404 Not Found"
[task 2017-08-23T18:03:00.969380Z] 18:03:00     INFO -  "CONSOLE_MESSAGE: (info) 1503511379418	addons.xpi	DEBUG	downloadFailed: removing temp file for http://localhost:40510/broken.xpi"
[task 2017-08-23T18:03:00.971235Z] 18:03:00     INFO -  "CONSOLE_MESSAGE: (info) 1503511379419	addons.xpi	DEBUG	removeTemporaryFile: http://localhost:40510/broken.xpi removing temp file /tmp/tmp-u6l.xpi"
[task 2017-08-23T18:03:00.972757Z] 18:03:00     INFO -  PID 13027 | 1503511379514	addons.xpi	DEBUG	Download started for http://localhost:40510/broken.xpi to file /tmp/tmp-379.xpi
[task 2017-08-23T18:03:00.974535Z] 18:03:00     INFO -  PID 13027 | 1503511379517	addons.xpi	DEBUG	Download of http://localhost:40510/broken.xpi completed.
[task 2017-08-23T18:03:00.976249Z] 18:03:00     INFO -  PID 13027 | 1503511379517	addons.xpi	WARN	Download of http://localhost:40510/broken.xpi failed: 404 Not Found
[task 2017-08-23T18:03:00.978254Z] 18:03:00  WARNING -  TEST-UNEXPECTED-FAIL | xpcshell-unpack.ini:toolkit/mozapps/extensions/test/xpcshell/test_backgroundupdate.js | onDownloadFailed - [onDownloadFailed : 118] false == true
[task 2017-08-23T18:03:00.985487Z] 18:03:00     INFO -  /home/worker/workspace/build/tests/xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell/test_backgroundupdate.js:onDownloadFailed:118
[task 2017-08-23T18:03:00.985859Z] 18:03:00     INFO -  resource://gre/modules/AddonManager.jsm:callInstallListeners:1640
[task 2017-08-23T18:03:00.987281Z] 18:03:00     INFO -  resource://gre/modules/AddonManager.jsm:callInstallListeners:3160
[task 2017-08-23T18:03:00.987728Z] 18:03:00     INFO -  resource://gre/modules/addons/XPIInstall.jsm:downloadFailed:2515
[task 2017-08-23T18:03:00.989306Z] 18:03:00     INFO -  resource://gre/modules/addons/XPIInstall.jsm:onStopRequest:2491
[task 2017-08-23T18:03:00.989645Z] 18:03:00     INFO -  /home/worker/workspace/build/tests/xpcshell/head.js:_do_main:221
[task 2017-08-23T18:03:00.990964Z] 18:03:00     INFO -  /home/worker/workspace/build/tests/xpcshell/head.js:_execute_test:544
[task 2017-08-23T18:03:00.991237Z] 18:03:00     INFO -  -e:null:1
[task 2017-08-23T18:03:00.991504Z] 18:03:00     INFO -  exiting test

Failure log chrome: https://treeherder.mozilla.org/logviewer.html#?job_id=125259675&repo=mozilla-inbound
[task 2017-08-23T17:59:22.139102Z] 17:59:22     INFO - TEST-START | dom/xul/test/test_bug1290965.xul
[task 2017-08-23T17:59:22.227824Z] 17:59:22     INFO - GECKO(3358) | ++DOMWINDOW == 51 (0x7f98e572b000) [pid = 3358] [serial = 53] [outer = 0x7f98e018f000]
[task 2017-08-23T17:59:22.376994Z] 17:59:22     INFO - GECKO(3358) | [Parent 3358] ###!!! ASSERTION: mListener is null: 'Not Reached', file /home/worker/workspace/build/src/netwerk/protocol/http/nsHttpChannel.cpp, line 7126
[task 2017-08-23T17:59:22.378812Z] 17:59:22     INFO - GECKO(3358) | #01: nsInputStreamPump::OnStateStart [netwerk/base/nsInputStreamPump.cpp:536]
[task 2017-08-23T17:59:22.380274Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.382976Z] 17:59:22     INFO - GECKO(3358) | #02: nsInputStreamPump::OnInputStreamReady [netwerk/base/nsInputStreamPump.cpp:440]
[task 2017-08-23T17:59:22.384455Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.385932Z] 17:59:22     INFO - GECKO(3358) | #03: nsInputStreamReadyEvent::Run [xpcom/io/nsStreamUtils.cpp:99]
[task 2017-08-23T17:59:22.387685Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.389879Z] 17:59:22     INFO - GECKO(3358) | #04: nsThread::ProcessNextEvent [mfbt/Maybe.h:445]
[task 2017-08-23T17:59:22.390881Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.392519Z] 17:59:22     INFO - GECKO(3358) | #05: NS_ProcessNextEvent [xpcom/threads/nsThreadUtils.cpp:521]
[task 2017-08-23T17:59:22.393877Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.395165Z] 17:59:22     INFO - GECKO(3358) | #06: mozilla::ipc::MessagePump::Run [ipc/glue/MessagePump.cpp:98]
[task 2017-08-23T17:59:22.396863Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.398106Z] 17:59:22     INFO - GECKO(3358) | #07: MessageLoop::RunInternal [ipc/chromium/src/base/message_loop.cc:327]
[task 2017-08-23T17:59:22.399764Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.401327Z] 17:59:22     INFO - GECKO(3358) | #08: MessageLoop::Run [ipc/chromium/src/base/message_loop.cc:298]
[task 2017-08-23T17:59:22.404294Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.405379Z] 17:59:22     INFO - GECKO(3358) | #09: nsBaseAppShell::Run [widget/nsBaseAppShell.cpp:160]
[task 2017-08-23T17:59:22.406226Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.407240Z] 17:59:22     INFO - GECKO(3358) | #10: nsAppStartup::Run [toolkit/components/startup/nsAppStartup.cpp:289]
[task 2017-08-23T17:59:22.408640Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.409660Z] 17:59:22     INFO - GECKO(3358) | #11: XREMain::XRE_mainRun [toolkit/xre/nsAppRunner.cpp:4647]
[task 2017-08-23T17:59:22.412159Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.413253Z] 17:59:22     INFO - GECKO(3358) | #12: XREMain::XRE_main [toolkit/xre/nsAppRunner.cpp:4810]
[task 2017-08-23T17:59:22.414086Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.415260Z] 17:59:22     INFO - GECKO(3358) | #13: XRE_main [toolkit/xre/nsAppRunner.cpp:4905]
[task 2017-08-23T17:59:22.416099Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.419344Z] 17:59:22     INFO - GECKO(3358) | #14: do_main [browser/app/nsBrowserApp.cpp:236]
[task 2017-08-23T17:59:22.420202Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.421649Z] 17:59:22     INFO - GECKO(3358) | #15: main [browser/app/nsBrowserApp.cpp:311]
[task 2017-08-23T17:59:22.422524Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.423512Z] 17:59:22     INFO - GECKO(3358) | #16: libc.so.6 + 0x20830
[task 2017-08-23T17:59:22.427038Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.428096Z] 17:59:22     INFO - GECKO(3358) | #17: _start
[task 2017-08-23T17:59:22.429383Z] 17:59:22     INFO - 
[task 2017-08-23T17:59:22.430390Z] 17:59:22     INFO - GECKO(3358) | MEMORY STAT | vsize 2042MB | residentFast 250MB | heapAllocated 100MB
[task 2017-08-23T17:59:22.527846Z] 17:59:22     INFO - TEST-OK | dom/xul/test/test_bug1290965.xul | took 385ms
[task 2017-08-23T17:59:22.612519Z] 17:59:22     INFO - GECKO(3358) | ++DOMWINDOW == 52 (0x7f98ec811800) [pid = 3358] [serial = 54] [outer = 0x7f98e018f000]
[task 2017-08-23T17:59:22.670892Z] 17:59:22     INFO - GECKO(3358) | ++DOCSHELL 0x7fad2da1b800 == 1 [pid = 3460] [id = {d647c191-9278-4b5a-a6d2-687b6ee453fa}]
[task 2017-08-23T17:59:22.748548Z] 17:59:22     INFO - TEST-UNEXPECTED-FAIL | dom/xul/test/test_bug1290965.xul | assertion count 1 is more than expected 0 assertions
Flags: needinfo?(michal.novotny)
Backout by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/mozilla-inbound/rev/30ec306c75d9
Backed out changeset d4fba7cb9d4a for failing xpcshell tests, e.g.test_backgroundupdate.js and assertions in chrome tests, e.g. test_bug1290965.xul. r=backout on a CLOSED TREE
Depends on: 1384510
Should we try turning it on again?
Try push looks mostly good after fixing bug 1395121 and bug 1384478. See https://treeherder.mozilla.org/#/jobs?repo=try&revision=a45a0747f38d31c9d3fa4c80bb1802ed85e86c5c.

I don't understand the wpt8 failure (css3-text-line-break-baspglwj-022.html). I'll do another try push and try to reproduce this locally to verify this has nothing to do with RCWN.

There is also this crash: https://treeherder.mozilla.org/logviewer.html#?job_id=127498610&repo=try&lineNumber=5285, but this crash signature is also in crash-stats and it's not from RCWN experiment days, so I assume this is not RCWN related.
Flags: needinfo?(michal.novotny)
This is already known intermittent failure, why do you think it is caused by this push?

https://bugzilla.mozilla.org/show_bug.cgi?id=1238222
Flags: needinfo?(michal.novotny) → needinfo?(wkocher)
Before the backout, these perf regressions were noticed and also one big improvement:

== Change summary for alert #9249 (as of September 06 2017 12:53 UTC) ==

Regressions:

 25%  tp6_facebook summary windows7-32 pgo e10s     245.79 -> 306.62
 22%  tp6_facebook summary windows10-64 pgo e10s    259.58 -> 315.79
 13%  tp6_facebook summary windows10-64 opt e10s    331.31 -> 374.21

Improvements:

 27%  sessionrestore_many_windows windows10-64 pgo e10s     3,637.25 -> 2,643.33

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9249
And these perf changes appeared after the backout:

== Change summary for alert #9270 (as of September 06 2017 23:48 UTC) ==

Regressions:

 38%  sessionrestore_many_windows windows10-64 pgo e10s     2,643.62 -> 3,641.56

Improvements:

 18%  tp6_facebook summary windows7-32 opt e10s     376.42 -> 310.25
 16%  tp6_facebook summary windows10-64 opt 1_thread e10s389.46 -> 325.46
 15%  tp6_facebook summary windows10-64 opt e10s    376.75 -> 320.29

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9270
Can we run this locally to try to find the cause of regression?
Flags: needinfo?(ionut.goldan)
Yes, you should be able to run this locally using | mach talos-tests --suite tp6-e10s |.
Flags: needinfo?(ionut.goldan)
(In reply to Michal Novotny (:michal) from comment #10)
> This is already known intermittent failure, why do you think it is caused by
> this push?
> 
> https://bugzilla.mozilla.org/show_bug.cgi?id=1238222

The intermittent failure has a different number of differing pixels, and https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=88d460f1480ac18d3b49e1c4d9d4f8ea3422f4f6&fromchange=f4e16883efd9d7e3465dc1a8fd9f642e17940352&tochange=75b91420fff7c1ded50cc8dd9396590bc0172588&selectedJob=128936629&group_state=expanded suggests that your patch started this failure and the failure stopped with the backout of this patch.
Flags: needinfo?(wkocher)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #15)
> Yes, you should be able to run this locally using | mach talos-tests --suite
> tp6-e10s |.

I managed to run the test locally, and indeed I observed a regression.

I ran the following:

./mach talos-test --setpref network.http.rcwn.enabled=true -a tp6_facebook
run-tests - Wall time: 421s; CPU: 23.0%; Read bytes: 19769344; Write bytes: 156924416; Read time: 17756; Write time: 863206

./mach talos-test -a tp6_facebook
run-tests - Wall time: 403s; CPU: 22.0%; Read bytes: 21012480; Write bytes: 134531072; Read time: 31607; Write time: 739350

Figuring out why this happens seems a bit tricky though.
(In reply to Wes Kocher (:KWierso) from comment #16)
> (In reply to Michal Novotny (:michal) from comment #10)
> > This is already known intermittent failure, why do you think it is caused by
> > this push?
> > 
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1238222
> 
> The intermittent failure has a different number of differing pixels, and
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-
> searchStr=88d460f1480ac18d3b49e1c4d9d4f8ea3422f4f6&fromchange=f4e16883efd9d7e
> 3465dc1a8fd9f642e17940352&tochange=75b91420fff7c1ded50cc8dd9396590bc0172588&s
> electedJob=128936629&group_state=expanded suggests that your patch started
> this failure and the failure stopped with the backout of this patch.

The number of differing pixels I think is caused by the fact that we are failing different subtests of font-display-1.html
The reference URIs are different in the rcwn failure vs bug 1238222. But as mentioned in bug 1238222 comment 1, that test is racy.
(In reply to Wes Kocher (:KWierso) from comment #16)
> (In reply to Michal Novotny (:michal) from comment #10)
> > This is already known intermittent failure, why do you think it is caused by
> > this push?
> > 
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1238222
> 
> The intermittent failure has a different number of differing pixels, and
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-
> searchStr=88d460f1480ac18d3b49e1c4d9d4f8ea3422f4f6&fromchange=f4e16883efd9d7e
> 3465dc1a8fd9f642e17940352&tochange=75b91420fff7c1ded50cc8dd9396590bc0172588&s
> electedJob=128936629&group_state=expanded suggests that your patch started
> this failure and the failure stopped with the backout of this patch.

That's simply not true. Have you checked failures in bug 1238222? E.g. failures from comment 46:

5376 differing pixels:
  https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=126338148&lineNumber=13598
  https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127055651&lineNumber=18222

5997 differing pixels:
  https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=126836952&lineNumber=2787
  https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127547058&lineNumber=2786

None of the reports have RCWN enabled. And if you have a look at the pictures (see attached archive), you can see that the problem is the same and it differs only in used font. Btw, I have run reftests locally 50 times (with RCWN disabled) and test layout/reftests/font-face/font-display-1.html failed in all cases.
Flags: needinfo?(wkocher)
Depends on: 1398847
Backed out for failing mochitests, e.g. dom/tests/mochitest/ajax/offline/test_bug474696.html, on Windows 7 without e10s:

https://hg.mozilla.org/integration/mozilla-inbound/rev/c0f7ab91792c957f05d47aacd25892f56dbaff10

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=5a73ec3a19704bab348fa834733981163ffd4a55&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=130577296&repo=mozilla-inbound

848 INFO TEST-UNEXPECTED-FAIL | dom/tests/mochitest/ajax/offline/test_bug474696.html | Test timed out.
873 INFO TEST-UNEXPECTED-FAIL | dom/tests/mochitest/ajax/offline/test_fallback.html | Test timed out.
921 INFO TEST-UNEXPECTED-FAIL | dom/tests/mochitest/ajax/offline/test_simpleManifest.html | Unexpected event: error

etc.
Flags: needinfo?(michal.novotny)
Please also take a look at this crash which hit the push after this one:

https://treeherder.mozilla.org/logviewer.html#?job_id=130590813&repo=mozilla-inbound

02:12:12     INFO -  380 INFO TEST-START | browser/base/content/test/general/browser_gZipOfflineChild.js
02:12:12     INFO -  GECKO(3228) | ++DOCSHELL 000000ED06A32000 == 2 [pid = 360] [id = {f4210046-0a0c-49e1-b5dc-de187e74d8a6}]
02:12:12     INFO -  GECKO(3228) | ++DOMWINDOW == 12 (000000ED0703B800) [pid = 360] [serial = 387] [outer = 0000000000000000]
02:12:12     INFO -  GECKO(3228) | [Child 2092] WARNING: stylo: ServoStyleSets cannot respond to document state changes yet (only matters for chrome documents). See bug 1290285.: file z:/build/build/src/layout/base/PresShell.cpp, line 4297
02:12:12     INFO -  GECKO(3228) | ++DOMWINDOW == 13 (000000ED09F18000) [pid = 360] [serial = 388] [outer = 000000ED0703B800]
02:12:12     INFO -  GECKO(3228) | [Child 360] WARNING: stylo: ServoStyleSets cannot respond to document state changes yet (only matters for chrome documents). See bug 1290285.: file z:/build/build/src/layout/base/PresShell.cpp, line 4297
02:12:12     INFO -  GECKO(3228) | [Parent 3228] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file z:/build/build/src/netwerk/base/nsChannelClassifier.cpp, line 342
02:12:12     INFO -  GECKO(3228) | --DOMWINDOW == 53 (000000E133664000) [pid = 2092] [serial = 231] [outer = 0000000000000000] [url = http://example.com/browser/browser/base/content/test/general/dummy_page.html]
02:12:12     INFO -  GECKO(3228) | --DOMWINDOW == 52 (000000E1377E7800) [pid = 2092] [serial = 253] [outer = 0000000000000000] [url = http://example.org/browser/browser/base/content/test/general/dummy_page.html]
02:12:12     INFO -  GECKO(3228) | ++DOMWINDOW == 14 (000000ED0AEAB800) [pid = 360] [serial = 389] [outer = 000000ED0703B800]
02:12:12     INFO -  GECKO(3228) | --DOMWINDOW == 51 (000000E133666000) [pid = 2092] [serial = 250] [outer = 0000000000000000] [url = http://example.com/browser/browser/base/content/test/general/dummy_page.html]
02:12:12     INFO -  GECKO(3228) | --DOMWINDOW == 50 (000000E137669800) [pid = 2092] [serial = 247] [outer = 0000000000000000] [url = http://example.org/browser/browser/base/content/test/general/dummy_page.html]
02:12:12     INFO -  GECKO(3228) | --DOMWINDOW == 49 (000000E13365B000) [pid = 2092] [serial = 258] [outer = 0000000000000000] [url = about:blank]
02:12:12     INFO -  GECKO(3228) | --DOMWINDOW == 48 (000000E13737D000) [pid = 2092] [serial = 256] [outer = 0000000000000000] [url = about:blank]
02:12:12     INFO -  GECKO(3228) | --DOMWINDOW == 47 (000000E133780000) [pid = 2092] [serial = 260] [outer = 0000000000000000] [url = about:blank]
02:12:12     INFO -  GECKO(3228) | [Child 360] WARNING: stylo: Web Components not supported yet: file z:/build/build/src/dom/base/nsDocument.cpp, line 6413
02:12:12     INFO -  GECKO(3228) | [Child 360] WARNING: stylo: Web Components not supported yet: file z:/build/build/src/dom/base/nsDocument.cpp, line 6413
02:12:12     INFO -  GECKO(3228) | ++DOCSHELL 000000ED0B857000 == 3 [pid = 360] [id = {c25721b5-360e-4f1d-a2d9-0326335ff5e6}]
02:12:12     INFO -  GECKO(3228) | ++DOMWINDOW == 15 (000000ED0B85C000) [pid = 360] [serial = 390] [outer = 0000000000000000]
02:12:12     INFO -  GECKO(3228) | [Child 360] WARNING: stylo: ServoStyleSets cannot respond to document state changes yet (only matters for chrome documents). See bug 1290285.: file z:/build/build/src/layout/base/PresShell.cpp, line 4297
02:12:12     INFO -  GECKO(3228) | ++DOMWINDOW == 16 (000000ED0AE17000) [pid = 360] [serial = 391] [outer = 000000ED0B85C000]
02:12:12     INFO -  GECKO(3228) | [Child 360] WARNING: Could not get disk status from nsIDiskSpaceWatcher: file z:/build/build/src/uriloader/prefetch/nsOfflineCacheUpdateService.cpp, line 283
02:12:12     INFO -  GECKO(3228) | [Child 360] WARNING: stylo: Web Components not supported yet: file z:/build/build/src/dom/base/nsDocument.cpp, line 6413
02:12:12     INFO -  GECKO(3228) | [Child 360] WARNING: stylo: Web Components not supported yet: file z:/build/build/src/dom/base/nsDocument.cpp, line 6413
02:12:12     INFO -  GECKO(3228) | [Child 360] WARNING: stylo: ServoStyleSets cannot respond to document state changes yet (only matters for chrome documents). See bug 1290285.: file z:/build/build/src/layout/base/PresShell.cpp, line 4297
02:12:13     INFO -  GECKO(3228) | [Parent 3228] WARNING: Suboptimal indexes for the SQL statement 0x2bcc4fc840 (http://mzl.la/1FuID0j).: file z:/build/build/src/storage/mozStoragePrivateHelpers.cpp, line 114
02:12:13     INFO -  GECKO(3228) | [Parent 3228] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file z:/build/build/src/netwerk/protocol/http/nsHttpChannel.cpp, line 4103
02:12:13     INFO -  GECKO(3228) | [Parent 3228] WARNING: cache check failed: file z:/build/build/src/netwerk/cache2/OldWrappers.cpp, line 878
02:12:13     INFO -  GECKO(3228) | [Parent 3228] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file z:/build/build/src/toolkit/components/mediasniffer/nsMediaSniffer.cpp, line 155
02:12:13     INFO -  GECKO(3228) | Assertion failure: mResponseHead (No response head in ODA!!), at z:/build/build/src/netwerk/protocol/http/nsHttpChannel.cpp:7701
02:12:13     INFO -  GECKO(3228) | #01: nsInputStreamPump::OnStateTransfer() [netwerk/base/nsInputStreamPump.cpp:618]
02:12:13     INFO -  GECKO(3228) | #02: nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream *) [netwerk/base/nsInputStreamPump.cpp:444]
02:12:13     INFO -  GECKO(3228) | #03: nsInputStreamReadyEvent::Run() [xpcom/io/nsStreamUtils.cpp:99]
02:12:13     INFO -  GECKO(3228) | #04: nsThread::ProcessNextEvent(bool,bool *) [xpcom/threads/nsThread.cpp:1040]
02:12:13     INFO -  GECKO(3228) | #05: NS_ProcessNextEvent(nsIThread *,bool) [xpcom/threads/nsThreadUtils.cpp:521]
02:12:13     INFO -  GECKO(3228) | #06: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) [ipc/glue/MessagePump.cpp:97]
02:12:13     INFO -  GECKO(3228) | #07: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:320]
02:12:13     INFO -  GECKO(3228) | #08: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:300]
02:12:13     INFO -  GECKO(3228) | #09: nsBaseAppShell::Run() [widget/nsBaseAppShell.cpp:160]
02:12:13     INFO -  GECKO(3228) | #10: nsAppShell::Run() [widget/windows/nsAppShell.cpp:232]
02:12:13     INFO -  GECKO(3228) | #11: nsAppStartup::Run() [toolkit/components/startup/nsAppStartup.cpp:288]
02:12:13     INFO -  GECKO(3228) | #12: XREMain::XRE_mainRun() [toolkit/xre/nsAppRunner.cpp:4653]
02:12:13     INFO -  GECKO(3228) | #13: XREMain::XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [toolkit/xre/nsAppRunner.cpp:4817]
02:12:13     INFO -  GECKO(3228) | #14: XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [toolkit/xre/nsAppRunner.cpp:4912]
02:12:13     INFO -  GECKO(3228) | #15: do_main [browser/app/nsBrowserApp.cpp:237]
02:12:13     INFO -  GECKO(3228) | #16: NS_internal_main(int,char * *,char * *) [browser/app/nsBrowserApp.cpp:311]
02:12:13     INFO -  GECKO(3228) | #17: wmain [toolkit/xre/nsWindowsWMain.cpp:118]
02:12:13     INFO -  GECKO(3228) | #18: __scrt_common_main_seh [f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253]
02:12:13     INFO -  GECKO(3228) | #19: KERNEL32.DLL + 0x167e
02:12:13     INFO -  GECKO(3228) | #20: ntdll.dll + 0x1c3f1
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #21)
> 848 INFO TEST-UNEXPECTED-FAIL |
> dom/tests/mochitest/ajax/offline/test_bug474696.html | Test timed out.
> 873 INFO TEST-UNEXPECTED-FAIL |
> dom/tests/mochitest/ajax/offline/test_fallback.html | Test timed out.
> 921 INFO TEST-UNEXPECTED-FAIL |
> dom/tests/mochitest/ajax/offline/test_simpleManifest.html | Unexpected
> event: error

Looks like RCWN should be disabled for all mochitests in dom/tests/mochitest/ajax/offline but for some reason they were not covered by bug 1373198. They simply expect the content to be loaded from the cache.
Flags: needinfo?(michal.novotny)
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
When this landed temporarily, we noticed this improvement on Windows 10:

== Change summary for alert #9418 (as of September 13 2017 08:16 UTC) ==

Improvements:

 27%  sessionrestore_many_windows windows10-64 pgo e10s     3,602.50 -> 2,615.42

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9418
The backout brought back the initial baseline:

== Change summary for alert #9423 (as of September 13 2017 12:45 UTC) ==

Regressions:

 39%  sessionrestore_many_windows windows10-64 pgo e10s     2,615.17 -> 3,639.58

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9423
Depends on: 1399802
(In reply to Michal Novotny (:michal) from comment #24)
> (In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on
> intermittent or backout) from comment #21)
> > 848 INFO TEST-UNEXPECTED-FAIL |
> > dom/tests/mochitest/ajax/offline/test_bug474696.html | Test timed out.
> > 873 INFO TEST-UNEXPECTED-FAIL |
> > dom/tests/mochitest/ajax/offline/test_fallback.html | Test timed out.
> > 921 INFO TEST-UNEXPECTED-FAIL |
> > dom/tests/mochitest/ajax/offline/test_simpleManifest.html | Unexpected
> > event: error
> 
> Looks like RCWN should be disabled for all mochitests in
> dom/tests/mochitest/ajax/offline but for some reason they were not covered
> by bug 1373198. They simply expect the content to be loaded from the cache.

Bug 1399802 will fix the failures.
https://hg.mozilla.org/mozilla-central/rev/bf572b74ec5f
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Backed out for frequently failing reftest layout/reftests/font-face/font-display-1.html on Linux debug and on suspicion for causing leak in bug 1399117:

https://hg.mozilla.org/mozilla-central/rev/5b39f8dcc5cd6a38fe99564a9cfc90eedfad1639

See spike in failure on Treherder: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=67672b4fecc49740a30cdd2f61412044f134b585&filter-searchStr=88d460f1480ac18d3b49e1c4d9d4f8ea3422f4f6&tochange=7d0893f1b6ce8494c9d73a0e7fe04ef3cf104b98&selectedJob=131335846
Status: RESOLVED → REOPENED
Flags: needinfo?(michal.novotny)
Resolution: FIXED → ---
Target Milestone: mozilla57 → ---
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #31)
> Backed out for frequently failing reftest
> layout/reftests/font-face/font-display-1.html on Linux debug

See my comment #19. RCWN is not causing the problem. It might cause the spike, but it doesn't change the fact that the test is badly written.

> and on suspicion for causing leak in bug 1399117:

Again, it's not causing it, because it was reported on revision 485c8d248bfc4acc97f92cdc130210d7388bca32, where RCWN is disabled.
Flags: needinfo?(michal.novotny)
(In reply to Michal Novotny (:michal) from comment #32)
> (In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on
> intermittent or backout) from comment #31)
> > Backed out for frequently failing reftest
> > layout/reftests/font-face/font-display-1.html on Linux debug
> 
> See my comment #19. RCWN is not causing the problem. It might cause the
> spike, but it doesn't change the fact that the test is badly written.
Best way forward here is to file a bug for the issue and needinfo :bradwerth. Either the issue has to be fixed or the test disabled in that bug, then this one can land.
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #33)
> > See my comment #19. RCWN is not causing the problem. It might cause the
> > spike, but it doesn't change the fact that the test is badly written.
> Best way forward here is to file a bug for the issue and needinfo
> :bradwerth. Either the issue has to be fixed or the test disabled in that
> bug, then this one can land.

Thanks for info. And what to do with the leak? It is the same problem. It happens without this patch too, but more often with it.
Flags: needinfo?(aryx.bugmail)
Depends on: 1400386
> Thanks for info. And what to do with the leak? It is the same problem. It
> happens without this patch too, but more often with it.
It's still high frequency after that backout, so unrelated and doesn't block the relanding.
Flags: needinfo?(aryx.bugmail)
Before this was backed out, these improvements were noticed:

== Change summary for alert #9451 (as of September 14 2017 12:40 UTC) ==

Improvements:

 27%  sessionrestore_many_windows windows10-64 pgo e10s     3,609.23 -> 2,624.92
 27%  sessionrestore_many_windows windows7-32 pgo e10s      3,547.75 -> 2,598.50

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9451
The last backout, as noted in comment 31, brought the equivalent perf metrics back:

== Change summary for alert #9492 (as of September 15 2017 19:34 UTC) ==

Regressions:

 37%  sessionrestore_many_windows windows7-32 pgo e10s     2,562.67 -> 3,500.33
 34%  sessionrestore_many_windows windows10-64 pgo e10s    2,593.00 -> 3,486.00

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9492
Hi Michal,

It seems like this has caused a regression on the Quantum Reference Hardware on Speedometer, around 5% on x64 and 2.5% on x86, please see the links below respectively (please look for the node on the graph with revision b577f883dc898d203b74f7b02345a0a7dc3d32c0):

https://arewefastyet.com/#machine=36&view=single&suite=speedometer-misc&subtest=score&start=1505761901&end=1505831856
https://arewefastyet.com/#machine=37&view=single&suite=speedometer-misc&subtest=score&start=1505761901&end=1505831856

In contrast, the faster Windows machine that also runs these tests shows no regression on this changeset:

https://arewefastyet.com/#machine=35&view=single&suite=speedometer-misc&subtest=score&start=1505761901&end=1505831856
https://arewefastyet.com/#machine=17&view=single&suite=speedometer-misc&subtest=score&start=1505761901&end=1505831856

This may suggest that the machines have different disk/network speeds which makes one prone to a regression.

Since Speedometer V2 performance on the Quantum Reference Hardware is a release criteria goal, and we're very late in the cycle, I may very reluctantly request you to back this out for now while you investigate the regression.  Sorry to be a pain, and thanks for understanding.  :-(
Flags: needinfo?(michal.novotny)
Backout by mnovotny@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8ec345b71a6e
Backed out changeset from bug 1392841 due to speedometer regression. r=backout
Also was the intention to only enable this on Nightly?  The patch was shipping the feature as far as I could tell.  :-)
Yes, the intent is to ship.

After talking to Selena, overholt, and ryanVM about this, we decided that the gold standard here is really a PGO build.  So we want to see the numbers for the existing code on 32/64 bit windows PGO builds, both with RCWN on/off--if there's no regression for PGO, we're good to go here.  Otherwise, we have to debug what's going on (I know little about Speedindex other than that it is a simple TODO web app.  We're not sure, but our impression is that the only networking it should do it loading the resources for the test, i.e. we don't think it does XHR or anything like that while the test is running.  Which makes the regression here kind of weird.)

I've got my folks testing this on our reference Acer laptops, and we should have numbers soon.
Flags: needinfo?(michal.novotny)
Yet again, when this relanded:

== Change summary for alert #9536 (as of September 19 2017 07:57 UTC) ==

Improvements:

 29%  sessionrestore_many_windows windows10-64 pgo e10s     3,469.92 -> 2,481.00
 24%  sessionrestore_many_windows windows7-32 pgo e10s      3,488.25 -> 2,663.58

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9536
Then after the backout:

== Change summary for alert #9545 (as of September 19 2017 16:37 UTC) ==

Regressions:

 40%  sessionrestore_many_windows windows10-64 pgo e10s     2,484.83 -> 3,484.75
 17%  sessionrestore_many_windows windows7-32 pgo e10s      2,987.71 -> 3,501.33

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9545
Depends on: 1401483
(In reply to Jason Duell [:jduell] (needinfo me) from comment #44)
> Yes, the intent is to ship.
> 
> After talking to Selena, overholt, and ryanVM about this, we decided that
> the gold standard here is really a PGO build.  So we want to see the numbers
> for the existing code on 32/64 bit windows PGO builds, both with RCWN
> on/off--if there's no regression for PGO, we're good to go here.  Otherwise,
> we have to debug what's going on (I know little about Speedindex other than
> that it is a simple TODO web app.  We're not sure, but our impression is
> that the only networking it should do it loading the resources for the test,
> i.e. we don't think it does XHR or anything like that while the test is
> running.  Which makes the regression here kind of weird.)

Speedometer doesn't use XHR as far as I know, but it does load *lots* of web pages, so if turning on the RCWN pref causes some extra code to be run that wouldn't run without it that could explain the regression.  In my experience the Acer hardware has been good at detecting small regressions but it has been pretty terrible for profiling purposes unfortunately, since the sampling rate can't be adjusted to less than 1ms (which is 2ms on Windows effectively when using the Gecko profiler) so if the code is all platform independent, profiling Speedometer with and without the pref on Linux or Mac and comparing the profiles to see what shows up additionally when the pref is on may be a good place to start.  The way I usually profile Speedometer is I set the sampling frequency to 0.1ms and the buffer size to 450MB and I run Speedometer and let it run ~150 or so of its subtests and then I capture a profile (the benchmark loops through its subtests so by that time you have captured several runs of the subtests).  You can use https://mozilla.github.io/arewefastyet-speedometer/2.0/ if you want the same copy of the test that AWFY uses.  Hope this helps!
Summary: Enable RCWN on Nightly → Enable RCWN
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #47)
> captured several runs of the subtests).  You can use
> https://mozilla.github.io/arewefastyet-speedometer/2.0/ if you want the same
> copy of the test that AWFY uses.  Hope this helps!

The set of tests seems to be different from what is run when I follow https://github.com/mozilla/arewefastyet/wiki/Slave#benchmark-locally. So which one is correct? And how to run multiple tests when we want to have some average number? Should be all individual tests executed separately with a fresh profile?
Flags: needinfo?(ehsan)
Ehsan, is there possibility to run speedometer test with a custom patch on the official servers? So far we couldn't reproduce the regression locally on the reference HW.
:bc (Bob Clary) says the answer is no, but he's going to experiment with using try builds to run Speedometer.
Flags: needinfo?(ehsan)
Punting on landing for 57 today, as we've run into bug 1401741, which causes devtools to mark some RCWN channels as never completed, which seems bad enough to block landing for now.

We gathered a lot of perf numbers and I'll have an analysis of them soon.
Priority: P1 → P2
Hi Valentin,
Can you please tell me if 55 and 56 versions are unaffected?
Flags: needinfo?(valentin.gosu)
(In reply to Valentina Claudia Ona from comment #52)
> Hi Valentin,
> Can you please tell me if 55 and 56 versions are unaffected?

RCWN is not enabled on any version of Firefox.
Flags: needinfo?(valentin.gosu)
I asked you because I want to set tracking flags.
Flags: needinfo?(valentin.gosu)
We missed the train for 57 and we definitely will not be shipping on 55 or 56 - so we don't need to track it.
Flags: needinfo?(valentin.gosu)
FWIW in bug 1400095, a Speedometer regression was noticed the previous time this had landed (comment 29)
So we've looked some into Speedometer performance (with PGO builds).

We've gathered two sets of data, and the spreadsheet with our numbers is here:

   https://docs.google.com/spreadsheets/d/11FqwBvw0QyGXclGWB5sJ9rwGNnjN9txAG02hbdjWE2g/edit#gid=0

1) 5 different Acer reference laptops, on which we manually ran 5-10 Speedometer runs with RCWN on/off (and 32/64 bit) using the remote standard Speedometer webserver.  Aggregating all the results, we do see a statistically significant improvement for RCWN on 32 bit (+2.18%), and a perf hit on 64 bit (-0.9%).  However, if we exclude the noisiest laptop (PC1), which allegedly had update software running during some of the tests, the effects are less (+.59% and -.54%) and are no longer statistically significant.  If we use only the 3 laptops that had "stable" numbers (all std deviations < .5), then the effects drop further (.25%, -.13%) and are a total wash statistically.
   
I'm not enough of a stats person to know if it's kosher to disregard the data from the noisier laptops.  We did that sort of thing a lot when I worked in scientific computing...

2) Michal ran more data points using our automated tools (so I presume these used a localhost server).  We have a lot more data points here (73 runs for each config), yet oddly enough we see absolutely no significant differences in performance.  But the Speedometer numbers are also oddly high (61-62, vs mid-40s for our other runs), so I want to know more about that before making any conclusions. (Update from Michal: "It's the standard reference notebook. If you use framework https://github.com/mozilla/arewefastyet.git and do './execute.py -b remote.speedometer', it shows results ~61. If you do it manually on page https://mozilla.github.io/arewefastyet-speedometer/2.0/ it shows results around 42.'  Ehsan, is that something you've seen before?  Are there differences besides the location of the web server that you know of? Which method is arewefastyet.com using?)

3) Just as for our automated Speedometer results, we don't see any significant differences with RCWN on workstations--only on our reference hardware.

We can start digging into gecko profiles, necko logs, etc.  But before we do I'd like to have some idea of the goalposts here.  

In general, we have never considered tests that run with a localhost web server (and usually via an HTTP proxy--do we know if we use a proxy here too?) to be useful tests of necko performance, since it's not a realistic environment for networking.  In particular, let me note that running a localhost server (as I assume we do in our Speedometer results from bot runs) is a recipe for artificially poor performance with RCWN.  If RCWN decides to race with cache, that means that 1) the local disk is probably already under high load, and 2) we'll fire up a request to the web server, which will then add more I/O trying to access the same data (but in a different file).  So there's artificial downside to RCWN in this environment. (Another way of putting it is that RCWN is based on gambling that a different I/O channel may be faster, and so worth some overhead to try.  In this environment the 'different I/O channel' winds up being the same--the local disk--so we only add overhead).

That said, I understand that 1) Speedometer is a benchmark that may have market relevance that make it unavoidable (could we turn off RCWN for it? Or maybe it's as simple as teaching RCWN to not race for connections that resolve to localhost?); 2) We do seem to see these weird results when we use a remote server, too, so we need to look into that.  I think our next step is to get gecko/necko logs on a machine and see if we see anything interesting (maybe honza's log tracer can help?).

I personally have no idea why we'd be speeding things up on 32 bits, but slowing them down on 64.  Have we seen that pattern elsewhere and learned from it?
Flags: needinfo?(ehsan)
(In reply to Jason Duell [:jduell] (needinfo me) from comment #58)
> 1) 5 different Acer reference laptops, on which we manually ran 5-10
> Speedometer runs with RCWN on/off (and 32/64 bit) using the remote standard
> Speedometer webserver.  Aggregating all the results, we do see a
> statistically significant improvement for RCWN on 32 bit (+2.18%), and a
> perf hit on 64 bit (-0.9%).  However, if we exclude the noisiest laptop
> (PC1), which allegedly had update software running during some of the tests,
> the effects are less (+.59% and -.54%) and are no longer statistically
> significant.  If we use only the 3 laptops that had "stable" numbers (all
> std deviations < .5), then the effects drop further (.25%, -.13%) and are a
> total wash statistically.
>    
> I'm not enough of a stats person to know if it's kosher to disregard the
> data from the noisier laptops.  We did that sort of thing a lot when I
> worked in scientific computing...

I don't know what to say.  This has landed twice on AWFY and it has shown regressions both times (comment 57 and comment 39) and the regressions are clear from the graphs of runs over each commit in both cases.  Has there been a code change in between the latest landing and these measurements that is expected to have removed the regressions?  If not, I wouldn't expect the regression to go away on its own.  Has anyone tried to profile a Speedometer run with and without RCWN to see what extra code runs and get an understanding of where the regression comes from?

Also, as I noted on IRC last week, even though the PGO builds are what we distribute to the users and are what "matters" in that sense, you're making your job too difficult by using them for comparisons instead of non-PGO builds.  For example, around comment 39 we were seeing a 5% regression on x64 on non-PGO builds and around a 1% regression on PGO builds.  Those builds therefore may be easier for profiling purposes since they will make the differences in performance more pronounced.  (FTR I personally do all of my Speedometer profiling on non-PGO builds myself, the results transfer very well in my experience.)

> 2) Michal ran more data points using our automated tools (so I presume these
> used a localhost server).  We have a lot more data points here (73 runs for
> each config), yet oddly enough we see absolutely no significant differences
> in performance.  But the Speedometer numbers are also oddly high (61-62, vs
> mid-40s for our other runs), so I want to know more about that before making
> any conclusions. (Update from Michal: "It's the standard reference notebook.
> If you use framework https://github.com/mozilla/arewefastyet.git and do
> './execute.py -b remote.speedometer', it shows results ~61. If you do it
> manually on page https://mozilla.github.io/arewefastyet-speedometer/2.0/ it
> shows results around 42.'  Ehsan, is that something you've seen before?  Are
> there differences besides the location of the web server that you know of?
> Which method is arewefastyet.com using?)

If you are getting numbers in the range of sixties, you are using an old version of Speedometer before they switched to using geometric mean to report the test score (https://bugs.webkit.org/show_bug.cgi?id=172968).  It is entirely possible for a regression that shows up when using geometric mean to not show up when using arithmetic mean, so please disregard test runs with results in the sixties range on the reference hardware.  When in doubt, use https://mozilla.github.io/arewefastyet-speedometer/2.0/, that is the latest version of Speedometer that AWFY uses.  BTW, Speedometer benchmark doesn't care which webserver it is served off of, as it doesn't measure any of the network time.

> 3) Just as for our automated Speedometer results, we don't see any
> significant differences with RCWN on workstations--only on our reference
> hardware.

Yes, AWFY also didn't show a regression on a faster machine either...

> We can start digging into gecko profiles, necko logs, etc.  But before we do
> I'd like to have some idea of the goalposts here.  

Yes, I think the right next step here is for someone to profile the test runs to see what extra code runs with RCWN turned on, and why it takes so long.  And for that profiling to happen on a *non-PGO* build if you are looking for an easy path to an answer.  :-)

> In general, we have never considered tests that run with a localhost web
> server (and usually via an HTTP proxy--do we know if we use a proxy here
> too?)

I don't know anything about the network setup on AWFY, you should ask Bob if you're interested.

> to be useful tests of necko performance, since it's not a realistic
> environment for networking.

Yes, and that is the point here.  RCWN is regressing the performance on non-networking parts of the browser.  :-)

Let me describe, because it may not be obvious what Speedometer measures.  Speedometer loads some web pages in iframes, waits for the load to finish, and then starts measuring.  Each measurement works like this:

  var t1 = new Date();
  runSomeCode();
  var t2 = new Date();
  syncTime = (t2 - t1);
  var t3 = new Date();
  setTimeout(function() {
    var t4 = new Date();
    asyncTime = (t4 - t3);
  }, 0);

syncTime and asyncTime are reported for each subset.  runSomeCode() is the benchmark code that Speedometer runs which is some synchronous JS operation.  So basically the benchmark runs some JS, measures how long that takes, then schedules a timeout and measures how long it takes for the timeout handler to run (as a proxy to see how long it takes for any async tasks that the benchmarked code has scheduled to run) and reports that as asyncTime, and then moves on to run the next iframe.

This regression means that RCWN is either increasing something in the syncTimes or in the asyncTimes.  It's hard to imagine how something may have changed in what runs in syncTimes because it's all code triggered by web page JS, and I'm assuming RCWN doesn't have anything triggered by JS/DOM or such.  But more likely it is scheduling some runnables to run on the event queue, which happen to run inside the period of time measured by asyncTime, and those runnables are taking a long time.  The goal here is to find out the nature of the additional code running and figure out what we can do about it.

> In particular, let me note that running a
> localhost server (as I assume we do in our Speedometer results from bot
> runs) is a recipe for artificially poor performance with RCWN.  If RCWN
> decides to race with cache, that means that 1) the local disk is probably
> already under high load, and 2) we'll fire up a request to the web server,
> which will then add more I/O trying to access the same data (but in a
> different file).  So there's artificial downside to RCWN in this
> environment. (Another way of putting it is that RCWN is based on gambling
> that a different I/O channel may be faster, and so worth some overhead to
> try.  In this environment the 'different I/O channel' winds up being the
> same--the local disk--so we only add overhead).

Because Speedometer doesn't measure page load time, it's unlikely that changes like this to cause an impact to what the test measures.

> That said, I understand that 1) Speedometer is a benchmark that may have
> market relevance that make it unavoidable (could we turn off RCWN for it? Or
> maybe it's as simple as teaching RCWN to not race for connections that
> resolve to localhost?);

That may be a good idea because of the problem you mentioned above, but I'm not sure if it would help with Speedometer itself.

> 2) We do seem to see these weird results when we use
> a remote server, too, so we need to look into that.  I think our next step
> is to get gecko/necko logs on a machine and see if we see anything
> interesting (maybe honza's log tracer can help?).

I can't imagine how logs could help because I don't know what we're looking for yet, hopefully my description above is helpful in thinking about what to look for when digging into this.  I think a sampling profiler is sufficient to give answers here.  BTW unless the code here is Windows specific you should be able to do the profiling on any platform. (On Windows it's impossible currently to get a meaningful sampling frequency below 2ms which is way too high for useful profiling.  In my experience profiling on the reference hardware is pretty difficult.)

> I personally have no idea why we'd be speeding things up on 32 bits, but
> slowing them down on 64.  Have we seen that pattern elsewhere and learned
> from it?

No, I can't imagine why this would be to be honest.  I suspect once we find out the underlying cause of the regression it would be evident...
Flags: needinfo?(ehsan)
Depends on: 1401741
So, the profiling effort hasn't been going so great:
I tried profiling with a 1ms interval, and set the limit to 900 Mb. The test ended up taking somewhere around an hour, and then the profile could not be opened (after a few hours of loading screen I gave up).
I tried reducing the testcase, and only profiled the first 20 of 460 tests of the speedometer, with a 0.1 ms interval.

[no-rcwn] https://perfht.ml/2xQZzBQ
[with-rcwn] https://perfht.ml/2xS18zo

My profiling skills aren't the best, but I don't see much of a difference between the two captures. Any help would be very welcome.
Flags: needinfo?(mconley)
(In reply to Valentin Gosu [:valentin] from comment #60)
> So, the profiling effort hasn't been going so great:
> I tried profiling with a 1ms interval, and set the limit to 900 Mb. The test
> ended up taking somewhere around an hour, and then the profile could not be
> opened (after a few hours of loading screen I gave up).

This looks strange. I used the profiler on the reference HW with non-pgo build and the profiler overhead is negligible. The whole test finishes in about 200 seconds.
(In reply to Valentin Gosu [:valentin] from comment #60)
> 
> [no-rcwn] https://perfht.ml/2xQZzBQ
> [with-rcwn] https://perfht.ml/2xS18zo
> 
> My profiling skills aren't the best, but I don't see much of a difference
> between the two captures. Any help would be very welcome.

I've started to poke at these profiles. One thing I've noticed is that you have at least one add-on enabled - I think it's ABP. For this sort of work, I find it's often best if we can reduce the number of things that are running to the bare minimum to lower the number of red herrings in the profile.

Would it be possible to gather the profiles again, but this time, with _only_ the Gecko Profiler add-on enabled?
Flags: needinfo?(mconley) → needinfo?(valentin.gosu)
Sorry, I realize you're all kinda chomping at the bit here, but I was traveling for MozFest and only just got back.

I've been looking at the profiles in comment 63 today, and one thing sticks out to me: the profiles with RCWN enabled have 3 content processes, whereas the RCWN disabled ones have 2 content processes. Why?

What happens if you restrict the number of content processes to just 1 via dom.ipc.processCount? Does RCWN still regression Speedometer then?
Flags: needinfo?(michal.novotny)
Since yesterday we merged, we should consider enabling this in Nightly and letting it ride the release train.
Our profiling didn't reveal the cause for the perceived performance regression, and that may it is within the test's error range.
We did trigger multiple runs for the tests and the regression delta went way down (between 0% and 0.44%). See [1]

We also looked at subtests. Some show a regression, while others an improvement. This differs a lot between runs, leading us to believe that this is caused by the change in the timing of cache IO operations on the background thread.

Telemetry indicates RCWN can reduce the time to load a channel by a median of 143 milliseconds [2] when the network request wins the race 7.33% of all loads [3]. We need to enable it to verify if telemetry shows an improvement in overall page load time as well.

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=7a767bfcdd084f80c3e80b6b8cf401f4715c11b8&newProject=try&newRevision=d17ec5fedeaaedafdf58a7fe3251496041e64ea3&framework=1
[2] https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-07-31&keys=__none__!__none__!__none__&max_channel_version=nightly%252F56&measure=NETWORK_RACE_CACHE_WITH_NETWORK_SAVED_TIME&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-06-12&table=0&trim=1&use_submission_date=0
[3] https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-08-01&keys=__none__!__none__!__none__&max_channel_version=nightly%252F56&measure=NETWORK_RACE_CACHE_WITH_NETWORK_USAGE_2&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-07-09&table=0&trim=1&use_submission_date=0
I support landing this and seeing if we still see the Speedometer regressions.  Keep a close eye on the Speedometer results (esp the tests in comment 39).
https://hg.mozilla.org/mozilla-central/rev/64b1b02e184b
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Is this landing the source of the 10% regression in Speedometer - https://arewefastyet.com/#machine=36&view=single&suite=speedometer-misc&subtest=score&start=1505336371&end=1511198137? 

That feels like too big a deficit to land. Who can take a look at this?
(In reply to Naveed Ihsanullah [:naveed] from comment #69)
> Is this landing the source of the 10% regression in Speedometer -
> https://arewefastyet.com/#machine=36&view=single&suite=speedometer-
> misc&subtest=score&start=1505336371&end=1511198137? 

Where do you see 10% regression? I see this in the graph:

suite: speedometer-misc 2.0-r221659
score: 39
delta: -0.9 (2.2% worse)
source: Firefox (Ion PGO)
revs: a77c628829b389ed6ac608eadc88b09eb5115bef to 3a0aac55195fd50ca4b1b41be450bfbaaafc191c
tested: Nov 17 to Nov 19, 2017
Flags: needinfo?(michal.novotny) → needinfo?(nihsanullah)
2.2% is probably too big.  And from the looks of it, some other patches that landed in the day or so before we enabled RCWN caused some Speedometer regressions, so it's all a bit murky right now.  We should probably back out, wait to see if there are other patches that need backing out, and when the Speedometer numbers have settled down we can try again.
Could someone explain a bit how the RCWN code works?  AFAIK speedometer tries not to measure load times.  So it seems perhaps some extra CPU is being used outside of normal network load operations.  All it takes is a few extra runnables on the main thread at the wrong time to regress the benchmark.  Does this make any sense with how RCWN is supposed to work?

Also, is it possible the RCWN algorithm does some unnecessary work when network and disk speeds are nearly identical?  Does it do more racing because it can't make up its mind which is faster?
(In reply to Ben Kelly [:bkelly] from comment #72)
> Could someone explain a bit how the RCWN code works?  AFAIK speedometer
> tries not to measure load times.  So it seems perhaps some extra CPU is
> being used outside of normal network load operations.  All it takes is a few
> extra runnables on the main thread at the wrong time to regress the
> benchmark.  Does this make any sense with how RCWN is supposed to work?

When RCWN is enabled, http channel waits for the cache entry for a limited time (between 0 and 500ms) and then issues the network request. If we get the response from the cache first, we cancel the network transaction. If the network is faster, we ignore the cache entry. It can happen, that we run few more runnables with RCWN than without it.

> Also, is it possible the RCWN algorithm does some unnecessary work when
> network and disk speeds are nearly identical?  Does it do more racing
> because it can't make up its mind which is faster?

Normally, when network wins we ignore the cache entry in OnCacheEntryCheck. In rare cases, the network wins after OnCacheEntryCheck and before OnCacheEntryAvailable is called and then we do a bit more stuff.

Anyway, AFAICS the numbers are now back in normal without backing RCWN out, right?

https://arewefastyet.com/#machine=36&view=single&suite=speedometer-misc&subtest=score&start=1509771214&end=1511817810
Wow, yes.  The numbers do look much better today.  I believe that change is probably due to bug 1418389.

Its unclear if it would have come down even farther with RCWN backed out or not.  For example, around Oct 30 on PGO we were ~41/42 and so far today we are back to ~40/41.

I wonder if the RCWN algorithm can hold the channel and its various listeners alive for longer which puts more stress on GC/CC.

Honestly I'd be somewhat in favor of just disabling RCWN on speedometer instead of backing it out.  My reasoning:

1. Speedometer is not representative of a real network and is not designed to test network stuff.
2. Speedometer is also a stress test and almost overly sensitive at times.  Its great for catching DOM/js issues, but we shouldn't let it stop us from doing smart things in our network stack.
3. The regression, if still there, is relatively small.  It doesn't show up on the faster i7-4700HQ at all.
4. If disabling does not show an improvement here then we can just re-enable again.  Otherwise we can investigate further.
5. Real users can benefit from RCWN in the meantime.  (I assume you have metrics showing wins on real devices.)

That's just my gut feeling on this one.  I guess its up to Naveed and Jason, though.
OK, sounds like our plan should be

1) disable RCWN temporarily and see if Speedometer changes.  If it doesn't change significantly, re-enable and we're done.

2) If it still does affect Speedometer, look into either
   A) See if changing the race delay fixes things (IIRC we've been planning to experiment with upping the delay anyway)
   B) See if we can disable RCWN for Speedometer (is that a matter of changing about:config prefs in the test somehow?)

Let's try #A first if it's not too hard to do.
Flags: needinfo?(michal.novotny)
Bob can you help Michal and Jason flip a pref in the AWFY speedometer settings?  (Assuming everyone agrees with this approach.)
Flags: needinfo?(bob)
sure > bug 1421650
Flags: needinfo?(bob)
See Also: → 1421650
(In reply to Bob Clary [:bc:] from comment #77)
> sure > bug 1421650

Thanks!

Jason, I guess you meant to do this test on speedometer only and not on m-c, right?
Flags: needinfo?(michal.novotny)
Sure, if we can test Speedometer w/o landing in m-c, that's great.
And... we're keeping the pref enabled.  RCWN is finally landed, enabled, and done \o/
Depends on: 1430698
Flags: needinfo?(nihsanullah)
Depends on: 1494405
You need to log in before you can comment on or make changes to this bug.