1392841 - Enable RCWN

Assignee

Description

•

8 years ago

We should enable RCWN on Nightly permanently. We didn't experience crashes during second shield study (bug 1381816). Telemetry from 2017/07/27 to 2017/07/31 shows that average time saved by racing was 5.82s while the median was 150ms. We can estimate the added bandwidth usage by racing as (NETWORK_RACE_CACHE_BANDWIDTH_RACE_NETWORK_WIN * cache_hit_rate + NETWORK_RACE_CACHE_BANDWIDTH_RACE_CACHE_WIN) which is around 0.53%.

Michal Novotny [:michal]

Assignee

Updated

•

8 years ago

Blocks: RCWN

Michal Novotny [:michal]

Assignee

Comment 1

•

8 years ago

Attached patch patch — Details — Splinter Review

Attachment #8900147 - Flags: review?(valentin.gosu)

Valentin Gosu [:valentin]

Updated

•

8 years ago

Attachment #8900147 - Flags: review?(valentin.gosu) → review+

Pulsebot

Comment 2

•

8 years ago

Pushed by mnovotny@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/d4fba7cb9d4a Enable RCWN on Nightly, r=valentin

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 3

•

8 years ago

Push which ran many failing tests: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=d97a8d7abdd65ef67038a6b139914b296f736a1c&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable XPCshell failure: https://treeherder.mozilla.org/logviewer.html#?job_id=125259167&repo=mozilla-inbound [task 2017-08-23T18:03:00.949496Z] 18:03:00 INFO - "CONSOLE_MESSAGE: (info) 1503511379414 addons.xpi DEBUG Download started for http://localhost:40510/broken.xpi to file /tmp/tmp-u6l.xpi" [task 2017-08-23T18:03:00.954611Z] 18:03:00 INFO - "CONSOLE_MESSAGE: (info) 1503511379416 addons.xpi DEBUG Download of http://localhost:40510/broken.xpi completed." [task 2017-08-23T18:03:00.967501Z] 18:03:00 INFO - "CONSOLE_MESSAGE: (info) 1503511379417 addons.xpi WARN Download of http://localhost:40510/broken.xpi failed: 404 Not Found" [task 2017-08-23T18:03:00.969380Z] 18:03:00 INFO - "CONSOLE_MESSAGE: (info) 1503511379418 addons.xpi DEBUG downloadFailed: removing temp file for http://localhost:40510/broken.xpi" [task 2017-08-23T18:03:00.971235Z] 18:03:00 INFO - "CONSOLE_MESSAGE: (info) 1503511379419 addons.xpi DEBUG removeTemporaryFile: http://localhost:40510/broken.xpi removing temp file /tmp/tmp-u6l.xpi" [task 2017-08-23T18:03:00.972757Z] 18:03:00 INFO - PID 13027 | 1503511379514 addons.xpi DEBUG Download started for http://localhost:40510/broken.xpi to file /tmp/tmp-379.xpi [task 2017-08-23T18:03:00.974535Z] 18:03:00 INFO - PID 13027 | 1503511379517 addons.xpi DEBUG Download of http://localhost:40510/broken.xpi completed. [task 2017-08-23T18:03:00.976249Z] 18:03:00 INFO - PID 13027 | 1503511379517 addons.xpi WARN Download of http://localhost:40510/broken.xpi failed: 404 Not Found [task 2017-08-23T18:03:00.978254Z] 18:03:00 WARNING - TEST-UNEXPECTED-FAIL | xpcshell-unpack.ini:toolkit/mozapps/extensions/test/xpcshell/test_backgroundupdate.js | onDownloadFailed - [onDownloadFailed : 118] false == true [task 2017-08-23T18:03:00.985487Z] 18:03:00 INFO - /home/worker/workspace/build/tests/xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell/test_backgroundupdate.js:onDownloadFailed:118 [task 2017-08-23T18:03:00.985859Z] 18:03:00 INFO - resource://gre/modules/AddonManager.jsm:callInstallListeners:1640 [task 2017-08-23T18:03:00.987281Z] 18:03:00 INFO - resource://gre/modules/AddonManager.jsm:callInstallListeners:3160 [task 2017-08-23T18:03:00.987728Z] 18:03:00 INFO - resource://gre/modules/addons/XPIInstall.jsm:downloadFailed:2515 [task 2017-08-23T18:03:00.989306Z] 18:03:00 INFO - resource://gre/modules/addons/XPIInstall.jsm:onStopRequest:2491 [task 2017-08-23T18:03:00.989645Z] 18:03:00 INFO - /home/worker/workspace/build/tests/xpcshell/head.js:_do_main:221 [task 2017-08-23T18:03:00.990964Z] 18:03:00 INFO - /home/worker/workspace/build/tests/xpcshell/head.js:_execute_test:544 [task 2017-08-23T18:03:00.991237Z] 18:03:00 INFO - -e:null:1 [task 2017-08-23T18:03:00.991504Z] 18:03:00 INFO - exiting test Failure log chrome: https://treeherder.mozilla.org/logviewer.html#?job_id=125259675&repo=mozilla-inbound [task 2017-08-23T17:59:22.139102Z] 17:59:22 INFO - TEST-START | dom/xul/test/test_bug1290965.xul [task 2017-08-23T17:59:22.227824Z] 17:59:22 INFO - GECKO(3358) | ++DOMWINDOW == 51 (0x7f98e572b000) [pid = 3358] [serial = 53] [outer = 0x7f98e018f000] [task 2017-08-23T17:59:22.376994Z] 17:59:22 INFO - GECKO(3358) | [Parent 3358] ###!!! ASSERTION: mListener is null: 'Not Reached', file /home/worker/workspace/build/src/netwerk/protocol/http/nsHttpChannel.cpp, line 7126 [task 2017-08-23T17:59:22.378812Z] 17:59:22 INFO - GECKO(3358) | #01: nsInputStreamPump::OnStateStart [netwerk/base/nsInputStreamPump.cpp:536] [task 2017-08-23T17:59:22.380274Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.382976Z] 17:59:22 INFO - GECKO(3358) | #02: nsInputStreamPump::OnInputStreamReady [netwerk/base/nsInputStreamPump.cpp:440] [task 2017-08-23T17:59:22.384455Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.385932Z] 17:59:22 INFO - GECKO(3358) | #03: nsInputStreamReadyEvent::Run [xpcom/io/nsStreamUtils.cpp:99] [task 2017-08-23T17:59:22.387685Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.389879Z] 17:59:22 INFO - GECKO(3358) | #04: nsThread::ProcessNextEvent [mfbt/Maybe.h:445] [task 2017-08-23T17:59:22.390881Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.392519Z] 17:59:22 INFO - GECKO(3358) | #05: NS_ProcessNextEvent [xpcom/threads/nsThreadUtils.cpp:521] [task 2017-08-23T17:59:22.393877Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.395165Z] 17:59:22 INFO - GECKO(3358) | #06: mozilla::ipc::MessagePump::Run [ipc/glue/MessagePump.cpp:98] [task 2017-08-23T17:59:22.396863Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.398106Z] 17:59:22 INFO - GECKO(3358) | #07: MessageLoop::RunInternal [ipc/chromium/src/base/message_loop.cc:327] [task 2017-08-23T17:59:22.399764Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.401327Z] 17:59:22 INFO - GECKO(3358) | #08: MessageLoop::Run [ipc/chromium/src/base/message_loop.cc:298] [task 2017-08-23T17:59:22.404294Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.405379Z] 17:59:22 INFO - GECKO(3358) | #09: nsBaseAppShell::Run [widget/nsBaseAppShell.cpp:160] [task 2017-08-23T17:59:22.406226Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.407240Z] 17:59:22 INFO - GECKO(3358) | #10: nsAppStartup::Run [toolkit/components/startup/nsAppStartup.cpp:289] [task 2017-08-23T17:59:22.408640Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.409660Z] 17:59:22 INFO - GECKO(3358) | #11: XREMain::XRE_mainRun [toolkit/xre/nsAppRunner.cpp:4647] [task 2017-08-23T17:59:22.412159Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.413253Z] 17:59:22 INFO - GECKO(3358) | #12: XREMain::XRE_main [toolkit/xre/nsAppRunner.cpp:4810] [task 2017-08-23T17:59:22.414086Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.415260Z] 17:59:22 INFO - GECKO(3358) | #13: XRE_main [toolkit/xre/nsAppRunner.cpp:4905] [task 2017-08-23T17:59:22.416099Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.419344Z] 17:59:22 INFO - GECKO(3358) | #14: do_main [browser/app/nsBrowserApp.cpp:236] [task 2017-08-23T17:59:22.420202Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.421649Z] 17:59:22 INFO - GECKO(3358) | #15: main [browser/app/nsBrowserApp.cpp:311] [task 2017-08-23T17:59:22.422524Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.423512Z] 17:59:22 INFO - GECKO(3358) | #16: libc.so.6 + 0x20830 [task 2017-08-23T17:59:22.427038Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.428096Z] 17:59:22 INFO - GECKO(3358) | #17: _start [task 2017-08-23T17:59:22.429383Z] 17:59:22 INFO - [task 2017-08-23T17:59:22.430390Z] 17:59:22 INFO - GECKO(3358) | MEMORY STAT | vsize 2042MB | residentFast 250MB | heapAllocated 100MB [task 2017-08-23T17:59:22.527846Z] 17:59:22 INFO - TEST-OK | dom/xul/test/test_bug1290965.xul | took 385ms [task 2017-08-23T17:59:22.612519Z] 17:59:22 INFO - GECKO(3358) | ++DOMWINDOW == 52 (0x7f98ec811800) [pid = 3358] [serial = 54] [outer = 0x7f98e018f000] [task 2017-08-23T17:59:22.670892Z] 17:59:22 INFO - GECKO(3358) | ++DOCSHELL 0x7fad2da1b800 == 1 [pid = 3460] [id = {d647c191-9278-4b5a-a6d2-687b6ee453fa}] [task 2017-08-23T17:59:22.748548Z] 17:59:22 INFO - TEST-UNEXPECTED-FAIL | dom/xul/test/test_bug1290965.xul | assertion count 1 is more than expected 0 assertions

Flags: needinfo?(michal.novotny)

Pulsebot

Comment 4

•

8 years ago

Backout by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/mozilla-inbound/rev/30ec306c75d9 Backed out changeset d4fba7cb9d4a for failing xpcshell tests, e.g.test_backgroundupdate.js and assertions in chrome tests, e.g. test_bug1290965.xul. r=backout on a CLOSED TREE

Junior [inactive]

Updated

•

8 years ago

Depends on: 1384510

Valentin Gosu [:valentin]

Comment 5

•

8 years ago

Should we try turning it on again?

Michal Novotny [:michal]

Assignee

Comment 6

•

8 years ago

Try push looks mostly good after fixing bug 1395121 and bug 1384478. See https://treeherder.mozilla.org/#/jobs?repo=try&revision=a45a0747f38d31c9d3fa4c80bb1802ed85e86c5c. I don't understand the wpt8 failure (css3-text-line-break-baspglwj-022.html). I'll do another try push and try to reproduce this locally to verify this has nothing to do with RCWN. There is also this crash: https://treeherder.mozilla.org/logviewer.html#?job_id=127498610&repo=try&lineNumber=5285, but this crash signature is also in crash-stats and it's not from RCWN experiment days, so I assume this is not RCWN related.

Flags: needinfo?(michal.novotny)

Pulsebot

Comment 7

•

8 years ago

Pushed by mnovotny@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/b7726e9751ca Enable RCWN on Nightly, r=valentin

Michal Novotny [:michal]

Assignee

Comment 8

•

8 years ago

E10s web-platform tests look good now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=07645d1d5624a360c5e9fcf2ddf343c91a96663a&selectedJob=128865924

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 9

•

8 years ago

This appears to have caused reftest failures like https://treeherder.mozilla.org/logviewer.html#?job_id=128904305&repo=mozilla-inbound Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/149d14865d30855ae7ac9a3cc55e76638ee138f3

Flags: needinfo?(michal.novotny)

Michal Novotny [:michal]

Assignee

Comment 10

•

8 years ago

This is already known intermittent failure, why do you think it is caused by this push? https://bugzilla.mozilla.org/show_bug.cgi?id=1238222

Flags: needinfo?(michal.novotny) → needinfo?(wkocher)

Michal Novotny [:michal]

Assignee

Comment 11

•

8 years ago

See also https://bugzilla.mozilla.org/show_bug.cgi?id=1392144

Ionuț Goldan [:igoldan]

Comment 12

•

8 years ago

Before the backout, these perf regressions were noticed and also one big improvement: == Change summary for alert #9249 (as of September 06 2017 12:53 UTC) == Regressions: 25% tp6_facebook summary windows7-32 pgo e10s 245.79 -> 306.62 22% tp6_facebook summary windows10-64 pgo e10s 259.58 -> 315.79 13% tp6_facebook summary windows10-64 opt e10s 331.31 -> 374.21 Improvements: 27% sessionrestore_many_windows windows10-64 pgo e10s 3,637.25 -> 2,643.33 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9249

Ionuț Goldan [:igoldan]

Comment 13

•

8 years ago

And these perf changes appeared after the backout: == Change summary for alert #9270 (as of September 06 2017 23:48 UTC) == Regressions: 38% sessionrestore_many_windows windows10-64 pgo e10s 2,643.62 -> 3,641.56 Improvements: 18% tp6_facebook summary windows7-32 opt e10s 376.42 -> 310.25 16% tp6_facebook summary windows10-64 opt 1_thread e10s389.46 -> 325.46 15% tp6_facebook summary windows10-64 opt e10s 376.75 -> 320.29 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9270

Michal Novotny [:michal]

Assignee

Comment 14

•

8 years ago

Can we run this locally to try to find the cause of regression?

Flags: needinfo?(ionut.goldan)

Ionuț Goldan [:igoldan]

Comment 15

•

8 years ago

Yes, you should be able to run this locally using | mach talos-tests --suite tp6-e10s |.

Flags: needinfo?(ionut.goldan)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 16

•

8 years ago

(In reply to Michal Novotny (:michal) from comment #10) > This is already known intermittent failure, why do you think it is caused by > this push? > > https://bugzilla.mozilla.org/show_bug.cgi?id=1238222 The intermittent failure has a different number of differing pixels, and https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=88d460f1480ac18d3b49e1c4d9d4f8ea3422f4f6&fromchange=f4e16883efd9d7e3465dc1a8fd9f642e17940352&tochange=75b91420fff7c1ded50cc8dd9396590bc0172588&selectedJob=128936629&group_state=expanded suggests that your patch started this failure and the failure stopped with the backout of this patch.

Flags: needinfo?(wkocher)

Valentin Gosu [:valentin]

Comment 17

•

8 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #15) > Yes, you should be able to run this locally using | mach talos-tests --suite > tp6-e10s |. I managed to run the test locally, and indeed I observed a regression. I ran the following: ./mach talos-test --setpref network.http.rcwn.enabled=true -a tp6_facebook run-tests - Wall time: 421s; CPU: 23.0%; Read bytes: 19769344; Write bytes: 156924416; Read time: 17756; Write time: 863206 ./mach talos-test -a tp6_facebook run-tests - Wall time: 403s; CPU: 22.0%; Read bytes: 21012480; Write bytes: 134531072; Read time: 31607; Write time: 739350 Figuring out why this happens seems a bit tricky though.

Valentin Gosu [:valentin]

Comment 18

•

8 years ago

(In reply to Wes Kocher (:KWierso) from comment #16) > (In reply to Michal Novotny (:michal) from comment #10) > > This is already known intermittent failure, why do you think it is caused by > > this push? > > > > https://bugzilla.mozilla.org/show_bug.cgi?id=1238222 > > The intermittent failure has a different number of differing pixels, and > https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter- > searchStr=88d460f1480ac18d3b49e1c4d9d4f8ea3422f4f6&fromchange=f4e16883efd9d7e > 3465dc1a8fd9f642e17940352&tochange=75b91420fff7c1ded50cc8dd9396590bc0172588&s > electedJob=128936629&group_state=expanded suggests that your patch started > this failure and the failure stopped with the backout of this patch. The number of differing pixels I think is caused by the fact that we are failing different subtests of font-display-1.html The reference URIs are different in the rcwn failure vs bug 1238222. But as mentioned in bug 1238222 comment 1, that test is racy.

Michal Novotny [:michal]

Assignee

Comment 19

•

8 years ago

Attached file test and reference images of the failures — Details

(In reply to Wes Kocher (:KWierso) from comment #16) > (In reply to Michal Novotny (:michal) from comment #10) > > This is already known intermittent failure, why do you think it is caused by > > this push? > > > > https://bugzilla.mozilla.org/show_bug.cgi?id=1238222 > > The intermittent failure has a different number of differing pixels, and > https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter- > searchStr=88d460f1480ac18d3b49e1c4d9d4f8ea3422f4f6&fromchange=f4e16883efd9d7e > 3465dc1a8fd9f642e17940352&tochange=75b91420fff7c1ded50cc8dd9396590bc0172588&s > electedJob=128936629&group_state=expanded suggests that your patch started > this failure and the failure stopped with the backout of this patch. That's simply not true. Have you checked failures in bug 1238222? E.g. failures from comment 46: 5376 differing pixels: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=126338148&lineNumber=13598 https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127055651&lineNumber=18222 5997 differing pixels: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=126836952&lineNumber=2787 https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127547058&lineNumber=2786 None of the reports have RCWN enabled. And if you have a look at the pictures (see attached archive), you can see that the problem is the same and it differs only in used font. Btw, I have run reftests locally 50 times (with RCWN disabled) and test layout/reftests/font-face/font-display-1.html failed in all cases.

Flags: needinfo?(wkocher)

Michal Novotny [:michal]

Assignee

Updated

•

8 years ago

Depends on: 1398847

Pulsebot

Comment 20

•

8 years ago

Pushed by mnovotny@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/5a73ec3a1970 Enable RCWN on Nightly, r=valentin

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 21

•

8 years ago

Backed out for failing mochitests, e.g. dom/tests/mochitest/ajax/offline/test_bug474696.html, on Windows 7 without e10s: https://hg.mozilla.org/integration/mozilla-inbound/rev/c0f7ab91792c957f05d47aacd25892f56dbaff10 Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=5a73ec3a19704bab348fa834733981163ffd4a55&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=130577296&repo=mozilla-inbound 848 INFO TEST-UNEXPECTED-FAIL | dom/tests/mochitest/ajax/offline/test_bug474696.html | Test timed out. 873 INFO TEST-UNEXPECTED-FAIL | dom/tests/mochitest/ajax/offline/test_fallback.html | Test timed out. 921 INFO TEST-UNEXPECTED-FAIL | dom/tests/mochitest/ajax/offline/test_simpleManifest.html | Unexpected event: error etc.

Flags: needinfo?(michal.novotny)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 23

•

8 years ago

Please also take a look at this crash which hit the push after this one: https://treeherder.mozilla.org/logviewer.html#?job_id=130590813&repo=mozilla-inbound 02:12:12 INFO - 380 INFO TEST-START | browser/base/content/test/general/browser_gZipOfflineChild.js 02:12:12 INFO - GECKO(3228) | ++DOCSHELL 000000ED06A32000 == 2 [pid = 360] [id = {f4210046-0a0c-49e1-b5dc-de187e74d8a6}] 02:12:12 INFO - GECKO(3228) | ++DOMWINDOW == 12 (000000ED0703B800) [pid = 360] [serial = 387] [outer = 0000000000000000] 02:12:12 INFO - GECKO(3228) | [Child 2092] WARNING: stylo: ServoStyleSets cannot respond to document state changes yet (only matters for chrome documents). See bug 1290285.: file z:/build/build/src/layout/base/PresShell.cpp, line 4297 02:12:12 INFO - GECKO(3228) | ++DOMWINDOW == 13 (000000ED09F18000) [pid = 360] [serial = 388] [outer = 000000ED0703B800] 02:12:12 INFO - GECKO(3228) | [Child 360] WARNING: stylo: ServoStyleSets cannot respond to document state changes yet (only matters for chrome documents). See bug 1290285.: file z:/build/build/src/layout/base/PresShell.cpp, line 4297 02:12:12 INFO - GECKO(3228) | [Parent 3228] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file z:/build/build/src/netwerk/base/nsChannelClassifier.cpp, line 342 02:12:12 INFO - GECKO(3228) | --DOMWINDOW == 53 (000000E133664000) [pid = 2092] [serial = 231] [outer = 0000000000000000] [url = http://example.com/browser/browser/base/content/test/general/dummy_page.html] 02:12:12 INFO - GECKO(3228) | --DOMWINDOW == 52 (000000E1377E7800) [pid = 2092] [serial = 253] [outer = 0000000000000000] [url = http://example.org/browser/browser/base/content/test/general/dummy_page.html] 02:12:12 INFO - GECKO(3228) | ++DOMWINDOW == 14 (000000ED0AEAB800) [pid = 360] [serial = 389] [outer = 000000ED0703B800] 02:12:12 INFO - GECKO(3228) | --DOMWINDOW == 51 (000000E133666000) [pid = 2092] [serial = 250] [outer = 0000000000000000] [url = http://example.com/browser/browser/base/content/test/general/dummy_page.html] 02:12:12 INFO - GECKO(3228) | --DOMWINDOW == 50 (000000E137669800) [pid = 2092] [serial = 247] [outer = 0000000000000000] [url = http://example.org/browser/browser/base/content/test/general/dummy_page.html] 02:12:12 INFO - GECKO(3228) | --DOMWINDOW == 49 (000000E13365B000) [pid = 2092] [serial = 258] [outer = 0000000000000000] [url = about:blank] 02:12:12 INFO - GECKO(3228) | --DOMWINDOW == 48 (000000E13737D000) [pid = 2092] [serial = 256] [outer = 0000000000000000] [url = about:blank] 02:12:12 INFO - GECKO(3228) | --DOMWINDOW == 47 (000000E133780000) [pid = 2092] [serial = 260] [outer = 0000000000000000] [url = about:blank] 02:12:12 INFO - GECKO(3228) | [Child 360] WARNING: stylo: Web Components not supported yet: file z:/build/build/src/dom/base/nsDocument.cpp, line 6413 02:12:12 INFO - GECKO(3228) | [Child 360] WARNING: stylo: Web Components not supported yet: file z:/build/build/src/dom/base/nsDocument.cpp, line 6413 02:12:12 INFO - GECKO(3228) | ++DOCSHELL 000000ED0B857000 == 3 [pid = 360] [id = {c25721b5-360e-4f1d-a2d9-0326335ff5e6}] 02:12:12 INFO - GECKO(3228) | ++DOMWINDOW == 15 (000000ED0B85C000) [pid = 360] [serial = 390] [outer = 0000000000000000] 02:12:12 INFO - GECKO(3228) | [Child 360] WARNING: stylo: ServoStyleSets cannot respond to document state changes yet (only matters for chrome documents). See bug 1290285.: file z:/build/build/src/layout/base/PresShell.cpp, line 4297 02:12:12 INFO - GECKO(3228) | ++DOMWINDOW == 16 (000000ED0AE17000) [pid = 360] [serial = 391] [outer = 000000ED0B85C000] 02:12:12 INFO - GECKO(3228) | [Child 360] WARNING: Could not get disk status from nsIDiskSpaceWatcher: file z:/build/build/src/uriloader/prefetch/nsOfflineCacheUpdateService.cpp, line 283 02:12:12 INFO - GECKO(3228) | [Child 360] WARNING: stylo: Web Components not supported yet: file z:/build/build/src/dom/base/nsDocument.cpp, line 6413 02:12:12 INFO - GECKO(3228) | [Child 360] WARNING: stylo: Web Components not supported yet: file z:/build/build/src/dom/base/nsDocument.cpp, line 6413 02:12:12 INFO - GECKO(3228) | [Child 360] WARNING: stylo: ServoStyleSets cannot respond to document state changes yet (only matters for chrome documents). See bug 1290285.: file z:/build/build/src/layout/base/PresShell.cpp, line 4297 02:12:13 INFO - GECKO(3228) | [Parent 3228] WARNING: Suboptimal indexes for the SQL statement 0x2bcc4fc840 (http://mzl.la/1FuID0j).: file z:/build/build/src/storage/mozStoragePrivateHelpers.cpp, line 114 02:12:13 INFO - GECKO(3228) | [Parent 3228] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file z:/build/build/src/netwerk/protocol/http/nsHttpChannel.cpp, line 4103 02:12:13 INFO - GECKO(3228) | [Parent 3228] WARNING: cache check failed: file z:/build/build/src/netwerk/cache2/OldWrappers.cpp, line 878 02:12:13 INFO - GECKO(3228) | [Parent 3228] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file z:/build/build/src/toolkit/components/mediasniffer/nsMediaSniffer.cpp, line 155 02:12:13 INFO - GECKO(3228) | Assertion failure: mResponseHead (No response head in ODA!!), at z:/build/build/src/netwerk/protocol/http/nsHttpChannel.cpp:7701 02:12:13 INFO - GECKO(3228) | #01: nsInputStreamPump::OnStateTransfer() [netwerk/base/nsInputStreamPump.cpp:618] 02:12:13 INFO - GECKO(3228) | #02: nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream *) [netwerk/base/nsInputStreamPump.cpp:444] 02:12:13 INFO - GECKO(3228) | #03: nsInputStreamReadyEvent::Run() [xpcom/io/nsStreamUtils.cpp:99] 02:12:13 INFO - GECKO(3228) | #04: nsThread::ProcessNextEvent(bool,bool *) [xpcom/threads/nsThread.cpp:1040] 02:12:13 INFO - GECKO(3228) | #05: NS_ProcessNextEvent(nsIThread *,bool) [xpcom/threads/nsThreadUtils.cpp:521] 02:12:13 INFO - GECKO(3228) | #06: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) [ipc/glue/MessagePump.cpp:97] 02:12:13 INFO - GECKO(3228) | #07: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:320] 02:12:13 INFO - GECKO(3228) | #08: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:300] 02:12:13 INFO - GECKO(3228) | #09: nsBaseAppShell::Run() [widget/nsBaseAppShell.cpp:160] 02:12:13 INFO - GECKO(3228) | #10: nsAppShell::Run() [widget/windows/nsAppShell.cpp:232] 02:12:13 INFO - GECKO(3228) | #11: nsAppStartup::Run() [toolkit/components/startup/nsAppStartup.cpp:288] 02:12:13 INFO - GECKO(3228) | #12: XREMain::XRE_mainRun() [toolkit/xre/nsAppRunner.cpp:4653] 02:12:13 INFO - GECKO(3228) | #13: XREMain::XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [toolkit/xre/nsAppRunner.cpp:4817] 02:12:13 INFO - GECKO(3228) | #14: XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [toolkit/xre/nsAppRunner.cpp:4912] 02:12:13 INFO - GECKO(3228) | #15: do_main [browser/app/nsBrowserApp.cpp:237] 02:12:13 INFO - GECKO(3228) | #16: NS_internal_main(int,char * *,char * *) [browser/app/nsBrowserApp.cpp:311] 02:12:13 INFO - GECKO(3228) | #17: wmain [toolkit/xre/nsWindowsWMain.cpp:118] 02:12:13 INFO - GECKO(3228) | #18: __scrt_common_main_seh [f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253] 02:12:13 INFO - GECKO(3228) | #19: KERNEL32.DLL + 0x167e 02:12:13 INFO - GECKO(3228) | #20: ntdll.dll + 0x1c3f1

Michal Novotny [:michal]

Assignee

Comment 24

•

8 years ago

(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #21) > 848 INFO TEST-UNEXPECTED-FAIL | > dom/tests/mochitest/ajax/offline/test_bug474696.html | Test timed out. > 873 INFO TEST-UNEXPECTED-FAIL | > dom/tests/mochitest/ajax/offline/test_fallback.html | Test timed out. > 921 INFO TEST-UNEXPECTED-FAIL | > dom/tests/mochitest/ajax/offline/test_simpleManifest.html | Unexpected > event: error Looks like RCWN should be disabled for all mochitests in dom/tests/mochitest/ajax/offline but for some reason they were not covered by bug 1373198. They simply expect the content to be loaded from the cache.

Flags: needinfo?(michal.novotny)

Firefox Bug Husbandry Bot

Comment 25

•

8 years ago

Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Priority: -- → P1

Ionuț Goldan [:igoldan]

Comment 26

•

8 years ago

When this landed temporarily, we noticed this improvement on Windows 10: == Change summary for alert #9418 (as of September 13 2017 08:16 UTC) == Improvements: 27% sessionrestore_many_windows windows10-64 pgo e10s 3,602.50 -> 2,615.42 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9418

Ionuț Goldan [:igoldan]

Comment 27

•

8 years ago

The backout brought back the initial baseline: == Change summary for alert #9423 (as of September 13 2017 12:45 UTC) == Regressions: 39% sessionrestore_many_windows windows10-64 pgo e10s 2,615.17 -> 3,639.58 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9423

Michal Novotny [:michal]

Assignee

Updated

•

8 years ago

Depends on: 1399802

Michal Novotny [:michal]

Assignee

Comment 28

•

8 years ago

(In reply to Michal Novotny (:michal) from comment #24) > (In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on > intermittent or backout) from comment #21) > > 848 INFO TEST-UNEXPECTED-FAIL | > > dom/tests/mochitest/ajax/offline/test_bug474696.html | Test timed out. > > 873 INFO TEST-UNEXPECTED-FAIL | > > dom/tests/mochitest/ajax/offline/test_fallback.html | Test timed out. > > 921 INFO TEST-UNEXPECTED-FAIL | > > dom/tests/mochitest/ajax/offline/test_simpleManifest.html | Unexpected > > event: error > > Looks like RCWN should be disabled for all mochitests in > dom/tests/mochitest/ajax/offline but for some reason they were not covered > by bug 1373198. They simply expect the content to be loaded from the cache. Bug 1399802 will fix the failures.

Pulsebot

Comment 29

•

8 years ago

Pushed by mnovotny@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/bf572b74ec5f Enable RCWN on Nightly, r=valentin

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 30

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bf572b74ec5f

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox57: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 31

•

8 years ago

Backed out for frequently failing reftest layout/reftests/font-face/font-display-1.html on Linux debug and on suspicion for causing leak in bug 1399117: https://hg.mozilla.org/mozilla-central/rev/5b39f8dcc5cd6a38fe99564a9cfc90eedfad1639 See spike in failure on Treherder: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=67672b4fecc49740a30cdd2f61412044f134b585&filter-searchStr=88d460f1480ac18d3b49e1c4d9d4f8ea3422f4f6&tochange=7d0893f1b6ce8494c9d73a0e7fe04ef3cf104b98&selectedJob=131335846

Status: RESOLVED → REOPENED

status-firefox57: fixed → ---

Flags: needinfo?(michal.novotny)

Resolution: FIXED → ---

Target Milestone: mozilla57 → ---

Michal Novotny [:michal]

Assignee

Comment 32

•

8 years ago

(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #31) > Backed out for frequently failing reftest > layout/reftests/font-face/font-display-1.html on Linux debug See my comment #19. RCWN is not causing the problem. It might cause the spike, but it doesn't change the fact that the test is badly written. > and on suspicion for causing leak in bug 1399117: Again, it's not causing it, because it was reported on revision 485c8d248bfc4acc97f92cdc130210d7388bca32, where RCWN is disabled.

Flags: needinfo?(michal.novotny)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 33

•

8 years ago

(In reply to Michal Novotny (:michal) from comment #32) > (In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on > intermittent or backout) from comment #31) > > Backed out for frequently failing reftest > > layout/reftests/font-face/font-display-1.html on Linux debug > > See my comment #19. RCWN is not causing the problem. It might cause the > spike, but it doesn't change the fact that the test is badly written. Best way forward here is to file a bug for the issue and needinfo :bradwerth. Either the issue has to be fixed or the test disabled in that bug, then this one can land.

Michal Novotny [:michal]

Assignee

Comment 34

•

8 years ago

(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #33) > > See my comment #19. RCWN is not causing the problem. It might cause the > > spike, but it doesn't change the fact that the test is badly written. > Best way forward here is to file a bug for the issue and needinfo > :bradwerth. Either the issue has to be fixed or the test disabled in that > bug, then this one can land. Thanks for info. And what to do with the leak? It is the same problem. It happens without this patch too, but more often with it.

Flags: needinfo?(aryx.bugmail)

Michal Novotny [:michal]

Assignee

Updated

•

8 years ago

Depends on: 1400386

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 35

•

8 years ago

> Thanks for info. And what to do with the leak? It is the same problem. It > happens without this patch too, but more often with it. It's still high frequency after that backout, so unrelated and doesn't block the relanding.

Flags: needinfo?(aryx.bugmail)

Ionuț Goldan [:igoldan]

Comment 36

•

8 years ago

Before this was backed out, these improvements were noticed: == Change summary for alert #9451 (as of September 14 2017 12:40 UTC) == Improvements: 27% sessionrestore_many_windows windows10-64 pgo e10s 3,609.23 -> 2,624.92 27% sessionrestore_many_windows windows7-32 pgo e10s 3,547.75 -> 2,598.50 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9451

Ionuț Goldan [:igoldan]

Comment 37

•

8 years ago

The last backout, as noted in comment 31, brought the equivalent perf metrics back: == Change summary for alert #9492 (as of September 15 2017 19:34 UTC) == Regressions: 37% sessionrestore_many_windows windows7-32 pgo e10s 2,562.67 -> 3,500.33 34% sessionrestore_many_windows windows10-64 pgo e10s 2,593.00 -> 3,486.00 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9492

Pulsebot

Comment 38

•

8 years ago

Pushed by mnovotny@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c1df5b677b8a Enable RCWN on Nightly, r=valentin

(no longer active)

Comment 39

•

8 years ago

Hi Michal, It seems like this has caused a regression on the Quantum Reference Hardware on Speedometer, around 5% on x64 and 2.5% on x86, please see the links below respectively (please look for the node on the graph with revision b577f883dc898d203b74f7b02345a0a7dc3d32c0): https://arewefastyet.com/#machine=36&view=single&suite=speedometer-misc&subtest=score&start=1505761901&end=1505831856 https://arewefastyet.com/#machine=37&view=single&suite=speedometer-misc&subtest=score&start=1505761901&end=1505831856 In contrast, the faster Windows machine that also runs these tests shows no regression on this changeset: https://arewefastyet.com/#machine=35&view=single&suite=speedometer-misc&subtest=score&start=1505761901&end=1505831856 https://arewefastyet.com/#machine=17&view=single&suite=speedometer-misc&subtest=score&start=1505761901&end=1505831856 This may suggest that the machines have different disk/network speeds which makes one prone to a regression. Since Speedometer V2 performance on the Quantum Reference Hardware is a release criteria goal, and we're very late in the cycle, I may very reluctantly request you to back this out for now while you investigate the regression. Sorry to be a pain, and thanks for understanding. :-(

Blocks: Speedometer_V2

Flags: needinfo?(michal.novotny)

Pulsebot

Comment 40

•

8 years ago

Backout by mnovotny@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/8ec345b71a6e Backed out changeset from bug 1392841 due to speedometer regression. r=backout

Ryan VanderMeulen [:RyanVM]

Comment 41

•

8 years ago

Merge of backout: https://hg.mozilla.org/mozilla-central/rev/8ec345b71a6e

Flags: needinfo?(wkocher)

(no longer active)

Comment 42

•

8 years ago

FWIW the backout caused the regression to go away, for example: <https://arewefastyet.com/#machine=36&view=single&suite=speedometer-misc&subtest=score&start=1505767575&end=1505859938>

(no longer active)

Comment 43

•

8 years ago

Also was the intention to only enable this on Nightly? The patch was shipping the feature as far as I could tell. :-)

Jason Duell

Comment 44

•

8 years ago

Yes, the intent is to ship. After talking to Selena, overholt, and ryanVM about this, we decided that the gold standard here is really a PGO build. So we want to see the numbers for the existing code on 32/64 bit windows PGO builds, both with RCWN on/off--if there's no regression for PGO, we're good to go here. Otherwise, we have to debug what's going on (I know little about Speedindex other than that it is a simple TODO web app. We're not sure, but our impression is that the only networking it should do it loading the resources for the test, i.e. we don't think it does XHR or anything like that while the test is running. Which makes the regression here kind of weird.) I've got my folks testing this on our reference Acer laptops, and we should have numbers soon.

Flags: needinfo?(michal.novotny)

Ionuț Goldan [:igoldan]

Comment 45

•

8 years ago

Yet again, when this relanded: == Change summary for alert #9536 (as of September 19 2017 07:57 UTC) == Improvements: 29% sessionrestore_many_windows windows10-64 pgo e10s 3,469.92 -> 2,481.00 24% sessionrestore_many_windows windows7-32 pgo e10s 3,488.25 -> 2,663.58 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9536

Ionuț Goldan [:igoldan]

Comment 46

•

8 years ago

Then after the backout: == Change summary for alert #9545 (as of September 19 2017 16:37 UTC) == Regressions: 40% sessionrestore_many_windows windows10-64 pgo e10s 2,484.83 -> 3,484.75 17% sessionrestore_many_windows windows7-32 pgo e10s 2,987.71 -> 3,501.33 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9545

Ionuț Goldan [:igoldan]

Updated

•

8 years ago

Depends on: 1401483

(no longer active)

Comment 47

•

8 years ago

(In reply to Jason Duell [:jduell] (needinfo me) from comment #44) > Yes, the intent is to ship. > > After talking to Selena, overholt, and ryanVM about this, we decided that > the gold standard here is really a PGO build. So we want to see the numbers > for the existing code on 32/64 bit windows PGO builds, both with RCWN > on/off--if there's no regression for PGO, we're good to go here. Otherwise, > we have to debug what's going on (I know little about Speedindex other than > that it is a simple TODO web app. We're not sure, but our impression is > that the only networking it should do it loading the resources for the test, > i.e. we don't think it does XHR or anything like that while the test is > running. Which makes the regression here kind of weird.) Speedometer doesn't use XHR as far as I know, but it does load *lots* of web pages, so if turning on the RCWN pref causes some extra code to be run that wouldn't run without it that could explain the regression. In my experience the Acer hardware has been good at detecting small regressions but it has been pretty terrible for profiling purposes unfortunately, since the sampling rate can't be adjusted to less than 1ms (which is 2ms on Windows effectively when using the Gecko profiler) so if the code is all platform independent, profiling Speedometer with and without the pref on Linux or Mac and comparing the profiles to see what shows up additionally when the pref is on may be a good place to start. The way I usually profile Speedometer is I set the sampling frequency to 0.1ms and the buffer size to 450MB and I run Speedometer and let it run ~150 or so of its subtests and then I capture a profile (the benchmark loops through its subtests so by that time you have captured several runs of the subtests). You can use https://mozilla.github.io/arewefastyet-speedometer/2.0/ if you want the same copy of the test that AWFY uses. Hope this helps!

Summary: Enable RCWN on Nightly → Enable RCWN

Michal Novotny [:michal]

Assignee

Comment 48

•

8 years ago

(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #47) > captured several runs of the subtests). You can use > https://mozilla.github.io/arewefastyet-speedometer/2.0/ if you want the same > copy of the test that AWFY uses. Hope this helps! The set of tests seems to be different from what is run when I follow https://github.com/mozilla/arewefastyet/wiki/Slave#benchmark-locally. So which one is correct? And how to run multiple tests when we want to have some average number? Should be all individual tests executed separately with a fresh profile?

Flags: needinfo?(ehsan)

Michal Novotny [:michal]

Assignee

Comment 49

•

8 years ago

Ehsan, is there possibility to run speedometer test with a custom patch on the official servers? So far we couldn't reproduce the regression locally on the reference HW.

Jason Duell

Comment 50

•

8 years ago

:bc (Bob Clary) says the answer is no, but he's going to experiment with using try builds to run Speedometer.

Flags: needinfo?(ehsan)

Jason Duell

Comment 51

•

8 years ago

Punting on landing for 57 today, as we've run into bug 1401741, which causes devtools to mark some RCWN channels as never completed, which seems bad enough to block landing for now. We gathered a lot of perf numbers and I'll have an analysis of them soon.

Priority: P1 → P2

Valentina Claudia Ona

Comment 52

•

8 years ago

Hi Valentin, Can you please tell me if 55 and 56 versions are unaffected?

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin]

Comment 53

•

8 years ago

(In reply to Valentina Claudia Ona from comment #52) > Hi Valentin, > Can you please tell me if 55 and 56 versions are unaffected? RCWN is not enabled on any version of Firefox.

Flags: needinfo?(valentin.gosu)

Valentina Claudia Ona

Comment 54

•

8 years ago

I asked you because I want to set tracking flags.

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin]

Comment 55

•

8 years ago

We missed the train for 57 and we definitely will not be shipping on 55 or 56 - so we don't need to track it.

Flags: needinfo?(valentin.gosu)

Valentina Claudia Ona

Updated

•

8 years ago

status-firefox55: --- → unaffected

status-firefox56: --- → unaffected

status-firefox57: --- → affected

(no longer active)

Comment 57

•

8 years ago

FWIW in bug 1400095, a Speedometer regression was noticed the previous time this had landed (comment 29)

Jason Duell

Comment 58

•

8 years ago

So we've looked some into Speedometer performance (with PGO builds). We've gathered two sets of data, and the spreadsheet with our numbers is here: https://docs.google.com/spreadsheets/d/11FqwBvw0QyGXclGWB5sJ9rwGNnjN9txAG02hbdjWE2g/edit#gid=0 1) 5 different Acer reference laptops, on which we manually ran 5-10 Speedometer runs with RCWN on/off (and 32/64 bit) using the remote standard Speedometer webserver. Aggregating all the results, we do see a statistically significant improvement for RCWN on 32 bit (+2.18%), and a perf hit on 64 bit (-0.9%). However, if we exclude the noisiest laptop (PC1), which allegedly had update software running during some of the tests, the effects are less (+.59% and -.54%) and are no longer statistically significant. If we use only the 3 laptops that had "stable" numbers (all std deviations < .5), then the effects drop further (.25%, -.13%) and are a total wash statistically. I'm not enough of a stats person to know if it's kosher to disregard the data from the noisier laptops. We did that sort of thing a lot when I worked in scientific computing... 2) Michal ran more data points using our automated tools (so I presume these used a localhost server). We have a lot more data points here (73 runs for each config), yet oddly enough we see absolutely no significant differences in performance. But the Speedometer numbers are also oddly high (61-62, vs mid-40s for our other runs), so I want to know more about that before making any conclusions. (Update from Michal: "It's the standard reference notebook. If you use framework https://github.com/mozilla/arewefastyet.git and do './execute.py -b remote.speedometer', it shows results ~61. If you do it manually on page https://mozilla.github.io/arewefastyet-speedometer/2.0/ it shows results around 42.' Ehsan, is that something you've seen before? Are there differences besides the location of the web server that you know of? Which method is arewefastyet.com using?) 3) Just as for our automated Speedometer results, we don't see any significant differences with RCWN on workstations--only on our reference hardware. We can start digging into gecko profiles, necko logs, etc. But before we do I'd like to have some idea of the goalposts here. In general, we have never considered tests that run with a localhost web server (and usually via an HTTP proxy--do we know if we use a proxy here too?) to be useful tests of necko performance, since it's not a realistic environment for networking. In particular, let me note that running a localhost server (as I assume we do in our Speedometer results from bot runs) is a recipe for artificially poor performance with RCWN. If RCWN decides to race with cache, that means that 1) the local disk is probably already under high load, and 2) we'll fire up a request to the web server, which will then add more I/O trying to access the same data (but in a different file). So there's artificial downside to RCWN in this environment. (Another way of putting it is that RCWN is based on gambling that a different I/O channel may be faster, and so worth some overhead to try. In this environment the 'different I/O channel' winds up being the same--the local disk--so we only add overhead). That said, I understand that 1) Speedometer is a benchmark that may have market relevance that make it unavoidable (could we turn off RCWN for it? Or maybe it's as simple as teaching RCWN to not race for connections that resolve to localhost?); 2) We do seem to see these weird results when we use a remote server, too, so we need to look into that. I think our next step is to get gecko/necko logs on a machine and see if we see anything interesting (maybe honza's log tracer can help?). I personally have no idea why we'd be speeding things up on 32 bits, but slowing them down on 64. Have we seen that pattern elsewhere and learned from it?

Flags: needinfo?(ehsan)

(no longer active)

Comment 59

•

8 years ago

(In reply to Jason Duell [:jduell] (needinfo me) from comment #58) > 1) 5 different Acer reference laptops, on which we manually ran 5-10 > Speedometer runs with RCWN on/off (and 32/64 bit) using the remote standard > Speedometer webserver. Aggregating all the results, we do see a > statistically significant improvement for RCWN on 32 bit (+2.18%), and a > perf hit on 64 bit (-0.9%). However, if we exclude the noisiest laptop > (PC1), which allegedly had update software running during some of the tests, > the effects are less (+.59% and -.54%) and are no longer statistically > significant. If we use only the 3 laptops that had "stable" numbers (all > std deviations < .5), then the effects drop further (.25%, -.13%) and are a > total wash statistically. > > I'm not enough of a stats person to know if it's kosher to disregard the > data from the noisier laptops. We did that sort of thing a lot when I > worked in scientific computing... I don't know what to say. This has landed twice on AWFY and it has shown regressions both times (comment 57 and comment 39) and the regressions are clear from the graphs of runs over each commit in both cases. Has there been a code change in between the latest landing and these measurements that is expected to have removed the regressions? If not, I wouldn't expect the regression to go away on its own. Has anyone tried to profile a Speedometer run with and without RCWN to see what extra code runs and get an understanding of where the regression comes from? Also, as I noted on IRC last week, even though the PGO builds are what we distribute to the users and are what "matters" in that sense, you're making your job too difficult by using them for comparisons instead of non-PGO builds. For example, around comment 39 we were seeing a 5% regression on x64 on non-PGO builds and around a 1% regression on PGO builds. Those builds therefore may be easier for profiling purposes since they will make the differences in performance more pronounced. (FTR I personally do all of my Speedometer profiling on non-PGO builds myself, the results transfer very well in my experience.) > 2) Michal ran more data points using our automated tools (so I presume these > used a localhost server). We have a lot more data points here (73 runs for > each config), yet oddly enough we see absolutely no significant differences > in performance. But the Speedometer numbers are also oddly high (61-62, vs > mid-40s for our other runs), so I want to know more about that before making > any conclusions. (Update from Michal: "It's the standard reference notebook. > If you use framework https://github.com/mozilla/arewefastyet.git and do > './execute.py -b remote.speedometer', it shows results ~61. If you do it > manually on page https://mozilla.github.io/arewefastyet-speedometer/2.0/ it > shows results around 42.' Ehsan, is that something you've seen before? Are > there differences besides the location of the web server that you know of? > Which method is arewefastyet.com using?) If you are getting numbers in the range of sixties, you are using an old version of Speedometer before they switched to using geometric mean to report the test score (https://bugs.webkit.org/show_bug.cgi?id=172968). It is entirely possible for a regression that shows up when using geometric mean to not show up when using arithmetic mean, so please disregard test runs with results in the sixties range on the reference hardware. When in doubt, use https://mozilla.github.io/arewefastyet-speedometer/2.0/, that is the latest version of Speedometer that AWFY uses. BTW, Speedometer benchmark doesn't care which webserver it is served off of, as it doesn't measure any of the network time. > 3) Just as for our automated Speedometer results, we don't see any > significant differences with RCWN on workstations--only on our reference > hardware. Yes, AWFY also didn't show a regression on a faster machine either... > We can start digging into gecko profiles, necko logs, etc. But before we do > I'd like to have some idea of the goalposts here. Yes, I think the right next step here is for someone to profile the test runs to see what extra code runs with RCWN turned on, and why it takes so long. And for that profiling to happen on a *non-PGO* build if you are looking for an easy path to an answer. :-) > In general, we have never considered tests that run with a localhost web > server (and usually via an HTTP proxy--do we know if we use a proxy here > too?) I don't know anything about the network setup on AWFY, you should ask Bob if you're interested. > to be useful tests of necko performance, since it's not a realistic > environment for networking. Yes, and that is the point here. RCWN is regressing the performance on non-networking parts of the browser. :-) Let me describe, because it may not be obvious what Speedometer measures. Speedometer loads some web pages in iframes, waits for the load to finish, and then starts measuring. Each measurement works like this: var t1 = new Date(); runSomeCode(); var t2 = new Date(); syncTime = (t2 - t1); var t3 = new Date(); setTimeout(function() { var t4 = new Date(); asyncTime = (t4 - t3); }, 0); syncTime and asyncTime are reported for each subset. runSomeCode() is the benchmark code that Speedometer runs which is some synchronous JS operation. So basically the benchmark runs some JS, measures how long that takes, then schedules a timeout and measures how long it takes for the timeout handler to run (as a proxy to see how long it takes for any async tasks that the benchmarked code has scheduled to run) and reports that as asyncTime, and then moves on to run the next iframe. This regression means that RCWN is either increasing something in the syncTimes or in the asyncTimes. It's hard to imagine how something may have changed in what runs in syncTimes because it's all code triggered by web page JS, and I'm assuming RCWN doesn't have anything triggered by JS/DOM or such. But more likely it is scheduling some runnables to run on the event queue, which happen to run inside the period of time measured by asyncTime, and those runnables are taking a long time. The goal here is to find out the nature of the additional code running and figure out what we can do about it. > In particular, let me note that running a > localhost server (as I assume we do in our Speedometer results from bot > runs) is a recipe for artificially poor performance with RCWN. If RCWN > decides to race with cache, that means that 1) the local disk is probably > already under high load, and 2) we'll fire up a request to the web server, > which will then add more I/O trying to access the same data (but in a > different file). So there's artificial downside to RCWN in this > environment. (Another way of putting it is that RCWN is based on gambling > that a different I/O channel may be faster, and so worth some overhead to > try. In this environment the 'different I/O channel' winds up being the > same--the local disk--so we only add overhead). Because Speedometer doesn't measure page load time, it's unlikely that changes like this to cause an impact to what the test measures. > That said, I understand that 1) Speedometer is a benchmark that may have > market relevance that make it unavoidable (could we turn off RCWN for it? Or > maybe it's as simple as teaching RCWN to not race for connections that > resolve to localhost?); That may be a good idea because of the problem you mentioned above, but I'm not sure if it would help with Speedometer itself. > 2) We do seem to see these weird results when we use > a remote server, too, so we need to look into that. I think our next step > is to get gecko/necko logs on a machine and see if we see anything > interesting (maybe honza's log tracer can help?). I can't imagine how logs could help because I don't know what we're looking for yet, hopefully my description above is helpful in thinking about what to look for when digging into this. I think a sampling profiler is sufficient to give answers here. BTW unless the code here is Windows specific you should be able to do the profiling on any platform. (On Windows it's impossible currently to get a meaningful sampling frequency below 2ms which is way too high for useful profiling. In my experience profiling on the reference hardware is pretty difficult.) > I personally have no idea why we'd be speeding things up on 32 bits, but > slowing them down on 64. Have we seen that pattern elsewhere and learned > from it? No, I can't imagine why this would be to be honest. I suspect once we find out the underlying cause of the regression it would be evident...

Flags: needinfo?(ehsan)

Michal Novotny [:michal]

Assignee

Updated

•

8 years ago

Depends on: 1401741

Valentin Gosu [:valentin]

Comment 60

•

8 years ago

So, the profiling effort hasn't been going so great: I tried profiling with a 1ms interval, and set the limit to 900 Mb. The test ended up taking somewhere around an hour, and then the profile could not be opened (after a few hours of loading screen I gave up). I tried reducing the testcase, and only profiled the first 20 of 460 tests of the speedometer, with a 0.1 ms interval. [no-rcwn] https://perfht.ml/2xQZzBQ [with-rcwn] https://perfht.ml/2xS18zo My profiling skills aren't the best, but I don't see much of a difference between the two captures. Any help would be very welcome.

Mike Conley (:mconley) (:⚙️)

Updated

•

8 years ago

Flags: needinfo?(mconley)

Michal Novotny [:michal]

Assignee

Comment 61

•

8 years ago

(In reply to Valentin Gosu [:valentin] from comment #60) > So, the profiling effort hasn't been going so great: > I tried profiling with a 1ms interval, and set the limit to 900 Mb. The test > ended up taking somewhere around an hour, and then the profile could not be > opened (after a few hours of loading screen I gave up). This looks strange. I used the profiler on the reference HW with non-pgo build and the profiler overhead is negligible. The whole test finishes in about 200 seconds.

Mike Conley (:mconley) (:⚙️)

Comment 62

•

8 years ago

(In reply to Valentin Gosu [:valentin] from comment #60) > > [no-rcwn] https://perfht.ml/2xQZzBQ > [with-rcwn] https://perfht.ml/2xS18zo > > My profiling skills aren't the best, but I don't see much of a difference > between the two captures. Any help would be very welcome. I've started to poke at these profiles. One thing I've noticed is that you have at least one add-on enabled - I think it's ABP. For this sort of work, I find it's often best if we can reduce the number of things that are running to the bare minimum to lower the number of red herrings in the profile. Would it be possible to gather the profiles again, but this time, with _only_ the Gecko Profiler add-on enabled?

Flags: needinfo?(mconley) → needinfo?(valentin.gosu)

Michal Novotny [:michal]

Assignee

Comment 63

•

8 years ago

I made following profiles on the reference HW with non-pgo build of revision b577f883dc898d203b74f7b02345a0a7dc3d32c0 (the one mentioned in comment #39). rcwn enabled: https://perf-html.io/public/f26441c356bfa324c555ff0ac6af7cee83552756 https://perf-html.io/public/ba55ac18b277c03306fa85230c61820d038b7dcf https://perf-html.io/public/fcf9db5dc1d4f8c5039e85d02f33334873346e1c rcwn disabled: https://perf-html.io/public/8b0505f940911ba611d2b1793b9be13f0a3ee386 https://perf-html.io/public/f99474cfc7200f6f8112c3d675ca75d9c3c36552 https://perf-html.io/public/2cc98bddab33a3c0b4d1dbf6985d687c7351162f

Flags: needinfo?(valentin.gosu)

Mike Conley (:mconley) (:⚙️)

Comment 64

•

8 years ago

Sorry, I realize you're all kinda chomping at the bit here, but I was traveling for MozFest and only just got back. I've been looking at the profiles in comment 63 today, and one thing sticks out to me: the profiles with RCWN enabled have 3 content processes, whereas the RCWN disabled ones have 2 content processes. Why? What happens if you restrict the number of content processes to just 1 via dom.ipc.processCount? Does RCWN still regression Speedometer then?

Flags: needinfo?(michal.novotny)

Valentin Gosu [:valentin]

Comment 65

•

8 years ago

Since yesterday we merged, we should consider enabling this in Nightly and letting it ride the release train. Our profiling didn't reveal the cause for the perceived performance regression, and that may it is within the test's error range. We did trigger multiple runs for the tests and the regression delta went way down (between 0% and 0.44%). See [1] We also looked at subtests. Some show a regression, while others an improvement. This differs a lot between runs, leading us to believe that this is caused by the change in the timing of cache IO operations on the background thread. Telemetry indicates RCWN can reduce the time to load a channel by a median of 143 milliseconds [2] when the network request wins the race 7.33% of all loads [3]. We need to enable it to verify if telemetry shows an improvement in overall page load time as well. [1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=7a767bfcdd084f80c3e80b6b8cf401f4715c11b8&newProject=try&newRevision=d17ec5fedeaaedafdf58a7fe3251496041e64ea3&framework=1 [2] https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-07-31&keys=__none__!__none__!__none__&max_channel_version=nightly%252F56&measure=NETWORK_RACE_CACHE_WITH_NETWORK_SAVED_TIME&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-06-12&table=0&trim=1&use_submission_date=0 [3] https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-08-01&keys=__none__!__none__!__none__&max_channel_version=nightly%252F56&measure=NETWORK_RACE_CACHE_WITH_NETWORK_USAGE_2&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-07-09&table=0&trim=1&use_submission_date=0

Jason Duell

Comment 66

•

8 years ago

I support landing this and seeing if we still see the Speedometer regressions. Keep a close eye on the Speedometer results (esp the tests in comment 39).

Pulsebot

Comment 67

•

8 years ago

Pushed by mnovotny@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/64b1b02e184b Enable RCWN on Nightly, r=valentin

Raul Gurzau (:RaulG)

Comment 68

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/64b1b02e184b

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

status-firefox59: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla59

Virtual_ManPL [:Virtual] 🇵🇱 - (please needinfo? me - so I will see your comment/reply/question/etc.)

Updated

•

8 years ago

Depends on: 1418752

Naveed Ihsanullah [:naveed]

Comment 69

•

8 years ago

Is this landing the source of the 10% regression in Speedometer - https://arewefastyet.com/#machine=36&view=single&suite=speedometer-misc&subtest=score&start=1505336371&end=1511198137? That feels like too big a deficit to land. Who can take a look at this?

Michal Novotny [:michal]

Assignee

Comment 70

•

8 years ago

(In reply to Naveed Ihsanullah [:naveed] from comment #69) > Is this landing the source of the 10% regression in Speedometer - > https://arewefastyet.com/#machine=36&view=single&suite=speedometer- > misc&subtest=score&start=1505336371&end=1511198137? Where do you see 10% regression? I see this in the graph: suite: speedometer-misc 2.0-r221659 score: 39 delta: -0.9 (2.2% worse) source: Firefox (Ion PGO) revs: a77c628829b389ed6ac608eadc88b09eb5115bef to 3a0aac55195fd50ca4b1b41be450bfbaaafc191c tested: Nov 17 to Nov 19, 2017

Flags: needinfo?(michal.novotny) → needinfo?(nihsanullah)

Jason Duell

Comment 71

•

8 years ago

2.2% is probably too big. And from the looks of it, some other patches that landed in the day or so before we enabled RCWN caused some Speedometer regressions, so it's all a bit murky right now. We should probably back out, wait to see if there are other patches that need backing out, and when the Speedometer numbers have settled down we can try again.

Ben Kelly [:bkelly, not reviewing]

Comment 72

•

8 years ago

Could someone explain a bit how the RCWN code works? AFAIK speedometer tries not to measure load times. So it seems perhaps some extra CPU is being used outside of normal network load operations. All it takes is a few extra runnables on the main thread at the wrong time to regress the benchmark. Does this make any sense with how RCWN is supposed to work? Also, is it possible the RCWN algorithm does some unnecessary work when network and disk speeds are nearly identical? Does it do more racing because it can't make up its mind which is faster?

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

status-firefox57: affected → unaffected

status-firefox58: --- → unaffected

Michal Novotny [:michal]

Assignee

Comment 73

•

8 years ago

(In reply to Ben Kelly [:bkelly] from comment #72) > Could someone explain a bit how the RCWN code works? AFAIK speedometer > tries not to measure load times. So it seems perhaps some extra CPU is > being used outside of normal network load operations. All it takes is a few > extra runnables on the main thread at the wrong time to regress the > benchmark. Does this make any sense with how RCWN is supposed to work? When RCWN is enabled, http channel waits for the cache entry for a limited time (between 0 and 500ms) and then issues the network request. If we get the response from the cache first, we cancel the network transaction. If the network is faster, we ignore the cache entry. It can happen, that we run few more runnables with RCWN than without it. > Also, is it possible the RCWN algorithm does some unnecessary work when > network and disk speeds are nearly identical? Does it do more racing > because it can't make up its mind which is faster? Normally, when network wins we ignore the cache entry in OnCacheEntryCheck. In rare cases, the network wins after OnCacheEntryCheck and before OnCacheEntryAvailable is called and then we do a bit more stuff. Anyway, AFAICS the numbers are now back in normal without backing RCWN out, right? https://arewefastyet.com/#machine=36&view=single&suite=speedometer-misc&subtest=score&start=1509771214&end=1511817810

Ben Kelly [:bkelly, not reviewing]

Comment 74

•

8 years ago

Wow, yes. The numbers do look much better today. I believe that change is probably due to bug 1418389. Its unclear if it would have come down even farther with RCWN backed out or not. For example, around Oct 30 on PGO we were ~41/42 and so far today we are back to ~40/41. I wonder if the RCWN algorithm can hold the channel and its various listeners alive for longer which puts more stress on GC/CC. Honestly I'd be somewhat in favor of just disabling RCWN on speedometer instead of backing it out. My reasoning: 1. Speedometer is not representative of a real network and is not designed to test network stuff. 2. Speedometer is also a stress test and almost overly sensitive at times. Its great for catching DOM/js issues, but we shouldn't let it stop us from doing smart things in our network stack. 3. The regression, if still there, is relatively small. It doesn't show up on the faster i7-4700HQ at all. 4. If disabling does not show an improvement here then we can just re-enable again. Otherwise we can investigate further. 5. Real users can benefit from RCWN in the meantime. (I assume you have metrics showing wins on real devices.) That's just my gut feeling on this one. I guess its up to Naveed and Jason, though.

Jason Duell

Comment 75

•

8 years ago

OK, sounds like our plan should be 1) disable RCWN temporarily and see if Speedometer changes. If it doesn't change significantly, re-enable and we're done. 2) If it still does affect Speedometer, look into either A) See if changing the race delay fixes things (IIRC we've been planning to experiment with upping the delay anyway) B) See if we can disable RCWN for Speedometer (is that a matter of changing about:config prefs in the test somehow?) Let's try #A first if it's not too hard to do.

Flags: needinfo?(michal.novotny)

Ben Kelly [:bkelly, not reviewing]

Comment 76

•

8 years ago

Bob can you help Michal and Jason flip a pref in the AWFY speedometer settings? (Assuming everyone agrees with this approach.)

Flags: needinfo?(bob)

Bob Clary [:bc] (inactive)

Comment 77

•

8 years ago

sure > bug 1421650

Flags: needinfo?(bob)

Comment 78

•

8 years ago

(In reply to Bob Clary [:bc:] from comment #77) > sure > bug 1421650 Thanks! Jason, I guess you meant to do this test on speedometer only and not on m-c, right?

Flags: needinfo?(michal.novotny)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

8 years ago

Blocks: 1052467

Jason Duell

Comment 79

•

8 years ago

Sure, if we can test Speedometer w/o landing in m-c, that's great.

Jason Duell

Comment 80

•

8 years ago

And... we're keeping the pref enabled. RCWN is finally landed, enabled, and done \o/

magicp

Updated

•

8 years ago

Depends on: 1430698

Michal Novotny [:michal]

Assignee

Updated

•

7 years ago

Flags: needinfo?(nihsanullah)

Honza Bambas (:mayhemer)

Updated

•

7 years ago

Depends on: 1494405

Valentin Gosu [:valentin]

Updated

•

3 years ago

Regressions: 1814961

patch 8 years ago Michal Novotny [:michal] 1.09 KB, patch	valentin : review+	Details \| Diff \| Splinter Review
test and reference images of the failures 8 years ago Michal Novotny [:michal] 9.86 KB, application/x-xz		Details