Closed Bug 1244259 Opened 8 years ago Closed 8 years ago

60% xperf non main file io regression (now bi-modal) on fx-team (v.47) Jan 28, from push d59fd186550c

Categories

(Core :: Networking, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox47 --- fixed

People

(Reporter: jmaher, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(2 files)

this is a fun bug, for e10s xperf, we have a newly bi-modal counter:
https://treeherder.mozilla.org/perf.html#/graphs?series=[fx-team,2f3af3833d55ff371ecf01c41aeee1939ef3a782,1]&highlightedRevisions=d59fd186550c

To run this on try:
try: -b o -p win32 -u none -t xperf-e10s

this is a test where we don't need to retrigger often- it is not measuring time, but file I/O.

I don't mind bisecting this on try server.  I suspect looking a the raw data from xperf will shed light on what is happening.
:Gijs, any thoughts on this?  I am happy to help collect more data, just let me know a direction to go here as you might have more insight into this.
Flags: needinfo?(gijskruitbosch+bugs)
No idea. Never had to deal with this test before, and no idea why the code change would cause file IO changes... if anything we should have become more restrictive about loading things.

Is there some way we can get a list of what the failing (ie high io, right?) builds are accessing? Like, given that it's bimodal, is there a way to figure out the difference between the two cases? Also, are the low points just pgo builds or something? Did this only happen on Windows or also elsewhere? (do we run e10s xperf elsewhere?)
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(jmaher)
xperf is only run on windows 7, so i am not sure about the other platforms.  I will leave this needinfo around for me to get the list of files between high/low.  For reference, the bi-modal is on the same build type (this link is for non-pgo)
Also, and I don't mean to be all deflective, but if it's bimodal, how sure are you that it's this push? I'm also not sure how the bimodality jives with "no need to retrigger often" :-)
I did retrigger this push and the previous push a few times to ensure the bi-modality started with this push.  as you can tell by the graph, this is not a noisy test :)
Flags: needinfo?(jmaher)
Thanks for clarifying. I'll look at the logs in a bit to see if I can distill something from that, but it's late Friday night here, so it might need to wait until Monday.

(In reply to Joel Maher (:jmaher) from comment #3)
> I will leave this needinfo around for me to get the list of files between
> high/low.

Doing that. I'm a little lost here... Thanks for your help, Joel! :-)
Flags: needinfo?(jmaher)
(In reply to :Gijs Kruitbosch from comment #6)
> Thanks for clarifying. I'll look at the logs in a bit to see if I can
> distill something from that, but it's late Friday night here, so it might
> need to wait until Monday.
> 
> (In reply to Joel Maher (:jmaher) from comment #3)
> > I will leave this needinfo around for me to get the list of files between
> > high/low.
> 
> Doing that. I'm a little lost here... Thanks for your help, Joel! :-)

FWIW, I couldn't see anything in the logs... so a list of files would be really helpful here.
I have some info here.

The simple answer is in the case where we have more file io bytes, these are the files it is coming from:
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-forbid-simple.cache	44
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-forbid-simple.pset	65536
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-forbid-simple.sbstore	232
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-malware-simple.cache	44
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-malware-simple.pset	65536
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-malware-simple.sbstore	232
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-phish-simple.cache	44
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-phish-simple.pset	65536
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-phish-simple.sbstore	232
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-track-simple.cache	76
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-track-simple.pset	65536
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-track-simple.sbstore	272
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-trackwhite-simple.cache	44
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-trackwhite-simple.pset	65536
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-trackwhite-simple.sbstore	232
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-unwanted-simple.cache	44
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-unwanted-simple.pset	65536
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-unwanted-simple.sbstore	232


here are the common files between the old and new:
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\addons.json	24
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\extensions.json	5382
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\extensions\pageloader@mozilla.org\chrome\memory.js	63046
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\extensions\pageloader@mozilla.org\chrome\MozillaFileLogger.js	62601
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\extensions\pageloader@mozilla.org\chrome\pageloader.js	37569
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\extensions\pageloader@mozilla.org\chrome\pageloader.xul	62852
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\extensions\pageloader@mozilla.org\chrome\Profiler.js	60408
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\extensions\pageloader@mozilla.org\chrome\quit.js	62192
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\extensions\pageloader@mozilla.org\chrome\report.js	61219
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\frequencyCap.json	2
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\search.json.mozlz4	25312
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\sessionCheckpoints.json	288
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\sessionCheckpoints.json.tmp	53
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\sessionstore.js	596
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\times.json	50
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\webapps\webapps-1.json	4098
C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\webapps\webapps.json	65534


:gijs, do you have thoughts on why the safebrowsing/ dir is being accessed now intermittently?
Flags: needinfo?(jmaher) → needinfo?(gijskruitbosch+bugs)
Thanks, Joel!

(In reply to Joel Maher (:jmaher) from comment #8)
> :gijs, do you have thoughts on why the safebrowsing/ dir is being accessed
> now intermittently?

Not really... did the runtime of the test change (ni for this... I tried to look at this when looking at the log but found it hard to distill that info...), and are we simply hitting some timer that causes us to try to update our databases or something? Do we know what kind of IO this is (write/read) ?

When you got this info, did you get a list of network pages hit, too? If there's a difference there that might help explain the safebrowsing link?

Just to be 100% certain, is 100% of the extra IO coming from the safebrowsing pages?
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(jmaher)
100% of the IO is coming from the safebrowsing pages.  The runtime seems to have changed for the overall job by ~4 minutes.  Maybe that is a timer for safebrowsing to take advantage.

I can figure out the networkIO, not sure if I will get a full list of url requests.

the process time for the test is (verified with 2 different data points for each high and low):
* high fileIO: 1:45
* low  fileIO: 1:35

so that indicates that we have something in the slower case taking 10 extra seconds.  A reasonable theory would be that the 10 seconds allow for safe browsing work to be done in the profile.

* Why are we getting 10 seconds longer intermittently?
* Are we having issues setting up the profile or running/closing the warmup run?
* could there be other machine interactions causing this 10 seconds of browser runtime?

we can rule out machine specific issues most likely, but it is something to look at.

Any other thoughts :Gijs?
Flags: needinfo?(jmaher)
François, any idea why we would suddenly be doing much more safebrowsing IO - some of the time - after landing tightened security checks for opening pages with nested protocols (esp. view-source) ? Is there a way to get more logging output from safebrowsing in a way that dumps into logfiles for talos try runs, so we can see what it's doing?
Flags: needinfo?(francois)
(In reply to Joel Maher (:jmaher) from comment #8)
> The simple answer is in the case where we have more file io bytes, these are
> the files it is coming from:
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-forbid-
> simple.cache	44
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-forbid-
> simple.pset	65536
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-forbid-
> simple.sbstore	232
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-malware-
> simple.cache	44
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-malware-
> simple.pset	65536
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-malware-
> simple.sbstore	232
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-phish-
> simple.cache	44
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-phish-
> simple.pset	65536
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-phish-
> simple.sbstore	232
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-track-
> simple.cache	76
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-track-
> simple.pset	65536
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-track-
> simple.sbstore	272
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-
> trackwhite-simple.cache	44
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-
> trackwhite-simple.pset	65536
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-
> trackwhite-simple.sbstore	232
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-unwanted-
> simple.cache	44
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-unwanted-
> simple.pset	65536
> C:\Users\cltbld\AppData\Local\Temp\<tmp>\profile\safebrowsing\test-unwanted-
> simple.sbstore	232

Are these numbers showing filesize? Because the .pset files seem rather large to me, here's what they are on my Linux Nightly profile:

-rw-r--r-- 1 francois francois      44 Feb  2 14:59 test-forbid-simple.cache
-rw-r--r-- 1 francois francois      16 Feb  2 14:59 test-forbid-simple.pset
-rw-r--r-- 1 francois francois     232 Feb  2 14:59 test-forbid-simple.sbstore
-rw-r--r-- 1 francois francois      44 Feb  2 14:59 test-malware-simple.cache
-rw-r--r-- 1 francois francois      16 Feb  2 14:59 test-malware-simple.pset
-rw-r--r-- 1 francois francois     232 Feb  2 14:59 test-malware-simple.sbstore
-rw-r--r-- 1 francois francois      44 Feb  2 14:59 test-phish-simple.cache
-rw-r--r-- 1 francois francois      16 Feb  2 14:59 test-phish-simple.pset
-rw-r--r-- 1 francois francois     232 Feb  2 14:59 test-phish-simple.sbstore
-rw-r--r-- 1 francois francois      76 Feb  2 14:59 test-track-simple.cache
-rw-r--r-- 1 francois francois      16 Feb  2 14:59 test-track-simple.pset
-rw-r--r-- 1 francois francois     272 Feb  2 14:59 test-track-simple.sbstore
-rw-r--r-- 1 francois francois      44 Feb  2 14:59 test-trackwhite-simple.cache
-rw-r--r-- 1 francois francois      16 Feb  2 14:59 test-trackwhite-simple.pset
-rw-r--r-- 1 francois francois     232 Feb  2 14:59 test-trackwhite-simple.sbstore
-rw-r--r-- 1 francois francois      44 Feb  2 14:59 test-unwanted-simple.cache
-rw-r--r-- 1 francois francois      16 Feb  2 14:59 test-unwanted-simple.pset
-rw-r--r-- 1 francois francois     232 Feb  2 14:59 test-unwanted-simple.sbstore

(In reply to Joel Maher (:jmaher) from comment #10)
> 100% of the IO is coming from the safebrowsing pages.  The runtime seems to
> have changed for the overall job by ~4 minutes.  Maybe that is a timer for
> safebrowsing to take advantage.

Maybe that intermittent delay is the 0-5 min delay (before the first Safe Browsing list update) that got added in https://hg.mozilla.org/mozilla-central/rev/988fdc8043e0 ?

(In reply to :Gijs Kruitbosch from comment #11)
> Is there a way to get more
> logging output from safebrowsing in a way that dumps into logfiles for talos
> try runs, so we can see what it's doing?

The PR_LOG modules for Safe Browsing are listed here: https://wiki.mozilla.org/Security/Safe_Browsing#QA
Flags: needinfo?(francois)
the 64K for the .pset files is the total bytes IO- so this is read/write.  In this case we are reading the files and they are seen as 64K.  Here are all the *.pset file accesses from the xperf file:
['FileIoRead', '781778', 'firefox.exe (3852)', '3328', 'Unknown (  -1)', '-1', '3', '0xb068dd78', '0x85ba4a58', '0x0000000000', '0x00010000', '0x00040900', '0x00000000', 'Low', 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmp8bax06\\profile\\safebrowsing\\test-forbid-simple.pset', '8 | 11 | 18']
['FileIoRead', '782174', 'firefox.exe (3852)', '3328', 'Unknown (  -1)', '-1', '3', '0xb068dd78', '0xad600f80', '0x0000000000', '0x00010000', '0x00040900', '0x00000000', 'Low', 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmp8bax06\\profile\\safebrowsing\\test-malware-simple.pset', '8 | 11 | 18']
['FileIoRead', '782567', 'firefox.exe (3852)', '3328', 'Unknown (  -1)', '-1', '3', '0xb068dd78', '0x888f9f80', '0x0000000000', '0x00010000', '0x00040900', '0x00000000', 'Low', 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmp8bax06\\profile\\safebrowsing\\test-phish-simple.pset', '8 | 11 | 18']
['FileIoRead', '782946', 'firefox.exe (3852)', '3328', 'Unknown (  -1)', '-1', '3', '0xb068dd78', '0x8a13d6d8', '0x0000000000', '0x00010000', '0x00040900', '0x00000000', 'Low', 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmp8bax06\\profile\\safebrowsing\\test-track-simple.pset', '8 | 11 | 18']
['FileIoRead', '783313', 'firefox.exe (3852)', '3328', 'Unknown (  -1)', '-1', '3', '0xb068dd78', '0xb06e0808', '0x0000000000', '0x00010000', '0x00040900', '0x00000000', 'Low', 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmp8bax06\\profile\\safebrowsing\\test-trackwhite-simple.pset', '8 | 11 | 18']
['FileIoRead', '783679', 'firefox.exe (3852)', '3328', 'Unknown (  -1)', '-1', '3', '0xb068dd78', '0x883e9f80', '0x0000000000', '0x00010000', '0x00040900', '0x00000000', 'Low', 'C:\\Users\\cltbld\\AppData\\Local\\Temp\\tmp8bax06\\profile\\safebrowsing\\test-unwanted-simple.pset', '8 | 11 | 18']


I have no idea what the file sizes are in the profile on the test machine, I could do a try push if you wish and get the info from that.  Looking at my local profile, I see 32K+:
-rw-r--r-- 1 jmaher jmaher 322774 Apr 27  2015 /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/goog-badbinurl-shavar.pset
-rw-r--r-- 1 jmaher jmaher 298046 Apr 27  2015 /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/goog-malware-shavar.pset
-rw-r--r-- 1 jmaher jmaher 326400 Apr 27  2015 /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/goog-phish-shavar.pset
-rw-r--r-- 1 jmaher jmaher 178784 Apr 27  2015 /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/goog-unwanted-shavar.pset
-rw-r--r-- 1 jmaher jmaher     16 Apr 27  2015 /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/test-malware-simple.pset
-rw-r--r-- 1 jmaher jmaher     16 Apr 27  2015 /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/test-phish-simple.pset
-rw-r--r-- 1 jmaher jmaher     16 Apr 27  2015 /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/test-unwanted-simple.pset
(In reply to :Gijs Kruitbosch from comment #11)
> François, any idea why we would suddenly be doing much more safebrowsing IO
...
>Are these numbers showing filesize? Because the .pset files seem rather large 
>to me, here's what they are on my Linux Nightly profile:

I strongly suspect this is related to bug 1211090, which made .pset file operations
use buffered IO with a 64k buffer.

There are fallocate changes in the same patch, but they should (famous last words)
not have made any difference.

Does BufferedOutputStream do something funny that causes it to actually create more IO
for files < buffer size? Like write out the buffer and truncate or so?
(In reply to Joel Maher (:jmaher) from comment #13)
> the 64K for the .pset files is the total bytes IO- so this is read/write. 
> In this case we are reading the files and they are seen as 64K.  Here are
> all the *.pset file accesses from the xperf file:

Need to dig into our BufferedStream impl!

> I have no idea what the file sizes are in the profile on the test machine, I
> could do a try push if you wish and get the info from that.  Looking at my
> local profile, I see 32K+:
> -rw-r--r-- 1 jmaher jmaher 322774 Apr 27  2015
> /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/goog-
> badbinurl-shavar.pset
> -rw-r--r-- 1 jmaher jmaher 298046 Apr 27  2015
> /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/goog-
> malware-shavar.pset
> -rw-r--r-- 1 jmaher jmaher 326400 Apr 27  2015
> /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/goog-phish-
> shavar.pset
> -rw-r--r-- 1 jmaher jmaher 178784 Apr 27  2015
> /home/jmaher/.mozilla/firefox-trunk/9ipvparl.default/safebrowsing/goog-
> unwanted-shavar.pset

The goog-* files are updated from the "real" server on the network. On a fresh
profile and in (non-networked) testing conditions, you will have the small <1k
files as Francois showed.
Pushed:

remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=570390421da4

Trying some more avenues for investigation: because this is restricted to e10s as well... Joel, do we know if the same safebrowsing IO is happening in the non-e10s case already (before bug 1172165)?

I wonder if this is all just a freakish "you made this take slightly longer and now occasionally we hit safebrowsing updates making it take significantly (10s) longer", in which case I suspect that even just NSPR logging might turn this from intermittent/bimodal to permanently-taking-longer, which then defeats the point of the logging to figure out what the difference here is. I'm almost hoping there is a more deterministic cause...
Flags: needinfo?(jmaher)
the non-e10s results do not seem to be affected at all, but do keep in mind that non-e10s has been bimodal throughout history.

If this does become single modal on the higher data side, then we have much more evidence this is a timing issue.
Flags: needinfo?(jmaher)
(In reply to François Marier [:francois] from comment #12)
> (In reply to Joel Maher (:jmaher) from comment #8)
> (In reply to Joel Maher (:jmaher) from comment #10)
> > 100% of the IO is coming from the safebrowsing pages.  The runtime seems to
> > have changed for the overall job by ~4 minutes.  Maybe that is a timer for
> > safebrowsing to take advantage.
> 
> Maybe that intermittent delay is the 0-5 min delay (before the first Safe
> Browsing list update) that got added in
> https://hg.mozilla.org/mozilla-central/rev/988fdc8043e0 ?

If this is the issue, we can make that a configurable pref and set it in tests.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #18)
> (In reply to François Marier [:francois] from comment #12)
> ...
> > Maybe that intermittent delay is the 0-5 min delay (before the first Safe
> > Browsing list update) that got added in
> > https://hg.mozilla.org/mozilla-central/rev/988fdc8043e0 ?
> 
> If this is the issue, we can make that a configurable pref and set it in
> tests.


> +  // Add a fuzz of 0-5 minutes.
> +  initialUpdateDelay += Math.floor(Math.random() * (5 * 60 * 1000));

Do we really have a random delay between zero to 5 minutes before safe browsing does things?

We should definitely not have it random in tests, but OTOH we should also be able to measure its effect on performance somehow.

IMO, for all "normal" performance tests we should have it completely disabled, but have one or few tests where it's happening while we're measuring performance, e.g. maybe run tsvgx and time it such that the update happens right in the middle of the test.
Resubmitted and now ignoring non-whitelisted IO:

remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=bbdba21e60ea

That said, it seems like the IO from the nspr logging itself might make the distinction between the high/low IO states harder to see...
(In reply to Avi Halachmi (:avih) from comment #19)
> Do we really have a random delay between zero to 5 minutes before safe
> browsing does things?

Before it *updates*, or at least tries to update from the network.


> We should definitely not have it random in tests, but OTOH we should also be
> able to measure its effect on performance somehow.
> 
> IMO, for all "normal" performance tests we should have it completely
> disabled, 

"It completely disabled" -> I asssume you mean updates, not the feature itself, there's no point in disabling the local lookups especially as we do them on all URLs.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #21)
> "It completely disabled" -> I asssume you mean updates, not the feature
> itself, there's no point in disabling the local lookups especially as we do
> them on all URLs.

Indeed. Disable the thing which happens after a random duration, which happens to be the update.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bbdba21e60ea&selectedJob=16268129 is a job that has a "good" IO nonmain_startup_fileio number, but from the safebrowsing nspr log, I see no difference with any of the other jobs which all had 'bad' IO numbers (after removing timestamps, the unique localhost port tp5 uses, and then sorting, only requests made via JS that include unique random numbers / timestamps are different remain, and the diff is quite small). Francois, can you check if I'm missing something? Or, based on https://hg.mozilla.org/try/rev/bbdba21e60ea, did I miss part of the safebrowsing things that could be logged? Is there anything that would cause IO to those files that doesn't have accompanying NSPR logs within the modules I enabled? Can you check if the file that got uploaded actually has the content process info as well, so that we're not missing the relevant bits? (NSPR logs in e10s can be a bit finnicky like that, I believe)
Flags: needinfo?(francois)
>Is there anything that would cause IO to those files that doesn't have accompanying NSPR logs within the modules I enabled?

Yes, you're missing the "UrlClassifierPrefixSet" module which is what writes out the .pset files.
Flags: needinfo?(francois)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)
> >Is there anything that would cause IO to those files that doesn't have accompanying NSPR logs within the modules I enabled?
> 
> Yes, you're missing the "UrlClassifierPrefixSet" module which is what writes
> out the .pset files.

Ah! I updated https://wiki.mozilla.org/Security/Safe_Browsing#QA .

Pushing again, then:

remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=4492f4094f79


Though it seems plausible enough that, given the fileset and that I missed that module and that there is no difference in any of the other modules, that this module is what is causing the extra writes. Is this the same thing that's writing on a random timeout?
Flags: needinfo?(gpascutto)
It's one of the things (the one with the odd 64K size inconsistency). The other (https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/HashStore.cpp?from=HashStore.cpp) is covered by the ""UrlClassifierDbService" log, so it's not immediately clear to me you would get a difference in one if the other doesn't change.
Flags: needinfo?(gpascutto)
Uh... so this is confusing.

After 60-odd retriggers... here's a "low" IO log:

http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/4a51f28ee7b553046d486ad9f336e761b914539ad7e479238dc07d132c5914b218cc723339518247fcd24c86154dbbba3a59a2ff3c4776abcda648a537b3f890

(from https://treeherder.mozilla.org/#/jobs?repo=try&revision=4492f4094f79&selectedJob=16351185 ; tp5n_nonmain_startup_fileio: 556243)


And this is an arbitrary "high" IO log:

(from https://treeherder.mozilla.org/#/jobs?repo=try&revision=4492f4094f79&selectedJob=16351173 ; tp5n_nonmain_startup_fileio: 948878 )

The former (so low IO) has:

[16659750]: GetTables: test-forbid-simple;a:1
test-malware-simple;a:1
test-phish-simple;a:1
test-track-simple;a:1-2
test-trackwhite-simple;a:1
test-unwanted-simple;a:1
1404[919140]: nsUrlClassifierStreamUpdater init [this=11a9f150]
1404[919140]: FetchUpdate: http://127.0.0.1/safebrowsing-dummy/update
1404[919140]: (pre) Fetching update from http://127.0.0.1/safebrowsing-dummy/update
3632[16659750]: nsUrlClassifierDBServiceWorker::BeginUpdate [goog-phish-shavar,goog-malware-shavar,goog-unwanted-shavar,goog-badbinurl-shavar,goog-downloadwhite-digest256,mozstd-track-digest256,mozstd-trackwhite-digest256]
1404[919140]: (post) Fetching update from http://127.0.0.1/safebrowsing-dummy/update
1404[919140]: nsUrlClassifierStreamUpdater::OnStartRequest (status=NS_OK, uri=http://127.0.0.1/safebrowsing-dummy/update, this=11a9f150)
1404[919140]: nsUrlClassifierStreamUpdater::OnStartRequest failed (404)
1404[919140]: nsUrlClassifierStreamUpdater::Download error [this=11a9f150]
1404[919140]: OnStopRequest (status 80004004, beganStream false, this=11a9f150)
1404[919140]: OnStopRequest::Finishing update [this=11a9f150]
3632[16659750]: nsUrlClassifierDBServiceWorker::ApplyUpdate()
3632[16659750]: Backup before update.
3632[16659750]: Applying 0 table updates.
3632[16659750]: Active table: test-forbid-simple
3632[16659750]: Active table: test-malware-simple
3632[16659750]: Active table: test-phish-simple
3632[16659750]: Active table: test-track-simple
3632[16659750]: Active table: test-trackwhite-simple
3632[16659750]: Active table: test-unwanted-simple
3632[16659750]: Cleaning up backups.
3632[16659750]: Done applying updates.
3632[16659750]: update took 566ms
3632[16659750]: Notifying success: 1000
1404[919140]: nsUrlClassifierStreamUpdater::UpdateSuccess [this=11a9f150]
1404[919140]: nsUrlClassifierStreamUpdater::DownloadDone [this=11a9f150]
1404[919140]: nsUrlClassifierStreamUpdater::UpdateSuccess skipping callback [this=11a9f150]
1404[919140]: stream updater: calling into fetch next request
1404[919140]: No more requests, returning
1404[919140]: shutting down db service
1404[919140]: joining background thread
3632[16659750]: nsUrlClassifierDBServiceWorker::CancelUpdate
3632[16659750]: No UpdateObserver, nothing to cancel
3632[16659750]: urlclassifier db closed
1404[919140]: shutting down db service



The latter has a same-looking hunk:

GetTables: test-forbid-simple;a:1
test-malware-simple;a:1
test-phish-simple;a:1
test-track-simple;a:1-2
test-trackwhite-simple;a:1
test-unwanted-simple;a:1
3992[a19140]: nsUrlClassifierStreamUpdater init [this=11e9f470]
3992[a19140]: FetchUpdate: http://127.0.0.1/safebrowsing-dummy/update
3992[a19140]: (pre) Fetching update from http://127.0.0.1/safebrowsing-dummy/update
3992[a19140]: (post) Fetching update from http://127.0.0.1/safebrowsing-dummy/update
3092[16559750]: nsUrlClassifierDBServiceWorker::BeginUpdate [goog-phish-shavar,goog-malware-shavar,goog-unwanted-shavar,goog-badbinurl-shavar,goog-downloadwhite-digest256,mozstd-track-digest256,mozstd-trackwhite-digest256]
3992[a19140]: nsUrlClassifierStreamUpdater::OnStartRequest (status=NS_OK, uri=http://127.0.0.1/safebrowsing-dummy/update, this=11e9f470)
3992[a19140]: nsUrlClassifierStreamUpdater::OnStartRequest failed (404)
3992[a19140]: nsUrlClassifierStreamUpdater::Download error [this=11e9f470]
3992[a19140]: OnStopRequest (status 80004004, beganStream false, this=11e9f470)
3992[a19140]: OnStopRequest::Finishing update [this=11e9f470]
3092[16559750]: nsUrlClassifierDBServiceWorker::ApplyUpdate()
3092[16559750]: Backup before update.
3092[16559750]: Applying 0 table updates.
3092[16559750]: Active table: test-forbid-simple
3092[16559750]: Active table: test-malware-simple
3092[16559750]: Active table: test-phish-simple
3092[16559750]: Active table: test-track-simple
3092[16559750]: Active table: test-trackwhite-simple
3092[16559750]: Active table: test-unwanted-simple
3092[16559750]: Cleaning up backups.
3092[16559750]: Done applying updates.
3092[16559750]: update took 112ms
3092[16559750]: Notifying success: 1000
3992[a19140]: nsUrlClassifierStreamUpdater::UpdateSuccess [this=11e9f470]
3992[a19140]: nsUrlClassifierStreamUpdater::DownloadDone [this=11e9f470]
3992[a19140]: nsUrlClassifierStreamUpdater::UpdateSuccess skipping callback [this=11e9f470]
3992[a19140]: stream updater: calling into fetch next request
3992[a19140]: No more requests, returning
3992[a19140]: shutting down db service
3992[a19140]: joining background thread
3092[16559750]: nsUrlClassifierDBServiceWorker::CancelUpdate
3092[16559750]: No UpdateObserver, nothing to cancel
3092[16559750]: urlclassifier db closed
3992[a19140]: shutting down db service

(though note the difference in time!)


But other high IO ones like

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4492f4094f79&selectedJob=16351164
and
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4492f4094f79&selectedJob=16327514

just have:

 shutting down db service
1340[a19140]: joining background thread
1296[17759750]: nsUrlClassifierDBServiceWorker::CancelUpdate
1296[17759750]: No UpdateObserver, nothing to cancel
1296[17759750]: urlclassifier db closed
1340[a19140]: shutting down db service

with no results for a grep for "ApplyUpdate()"



Is it possible that some of this IO is happening so early it's not showing up in the log? Is there other IO not covered by this log? :-\
Flags: needinfo?(gpascutto)
There is a ton of logging missing. I tried with the same NSPR_LOG_MODULES and I get what I expect, for example you should see things like the following. The first and last example both cause IO.

-1749027072[7f37984f3a00]: Active table: test-phish-simple
-1749027072[7f37984f3a00]: 0 completions present
-1749027072[7f37984f3a00]: ReadCompletions
-1749027072[7f37984f3a00]: Loading PrefixSet
-1749027072[7f37984f3a00]: stored PrefixSet exists, loading from disk
-1749027072[7f37984f3a00]: Loading PrefixSet successful

amd

-1749027072[7f37984f3a00]: nsUrlClassifierDBServiceWorker::BeginUpdate [test-malware-simple,test-phish-simple,test-unwanted-simple,test-track-simple,test-trackwhite-simple,test-forbid-simple]
-1749027072[7f37984f3a00]: nsUrlClassifierDBServiceWorker::BeginStream
-1749027072[7f37984f3a00]: Processing n:1000
-1749027072[7f37984f3a00]: Processing i:test-malware-simple
-1749027072[7f37984f3a00]: Processing ad:1
-1749027072[7f37984f3a00]: Processing a:1:32:40
-1749027072[7f37984f3a00]: Handling a 40-byte simple chunk
-1749027072[7f37984f3a00]: Processing 

and

1749027072[7f37984f3a00]: nsUrlClassifierDBServiceWorker::ApplyUpdate()
-1749027072[7f37984f3a00]: Backup before update.
-1749027072[7f37984f3a00]: Applying 6 table updates.
-1749027072[7f37984f3a00]: Classifier::ApplyTableUpdates(test-malware-simple)
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: InflateReadTArray: 8 in 0 out
-1749027072[7f37984f3a00]: Applied update to table test-malware-simple:
-1749027072[7f37984f3a00]:   1 add chunks
-1749027072[7f37984f3a00]:   0 add prefixes
-1749027072[7f37984f3a00]:   1 add completions
-1749027072[7f37984f3a00]:   0 sub chunks
-1749027072[7f37984f3a00]:   0 sub prefixes
-1749027072[7f37984f3a00]:   0 sub completions
-1749027072[7f37984f3a00]:   1 add expirations
-1749027072[7f37984f3a00]:   0 sub expirations
-1749027072[7f37984f3a00]: Remote update, updating freshness
-1749027072[7f37984f3a00]: Contains Completes, keeping cache.
-1749027072[7f37984f3a00]: Applied 1 update(s) to test-malware-simple.
-1749027072[7f37984f3a00]: All databases seem to have a consistent sort order.
-1749027072[7f37984f3a00]: Removed 0 dead SubPrefix entries.
-1749027072[7f37984f3a00]: All databases seem to have a consistent sort order.
-1749027072[7f37984f3a00]: Table test-malware-simple now has:
-1749027072[7f37984f3a00]:   1 add chunks
-1749027072[7f37984f3a00]:   0 add prefixes
-1749027072[7f37984f3a00]:   1 add completions
-1749027072[7f37984f3a00]:   0 sub chunks
-1749027072[7f37984f3a00]:   0 sub prefixes
-1749027072[7f37984f3a00]:   0 sub completions
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: DeflateWriteTArray: 0 in 8 out
-1749027072[7f37984f3a00]: SB tree done, size = 80 bytes
-1749027072[7f37984f3a00]: Completion: E542995747A01561DDF2450C2B604438B5A4AC821A619A20A1CAC34403B0B971
-1749027072[7f37984f3a00]: Writing 1 completions
-1749027072[7f37984f3a00]: Saving PrefixSet successful
-1749027072[7f37984f3a00]: Successfully updated test-malware-simple
Flags: needinfo?(gpascutto)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c982b3e3b38b - without nsChannelService to see if that gets us more of the rest of the log.

 https://treeherder.mozilla.org/#/jobs?repo=try&revision=a8bff7e8dba7 - without addMozEntries adding test entries, because they aren't in any of the logs, but they should be getting added, so maybe they're responsible for the extra IO? Kind of a shot in the dark, but hey.
(In reply to :Gijs Kruitbosch from comment #29)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=c982b3e3b38b -
> without nsChannelService to see if that gets us more of the rest of the log.
> 
>  https://treeherder.mozilla.org/#/jobs?repo=try&revision=a8bff7e8dba7 -
> without addMozEntries adding test entries, because they aren't in any of the
> logs, but they should be getting added, so maybe they're responsible for the
> extra IO? Kind of a shot in the dark, but hey.

So it definitely looks like adding the test entries is what is causing the extra IO here. What I don't understand is why that wouldn't happen in some circumstances and/or why my patch in bug 1172165 would have caused those circumstances (whatever they are) to suddenly happen "almost all, but not quite all" of the time.

The only explanation I have is that we call SafeBrowsing.init (which adds these entries) on a 2 second timer in gBrowser's init.

What I don't get is how we wouldn't run that 2 second timer while the browser is up. That test takes a lot longer to run than just 2 seconds. It would kind of make sense if the IO simply happened later and therefore missed the 'startup' cutoff, but then it doesn't make sense that the logs are so different (compare http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/5a06aac22cd71f0b59295dbb9b9d7d640a0f357b1618f4cfda3696b33027379911ad4e8f240bba768c82f59d14e9f283a9c808ab7586d74f6c6bd0fc7037fabf (bad) and http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/7b3b5f018c74cb61cd5493d17286c91c03ee3da399d01da217ed5fdd77d5018fd1361fafb8ad706df3d71bf5172fda53359c05726ef5ed4a111c1ac67bd7a577 (good) .

I'm also confused as to why the 'good' logs have so little in them in the first place. :-\

Joel/gcp, do either of you have more ideas about what could be going on here?
Flags: needinfo?(jmaher)
Flags: needinfo?(gpascutto)
maybe we can experiment with changing the 2 second timer to validate there is truth in timer theory.  Is there a way to change this via preferences in the test?  I would be curious of this same thing affects ts_paint and session restore.
Flags: needinfo?(jmaher)
(In reply to :Gijs Kruitbosch from comment #30)
> I'm also confused as to why the 'good' logs have so little in them in the
> first place. :-\
> 

They don't seem to run the init function (the one on the 2 second delay).
Flags: needinfo?(gpascutto)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #32)
> (In reply to :Gijs Kruitbosch from comment #30)
> > I'm also confused as to why the 'good' logs have so little in them in the
> > first place. :-\
> > 
> 
> They don't seem to run the init function (the one on the 2 second delay).

How do you determine this? Because I commented out the addMozEntries stuff, but I assumed that there would be other code that would eventually cause some safebrowsing code to run. Was I wrong, ie is there nothing else that would trigger logs besides adding the moz entries? (see the patch: https://hg.mozilla.org/try/rev/a8bff7e8dba7#l4.12 )

(In reply to Joel Maher (:jmaher) from comment #31)
> maybe we can experiment with changing the 2 second timer to validate there
> is truth in timer theory.  Is there a way to change this via preferences in
> the test?

No, the 2 second timeout is hardcoded.

>  I would be curious of this same thing affects ts_paint and
> session restore.

It looks from the comments like the whole reason this is on a few-second timeout is so it won't affect startup performance.

It still does not make sense to me how we wouldn't ever not run a 2-second timeout on a several-minute Firefox run. If this is really it, this IO should always have been happening during xperf, both before and after my patch. Yes, timeouts can be inaccurate, but the main thread shouldn't be blocked all the time during the xperf run (AFAICT we load pages sequentially?) and we shouldn't be inaccurate by several minutes...
Flags: needinfo?(gpascutto)
(In reply to :Gijs Kruitbosch from comment #33)
> (In reply to Gian-Carlo Pascutto [:gcp] from comment #32)
> > (In reply to :Gijs Kruitbosch from comment #30)
> > > I'm also confused as to why the 'good' logs have so little in them in the
> > > first place. :-\
> > > 
> > 
> > They don't seem to run the init function (the one on the 2 second delay).
> 
> How do you determine this? Because I commented out the addMozEntries stuff,

Ok, because of that it simply didn't ran the initial update, which should be causing the IO.

> but I assumed that there would be other code that would eventually cause
> some safebrowsing code to run. Was I wrong, ie is there nothing else that
> would trigger logs besides adding the moz entries? (see the patch:
> https://hg.mozilla.org/try/rev/a8bff7e8dba7#l4.12 )

There is code that triggers logs, but it would be an empty update that doesn't actually cause disk IO.
Flags: needinfo?(gpascutto)
OK, I *think* I've finally figured out what's going on here. Let's look at https://treeherder.mozilla.org/#/jobs?repo=try&revision=c982b3e3b38b , where I hadn't commented out the addMozEntries thing, so everything is running "as normal" (but the logging is better than previous pushes). I retriggered 30 times and now there are a few (3 or 4 or something) samples with low IO, and a lot with 'high' IO.

Here's a log from a 'low' IO sample:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c982b3e3b38b&selectedJob=16514478

http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/56b8841d5ff6b7aac6c318ff65568c3ca542551b8c50958d657fd5d3e23cf67f3f7a89d58f5ac6ffe4eb1bd49c48fc65fb28c2dfc91245f60ca88cb7edebcdd3

And here's a 'high' IO sample:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c982b3e3b38b&selectedJob=16392401

http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/5a06aac22cd71f0b59295dbb9b9d7d640a0f357b1618f4cfda3696b33027379911ad4e8f240bba768c82f59d14e9f283a9c808ab7586d74f6c6bd0fc7037fabf

From about the BeginUpdate onwards they're pretty similar (though not identical), but all the 'high' IO logs start out with lots of:

4080[15dbf600]: 1 completions present
4080[15dbf600]: ReadCompletions
4080[15dbf600]: Read 1 completions
4080[15dbf600]: Loading PrefixSet
4080[15dbf600]: stored PrefixSet exists, loading from disk
4080[15dbf600]: stored PrefixSet is empty!

For a number of tables.

So here's my hypothesis:

1) we 'normally' write this stuff out on the initial, non-measured run
2) if the 2-second timeout fires on that initial run, and finishes writing all this stuff, we hit 'low' IO on the measured run, because we don't need to write out the identical update to the test tables. We always have to (try to) read the tables, no matter whether the "prep" run wrote them or not.
3) if the 2-second timeout *doesn't* fire on that initial run, and so we don't write stuff at that time, we hit 'high' IO when we do a measured run, because now we need to write out all those updates to the test tables.

I don't know exactly how talos does its "prep" run for this test, and what it waits for, so there are two possible explanations here:

a) with the patch, somehow we hit whatever condition talos waits for "sooner", and so we more often don't get around to writing out the safebrowsing tables
b) with the patch, somehow the browser is "busier"/slower and so the timeout doesn't hit "on time" (assuming there's a timed cut-off for the run) or whatever, and so we don't get around to writing out the safebrowsing tables.

For the purposes of this test, both of these can be fixed by making talos wait "longer" before it shuts down the prep run, and/or somehow manually ensuring safebrowsing's stuff runs on it and not on the measured run. That might (might!) also fix the pre-existing bimodality on the non-e10s version of this talos test.

For safebrowsing, looking into the write overhead here (see comment 15 et al) might be interesting. Or not. Not sure what the tradeoffs are and how this impacts "real" safebrowsing updates/writes compared to the test tables (which only get written once in the lifetime of a profile, AIUI?).

Depending on how talos shuts down its prep run, it might be possible to decide which of (a) or (b) is happening and/or if there's something we can do about that (though, if there was no change in ts_paint or tpaint from the landing of bug 1172165 then it seems unlikely there is a serious perf issue that needs addressing...)

Joel and Gian-Carlo, does this make sense at all?
Flags: needinfo?(jmaher)
Flags: needinfo?(gpascutto)
the warmup run is just as simple as loading this page:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/getInfo.html

We could easily adjust that page to wait a few seconds or for certain events before completing.  If it is that simple, I vote for fixing that page!
Flags: needinfo?(jmaher)
From the logs, it's the other way around: 

The first log, the low IO one, finds no existing tables, doesn't read anything in (!), and writes out fresh test tables.

The second log, the high IO one, finds the existing tables, reads them in (that's where the extra IO must come from), and writes out test tables (which is now a transnactional write with moving and copying things around, even more IO).

The test tables are always written just after startup. I know that isn't ideal but they're supposed to be very minimal (a few times max <300 bytes), they're off-main-thread, after startup, and this avoids difficult and error-prone migration scenarios.

So there's 2 issues:

a) The initial update sometimes and sometimes not being in the measurement.

b) The inconsistency between the actual file size and the IO size being measured in comment 8, specifically:
\test-forbid-simple.cache	44     (ok)
\test-forbid-simple.pset	65536  (should be 16 bytes!)
\test-forbid-simple.sbstore	232    (ok)
Flags: needinfo?(gpascutto)
Based on comment #37 and discussion with gcp on IRC, I've pushed this to try:

remote:   https://hg.mozilla.org/try/pushloghtml?changeset=b567c6bca596

to verify that if we reduce the buffer size for the stream, that reduces the IO numbers here. Considering the size of real (not test) safebrowsing tables we likely can't ship such a change in size, but it would narrow down the cause of the overly large IO for these files which are supposed to be tiny. We can investigate fixing that in a separate bug against XPCOM.

I'd still be curious if there's a way we can avoid re-writing these files altogether.


In the short term, for the talos test, we would indeed need to run the prep run for longer, which will pin the IO numbers at the 'high' end. That isn't ideal, but it would be unrealistic to pin them on the 'low' end and would require that we find a way to avoid writing the files in the prep run, which seems complex and unnecessary.
(In reply to :Gijs Kruitbosch from comment #38)
> Based on comment #37 and discussion with gcp on IRC, I've pushed this to try:
> 
> remote:   https://hg.mozilla.org/try/pushloghtml?changeset=b567c6bca596
> 
> to verify that if we reduce the buffer size for the stream, that reduces the
> IO numbers here.

It does, looking at both the e10s and non-e10s runs: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b567c6bca596&selectedJob=16515687

gcp, can you file a followup bug to investigate how to fix that in either safebrowsing code or XPCOM?
Flags: needinfo?(gpascutto)
Depends on: 1246969
Filed bug 1246969 to track the weirdness with reported IO.
Flags: needinfo?(gpascutto)
(In reply to :Gijs Kruitbosch from comment #39)
> An attempt at a fix: remote:  
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee4b35e6e70d

This still has some "low" numbers, so forcing the initialization of safebrowsing here doesn't seem enough, presumably because we quit so soon afterwards that the OMT IO doesn't actually complete, or something?

:gcp, is there a better way to ensure that that happens? Callback or observer notification of some kind that we could use to verify that the data has been written?
Flags: needinfo?(gpascutto)
AFAIK there's no notification that the/an update completed.

https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/content/listmanager.js#432

We do update some prefs, so if you can listen to a pref change that might work?
Flags: needinfo?(gpascutto)
I have a patch that modifies nsUrlClassifierPrefixSet to be smart about the size of the buffers it uses to read/write its files. That will reduce the (apparent) 64k IOs to 16 byte IOs, without needing a fix for bug 1246969. According to Gijs on IRC, that should put the relevant IO below the noise in the existing test.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #44)
> I have a patch that modifies nsUrlClassifierPrefixSet to be smart about the
> size of the buffers it uses to read/write its files. That will reduce the
> (apparent) 64k IOs to 16 byte IOs, without needing a fix for bug 1246969.
> According to Gijs on IRC, that should put the relevant IO below the noise in
> the existing test.

So what is the goal of such change as far as users are concerned?

Quick reminder: we only care about the user experience and factors which the user would care about. If there's an apparent regression in our tests but we're quite certain the user would not care, then we should just set our new benchmark to the new level of values the tests produce.

It's also OK to make changes specific for the tests, preferably in a way which doesn't "mask off" changes in numbers, but rather just make sure they're still stable and valid.
>So what is the goal of such change as far as users are concerned?

Reduced memory usage. It's a real optimization, would've just fixed bug 1246969 otherwise.
Don't allocate IO buffers that are larger than the actual file size. Aside from being a small memory optimization, this avoids the issue with the IO being overstated in our performance tests.
Attachment #8718386 - Flags: review?(nfroyd)
Attachment #8718386 - Flags: review?(nfroyd) → review+
(In reply to :Gijs Kruitbosch from comment #51)
> remote:  
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=49a5d9a1718c

Looks like this is broken in e10s because all of it tries to do stuff in a sandboxed content process. Sigh.
Attachment #8718755 - Flags: review?(jmaher)
Comment on attachment 8718755 [details]
MozReview Request: Bug 1244259 - wait for SafeBrowsing to have added MozEntries in initial runs of talos perf testing, r?jmaher,gcp

https://reviewboard.mozilla.org/r/34739/#review31363

As this is failing on e10s, lets figure out what to do.  I really like how we can wait for safe browsing.

::: testing/talos/talos/tests/quit.js:71
(Diff revision 1)
> -function goQuitApplication()
> +function goQuitApplication(waitForSafeBrowsing, callback)

I don't see callback ever getting called.
https://reviewboard.mozilla.org/r/34739/#review31363

> I don't see callback ever getting called.

It is in the safebrowsing case. I forgot the other one. But really, it's unnecessary anyway, because goQuitApplication closes all the windows, so having the window itself calling window.close() is superfluous. I have a new version of the patch ready that just does away with it entirely.
Attachment #8718755 - Flags: review?(jmaher)
Comment on attachment 8718755 [details]
MozReview Request: Bug 1244259 - wait for SafeBrowsing to have added MozEntries in initial runs of talos perf testing, r?jmaher,gcp

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/34739/diff/1-2/
Not a big fan of the duplication, but enough yaks have been shaved for this bug, so I don't really want to address it here.

remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a24ff50d54a


(this worked locally with ./mach talos-test --activeTests tpaint --e10s ; I can't run xperf on my mac, apparently.)
Attachment #8718755 - Flags: review?(jmaher) → review+
Comment on attachment 8718755 [details]
MozReview Request: Bug 1244259 - wait for SafeBrowsing to have added MozEntries in initial runs of talos perf testing, r?jmaher,gcp

https://reviewboard.mozilla.org/r/34739/#review31369

This seems to be what we are looking for.  Of course testing on try might find some corner case in a config or a platform.
Comment on attachment 8718755 [details]
MozReview Request: Bug 1244259 - wait for SafeBrowsing to have added MozEntries in initial runs of talos perf testing, r?jmaher,gcp

https://reviewboard.mozilla.org/r/34739/#review31405
Attachment #8718755 - Flags: review?(gpascutto) → review+
(In reply to :Gijs Kruitbosch from comment #56)
> Not a big fan of the duplication, but enough yaks have been shaved for this
> bug, so I don't really want to address it here.
> 
> remote:  
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a24ff50d54a

This looks green, and post retriggers seems to be effective in forcing both e10s and non-e10s to the 'high' version of these numbers. The fix from gcp that landed on inbound will take care of bringing that number back down.
Since getInfo.html runs also with other tests, I wonder if it would help with reducing bimodal results with other tests too.

Can anyone assess for how long this could have caused inconsistent//bimodal behavior assuming it also affects other tests? days? weeks? months? Or did it start with this regression bug?
(In reply to Avi Halachmi (:avih) from comment #61)
> Since getInfo.html runs also with other tests, I wonder if it would help
> with reducing bimodal results with other tests too.

"Maybe".

Note that I fixed the quit.js used in getInfo.html, but it seems the pageloader has another copy of it? That isn't quite the same for reasons I don't understand? Might be worth a followup bug if someone who knows more about talos thinks so.

> Can anyone assess for how long this could have caused inconsistent//bimodal
> behavior assuming it also affects other tests? days? weeks? months? Or did
> it start with this regression bug?

It did not start with the regressing bug, the regressing bug tipped xperf e10s over whatever finnicky edge is the edge here. I wouldn't be surprised if xperf e10s had intermittently been bimodal before now, and our 12-run averaging just meant that very low-frequency intermittent peaks in the data weren't being noticed.

I don't know how long this would have been going on, though IIRC gcp suggested earlier that it might have started with bug 1211090 and/or bug 1175562. Perhaps we can pin it down to either of those with data.
Looks like we're done here. Thanks everyone! :-)
Status: NEW → RESOLVED
Closed: 8 years ago
Keywords: leave-open
Resolution: --- → FIXED
Fixed in 47 since m-c was 47 back in February.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: