Closed Bug 1436297 Opened 6 years ago Closed 6 years ago

100% I/O usage with Firefox 52 ESR (safebrowsing)

Categories

(Toolkit :: Safe Browsing, defect, P1)

52 Branch
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr52 60+ fixed
firefox59 --- unaffected
firefox60 --- unaffected
firefox61 --- unaffected

People

(Reporter: luca76, Assigned: francois)

References

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0
Build ID: 20180130211649

Steps to reproduce:

Nothing in particular, this issue appears at random every days on multiple computers.

Using Firefox 52 ESR


Actual results:

100% I/O disk activity


Expected results:

Further investigation with found that the SafeBrowsing component writes too much on the HDD file "mozilla in userprofile \ safebrowsing \ goog-phish-shavar-1.sbstore". Performance monitor shows Firefox is writing this file continously at 7 MB/sec for minutes... why?
Component: Untriaged → Safe Browsing
Product: Firefox → Toolkit
Priority: -- → P5
A lot of people have been complaining about performances issue in esr lately.
I wonder if this is could not be the source of the problem. We had the same issue with android recently (bug 1420641)

Ryan, we might want to do something about that for the next esr release in case there is an easy fix.
Flags: needinfo?(ryanvm)
See Also: → 1420641
ESR52 is on Safebrowsing v2 still I believe. Francois, any thoughts on what we might be able to do to help mitigate this?
Flags: needinfo?(ryanvm) → needinfo?(francois)
I'm going to try to see if I can reproduce this.
Assignee: nobody → francois
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Flags: needinfo?(francois)
Priority: P5 → P1
@sylvestre Do you have any number on how many users are impacted? and how bad the performance are impacted?
I don't know the number but a lot of people have been complaining on the enterprise ML over the last one or two months. I am not sure we have telemetry on that (even if we did, as many enterprise users have disabled telemetry).
It is bad enough so that we want to act on it... Sysadmin have been putting in place workaround, etc.
Francois, any update on this bug?
I left my ESR52 profile running overnight and ended up with the following phishing list:

-rw-r--r-- 1 francois francois 2.9M Mar  6 10:17 goog-phish-shavar.pset
-rw-r--r-- 1 francois francois 4.9M Mar  6 10:17 goog-phish-shavar.sbstore

With a fully up-to-date database, I was able to get this amount of startup I/O (iotop -ao):

  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                                                                                                                                                                                                        
18945 be/4 francois      0.00 B     11.52 M  0.00 %  0.04 % ./firefox --no-remote [URL Classifier]

whereas on Nightly 60, I don't even get an entry for the URL Classifier thread. It seems unlikely that this 12MB of writes is responsible for the problems described in comment 0.

I also took a profile with the Gecko Profiler and these options:

MOZ_PROFILER_STARTUP=1
MOZ_PROFILER_STARTUP_FILTERS="GeckoMain,URL Classifier,Classifier Update"
MOZ_PROFILER_STARTUP_FEATURES="mainthreadio"

The results are here: https://perfht.ml/2FvnmKI

and I couldn't see anything odd in there I/O-wise. I wonder if that's because the Gecko Profiler in 52 doesn't cover the URL Classifier thread.

Sylvestre, do you think that Soft Vision would be able to help in creating a test case to reproduce this problem?
Flags: needinfo?(sledru)
Andrei can probably help finding someone.


Luca, the reporter might be able to help also.
Flags: needinfo?(sledru)
Flags: needinfo?(luca76)
Flags: needinfo?(andrei.vaida)
(In reply to Sylvestre Ledru [:sylvestre] from comment #8)
> Andrei can probably help finding someone.

This may be harder to crack. We'll try to reproduce it on our Ubuntu and Fedora test machines, but because this involves Safe Browsing v2, we'll need to leave the machines running overnight. We may only have test results sometime next week.
How can I test this issue on Windows? (other then the profiling extension)
I went through all of the commits to the URL Classifier between 52 and now, filtering out anything that wasn't done by members of the Safe Browsing team:

    $ hg log --user tnguyen --user hchang --user francois --user dlee --user gcp -r "ancestor(tip,FIREFOX_RELEASE_52_BASE)"..tip toolkit/components/url-classifier --template '{node|short} | {date|isodatesec} | {author|user}: {desc|strip|firstline}\n' | wc -l
    174

Of those 174 commits, I shortlisted these 5 commits:

> 366897   3d717c8eaa76   2017-01-19 17:48 +0800   hchang
>   Bug 1332218 - Add boundary checks for array access in nsUrlClassifierPrefixSet::GetPrefixesNative. r=francois.

Might be useful since it adds a check for corrupt databases.

> 394492   ffdb24073101   2017-04-18 17:00 +0800   tnguyen
>   Bug 1353853 - Cache preferences when doing channel classify r=francois
> 410107   f95be8a71102   2017-07-04 16:36 +0800   tnguyen
>   Bug 1377559 - Should store value of preference browser.safebrowsing.debug to reuse r=francois

Might be useful, but it doesn't appear to be related to disk I/O.

> 396372   54ab48204b9e   2017-04-24 17:51 +0800   tnguyen
>   Bug 1336904 - Add timeouts for Safe Browsing updates r=francois

Might be useful since it puts a limit on how long updates can take.

This will also need the updated timeout values from bug 1381745.

> 402515   8f2b4efc5f0d   2017-05-19 13:17 +0800   tnguyen
>   Bug 1361699 - Add buffer when writing hashstore to file r=gcp,mcmanus

Looks highly relevant since it has to do with lots of small writes to a network-attached disk.

That last one sounds especially worthwhile since it's relatively small: https://hg.mozilla.org/mozilla-central/rev/8f2b4efc5f0d

Ryan, do you think we should uplift now and hope that it resolves the issues?
Flags: needinfo?(ryanvm)
The new update threading scheme would likely also help here since it almost eliminates all possible deadlocking during database updates, but it's a lot of work and would probably be far too risky for ESR.

At a minimum, we'd need all of these commits:

381523   2c1fc455e81e   2017-02-21 14:18 +0800   hchang
  Bug 1338970 - Introduce 2-step DB update scheme. (background update then swap in) r=francois,gcp
382732   313bfccbbd9f   2017-02-25 00:22 +0800   hchang
  Bug 1342397 - Properly reset stuff when failing to create LookupCache for update. r=gcp
384381   a1dfb38c49b0   2017-02-22 17:25 +0800   hchang
  Bug 1339760 - Split update process to background/foreground and run background on update thread **synchronously**. r=francois,gcp
385776   a26b45336e56   2017-03-13 21:16 +0800   hchang
  Bug 1346757 - Change the downloadError callback timing. r=francois
386089   ffb8aa89b8c1   2017-03-10 00:47 +0800   hchang
  Bug 1345922 - Avoid concurrent update and take the failed beginUpdate into account. r=francois
391336   baaee2deb3ac   2017-04-06 07:07 +0800   hchang
  Bug 1339050 - Asynchronously apply safebrowsing DB update. r=francois,gcp
Luca, if we created a test build with these patches included, would you be easily able to test it to see if it helps?
Flags: needinfo?(ryanvm)
I managed to test this issue on two machine Ubuntu 14.04 x64 and Ubuntu 16.04 x64, same way as Francois did, I left my ESR52 profile running overnight and ended up with the following phishing list:
                        goog-phish-shavar.pset 3,0 MB
                        goog-phish-shavar.sbstore 5,2 MB

There were no IO> 100% entries in terminal for the Firefox command, only:

TID  PRIO  USER       DISK READ   DISK WRITE  SWAPIN   IO>    COMMAND   
14004 be/4 timeazso      0.00 B    416.00 K  0.00 %  0.00 % ./firefox -no-remote -p [localStorage DB]

Gecko Profiler result: https://perfht.ml/2DotQpz

I contacted the reporter, he told me that the problem is on Windows system, so I will test on windows too.
Flags: needinfo?(andrei.vaida)
Francois, want to try spinning a Try build of those uplifts for now to see how things look? Note that you'll want to include --buildbot in your Try syntax to make sure you get legacy platforms, i.e. |try: -b do -p all -u all -t none --buildbot|.
Flags: needinfo?(francois)
On Ubuntu 16.04 x64 sometimes the CPU usages was 100%, see attachment Bug 1436297.png.
On Windows 8.1 x64 and Windows 10 x64 I can not reproduced the issue.
Attached image Bug1436297.png
Here is the attachment, I am not sure if the 100% CPU is related to this issue.
Here's a Try build with 8f2b4efc5f0d grafted on top of the esr52 branch:

remote: View the pushlog for these changes here:
remote:   https://hg.mozilla.org/try/pushloghtml?changeset=b9dab7ede75888a47882ef313a78d85b8515118d
remote: 
remote: Follow the progress of your build on Treeherder:
remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9dab7ede75888a47882ef313a78d85b8515118d

The patch applies cleanly as-is.
Flags: needinfo?(francois)
See Also: → 1432097
I received positive feedback via email on the Try builds. I think we can go ahead with the approval request on bug 1361699 :)
Flags: needinfo?(francois)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #20)
> I received positive feedback via email on the Try builds. I think we can go
> ahead with the approval request on bug 1361699 :)

https://bugzilla.mozilla.org/show_bug.cgi?id=1361699#c24
Flags: needinfo?(francois)
Calling this fixed now that bug 1361699 has been uplifted. We can reopen the bug if safebrowsing still shows up as a significant source of slowdowns in next month's 52.8.0 release.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: