Closed Bug 1381745 Opened 7 years ago Closed 7 years ago

Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers

Categories

(Toolkit :: Safe Browsing, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox56 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: tnguyen)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:timing], #sbv4-m8)

Attachments

(1 file, 1 obsolete file)

Hm, wasn't that download successful?

[task 2017-07-18T06:44:12.059816Z] 06:44:12     INFO -  listmanager: 06:44:12 GMT+0000 (UTC): update success for goog-phish-proto,goog-malware-proto,goog-unwanted-proto,goog-badbinurl-proto,goog-downloadwhite-proto from https://safebrowsing.googleapis.com/v4/threatListUpdates:fetch?$ct=application/x-protobuf&key=[trimmed-google-api-key]&$httpMethod=POST: 1818
[task 2017-07-18T06:44:12.060661Z] 06:44:12     INFO -  listmanager: 06:44:12 GMT+0000 (UTC): Waiting 30.3min
[task 2017-07-18T06:44:12.061399Z] 06:44:12     INFO -  listmanager: 06:44:12 GMT+0000 (UTC): Setting last update of google4 to 1500360252055
[task 2017-07-18T06:44:12.062278Z] 06:44:12     INFO -  listmanager: 06:44:12 GMT+0000 (UTC): Setting next update of google4 to 1500362070055 (30.3min from now)

Maybe it would be helpful for debugging to print the list of downloaded files in case of a test failure.
Yes, you are right. We downloaded data but somehow did not save data to disk
I don't know what's going wrong but seems we could not update at the moment. I tried to remove google4 folder (or safebrowsing folder), going to about:url-classifier, click update google4. I could see "update success" but google4 is still empty.
I will look closer tomorrow.
Francois, do we have any change in server side?
Flags: needinfo?(tnguyen) → needinfo?(francois)
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #3)
> Yes, you are right. We downloaded data but somehow did not save data to disk
> I don't know what's going wrong but seems we could not update at the moment.
> I tried to remove google4 folder (or safebrowsing folder), going to
> about:url-classifier, click update google4. I could see "update success" but
> google4 is still empty.
> I will look closer tomorrow.
> Francois, do we have any change in server side?

And the log is similar to the failure log.
That is : update success for goog-phish-proto,goog-malware-proto,goog-unwanted-proto,goog-badbinurl-proto,goog-downloadwhite-proto from 
But the folder is still empty
(In reply to Henrik Skupin (:whimboo) [away 07/21 - 07/31] from comment #1)
> As it look like the fix on bug 1374268 didn't hold long. I can see tons of
> failures of the same type on autoland now. It all started about 11 hours
> ago. Maybe we hit a limit again and delivery gets delayed?
> 
It is possible, i tried to update manually in about:url-classifier and failed too
this has 157 failures today:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1381745&startday=2017-07-10&endday=2017-07-18&tree=all

we need to fix or disable this by end of day given the confusion this will cause on the trees.
Whiteboard: [stockwell needswork]
Assignee: nobody → francois
Status: NEW → ASSIGNED
Flags: needinfo?(francois)
Priority: -- → P1
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #3)
> Yes, you are right. We downloaded data but somehow did not save data to disk
> I don't know what's going wrong but seems we could not update at the moment.
> I tried to remove google4 folder (or safebrowsing folder), going to
> about:url-classifier, click update google4. I could see "update success" but
> google4 is still empty.
> I will look closer tomorrow.

I have done the same in a fresh profile (today's Nightly on Linux) and it worked fine.
Assuming that updates aren't actually broken (as per comment 8), I traced through the code and it does look like nextupdatetime will only be updated after we have successfully processed the update.

In case I missed something, I tried adding a sleep(30) after nextupdatetime is set and before we look for the files on disk in case we are dealing with a slow disk that takes a bit longer to write the files to disk: https://hg.mozilla.org/try/rev/3fb190a131b7e1cb019bc8399ee264a314afadaa

Running 20 Try attempts with that: https://treeherder.mozilla.org/#/jobs?repo=try&revision=268caf9814073a8a42dbec352f713e54ddb15740
(In reply to François Marier [:francois] from comment #9)
> Running 20 Try attempts with that:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=268caf9814073a8a42dbec352f713e54ddb15740

All of the tests passed, I'm not sure if that means anything, but it might be worth giving that a try before we disable the test.
I have just tried to update and it works fine now.
I suspect there's an unexpected problem with update data yesterday.
What I did is running
./mach firefox-ui-functional testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py --disable-e10s --gecko-log ~/Logs/gecko.log

in local machine and it failed 100% rate.
Then tried to update google4 manually in about:url-classifier and failed too.
It works fine both this morning although I did not change anything
I tried to retrigger the test from autoland multiple times (randomly pick someone's push)
See 
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=115441130
and seems it is working this morning although we did not push anything to server from yesterday
Edit link
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=da7429cfbfa50fc74cc8c0f8751005e15e69756b&selectedJob=115441257
Running ~40 times in various platforms but only 1 failure but it's not related to this bug (we could not download google list in that case)
I suspect update server was down yesterday, Francois, do you have any idea that I missed?
Flags: needinfo?(francois)
Comment on attachment 8887746 [details]
Bug 1381745 - Wait for slow disks to write Safe Browsing lists to disk.

https://reviewboard.mozilla.org/r/158664/#review163924

::: testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py:107
(Diff revision 1)
>              Wait(self.marionette, timeout=90).until(
>                  check_downloaded, message='Not all safebrowsing files have been downloaded')
>          finally:
> +            # Wait for slow disks to finish writing the tables to disk
> +            time.sleep(30)
> +

The "update sucess" is notified (nextupdate pref change) after saving data to disk, I don't think that makes visible change
http://searchfox.org/mozilla-central/rev/a83a4b68974aecaaacdf25145420e0fe97b7aa22/toolkit/components/url-classifier/Classifier.cpp#762,871
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #16)
> I suspect update server was down yesterday, Francois, do you have any idea
> that I missed?

I've asked Google if they are aware of an outage on the V4 servers yesterday.
Flags: needinfo?(francois)
(In reply to François Marier [:francois] from comment #18)
> (In reply to Thomas Nguyen[:tnguyen] ni plz from comment #16)
> > I suspect update server was down yesterday, Francois, do you have any idea
> > that I missed?
> 
> I've asked Google if they are aware of an outage on the V4 servers yesterday.

You were right Thomas. They have confirmed that they had an outage yesterday and that they pushed out a fix today.

So let's wait for the OrangeFactor Robot to confirm that all is good and close this bug without merging my hack.
Everything looks fine now on autoland for the last 3.5 hours:
https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&bugfiler

Francois, I wonder how we want to deal with those failures in the future. It's great to catch those problems with an external service but on the other side puts our test under pressure by sheriffs regarding the need to star failures.
Summary: Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] → Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers
Comment on attachment 8887746 [details]
Bug 1381745 - Wait for slow disks to write Safe Browsing lists to disk.

https://reviewboard.mozilla.org/r/158664/#review163962

It's not needed! \o/
Attachment #8887746 - Flags: review?(hskupin)
(In reply to Henrik Skupin (:whimboo) [away 07/21 - 07/31] from comment #20)
> Everything looks fine now on autoland for the last 3.5 hours:
> https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-
> searchStr=Firefox%20UI&bugfiler

\o/

> Francois, I wonder how we want to deal with those failures in the future.
> It's great to catch those problems with an external service but on the other
> side puts our test under pressure by sheriffs regarding the need to star
> failures.

Yeah that's a good point. Looking at the reliability of the V2 servers, I suspect it won't be a problem in practice.

The V4 service is relatively new and Chrome isn't using it on release yet either. Once that's the case, I expect it will be as reliable as the old servers.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Great. So lets leave it as is and let us spot bustage on Google's side too. Marking as WFM given that it's nothing we fixed on our end.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → WORKSFORME
It looks like that we still have this problem on and off:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&bugfiler&selectedJob=115571748
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(In reply to Henrik Skupin (:whimboo) [away 07/21 - 07/31] from comment #24)
> It looks like that we still have this problem on and off:
> 
> https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-
> searchStr=Firefox%20UI&bugfiler&selectedJob=115571748

It seems the download did not finish in the failure. It was still downloading or server timeout
I could see in telemetry 
https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-07-13&keys=google4!mozilla!__none__&max_channel_version=nightly%252F56&measure=URLCLASSIFIER_UPDATE_TIMEOUT&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-07-13&table=0&trim=1&use_submission_date=0

downloading timeout (>60 secs) could happen
We could try bumping the timeouts to:

    pref("urlclassifier.update.response_timeout_ms", 30000);
    pref("urlclassifier.update.timeout_ms", 90000);

and see how that impacts the test on Try.

Looking at Telemetry, we still have about 1% of requests that timeout on the server response:

https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-07-18&keys=google4!mozilla!__none__&max_channel_version=nightly%252F56&measure=URLCLASSIFIER_UPDATE_TIMEOUT&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-07-16&table=1&trim=1&use_submission_date=0
(In reply to François Marier [:francois] from comment #26)
> We could try bumping the timeouts to:
> 
>     pref("urlclassifier.update.response_timeout_ms", 30000);
>     pref("urlclassifier.update.timeout_ms", 90000);

Thomas, do you want to prepare a patch for this and get it reviewed by Henry? (I'm on PTO Thursday/Friday)
Flags: needinfo?(tnguyen)
(In reply to François Marier [:francois] from comment #26)
> We could try bumping the timeouts to:
> 
>     pref("urlclassifier.update.response_timeout_ms", 30000);
>     pref("urlclassifier.update.timeout_ms", 90000);
> 
> and see how that impacts the test on Try.
> 
> Looking at Telemetry, we still have about 1% of requests that timeout on the
> server response:
> 
> https://telemetry.mozilla.org/new-pipeline/dist.html#!
> cumulative=0&end_date=2017-07-18&keys=google4!mozilla!
> __none__&max_channel_version=nightly%252F56&measure=URLCLASSIFIER_UPDATE_TIME
> OUT&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_ke
> ys=submissions&start_date=2017-07-16&table=1&trim=1&use_submission_date=0

Perhaps, we could do that in another bug. I am not sure if it has an impact on this bug, but let's see.
Flags: needinfo?(tnguyen)
Comment on attachment 8888177 [details]
Bug 1381745 - Increase server and download timeout when updating SB lists

https://reviewboard.mozilla.org/r/159104/#review164506

How about increase timeout even more in the test case? I mean we can leave the change as it
but modify the test case to have even longer timeout?
Attachment #8888177 - Flags: review?(hchang)
That's right. I would like to file another bug to solve comment 26 because, when the timer fires, the test case will fail too. It may not help us here
We will suffer this kind of intermittent if server does not respond. Or we have to notify to let browser knows when updating completesw
Comment on attachment 8888177 [details]
Bug 1381745 - Increase server and download timeout when updating SB lists

https://reviewboard.mozilla.org/r/159104/#review164536

::: testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py:101
(Diff revision 2)
>          def check_downloaded(_):
>              return reduce(lambda state, pref: state and int(self.marionette.get_pref(pref)) != 1,
>                            self.prefs_provider_update_time.keys(), True)
>  
>          try:
> -            Wait(self.marionette, timeout=90).until(
> +            Wait(self.marionette, timeout=170).until(

I mean the same prefs as you changed in all.js but with longer values in the test case. Does it make sense to you?
Attachment #8888177 - Flags: review?(hchang)
Oh, sorry, I missed your point.
There's a story that listmanager updating may be kicked off before or after marionette test (or marionette test will be restarted with another profile).
Have to find a way to set marionette client pref before the test starts. Probably there are maybe some pref.js override but I don't find it.
Bug 1376591 is going to be landed and SafeBrowsing Init will be called after sessionstore-windows-restored (which triggers marionette). And then it should be fine
Whiteboard: [stockwell needswork] → [stockwell needswork], sbv4-m8
Whiteboard: [stockwell needswork], sbv4-m8 → [stockwell needswork], #sbv4-m8
The failure rate reduces a lot but the intermittent still occurs. I see a couple of reasons:
- Server responds 400 error code ( I will do a bit more in bug 1381718 for further investigation)  
- Server response timeout after 30s
- Still downloading and the test is timed out.

Reverted the patch, with longer timeout (server timeout, downloading timeout and test timeout)
Comment on attachment 8888177 [details]
Bug 1381745 - Increase server and download timeout when updating SB lists

https://reviewboard.mozilla.org/r/159104/#review165586

Thanks Thomas :) Sorry for misleading you earlier on.
Attachment #8888177 - Flags: review?(hchang) → review+
Attachment #8887746 - Attachment is obsolete: true
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ffc435839f360d7d51c55e18fe4365fdef39f8d&selectedJob=116804135
I would like to leave it open but lower priority until we could find any clue on 400 error code response
Keywords: checkin-needed
Keywords: checkin-needed
Priority: P1 → P2
Assignee: francois → tnguyen
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/acde5b9b7b25
Increase server and download timeout when updating SB lists r=hchang
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/acde5b9b7b25
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Depends on: 1384024
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers → Permafail on OS X 10-10 opt test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers
Perhaps we should still keep it as resolved fixed. The failures only occur only OS X 10-10 opt, bug 1384024 will be the follow-up
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Summary: Permafail on OS X 10-10 opt test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers → Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers
Whiteboard: [stockwell needswork], #sbv4-m8 → [stockwell fixed:timing], #sbv4-m8
(In reply to OrangeFactor Robot from comment #46)
> 6 failures in 1008 pushes (0.006 failures/push) were associated with this
> bug in the last 7 days.   

[snip]

> For more details, see:
> https://brasstacks.mozilla.com/orangefactor/
> ?display=Bug&bugid=1381745&startday=2017-07-24&endday=2017-07-30&tree=all

All of these occured on 24 July. Nothing since.
You need to log in before you can comment on or make changes to this bug.