Closed
Bug 1381745
Opened 7 years ago
Closed 7 years ago
Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers
Categories
(Toolkit :: Safe Browsing, defect, P2)
Toolkit
Safe Browsing
Tracking
()
RESOLVED
FIXED
mozilla56
Tracking | Status | |
---|---|---|
firefox56 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: tnguyen)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:timing], #sbv4-m8)
Attachments
(1 file, 1 obsolete file)
Filed by: hskupin [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=115111633&repo=autoland https://queue.taskcluster.net/v1/task/LBqM8iniT1Sd2NeIhMlcRg/runs/0/artifacts/public/logs/live_backing.log
Comment 1•7 years ago
|
||
As it look like the fix on bug 1374268 didn't hold long. I can see tons of failures of the same type on autoland now. It all started about 11 hours ago. Maybe we hit a limit again and delivery gets delayed? Thomas can you please have a look? https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=firefox%20ui%20remote&bugfiler&fromchange=9033647b9a51fdd82d1049a30f952e2fbb554b97&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&selectedJob=115111575
Comment 2•7 years ago
|
||
Hm, wasn't that download successful? [task 2017-07-18T06:44:12.059816Z] 06:44:12 INFO - listmanager: 06:44:12 GMT+0000 (UTC): update success for goog-phish-proto,goog-malware-proto,goog-unwanted-proto,goog-badbinurl-proto,goog-downloadwhite-proto from https://safebrowsing.googleapis.com/v4/threatListUpdates:fetch?$ct=application/x-protobuf&key=[trimmed-google-api-key]&$httpMethod=POST: 1818 [task 2017-07-18T06:44:12.060661Z] 06:44:12 INFO - listmanager: 06:44:12 GMT+0000 (UTC): Waiting 30.3min [task 2017-07-18T06:44:12.061399Z] 06:44:12 INFO - listmanager: 06:44:12 GMT+0000 (UTC): Setting last update of google4 to 1500360252055 [task 2017-07-18T06:44:12.062278Z] 06:44:12 INFO - listmanager: 06:44:12 GMT+0000 (UTC): Setting next update of google4 to 1500362070055 (30.3min from now) Maybe it would be helpful for debugging to print the list of downloaded files in case of a test failure.
Assignee | ||
Comment 3•7 years ago
|
||
Yes, you are right. We downloaded data but somehow did not save data to disk I don't know what's going wrong but seems we could not update at the moment. I tried to remove google4 folder (or safebrowsing folder), going to about:url-classifier, click update google4. I could see "update success" but google4 is still empty. I will look closer tomorrow. Francois, do we have any change in server side?
Flags: needinfo?(tnguyen) → needinfo?(francois)
Assignee | ||
Comment 4•7 years ago
|
||
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #3) > Yes, you are right. We downloaded data but somehow did not save data to disk > I don't know what's going wrong but seems we could not update at the moment. > I tried to remove google4 folder (or safebrowsing folder), going to > about:url-classifier, click update google4. I could see "update success" but > google4 is still empty. > I will look closer tomorrow. > Francois, do we have any change in server side? And the log is similar to the failure log. That is : update success for goog-phish-proto,goog-malware-proto,goog-unwanted-proto,goog-badbinurl-proto,goog-downloadwhite-proto from But the folder is still empty
Assignee | ||
Comment 5•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [away 07/21 - 07/31] from comment #1) > As it look like the fix on bug 1374268 didn't hold long. I can see tons of > failures of the same type on autoland now. It all started about 11 hours > ago. Maybe we hit a limit again and delivery gets delayed? > It is possible, i tried to update manually in about:url-classifier and failed too
Comment 6•7 years ago
|
||
this has 157 failures today: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1381745&startday=2017-07-10&endday=2017-07-18&tree=all we need to fix or disable this by end of day given the confusion this will cause on the trees.
Whiteboard: [stockwell needswork]
Updated•7 years ago
|
Assignee: nobody → francois
Status: NEW → ASSIGNED
Flags: needinfo?(francois)
Priority: -- → P1
Comment 8•7 years ago
|
||
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #3) > Yes, you are right. We downloaded data but somehow did not save data to disk > I don't know what's going wrong but seems we could not update at the moment. > I tried to remove google4 folder (or safebrowsing folder), going to > about:url-classifier, click update google4. I could see "update success" but > google4 is still empty. > I will look closer tomorrow. I have done the same in a fresh profile (today's Nightly on Linux) and it worked fine.
Comment 9•7 years ago
|
||
Assuming that updates aren't actually broken (as per comment 8), I traced through the code and it does look like nextupdatetime will only be updated after we have successfully processed the update. In case I missed something, I tried adding a sleep(30) after nextupdatetime is set and before we look for the files on disk in case we are dealing with a slow disk that takes a bit longer to write the files to disk: https://hg.mozilla.org/try/rev/3fb190a131b7e1cb019bc8399ee264a314afadaa Running 20 Try attempts with that: https://treeherder.mozilla.org/#/jobs?repo=try&revision=268caf9814073a8a42dbec352f713e54ddb15740
Comment hidden (Intermittent Failures Robot) |
Comment 11•7 years ago
|
||
(In reply to François Marier [:francois] from comment #9) > Running 20 Try attempts with that: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=268caf9814073a8a42dbec352f713e54ddb15740 All of the tests passed, I'm not sure if that means anything, but it might be worth giving that a try before we disable the test.
Comment hidden (mozreview-request) |
Assignee | ||
Comment 13•7 years ago
|
||
I have just tried to update and it works fine now. I suspect there's an unexpected problem with update data yesterday.
Assignee | ||
Comment 14•7 years ago
|
||
What I did is running ./mach firefox-ui-functional testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py --disable-e10s --gecko-log ~/Logs/gecko.log in local machine and it failed 100% rate. Then tried to update google4 manually in about:url-classifier and failed too. It works fine both this morning although I did not change anything
Assignee | ||
Comment 15•7 years ago
|
||
I tried to retrigger the test from autoland multiple times (randomly pick someone's push) See https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=115441130 and seems it is working this morning although we did not push anything to server from yesterday
Assignee | ||
Comment 16•7 years ago
|
||
Edit link https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=da7429cfbfa50fc74cc8c0f8751005e15e69756b&selectedJob=115441257 Running ~40 times in various platforms but only 1 failure but it's not related to this bug (we could not download google list in that case) I suspect update server was down yesterday, Francois, do you have any idea that I missed?
Flags: needinfo?(francois)
Assignee | ||
Comment 17•7 years ago
|
||
mozreview-review |
Comment on attachment 8887746 [details] Bug 1381745 - Wait for slow disks to write Safe Browsing lists to disk. https://reviewboard.mozilla.org/r/158664/#review163924 ::: testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py:107 (Diff revision 1) > Wait(self.marionette, timeout=90).until( > check_downloaded, message='Not all safebrowsing files have been downloaded') > finally: > + # Wait for slow disks to finish writing the tables to disk > + time.sleep(30) > + The "update sucess" is notified (nextupdate pref change) after saving data to disk, I don't think that makes visible change http://searchfox.org/mozilla-central/rev/a83a4b68974aecaaacdf25145420e0fe97b7aa22/toolkit/components/url-classifier/Classifier.cpp#762,871
Comment 18•7 years ago
|
||
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #16) > I suspect update server was down yesterday, Francois, do you have any idea > that I missed? I've asked Google if they are aware of an outage on the V4 servers yesterday.
Flags: needinfo?(francois)
Comment 19•7 years ago
|
||
(In reply to François Marier [:francois] from comment #18) > (In reply to Thomas Nguyen[:tnguyen] ni plz from comment #16) > > I suspect update server was down yesterday, Francois, do you have any idea > > that I missed? > > I've asked Google if they are aware of an outage on the V4 servers yesterday. You were right Thomas. They have confirmed that they had an outage yesterday and that they pushed out a fix today. So let's wait for the OrangeFactor Robot to confirm that all is good and close this bug without merging my hack.
Comment 20•7 years ago
|
||
Everything looks fine now on autoland for the last 3.5 hours: https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&bugfiler Francois, I wonder how we want to deal with those failures in the future. It's great to catch those problems with an external service but on the other side puts our test under pressure by sheriffs regarding the need to star failures.
Summary: Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] → Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers
Comment 21•7 years ago
|
||
mozreview-review |
Comment on attachment 8887746 [details] Bug 1381745 - Wait for slow disks to write Safe Browsing lists to disk. https://reviewboard.mozilla.org/r/158664/#review163962 It's not needed! \o/
Attachment #8887746 -
Flags: review?(hskupin)
Comment 22•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [away 07/21 - 07/31] from comment #20) > Everything looks fine now on autoland for the last 3.5 hours: > https://treeherder.mozilla.org/#/jobs?repo=autoland&filter- > searchStr=Firefox%20UI&bugfiler \o/ > Francois, I wonder how we want to deal with those failures in the future. > It's great to catch those problems with an external service but on the other > side puts our test under pressure by sheriffs regarding the need to star > failures. Yeah that's a good point. Looking at the reliability of the V2 servers, I suspect it won't be a problem in practice. The V4 service is relatively new and Chrome isn't using it on release yet either. Once that's the case, I expect it will be as reliable as the old servers.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment 23•7 years ago
|
||
Great. So lets leave it as is and let us spot bustage on Google's side too. Marking as WFM given that it's nothing we fixed on our end.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•7 years ago
|
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → WORKSFORME
Comment 24•7 years ago
|
||
It looks like that we still have this problem on and off: https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&bugfiler&selectedJob=115571748
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Assignee | ||
Comment 25•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [away 07/21 - 07/31] from comment #24) > It looks like that we still have this problem on and off: > > https://treeherder.mozilla.org/#/jobs?repo=autoland&filter- > searchStr=Firefox%20UI&bugfiler&selectedJob=115571748 It seems the download did not finish in the failure. It was still downloading or server timeout I could see in telemetry https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-07-13&keys=google4!mozilla!__none__&max_channel_version=nightly%252F56&measure=URLCLASSIFIER_UPDATE_TIMEOUT&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-07-13&table=0&trim=1&use_submission_date=0 downloading timeout (>60 secs) could happen
Comment 26•7 years ago
|
||
We could try bumping the timeouts to: pref("urlclassifier.update.response_timeout_ms", 30000); pref("urlclassifier.update.timeout_ms", 90000); and see how that impacts the test on Try. Looking at Telemetry, we still have about 1% of requests that timeout on the server response: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-07-18&keys=google4!mozilla!__none__&max_channel_version=nightly%252F56&measure=URLCLASSIFIER_UPDATE_TIMEOUT&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-07-16&table=1&trim=1&use_submission_date=0
Comment 27•7 years ago
|
||
(In reply to François Marier [:francois] from comment #26) > We could try bumping the timeouts to: > > pref("urlclassifier.update.response_timeout_ms", 30000); > pref("urlclassifier.update.timeout_ms", 90000); Thomas, do you want to prepare a patch for this and get it reviewed by Henry? (I'm on PTO Thursday/Friday)
Flags: needinfo?(tnguyen)
Comment hidden (mozreview-request) |
Assignee | ||
Comment 29•7 years ago
|
||
(In reply to François Marier [:francois] from comment #26) > We could try bumping the timeouts to: > > pref("urlclassifier.update.response_timeout_ms", 30000); > pref("urlclassifier.update.timeout_ms", 90000); > > and see how that impacts the test on Try. > > Looking at Telemetry, we still have about 1% of requests that timeout on the > server response: > > https://telemetry.mozilla.org/new-pipeline/dist.html#! > cumulative=0&end_date=2017-07-18&keys=google4!mozilla! > __none__&max_channel_version=nightly%252F56&measure=URLCLASSIFIER_UPDATE_TIME > OUT&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_ke > ys=submissions&start_date=2017-07-16&table=1&trim=1&use_submission_date=0 Perhaps, we could do that in another bug. I am not sure if it has an impact on this bug, but let's see.
Flags: needinfo?(tnguyen)
Comment 30•7 years ago
|
||
mozreview-review |
Comment on attachment 8888177 [details] Bug 1381745 - Increase server and download timeout when updating SB lists https://reviewboard.mozilla.org/r/159104/#review164506 How about increase timeout even more in the test case? I mean we can leave the change as it but modify the test case to have even longer timeout?
Attachment #8888177 -
Flags: review?(hchang)
Assignee | ||
Comment 31•7 years ago
|
||
That's right. I would like to file another bug to solve comment 26 because, when the timer fires, the test case will fail too. It may not help us here We will suffer this kind of intermittent if server does not respond. Or we have to notify to let browser knows when updating completesw
Comment hidden (mozreview-request) |
Comment 33•7 years ago
|
||
mozreview-review |
Comment on attachment 8888177 [details] Bug 1381745 - Increase server and download timeout when updating SB lists https://reviewboard.mozilla.org/r/159104/#review164536 ::: testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py:101 (Diff revision 2) > def check_downloaded(_): > return reduce(lambda state, pref: state and int(self.marionette.get_pref(pref)) != 1, > self.prefs_provider_update_time.keys(), True) > > try: > - Wait(self.marionette, timeout=90).until( > + Wait(self.marionette, timeout=170).until( I mean the same prefs as you changed in all.js but with longer values in the test case. Does it make sense to you?
Attachment #8888177 -
Flags: review?(hchang)
Assignee | ||
Comment 34•7 years ago
|
||
Oh, sorry, I missed your point. There's a story that listmanager updating may be kicked off before or after marionette test (or marionette test will be restarted with another profile). Have to find a way to set marionette client pref before the test starts. Probably there are maybe some pref.js override but I don't find it.
Assignee | ||
Comment 35•7 years ago
|
||
Bug 1376591 is going to be landed and SafeBrowsing Init will be called after sessionstore-windows-restored (which triggers marionette). And then it should be fine
Comment hidden (mozreview-request) |
Assignee | ||
Updated•7 years ago
|
Whiteboard: [stockwell needswork] → [stockwell needswork], sbv4-m8
Assignee | ||
Updated•7 years ago
|
Whiteboard: [stockwell needswork], sbv4-m8 → [stockwell needswork], #sbv4-m8
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 38•7 years ago
|
||
The failure rate reduces a lot but the intermittent still occurs. I see a couple of reasons: - Server responds 400 error code ( I will do a bit more in bug 1381718 for further investigation) - Server response timeout after 30s - Still downloading and the test is timed out. Reverted the patch, with longer timeout (server timeout, downloading timeout and test timeout)
Comment hidden (mozreview-request) |
Comment 40•7 years ago
|
||
mozreview-review |
Comment on attachment 8888177 [details] Bug 1381745 - Increase server and download timeout when updating SB lists https://reviewboard.mozilla.org/r/159104/#review165586 Thanks Thomas :) Sorry for misleading you earlier on.
Attachment #8888177 -
Flags: review?(hchang) → review+
Assignee | ||
Updated•7 years ago
|
Attachment #8887746 -
Attachment is obsolete: true
Comment hidden (typo) |
Assignee | ||
Comment 42•7 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ffc435839f360d7d51c55e18fe4365fdef39f8d&selectedJob=116804135 I would like to leave it open but lower priority until we could find any clue on 400 error code response
Assignee | ||
Updated•7 years ago
|
Keywords: checkin-needed
Assignee | ||
Updated•7 years ago
|
Keywords: checkin-needed
Assignee | ||
Updated•7 years ago
|
Priority: P1 → P2
Assignee | ||
Updated•7 years ago
|
Assignee: francois → tnguyen
Comment 43•7 years ago
|
||
Pushed by cbook@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/acde5b9b7b25 Increase server and download timeout when updating SB lists r=hchang
Keywords: checkin-needed
Comment 44•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/acde5b9b7b25
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Assignee | ||
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers → Permafail on OS X 10-10 opt test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers
Assignee | ||
Comment 45•7 years ago
|
||
Perhaps we should still keep it as resolved fixed. The failures only occur only OS X 10-10 opt, bug 1384024 will be the follow-up
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•7 years ago
|
Summary: Permafail on OS X 10-10 opt test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers → Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] due to outage of Google SafeBrowsing version 4 servers
Updated•7 years ago
|
Whiteboard: [stockwell needswork], #sbv4-m8 → [stockwell fixed:timing], #sbv4-m8
Comment hidden (Intermittent Failures Robot) |
Comment 47•7 years ago
|
||
(In reply to OrangeFactor Robot from comment #46) > 6 failures in 1008 pushes (0.006 failures/push) were associated with this > bug in the last 7 days. [snip] > For more details, see: > https://brasstacks.mozilla.com/orangefactor/ > ?display=Bug&bugid=1381745&startday=2017-07-24&endday=2017-07-30&tree=all All of these occured on 24 July. Nothing since.
You need to log in
before you can comment on or make changes to this bug.
Description
•