Closed Bug 1374268 Opened 7 years ago Closed 7 years ago

Permafail test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in []

Categories

(Firefox Build System :: General, defect)

52 Branch
defect
Not set
normal

Tracking

(firefox-esr52 unaffected, firefox54 unaffected, firefox55 unaffected, firefox56 fixed)

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 --- unaffected
firefox56 --- fixed

People

(Reporter: whimboo, Assigned: tnguyen)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell fixed:timing])

Attachments

(1 file)

It looks like that artifact builds do not correctl package the safebrowsing Google API key and as such are failing on try:

Here the failing try build: https://treeherder.mozilla.org/logviewer.html#?job_id=107700226&repo=try&lineNumber=12845

(In reply to François Marier [:francois] from bug 1371923 comment #27)
> > Does that problem also explain the test failures on try for my push?
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=7c858c0d4319&selectedJob=107700226
> 
> That's a different problem. The update for the V4 lists fails with a 400:
> 
> https://treeherder.mozilla.org/logviewer.
> html#?job_id=107700226&repo=try&lineNumber=12845
> 
> Was that just a normal Try push? Try should have the correct API key set...
The full try build which I requested a couple minutes later based on the same changeset didn't cause test failures:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9bb2ae032322c8579023a43f1e89bd7320d55b2e
See Also: → 1370753
Francois, any preference in what to fix first and make the other bug dependent?
Flags: needinfo?(francois)
Component: Firefox UI Tests → Build Config
Product: Testing → Core
QA Contact: hskupin
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Francois, any preference in what to fix first and make the other bug
> dependent?

I think we should ideally fix artifact builds to include the API key.
Flags: needinfo?(francois)
Btw. I can see the same failures with a local debug artifact build (opt not tested). So this is not only try.
Summary: Permafail test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] for artifact builds on try → Permafail test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] for artifact builds
This failed even on autoland now. Francois can you please have a look?

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&bugfiler&selectedJob=112477683
Flags: needinfo?(francois)
Whiteboard: [stockwell needswork]
Summary: Permafail test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in [] for artifact builds → Permafail test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: 'goog-badbinurl-proto.pset' not found in []
Francois, this test failure has to be fixed. Otherwise the test might get disabled soon. Can you please have a look at it?
Francois, this is a very frequent failure, so I am eager to see it fixed or disabled soon. Please let us know your plan.
I took a quick look and it doesn't appear to be the usual suspect.

Thomas, can you have a look too to see if there's anything obvious I missed?
Flags: needinfo?(francois) → needinfo?(tnguyen)
Btw. this started to raise a lot on July 2nd, and since then more and more failures are getting reported:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1374268&startday=2017-06-23&endday=2017-07-10&tree=all

Did anything land lately on integration branches which could have caused this regression? The first report on an integration branch beside try and a single failure on cedar was via comment 7 which is 4 days ago.
I am taking a look at this bug, and the followings are the reason:
- Marionette is initialized when the `sessionstore-windows-restored` observer notification is received. At this time we also setup the initial browser Marionette, load framescript and do a lot of stuffs.
In our test, we prepare setting preference.
browser.safebrowsing.provider.google.nextupdatetime = 1
browser.safebrowsing.provider.google4.nextupdatetime = 1

- Unfortunately, listmanager would be initialized earlier than sessionstore-windows-restored. At that time, listmanager will choose a random number to download file (0-1 min)
http://searchfox.org/mozilla-central/rev/5dadcbe55b4ddd1e448c06c77390ff6483aa009b/toolkit/components/url-classifier/content/listmanager.js#210
-> that is how the failure occurs
You will see the failure has been grown recently, presumably because we are trying to save the time of initializing SafeBrowsing and listmanager (such as Bug 1297614)
Flags: needinfo?(tnguyen)
And I see this failure occurs not only in opt build, but also in debug with lower rate. I guess there are differences in startup timing between debug and opt build
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #19)
> And I see this failure occurs not only in opt build, but also in debug with
> lower rate. I guess there are differences in startup timing between debug
> and opt build

Yes, that is the case. There is a longer delay for debug builds than for opt builds.

(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #18)
> You will see the failure has been grown recently, presumably because we are
> trying to save the time of initializing SafeBrowsing and listmanager (such
> as Bug 1297614)

Thanks. I will just put it onto the dependency list so we finally have a good set of changes which caused this.
Blocks: 1297614
Keywords: regression
I am not sure if that is really a regression :), because the reason I mentioned in comment 18
Just updated a quick fix, Francois, do you have any idea?
I tried to look at listmanager log and I found in all of my log, for example 


[task 2017-07-11T12:50:44.684945Z] 12:50:44     INFO -  listmanager: 12:50:44 GMT+0000 (UTC): Initializing update checker for https://safebrowsing.googleapis.com/v4/threatListUpdates:fetch?$ct=application/x-protobuf&key=[trimmed-google-api-key]&$httpMethod=POST provided by google4
[task 2017-07-11T12:50:44.685713Z] 12:50:44     INFO -  listmanager: 12:50:44 GMT+0000 (UTC): Next update 0.8299min from now
[task 2017-07-11T12:50:44.686991Z] 12:50:44     INFO -  listmanager: 12:50:44 GMT+0000 (UTC): Initializing update checker for https://shavar.services.mozilla.com/downloads?client=navclient-auto-ffox&appver=56.0a&pver=2.2 provided by mozilla
[task 2017-07-11T12:50:44.687855Z] 12:50:44     INFO -  listmanager: 12:50:44 GMT+0000 (UTC): Next update at 1
[task 2017-07-11T12:50:44.688750Z] 12:50:44     INFO -  listmanager: 12:50:44 GMT+0000 (UTC): Next update 0min from now
[task 2017-07-11T12:50:44.689650Z] 12:50:44     INFO -  1499777444615	Marionette	TRACE	2 -> [0,23,"getContext",{}]
[task 2017-07-11T12:50:44.690829Z] 12:50:44     INFO -  1499777444616	Marionette	TRACE	2 <- [1,23,null,{"value":"chrome"}]

You will see 0.8299min is the random number for google4 update and obviously what we set 
browser.safebrowsing.provider.google4.nextupdatetime = 1
did not take any effect.
Plz see 
http://searchfox.org/mozilla-central/rev/5dadcbe55b4ddd1e448c06c77390ff6483aa009b/testing/marionette/client/marionette_driver/marionette.py#1001

we are trying to enforce nonexistent pref, as the result we delete_session, restart then receive sessionstore-windows-restored

[task 2017-07-11T12:50:40.355589Z] 12:50:40     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -no-remote -marionette -profile /home/worker/workspace/build/tmp53fF_5.mozrunner-1499777440
[task 2017-07-11T12:50:41.151107Z] 12:50:41     INFO -  1499777441142	Marionette	DEBUG	Received observer notification "profile-after-change"
[task 2017-07-11T12:50:41.228771Z] 12:50:41     INFO -  1499777441221	Marionette	DEBUG	Received observer notification "command-line-startup"
[task 2017-07-11T12:50:41.228877Z] 12:50:41     INFO -  1499777441222	Marionette	INFO	Enabled via --marionette
[task 2017-07-11T12:50:42.708570Z] 12:50:42     INFO -  GLib-GIO-Message: Using the 'memory' GSettings backend.  Your settings will not be saved or shared with other applications.
[task 2017-07-11T12:50:42.979547Z] 12:50:42     INFO -  1499777442976	Marionette	DEBUG	Received observer notification "sessionstore-windows-restored"
[task 2017-07-11T12:50:42.987183Z] 12:50:42     INFO -  1499777442980	Marionette	DEBUG	Received observer notification "sessionstore-windows-restored"

The patch I posted, simply adding the browser.safebrowsing.provider.google4.nextupdatetime to startup profile to avoid this from being nonexistent.
well, I see
[task 2017-07-11T14:38:42.480453Z] 14:38:42     INFO -  listmanager: 14:38:42 GMT+0000 (UTC): Next update at 1
[task 2017-07-11T14:38:42.480750Z] 14:38:42     INFO -  listmanager: 14:38:42 GMT+0000 (UTC): Next update 0min from now
[task 2017-07-11T14:38:42.481314Z] 14:38:42     INFO -  listmanager: 14:38:42 GMT+0000 (UTC): Initializing update checker for https://shavar.services.mozilla.com/downloads?client=navclient-auto-ffox&appver=56.0a&pver=2.2 provided by mozilla
[task 2017-07-11T14:38:42.481570Z] 14:38:42     INFO -  listmanager: 14:38:42 GMT+0000 (UTC): Next update at 1
[task 2017-07-11T14:38:42.481859Z] 14:38:42     INFO -  listmanager: 14:38:42 GMT+0000 (UTC): Next update 0min from now

==> google4.nextupdatetime == 1, google4 is going to be updated now,

But some failures
[task 2017-07-11T14:38:42.613936Z] 14:38:42     INFO -  1499783922460	Marionette	TRACE	2 -> [0,30,"setContext",{"value":"chrome"}]
[task 2017-07-11T14:38:42.614068Z] 14:38:42     INFO -  1499783922467	Marionette	TRACE	2 <- [1,30,null,{}]
[task 2017-07-11T14:38:42.615670Z] 14:38:42     INFO -  listmanager: 14:38:42 GMT+0000 (UTC): download error for goog-phish-proto,goog-malware-proto,goog-unwanted-proto,goog-badbinurl-proto,goog-downloadwhite-proto: 400

Google responded 400, I remember that we have filed a bug for 400 error code somewhere, Francois, do you have any idea about that?
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #25)
> Google responded 400, I remember that we have filed a bug for 400 error code
> somewhere, Francois, do you have any idea about that?
Is that because I ran 20 tests simultaneously, then we sent 20 requests nearly at the same time? Or GAPI problem?
Comment on attachment 8885169 [details]
Bug 1374268 - Increase waiting time for all safebrowsing files being downloaded

https://reviewboard.mozilla.org/r/156030/#review161368

We can't force an immediate update on startup because that would be violating the spec which says it has to be [done within a random amount of time between 0 and 1 minutes after startup](https://developers.google.com/safe-browsing/v4/request-frequency#update-requests).
Attachment #8885169 - Flags: review?(francois) → review-
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #27)
> (In reply to Thomas Nguyen[:tnguyen] ni plz from comment #25)
> > Google responded 400, I remember that we have filed a bug for 400 error code
> > somewhere, Francois, do you have any idea about that?
> Is that because I ran 20 tests simultaneously, then we sent 20 requests
> nearly at the same time? Or GAPI problem?

That's quite possible. They do have an anti-DoS frontend in front of the Safe Browsing servers we talk to.
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #18)
> I am taking a look at this bug, and the followings are the reason:
> - Marionette is initialized when the `sessionstore-windows-restored`
> observer notification is received. At this time we also setup the initial
> browser Marionette, load framescript and do a lot of stuffs.
> In our test, we prepare setting preference.
> browser.safebrowsing.provider.google.nextupdatetime = 1
> browser.safebrowsing.provider.google4.nextupdatetime = 1
> - Unfortunately, listmanager would be initialized earlier than
> sessionstore-windows-restored. At that time, listmanager will choose a
> random number to download file (0-1 min)

Henrik, is there a way in Marionette to set prefs before we even start Firefox?

We could set browser.safebrowsing.provider.*.nextupdatetime = 1 in there.
Flags: needinfo?(hskupin)
Henrik, fwiw, I wonder if we can do:
- Increase waiting time 60 secs -> 70 secs (basically the first update random time is not exactly 0-1 min, it is 3 -> 63 secs).
http://searchfox.org/mozilla-central/rev/31311070d9860b24fe4a7a36976c14b328c16208/testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py#101

- I don't know if we can change enforce_gecko_pref to set_prefs, 
http://searchfox.org/mozilla-central/rev/31311070d9860b24fe4a7a36976c14b328c16208/testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py#83
because enforce_gecko_prefs may delete session, restart instance in the case the pref is nonexistent (seems that the pref we want will not work). Or we can find somewhere to put browser.safebrowsing.provider.*.nextupdatetime = 1 earlier
(In reply to François Marier [:francois] from comment #30)
> Henrik, is there a way in Marionette to set prefs before we even start
> Firefox?
> 
> We could set browser.safebrowsing.provider.*.nextupdatetime = 1 in there.

Not easily from within a test for the first Firefox instance. But I thought that this works with `enforce_gecko_pref` which forces a restart of the application. If that doesn't work (as Thomas pointed out above), we could do a:

```
self.marionette.quit(in_app=True)
self.marionette.instance.profile.required_prefs.append(...)
self.marionette.start_session()
```

It would be a bit of cheating and we have to ensure that the pref is getting reset after the test. Beside that I still feel we have to improve our preference handling with Marionette for those cases we need a restart. This is still logged as bug 1260502. For tracking purposes I will add this bug as dependency.
Depends on: 1260502
Flags: needinfo?(hskupin)
That is blooming.
I took some random logs and all are the same, we picked a random time to update google 4 which may be too large

The safe and hot fix, for now, is increasing waiting time (indeed we must do that because waiting for 60 secs is not enough), for example

19:12:46     INFO -  listmanager: 19:12:45 GMT-0700 (PDT): Initializing update checker for https://safebrowsing.googleapis.com/v4/threatListUpdates:fetch?$ct=application/x-protobuf&key=[trimmed-google-api-key]&$httpMethod=POST provided by google4
19:12:46     INFO -  listmanager: 19:12:45 GMT-0700 (PDT): Next update 1.0487min from now
Comment on attachment 8885169 [details]
Bug 1374268 - Increase waiting time for all safebrowsing files being downloaded

https://reviewboard.mozilla.org/r/156030/#review162764

Sounds good. Let's do 90 seconds if we can, to make sure that updates do get through.
Attachment #8885169 - Flags: review?(francois) → review+
There're some failures but it's not related to this bug. See bug 1379906
(In reply to Thomas Nguyen[:tnguyen] ni plz from comment #44)
> There're some failures but it's not related to this bug. See bug 1379906

No, that's something different but still unrelated to this bug. The ASAN failures here are happening because of a busted Firefox binary which cannot find some DLLs.
Blocks: 1282056
Blocks: 1309764
Assignee: nobody → tnguyen
Status: NEW → ASSIGNED
Keywords: checkin-needed
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9395f52d60ac
Increase waiting time for all safebrowsing files being downloaded r=francois
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/9395f52d60ac
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
New occurrence after the merge. Should this still be considered fixed?
Flags: needinfo?(tnguyen)
Where did it fail? I don't see anything on autoland nor mozilla-central.
Flags: needinfo?(aryx.bugmail)
I see. Thomas should have a look at. Generally the failure rate dropped nearly completely, so I don't think that a backout would be wise. We should better file a new bug for the remaining case. But Thomas should decide.
(In reply to Henrik Skupin (:whimboo) [away 07/21 - 07/31] from comment #51)
> Generally the failure rate dropped nearly completely, so I don't think that a backout
> would be wise. We should better file a new bug for the remaining case.

Whether or not it is sufficient (it doesn't appear so), the longer timeout is now necessary. So I'd say we should keep that commit and then figure out what else needs to be done to fix this fully.

(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #50)
> https://treeherder.mozilla.org/logviewer.html#?job_id=114846528&repo=mozilla-
> inbound

From the log, I see an mkdir error:

07:00:57    ERROR -  Traceback (most recent call last):
07:00:57     INFO -    File "/tools/tooltool.py", line 669, in fetch_files
07:00:57     INFO -      os.makedirs(cache_folder, 0700)
07:00:57     INFO -    File "/Users/cltbld/tasks/task_1500298502/build/venv/lib/python2.7/os.py", line 150, in makedirs
07:00:57     INFO -      makedirs(head, mode)
07:00:57     INFO -    File "/Users/cltbld/tasks/task_1500298502/build/venv/lib/python2.7/os.py", line 157, in makedirs
07:00:57     INFO -      mkdir(name, mode)
07:00:57     INFO -  OSError: [Errno 45] Operation not supported: '/home/worker'

which looks like an unrelated test harness failure.
Flags: needinfo?(tnguyen)
(In reply to François Marier [:francois] from comment #52)
> which looks like an unrelated test harness failure.

Yes, that's not related.
Blocks: 1381745
I still see many (100+) failures today, I suspect we will end up disabling this and other tests if we don't get a fix real soon.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → DUPLICATE
We landed a fix on this bug which fixed the issue for half a day before something else popped up. So we should still keep it as resolved fixed. Bug 1381745 will be the follow-up to whatever has changed a couple of hours later.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Whiteboard: [stockwell needswork] → [stockwell fixed:timing]
Bigger update timeouts have just landed on bug 1381745 which we suspect might help with this. We are now working in bug 1381718 on collecting better debugging info when we run into a 400 from the server.
(In reply to OrangeFactor Robot from comment #61)
> 34 failures in 1008 pushes (0.034 failures/push) were associated with this
> bug in the last 7 days.   

[snip]
 
> For more details, see:
> https://brasstacks.mozilla.com/orangefactor/
> ?display=Bug&bugid=1374268&startday=2017-07-24&endday=2017-07-30&tree=all

These Mac failures look like they are due to an unrelated bug:

https://treeherder.mozilla.org/logviewer.html#?repo=try&job_id=119331200&lineNumber=655
https://treeherder.mozilla.org/logviewer.html#?repo=try&job_id=119331205&lineNumber=655
https://treeherder.mozilla.org/logviewer.html#?repo=try&job_id=118362820&lineNumber=655

These Linux and Windows errors:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=118651103&lineNumber=17295
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=118618990&lineNumber=17057
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=118634814&lineNumber=19761
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=118625002&lineNumber=19558

contains an actual update error:

update error for goog-phish-proto,goog-malware-proto,goog-unwanted-proto,goog-badbinurl-proto,goog-downloadwhite-proto from https://safebrowsing.googleapis.com/v4/threatListUpdates:fetch?$ct=application/x-protobuf&key=[trimmed-google-api-key]&$httpMethod=POST: 2152398850

and that number corresponds to NS_BINDING_ABORTED = 0x804B0002, which is returned when we cancel the update:

https://searchfox.org/mozilla-central/rev/09c065976fd4f18d4ad764d7cb4bbc684bf56714/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#768

most likely because of an unexpected network connection termination:

https://searchfox.org/mozilla-central/rev/09c065976fd4f18d4ad764d7cb4bbc684bf56714/toolkit/components/url-classifier/nsUrlClassifierStreamUpdater.cpp#867

So ignoring the Mac failures, it looks like we might have 3-4 failures a day due to network conditions.
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: