Intermittent test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: Items in the first set but not the second:

RESOLVED WORKSFORME

Status

RESOLVED WORKSFORME
2 years ago
a year ago

People

(Reporter: intermittent-bug-filer, Unassigned)

Tracking

({intermittent-failure})

Version 3
intermittent-failure
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Missing files for this test:

ASAN build:

[task 2016-10-07T00:44:16.152799Z] 00:44:16     INFO - 'googpub-phish-shavar.pset'
[task 2016-10-07T00:44:16.152930Z] 00:44:16     INFO - 'googpub-phish-shavar.sbstore'
[task 2016-10-07T00:44:16.152996Z] 00:44:16     INFO - 'goog-malware-shavar.pset'
[task 2016-10-07T00:44:16.152846Z] 00:44:16     INFO - 'goog-malware-shavar.sbstore'
[task 2016-10-07T00:44:16.152964Z] 00:44:16     INFO - 'goog-unwanted-shavar.pset'
[task 2016-10-07T00:44:16.152891Z] 00:44:16     INFO - 'goog-unwanted-shavar.sbstore'

As far as I remember we should not wait for the pub version from Google in such a case. Francois, any idea why the public version is listed in the preference? Maybe the other two are somewhat related?


Debug build:

[task 2016-10-07T00:45:10.432265Z] 00:45:10     INFO - 'goog-phish-shavar.sbstore'
[task 2016-10-07T00:45:10.433069Z] 00:45:10     INFO - 'goog-malware-shavar.sbstore'
[task 2016-10-07T00:45:10.433985Z] 00:45:10     INFO - 'goog-unwanted-shavar.sbstore'
[task 2016-10-07T00:45:10.434817Z] 00:45:10     INFO - 'goog-unwanted-shavar.pset'
[task 2016-10-07T00:45:10.435790Z] 00:45:10     INFO - 'goog-phish-shavar.pset'
[task 2016-10-07T00:45:10.436578Z] 00:45:10     INFO - 'goog-malware-shavar.pset


In both cases we were waiting 60 seconds. Maybe we need more time for slowly running builds like debug and especially ASAN?
Flags: needinfo?(francois)
(In reply to Henrik Skupin (:whimboo) from comment #1)
> As far as I remember we should not wait for the pub version from Google in
> such a case. Francois, any idea why the public version is listed in the
> preference? Maybe the other two are somewhat related?

Is it possible that ASAN builds, unlike debug ones, are not considered official (as in #ifdef MOZILLA_OFFICIAL)?
Flags: needinfo?(francois)
(In reply to François Marier [:francois] from comment #2)
> Is it possible that ASAN builds, unlike debug ones, are not considered
> official (as in #ifdef MOZILLA_OFFICIAL)?

Well, whatever happens here we retrieve the filenames from the appropriate preferences. If there is something wrong here wouldn't it be a build config issue? See bug 1296614 comment 15 as you gave on the other bug about the correct file name.
(In reply to Henrik Skupin (:whimboo) from comment #3)
> Well, whatever happens here we retrieve the filenames from the appropriate
> preferences. If there is something wrong here wouldn't it be a build config
> issue? See bug 1296614 comment 15 as you gave on the other bug about the
> correct file name.

Yes, I think the test is right and the ASAN builds are simply configured differently.

As far as I'm concerned they are both failing in the same way: they are missing the main lists (malware, phishing and unwanted).
Please keep in mind that all files as needed have been downloaded in this case. If that wouldn't have happened a failure like on bug 1309764 is seen.

The check for a completed download which we perform is:

https://dxr.mozilla.org/mozilla-central/source/testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py?q=TestSafeBrowsingInitialDownload&redirect_type=single#73

It means that `browser.safebrowsing.provider.google.nextupdatetime` has been modified, and as you told it means the download has been completed.

For me it means that we get the wrong files served, or modify this preference too early while the download has not been finished yet.
(In reply to Henrik Skupin (:whimboo) from comment #5)
> It means that `browser.safebrowsing.provider.google.nextupdatetime` has been
> modified, and as you told it means the download has been completed.
> 
> For me it means that we get the wrong files served, or modify this
> preference too early while the download has not been finished yet.

We have landed some changes recently that cause the prefs to be re-read and possibly an update to trigger whenever the safebrowsing-related features are toggled.

I'm wondering whether we should add a browser restart:

1. set the *.enabled prefs first
2. set the *.nextupdatetime prefs second
3. restart Firefox
4. wait for nextupdate != 1
5. check that the expected files are present

or even add another browser restart between steps 1 and 2. That way, any updates already happening while we flip prefs would finish or get cancelled and we'd only wait for the last update after the second restart.
I would suggest we wait for having debug output available. Then we can see which improvements have to be made to the test. I'm going to file a bug shortly for that.
Depends on: 1310128
Francois, I think we indeed face a problem here with the update because it seems to gets pushed out to the future and which I assume will update the timer preference:

06:31:56     INFO -   update: https://shavar.services.mozilla.com/downloads?client=navclient-auto-ffox&appver=52.0a&pver=2.2 tablelist: base-track-digest256,mozstd-trackwhite-digest256,mozplugin-block-digest256
06:31:56     INFO -  listmanager: 06:31:56 GMT-0700 (PDT): pending update, queued request until later
06:32:54     INFO - TEST-UNEXPECTED-FAIL | test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: Items in the first set but not the second:
Flags: needinfo?(francois)
46 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-aurora: 43
* mozilla-central: 3

Platform breakdown:
* windows8-64: 8
* windows7-64: 8
* osx-10-10: 5
* windows7-32: 4
* osx-10-9: 4
* osx-10-11: 4
* linux64: 4
* windowsxp: 3
* windows8-32: 3
* linux32: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1308368&startday=2016-10-19&endday=2016-10-19&tree=all
89 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 75
* mozilla-central: 11
* autoland: 2
* mozilla-inbound: 1

Platform breakdown:
* windows7-64: 13
* linux64: 12
* windows8-64: 11
* osx-10-11: 11
* osx-10-9: 10
* windows7-32: 8
* osx-10-10: 8
* windows8-32: 7
* windowsxp: 6
* linux32: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1308368&startday=2016-10-17&endday=2016-10-23&tree=all
This is what I saw in the log (https://firefox-ui-tests.s3.amazonaws.com/23950294-3879-4d23-8b5c-f6f3f4082649/log_info.log):


06:31:56     INFO -  listmanager: 06:31:56 GMT-0700 (PDT): update request: {
06:31:56     INFO -    "tableList": "base-track-digest256,mozstd-trackwhite-digest256,mozplugin-block-digest256",
06:31:56     INFO -    "tableNames": {
06:31:56     INFO -      "base-track-digest256": true,
06:31:56     INFO -      "mozstd-trackwhite-digest256": true,
06:31:56     INFO -      "mozplugin-block-digest256": true
06:31:56     INFO -    },
06:31:56     INFO -    "requestPayload": "base-track-digest256;\nmozstd-trackwhite-digest256;\nmozplugin-block-digest256;\n",
06:31:56     INFO -    "isPostRequest": true
06:31:56     INFO -  }
06:31:56     INFO -  listmanager: 06:31:56 GMT-0700 (PDT): makeUpdateRequestForEntry_: requestPayload base-track-digest256;
06:31:56     INFO -  mozstd-trackwhite-digest256;
06:31:56     INFO -  mozplugin-block-digest256;
06:31:56     INFO -   update: https://shavar.services.mozilla.com/downloads?client=navclient-auto-ffox&appver=52.0a&pver=2.2 tablelist: base-track-digest256,mozstd-trackwhite-digest256,mozplugin-block-digest256
06:31:56     INFO -  listmanager: 06:31:56 GMT-0700 (PDT): pending update, queued request until later
06:32:54     INFO - TEST-UNEXPECTED-FAIL | test_safe_browsing_initial_download.py TestSafeBrowsingInitialDownload.test_safe_browsing_initial_download | AssertionError: Items in the first set but not the second:


This is what we should see (and do see when running things manually in a new profile):


listmanager: 14:00:31 GMT-0400 (EDT): update request: {
  "tableList": "mozstd-trackwhite-digest256,mozplugin-block-digest256",
  "tableNames": {},
  "requestPayload": "mozstd-trackwhite-digest256;a:1472162826\nmozplugin-block-digest256;a:1471849627\n",
  "isPostRequest": true
}

listmanager: 14:00:31 GMT-0400 (EDT): makeUpdateRequestForEntry_: requestPayload mozstd-trackwhite-digest256;a:1472162826
mozplugin-block-digest256;a:1471849627
 update: https://shavar.services.mozilla.com/downloads?client=navclient-auto-ffox&appver=52.0a&pver=2.2 tablelist: mozstd-trackwhite-digest256,mozplugin-block-digest256

listmanager: 14:00:31 GMT-0400 (EDT): pending update, queued request until later
listmanager: 14:00:33 GMT-0400 (EDT): update success for goog-phish-shavar,goog-malware-shavar,goog-unwanted-shavar,goog-badbinurl-shavar from https://safebrowsing.google.com/safebrowsing/downloads?client=navclient-auto-ffox&appver=52.0a&pver=2.2&key=[trimmed-google-api-key]: 1894

listmanager: 14:00:33 GMT-0400 (EDT): Waiting 1894000ms
listmanager: 14:00:33 GMT-0400 (EDT): Setting last update of google to 1477418433917
listmanager: 14:00:33 GMT-0400 (EDT): Setting next update of google to 1477420327917 (1894000ms from now)
listmanager: 14:00:34 GMT-0400 (EDT): update success for mozstd-trackwhite-digest256,mozplugin-block-digest256 from https://shavar.services.mozilla.com/downloads?client=navclient-auto-ffox&appver=52.0a&pver=2.2: 3600

listmanager: 14:00:34 GMT-0400 (EDT): Waiting 3600000ms
listmanager: 14:00:34 GMT-0400 (EDT): Setting last update of mozilla to 1477418434409
listmanager: 14:00:34 GMT-0400 (EDT): Setting next update of mozilla to 1477422034409 (3600000ms from now)


Basically, nothing happens on the test machine between 6:31:56 and 6:32:54. I would expect to see the update from Google come back or timeout, but it's possible the network timeout is > 60 seconds and the test times out first.

Are we confident that the test machine can make outbound network connections?

It doesn't look like it. Another thing to try if you can reproduce this reliably would be to increase the timeout to 5 minutes and see if the update timesout and returns an error that we can see in the logs.
Flags: needinfo?(francois)
See Also: → bug 1282056
(In reply to François Marier [:francois] from comment #12)
> Basically, nothing happens on the test machine between 6:31:56 and 6:32:54.
> I would expect to see the update from Google come back or timeout, but it's
> possible the network timeout is > 60 seconds and the test times out first.
> 
> Are we confident that the test machine can make outbound network connections?

Yes, this is all fine. Sometimes it fluctuates if there are networking issues but generally we do not have any problems with outbound connections. Mind explaining me why we defer to download the files from Google at a later time? I cannot see anything in the log which tells about issues with downloading. Also seeing 1894000ms will definitely make the test fail because currently we are waiting 60s for a complete download of all files.

> It doesn't look like it. Another thing to try if you can reproduce this
> reliably would be to increase the timeout to 5 minutes and see if the update
> timesout and returns an error that we can see in the logs.

Does it mean we do not see an error within 60s? So why do we defer?

Additionally to all the above, I feel that waiting for the 'nextupdatetime' preference was a wrong decision, and we really have to change that to wait for the correct amount of files we expect (as retrieved from the preferences). That way we would not mark the test as failed when updates are getting deferred.
Flags: needinfo?(francois)
(In reply to Henrik Skupin (:whimboo) from comment #13)
> Mind explaining me why we defer to download the files from Google at a later
> time?

We only pull from one server at a time. Until the update with shavar.services.mozilla.com is done (or times out), we won't attempt to update with Google.

> Additionally to all the above, I feel that waiting for the 'nextupdatetime'
> preference was a wrong decision, and we really have to change that to wait
> for the correct amount of files we expect (as retrieved from the
> preferences). That way we would not mark the test as failed when updates are
> getting deferred.

Are you waiting for both browser.safebrowsing.provider.google.nextupdatetime and browser.safebrowsing.provider.mozilla.nextupdatetime or just one of them?

The nexupdatetime is be set after we hear back from the server:

listmanager: 14:00:33 GMT-0400 (EDT): update success for goog-phish-shavar,goog-malware-shavar,goog-unwanted-shavar,goog-badbinurl-shavar from https://safebrowsing.google.com/safebrowsing/downloads?client=navclient-auto-ffox&appver=52.0a&pver=2.2&key=[trimmed-google-api-key]: 1894

listmanager: 14:00:33 GMT-0400 (EDT): Waiting 1894000ms
listmanager: 14:00:33 GMT-0400 (EDT): Setting last update of google to 1477418433917
listmanager: 14:00:33 GMT-0400 (EDT): Setting next update of google to 1477420327917 (1894000ms from now)

What exactly are you looking at? A change in value or any writes to that pref (e.g. overwriting the "1" with the same value of "1")?
Flags: needinfo?(francois)
(In reply to François Marier [:francois] from comment #14)
> We only pull from one server at a time. Until the update with
> shavar.services.mozilla.com is done (or times out), we won't attempt to
> update with Google.

I see. So it means at this time we update the 'nextupdatetime' preference for Google and set it to a time in the future? If that is the case we indeed have a problem with our test. 

> Are you waiting for both browser.safebrowsing.provider.google.nextupdatetime
> and browser.safebrowsing.provider.mozilla.nextupdatetime or just one of them?

We are waiting for both to be not '1'. But that means whenever we defer the download from Google and then finish up the download from our servers both preferences have a value of not 1. And as such we continue the test and fail.

> What exactly are you looking at? A change in value or any writes to that
> pref (e.g. overwriting the "1" with the same value of "1")?

Our check is here:

https://dxr.mozilla.org/mozilla-aurora/source/testing/firefox-ui/tests/functional/security/test_safe_browsing_initial_download.py#73-75

We start with True (basically assume that all files have been downloaded) and then reduce the value for each 'nextupdatetime' preference for a value != 1. As mentioned above if we defer the download of Google and update that preference, we will escape from that Wait loop and then fail in the finally block in comparing the files. 

So I see two options:

1) We change the check_downloaded callback to check the number of files having downloaded. Then we would ignore the update preference. Drawback here is that we still have the additional deferred download for Google and somehow would have to reset this pref again - maybe by watching the preference for changes.

2) We do separate downloads for each list and not at once. We might need an additional restart but at least then we would not have the issue that the Google download gets pushed out. 

Personally I would prefer option 2 if my assumptions from above are correct. Please let me know about which strategy we want to do.

Btw. did we always defer the download of Google files or is it something which has been changed for Firefox 50? I just wonder about the spike for current Beta.
Flags: needinfo?(francois)
(In reply to Henrik Skupin (:whimboo) from comment #15)
> (In reply to François Marier [:francois] from comment #14)
> > We only pull from one server at a time. Until the update with
> > shavar.services.mozilla.com is done (or times out), we won't attempt to
> > update with Google.
> 
> I see. So it means at this time we update the 'nextupdatetime' preference
> for Google and set it to a time in the future? If that is the case we indeed
> have a problem with our test. 

When we queue up an update because there's already one running, we don't update the nextupdatetime pref. See the trace from comment 12:

listmanager: 14:00:31 GMT-0400 (EDT): makeUpdateRequestForEntry_: requestPayload mozstd-trackwhite-digest256
...
listmanager: 14:00:31 GMT-0400 (EDT): pending update, queued request until later
listmanager: 14:00:33 GMT-0400 (EDT): update success for goog-phish-shavar,goog-malware-shavar,goog-unwanted-shavar,goog-badbinurl-shavar from https://safebrowsing.google.com/safebrowsing/downloads?client=navclient-auto-ffox&appver=52.0a&pver=2.2&key=[trimmed-google-api-key]: 1894
...
listmanager: 14:00:33 GMT-0400 (EDT): Setting next update of google to 1477420327917 (1894000ms from now)
...
listmanager: 14:00:34 GMT-0400 (EDT): update success for mozstd-trackwhite-digest256,mozplugin-block-digest256 from https://shavar.services.mozilla.com/downloads?client=navclient-auto-ffox&appver=52.0a&pver=2.2: 3600
...
listmanager: 14:00:34 GMT-0400 (EDT): Setting next update of mozilla to 1477422034409 (3600000ms from now)

It goes like this:

1. We request both updates (the Google request is not shown in this trace).
2. The Mozilla update is queued.
3. The Google update returns.
4. We update Google's nextupdatetime.
5. We kick off the Mozilla update (not shown in trace).
6. The Mozilla update returns.
7. We update Mozilla's nextupdatetime.

> > Are you waiting for both browser.safebrowsing.provider.google.nextupdatetime
> > and browser.safebrowsing.provider.mozilla.nextupdatetime or just one of them?
> 
> We are waiting for both to be not '1'. But that means whenever we defer the
> download from Google and then finish up the download from our servers both
> preferences have a value of not 1. And as such we continue the test and fail.

That's odd because that's not what I'm seeing from the trace. When we defer/queue an update, we don't touch the nextupdatetime.

Maybe there's a race condition here where the update has finished but the files aren't yet flushed to disk. Should we add a 10 sec delay after the prefs have been set to 1?

> 1) We change the check_downloaded callback to check the number of files
> having downloaded. Then we would ignore the update preference. Drawback here
> is that we still have the additional deferred download for Google and
> somehow would have to reset this pref again - maybe by watching the
> preference for changes.

If we are dealing with a race condition, maybe the following would work:

1. wait for the prefs
2. wait a few seconds
3. check for the correct number of files
4. if there are files missing, goto 3

> 2) We do separate downloads for each list and not at once. We might need an
> additional restart but at least then we would not have the issue that the
> Google download gets pushed out. 

It might actually be a good idea to split this in two. This test requires the network so it's bound to be flakier than normal and splitting it in two might reduce the intermittent failures that we can't control.

> Btw. did we always defer the download of Google files or is it something
> which has been changed for Firefox 50? I just wonder about the spike for
> current Beta.

We've always deferred updates if another one is already happening. I'm not sure why this test started to fail now.
Flags: needinfo?(francois)
5 failures in 722 pushes (0.007 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* autoland: 4
* try: 1

Platform breakdown:
* linux32: 3
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1308368&startday=2017-01-09&endday=2017-01-15&tree=all
7 failures in 690 pushes (0.01 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 3
* mozilla-aurora: 2
* graphics: 1
* autoland: 1

Platform breakdown:
* linux32: 4
* linux64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1308368&startday=2017-01-16&endday=2017-01-22&tree=all
16 failures in 749 pushes (0.021 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 11
* autoland: 5

Platform breakdown:
* linux64: 10
* linux32: 5
* windows2012-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1308368&startday=2017-01-23&endday=2017-01-29&tree=all
11 failures in 733 pushes (0.015 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-central: 7
* mozilla-inbound: 3
* graphics: 1

Platform breakdown:
* linux64: 5
* windows8-64: 2
* windows8-32: 1
* windows7-64: 1
* windows7-32: 1
* osx-10-9: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1308368&startday=2017-01-30&endday=2017-02-05&tree=all
Francois is working on improvements for this test on bug 1333303. It should fix a couple of all those issues. Lets see.
Depends on: 1333303
Status: NEW → RESOLVED
Last Resolved: a year ago
Keywords: bulk-close-intermittents
Resolution: --- → INCOMPLETE
The test actually got changed so this failure signature is no longer present. Marking as WFM.
Resolution: INCOMPLETE → WORKSFORME
Keywords: bulk-close-intermittents
You need to log in before you can comment on or make changes to this bug.