Closed Bug 1018161 Opened 6 years ago Closed 5 years ago

Test failure 'profile/safebrowsing/goog-malware-* exists' in /restartTests/testSafeBrowsing_initialDownload/test2.js

Categories

(Mozilla QA Graveyard :: Mozmill Tests, defect, P1)

defect

Tracking

(firefox31 wontfix, firefox32 disabled, firefox33 fixed, firefox34 fixed, firefox35 fixed, firefox36 fixed, firefox-esr24 wontfix, firefox-esr31 unaffected)

RESOLVED FIXED
Tracking Status
firefox31 --- wontfix
firefox32 --- disabled
firefox33 --- fixed
firefox34 --- fixed
firefox35 --- fixed
firefox36 --- fixed
firefox-esr24 --- wontfix
firefox-esr31 --- unaffected

People

(Reporter: andrei, Unassigned)

References

()

Details

(Whiteboard: [mozmill-test-failure][sprint])

Attachments

(3 files, 3 obsolete files)

The test landed in bug 967568 has had lots of failures since we landed it.
We'll need to investigate them.

I would like to skip the tests for now until we find out why these are failing at such an alarming rate.

Mihaela, can you prepare a skip patch please.
Attached patch skip patch v1 (obsolete) — Splinter Review
Attachment #8431530 - Flags: review?(andrei.eftimie)
Attached patch skip patch v1.1 (obsolete) — Splinter Review
A small fix
Attachment #8431530 - Attachment is obsolete: true
Attachment #8431530 - Flags: review?(andrei.eftimie)
Attachment #8431540 - Flags: review?(andrei.eftimie)
Comment on attachment 8431540 [details] [diff] [review]
skip patch v1.1

Review of attachment 8431540 [details] [diff] [review]:
-----------------------------------------------------------------

Disabled:
http://hg.mozilla.org/qa/mozmill-tests/rev/76af1b4a9e74 (default)
http://hg.mozilla.org/qa/mozmill-tests/rev/da86b3cb08ef (mozilla-aurora)
Attachment #8431540 - Flags: review?(andrei.eftimie) → review+
Test is now disabled on both Aurora and Nightly, we'll have to investigate why it fails.
Not sure why all platforms is listed for this bug. This is clearly Windows only. Also it's not only the *shavar.cache file, but all of goog-malware-* as seen in the reports. Are those the files which should not exist on Windows? If that is the case, why we haven't seen this while the test was created? I feel it's a Firefox issue we should investigate quickly.
OS: All → Windows 7
Priority: P2 → P1
Summary: Test failure 'profile/safebrowsing/goog-malware-shavar.cache exists' in /restartTests/testSafeBrowsing_initialDownload/test2.js → Test failure 'profile/safebrowsing/goog-malware-* exists' in /restartTests/testSafeBrowsing_initialDownload/test2.js
Whiteboard: [mozmill-test-failure] → [mozmill-test-failure][mozmill-test-skipped]
Also those failures are going back a couple of days. Why hasn't this bug been filed earlier when it appeared the first time? Specifically by that amount of failures. We have 4 days of no action here! :(
(In reply to Henrik Skupin (:whimboo) from comment #5)
> Not sure why all platforms is listed for this bug. This is clearly Windows
> only.
How could this _clearly_ be a Windows-only problem when it fails across platforms?

The failure indicates that the specified files are not present in the tested profile (when they should).
We do wait for up to 10 seconds _for each_ file. I see this unlikely a timeout problem.
OS: Windows 7 → All
Oh, sorry. Looks like the other platforms only failed the day when you filed the bug. Then only Windows was failing. It may be good to know how long we have to wait until the files are present. Can this be reproduced locally? It may be good to run this through a modified waitFor() call.
I'll take this to see what the problem is.
Assignee: mihaela.velimiroviciu → andrei.eftimie
Not able to reproduce the problem locally.

One thing: stopApplication(true) does not clear the files for me. This means that we _have_ the files already when we start the test (since we are using the same profile). I've tried the recently implemented sanitize calls, and that doesn't clean them up either.

We should add a step to test that the files are _not_ present when the test starts.

I'll test on a CI machine.
(In reply to Andrei Eftimie from comment #10)
> One thing: stopApplication(true) does not clear the files for me. This means
> that we _have_ the files already when we start the test (since we are using
> the same profile). I've tried the recently implemented sanitize calls, and
> that doesn't clean them up either.

That is exactly what I meant earlier to you last week, that mozprofile might not take care about the cache folder. Is that the case? If yes, we have to fix it there.
I've run 1000 testsruns on mm-win-8-32-4 last night against Aurora, out of which 477 failed with this problem. 47% failure rate.

I'm checking it from more angles now (see if running the test itself in a loop exhibits this issue, check the profile folder, see if I can easily reproduce it on other systems, hope locally)
Wiresharking/TcpDump may be useful to see if the problem doesn't lie with the remote side. What if they start throttling the test machine's IP? I'm not saying that this is what causes the failure, but it's another possibility to keep in mind.
Valid tip Gian-Carlo, I'll look into that.
I've noticed that the `safebrowsing` folder gets moved to `safebrowsing-to_delete` and/or `safebrowsing-backup` during some test before this one (during testAddons_InstallAddonWithoutEULA which is ATM skipped).

This is handled here: http://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/Classifier.cpp

Wondering if there might be some kind of race condition. We reference the profile folder `ProfD`, we append `safebrowsing` as a nsIFile object.

If the folder is renamed to `safebrowsing-to_delete` does our initial reference change as well?
This might explain our failures here.
Sounds reasonable. You can't keep the reference to the folder alive across a SafeBrowsing update.
The reference does not appear to change.
I've renamed the folder manually, and nsIFile still pointed to the right (original) folder.
This + the long wait time we use in the test make it seem unlikely what I described in comment 15.

I'll input some dump statements to see exactly which paths we test and run it again in a loop on the ci machine where I reproduced it earlier (I can't reproduce it at all locally).
We are checking the correct path:
> path:c:\Users\mozauto\Desktop\1018161\workspace\profile\safebrowsing\goog-downloadwhite-digest256.pset
> ERROR | Test Failure | {
>   "fail": {
>     "functionName": "verifyFilesExistence/<",
>     "message": "c:\\Users\\mozauto\\Desktop\\1018161\\workspace\\profile\\safebrowsing\\goog-downloadwhite-digest256.pset exists",
>     "fileName": "file:///c:/Users/mozauto/Desktop/1018161/workspace/mozmill-tests/firefox/tests/remote/restartTests/testSafeBrowsing_initialDownload/test2.js",
>     "lineNumber": 62
>   }
> }

Trying to reduce this to (hopefully) a testcase.
Andrei, can you please check which of the files are present and which are not? That would also be of help.
No failures at all if I only run this test.
A previous test seems to be required.

> Andrei, can you please check which of the files are present and which are not?
> That would also be of help.
This is interesting.

While all (or most) of the rest were missing. I haven't seen any of these in the error messages:
> "goog-badbinurl-shavar.cache",
> "goog-badbinurl-shavar.pset",
> "goog-badbinurl-shavar.sbstore"
Unable to reproduce locally at all (with the same conditions as the CI machine).

If this is network related, maybe its the proxy used on CI machines doing some throttling.
(In reply to Andrei Eftimie from comment #21)
> If this is network related, maybe its the proxy used on CI machines doing
> some throttling.

Michael, do you have an idea if that could be happening due to squid?
Flags: needinfo?(mhenry)
The proxies are not configured to throttle and they don't appear to be under heavy load.

That said if it were the proxies logic would dictate it wouldn't just happen on Windows tests.

I'm not able to easily deduce from this bug what machine the connections are coming from and going to.  If I had that information I could check the logs to be sure.

From what I can read I don't think they are the problem, but I could be wrong, so will check and make sure.
Flags: needinfo?(mhenry)
Is this test still disabled? I'm about to make a pretty big safebrowsing change, it would be great to have confirmation that I didn't break everything.
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #24)
> Is this test still disabled? I'm about to make a pretty big safebrowsing
> change, it would be great to have confirmation that I didn't break
> everything.

Unfortunately yes, the test is still disabled due to large amounts of failures on the CI machines (which we can't reproduce locally).
We are swamped to finish our goals for this quarter and hadn't had time to investigate further these failures.
Thanks Andrei. How can I run this test locally?
Monica, please check https://developer.mozilla.org/en-US/docs/Mozilla/QA/Mozmill_tests#The_test_repository

Then make sure you select the right named branch for the version of Firefox under test. Also you will have to update the manifest file, or as best backout the skip patch as attached to this bug. Then you can run the test by specifying the path to the tests manifest file.
Andrei, do you have an update for this P1 bug? It's lingering around for a while, and looks to be important to get investigated.
Flags: needinfo?(andrei.eftimie)
Cosmin will take over here.
Assignee: andrei.eftimie → cosmin.malutan
Flags: needinfo?(andrei.eftimie)
I couldn't get this reproduced locally neither on a CI node, I will look in to it again tomorrow, if is still unreproducible I think we could re-enable it.
I ran the test alone the restart directory and complete testruns for more than 50 times each but it did't reproduce. Node mm-win-81-64.
Considering that this issue was only on CI nodes and it never reproduced locally I incline to think that it never affected the build itself but it was an environment related issue which is gone now.

Andreea or Andrei can one of you please backout the skip patch?
Flags: needinfo?(andrei.eftimie)
Flags: needinfo?(andreea.matei)
Backed out skip on nightly:
https://hg.mozilla.org/qa/mozmill-tests/rev/f17307d4e075 (default)

Cosmin, please monitor the results.
Flags: needinfo?(cosmin.malutan)
Flags: needinfo?(andrei.eftimie)
Flags: needinfo?(andreea.matei)
Andrei, it didn't failed on nightly so far, maybe we can unskip it on aurora too?
All is fine on Aurora as well.
Transplanted unskip across branches:

https://hg.mozilla.org/qa/mozmill-tests/rev/e0942119a8f8 (mozilla-beta)
https://hg.mozilla.org/qa/mozmill-tests/rev/613325ea686a (mozilla-release)
https://hg.mozilla.org/qa/mozmill-tests/rev/83101002d6f9 (mozilla-esr31)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(cosmin.malutan)
Resolution: --- → FIXED
Whiteboard: [mozmill-test-failure][mozmill-test-skipped] → [mozmill-test-failure]
This still failed on esr31, but no other branches, if it will not fail on other branches I should need to make a regression-range to see when it got fixed.
http://mozmill-daily.blargon7.com/#/remote/report/0d1d39eca2ce08240b0fc576872d38c8
http://mozmill-daily.blargon7.com/#/remote/report/0d1d39eca2ce08240b0fc576872d51eb
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Can we please get the latest landing backed-out again on esr31?
Attached patch skip.patchSplinter Review
I'm not going to re-add the old patch since that also had __force_skip__ statements.

Here is a new skip patch. This is intended only for ESR31 where this failre seems to persist.
Attachment #8431540 - Attachment is obsolete: true
Attachment #8486359 - Flags: review?(andreea.matei)
Comment on attachment 8486359 [details] [diff] [review]
skip.patch

Review of attachment 8486359 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Please land asap. Thanks
Attachment #8486359 - Flags: review?(andreea.matei) → review+
This is also failing for release a lot! I do not see any statement for testing this on beta, release, and esr31 by comment 35. So on what was the statement based that this can be backported? I'm going ahead and really skip this test again on beta and releases. This needs way more closely testing before we shall reland this again. Also please make sure to give proper commit messages when backing out skip patches etc. Otherwise it is very hard to find in the pushlog.
(In reply to Henrik Skupin (:whimboo) from comment #41)
> This is also failing for release a lot! I do not see any statement for
> testing this on beta, release, and esr31 by comment 35. So on what was the
> statement based that this can be backported? I'm going ahead and really skip
> this test again on beta and releases. This needs way more closely testing
> before we shall reland this again.

I've been rerunning these tests on a scl3 machine.
I can indeed reproduce the issue on 32 (release) still.
But it appears to be fixed on 33 (beta).

I will reenable the test for 33, and will try finding what fixed it.

> Also please make sure to give proper
> commit messages when backing out skip patches etc. Otherwise it is very hard
> to find in the pushlog.


Oh, I usually insert at least the bug number. Seems I missed it in this case.
I've run more tests today with esr31, 32 and 33b.

Even with esr31 this failed intermittently.

Some note from my testing today.
- ran tests on Win81
- I've also tested a staging Win81 machine - it failed on this one as well
- using a custom folder instead of temp for the profile (ie --workspace or not) didn't seem to have any effect
- I was only able to reproduce this with a testrun, (using mozmill with --profile issued disconnects and I was not able to properly run tests this way on these machines).

I will need to run a few more tests against beta to make sure that it is fixed there, before I will attempt to unskip there.
Assignee: cosmin.malutan → andrei.eftimie
Interestingly this failed _once_ on Aurora 34 once on OSX 10.6, en-US:
http://mozmill-daily.blargon7.com/#/remote/report/f11964362bbc85ebfbc5b36de77b7b83

Looks to be a different issue than we had before - which had a high failure rate, and mostly (all?) on windows.

> /Users/mozauto/jenkins/workspace/mozilla-aurora_remote/data/profile/safebrowsing/goog-phish-shavar.cache exists
> /Users/mozauto/jenkins/workspace/mozilla-aurora_remote/data/profile/safebrowsing/goog-phish-shavar.pset exists
> /Users/mozauto/jenkins/workspace/mozilla-aurora_remote/data/profile/safebrowsing/goog-phish-shavar.sbstore exists

This denotes these files were not found in the profile where and when we expected them to be.
I've run this against beta on a CI machine and wasn't able to reproduce the failure at all.
I'm going to re-enable the test on Beta (it will remain disabled on Release and ESR31 where it still fails).

The failure mentioned in comment 45 is a different one, if that happens again we need to file a new bug for that.

Re-enabled:
https://hg.mozilla.org/qa/mozmill-tests/rev/3f927dc8d38b (mozilla-beta)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
I don't want to see this disabled all the next 6 release cycles for esr31. We have to check what fixed this problem, or at least to debug and find out what's the problem here.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Henrik Skupin (:whimboo) from comment #47)
> I don't want to see this disabled all the next 6 release cycles for esr31.
> We have to check what fixed this problem, or at least to debug and find out
> what's the problem here.

Indeed it would be nice to know that.
Even if we do find what fixed it, what are the chances the fix will be transplanted to ESR31?

We can still only reproduce this from within SCL3
How often does it reproduce? Have you created a HTTP log yet? If this is really e.g. proxy related, installation in large companies might be affected by this. And those mainly use our ESR releases. If you can reproduce, try to check which sleep is necessary until those files are present.
I spent a couple hours today trying to reproduce the issue on mm-win-81-32-3 with no luck.
There seem to be some current issues as I was getting a lot of disconnect failures.
I was only able to test about 4 builds in 2 hours.

I'll leave this issue in the backlog for now.

So what's left here is to try finding what fixed the issue (probably something between Firefox 32 and Firefox 33)
Assignee: andrei.eftimie → nobody
Priority: P1 → P2
You could utilize the Win 8.1 staging box for testing the builds. ondemand remote can run release and beta builds. So no need to work with an offline node, just let it run in the background.
(In reply to Henrik Skupin (:whimboo) from comment #51)
> You could utilize the Win 8.1 staging box for testing the builds. ondemand
> remote can run release and beta builds. So no need to work with an offline
> node, just let it run in the background.

I'll try this tomorrow. AFAIR I wasn't able previously to reproduce this on staging machines.
Whiteboard: [mozmill-test-failure] → [mozmill-test-failure][sprint]
(In reply to Andreea Matei [:AndreeaMatei] from comment #53)
> Failed once with Aurora zh-CN, Win 8:
> http://mozmill-daily.blargon7.com/#/remote/report/
> c1ae8473ee5b384b83bbfde33131885f
> 
> We might have to recheck this.

As said above, if we have new failure, they should be tracked in a different place. The issue in this bug has been fixed (except ESR31).

We had an apparent outage on the 17th, and I filed bug 1085286 to track those.
Duplicate of this bug: 1085286
I've noticed this test failing repeatedly locally with current Nightly on OS X 10.9.5, while I've been working on bug 1076741. My most recent testrun report is here:

http://mozmill-crowd.blargon7.com/#/remote/report/24f9dc1b051c99c8247bed4e780b0642

I've just run this test again locally a number of times... 

* several times with TIMEOUT = 10000, and it always failed.

* twice with TIMEOUT = 11250, and it always failed.

* twice at each value, with TIMEOUT = 100000, 50000, 20000, 15000, and 12500, and it always passed.
Can we get this test fixed by bumping the timeouts as Barbara suggests in comment 56?
Flags: needinfo?(hskupin)
(In reply to Andrei Eftimie from comment #48)

> Indeed it would be nice to know that.
> Even if we do find what fixed it, what are the chances the fix will be
> transplanted to ESR31?

I suspect it was this:
https://bugzilla.mozilla.org/show_bug.cgi?id=1045163

Which shortens the initial update delay from several minutes to several seconds.

But it's not the kind of thing I'd want to uplift to ESR.
Given that no-one replied yet, I want to add that bumping the timeout for the waitFor call would be totally fine. I would totally be happy to even bump this up to 30s given that it is a remote activity.

Barbary, would you mind to put up a patch, which exactly fixes the problem for you? Thanks!
Flags: needinfo?(galgeek)
Yes, I'll submit a patch in a little bit.
Flags: needinfo?(galgeek)
Attached patch Increase TIMEOUT (obsolete) — Splinter Review
This patch increases TIMEOUT from 10000 to 20000.

It may still be helpful to increase it more, but this value enables the test to pass reliably for me on the current network.
Attachment #8532726 - Flags: review?(hskupin)
Assignee: nobody → galgeek
Flags: needinfo?(hskupin)
Comment on attachment 8532726 [details] [diff] [review]
Increase TIMEOUT

Review of attachment 8532726 [details] [diff] [review]:
-----------------------------------------------------------------

::: firefox/tests/remote/restartTests/testSafeBrowsing_initialDownload/test2.js
@@ +5,5 @@
>  "use strict";
>  
>  Cu.import("resource://gre/modules/Services.jsm");
>  
> +const TIMEOUT = 20000;

As mentioned in my last comment please make this 30s, because that's the time we use by default for remote content to be loaded.
Attachment #8532726 - Flags: review?(hskupin) → review-
I'm sorry for the confusion. This patch increases TIMEOUT to 30000.
Attachment #8532726 - Attachment is obsolete: true
Attachment #8534162 - Flags: review?(hskupin)
Attachment #8534162 - Flags: review?(hskupin) → review+
https://hg.mozilla.org/qa/mozmill-tests/rev/3771f86ae0c1 (default)

If nothing regresses I will backport this patch later to all the other branches. Hopefully it will allow us to re-enable the test for the esr31 branch.

Thanks Barbara!
https://hg.mozilla.org/qa/mozmill-tests/rev/33b5365f2cbd (aurora)
https://hg.mozilla.org/qa/mozmill-tests/rev/0e4577925cd4 (beta)
https://hg.mozilla.org/qa/mozmill-tests/rev/8e91aceaea57 (release)
https://hg.mozilla.org/qa/mozmill-tests/rev/16688d6cd77f (esr31)

Barbara, would you mind to check with a ESR31 build, if you are still able to see this failure on this branch or if it is gone? If the latter is the case, we could unskip the test.
Flags: needinfo?(galgeek)
This test passes for me against ESR31.

Here's the command and its output:

$ mozmill -t firefox/tests/remote/restartTests/testSafeBrowsing_initialDownload/test2.js -b /Applications/FirefoxESR.app/
mozversion INFO | application_buildid: 20141125031119
mozversion INFO | application_changeset: f416e15cc2c5
mozversion INFO | application_display_name: Firefox
mozversion INFO | application_id: {ec8030f7-c20a-464f-9b0e-13a3a9e97384}
mozversion INFO | application_name: Firefox
mozversion INFO | application_repository: https://hg.mozilla.org/releases/mozilla-esr31
mozversion INFO | application_vendor: Mozilla
mozversion INFO | application_version: 31.3.0
mozversion INFO | platform_buildid: 20141125031119
mozversion INFO | platform_changeset: f416e15cc2c5
mozversion INFO | platform_repository: https://hg.mozilla.org/releases/mozilla-esr31
mozversion INFO | platform_version: 31.3.0
TEST-START | /Users/bara/Dev/mozmill-tests/firefox/tests/remote/restartTests/testSafeBrowsing_initialDownload/test2.js | setupModule
TEST-START | /Users/bara/Dev/mozmill-tests/firefox/tests/remote/restartTests/testSafeBrowsing_initialDownload/test2.js | testSafeBrowsing_initialDownload
TEST-PASS | /Users/bara/Dev/mozmill-tests/firefox/tests/remote/restartTests/testSafeBrowsing_initialDownload/test2.js | testSafeBrowsing_initialDownload
TEST-END | /Users/bara/Dev/mozmill-tests/firefox/tests/remote/restartTests/testSafeBrowsing_initialDownload/test2.js | finished in 13210ms
RESULTS | Passed: 1
RESULTS | Failed: 0
RESULTS | Skipped: 0
Flags: needinfo?(galgeek)
Thanks Barbara! Lets see if it is now sticky. I backed out the skip via:

https://hg.mozilla.org/qa/mozmill-tests/rev/93801a2f6041 (esr31)

Lots hope we don't have to reopen it again.
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
The failures reappeared.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Could someone please check on one of those boxes if that is reproducible? It seems to fail all the time, so the chance for that is very likely.
Given that it is failing constantly on esr31, we should really get this investigated on our machines. Andreea, who can take care of it?

I disabled the test again:
https://hg.mozilla.org/qa/mozmill-tests/rev/bcaf161e8686 (esr31)
Assignee: galgeek → nobody
Priority: P2 → P1
Whiteboard: [mozmill-test-failure][sprint] → [mozmill-test-failure][mozmill-test-skipped][sprint]
Flags: needinfo?(andreea.matei)
Please see comment 58, which appears to have been overlooked? Unless you're willing to extend the wait for several minutes the ESR31 problem isn't fixable because it has a much longer (random) delay before it starts the updates, and we'd prefer not uplifting that change.
Flags: needinfo?(hskupin)
Oh, I see. It was indeed overseen by myself. Thanks for pointing it out!

So it's indeed something we cannot cover for ESR31 and we might want to completely remove this test from this branch then.
Flags: needinfo?(hskupin)
Created a new patch for removing testSafeBrowsing_initialDownload from remote/restartTest.
Attachment #8540060 - Flags: review?(mihaela.velimiroviciu)
Attachment #8540060 - Flags: review?(andreea.matei)
Comment on attachment 8540060 [details] [diff] [review]
removeTestSafeBrowsing

Review of attachment 8540060 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Daniela!
http://hg.mozilla.org/qa/mozmill-tests/rev/5b95d40da9aa (esr31)
Attachment #8540060 - Flags: review?(mihaela.velimiroviciu)
Attachment #8540060 - Flags: review?(andreea.matei)
Attachment #8540060 - Flags: review+
So I think we're good to close now.
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Flags: needinfo?(andreea.matei)
Resolution: --- → FIXED
Thanks!
Whiteboard: [mozmill-test-failure][mozmill-test-skipped][sprint] → [mozmill-test-failure][sprint]
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.