Closed Bug 909933 Opened 11 years ago Closed 10 years ago

Intermittent browser_thumbnails_update.js | thumbnail is still green - Got 255,0,0, expected 0,255,0

Categories

(Firefox :: Tabbed Browser, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 916497

People

(Reporter: RyanVM, Unassigned)

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=27078953&tree=Mozilla-Inbound

Windows XP 32-bit mozilla-inbound debug test mochitest-browser-chrome on 2013-08-27 12:24:24 PDT for push 1dce2c72ea15
slave: t-xp32-ix-102

13:31:01     INFO -  TEST-PASS | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_update.js | c:\docume~1\cltbld~1.t-x\locals~1\temp\tmpahnfov\thumbnails\eae6197cdeceaf2b806a2119a1bf8ebe.png should exist
13:31:02     INFO -  ++DOMWINDOW == 4 (06685558) [serial = 4] [outer = 05457BE8]
13:31:02     INFO -  ++DOMWINDOW == 5 (0669D488) [serial = 5] [outer = 05457BE8]
13:31:02     INFO -  TEST-PASS | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_update.js | modified time should be >= now
13:31:02     INFO -  ++DOCSHELL 23A97628 == 44 [id = 6315]
13:31:02     INFO -  ++DOMWINDOW == 308 (158E79F8) [serial = 16140] [outer = 00000000]
13:31:02     INFO -  ++DOMWINDOW == 309 (1DA2E4F0) [serial = 16141] [outer = 158E79F8]
13:31:02     INFO -  [Parent 3068] WARNING: NS_ENSURE_TRUE(mMutable) failed: file e:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/netwerk/base/src/nsSimpleURI.cpp, line 266
13:31:02     INFO -  [Parent 3068] WARNING: NS_ENSURE_TRUE(mMutable) failed: file e:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/netwerk/base/src/nsSimpleURI.cpp, line 266
13:31:02     INFO -  ++DOMWINDOW == 310 (35E33B88) [serial = 16142] [outer = 158E79F8]
13:31:02     INFO -  --DOCSHELL 1EA17A40 == 43 [id = 6308]
13:31:02     INFO -  --DOCSHELL 1D93C350 == 42 [id = 6307]
13:31:02     INFO -  --DOCSHELL 245E4420 == 41 [id = 6306]
13:31:02     INFO -  --DOCSHELL 0BB3BAB0 == 40 [id = 6305]
13:31:02     INFO -  --DOCSHELL 1EBA7F58 == 39 [id = 6304]
13:31:02     INFO -  --DOCSHELL 20561520 == 38 [id = 6303]
13:31:02     INFO -  --DOCSHELL 0D444A00 == 37 [id = 6302]
13:31:02     INFO -  --DOCSHELL 150CC970 == 36 [id = 6301]
13:31:02     INFO -  --DOCSHELL 13210A70 == 35 [id = 6300]
13:31:02     INFO -  --DOCSHELL 14B7F8A0 == 34 [id = 6299]
13:31:02     INFO -  --DOCSHELL 36838F80 == 33 [id = 6298]
13:31:02     INFO -  --DOCSHELL 3772E4A0 == 32 [id = 6297]
13:31:02     INFO -  --DOCSHELL 14639098 == 31 [id = 6296]
13:31:02     INFO -  --DOCSHELL 137A85E0 == 30 [id = 6295]
13:31:02     INFO -  --DOCSHELL 21211FE8 == 29 [id = 6294]
13:31:02     INFO -  --DOMWINDOW == 309 (218F2E40) [serial = 16040] [outer = 00000000] [url = about:privatebrowsing]
13:31:02     INFO -  --DOMWINDOW == 308 (36FE3C40) [serial = 16047] [outer = 00000000] [url = chrome://global/content/mozilla.xhtml]
13:31:02     INFO -  --DOMWINDOW == 307 (23D07D90) [serial = 16039] [outer = 00000000] [url = about:blank]
13:31:02     INFO -  --DOCSHELL 230DB190 == 28 [id = 6293]
13:31:02     INFO -  --DOCSHELL 1E305C28 == 27 [id = 6292]
13:31:02     INFO -  --DOCSHELL 0B9FB100 == 26 [id = 6291]
13:31:02     INFO -  --DOCSHELL 37CC0B28 == 25 [id = 6290]
13:31:02     INFO -  --DOCSHELL 1FE8E1D8 == 24 [id = 6289]
13:31:02     INFO -  --DOCSHELL 128A3FB8 == 23 [id = 6288]
13:31:02     INFO -  --DOCSHELL 22B24160 == 22 [id = 6287]
13:31:02     INFO -  --DOMWINDOW == 306 (23F0F0D8) [serial = 16050] [outer = 00000000] [url = about:neterror?e=deniedPortAccess&u=http%3A//127.0.0.1%3A1/&c=UTF-8&d=This%20address%20uses%20a%20network%20port%20which%20is%20normally%20used%20for%20purposes%20other%20than%20Web%20browsing.%20Firefox%20has%20canceled%20the%20request%20for%20your%20protection.]
13:31:02     INFO -  --DOMWINDOW == 305 (1FAB95D8) [serial = 16030] [outer = 00000000] [url = about:blank]
13:31:02     INFO -  --DOMWINDOW == 304 (15B2DB60) [serial = 16044] [outer = 00000000] [url = http://mochi.test:8888/browser/toolkit/components/thumbnails/test/thumbnails_background.sjs?%7B%22setGreenCookie%22%3Atrue%7D]
13:31:02     INFO -  --DOMWINDOW == 303 (13ED0EC0) [serial = 16043] [outer = 00000000] [url = http://mochi.test:8888/browser/toolkit/components/thumbnails/test/thumbnails_background.sjs?%7B%22setGreenCookie%22%3Atrue%7D]
13:31:02     INFO -  [Parent 3068] WARNING: NS_ENSURE_TRUE(mMutable) failed: file e:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/netwerk/base/src/nsSimpleURI.cpp, line 266
13:31:02  WARNING -  TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_update.js | thumbnail is still green - Got 255,0,0, expected 0,255,0
13:31:02     INFO -  Stack trace:
13:31:02     INFO -      JS frame :: chrome://mochikit/content/browser-test.js :: test_is :: line 560
13:31:02     INFO -      JS frame :: chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_update.js :: errorResponseUpdateTest/</< :: line 105
13:31:02     INFO -      JS frame :: chrome://mochitests/content/browser/toolkit/components/thumbnails/test/head.js :: <TOP_LEVEL> :: line 162
13:31:02     INFO -      JS frame :: chrome://mochikit/content/browser-test.js :: <TOP_LEVEL> :: line 589
13:31:02     INFO -      native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0
13:31:02     INFO -  TEST-INFO | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_update.js | Running subtest goodResponseUpdateTest
The log in comment 0 shows that subtests are getting interleaved, so something bad's happening:

> 13:31:01     INFO -  TEST-PASS | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_update.js | c:\docume~1\cltbld~1.t-x\locals~1\temp\tmpahnfov\thumbnails\eae6197cdeceaf2b806a2119a1bf8ebe.png should exist
> 13:31:02     INFO -  TEST-PASS | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_update.js | modified time should be >= now
> 13:31:02  WARNING -  TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_update.js | thumbnail is still green - Got 255,0,0, expected 0,255,0

Lines 1 and 3 are from capIfStaleErrorResponseUpdateTest, but line 2 is from capIfStaleGoodResponseUpdateTest.  The latter runs after the former.

Investigating...
Assignee: nobody → adw
Status: NEW → ASSIGNED
Oops, this test has been updated since comment 0, so I was comparing an old log to current code.  The logs after comment 0 show a similar but different error, and now I don't see what the problem could be.  A green thumbnail is on disk that's somehow overwritten by a red thumbnail, but that should not happen, because the bg service never overwrites any files, and the fg service doesn't overwrite when the response is not a 2xx.

All the failures are on Windows, though... I remember hearing over the years about flakiness on Windows due to files not being "completely" written when they should be, causing random test failures.  We're just relying on OS.File, so there ought to be similar OS.File intermittent Windows failures if that were the problem, but bug searches don't reveal any, open or closed.  A shot in the dark: PageThumbsStorage.writeData passes `flush: false` to OS.File.writeAtomic.  Maybe passing true would make a difference, but reading the writeAtomic implementation [1], I don't see how, and none of the writeAtomic tests pass `flush` anyway [2].

So I don't know what.  Maybe some logging, and changing the thumbnail colors per subtest to be sure that they aren't interfering with each other.

[1] http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_shared_front.jsm#383
[2] http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/tests/mochi/worker_test_osfile_front.js#184
(In reply to Drew Willcoxon :adw from comment #7)
> and the fg
> service doesn't overwrite when the response is not a 2xx.

And now it shouldn't overwrite one for 2 days, and the failure is the second capture of that URL IIRC.

> A shot in the dark: PageThumbsStorage.writeData passes `flush: false` to
> OS.File.writeAtomic.  Maybe passing true would make a difference, but
> reading the writeAtomic implementation [1], I don't see how, and none of the
> writeAtomic tests pass `flush` anyway [2].

writeAtomic taking that param looks like a copy-paste error to me :)  It closes the handle, so there's no need to flush (it makes sense for plain write() though)

> So I don't know what.  Maybe some logging, and changing the thumbnail colors
> per subtest to be sure that they aren't interfering with each other.

Yeah - I had a look and couldn't rationalize it either :)  That .sjs is fragile too - resetting the state second time around is a problem, so logging there would help too - pity it's not more regular, then you wouldn't have trouble hitting it on try :) IMO it wouldn't really hurt if we checked into trunk one or 2 logStringMessage (sp?) calls as a b/g capture started and completed.
This makes the test more rigorous.  It tests in a consistent manner all possible and valid combinations of capture method, response status, whether the file previously exists, and whether the file is stale.  It ensures page colors are as expected after page load, not only the colors of thumbnails that capture those pages.  It ensures thumbnail files don't exist before each subtest, and each subtest removes its file.  It generates a unique color each time the sjs loads.  It doesn't use timestamps anymore since page colors are sufficient.

I don't know if this will fix the failure.  Maybe it'll introduce others.  But I think it's sturdier than what we have now.  We can add logging later if it's not enough.
Attachment #809555 - Flags: review?(mhammond)
Comment on attachment 809555 [details] [diff] [review]
update.js rewrite

Review of attachment 809555 [details] [diff] [review]:
-----------------------------------------------------------------

IMO this change has made the test significantly harder to read/rationalize, with the main advantage being that if the tests somehow run in parallel it will be more obvious.  I'm very skeptical that this is the reason behind the oranges and if they do exist for other reasons, I don't think this is going to make it much easier to see why.

That said though, there are enough good changes here that I really can't object - but fingers-crossed this actually helps oranges and doesn't create new ones.  I'd like to suggest that this gets pushed to try and (say) 10 retriggers are done on each platform before checking in, just so we can be confident we aren't causing more problems than we are solving here.

::: toolkit/components/thumbnails/test/browser_thumbnails_update.js
@@ +109,5 @@
> +    let captureExpected = (capIfStale &&
> +                           ((responseOK && (!exists || stale)) ||
> +                            (!responseOK && !exists))) ||
> +                          (!capIfStale &&
> +                           (responseOK || !exists));

this is a very complex expression to wrap your head around.  I think I'd prefer to see it written as something like:

if (capIfStale)
    captureExpected = responseOK ? !exists || stale : !exists;
else
    captureExpected = responseOK || !exists;

as I can mentally decode that far easier than the original.

I also think we need some more info() call sprinkled here - if this test still has the same oranges as before, it seems it will be hard to trace the logic purely from the logs.
Attachment #809555 - Flags: review?(mhammond) → review+
https://tbpl.mozilla.org/?tree=Try&rev=30dc8b9b87b2

So far one failure: https://tbpl.mozilla.org/php/getParsedLog.php?id=28429753&tree=Try

Again, a fresh thumbnail is on disk, but apparently the subsequent captureAndStoreIfStale overwrites it.  I'm really starting to suspect Windows file system flakiness.  I'll add some logging to PageThumbs and try again.
https://tbpl.mozilla.org/?tree=Try&rev=f2137d089490

One failure: https://tbpl.mozilla.org/php/getParsedLog.php?id=28490547&tree=Try

This time, a thumbnail is on disk, we load a 401 page, call captureAndStore, which writes the new thumbnail with noOverwrite=true, but still the old thumbnail is overwritten with the errored thumbnail.  There's no writeData "The file exists" error like there is during test 3, when noOverwrite=true due to a 401 and the file exists.  Otherwise everything is working as expected during the test.

I'm going to have to add logging to OS.File now because this here is crazy.
Whiteboard: [triage]
Whiteboard: [triage]
Whiteboard: p=0
Seeing this more frequently with mochitest-bc running in chunks.
Flags: firefox-backlog?
Flags: firefox-backlog? → firefox-backlog+
No longer blocks: fxdesktopbacklog
Patch in bug 916497 touches code used by this test and the most recent occurrence is the day that landed.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
Assignee: adw → nobody
Flags: firefox-backlog+ → firefox-backlog-
Whiteboard: p=0
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: