Of course it passes on Win debug so I never caught this on my Windows dev box.
This blocks the next merge, so I'll investigate. Worst case we'll have to disable this test for Windows opt for now.
I can't reproduce this with an opt build on my Windows box. FML. I kicked off a try build a few hours ago with some extra debugging output: http://tbpl.mozilla.org/?tree=Try&rev=979d3671b5a7. But I'm not confident I can actually fix the test based on that, so disabling it on Windows opt seems like the best alternative right now.
Try build finally completed. The log output is very puzzling:
> Reading 55 bytes from file: c:\docume~1\cltbld\locals~1\temp\tmpixy1df\weave\logs\1308070538123.log
> error is: 0
> data is: MESSAGE is: this WILL show up
> TEST-PASS | c:/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_filelog.js | [null : 83] true == true
> TEST-UNEXPECTED-FAIL | c:/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_filelog.js | -1 != -1 - See following stack:
So inputStream.available() is reporting the correct number (55 bytes), there's no read error, but the 'data' we read appears to be garbled somehow. We don't even get to print a newline before printing the expected MESSAGE which leads me to believe 'data' contains a \0 character.
Created attachment 539353 [details] [diff] [review]
disable test on win opt
Can't really figure this out right now, so I went ahead and disabled the test on Win opt for now. Turns out, this wasn't as easy as it sounds. Can't wait until xpcshell manifests step up here.
Leaving this bug open to track and hopefully sooner than later resolve the actual failure.
What looks like the same error is getting perma-orange on linux opt builds on the build-system tree since the landing of bug 680440, which effectively enables jemalloc in xpcshell. Before bug 680440, xpcshell wasn't using jemalloc on linux, which I think was a mistake.
So, this now happens on linux and linux64 opt, happened on m-c when b-s was merged, and I confirmed on try that it is triggered on windows when jemalloc is enabled, and doesn't happen when jemalloc is disabled. So this pretty much looks like an issue the appear when jemalloc is enabled.
Since the solution on windows was to disable the test, I'm tempted to disable the test entirely for now.
It feels like this bug which is about Win opt failures is being hijacked for a jemalloc related change. Windows is known to be erratic about file I/O which is why we've begrudgingly accepted disabling this test on Windows opt *only*.
If we're now seeing a failure based on other factors now, that's very unfortunate, but we should investigate why. At the very least it seems suspicious that a change unrelated to Sync and file I/O creates this random orange.
If there's no time for a detailed investigation because, uh, the tree is orange half the time, I get it. But then we should have a *separate* bug that tracks that and the full disabling of this test, rather than assuming this is the same issue that causes the Win opt failures.
Let me restate it:
- If I enable the test on windows, I see the orange, which was what this bug was.
- If I enable the test on windows, and disable jemalloc, the orange disappears
What changed on linux is that now jemalloc is used in xpcshell which was not the case before. jemalloc has been used in xpcshell in windows for a while. So the result of this change is:
- The test fails when jemalloc is enabled
- The test doesn't fail when jemalloc is disabled
I can't believe this is a coincidence.
It does look like a latent race condition that jemalloc makes much more likely.
When the test works:
open("/tmp/xpcshell/xpcshellprofile/weave/logs/success-1315036331339.txt", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 17
write(17, "1315036331339\tSync.Test.FileLog\t"..., 55) = 55
When the test fails:
open("/tmp/xpcshell/xpcshellprofile/weave/logs/success-1315036065010.txt", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 17
write(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 55) = 55
(In reply to Mike Hommey [:glandium] from comment #10)
> Let me restate it:
> - If I enable the test on windows, I see the orange, which was what this bug
> - If I enable the test on windows, and disable jemalloc, the orange
> What changed on linux is that now jemalloc is used in xpcshell which was not
> the case before. jemalloc has been used in xpcshell in windows for a while.
> So the result of this change is:
> - The test fails when jemalloc is enabled
> - The test doesn't fail when jemalloc is disabled
> I can't believe this is a coincidence.
Thanks for the clarification, Mike. My apologies for my initial skepticism, but this didn't appear as clear cut to me from comment 6-8. Comment 12, however, matches my investigations from comment 3, so it very much looks like the same issue. Thanks for looking into that.
My initial suspicion is that we might be doing something fishy to the storage streams that we're logging to. I'll have a look at that in the morning. In the meantime I'd be ok with disabling the test if it turns out to be a high frequency orange on the tree. Please reference the disabling commit.
It is effectively perma orange on our test bots, but on a local build it's intermittent.
It only appeared on one push on m-c (a build-system merge), which was backed out for other reasons. So at the moment, this only happens on the build-system branch. We can keep it that way for now, I think, until we merge it again, but since the merge also breaks comm-central, we're going to wait a bit.
Note the test file name recently changed.
I get a reliable green if I increase STREAM_SEGMENT_SIZE in services/sync/modules/log4moz.js (I used 4096 * 8)
It is starting to smell like a bad bug in nsStorageStream's use of nsSegmentedBuffer.
FWIW, on the contrary, reducing STREAM_SEGMENT_SIZE makes it more reliably orange on my machine.
I found the problem, and why it only happens with jemalloc: It happens when nsStorageStream::Close is called before the corresponding nsStorageInputStream has been completely read from. In such situation, nsStorageStream::Close reallocates the last segment in the segment buffer. With the system allocator, the reallocated buffer gets the same address. With jemalloc, it changes. The problem is that then the corresponding nsStorageInputStream still contains a pointer to the old buffer segment. If the buffer has been reused for something else, it breaks.
This means nsStorageInputStream should store offsets in the segment buffers instead of pointers there.
Created attachment 558060 [details] [diff] [review]
Store offset inside segment instead of pointer for nsStorageInputStream
This uses offsets instead of pointers to store the current location in the StorageInputStream.
See comment 18 for a description of the problem being fixed.
Created attachment 558303 [details] [diff] [review]
Enable test_errorhandler_filelog.js on windows
Comment on attachment 558303 [details] [diff] [review]
Enable test_errorhandler_filelog.js on windows