Last Comment Bug 664090 - test_service_filelog.js/test_errorhandler_filelog.js failure when jemalloc is enabled
: test_service_filelog.js/test_errorhandler_filelog.js failure when jemalloc is...
Status: RESOLVED FIXED
fixed-in-bs
:
Product: Core
Classification: Components
Component: XPCOM (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla9
Assigned To: Mike Hommey [:glandium]
:
Mentors:
Depends on:
Blocks: 610832
  Show dependency treegraph
 
Reported: 2011-06-14 01:39 PDT by Philipp von Weitershausen [:philikon]
Modified: 2011-09-15 19:19 PDT (History)
2 users (show)
khuey: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
disable test on win opt (2.74 KB, patch)
2011-06-14 16:10 PDT, Philipp von Weitershausen [:philikon]
no flags Details | Diff | Review
Store offset inside segment instead of pointer for nsStorageInputStream (3.49 KB, patch)
2011-09-03 05:53 PDT, Mike Hommey [:glandium]
benjamin: review+
Details | Diff | Review
Enable test_errorhandler_filelog.js on windows (1.21 KB, patch)
2011-09-05 09:39 PDT, Mike Hommey [:glandium]
philipp: review+
Details | Diff | Review

Description Philipp von Weitershausen [:philikon] 2011-06-14 01:39:08 PDT
Of course it passes on Win debug so I never caught this on my Windows dev box.
Comment 1 Philipp von Weitershausen [:philikon] 2011-06-14 01:51:02 PDT
This blocks the next merge, so I'll investigate. Worst case we'll have to disable this test for Windows opt for now.
Comment 2 Philipp von Weitershausen [:philikon] 2011-06-14 09:43:14 PDT
I can't reproduce this with an opt build on my Windows box. FML. I kicked off a try build a few hours ago with some extra debugging output: http://tbpl.mozilla.org/?tree=Try&rev=979d3671b5a7. But I'm not confident I can actually fix the test based on that, so disabling it on Windows opt seems like the best alternative right now.
Comment 3 Philipp von Weitershausen [:philikon] 2011-06-14 10:21:24 PDT
Try build finally completed. The log output is very puzzling:

> Reading 55 bytes from file: c:\docume~1\cltbld\locals~1\temp\tmpixy1df\weave\logs\1308070538123.log
> error is: 0
> data is: MESSAGE is: this WILL show up
> 
> TEST-PASS | c:/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_filelog.js | [null : 83] true == true
> 
> TEST-UNEXPECTED-FAIL | c:/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_filelog.js | -1 != -1 - See following stack:

So inputStream.available() is reporting the correct number (55 bytes), there's no read error, but the 'data' we read appears to be garbled somehow. We don't even get to print a newline before printing the expected MESSAGE which leads me to believe 'data' contains a \0 character.
Comment 4 Philipp von Weitershausen [:philikon] 2011-06-14 16:10:32 PDT
Created attachment 539353 [details] [diff] [review]
disable test on win opt

Can't really figure this out right now, so I went ahead and disabled the test on Win opt for now. Turns out, this wasn't as easy as it sounds. Can't wait until xpcshell manifests step up here.

http://hg.mozilla.org/services/services-central/rev/efa8bb8b7a86
Comment 5 Philipp von Weitershausen [:philikon] 2011-06-15 02:37:52 PDT
http://hg.mozilla.org/mozilla-central/rev/efa8bb8b7a86

Leaving this bug open to track and hopefully sooner than later resolve the actual failure.
Comment 6 Mike Hommey [:glandium] 2011-09-02 09:45:50 PDT
What looks like the same error is getting perma-orange on linux opt builds on the build-system tree since the landing of bug 680440, which effectively enables jemalloc in xpcshell. Before bug 680440, xpcshell wasn't using jemalloc on linux, which I think was a mistake.
Comment 7 Mike Hommey [:glandium] 2011-09-02 22:50:18 PDT
So, this now happens on linux and linux64 opt, happened on m-c when b-s was merged, and I confirmed on try that it is triggered on windows when jemalloc is enabled, and doesn't happen when jemalloc is disabled. So this pretty much looks like an issue the appear when jemalloc is enabled.
Comment 8 Mike Hommey [:glandium] 2011-09-02 22:52:54 PDT
Since the solution on windows was to disable the test, I'm tempted to disable the test entirely for now.
Comment 9 Philipp von Weitershausen [:philikon] 2011-09-02 23:12:15 PDT
It feels like this bug which is about Win opt failures is being hijacked for a jemalloc related change. Windows is known to be erratic about file I/O which is why we've begrudgingly accepted disabling this test on Windows opt *only*.

If we're now seeing a failure based on other factors now, that's very unfortunate, but we should investigate why. At the very least it seems suspicious that a change unrelated to Sync and file I/O creates this random orange.

If there's no time for a detailed investigation because, uh, the tree is orange half the time, I get it. But then we should have a *separate* bug that tracks that and the full disabling of this test, rather than assuming this is the same issue that causes the Win opt failures.
Comment 10 Mike Hommey [:glandium] 2011-09-02 23:35:51 PDT
Let me restate it:
- If I enable the test on windows, I see the orange, which was what this bug was.
- If I enable the test on windows, and disable jemalloc, the orange disappears

What changed on linux is that now jemalloc is used in xpcshell which was not the case before. jemalloc has been used in xpcshell in windows for a while. So the result of this change is:
- The test fails when jemalloc is enabled
- The test doesn't fail when jemalloc is disabled

I can't believe this is a coincidence.
Comment 11 Mike Hommey [:glandium] 2011-09-03 00:47:39 PDT
It does look like a latent race condition that jemalloc makes much more likely.
Comment 12 Mike Hommey [:glandium] 2011-09-03 00:55:02 PDT
When the test works:
open("/tmp/xpcshell/xpcshellprofile/weave/logs/success-1315036331339.txt", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 17
write(17, "1315036331339\tSync.Test.FileLog\t"..., 55) = 55

When the test fails:
open("/tmp/xpcshell/xpcshellprofile/weave/logs/success-1315036065010.txt", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 17
write(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 55) = 55
Comment 13 Philipp von Weitershausen [:philikon] 2011-09-03 01:57:35 PDT
(In reply to Mike Hommey [:glandium] from comment #10)
> Let me restate it:
> - If I enable the test on windows, I see the orange, which was what this bug
> was.
> - If I enable the test on windows, and disable jemalloc, the orange
> disappears
> 
> What changed on linux is that now jemalloc is used in xpcshell which was not
> the case before. jemalloc has been used in xpcshell in windows for a while.
> So the result of this change is:
> - The test fails when jemalloc is enabled
> - The test doesn't fail when jemalloc is disabled
> 
> I can't believe this is a coincidence.

Thanks for the clarification, Mike. My apologies for my initial skepticism, but this didn't appear as clear cut to me from comment 6-8. Comment 12, however, matches my investigations from comment 3, so it very much looks like the same issue. Thanks for looking into that.

My initial suspicion is that we might be doing something fishy to the storage streams that we're logging to. I'll have a look at that in the morning. In the meantime I'd be ok with disabling the test if it turns out to be a high frequency orange on the tree. Please reference the disabling commit.
Comment 14 Mike Hommey [:glandium] 2011-09-03 02:28:52 PDT
It is effectively perma orange on our test bots, but on a local build it's intermittent.
It only appeared on one push on m-c (a build-system merge), which was backed out for other reasons. So at the moment, this only happens on the build-system branch. We can keep it that way for now, I think, until we merge it again, but since the merge also breaks comm-central, we're going to wait a bit.
Comment 15 Mike Hommey [:glandium] 2011-09-03 02:29:59 PDT
Note the test file name recently changed.
Comment 16 Mike Hommey [:glandium] 2011-09-03 03:06:43 PDT
I get a reliable green if I increase STREAM_SEGMENT_SIZE in services/sync/modules/log4moz.js (I used 4096 * 8)

It is starting to smell like a bad bug in nsStorageStream's use of nsSegmentedBuffer.
Comment 17 Mike Hommey [:glandium] 2011-09-03 03:29:04 PDT
FWIW, on the contrary, reducing STREAM_SEGMENT_SIZE makes it more reliably orange on my machine.
Comment 18 Mike Hommey [:glandium] 2011-09-03 05:31:43 PDT
I found the problem, and why it only happens with jemalloc: It happens when nsStorageStream::Close is called before the corresponding nsStorageInputStream has been completely read from. In such situation, nsStorageStream::Close reallocates the last segment in the segment buffer. With the system allocator, the reallocated buffer gets the same address. With jemalloc, it changes. The problem is that then the corresponding nsStorageInputStream still contains a pointer to the old buffer segment. If the buffer has been reused for something else, it breaks.
This means nsStorageInputStream should store offsets in the segment buffers instead of pointers there.
Comment 19 Mike Hommey [:glandium] 2011-09-03 05:53:07 PDT
Created attachment 558060 [details] [diff] [review]
Store offset inside segment instead of pointer for nsStorageInputStream

This uses offsets instead of pointers to store the current location in the StorageInputStream.
See comment 18 for a description of the problem being fixed.
Comment 20 Mike Hommey [:glandium] 2011-09-05 09:39:39 PDT
Created attachment 558303 [details] [diff] [review]
Enable test_errorhandler_filelog.js on windows
Comment 21 Philipp von Weitershausen [:philikon] 2011-09-05 09:46:06 PDT
Comment on attachment 558303 [details] [diff] [review]
Enable test_errorhandler_filelog.js on windows

Yay!
Comment 23 Kyle Huey [:khuey] (khuey@mozilla.com) (Away until 6/13) 2011-09-15 19:19:53 PDT
https://hg.mozilla.org/mozilla-central/rev/4e0af3ebc87f
https://hg.mozilla.org/mozilla-central/rev/65c007c580a8

Note You need to log in before you can comment on or make changes to this bug.