test_service_filelog.js/test_errorhandler_filelog.js failure when jemalloc is enabled

RESOLVED FIXED in mozilla9

Status

()

Core
XPCOM
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: philikon, Assigned: glandium)

Tracking

unspecified
mozilla9
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: fixed-in-bs)

Attachments

(3 attachments)

Of course it passes on Win debug so I never caught this on my Windows dev box.
This blocks the next merge, so I'll investigate. Worst case we'll have to disable this test for Windows opt for now.
Assignee: nobody → philipp
I can't reproduce this with an opt build on my Windows box. FML. I kicked off a try build a few hours ago with some extra debugging output: http://tbpl.mozilla.org/?tree=Try&rev=979d3671b5a7. But I'm not confident I can actually fix the test based on that, so disabling it on Windows opt seems like the best alternative right now.
Try build finally completed. The log output is very puzzling:

> Reading 55 bytes from file: c:\docume~1\cltbld\locals~1\temp\tmpixy1df\weave\logs\1308070538123.log
> error is: 0
> data is: MESSAGE is: this WILL show up
> 
> TEST-PASS | c:/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_filelog.js | [null : 83] true == true
> 
> TEST-UNEXPECTED-FAIL | c:/talos-slave/test/build/xpcshell/tests/services/sync/tests/unit/test_service_filelog.js | -1 != -1 - See following stack:

So inputStream.available() is reporting the correct number (55 bytes), there's no read error, but the 'data' we read appears to be garbled somehow. We don't even get to print a newline before printing the expected MESSAGE which leads me to believe 'data' contains a \0 character.
Created attachment 539353 [details] [diff] [review]
disable test on win opt

Can't really figure this out right now, so I went ahead and disabled the test on Win opt for now. Turns out, this wasn't as easy as it sounds. Can't wait until xpcshell manifests step up here.

http://hg.mozilla.org/services/services-central/rev/efa8bb8b7a86
http://hg.mozilla.org/mozilla-central/rev/efa8bb8b7a86

Leaving this bug open to track and hopefully sooner than later resolve the actual failure.
Target Milestone: --- → mozilla7
(Assignee)

Comment 6

6 years ago
What looks like the same error is getting perma-orange on linux opt builds on the build-system tree since the landing of bug 680440, which effectively enables jemalloc in xpcshell. Before bug 680440, xpcshell wasn't using jemalloc on linux, which I think was a mistake.
(Assignee)

Comment 7

6 years ago
So, this now happens on linux and linux64 opt, happened on m-c when b-s was merged, and I confirmed on try that it is triggered on windows when jemalloc is enabled, and doesn't happen when jemalloc is disabled. So this pretty much looks like an issue the appear when jemalloc is enabled.
OS: Windows 7 → All
Hardware: x86 → All
(Assignee)

Updated

6 years ago
Summary: Win opt failures in test_service_filelog.js → test_service_filelog.js failure when jemalloc is enabled
(Assignee)

Comment 8

6 years ago
Since the solution on windows was to disable the test, I'm tempted to disable the test entirely for now.
It feels like this bug which is about Win opt failures is being hijacked for a jemalloc related change. Windows is known to be erratic about file I/O which is why we've begrudgingly accepted disabling this test on Windows opt *only*.

If we're now seeing a failure based on other factors now, that's very unfortunate, but we should investigate why. At the very least it seems suspicious that a change unrelated to Sync and file I/O creates this random orange.

If there's no time for a detailed investigation because, uh, the tree is orange half the time, I get it. But then we should have a *separate* bug that tracks that and the full disabling of this test, rather than assuming this is the same issue that causes the Win opt failures.
OS: All → Windows 7
Hardware: All → x86
Summary: test_service_filelog.js failure when jemalloc is enabled → Win opt failures in test_service_filelog.js
(Assignee)

Comment 10

6 years ago
Let me restate it:
- If I enable the test on windows, I see the orange, which was what this bug was.
- If I enable the test on windows, and disable jemalloc, the orange disappears

What changed on linux is that now jemalloc is used in xpcshell which was not the case before. jemalloc has been used in xpcshell in windows for a while. So the result of this change is:
- The test fails when jemalloc is enabled
- The test doesn't fail when jemalloc is disabled

I can't believe this is a coincidence.
(Assignee)

Comment 11

6 years ago
It does look like a latent race condition that jemalloc makes much more likely.
(Assignee)

Comment 12

6 years ago
When the test works:
open("/tmp/xpcshell/xpcshellprofile/weave/logs/success-1315036331339.txt", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 17
write(17, "1315036331339\tSync.Test.FileLog\t"..., 55) = 55

When the test fails:
open("/tmp/xpcshell/xpcshellprofile/weave/logs/success-1315036065010.txt", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 17
write(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 55) = 55
(In reply to Mike Hommey [:glandium] from comment #10)
> Let me restate it:
> - If I enable the test on windows, I see the orange, which was what this bug
> was.
> - If I enable the test on windows, and disable jemalloc, the orange
> disappears
> 
> What changed on linux is that now jemalloc is used in xpcshell which was not
> the case before. jemalloc has been used in xpcshell in windows for a while.
> So the result of this change is:
> - The test fails when jemalloc is enabled
> - The test doesn't fail when jemalloc is disabled
> 
> I can't believe this is a coincidence.

Thanks for the clarification, Mike. My apologies for my initial skepticism, but this didn't appear as clear cut to me from comment 6-8. Comment 12, however, matches my investigations from comment 3, so it very much looks like the same issue. Thanks for looking into that.

My initial suspicion is that we might be doing something fishy to the storage streams that we're logging to. I'll have a look at that in the morning. In the meantime I'd be ok with disabling the test if it turns out to be a high frequency orange on the tree. Please reference the disabling commit.
OS: Windows 7 → All
Hardware: x86 → All
Summary: Win opt failures in test_service_filelog.js → test_service_filelog.js failure when jemalloc is enabled
Target Milestone: mozilla7 → ---
(Assignee)

Comment 14

6 years ago
It is effectively perma orange on our test bots, but on a local build it's intermittent.
It only appeared on one push on m-c (a build-system merge), which was backed out for other reasons. So at the moment, this only happens on the build-system branch. We can keep it that way for now, I think, until we merge it again, but since the merge also breaks comm-central, we're going to wait a bit.
OS: All → Windows 7
Hardware: All → x86
Target Milestone: --- → mozilla7
(Assignee)

Comment 15

6 years ago
Note the test file name recently changed.
OS: Windows 7 → All
Hardware: x86 → All
Summary: test_service_filelog.js failure when jemalloc is enabled → test_service_filelog.js/test_errorhandler_filelog.js failure when jemalloc is enabled
Target Milestone: mozilla7 → ---
(Assignee)

Comment 16

6 years ago
I get a reliable green if I increase STREAM_SEGMENT_SIZE in services/sync/modules/log4moz.js (I used 4096 * 8)

It is starting to smell like a bad bug in nsStorageStream's use of nsSegmentedBuffer.
(Assignee)

Comment 17

6 years ago
FWIW, on the contrary, reducing STREAM_SEGMENT_SIZE makes it more reliably orange on my machine.
(Assignee)

Comment 18

6 years ago
I found the problem, and why it only happens with jemalloc: It happens when nsStorageStream::Close is called before the corresponding nsStorageInputStream has been completely read from. In such situation, nsStorageStream::Close reallocates the last segment in the segment buffer. With the system allocator, the reallocated buffer gets the same address. With jemalloc, it changes. The problem is that then the corresponding nsStorageInputStream still contains a pointer to the old buffer segment. If the buffer has been reused for something else, it breaks.
This means nsStorageInputStream should store offsets in the segment buffers instead of pointers there.
(Assignee)

Updated

6 years ago
Assignee: philipp → mh+mozilla
Component: Firefox Sync: Backend → XPCOM
Product: Mozilla Services → Core
QA Contact: sync-backend → xpcom
(Assignee)

Comment 19

6 years ago
Created attachment 558060 [details] [diff] [review]
Store offset inside segment instead of pointer for nsStorageInputStream

This uses offsets instead of pointers to store the current location in the StorageInputStream.
See comment 18 for a description of the problem being fixed.
Attachment #558060 - Flags: review?(benjamin)
(Assignee)

Comment 20

6 years ago
Created attachment 558303 [details] [diff] [review]
Enable test_errorhandler_filelog.js on windows
Attachment #558303 - Flags: review?(philipp)
Comment on attachment 558303 [details] [diff] [review]
Enable test_errorhandler_filelog.js on windows

Yay!
Attachment #558303 - Flags: review?(philipp) → review+
Attachment #558060 - Flags: review?(benjamin) → review+
(Assignee)

Comment 22

6 years ago
http://hg.mozilla.org/projects/build-system/rev/4e0af3ebc87f
http://hg.mozilla.org/projects/build-system/rev/65c007c580a8
Whiteboard: fixed-in-bs
https://hg.mozilla.org/mozilla-central/rev/4e0af3ebc87f
https://hg.mozilla.org/mozilla-central/rev/65c007c580a8
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla9
You need to log in before you can comment on or make changes to this bug.