588048 - test_viewWrapper_virtualFolder.js intermittently hangs on Linux

Reporter

Description

•

15 years ago

test_viewWrapper_virtualFolder.js is intermittently hanging on our Linux builders and causing xpcshell-tests to timeout. Unforunately there's nothing useful in the tinderbox log: http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1282049059.1282056227.11123.gz TEST-INFO | /buildbot/linux-comm-central-check/build/objdir/mozilla/_tests/xpcshell/test_mailbase/unit/test_viewWrapper_virtualFolder.js | running test ... command timed out: 300 seconds without output, killing pid 8158 I'll see if I can get some more info.

Mark Banner (:standard8)

Reporter

Comment 1

•

15 years ago

This is starting to get bad (frequent), any ideas on the best way to start logging/poking at this?

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 2

•

15 years ago

Can I shorten the asyncTestUtils seppuku delay below that of the tinderbox so we can get some log output? I know you lowered it recently but it's still too high and the default success harness isn't giving us a lot to go on when it does pass (for example, how long a successful test run took...). In any event notes from investigating locally: Getting a lot of the following when I run the test locally: WARNING: NS_ENSURE_SUCCESS(err, err) failed with result 0x80004005: file /home/visbrero/rev_control/hg/comm-central/mailnews/db/msgdb/ src/nsMsgDatabase.cpp, line 5682 which is: nsresult err = GetSearchResultsTable(aSearchFolderUri, PR_FALSE, getter_AddRefs(table)); NS_ENSURE_SUCCESS(err, err); Is it possible we have a virtual folder performance regression or is that something that should not be an ensure success? Running manually, it appears that the biggest time-sink in the function is "test_virtual_folder_multi_load_alotta_folders_simple_pred". On my system when unloaded, it takes 10 seconds to run which includes creating 16 folders with 256 spread amongst those 16 folders.

David :Bienvenu

Comment 3

•

15 years ago

(In reply to comment #2) > > Getting a lot of the following when I run the test locally: > WARNING: NS_ENSURE_SUCCESS(err, err) failed with result 0x80004005: file > /home/visbrero/rev_control/hg/comm-central/mailnews/db/msgdb/ > src/nsMsgDatabase.cpp, line 5682 > > which is: > nsresult err = GetSearchResultsTable(aSearchFolderUri, PR_FALSE, > getter_AddRefs(table)); > NS_ENSURE_SUCCESS(err, err); > > Is it possible we have a virtual folder performance regression or is that > something that should not be an ensure success? That most likely just means that this is the first time we've run the saved search against the underlying folder, which is highly likely in a unit test. So yeah, it probably should fail silently.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 4

•

15 years ago

I bumped the timeout duration down to 240s after discussion with Standard8. It looks like the tests are taking forever to run. Specifically, the realfolder test ended up timing out on this run here: http://tinderbox.mozilla.org/showlog.cgi?tree=Thunderbird&errorparser=unittest&logfile=1286492503.1286506625.761.gz&buildtime=1286492503&buildname=Linux%20comm-central%20check&fulltext=1#err0 The machine in question for that run is momo-vm-linux-06. Given that the test is running slow and a successful unit test run currently appears to take (picking random runs) *240 minutes* for the linux VMs whereas the OS X machines take 98 minutes *with our slow mozmill runs* and windows 94 minutes. I would like to set forth the theory that the linux VMs suck and should be made to suck less or be replaced with something that sucks less. Of course, the tests could perhaps just be running slow; we would want to know what the system load is like during the run (and ideally I/O wait stats too).

Mark Banner (:standard8)

Reporter

Comment 5

•

15 years ago

(In reply to comment #4) > Given that the test is running slow and a successful unit test run currently > appears to take (picking random runs) *240 minutes* for the linux VMs whereas > the OS X machines take 98 minutes *with our slow mozmill runs* and windows 94 > minutes. Unfortunately that also includes compilation. I'm suspicious that our Linux VMs may be low on RAM and/or finding linking libxul hard following the recent switch. Looking purely at the times for the xpcshell-test part of the run: Linux between 13 - 24 minutes. Prob average about 20 minutes. Mac Average 18 minutes Windows Average 14 minutes So although its a little slower, it isn't that far off. As far as I know, we're using the same VM set up as MoCo, although it would be good to confirm how close we actually are. I'll also see if I can give this test a spin later and reproduce this hang or not.

Philippe M. Chiasson (:gozer)

Comment 6

•

15 years ago

(In reply to comment #5) > (In reply to comment #4) > > Given that the test is running slow and a successful unit test run currently > > appears to take (picking random runs) *240 minutes* for the linux VMs whereas > > the OS X machines take 98 minutes *with our slow mozmill runs* and windows 94 > > minutes. > > Unfortunately that also includes compilation. I'm suspicious that our Linux VMs > may be low on RAM and/or finding linking libxul hard following the recent > switch. Not sure why memory requirements would be higher for 32bit vs 64bit builds, but yeah, we can see the impact libxul has had. Basically, our Linux 64 VMs with only 2G of RAM are now starting to swap when linking the final libxul.so, and once these guys get into swap-land, things get very slow. John has been increasing RAM on these slaves to 3G to help with this.

John Hopkins (:jhopkins)

Comment 7

•

15 years ago

All the Linux (32 and 64 bit) build slaves have been given 3GB RAM.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 8

•

15 years ago

According to my shiny new systemtap-based tool, the test is performing *on my box with my probes* 1676 fsync's during its run, which is responsible for 79.5% of the wall time of the test (44 secs for me with probes). JS stacks finger messageInjection.js:add_sets_to_folders, with fsyncs seeming to either take ~0.3 or ~35ms. A quick perusal of nsMsgLocalMailFolder::AddMessage (which is the likeliest directly responsible party given the js callstack correlations with messages generated) shows it opens and closes an nsMsgFileStream and that it explicitly calls Flush on it which calls PR_Sync. The only obvious quickfix would be to cause the xpcshell working directory to be a ramdisk. In terms of making the message generation less io-bound, various options and problems: - Try and make AddMessage flush less/not-at-all. I believe the method is used for real purposes that don't like data loss, so it would have to be optional. I am unclear if PR_Close is always going to want to imply an fsync or what. - Make messageInjection go back to just building up an mbox file and then adding that as a new folder. I think test semantics all over the place assume we can add messages to existing folders. We could probably have it add the new folders then move the messages to the target folders, but that would introduce semantic difficulties, especially in things like gloda that try and be clever. - For tests that are generating a lot of messages but don't have tricky semantics, like this one, go back to the bulk mbox injection. - Make nsMsgLocalMailFolder support some form of batch injection so the fsyncs are O(injection transactions) instead of O(messages).

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 9

•

15 years ago

Extra notes: - The above explanation correlates well with the timed-out logs I have seen where all of the test operations seem to take much longer than one would expect. Given that we are talking about a VM which we generally would not expect to have great I/O let alone if fighting other VMs for I/O. - This will basically affect every test using messageInjection.js, which is a lot of tests. The DBViewWrapper tests have it worse though because they generate more messages than others, especially the 256 message one.

David :Bienvenu

Comment 10

•

15 years ago

Things are only going to get worse when we switch to maildir, I would think, unless somehow the size of the file affects the fsync time. And yes, AddMessage is used for real purposes. I think PR_Close is going to want to imply an fsync, but I could be wrong.

David :Bienvenu

Comment 11

•

15 years ago

The other possibility is to have an AddMessages() method that takes an array of strings representing messages and doesn't do an fsync or close until the end.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 12

•

15 years ago

(In reply to comment #10) > Things are only going to get worse when we switch to maildir, I would think, > unless somehow the size of the file affects the fsync time. And yes, AddMessage > is used for real purposes. I think PR_Close is going to want to imply an fsync, > but I could be wrong. I think the dominant factor would be the seek latency of the disk, which is why a ramdisk is an appealing option for unit tests where durability is not actually a concern. (And may require multiple seeks in order to flush both the data and the file metadata, depending on how strict the filesystem is being about requiring you to run fsync on the directory and what not.) I just looked more and both by inspection and gdb with a breakpoint on fsync it looks like nspr does not perform any implicit fsyncs (and man close explicitly states that fsyncs are not automatic.) (In reply to comment #11) > The other possibility is to have an AddMessages() method that takes an array of > strings representing messages and doesn't do an fsync or close until the end. Yeah, that's what I was thinking of in terms of batch injection. Maildir could still potentially benefit from batching too if it waited until some granularity before issuing the flushes. Also, if the writes weren't happening synchronously on the main thread, that would also be a big improvement. The only thing that really needs to be blocking on the fsync is a destructive operation like telling the mail server to delete the mails because we assert that we've got them.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 13

•

15 years ago

(In reply to comment #12) > Maildir could still potentially benefit from batching too if it waited until > some granularity before issuing the flushes. By this I mean, if we have 8 new mails, we could write [0 1 2 3], fsync those in sequence, then write [4 5 6 7] and sync those. It's possible the first fsync might take care of all of them making the other ones pretty quick. This is an example where the sqlite storage would totally win :)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 14

•

15 years ago

Attached patch Add a batching method and make messageInjection use it. — Details — Splinter Review

This seems to work. It makes the test go faster and it passes the gloda unit tests (and they are a persnicketty bunch!).

Assignee: nobody → bugmail

Status: NEW → ASSIGNED

Attachment #483870 - Flags: review?(bienvenu)

David :Bienvenu

Comment 15

•

15 years ago

Comment on attachment 483870 [details] [diff] [review] Add a batching method and make messageInjection use it. looks good, thx!

Attachment #483870 - Flags: review?(bienvenu) → review+

Mark Banner (:standard8)

Reporter

Comment 16

•

15 years ago

I've pushed this into the closed tree: http://hg.mozilla.org/comm-central/rev/76cd83285f89 I decided this was quite self-contained and would give us a chance to see the effect on our current frequent and permanent oranges.

Mark Banner (:standard8)

Reporter

Comment 17

•

15 years ago

So far today, this seems to have been holding fine. I may be speaking too early, but I'm going to tentatively close this as fixed.

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Flags: in-testsuite+

Resolution: --- → FIXED

Target Milestone: --- → Thunderbird 3.3a1

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 18

•

15 years ago

Some of the linux 'U' test runs are down in the neighborhood of 25 minutes now! I don't think that's because of this patch (the patch likely helps things in the longer running cases where the VMs are experiencing resource contention)... great job gozer/jhopkins on making the VMs much happier! (Also anyone else who did anything to make things faster!)

Nobody; OK to take it and work on it

Updated

•

13 years ago

Keywords: intermittent-failure

Nobody; OK to take it and work on it

Updated

•

13 years ago

Whiteboard: [orange]

:aceman

Updated

•

12 years ago

Depends on: 876185