Closed Bug 588048 Opened 11 years ago Closed 10 years ago

test_viewWrapper_virtualFolder.js intermittently hangs on Linux

Categories

(Thunderbird :: Mail Window Front End, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.3a1

People

(Reporter: standard8, Assigned: asuth)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

test_viewWrapper_virtualFolder.js is intermittently hanging on our Linux builders and causing xpcshell-tests to timeout.

Unforunately there's nothing useful in the tinderbox log:
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1282049059.1282056227.11123.gz

TEST-INFO | /buildbot/linux-comm-central-check/build/objdir/mozilla/_tests/xpcshell/test_mailbase/unit/test_viewWrapper_virtualFolder.js | running test ...

command timed out: 300 seconds without output, killing pid 8158

I'll see if I can get some more info.
This is starting to get bad (frequent), any ideas on the best way to start logging/poking at this?
Can I shorten the asyncTestUtils seppuku delay below that of the tinderbox so we can get some log output?  I know you lowered it recently but it's still too high and the default success harness isn't giving us a lot to go on when it does pass (for example, how long a successful test run took...).

In any event notes from investigating locally:

Getting a lot of the following when I run the test locally:
  WARNING: NS_ENSURE_SUCCESS(err, err) failed with result 0x80004005: file
  /home/visbrero/rev_control/hg/comm-central/mailnews/db/msgdb/
   src/nsMsgDatabase.cpp, line 5682

which is:
  nsresult err = GetSearchResultsTable(aSearchFolderUri, PR_FALSE, getter_AddRefs(table));
  NS_ENSURE_SUCCESS(err, err);

Is it possible we have a virtual folder performance regression or is that something that should not be an ensure success?

Running manually, it appears that the biggest time-sink in the function is "test_virtual_folder_multi_load_alotta_folders_simple_pred".  On my system when unloaded, it takes 10 seconds to run which includes creating 16 folders with 256 spread amongst those 16 folders.
(In reply to comment #2)
> 
> Getting a lot of the following when I run the test locally:
>   WARNING: NS_ENSURE_SUCCESS(err, err) failed with result 0x80004005: file
>   /home/visbrero/rev_control/hg/comm-central/mailnews/db/msgdb/
>    src/nsMsgDatabase.cpp, line 5682
> 
> which is:
>   nsresult err = GetSearchResultsTable(aSearchFolderUri, PR_FALSE,
> getter_AddRefs(table));
>   NS_ENSURE_SUCCESS(err, err);
> 
> Is it possible we have a virtual folder performance regression or is that
> something that should not be an ensure success?

That most likely just means that this is the first time we've run the saved search against the underlying folder, which is highly likely in a unit test. So yeah, it probably should fail silently.
I bumped the timeout duration down to 240s after discussion with Standard8.  It looks like the tests are taking forever to run.  Specifically, the realfolder test ended up timing out on this run here:

http://tinderbox.mozilla.org/showlog.cgi?tree=Thunderbird&errorparser=unittest&logfile=1286492503.1286506625.761.gz&buildtime=1286492503&buildname=Linux%20comm-central%20check&fulltext=1#err0

The machine in question for that run is momo-vm-linux-06.

Given that the test is running slow and a successful unit test run currently appears to take (picking random runs) *240 minutes* for the linux VMs whereas the OS X machines take 98 minutes *with our slow mozmill runs* and windows 94 minutes.

I would like to set forth the theory that the linux VMs suck and should be made to suck less or be replaced with something that sucks less.  Of course, the tests could perhaps just be running slow; we would want to know what the system load is like during the run (and ideally I/O wait stats too).
(In reply to comment #4)
> Given that the test is running slow and a successful unit test run currently
> appears to take (picking random runs) *240 minutes* for the linux VMs whereas
> the OS X machines take 98 minutes *with our slow mozmill runs* and windows 94
> minutes.

Unfortunately that also includes compilation. I'm suspicious that our Linux VMs may be low on RAM and/or finding linking libxul hard following the recent switch.

Looking purely at the times for the xpcshell-test part of the run:

Linux between 13 - 24 minutes. Prob average about 20 minutes.
Mac Average 18 minutes
Windows Average 14 minutes

So although its a little slower, it isn't that far off.

As far as I know, we're using the same VM set up as MoCo, although it would be good to confirm how close we actually are.

I'll also see if I can give this test a spin later and reproduce this hang or not.
(In reply to comment #5)
> (In reply to comment #4)
> > Given that the test is running slow and a successful unit test run currently
> > appears to take (picking random runs) *240 minutes* for the linux VMs whereas
> > the OS X machines take 98 minutes *with our slow mozmill runs* and windows 94
> > minutes.
> 
> Unfortunately that also includes compilation. I'm suspicious that our Linux VMs
> may be low on RAM and/or finding linking libxul hard following the recent
> switch.

Not sure why memory requirements would be higher for 32bit vs 64bit builds, but yeah, we can see the impact libxul has had. Basically, our Linux 64 VMs with only 2G of RAM are now starting to swap when linking the final libxul.so, and once these guys get into swap-land, things get very slow.

John has been increasing RAM on these slaves to 3G to help with this.
All the Linux (32 and 64 bit) build slaves have been given 3GB RAM.
According to my shiny new systemtap-based tool, the test is performing *on my box with my probes* 1676 fsync's during its run, which is responsible for 79.5% of the wall time of the test (44 secs for me with probes).  JS stacks finger messageInjection.js:add_sets_to_folders, with fsyncs seeming to either take ~0.3 or ~35ms.

A quick perusal of nsMsgLocalMailFolder::AddMessage (which is the likeliest directly responsible party given the js callstack correlations with messages generated) shows it opens and closes an nsMsgFileStream and that it explicitly calls Flush on it which calls PR_Sync.

The only obvious quickfix would be to cause the xpcshell working directory to be a ramdisk.

In terms of making the message generation less io-bound, various options and problems:

- Try and make AddMessage flush less/not-at-all.  I believe the method is used for real purposes that don't like data loss, so it would have to be optional.  I am unclear if PR_Close is always going to want to imply an fsync or what.

- Make messageInjection go back to just building up an mbox file and then adding that as a new folder.  I think test semantics all over the place assume we can add messages to existing folders.  We could probably have it add the new folders then move the messages to the target folders, but that would introduce semantic difficulties, especially in things like gloda that try and be clever.

- For tests that are generating a lot of messages but don't have tricky semantics, like this one, go back to the bulk mbox injection.

- Make nsMsgLocalMailFolder support some form of batch injection so the fsyncs are O(injection transactions) instead of O(messages).
Extra notes:

- The above explanation correlates well with the timed-out logs I have seen where all of the test operations seem to take much longer than one would expect.  Given that we are talking about a VM which we generally would not expect to have great I/O let alone if fighting other VMs for I/O.

- This will basically affect every test using messageInjection.js, which is a lot of tests.  The DBViewWrapper tests have it worse though because they generate more messages than others, especially the 256 message one.
Things are only going to get worse when we switch to maildir, I would think, unless somehow the size of the file affects the fsync time. And yes, AddMessage is used for real purposes. I think PR_Close is going to want to imply an fsync, but I could be wrong.
The other possibility is to have an AddMessages() method that takes an array of strings representing messages and doesn't do an fsync or close until the end.
(In reply to comment #10)
> Things are only going to get worse when we switch to maildir, I would think,
> unless somehow the size of the file affects the fsync time. And yes, AddMessage
> is used for real purposes. I think PR_Close is going to want to imply an fsync,
> but I could be wrong.

I think the dominant factor would be the seek latency of the disk, which is why a ramdisk is an appealing option for unit tests where durability is not actually a concern.  (And may require multiple seeks in order to flush both the data and the file metadata, depending on how strict the filesystem is being about requiring you to run fsync on the directory and what not.)

I just looked more and both by inspection and gdb with a breakpoint on fsync it looks like nspr does not perform any implicit fsyncs (and man close explicitly states that fsyncs are not automatic.)

(In reply to comment #11)
> The other possibility is to have an AddMessages() method that takes an array of
> strings representing messages and doesn't do an fsync or close until the end.

Yeah, that's what I was thinking of in terms of batch injection.

Maildir could still potentially benefit from batching too if it waited until some granularity before issuing the flushes.  Also, if the writes weren't happening synchronously on the main thread, that would also be a big improvement.  The only thing that really needs to be blocking on the fsync is a destructive operation like telling the mail server to delete the mails because we assert that we've got them.
(In reply to comment #12)
> Maildir could still potentially benefit from batching too if it waited until
> some granularity before issuing the flushes.

By this I mean, if we have 8 new mails, we could write [0 1 2 3], fsync those in sequence, then write [4 5 6 7] and sync those.  It's possible the first fsync might take care of all of them making the other ones pretty quick.  This is an example where the sqlite storage would totally win :)
This seems to work.  It makes the test go faster and it passes the gloda unit tests (and they are a persnicketty bunch!).
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Attachment #483870 - Flags: review?(bienvenu)
Comment on attachment 483870 [details] [diff] [review]
Add a batching method and make messageInjection use it.

looks good, thx!
Attachment #483870 - Flags: review?(bienvenu) → review+
I've pushed this into the closed tree:

http://hg.mozilla.org/comm-central/rev/76cd83285f89

I decided this was quite self-contained and would give us a chance to see the effect on our current frequent and permanent oranges.
So far today, this seems to have been holding fine. I may be speaking too early, but I'm going to tentatively close this as fixed.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 3.3a1
Some of the linux 'U' test runs are down in the neighborhood of 25 minutes now!  I don't think that's because of this patch (the patch likely helps things in the longer running cases where the VMs are experiencing resource contention)... great job gozer/jhopkins on making the VMs much happier!  (Also anyone else who did anything to make things faster!)
Whiteboard: [orange]
Depends on: 876185
You need to log in before you can comment on or make changes to this bug.