Closed Bug 468050 Opened 13 years ago Closed 13 years ago

Linux gloda tests are frequently failing on tinderbox

Categories

(MailNews Core :: Database, defect, P1)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b2

People

(Reporter: standard8, Assigned: asuth)

References

Details

Attachments

(5 files)

Since we checked in bug 466800 we're seeing frequent failures on Linux in test_query_messages.js e.g. http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1228438842.1228439080.7206.gz&fulltext=1

2008-12-04 17:19:47	gloda.indexer	INFO	Queue-ing folder for indexing: injecty1
2008-12-04 17:19:47	gloda.datastore	DEBUG	!! mapped 2 from mailbox://nobody@Local%20Folders/injecty1
2008-12-04 17:19:47	gloda.indexer	DEBUG	Entering folder: mailbox://nobody@Local%20Folders/injecty1
2008-12-04 17:19:47	gloda.indexer	ERROR	Problem entering folder: injecty1, skipping.
*** exiting
NEXT ERROR *** TEST-UNEXPECTED-FAIL | file:///buildbot/trunk-unix/build/objdir-tb/mozilla/dist/bin/modules/gloda/log4moz.js | 2008-12-04 17:19:47
Flags: blocking-thunderbird3+
Priority: -- → P1
This patch folds the error logging code on the next line into the prior error logging line.  Because of my logger magic that kills us the first time we hit a logged error, the second error logging never gets a chance to live, let alone to escape to the console.

This does not fix the problem, but will help me understand why things are unhappy.
Attachment #351670 - Flags: review?(bugzilla)
Whiteboard: [has diagnostic patch][needs review standard8]
Comment on attachment 351670 [details] [diff] [review]
[checked in] v1 get info on the exception that killed us

This looks good, more debug is always useful. r=me.

I've pushed this as well: http://hg.mozilla.org/comm-central/rev/33793220167f
Attachment #351670 - Attachment description: v1 get info on the exception that killed us → [checked in] v1 get info on the exception that killed us
Attachment #351670 - Flags: review?(bugzilla) → review+
So, that logic actually made things worse, as now the tests time out.  This patch causes gloda tests to abort themselves after a ridiculously long delay in a manner that gets us a log.  Ridiculously long is currently 3 minutes for each of the gloda tests.  This is, in fact, ridiculously long; it currently takes 6 seconds to run all the gloda tests on my box.  Even on a slow tinderbox with a head cold and a hangover, 3 minutes should be much more time than we need.  (I am leaving a lot of margin because we are trying to fix intermittent failures here, not make them more likely.)
Attachment #351722 - Flags: review?
Attachment #351722 - Flags: review? → review?(bienvenu)
Attachment #351722 - Flags: review?(bienvenu) → review+
Attachment #351722 - Attachment description: v1 make the gloda tests timeout after 3 minutes and die in a fashion that gets us the log → [checked in: comment #4] v1 make the gloda tests timeout after 3 minutes and die in a fashion that gets us the log
We got a good failure!  Hooray!
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1228622442.1228622844.24107.gz

The error code returned 0x80550005 = NS_MSG_ERROR_FOLDER_SUMMARY_OUT_OF_DATE indicates that nsMsgLocalMailFolder::GetDatabaseWithReparse was called when m_parsingFolder was true.  Gloda's indexer.js was not expecting this return value and was exploding.  This was a bug in gloda (although it could be argued that GetDatabaseWithReparse should be returning NS_ERROR_NOT_INITIALIZED for consistency).  I also noticed that we were passing the folder as the aReparseUrlListener, which is not required (and filtered out by ParseFolder but not GetDatabaseWithReparse), so we just pass null for that now.

The curious thing is that m_parsingFolder should not have been true, since the point of the updateFolderAndNotify helper was only to call the callback when things are completely loaded.  I am having trouble mentally synthesizing a chain of events that would result in this happening.  The best explanation was that nsMsgLocalMailFolder::OnStopRunningUrl notified mReparseListener and that listener generated a FolderLoaded notification (at which time m_parsingFolder would still be true) so updateFolderAndNotify's immediate invocation would explain things.  Except that even when mReparseListener is the folder (which is half-guarded against, as previously noted), this still can't happen unless it were to somehow short-circuit to the super-class's implementation.  (m_parsingFolder is set to false immediately before calling into nsMsgDBFolder::OnStopRunningUrl who does the FolderLoaded notification.)

I've made changes to to mailTestUtils to have updateFolderAndNotify also handle the additional error code that GetDatabaseWithReparse can propagate back through updateFolder.  I've also introduced a 0ms-delay into the equation for the callback to attempt to address at least one class of problems, but I fear that there's another actor in the scenario that I just don't know about.  (Or the bowels of the parsing logic are just doing something unexpected in the potential situation where the timestamp of the mbox file is equal to the msf file, because they would assume the msf file should be more recent due to use of > rather than >=, or something.)
Attachment #351738 - Flags: review?(bienvenu)
Attachment #351738 - Flags: review?(bugzilla)
(Requesting both testing and database-related reviews given my general confusion.)
Status: NEW → ASSIGNED
Whiteboard: [has diagnostic patch][needs review standard8] → [has patch][needs review bienvenu][needs review standard8]
Comment on attachment 351738 [details] [diff] [review]
v1 have gloda handle another return code, make helper test code more resilient

I agree it seems a bit strange, but I have seen problems before where we've needed the timeout to allow us to let things finish properly before we go onto doing the next item. Lets give it a try with this.
Attachment #351738 - Flags: review?(bugzilla) → review+
Comment on attachment 351738 [details] [diff] [review]
v1 have gloda handle another return code, make helper test code more resilient

notification...

+      //  is going to send us a noitification when the reparse has completed.
Attachment #351738 - Flags: review?(bienvenu) → review+
(In reply to comment #8)
> (From update of attachment 351738 [details] [diff] [review])
> notification...
> 
> +      //  is going to send us a noitification when the reparse has completed.

Oh, a woise guy, eh?

Commited, with nit addressed: http://hg.mozilla.org/comm-central/rev/69df76ff98d3

I'll mark this fixed if the tests stop going orange after a while...
Whiteboard: [has patch][needs review bienvenu][needs review standard8] → [waiting for linux comm-central check to be clear]
It turns out that mailnews' error codes don't show up in Components.results.  Standard8 is logging a bug.
Attachment #352080 - Flags: review?(bugzilla)
Comment on attachment 352080 [details] [diff] [review]
[checked in: comment #12] v1 don't use non-existent error constants, invent our own!

Looks good. Lets see how this affects things.
Attachment #352080 - Flags: review?(bugzilla) → review+
Comment on attachment 352080 [details] [diff] [review]
[checked in: comment #12] v1 don't use non-existent error constants, invent our own!

http://hg.mozilla.org/comm-central/rev/b0c1cc7f3e97
Attachment #352080 - Attachment description: v1 don't use non-existent error constants, invent our own! → [checked in: comment #12] v1 don't use non-existent error constants, invent our own!
With the error code now working, we began using the pending folder logic, which apparently had rotted, presumably around the time when we changed to our current style of async processing and probably also around the time we stopped calling updateFolder like it was going out of fashion.
Attachment #352085 - Flags: review?(bugzilla)
Comment on attachment 352085 [details] [diff] [review]
[checked in] v1 gloda pending folder logic was bitrotted

r=me. Checked in: http://hg.mozilla.org/comm-central/rev/b0c1cc7f3e97
Attachment #352085 - Attachment description: v1 gloda pending folder logic was bitrotted → [checked in] v1 gloda pending folder logic was bitrotted
Attachment #352085 - Flags: review?(bugzilla) → review+
This seems to be fixed now :-)
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [waiting for linux comm-central check to be clear]
You need to log in before you can comment on or make changes to this bug.