intermittent oranges from unreported message summary load completion (likely fallout from realistic click generation from bug 640877)

RESOLVED WORKSFORME

Status

RESOLVED WORKSFORME
8 years ago
6 years ago

People

(Reporter: asuth, Assigned: asuth)

Tracking

({intermittent-failure})

Trunk
All
Linux
intermittent-failure
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Assignee)

Description

8 years ago
We are seeing a number of mozmill intermittent orange failures that appear to be the result of a summarization that generates an MD_onLoadStarted event but fails to generate an MD_onLoadCompleted event.  The summarization appears to have completed though, per the screenshot.  Presumably I missed a control flow path or our decorated helper is getting trashed.

This seems much more prevalent on linux, although I doubt it's a linux-specific problem.  I would guess that it's due to a timing issue where linux is the fastest because of its general superiority over all other platforms.
(Assignee)

Updated

8 years ago
Blocks: 651472
(Assignee)

Comment 1

8 years ago
Pushed an orange-debugging logging addition to trunk:
http://hg.mozilla.org/comm-central/rev/42c914e42783

The try run (net green) was:
http://hg.mozilla.org/try-comm-central/rev/21a53a6a60fe
(Assignee)

Comment 2

8 years ago
The additional debug output shows that we are netting a "MD_MultiMessageSummary_summarize" entry without a corresponding "MD_*Summary_onQueryCompleted" entry.  Screenshot still suggests the summarization ran to completion.

This indicates that the summarization process is being initiated but one of the following is happening:

- We are not making it all the way to the end of the summarization logic to trigger the call to Gloda.getMessageCollectionForHeaders.  (Yes, gloda is still called and the query will happen; the database should just be empty.)

- Something is going awry in gloda and we are never generating a callback.  Since the database query itself should keep us rooted until the call is triggered, it shouldn't be a GC problem.

The next step is probably to crank up the gloda logging level so it shows up in the fancy pants logs.  I have a patch for this I am going to ask bienvenu for review for since I can't rubber-stamp it.
(Assignee)

Comment 3

8 years ago
Created attachment 529594 [details] [diff] [review]
have mozmill tests crank up gloda testing without any I/O for fancy pants logging v1

Tested locally.  The idea is to make fancy pants logging be able to see gloda DEBUG logging events; fancy pants logging can actually already see WARN and ERROR, we just tend not to generate them.

I can't rubber-stamp this because there is non-test code involved and it touches the review framework a teeny bit by setting the pref.
Attachment #529594 - Flags: review?(dbienvenu)

Updated

8 years ago
Attachment #529594 - Flags: review?(dbienvenu) → review+
(Assignee)

Comment 4

8 years ago
Additional gloda logging pushed to trunk:
http://hg.mozilla.org/comm-central/rev/e3f5ac337ac5

I plan to leave this logging around, as I believe it will be broadly helpful to know what gloda is getting up-to (even when it generally should not be getting up to anything).
(Assignee)

Comment 5

8 years ago
Failures indicate that the gloda query is being dispatched which suggests that gloda is at fault.  (Technically, it happens prior to the dispatch...)

When the failure happens, it seems like they tend to happen for multiple separate mozmill runs.  Specifically, in one set of runs, the problem was observed in composition/test-forward-headers.js, junk-commands/test-junk-commands.js, and also in folder-display/*.  (linux x32 went nuts all over the place starting from test_basic_summarization, but linux x64 only had issues in test-opening-messages.js/test_open_messages_in_tabs.)

Although we use the same profile directory for each run, given that the rmtree should throw errors and this is linux (where we can blow away open files), it seems unlikely it's a problem with fighting another thunderbird instance over the database.

Things that seem possible:

- A gloda failure during startup is somehow making things bad, but we don't get to see the error message.  The problem with this theory is that we should see an exception or logged message when we later go to dispatch the query.  Also, we would expect this failure to affect all summarization attempts, whereas linux x64 demonstrates that sometimes it only happens to one thing.

- The gloda async thread is so "busy" that it is unable to return to us within 6 seconds.  This could be a combination of: backlog of events on the thread, mean scheduling, serious I/O backlogs.  For consideration, the main thread in one of the test_basic_summarization test took 447ms between the start of the test and when it got to the wait_for_message_display_completion that was no good.

- Some error is happening and XPCOM is eating it, possibly because an error handler actually has an error in it.

Possible next options that jump out at me:

- Try and dump all the error events that happen prior to the mozmill standup.  This might require a JS component in an extension be installed depending on whether the built-in error console is willing to share its events with us.

- More profuse gloda logging (when running under mozmill).

- Turn on that pref that makes XPCOM be more aggressive about logging JS errors.

- Try and gain some understanding of the load the system was under during the test run by having runtest.py dump the contents of /proc/loadavg on linux at the start and stop of a test.  Or if we have cactus logs for the VMs, be able to correlate them, etc.
(Assignee)

Updated

8 years ago
Depends on: 654775
(Assignee)

Comment 6

8 years ago
After consultation with jhopkins, it appears that the problem may be some of our VMs are using NFS storage that can be flakey.  He has filed bug 654775 to get those builders onto local storage.

The most obvious indicator may be this recent log from the 64bit linux VM:
http://arbpl.visophyte.org/?tree=Miramar&pushid=5899&log=42%3A1304527759.1304531372.23402.gz%3Afolder-display%2Ftest-summarization.js-test_summary_when_multiple_identities

There are delays of 2160ms and 3413ms between adding a message and the classification notification.  A move process also has a delay of 2246ms, and 1040ms in another place.

It's worth noting that such obvious difficulties were not exposed in most of the other logs I looked at; the ArbPL fancy pants logging does explicitly call out all delays between log messages greater than 200ms, so such things tend to jump out.  Which is to say, the NFS issue may not be the only problem, or the problem, etc.
Keywords: intermittent-failure
Whiteboard: [tb-orange]
Resolving WFM keyword:intermittent-failure bugs last modified >3 months ago, whose whiteboard contains none of:
{random,disabled,marked,fuzzy,todo,fails,failing,annotated,time-bomb,leave open}

There will inevitably be some false positives; for that (and the bugspam) I apologise. Filter on orangewfm.
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.