Closed Bug 1864809 Opened 11 months ago Closed 11 months ago

[windows opt perma] comm/calendar/test/browser/browser_calDAV_discovery.js, browser_cardDAV_init.js | Uncaught exception in test bound testCalendarWithOnlyReadPriv - waiting for calendar list to appear - timed out after 100 tries

Categories

(Calendar :: General, defect, P5)

Tracking

(thunderbird_esr115 unaffected, thunderbird120 unaffected)

RESOLVED FIXED
121 Branch
Tracking Status
thunderbird_esr115 --- unaffected
thunderbird120 --- unaffected

People

(Reporter: intermittent-bug-filer, Assigned: darktrojan)

References

(Regression)

Details

(Keywords: intermittent-failure, intermittent-testcase, regression)

Attachments

(2 files)

Summary: Intermittent comm/calendar/test/browser/browser_calDAV_discovery.js | Uncaught exception in test bound testCalendarWithOnlyReadPriv - waiting for calendar list to appear - timed out after 100 tries → [windows opt perma] comm/calendar/test/browser/browser_calDAV_discovery.js | Uncaught exception in test bound testCalendarWithOnlyReadPriv - waiting for calendar list to appear - timed out after 100 tries

I don't know what bug 1836700 did that caused this, but it looks like it did. I backed it out on Try and the failures went away.

My suspicion, given the tests that are failing and when they're failing, is that nsServerSocket (as used by nsHttpServer) is breaking somehow.

These are the failing tests, on Windows only, and not every time but almost every time:

  • comm/calendar/test/browser/browser_calDAV_discovery.js
  • comm/mail/components/addrbook/test/browser/browser_cardDAV_init.js
Keywords: regression
Regressed by: 1836700
Summary: [windows opt perma] comm/calendar/test/browser/browser_calDAV_discovery.js | Uncaught exception in test bound testCalendarWithOnlyReadPriv - waiting for calendar list to appear - timed out after 100 tries → [windows opt perma] comm/calendar/test/browser/browser_calDAV_discovery.js, browser_cardDAV_init.js | Uncaught exception in test bound testCalendarWithOnlyReadPriv - waiting for calendar list to appear - timed out after 100 tries

(In reply to Geoff Lankow (:darktrojan) from comment #1)

I don't know what bug 1836700 did that caused this, but it looks like it did. I backed it out on Try and the failures went away.

Interesting!

My suspicion, given the tests that are failing and when they're failing, is that nsServerSocket (as used by nsHttpServer) is breaking somehow.

nsServerSocket appears to be inherently main-thread only since it's XPCOM-based (nsIServerSocket.idl), so if it's impacted, I would expect it to be because the other end of the connection it was talking to involved a worker which is either:

  • now being GCed more aggressively, breaking the connection earlier (or not performing some network activities)
  • not being GCed, causing a connection to remain alive indefinitely, or resulting in more network activities happening that were not happening previously.

These are the failing tests, on Windows only, and not every time but almost every time:

  • comm/calendar/test/browser/browser_calDAV_discovery.js
  • comm/mail/components/addrbook/test/browser/browser_cardDAV_init.js

Running these tests with MOZ_LOG=WorkerPrivate:5 should help provide debug logging that provides clarity about what workers are running and when they are starting and stopping, etc.

In order to get the MOZ_LOG in there on fresh try runs it's possible to do things like ./mach try fuzzy --env "MOZ_LOG=WorkerPrivate:5". I believe it should theoretically also be possible to re-trigger a variation of an existing job with logging enabled as a "custom action..." as documented at https://wiki.mozilla.org/ReleaseEngineering/TryServer#Re-running_tasks_with_custom_parameters_from_Treeherder for the comm-central revisions like from comment 0.

That said, it seems like the "retrigger" action provided when I use "custom action..." on the task from comment 0 may lack the required glue to actually propagate environment variables. I tried creating some tasks using environnment and env keys but neither task seems to include my MOZ_LOG inclusion. In comparison, if I try "custom action.." on one of Eden's try tasks I get a "retrigger-custom" which both has "environment" in the schema and pre-populated in the "payload". I seem to only be offered "retrigger" on mozilla-central just like for comm-central, so I presume this is an issue of tree configuration where I either lack access rights and/or the tree is more restrictive for security reasons.

Thanks for the input Andrew.

It's just dawned on me where the worker is and why there's something different about Windows. We roll our own DNS lookup code using ctypes and we use a worker to do it asynchronously. I still have no idea what's going wrong here though.

Before fixing the actual bug, let's tidy up this confusing mess of a file which is both a module and a worker.

Assignee: nobody → geoff
Status: NEW → ASSIGNED

The worker is getting garbage collected before it has a chance to return with results.

Depends on D193866

Thanks for the link! The logic that creates the worker isn't retaining a reference to the Worker, which makes the Worker potentially eligible for cycle collection, so if our changes introduced a bug (spoiler alert, yes), that would explain what is happening.

const worker = new BasePromiseWorker(LOCATION);
return worker.post("execute", [
  Services.appinfo.OS,
  "lookup",
  [...arguments],
]);

I think for the changes in bug 1836700 in dealing with all the more complicated edge cases we lost track of the more straightforward Worker.postMessage case. The MesageEventRunnable dispatch will end up being directly dispatched against the underlying nsThread in WorkerThread::DispatchAnyThread. There's also potentially a coverage gap for the other calls to nsThread::Dispatch from WorkerThread::Dispatch.

In IsEligibleForCC where we check the main-thread targets we should also be checking !mThread->HasPendingEvents(). I think we should be okay in terms of locking discipline (we need to be holding the WorkerPrivate::mMutex to access mThread anyways and this should be the only case that mutex would overlap with the ThreadEventQueue::mLock that will be acquired by the ThreadEventQueue::HasPendingEvent call made by nsThread::HasPendingEvents).

Eden, does this make sense to you as the likely problem and the appropriate fix? If so, we probably want a new bug in Workers that this can depend on. Note: I know you're out tomorrow, but because of the soft freeze I also don't think we can land a fix tomorrow or that we would want to risk a backout either, so I think this makes sense to address on Monday.

Flags: needinfo?(echuang)

(In reply to Geoff Lankow (:darktrojan) from comment #5)

The worker is getting garbage collected before it has a chance to return with results.

This should definitely fix the problem you're experiencing, but we (DOM: Workers) definitely introduced a bug here, so this change should not be necessary. (And since I forgot to mention it before, thank you for cc-ing us; it's very much appreciated to have found the bug quickly this way!)

That said, it could be reasonable in general to not be creating a fresh worker for every request. Like, if you had a single worker global you created and that you scheduled to tear it down at the earlier of:

  1. After 60 seconds of inactivity, and reset the timer every time you give it a new request.
  2. Being notified via AsyncShutdown that Thunderbird is shutting down.

That would probably be significantly less resource intensive.

I'm going to land the patches here to green up our tree. If I understood it correctly, the second patch should not be needed once the bug you discuss is fixed. We should back it out once that's done.

Target Milestone: --- → 121 Branch

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/c7fc921e4031
Split DNS.jsm into DNS.sys.mjs and dnsWorker.js. r=mkmelin
https://hg.mozilla.org/comm-central/rev/feb2d7c5b41c
Keep a reference to DNS workers while they are active. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED

I think bug 1864459 is the same issue. I will provide a patch according to Andrew's comment on bug 1864459.

Flags: needinfo?(echuang)
See Also: → 1864459

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #7)

That said, it could be reasonable in general to not be creating a fresh worker for every request. Like, if you had a single worker global you created and that you scheduled to tear it down at the earlier of:

  1. After 60 seconds of inactivity, and reset the timer every time you give it a new request.
  2. Being notified via AsyncShutdown that Thunderbird is shutting down.

That would probably be significantly less resource intensive.

I had the same thought, although this code runs very infrequently in normal use and doesn't do a lot, so I left it for another day.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: