Closed Bug 1821250 Opened 1 year ago Closed 1 month ago

Debugger panel is blank on stackblitz example

Categories

(DevTools :: Debugger, defect, P2)

defect

Tracking

(firefox128 fixed)

RESOLVED FIXED
128 Branch
Tracking Status
firefox128 --- fixed

People

(Reporter: nchevobbe, Assigned: ochameau)

Details

Attachments

(5 files)

  1. Open https://stackblitz.com/edit/express-simple-8wjs4b?file=pages%2Findex.html,index.js
  2. Open the debugger

-> it looks like we wait indefinitely

It seems to be related to these workers:
data:application/javascript,%0Aonmessage%20%3D%20function%20(ev)%20%7B%0A%20%20%20%20let%20%5Bia%2C%20index%2C%20value%5D%20%3D%20ev.data%3B%0A%20%20%20%20ia%20%3D%20new%20Int32Array(ia.buffer)%3B%0A%20%20%20%20let%20result%20%3D%20Atomics.wait(ia%2C%20index%2C%20value)%3B%0A%20%20%20%20postMessage(result)%3B%0A%7D%3B%0A
whose implementation is:

onmessage = function (ev) {
    let [ia, index, value] = ev.data;
    ia = new Int32Array(ia.buffer);
    let result = Atomics.wait(ia, index, value);
    postMessage(result);
};

For some reason we aren't able to instantiate a WorkerTarget correctly for them and this end up making watchTargets RDP method to never resolve.

On these workers nsIWorkerDebugger.initialize fails without any exception/error:
https://searchfox.org/mozilla-central/rev/cd2121e7d83af1b421c95e8c923db70e692dab5f/devtools/server/connectors/worker-connector.js#36
This never executes the worker.js file... and there is nothing saying that the worker is defunct, nsIWorkerDebugger.isClosed is false.

I think in the past I've already seen this in some tests, unfortunately, I couldn't find any bug about this.

Interesting. I know that code and it's coming from wasm-bindgen and is used as a polyfill for Atomics.waitAsync which is not yet available in Firefox (https://bugzilla.mozilla.org/show_bug.cgi?id=1467846).

See here for the code that checks for waitAsync https://github.com/rustwasm/wasm-bindgen/blob/main/crates/futures/src/task/multithread.rs#L168

And here's the worker code https://github.com/rustwasm/wasm-bindgen/blob/main/crates/futures/src/task/worker.js

I have tested this again in Firefox 110.0.1 and for some reason I am seeing sources all of a sudden and it also stops at the debugger statement. Has something been fixed? I know that in the past I have seen this as well but it wasn't very deterministic and usually it didn't work.

I tested this in Nightly again and there I don't see any sources.

For me opening the debugger after the following page is fully loaded, the debugger is blank on 110 as well as nightly:
https://stackblitz.com/edit/express-simple-8wjs4b?file=pages%2Findex.html,index.js
If you open the debugger before the page is fully loaded, or on another URL, it will work fine.

Do you happen to know how/where this Worker is loaded?
I'm wondering if it is loaded as an ES Module. If so, this issue would be a duplicated of bug 1816933.

I don't think it's loaded as an ES Module. It's loaded here.

Severity: -- → S3
Priority: -- → P1

I tried to produce a reduced test case by using following hello world:
https://github.com/rustwasm/wasm-bindgen/tree/main/examples/without-a-bundler

And then trying to mimic this code:
https://github.com/rustwasm/wasm-bindgen/blob/main/crates/futures/src/task/wait_async_polyfill.rs#L55-L56

I do reproduce the same data: URL worker... but it works fine.

elmdominic, would you somehow be able to help me narrow down to a simplier test case?

I tried to reproduce the same issue as well but it's somewhat hard to reproduce outside of StackBlitz and I believe it may also have something to do with the large assets we are loading? I don't know but I couldn't find a similar test case that is much smaller and I don't know what's going on on the Firefox debugger side.

I finally found the precise culprit. See the latest attachment.

It relates to Atomics.wait being used from any worker (so it isn't specific to wasm/rust).
If any worker is being blocked by this method, WorkerDebugger.initialize won't work. It is most likely paused and may resume if wait method resolves.
But given that the broken page doesn't resume, the DevTools never completes its loading sequence on stays empty.
It seems to also impact the console.

I'm lowering the priority to P2 as it is an edge case and DevTools work fine when opening the page on a simpler page first.

We should at least ensure that DevTools isn't 100% blank when we are getting into this edgecase.
But it might be tricky to solve the underlying issue. The debugger code around workers isn't having any active maintainer.

Priority: P1 → P2

So the issue is that WorkerDebugger::Initialize dispatches a CompileDebuggerScriptRunnable which never runs because call to Atomics.wait from the worker JS code will lock the thread execution (AFAIU!).
I verified and the dispatch will correctly add a new entry in WorkerPrivate::mDebuggerQueue over there:
https://searchfox.org/mozilla-central/rev/f078cd02746b29652c134b144f0629d47e378166/dom/workers/WorkerPrivate.cpp#1694

Atomics.wait will hold on on this particular line:
https://searchfox.org/mozilla-central/rev/f078cd02746b29652c134b144f0629d47e378166/js/src/builtin/AtomicsObject.cpp#592

 FutexThread::WaitResult retval = cx->fx.wait(cx, lock.unique(), timeout);

Which will end up calling pthread_cond_wait from there:
https://searchfox.org/mozilla-central/rev/f078cd02746b29652c134b144f0629d47e378166/mozglue/misc/ConditionVariable_posix.cpp#102-106

void mozilla::detail::ConditionVariableImpl::wait(MutexImpl& lock) {
  pthread_cond_t* ptCond = &platformData()->ptCond;
  pthread_mutex_t* ptMutex = &lock.platformData()->ptMutex;

  int r = pthread_cond_wait(ptCond, ptMutex);

It isn't clear what it means for the various event loop methods...
It looks like Atomics.wait is being processed from a call to WorkerPrivate::RunCurrentSyncLoop
and its call to pthread_cond_wait seems to make us being blocked on the call to NS_ProcessNextEvent over here:
https://searchfox.org/mozilla-central/rev/f078cd02746b29652c134b144f0629d47e378166/dom/workers/WorkerPrivate.cpp#4372

 MOZ_ALWAYS_TRUE(NS_ProcessNextEvent(thread, false));

It sounds like the way Atomics.wait blocks the thread execution disallow running anything in the Worker Thread, including debugger tasks...

Andrew, do you have any insight? I produced a simple test to reproduce all this in https://phabricator.services.mozilla.com/D173175.

Flags: needinfo?(bugmail)

Thank you for creating a test case!

By inspection, I think the general problem here is that only the JS interrupt mechanism can interrupt an atomic. We use the JS interrupt mechanism for worker control runnables, but nsIWorkerDebugger::PostMessage creates DebuggerMessageEventRunnable which is a WorkerDebuggerRunnable which does have some custom dispatch logic, but it does not involve the interrupt mechanism.

I talked with :bholley briefly about this, and I think it's probably fine for us to use the JS interrupt mechanism for this. We may need to make a minor adjustment to compensate for the comment about re-entry, but we have the necessary primitives.

I think the main question becomes how we want to use the interrupt mechanism for devtools, so needinfo back to :ochameau.

As noted, currently we inherently wait for the existing task to finish. Using the interrupt mechanism is not guaranteed to be instant, so there could be some latency there too. But I see a few options for how we use it:

  1. Always use the interrupt mechanism immediately when scheduling the debugger runnable, similar to what we do for control runnables. This means the debugger will run ASAP, but some of the time it will find itself in the middle of content JS when it probably didn't have to.
  2. Use a timer to provide a short grace period for the currently running task to finish, like 10ms. The idea would be that there isn't really a huge advantage for us to interrupting JS in the middle of execution, especially when there is so much async variability in the data flow prior to us deciding to trigger the interrupt.
    • An upside to the grace period is if it turns out we have bugs when we interrupt the JS, we'll do it less!
  3. Make nsIWorkerDebugger::PostMessage take an argument that is in control of whether we should interrupt and what the delay should be.
    • An additional option would be to specify that we should trigger an interrupt but, rather than running debugger runnables when the interrupt fires, we just throw an uncatchable exception that should close out the current task forcibly. This might be appealing in a wide variety of situations. For example, if the debugger interrupts an IndexedDB request callback when it's on the stack and the user is trying to debug their use of IndexedDB, they will have a very bad experience because that IndexedDB transaction will be permanently held open until the debugger yields control flow. Specifically, this would matter if it calls WorkerDebuggerGlobalScope::enterEventLoop which spins an event loop for the debugger where only debugger runnables and control runnables are processed so it can remain in control of the event loop and avoid content stuff from happening.
    • That said, we could also maybe let the debugger global know in the "message" event it receives or in something we expose on the global that it is interrupting content. It could then call a method we provide that will make it throw an uncatchable exception on the way out if it wanted.

I'm not exactly clear on what the debugger's mental model is in this situation. It's possible the debugger code always assumed it was interrupting JS?

Flags: needinfo?(bugmail) → needinfo?(poirot.alex)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #17)

Thank you for creating a test case!

By inspection, I think the general problem here is that only the JS interrupt mechanism can interrupt an atomic. We use the JS interrupt mechanism for worker control runnables, but nsIWorkerDebugger::PostMessage creates DebuggerMessageEventRunnable which is a WorkerDebuggerRunnable which does have some custom dispatch logic, but it does not involve the interrupt mechanism.

TBH, this is so far from my typical landscape that I don't really understand the difference.
As said recently, this worker debugging code was only ever managed by Eddy...

It looks like the main difference is that control runnables are using:

    if (JSContext* cx = mJSContext) {
      MOZ_ASSERT(mThread);
      JS_RequestInterruptCallback(cx);
    }

between adding task to the queue and calling mCondVar.Notify(). It looks like it follows the same principles, but we have JS interupt in addition?
Is JS_RequestInterrupCallback forcing spidermonkey to call the WorkerPrivate::InterruptCallback next time any JS code runs? Or does it force something more immediate, like forcing calling this InterrupCallback right away?

I talked with :bholley briefly about this, and I think it's probably fine for us to use the JS interrupt mechanism for this. We may need to make a minor adjustment to compensate for the comment about re-entry, but we have the necessary primitives.

It's not clear how we would prevent re-entry?
I imagine the issue is that the interrup callback will fire for debugger script code running?
Would we callback calls when it relates to the debugger global or something?

I think the main question becomes how we want to use the interrupt mechanism for devtools, so needinfo back to :ochameau.

Again... Eddy knew about all these requirements. I do have a good grasp on how we are using WorkerDebugger API and how we are using Spidermonkey Debugger API, but I have only sparse comprehension on actual C++ code, threads, ...

As noted, currently we inherently wait for the existing task to finish. Using the interrupt mechanism is not guaranteed to be instant, so there could be some latency there too. But I see a few options for how we use it:

  1. Always use the interrupt mechanism immediately when scheduling the debugger runnable, similar to what we do for control runnables. This means the debugger will run ASAP, but some of the time it will find itself in the middle of content JS when it probably didn't have to.

I imagine we have different requirements.

When we are using WorkerDebugger.initialize we expect to evaluate the debugger script before any code from the worker starts running.
So it may sounds like we want to run ASAP, but actually we are having the |setDebuggerReady` API which pauses worker execution
and so WorkerDebugger.initialize can run late...

Another usecase would be "pause on next interrupt", which happens when you click on the pause button in the debugger UI.
This will trigger this code:
https://searchfox.org/mozilla-central/rev/dd2fe65d792943365a03fa996cdf9766829575b6/devtools/server/actors/thread.js#1545-1551
It isn't clear to me if the code we are discussing about is involved when using the JS piece of code?
Is onEnterFrame going to be a runnable processed by WorkerRunnable/WorkerPrivate/WorkerDebuggerRunnable?

If not, if what we are discussing about is only about DebuggerMessageEventRunnable and CompileDebuggerScriptRunnable...
It means this is only about WorkerDebugger.initialize and WorkerDebugger.postMessage, then, I think it is fine having them run "someday".
AFAIK we don't expect them to run ASAP. We typically use them to pass breakpoints, step, execute some snippet from the console, and none of them expect to run ASAP.
When we have timing consideration, we actually control the execution flow, like the setDebuggerReady API, or when the ThreadActor uses the Debugger API and spinEventLoop to pause the content code.

  1. Use a timer to provide a short grace period for the currently running task to finish, like 10ms. The idea would be that there isn't really a huge advantage for us to interrupting JS in the middle of execution, especially when there is so much async variability in the data flow prior to us deciding to trigger the interrupt.
    • An upside to the grace period is if it turns out we have bugs when we interrupt the JS, we'll do it less!

In the main thread, when we receive a new RDP packet or JSWindowActor message, I imagine we execute the handler at the end of an event loop and not in middle of execution. WorkerDebugger.postMessage should probably follow the same principles.

  1. Make nsIWorkerDebugger::PostMessage take an argument that is in control of whether we should interrupt and what the delay should be.
    • An additional option would be to specify that we should trigger an interrupt but, rather than running debugger runnables when the interrupt fires, we just throw an uncatchable exception that should close out the current task forcibly. This might be appealing in a wide variety of situations. For example, if the debugger interrupts an IndexedDB request callback when it's on the stack and the user is trying to debug their use of IndexedDB, they will have a very bad experience because that IndexedDB transaction will be permanently held open until the debugger yields control flow. Specifically, this would matter if it calls WorkerDebuggerGlobalScope::enterEventLoop which spins an event loop for the debugger where only debugger runnables and control runnables are processed so it can remain in control of the event loop and avoid content stuff from happening.

You completely lost me here. It sounds like you are suggesting doing something we aren't doing in the main thread.
It sounds like a new feature to ease debugging IndexedDB transactions. AFAIK, in the main thread, if you are an opened IDB transaction... it will be kept opened when hitting a breakpoint.

  • That said, we could also maybe let the debugger global know in the "message" event it receives or in something we expose on the global that it is interrupting content. It could then call a method we provide that will make it throw an uncatchable exception on the way out if it wanted.

I'm not exactly clear on what the debugger's mental model is in this situation.

It's possible the debugger code always assumed it was interrupting JS?

For some features of the debugger, yes, we expect to interrupt in middle of JS code execution. But this is implemented via WorkerDebugger.enterNestedEventLoop and spidermonkey Debugger API. And this mostly rely on JS callback hosted in the debugger script.
Otherwise, we do not expect to interrupt JS for WorkerDebugger.initialize/postMessage. But... we don't want to block these two methods when some JS code blocks the execution synchronously!

Flags: needinfo?(poirot.alex)

Believe it or not... but I completely forgot that I actually investigated this issue already 2 years ago.
This bug is a duplicate of bug 1706703 comment 17.
Back then, I even did an in-depth analysis of devtools code:
https://docs.google.com/document/d/1jTMd-H_BwH9_QNUDxPse80vq884_hMvd234lvE5gqY8/edit#

Andrew, would you have some cycles to help me clarify comment 18? Or could I ping someone else about this?

Flags: needinfo?(bugmail)

It sounds like the most practical approach is to have all WorkerDebuggerRunnables backstopped by a timer that will trigger a JS interrupt and let the debugger run if the timeout elapses and we're not already under debugger control. This would cover both nsIWorkerDebugger::Initialize and nsIWorkerDebugger::PostMessage. This would strike a good compromise between ensuring we can interrupt blocking or misbehaving content code while not gratuitously interrupting stack frames when we can wait for the a task to complete.

I'd suggest we start with a timer delay in the 1-2 second range but that's an arbitrary choice based on the following factors:

  1. We want something long enough so that most reasonably sized tasks have an opportunity to complete. I think we would expect content code running on workers to try and follow the rule of thumb for sufficiently responsive being 100-150ms, which means it would be desirable to allow for an order of magnitude of slack.
  2. 1-2 seconds seems like a threshold which falls under acceptable jank but where reaching 5 seconds would feel significantly worse.
  3. It's desirable to choose a delay long enough that's also strictly longer than any JS debugger runnable that's not calling into content would take to do whatever it's doing. We will explicitly avoid re-entrancy for the debugger itself unless it has transitioned back into content, but there's a cost to the JS interrupts and so constantly interrupting the debugger isn't super helpful.

My tentative plan would be:

  • WorkerPrivate::DispatchDebuggerRunnable will start a timer whenever we queue a debugger runnable and we don't currently have a timer.
    • Under our current implementation, the timer event will be targeted at the worker's own control event queue via the method above. For now we clear the timer as soon as we pull a runnable off the queue to execute it. An assumption is that in the case where the debugger is calling into content and we want to interrupt that again, that would be the result of a newly initiated call to nsIWorkerDebugger::PostMessage in response to the user clicking a button, etc.
    • For our planned PWorkerDebugger mechanism we would introduce a specific WorkerDebuggerEventTarget which would implement the control runnable timer logic for the WorkerDebuggerChild actor.
  • The control runnable will call a new helper method on WorkerPrivate called DebuggerInterruptRequest which will set a new flag mDebuggerInterruptRequested which would live next to WorkerThreadAccessible::mFrozen.
  • Our InterruptCallback will honor the flag and drain the debugger event queue as long as GetCurrentRealmOrNull indicates that we're in the content global. This prevents us from interrupting the debugger's own logic unless it has called back into content. Because WorkerDebuggerGlobalScope.createSandbox creates a bunch of SimpleGlobalObjects, I think it ends up being more involved to figure out if we're in the debugger versus being in the sole content global. There may be other possible ways to accomplish this check.
    • It's up to the debug runnable to enterEventLoop if it wants to pause execution rather than yielding control back once the debugger runnable queue is drained.
  • Our other calls to process control runnables which would already normally process debug runnables will clear the flag.

I also try and address your comments inline below to help provide extra context:

(In reply to Alexandre Poirot [:ochameau] from comment #18)

between adding task to the queue and calling mCondVar.Notify(). It looks like it follows the same principles, but we have JS interupt in addition?
Is JS_RequestInterrupCallback forcing spidermonkey to call the WorkerPrivate::InterruptCallback next time any JS code runs? Or does it force something more immediate, like forcing calling this InterrupCallback right away?

It's inherently async because the target runtime has to check the flag rather than receiving something like a POSIX signal, but we should expect the time it takes for this to happen to be rather short.

I talked with :bholley briefly about this, and I think it's probably fine for us to use the JS interrupt mechanism for this. We may need to make a minor adjustment to compensate for the comment about re-entry, but we have the necessary primitives.

It's not clear how we would prevent re-entry?

In my most recent investigation[1] I believe I determined the comment I reference is largely moot for the current implementation which appears to properly clear the interrupt request when invoking the callback. But even if we did need to take an action to clear the interrupt, this is more of an implementation detail that we can easily compensate for. Apologies for making that seem like a concern to address.

I imagine the issue is that the interrup callback will fire for debugger script code running?

As per the above, this isn't an issue, but for context I should note that the interrupt can and will fire when the debugger's JS code is running in order to service WorkerControlRunnables during the normal course of operation, but that doesn't actually pose a problem. The logic I describe in my proposal at the top of this comment about only processsing debug runnables if we're currently in content will prevent (undesirable) re-entrancy. And in terms of normal control runnables, we definitely want to be process them because we need to shut down a worker when it's canceled.

I should note there is a potential DevTools desire to prevent content JS from terminating the worker via Worker.terminate while the debugger is actively debugging. Also potentially preventing self.close from taking an effect. But that would be implemented by adding an explicit lifetime state machine that would prevent the worker from transitioning out of Running rather than a case where we'd ignore the control runnables.

When we are using WorkerDebugger.initialize we expect to evaluate the debugger script before any code from the worker starts running.
So it may sounds like we want to run ASAP, but actually we are having the |setDebuggerReady` API which pauses worker execution
and so WorkerDebugger.initialize can run late...

Right, setDebuggerReady prevents us from needing to interrupt any content. So it's only a problem like in this bug where we're attaching the debugger after the worker has already started.

Another usecase would be "pause on next interrupt", which happens when you click on the pause button in the debugger UI.
This will trigger this code:
https://searchfox.org/mozilla-central/rev/dd2fe65d792943365a03fa996cdf9766829575b6/devtools/server/actors/thread.js#1545-1551
It isn't clear to me if the code we are discussing about is involved when using the JS piece of code?
Is onEnterFrame going to be a runnable processed by WorkerRunnable/WorkerPrivate/WorkerDebuggerRunnable?

thread.js seems to do its execution inside the worker; I think if we wanted to explicitly request interruption of running content on the worker it looks like it would need to happen in the front's interrupt method? My proposal here sidesteps the need to do that, but that might be a direction we would want to eventually go.

To do that, presumably the transport call to postMessage could inspect the packet for a special flag?

I might also be misunderstanding the moving parts here; my understanding of the actors/fronts is from a long time ago. (Although I should say I have some ideas for how searchfox could help make the relationships explicit, etc.!)

If not, if what we are discussing about is only about DebuggerMessageEventRunnable and CompileDebuggerScriptRunnable...
It means this is only about WorkerDebugger.initialize and WorkerDebugger.postMessage, then, I think it is fine having them run "someday".
AFAIK we don't expect them to run ASAP. We typically use them to pass breakpoints, step, execute some snippet from the console, and none of them expect to run ASAP.

(This helped inform my proposal up top, thanks!)

When we have timing consideration, we actually control the execution flow, like the setDebuggerReady API, or when the ThreadActor uses the Debugger API and spinEventLoop to pause the content code.

(ditto)

In the main thread, when we receive a new RDP packet or JSWindowActor message, I imagine we execute the handler at the end of an event loop and not in middle of execution. WorkerDebugger.postMessage should probably follow the same principles.

Yeah, this should be largely consistent with that unless we hit the timeout.

You completely lost me here. It sounds like you are suggesting doing something we aren't doing in the main thread.

Yeah, this would have been new. It sounds like we don't want this or its complexity.

It sounds like a new feature to ease debugging IndexedDB transactions. AFAIK, in the main thread, if you are an opened IDB transaction... it will be kept opened when hitting a breakpoint.

I just meant that as an example of the potential for violating user expectations if we were regularly interrupting content when we didn't have to. I think my proposal informed by your responses means that, for this example, we'd only end up interrupting a very long-running IDB transaction where the user would indeed want to see what content was doing in there.

For some features of the debugger, yes, we expect to interrupt in middle of JS code execution. But this is implemented via WorkerDebugger.enterNestedEventLoop and spidermonkey Debugger API. And this mostly rely on JS callback hosted in the debugger script.
Otherwise, we do not expect to interrupt JS for WorkerDebugger.initialize/postMessage. But... we don't want to block these two methods when some JS code blocks the execution synchronously!

We should be good with this proposal then.

Flags: needinfo?(bugmail)

Stating the obvious here, but comment 21 describes a nice plan to try to address this issue.
This bug is stuck on finding someone to implementing this plan.
DevTools team can try implementing it, but none of us are used to contribute C++ patches so it would take a bit and be planned versus all other ongoing tasks and goals.

I do have a promising patch based on comment 21. Let me try to be mentored on this.

Assignee: nobody → poirot.alex

Atomics.wait prevent WorkerDebuggerRunnable's from running by blocking the event loop.
While dispatching these runnables, also spawn a short timer which would interrupt the worker
and tentatively try to drain the debugger queue and resume any debugger runnable which may be blocked.

I rebased the test patch (https://phabricator.services.mozilla.com/D173175) to easily test my tentative patch (https://phabricator.services.mozilla.com/D194081)

The WorkerDebugger.initialize call is no longer blocked and the test now completes...
..but I get something totaly unrelated to throw when running the test:

$ ./mach mochitest --headless dom/workers/test/browser_WorkerDebugger_waiting.initialize.js
...
FAIL test left unexpected repeating timer dom::PeriodicGCTimerCallback (duration: 1000ms)

Any help would be much appreciated.

From comment 21, I ignored two things:

  • the WorkerDebuggerEventTarget, I don't quite follow this point.
  • the filtering by global, to use GetCurrentRealmOrNull to only drain when we aren't in the debugger global.
    I tried this, but it wasn't working:
      // Prevents interrupting the debugger's own logic unless it has called
      // back into content
      WorkerDebuggerGlobalScope* globalScope = DebuggerGlobalScope();
      if (globalScope) {
        JSObject* global = JS::CurrentGlobalOrNull(aCx);
        if (global &&
            // Isn't one of the many sandboxes
            (SimpleGlobalObject::SimpleGlobalType(global) !=
                 SimpleGlobalObject::GlobalType::WorkerDebuggerSandbox &&
             // nor the main debugger script
             global != globalScope->GetGlobalJSObject())) {
          DrainDebuggerEventQueue();
        }
      }

Thanks for taking this on!

(In reply to Alexandre Poirot [:ochameau] from comment #25)

The WorkerDebugger.initialize call is no longer blocked and the test now completes...
..but I get something totaly unrelated to throw when running the test:

$ ./mach mochitest --headless dom/workers/test/browser_WorkerDebugger_waiting.initialize.js
...
FAIL test left unexpected repeating timer dom::PeriodicGCTimerCallback (duration: 1000ms)

I believe that the Tester.getNewRepeatingTimers call is complaining about the worker's GC timer because it doesn't look like any action is being taken to terminate the worker (until the global that the Worker instance is torn down).

Your test probably wants to invoke worker.terminate() as part of the test and it could be reasonable to wait for the test to also use something like waitForUnregister to help ensure the worker actually went away.

From comment 21, I ignored two things:

  • the WorkerDebuggerEventTarget, I don't quite follow this point.

This is specific to an enhancement (https://bugzilla.mozilla.org/show_bug.cgi?id=1672493#c1) that has not happened yet.

  • the filtering by global, to use GetCurrentRealmOrNull to only drain when we aren't in the debugger global.

I'm not 100% on my exact thought, but I think my idea was to compare against the GetGlobalJSObject (like you've done) from WorkerPrivate::GlobalScope instead of WorkerPrivate::DebuggerGlobalScope since there should only ever be one worker global whereas there currently can be a bunch of debugger globals. So we infer we're in the debugger if we're not in content.

edit addition: I should note that it seems plausible that the check mechanism doesn't work like I was assuming, so it's potentially worthwhile trying to validate the idea by adding a printf and also creating a situation where you try and trigger an interrupt when the debugger is on the stack. This could be a reasonable test to add, but does fundamentally end up requiring a busy-loop which is not a great thing to do in a test.

Attachment #9324228 - Attachment description: Bug 1821250 - Demonstrate issues between Atomics.wait and WorkerDebugger::Initialize → Bug 1821250 - Test WorkerDebugger APIs against a worker blocked on Atomics.wait.

Thanks a lot for the prompt feedback!

worker.terminate fixed the test and the check against WorkerPrivate::GlobalScope seems to work great.

Now let's see what try thinks about such patch:
https://treeherder.mozilla.org/jobs?repo=try&revision=3cb5279547c46d86975e6c3cb3672674841a913f

Hum. I do struggle a bit with MOZ_CAN_RUN_SCRIPT.
I wanted to factorize the code draining the debugger event queue (DrainDebuggerEventQueue)... but it is called from a method with MOZ_CAN_RUN_SCRIPT and another one without. Is it fine to spread the MOZ_CAN_RUN_SCRIPT flag on all eventually nested dependencies?

(In reply to Alexandre Poirot [:ochameau] from comment #28)

Hum. I do struggle a bit with MOZ_CAN_RUN_SCRIPT.
I wanted to factorize the code draining the debugger event queue (DrainDebuggerEventQueue)... but it is called from a method with MOZ_CAN_RUN_SCRIPT and another one without. Is it fine to spread the MOZ_CAN_RUN_SCRIPT flag on all eventually nested dependencies?

Yes. It's definitely fine to add MOZ_CAN_RUN_SCRIPT to WorkerPrivate::InterruptCallback and its anonymous namespace friend of the same name. There are situations where it may be appropriate to add MOZ_CAN_RUN_SCRIPT_BOUNDARY as mentioned in the docs but in general it's usually appropriate to propagate MOZ_CAN_RUN_SCRIPT.

That said, you will probably also want to add jobs to the underlying phab try push for the hazard analysis to make sure that if there are any new problems the hazard analysis detects that we can make sure to address those at the same time.

This is the second time I hit MOZ_CAN_RUN_SCRIPT. It is still hard to investigate.

  • ./mach static-analysis check --outgoing doesn't report the error at all (It may be because it wasn't part of the modified files),
  • ./mach static-analysis check dom/workers/RuntimeService.cpp does report it, but the error is lost in a sea of warnings.

But let's ignore that, I'm now stuck on a trivial C++ issue, which I'm not 100% confident on how to address

dom/workers/RuntimeService.cpp:459:10: error: arguments must all be strong refs or caller's parameters when calling a function marked as MOZ_CAN_RUN_SCRIPT (including the implicit object argument).  'worker' is neither.
459 |   return worker->InterruptCallback(aCx);

Which relates to this code:

MOZ_CAN_RUN_SCRIPT bool InterruptCallback(JSContext* aCx) {
  WorkerPrivate* worker = GetWorkerPrivateFromContext(aCx);
  MOZ_ASSERT(worker);

  // Now is a good time to turn on profiling if it's pending.
  PROFILER_JS_INTERRUPT_CALLBACK();

  return worker->InterruptCallback(aCx);

This sounds trivial, but I'm clueless regarding how to address this error.
Moving to:

  RefPtr<WorkerPrivate> worker = GetWorkerPrivateFromContext(aCx);

please the static analysis, but crashes with:

Hit MOZ_CRASH(WorkerPrivate not thread-safe) at /mnt/desktop/gecko-dev/xpcom/base/nsISupportsImpl.cpp:43

In a brute-force approach, the following seems to no longer crash:

CheckedUnsafePtr<WorkerPrivate> worker = GetWorkerPrivateFromContext(aCx);

but... I'm getting the static anylisis error back :/

I think you want to annotate the call to worker->InterruptCallback(...) with MOZ_KnownLive so it looks like MOZ_KnownLive(worker)->InterruptCallback(...). Note that this is different from the MOZ_KNOWN_LIVE class member annotation.

Restating my understanding of the situation for context:

  • PerformDebuggerMicroTaskCheckpoint needs to call CycleCollectedJSContext::PerformDebuggerMicroTaskCheckpoint which is MOZ_CAN_RUN_SCRIPT so PerformDebuggerMicroTaskCheckpoint needs to be MOZ_CAN_RUN_SCRIPT or MOZ_CAN_RUN_SCRIPT_BOUNDARY.
  • WorkerPrivate::InterruptCallback is adding a call to PerformDebuggerMicroTaskCheckpoint so the same annotation rules apply unless we mark the method with a boundary.
  • WorkerPrivate is not threadsafe and can only have refcounts added and removed on the parent (owning) thread.
  • Uses of WorkerPrivate* on the worker thread are guaranteed by the (private) WorkerPrivate::mSelfRef which is guaranteed to be valid for as long as the thread is alive.
  • InterruptCallback is guaranteed to be invoked only on the worker thread and only while the JS runtime is active, which means we can categorically trust the returned WorkerPrivate* to be live and mark it as such.

Does this fix the problem? (Also, please needinfo me if it does not or you have other problems.)

Flags: needinfo?(poirot.alex)

Thanks a lot for this helpful last comment!

It looks like it works great now, without any warning.
I pushed to try with hazard jobs, it looks green. Was that the jobs you were looking for?
https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=GkC6Ti_vQom1n55BNhbkJg.0&revision=61a5f82fb11d921f23bfbe9ea09ce5be36a767b1

Flags: needinfo?(poirot.alex)

(In reply to Alexandre Poirot [:ochameau] from comment #32)

Thanks a lot for this helpful last comment!

It looks like it works great now, without any warning.
I pushed to try with hazard jobs, it looks green. Was that the jobs you were looking for?
https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=GkC6Ti_vQom1n55BNhbkJg.0&revision=61a5f82fb11d921f23bfbe9ea09ce5be36a767b1

adding needinfo for Andrew ^

Flags: needinfo?(bugmail)

Try looks green enough?
https://treeherder.mozilla.org/jobs?repo=try&revision=ac5aa834b9e43389a1264e6893ffc837ab20324d

I only have a small doubt about the following assertion:

[task 2024-02-13T10:02:26.201Z] 10:02:26     INFO - GECKO(1303) | 4463> Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /builds/worker/workspace/obj-build/dist/include/mozilla/RefPtr.h:343
[task 2024-02-13T10:02:26.204Z] 10:02:26     INFO -  Initializing stack-fixing for the first stack frame, this may take a while...
[task 2024-02-13T10:02:54.870Z] 10:02:54     INFO - GECKO(1303) | 4463> #01: mozilla::dom::(anonymous namespace)::DebugWorkerRefs::DebugWorkerRefs(RefPtr<mozilla::dom::ThreadSafeWorkerRef>&, std::string const&) [dom/xhr/XMLHttpRequestMainThread.cpp:209]
[task 2024-02-13T10:02:54.871Z] 10:02:54     INFO - GECKO(1303) | 4463> #02: mozilla::dom::XMLHttpRequestMainThread::OnStartRequest(nsIRequest*) [dom/xhr/XMLHttpRequestMainThread.cpp:1952]
[task 2024-02-13T10:02:54.873Z] 10:02:54     INFO - GECKO(1303) | 4463> #03: mozilla::net::nsStreamListenerWrapper::OnStartRequest(nsIRequest*) [netwerk/base/nsStreamListenerWrapper.h:44]
[task 2024-02-13T10:02:54.874Z] 10:02:54     INFO - GECKO(1303) | 4463> #04: nsJARChannel::OnStartRequest(nsIRequest*) [modules/libjar/nsJARChannel.cpp:1204]

https://treeherder.mozilla.org/logviewer?job_id=447063459&repo=try&lineNumber=9384

It is DevTools and Worker related. But this stack relates to XHR... which isn't quite the topic of this patch.

Flags: needinfo?(bugmail)
Pushed by apoirot@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/94145877428a
Test WorkerDebugger APIs against a worker blocked on Atomics.wait. r=asuth
https://hg.mozilla.org/integration/autoland/rev/d46945ada9ec
Use a timer on debugger control runnable to ensure running them even if worker uses Atomics.wait. r=asuth

Backed out for causing multiple failures on Linux 18.04 x64 WebRender tsan opt

  • Backout link
  • Push with failures
  • Failure Log
  • Failure line: SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) /builds/worker/checkouts/gecko/mozglue/misc/Mutex_posix.cpp:94:3 in mutexLock
Flags: needinfo?(poirot.alex)

I'm bit clueless as to how to address this cross-thread mutex dead lock.

But here is a quick summary of the report:

Cycle in lock order graph: M0 (0x7b0c0000b400) => M1 (0x7b2c0006fda0) => M2 (0x7b0c00122fd0) => M0
   Mutex M1 acquired here while holding mutex M0 in main thread:
     #4 BaseAutoLock /builds/worker/workspace/obj-build/dist/include/mozilla/Mutex.h:237:11 (libxul.so+0x77050d2)
     #5 IsOnCurrentThreadInfallible /builds/worker/checkouts/gecko/dom/workers/WorkerEventTarget.cpp:183:17 (libxul.so+0x77050d2)
     #6 mozilla::dom::WorkerEventTarget::IsOnCurrentThread(bool*) /builds/worker/checkouts/gecko/dom/workers/WorkerEventTarget.cpp:195:25 (libxul.so+0x77050d2)
     #7 TimerThread::FindNextFireTimeForCurrentThread(mozilla::TimeStamp, unsigned int) /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:1141:34 (libxul.so+0x31db12e) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)
     #8 TimerThreadWrapper::FindNextFireTimeForCurrentThread(mozilla::TimeStamp, unsigned int) /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:120:25 (libxul.so+0x31f5f56) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)

This worker related code locks the same mutex as the code (MutexAutoLock lock(mMutex);) I'm modifying over there

Mutex M2 acquired here while holding mutex M1 in thread T16:
     #5 mozilla::dom::WorkerPrivate::DispatchControlRunnable(already_AddRefed<mozilla::dom::WorkerControlRunnable>) /builds/worker/checkouts/gecko/dom/workers/WorkerPrivate.cpp:1674:19 (libxul.so+0x7716f70)
     #6 mozilla::dom::WorkerControlRunnable::DispatchInternal() /builds/worker/checkouts/gecko/dom/workers/WorkerRunnable.cpp (libxul.so+0x7731e74) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)
     #7 mozilla::dom::WorkerRunnable::Dispatch() /builds/worker/checkouts/gecko/dom/workers/WorkerRunnable.cpp:103:10 (libxul.so+0x771485c) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)
     #8 mozilla::dom::WorkerEventTarget::Dispatch(already_AddRefed<nsIRunnable>, unsigned int) /builds/worker/checkouts/gecko/dom/workers/WorkerEventTarget.cpp:137:11 (libxul.so+0x7704c54) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)

DispatchControlRunnable involves a lock over there, with similar mutex MutexAutoLock lock(mMutex);. But I imagine mMutex isn't the same one as the previous case?

 Mutex M0 acquired here while holding mutex M2 in main thread:
     #5 BaseAutoLock /builds/worker/workspace/obj-build/dist/include/mozilla/Mutex.h:237:11 (libxul.so+0x31f5de7)
     #6 TimerThreadWrapper::RemoveTimer(nsTimerImpl*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:109:32 (libxul.so+0x31f5de7)
     #7 InitCommon /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:415:18 (libxul.so+0x31f7a9d) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)
     #8 nsTimerImpl::InitHighResolutionWithNamedFuncCallback(void (*)(nsITimer*, void*), void*, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&, unsigned int, char const*) /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:447:10 (libxul.so+0x31f7a9d)
     #9 InitWithNamedFuncCallback /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:433:10 (libxul.so+0x31f7034) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)
     #10 nsTimer::InitWithNamedFuncCallback(void (*)(nsITimer*, void*), void*, unsigned int, unsigned int, char const*) /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.h:199:3 (libxul.so+0x31f7034)
     #11 mozilla::dom::WorkerPrivate::DispatchDebuggerRunnable(already_AddRefed<mozilla::dom::WorkerRunnable>) /builds/worker/checkouts/gecko/dom/workers/WorkerPrivate.cpp:1721:3 (libxul.so+0x77172ce) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)
     #12 mozilla::dom::WorkerRunnable::DispatchInternal() /builds/worker/checkouts/gecko/dom/workers/WorkerRunnable.cpp:115:14 (libxul.so+0x7730714) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)  

This is my new code, mDebuggerInterruptTimer->InitWithNamedFuncCallback(...).
Which creates a new Timer and down the line involves this lock. But now this is really a distinct mutex StaticMutexAutoLock lock(sMutex);.

T16 Thread is a Timer thread, which seems to be related to startup cache and not the new timer added in this patch.

#13 nsTimer::InitWithNamedFuncCallback(void (*)(nsITimer*, void*), void*, unsigned int, unsigned int, char const*) /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.h:199:3 (libxul.so+0x31f7034)
#14 ResetStartupWriteTimer /builds/worker/checkouts/gecko/startupcache/StartupCache.cpp:861:11 (libxul.so+0x9c19acd) (BuildId: ed310c840c6114b11dbbe542f22dc80cf558339c)
Flags: needinfo?(poirot.alex)

Andrew, would you have some free cycle to help me about the mutex dead lock? Or someone I could ping to help me?

Flags: needinfo?(bugmail)

I have specific suggestions on the PR derived from the analysis below which I'm including for completeness, but hopefully the PR notes make sense on their own.

Analysis Notes

The mutexes as discussed in https://treeherder.mozilla.org/logviewer?job_id=447301075&repo=autoland&lineNumber=4377

In terms of stacks from https://treeherder.mozilla.org/logviewer?job_id=447301075&repo=autoland&lineNumber=4377

  • M1 acquired while M0 on the main thread is TimerThread::FindNextFireTimeForCurrentThread
    is checking all the timers on the main thread if they want to run on the main thread.
    • We can't avoid this.
  • M2 acquired while holding M1 on the timer thread. DispatchControlRunnable needs
    to acquire the WorkerPrivate Mutex (M2) because of how that's structured right
    now.
    • It would be nice to avoid changing things around this, but presumably we
      could split WorkerPrivate::mMutex.
  • M0 acquired while holding M2 on the main thread in WorkerPrivate::DispatchDebuggerRunnable;
    this is us trying to create the timer while holding M2.
    • In terms of creation, because we only ever set mDebuggerInterruptTimer on
      the main thread, we can potentially drop the mutex while we set up the timer
      and then reacquire the mutex and swap the pointer/refcount into place.
      Because dropping the reference also potentially will involve M0 in the case
      where the timer hasn't actually fired, we will need to use the same idiom of
      swapping the pointer into a local so that we can drop the mutex lock before
      dropping the refcount.
Flags: needinfo?(bugmail)
Pushed by apoirot@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/db9e48be096b
Test WorkerDebugger APIs against a worker blocked on Atomics.wait. r=asuth
https://hg.mozilla.org/integration/autoland/rev/6fd4fff71bf6
Use a timer on debugger control runnable to ensure running them even if worker uses Atomics.wait. r=asuth
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → 128 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: