Closed Bug 1918267 Opened 1 year ago Closed 1 year ago

DevTools can't be opened when any of the tabs is having an infinite loop

Categories

(DevTools :: General, defect, P2)

defect

Tracking

(firefox135 fixed)

RESOLVED FIXED
135 Branch
Tracking Status
firefox135 --- fixed

People

(Reporter: ochameau, Assigned: ochameau)

References

(Blocks 2 open bugs, Regressed 1 open bug)

Details

Attachments

(1 file)

STR:

  • open data:text/html,<script>while(true);</script> in a tab
  • open another tab and try to open DevTools

DevTools will stay blank and never initialize.

This is about the same code modified by bug 1898490.
The JS Process Actor of the frozen tab will have all its sendQuery methods to return a pending promise.
The JS Actor message/query is never received/processed by the DevToolsProcessChild JS Actor.
It isn't clear if we could know from the parent process side of things when a process is frozen or not responsive.
We could easily add a timeout for the watchTargets query, but that's more complicated for the addOrSetDataEntry methods as it would delay each data entry and the timeouts would accumulate. The DevTools would load but in tens of seconds.

We may switch from sendQuery to sendAsyncMessage for all the additional DOM Processes (i.e. all but the one related to the top level document),
but this is a pretty big invariant change as we expect these methods to only resolve once they have been processed in the target's threads.

Note that it may explain the blank and slow to open DevTools reports.
But I couldn't find a simple example beyond an infinite loop:

  • Infinite loop in a worker, or a worker paused by Atomics.wait doesn't block DevTools.
  • Another tab frozen because of a breakpoint, doesn't block other DevTools.
    May be some Web API allows to have a page or a process to be somewhat frozen? Or the engine sometime freeze some processes?
Severity: -- → S3
Priority: -- → P2

:whimboo found an easy way to trigger this:

Due to one of the content processes being blocked, DevTools can't be opened on any tab. Closing the viewsource tab doesn't fix the issue. Only killing the process manually helps to recover.

I filed bug 1926808 to get the case for view source investigated and fixed. So it will hopefully not be around that long.

See Also: → 1926808

Nika, DevTools is using JS Process Actor's to communicate data to all the content processes (we may debate about this requirement, but let's assume we do that now).
The ongoing challenge is that content processes may end up being stuck and sendQuery would not return, or slowly (like 10s+ in some real world scenario involving long CC).

I was wondering if JS Actor API should expose something to handle frozen/slow processes.
I came across the following code in browser custom element which uses BHR to avoid trying to send a query to a possibly hanging process. This appears to not be reliable enough for our usecases. BHR wouldn't necessarily flag a content process with an infinite loop (comment 0 STR)...

While I could come up with a wrapper on top of Actor API in order to ignore frozen processes via some custom time-based timeouts,
I was wondering if we could:

  • expose something on DOMProcess to know if a process is or is not responsive, which we could use similarly to canSend, to avoid trying to send a message.
  • have a timeout option in sendQuery.
  • anything which could help
    I imagine it may help a couple of frontend cobase to bail out instead of being blank, like we do.

About the frozen processes, we have a typical scenario (related to bug 1926530) where Element/Matrix cause 10s+ CC, see the following profile where devtools finally show up after the long CC in chat.mozilla.org process.

Thanks a lot for your insight!

Flags: needinfo?(nika)

I think the most likely solution here is to time out requests which take too long and move forwards if that's an acceptable approach. The easiest way to do that is probably to do something like this:

Promise.race([
    actor.sendQuery("..."),
    new Promise((resolve, reject) => setTimeout(() => reject(), 1000)),
]);

We could perhaps add a helper function like sendQueryWithTImeout(...), but I also wouldn't be too surprised if there's a JS utility somewhere to add a timeout to a promise.

I wasn't aware of the code using the process hang monitor to detect hung processes and avoid permitUnload calls, but it makes some sense. A thing to note is that this isn't using BHR (BHR is nightly/early beta only, so couldn't be used for something like this), but rather it's using the "a script is taking too long" process hang reporter component. I think all of the report tracking stuff is in the browser frontend code, so you'd need to call that directly through the frontend, and it wouldn't necessarily make sense to integrate into JS process actors.

Flags: needinfo?(nika)

Thanks a lot for your prompt reply!

I wasn't aware of the code using the process hang monitor to detect hung processes and avoid permitUnload calls, but it makes some sense. A thing to note is that this isn't using BHR (BHR is nightly/early beta only, so couldn't be used for something like this), but rather it's using the "a script is taking too long" process hang reporter component. I think all of the report tracking stuff is in the browser frontend code, so you'd need to call that directly through the frontend, and it wouldn't necessarily make sense to integrate into JS process actors.

Oh Right I mixed ProcessHangMonitor with BackgroundHangReport ;)
And yes, I wasn't considering involving neither of these two layers right into Process Actors.
I was rather wondering if something in IPC C++ API was available to better handle non-responsive processes.

Promise.race([
    actor.sendQuery("..."),
    new Promise((resolve, reject) => setTimeout(() => reject(), 1000)),
]);

We could perhaps add a helper function like sendQueryWithTImeout(...), but I also wouldn't be too surprised if there's a JS utility somewhere to add a timeout to a promise.

Yes, I could do something along these lines, with an additional stateful bit to immediately reject any subsequent call to sendQuery when the actor already timed out. And resume normal operation if the timed-out query finally resolved.

That's because DevTools end up emitting lots of queries and having a 1s timeout for each request would still significantly delay most operations.
The typical problematic queries are the followings:

Ultimately, we have to maintain a list of slow processes in order to temporarily ignore them, or at least not wait for their reply.

I'll try to craft the necessary workaround via an intermediate JS layer and we can see if that's reasonable.

DevTools currently broadcast its data to all processes (in order to reach all service workers, which may run in arbitrary processes).
We expect for the queries to be processed in the content process before replying back to the client/frontend in the related RDP request.

But, in some case, some content process may be stuck on a CC or slow page script.
When this happens, DevTools would be blank on opening or very slow to operate.

Let's try to avoid waiting for any query send to process which timed out on any past query.

Assignee: nobody → poirot.alex
Status: NEW → ASSIGNED
See Also: → 1928670

I'm still struggling with this C++ assertions, which is hard to debug as I don't know what is the faulty callsite, and can't reproduce locally:

[task 2024-11-27T10:26:36.009Z] 10:26:36     INFO - GECKO(1069) | [1069] Assertion failure: aValue.isObject(), at /builds/worker/checkouts/gecko/dom/bindings/BindingUtils.cpp:3418
[task 2024-11-27T10:26:36.014Z] 10:26:36     INFO -  Initializing stack-fixing for the first stack frame, this may take a while...
[task 2024-11-27T10:26:48.288Z] 10:26:48     INFO - GECKO(1069) | #01: mozilla::dom::AssertReturnTypeMatchesJitinfo(JSJitInfo const*, JS::Handle<JS::Value>) [dom/bindings/BindingUtils.cpp:3418]
[task 2024-11-27T10:26:48.289Z] 10:26:48     INFO - GECKO(1069) | #02: mozilla::dom::binding_detail::GenericGetter<mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail::ThrowExceptions>(JSContext*, unsigned int, JS::Value*) [dom/bindings/BindingUtils.cpp:0]
[task 2024-11-27T10:26:48.289Z] 10:26:48     INFO - GECKO(1069) | #03: CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) [js/src/vm/Interpreter.cpp:532]
[task 2024-11-27T10:26:48.290Z] 10:26:48     INFO - GECKO(1069) | #04: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) [js/src/vm/Interpreter.cpp:628]
[task 2024-11-27T10:26:48.290Z] 10:26:48     INFO - GECKO(1069) | #05: js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) [js/src/vm/Interpreter.cpp:727]
[task 2024-11-27T10:26:48.291Z] 10:26:48     INFO - GECKO(1069) | #06: js::CallGetter(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::MutableHandle<JS::Value>) [js/src/vm/Interpreter.cpp:850]
[task 2024-11-27T10:26:48.292Z] 10:26:48     INFO - GECKO(1069) | #07: GetExistingProperty<(js::AllowGC)1>(JSContext*, js::MaybeRooted<JS::Value, (js::AllowGC)1>::HandleType, js::MaybeRooted<js::NativeObject*, (js::AllowGC)1>::HandleType, js::MaybeRooted<JS::PropertyKey, (js::AllowGC)1>::HandleType, js::PropertyInfoBase<unsigned int>, js::MaybeRooted<JS::Value, (js::AllowGC)1>::MutableHandleType) [js/src/vm/NativeObject.cpp:2177]

https://treeherder.mozilla.org/logviewer?job_id=484362876&repo=try&lineNumber=11789

I was able to fix a first occurence of this by waiting more correctly for async work done at the end of browser_dbg-breakpoints.js. Now it seems to reproduce again on browser_toolbox_meatball.js (and may be related to browser_toolbox_many_toggles.js).

It looks like it may be accessing the osPid on the JS Process Actor on timeout/resume (this.manager.osPid in the patch).

Let's see if try is fully green with this fix:
https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=Y5pCeQs-QIGIX6jxvGkWQA.0&revision=1fa8e098761c05be73de41797ebfecd7292cecae

This issue was around manager getter which triggers this assertion anytime we access it after the JS Process Actor has been unregistered.
There must be something wrong in that getter, which may be correct on JSWindowActor, but not on JSProcessActor.

Pushed by apoirot@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e7660a1ea32a [devtools] Avoid waiting for replies of hanging processes in DevTools communications made to content processes. r=devtools-reviewers,jdescottes
Regressions: 1934545
Keywords: leave-open

Backed out for causing dt failures @ browser_target_command_browser_workers

TEST-UNEXPECTED-FAIL | devtools/shared/commands/target/tests/browser_target_command_browser_workers.js | retrieve the target for the worker -
Flags: needinfo?(poirot.alex)
Pushed by apoirot@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9c4cc3cd69f7 [devtools] Avoid waiting for replies of hanging processes in DevTools communications made to content processes. r=devtools-reviewers,jdescottes
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 135 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: