DevTools can't be opened when any of the tabs is having an infinite loop
Categories
(DevTools :: General, defect, P2)
Tracking
(firefox135 fixed)
| Tracking | Status | |
|---|---|---|
| firefox135 | --- | fixed |
People
(Reporter: ochameau, Assigned: ochameau)
References
(Blocks 2 open bugs, Regressed 1 open bug)
Details
Attachments
(1 file)
STR:
- open
data:text/html,<script>while(true);</script>in a tab - open another tab and try to open DevTools
DevTools will stay blank and never initialize.
This is about the same code modified by bug 1898490.
The JS Process Actor of the frozen tab will have all its sendQuery methods to return a pending promise.
The JS Actor message/query is never received/processed by the DevToolsProcessChild JS Actor.
It isn't clear if we could know from the parent process side of things when a process is frozen or not responsive.
We could easily add a timeout for the watchTargets query, but that's more complicated for the addOrSetDataEntry methods as it would delay each data entry and the timeouts would accumulate. The DevTools would load but in tens of seconds.
We may switch from sendQuery to sendAsyncMessage for all the additional DOM Processes (i.e. all but the one related to the top level document),
but this is a pretty big invariant change as we expect these methods to only resolve once they have been processed in the target's threads.
Note that it may explain the blank and slow to open DevTools reports.
But I couldn't find a simple example beyond an infinite loop:
- Infinite loop in a worker, or a worker paused by Atomics.wait doesn't block DevTools.
- Another tab frozen because of a breakpoint, doesn't block other DevTools.
May be some Web API allows to have a page or a process to be somewhat frozen? Or the engine sometime freeze some processes?
Updated•1 year ago
|
Comment 1•1 year ago
|
||
:whimboo found an easy way to trigger this:
- open https://html.spec.whatwg.org/
- use View Source
- (view source will load forever, seems to be blocked on reflows)
- try to open devtools
Due to one of the content processes being blocked, DevTools can't be opened on any tab. Closing the viewsource tab doesn't fix the issue. Only killing the process manually helps to recover.
Comment 2•1 year ago
|
||
I filed bug 1926808 to get the case for view source investigated and fixed. So it will hopefully not be around that long.
Updated•1 year ago
|
| Assignee | ||
Comment 3•1 year ago
|
||
Nika, DevTools is using JS Process Actor's to communicate data to all the content processes (we may debate about this requirement, but let's assume we do that now).
The ongoing challenge is that content processes may end up being stuck and sendQuery would not return, or slowly (like 10s+ in some real world scenario involving long CC).
I was wondering if JS Actor API should expose something to handle frozen/slow processes.
I came across the following code in browser custom element which uses BHR to avoid trying to send a query to a possibly hanging process. This appears to not be reliable enough for our usecases. BHR wouldn't necessarily flag a content process with an infinite loop (comment 0 STR)...
While I could come up with a wrapper on top of Actor API in order to ignore frozen processes via some custom time-based timeouts,
I was wondering if we could:
- expose something on DOMProcess to know if a process is or is not responsive, which we could use similarly to
canSend, to avoid trying to send a message. - have a timeout option in
sendQuery. - anything which could help
I imagine it may help a couple of frontend cobase to bail out instead of being blank, like we do.
About the frozen processes, we have a typical scenario (related to bug 1926530) where Element/Matrix cause 10s+ CC, see the following profile where devtools finally show up after the long CC in chat.mozilla.org process.
Thanks a lot for your insight!
Comment 4•1 year ago
|
||
I think the most likely solution here is to time out requests which take too long and move forwards if that's an acceptable approach. The easiest way to do that is probably to do something like this:
Promise.race([
actor.sendQuery("..."),
new Promise((resolve, reject) => setTimeout(() => reject(), 1000)),
]);
We could perhaps add a helper function like sendQueryWithTImeout(...), but I also wouldn't be too surprised if there's a JS utility somewhere to add a timeout to a promise.
I wasn't aware of the code using the process hang monitor to detect hung processes and avoid permitUnload calls, but it makes some sense. A thing to note is that this isn't using BHR (BHR is nightly/early beta only, so couldn't be used for something like this), but rather it's using the "a script is taking too long" process hang reporter component. I think all of the report tracking stuff is in the browser frontend code, so you'd need to call that directly through the frontend, and it wouldn't necessarily make sense to integrate into JS process actors.
| Assignee | ||
Comment 5•1 year ago
|
||
Thanks a lot for your prompt reply!
I wasn't aware of the code using the process hang monitor to detect hung processes and avoid permitUnload calls, but it makes some sense. A thing to note is that this isn't using BHR (BHR is nightly/early beta only, so couldn't be used for something like this), but rather it's using the "a script is taking too long" process hang reporter component. I think all of the report tracking stuff is in the browser frontend code, so you'd need to call that directly through the frontend, and it wouldn't necessarily make sense to integrate into JS process actors.
Oh Right I mixed ProcessHangMonitor with BackgroundHangReport ;)
And yes, I wasn't considering involving neither of these two layers right into Process Actors.
I was rather wondering if something in IPC C++ API was available to better handle non-responsive processes.
Promise.race([ actor.sendQuery("..."), new Promise((resolve, reject) => setTimeout(() => reject(), 1000)), ]);We could perhaps add a helper function like
sendQueryWithTImeout(...), but I also wouldn't be too surprised if there's a JS utility somewhere to add a timeout to a promise.
Yes, I could do something along these lines, with an additional stateful bit to immediately reject any subsequent call to sendQuery when the actor already timed out. And resume normal operation if the timed-out query finally resolved.
That's because DevTools end up emitting lots of queries and having a 1s timeout for each request would still significantly delay most operations.
The typical problematic queries are the followings:
- https://searchfox.org/mozilla-central/rev/044802a2108a5163bd5288ea18eb6a88234d45f0/devtools/server/actors/watcher.js#241,264
- https://searchfox.org/mozilla-central/rev/044802a2108a5163bd5288ea18eb6a88234d45f0/devtools/server/actors/watcher.js#554
- https://searchfox.org/mozilla-central/rev/044802a2108a5163bd5288ea18eb6a88234d45f0/devtools/server/actors/watcher.js#773
Ultimately, we have to maintain a list of slow processes in order to temporarily ignore them, or at least not wait for their reply.
I'll try to craft the necessary workaround via an intermediate JS layer and we can see if that's reasonable.
| Assignee | ||
Comment 6•1 year ago
|
||
DevTools currently broadcast its data to all processes (in order to reach all service workers, which may run in arbitrary processes).
We expect for the queries to be processed in the content process before replying back to the client/frontend in the related RDP request.
But, in some case, some content process may be stuck on a CC or slow page script.
When this happens, DevTools would be blank on opening or very slow to operate.
Let's try to avoid waiting for any query send to process which timed out on any past query.
Updated•1 year ago
|
| Assignee | ||
Comment 7•1 year ago
|
||
I'm still struggling with this C++ assertions, which is hard to debug as I don't know what is the faulty callsite, and can't reproduce locally:
[task 2024-11-27T10:26:36.009Z] 10:26:36 INFO - GECKO(1069) | [1069] Assertion failure: aValue.isObject(), at /builds/worker/checkouts/gecko/dom/bindings/BindingUtils.cpp:3418
[task 2024-11-27T10:26:36.014Z] 10:26:36 INFO - Initializing stack-fixing for the first stack frame, this may take a while...
[task 2024-11-27T10:26:48.288Z] 10:26:48 INFO - GECKO(1069) | #01: mozilla::dom::AssertReturnTypeMatchesJitinfo(JSJitInfo const*, JS::Handle<JS::Value>) [dom/bindings/BindingUtils.cpp:3418]
[task 2024-11-27T10:26:48.289Z] 10:26:48 INFO - GECKO(1069) | #02: mozilla::dom::binding_detail::GenericGetter<mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail::ThrowExceptions>(JSContext*, unsigned int, JS::Value*) [dom/bindings/BindingUtils.cpp:0]
[task 2024-11-27T10:26:48.289Z] 10:26:48 INFO - GECKO(1069) | #03: CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) [js/src/vm/Interpreter.cpp:532]
[task 2024-11-27T10:26:48.290Z] 10:26:48 INFO - GECKO(1069) | #04: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) [js/src/vm/Interpreter.cpp:628]
[task 2024-11-27T10:26:48.290Z] 10:26:48 INFO - GECKO(1069) | #05: js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) [js/src/vm/Interpreter.cpp:727]
[task 2024-11-27T10:26:48.291Z] 10:26:48 INFO - GECKO(1069) | #06: js::CallGetter(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::MutableHandle<JS::Value>) [js/src/vm/Interpreter.cpp:850]
[task 2024-11-27T10:26:48.292Z] 10:26:48 INFO - GECKO(1069) | #07: GetExistingProperty<(js::AllowGC)1>(JSContext*, js::MaybeRooted<JS::Value, (js::AllowGC)1>::HandleType, js::MaybeRooted<js::NativeObject*, (js::AllowGC)1>::HandleType, js::MaybeRooted<JS::PropertyKey, (js::AllowGC)1>::HandleType, js::PropertyInfoBase<unsigned int>, js::MaybeRooted<JS::Value, (js::AllowGC)1>::MutableHandleType) [js/src/vm/NativeObject.cpp:2177]
https://treeherder.mozilla.org/logviewer?job_id=484362876&repo=try&lineNumber=11789
I was able to fix a first occurence of this by waiting more correctly for async work done at the end of browser_dbg-breakpoints.js. Now it seems to reproduce again on browser_toolbox_meatball.js (and may be related to browser_toolbox_many_toggles.js).
| Assignee | ||
Comment 8•1 year ago
|
||
It looks like it may be accessing the osPid on the JS Process Actor on timeout/resume (this.manager.osPid in the patch).
Let's see if try is fully green with this fix:
https://treeherder.mozilla.org/jobs?repo=try&selectedTaskRun=Y5pCeQs-QIGIX6jxvGkWQA.0&revision=1fa8e098761c05be73de41797ebfecd7292cecae
| Assignee | ||
Comment 9•1 year ago
•
|
||
This issue was around manager getter which triggers this assertion anytime we access it after the JS Process Actor has been unregistered.
There must be something wrong in that getter, which may be correct on JSWindowActor, but not on JSProcessActor.
Comment 10•1 year ago
|
||
| Assignee | ||
Updated•1 year ago
|
Comment 11•1 year ago
|
||
Backed out for causing dt failures @ browser_target_command_browser_workers
- Backout link
- Push with failures
- Failure Log
- Failure line:
TEST-UNEXPECTED-FAIL | devtools/shared/commands/target/tests/browser_target_command_browser_workers.js | retrieve the target for the worker -
| Assignee | ||
Comment 12•1 year ago
|
||
With latest revision:
https://treeherder.mozilla.org/jobs?repo=try&revision=f788ea2d7f70f89fbcb953b0fc2086418d109c53
Comment 13•1 year ago
|
||
Comment 14•1 year ago
|
||
| bugherder | ||
| Assignee | ||
Updated•1 year ago
|
Description
•