1609201 - processDescriptorFront.getTarget() leaks windows

Brian Hackett [Laid off!]

Reporter

Description

•

5 years ago

Attached patch patch for STR — Details — Splinter Review

STR:

Update m-c to b59dd4db036e
Apply attached patch
Build m-c with --enable-debug
Run the browser_dbg-asm.js mochitest

Actual Result:

ERROR TEST-UNEXPECTED-FAIL | devtools/client/debugger/test/mochitest/browser_dbg-asm.js | leaked 2 window(s) until shutdown [url = http://example.com/browser/devtools/client/debugger/test/mochitest/examples/doc-asm.html]

Expected Result:

No leak

The attached patch turns on the windowless service workers pref and then disables almost all of its functionality. The leak will stop if the following line is commented out in root.js (see patch):

      const front = await processDescriptorFront.getTarget();

The windowless service workers pref requires that the debugger get targets for all content processes so that it can look for service workers in them. Getting these targets causes leaks on this test and several others, though the other ones are not as reproducible. It seems from these STR that this is a problem with the target front rather than the way in which the debugger is using it. The targets seem like they are all being destroyed when the toolbox itself is destroyed, but if I close the toolbox and tab in the test and then await for several seconds the leak still occurs.

Jan Honza Odvarko [:Honza] (always need-info? me)

Comment 1

•

5 years ago

@Jason, could you please look at this, could the problem be related to target front implementation?

Honza

Flags: needinfo?(jlaster)

Patrick Brosset <:pbro>

Updated

•

5 years ago

Priority: -- → P3

:Harald Kirschner :digitarald

Comment 2

•

5 years ago

Updated ni? → Honza mentioned today that Julian is helping with this (thanks a lot!)

Flags: needinfo?(jlaster) → needinfo?(jdescottes)

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Updated

•

5 years ago

Assignee: nobody → jdescottes

Status: NEW → ASSIGNED

Flags: needinfo?(jdescottes)

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 3

•

5 years ago

•

Edited

summary

I started investigating this and I have a few findings Tldr:

leak is linked to the content-process DebuggerServer that lives in the same process as the tab (where we already have a DebuggerServer)
leak is linked to a "scripts" map, held by BreakpointActors, which are never cleaned up

try pushes

I have two try pushes in progress:

call BreakpointActor destroy: https://treeherder.mozilla.org/#/jobs?repo=try&revision=43d5425cf262483d97ae80fa42e81308fdb456e8
call BreakpointActor destroy + enable windowless service workers: https://treeherder.mozilla.org/#/jobs?repo=try&revision=24fc8c9d1adf9d57b7562ea369cac81e2708417b

However if you look at the original patch that was backed out: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&revision=47f8d76aa4fc9a21a03d7163cc0b4747a65adfe0 a lot of failures are timeouts, and don't seem linked to this leak. So let's wait to see what try says, but I think more work might be needed to enable the feature (probably should file other bugs).

investigation (detailed)

We are connecting to several content processes during this test (5 in my local tests). The content process that makes this test leak is the content process of the tab targeted by the toolbox. If I skip this specific content process and don't connect to it, there is no leak. The actorIDs of the descriptors are relatively stable across runs, so I simply filtered by processDescriptorFront.actorID to verify this theory.

In practice, we already have a DebuggerServer created in this process, because of Tab target of the toolbox. So when we connect to this process target, we create another DebuggerServer in the same process. The content-process.jsm script creates a dedicated loader, which means we really use separate DebuggerServers: https://searchfox.org/mozilla-central/source/devtools/server/startup/content-process.jsm#29-38

If we modify content-process.jsm to use the shared loader instead of creating a new one, the leak disappears as well. Finally, simply setting invisibleToDebugger to false is not enough to fix the leak.

More interesting I also looked at the test itself to understand which step was leaking precisely. Turns out the test only leaks when you add a breakpoint via await addBreakpoint(dbg, "asm.js", 7);. Diving deeper into what leaks in the breakpoint creation, it seems related to a map held by the BreakpointActor: https://searchfox.org/mozilla-central/source/devtools/server/actors/breakpoint.js#40. This map is supposed to be cleared when destroying the actor, but breakpoint actors are never destroyed (at least not in this test).

Modifying ThreadActor to destroy all breakpoint actors, fixes the leak for me locally. That being said, I don't understand why this turned into a leak here and not before. Even without enabling windowless service workers, the breakpoint actors are never destroyed. What is it about creating an additional DebuggerServer in the same process that makes this map leak? So far I don't know.

I also verified that all the DebuggerServers were correctly destroyed, couldn't spot anything wrong there.

Jason Laster [:jlast]

Comment 4

•

5 years ago

Thanks for the writeup and analysis.

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 5

•

5 years ago

investigation update

After fixing the breakpoint map issue, there was still a leak in a test, reproducible locally.
Test is browser_dbg-pause-exceptions.js (https://searchfox.org/mozilla-central/source/devtools/client/debugger/test/mochitest/browser_dbg-pause-exceptions.js). Same overall observations, leaks only because we create a DebuggerServer in the same process as the content page, and only if we are not reusing the existing Loader.

For this one I was not able to find the root cause of the leak. However I could determine that the test only leaked when the debugger was pausing on exceptions and those exceptions were Error objects (as opposed to simple throw "some string";). So as a workaround I simply switched all errors to strings in the test page (https://hg.mozilla.org/try/rev/ea887d26cb048bbd34b7f5628bd01c28729c351a).

At that point, I don't have any test leaking locally in the debugger suite. However we still have timeouts & crashes on try, on Linux and macos (windows is green now). Example: https://treeherder.mozilla.org/#/jobs?repo=try&revision=84f1dff81f3d82689752065dd475f0f5702d8ea5

The timeouts seemed to revolve around two tests mostly:

browser_dbg-windowless-service-workers.js
browser_dbg-xhr-breakpoints.js
I tried skipping them on debug to see if it's only about those two tests, or if the issue shifts somewhere else.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ea462c2c930267e0ec2f4147ac5aa90e0f28d09
(no results yet, but the typical dt12 jobs that fail have already been running for longer than they usually do, so I'm guessing it will timeout all the same)

I am also checking if we could just stop using a dedicated loader for the content-process servers. At least to see if it fixes the timeouts, even though I'm not clear about the consequences this might have.
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=286321905&revision=c1d73818a1da9ca32c1cfbbcae88ea03bd97d643

Some other things I tried without any success:

set dom.ipc.processPrelaunch.enabled=false in the debugger test suite to make process creation more reliable
split debugger's browser.ini in two to balance the load

next steps

Unless the shared-loader for content-processes is somewhat successful, I think we'll need to discuss priorities for this. Getting this to a landable state will take time.
Also everything here is tightly linked to the way the debugger is fetching targets. Since this is a codepath we want to remove in favor of TargetList, if possible I suggest to wait until the following bugs have been addressed:

https://bugzilla.mozilla.org/show_bug.cgi?id=1594750
https://bugzilla.mozilla.org/show_bug.cgi?id=1572409
And also until the debugger's targets.js file has been updated to fully rely on target-list instead of implementing its own logic for fetching targets

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 6

•

5 years ago

Attached file Bug 1609201 - WIP/Feedback: Use shared loader for content process server (obsolete) — Details

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 7

•

5 years ago

Let's do another attempt skipping all windowless tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=80b384d452615e0ef98aef4f61516eb86e80b361

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 8

•

5 years ago

DAMP compare https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=3d42367b4c35a310bc1dece412d72b7f082ff3c9&originalSignature=1759151&newSignature=1759151&framework=12&originalRevision=824b1b0eb2e124e30da533608034a2da8eb56d0c

cold.jsdebugger.open.DAMP 30% slower
complicated.jsdebugger.close.DAMP 10% slower
complicated.jsdebugger.open.DAMP 16.6% slower
custom.jsdebugger.close.DAMP 8% slower
custom.jsdebugger.open.DAMP 21.5% slower
custom.jsdebugger.stepIn.DAMP 13.6% slower
simple.jsdebugger.close.DAMP 20% slower
simple.jsdebugger.open.DAMP 34% slower

This confirms the negative performance impact of enabling the feature as is.
Fixing the leak would most likely not address this regression.
One option to mitigate this is to retrieve service worker targets only from the current main target.
Limitations/challenges:

browsing context targets currently only return workers of type "dedicated"
we have no guarantee that the service worker will spawn in the same process as the page it controls

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 9

•

5 years ago

Leak investigation update

The previous investigations highlighted two leaks.
Both leaks can be fixed by code fixes which seem reasonable:

fix for BreakpointActors leak: https://hg.mozilla.org/try/rev/09b0f7c6e9990c2227617dcadf0e65b912c2cb7c
fix for _handledFrameExceptions leak: https://hg.mozilla.org/try/rev/8d9826f809d96938b06d4fd6e6ae68fc2225535c

After fixing those 2 leaks, there are no visible window leaks on try anymore. But as I said in a previous update, Try is still failing on linux debug and macos debug. It now fails because of timeouts, which on Linux occurs around 90 minutes of run for the debugger chunk.
For reference, on central right now, the same chunk takes around 40 minutes. Combining this information with the DAMP results from above, I think what's happening is that the debugger test suite is now so slow and heavy it times out. Maybe because of a leak (although it doesn't show up as a window leak and therefore doesn't fail the tests directly). Maybe for another reason.

builtin-modules leak

So I tried to understand why having 2 DebuggerServers loaded in the same content process was making the 2 leaks I mentioned visible. Because even though we have new servers, the maps or objects creating the leaks were always living in the "regular" DebuggerServer, not the additional server created for the sw debugging feature. So why would they now trigger a leak, just because we had another DebuggerServer sitting there.

From what I could bisect, it seems to be purely linked to the debugger Sandbox which DevTools spawn in builtin-modules.js (searchfox). As a reminder, a server created for content-process debugging uses a dedicated DevTools loader. And for each DevTools loader we load this builtin-modules file, and create a corresponding Sandbox. All that to say that simply loading builtin-modules for the content-process server is enough to make the regular server leak.

I tried to find a way to prevent the leak while still loading 2 copies of built-in modules. And it seems that adding a bit of proxying and cleanup to the way we handle the Sandbox can fix this. Patch at https://hg.mozilla.org/try/rev/d3858c0465b6d746b6c83c6f0b5286583e8c8629 (this is more a WIP/proof of concept than something that I really want to push).

With this patch, tests such as browser_dbg-pause-exceptions.js now pass without the _handledFrameExceptions leak fix. And tests such as browser_dbg-asm.js also pass without the BreakpointActors leak fix.

Why try to fix leaks for which we already had a solution? Since the setup made various tests start leaking for unrelated reasons, I think that fixing the root cause is important, rather than hoping that the leaks we found in mochitests are all of the leaks that can happen in this situation. And maybe this will, overall, fix a bigger leak that makes the debugger test suite so slow.

Reduced test case for the builtin-modules leak

I want to share the reduced test case I used to investigate builtin-modules, hoping someone more knowledgeable can understand why we have a leak here.

The stack to use for this reduced test case is https://hg.mozilla.org/try/rev/a23cab06da748e27331ebd87a06475efc18d32ba .
With this stack applied, you can reproduce the leak simply by using a debug build and running ./mach test browser_dbg-pause-exceptions.js.

This stack contains the pref flip for sw debugging and the fix for the BreakpointActors leak, but it lacks the fix for _handledFrameExceptions. Since we know that without this fix, browser_dbg-pause-exceptions.js will leak in debug mode, we can easily use that to reduce the test case. The good thing with this test is that it doesn't rely at all on the content-process servers that the debugger spawns when sw-debugging is enabled.

Thanks to this we can cripple those servers as much as we want until the leak disappears. The gist of the changes in the current patch is that the servers don't do anything in the content processes except creating a DevTools Loader with a custom flag isForContentProcessServer. When this flag is set, the Loader will create its base-loader and then will only require('builtin-modules-copy'), which is simply a really simplified version of builtin-modules.js

(function() {
  const { Cu } = require("chrome");
  const { Services } = require("resource://gre/modules/Services.jsm");

  const systemPrincipal = Services.scriptSecurityManager.getSystemPrincipal();
  const debuggerSandbox = Cu.Sandbox(systemPrincipal, {
    freshCompartment: true,
    wantGlobalProperties: [],
  });

  // In order to trigger the leak, we must export a method.
  // The closure it creates seems important here.
  exports.testFunction = function() {};

  // And the sandbox needs to be kept alive in a function scope.
  function foo() { debuggerSandbox; }
})();

And if you comment out exports.testFunction (or make it export something else than a function), or if you stop encapsulating debuggerSandbox in the foo method, the leak is gone.

This investigation lead me to a "fix" for the leak used in https://hg.mozilla.org/try/rev/d3858c0465b6d746b6c83c6f0b5286583e8c8629, but it would be nice if someone with a better understanding of sandboxes and compartments could investigate this.

Next steps

Despite having a workaround for the builtin-modules leak, the debugger test suite still times out on linux and macos on try, and the performance on DAMP is still 20/30% slower than without the feature.

Thinking only about try we could split the debugger test suite in smaller ini files, hoping that restarting Firefox a bit more often could fix the issue. I will try some more tricks to speed up the debugger attach code (basically reapplying what we recently did on target-list). But I think the bottomline is that the feature as it was designed is too slow to be shipped and needs to be rearchitected to spawn less servers.

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 10

•

5 years ago

•

Edited

Some additional information. I tried to slightly change the logic in the debugger's target creation code. In the current implementation, we do three actions unconditionally:

create all worker targets
get all the service-worker-registrations
filter the worker targets that match a registration which controls the URL of the current top target

I changed that to only get all the service worker registrations, and then check if we have any sw registration for the current URL. If we do, we fetch worker targets and do the filtering, if we don't, we bail. This means that it should reduce the impact of the feature when debugging pages which don't have a sw. Only the cost of getting the registrations should still apply.

With this change, we have a few interesting results:
First, DAMP:

The regression is more or less cut in half. Since we don't exercise service workers in our DAMP tests, this means getting the sw registrations is already very costly.

Now on mochitests, we still get a timeout on the debugger test suite on debug platforms, and the suite still takes around 90 mn. But now it completes at least, it's no longer killed by harness. I was still expecting this to perform better with mochitests. Maybe fetching registrations is especially slow on debug. Normally there is only one debugger test which involves workers, and it runs at the end of the suite.

Finally, I also tried one of the suggestions mentioned in my last comment: split the ini in several files: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=2920cd51f9fa7ffde7f5c004386e5d3ce1b196f3&originalSignature=1759151&newSignature=1759151&framework=12&originalRevision=3d42367b4c35a310bc1dece412d72b7f082ff3c9 . And by doing that all tests passed. I would like to look more at the performance issue around listing registrations now.

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 11

•

5 years ago

analysis

Yesterday I missed another codepath which is flagged behind the windowless service workers preference, because it was hidden in listProcessTargets :

async function listProcessTargets(args: Args) {
  const { currentTarget } = args;
  if (!attachAllTargets(currentTarget)) {
    if (currentTarget.url && features.windowlessServiceWorkers) {
      // Service workers associated with our target's origin need to pause until
      // we attach, regardless of which process they are running in.
      try {
        const origin = new URL(currentTarget.url).origin;
        const targets = await getAllProcessTargets(args);
        await Promise.all(
          targets.map(t => t.pauseMatchingServiceWorkers({ origin }))
        );
      } catch (e) {
        // currentTarget.url might not be a full URL, and old servers without
        // without pauseMatchingServiceWorkers will throw.
        // @backward-compatibility: remove in Firefox 75
      }
    }
    return [];
  }

  return getAllProcessTargets(args);
}

This is the logic that enables the "early" breakpoints for service workers. As we can see, this calls getAllProcessTargets which will call rootFront's listProcesses. This means that overall we are creating processes targets twice everytime we update the targets in the debugger. This explains why I just had a 50% perf improvement after my previous patch.

I verified that disabling this second codepath gets us back to normal performance.

So now we have a better picture of what is triggered by the windowless service workers preference:

list all workers
- list parent process workers
- list all processes -> create process targets -> list workers
list all service worker registrations
filter out workers that don't match registrations for the current URL
pause all service workers
- list all processes -> create process targets -> pauseMatchingServiceWorkers

So we are creating process targets twice here. And the bad news is that I can't apply the same trick as before to "pause all service workers". If we want to pause sw early (during installation phase for instance), we cannot wait until the service worker registration is available to call pauseMatchingServiceWorkers.

Next steps

We have a few options for our next steps (which are not exclusive, we can combine).

Option 1: Disable early service worker breakpoints
We could compromise and disable (or put behind a second pref) the "early breakpoints" in service workers. This means you wouldn't be able to break on the early phases of a worker's lifecycle (eg install).

If we don't want to compromise, we need to solve the "early breakpoints" differently or find ways to make it significantly faster.

Option 2: Reduce the number of servers
One of the challenges of service workers is that they can spawn in any process right now, not just in the process of the page they are controlling. If we could enforce this limitation, instead of creating as many servers as we have processes, we could just create one content-process server corresponding to the process of the debugged page. Maybe we could even reuse the server created for our current target?

Option 3: Reduce the number of calls (target-list?)
We could also look at how often this listProcessTargets is called. The debugger bundles all those operations in an updateThreads method which is called every time the list of processes changes, every time the list of workers for the current target changes and every time there is a "potential" worker change (coming from workers-listener and already listens for process list changes and worker list changes...). But this work would probably overlap with supporting workers in the target-list and switching the debugger to it.

Option 4: Wait for the new early events architecture
And finally there is the planned architecture change for fission/early events, recently described by :ochameau, which should completely change all the patterns used here.

In my opinion, the most interesting options to look at right now are options 1 and 2. Option 3 is not guaranteed to solve the performance issue, and we should really just use the target-list anyway. Option 4 is too far in the future I think.

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 12

•

5 years ago

I tried to assert the potential impact of option 2 and option 3.

For option 2, I modified the debugger's code to only create a single content process server. I just picked the first process I could find, so it is completely incorrect, the goal was just to see the impact of creating a single content process server. DAMP still shows a non-negligeable regression (between 4% and 8% on debugger open tests). Mochitests managed to complete in a decent time.
patch at https://hg.mozilla.org/try/rev/a4fd12b468f17071ce4e75a01645a14a69db0947

For option 3, I introduced a custom target list implementation, only focused on process targets.
And I am using it to call pauseMatchingServiceWorkers. DAMP didn't report performance improvements and the mochitests still timed out
patch at https://hg.mozilla.org/try/rev/0453061c7ac68956370722b022ef588a067cf2fa

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 13

•

5 years ago

Hi Perry & Andrew!

We are trying to enable service worker debugging in the debugger, but we are hitting performance issues with the current design.
Right now, if I remember correctly, a service worker running for tabA can spawn in any content process. Because of that, we have to "install" DevTools in every content process in order to find the service workers that might control the page.

Some time ago we discussed about the possibility of forcing the service worker to spawn in the same process as the page it controls, and I think you mentioned that it would be the case with Fission anyway. Is it a limitation we could start enforcing before Fission? If we had this limitation, DevTools could "look" for service workers in a single content process, which would be less expensive.

I can already see some challenges with this approach. For instance, I see that without Fission, two tabs of the same domain can run in different processes. In that case I am not sure how we could enforce a service worker to be in the "same" process as the tab.

Broadening a bit the question: do you have other ideas to help us narrow the number of content processes where we install DevTools? Maybe the ServiceWorkerManager could help us identify the processes where service workers are spawning?

Thanks!

Flags: needinfo?(perry)

Flags: needinfo?(bugmail)

BugBot [:suhaib / :marco/ :calixte]

Comment 14

•

5 years ago

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Updated

•

5 years ago

Keywords: regression

Perry Jiang [:perry] [no longer employee, use ni?]

Comment 15

•

5 years ago

Yeah, there's still no guarantees for which process they're put in, and doing the same-process-as-page approach might get tricky as you mentioned. I'm not sure about the possibility of multiple processes under the same origin with fission though, in which case the problem is more or less the same. We can track the process ID that a worker is in via ServiceWorkerManager (just have to make sure that the worker is held alive, otherwise the PID may change after we decide to look in that process).

Flags: needinfo?(perry)

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 16

•

5 years ago

Thanks for the feedback! (clearing ni for asuth)
So having same-process-as-page is probably a no go at the moment.

We can track the process ID that a worker is in via ServiceWorkerManager
(just have to make sure that the worker is held alive, otherwise the PID may change after we decide to look in that process).

We should prototype a solution based on that. Basically the idea would be that before we start the servers and set the breakpoints, we would ask the server in which process we have service workers. The challenge is that we want to be able to break very early. I don't know if such a setup will allow us to set the breakpoints early enough, but we have to test.

Otherwise, as we were discussing option 1 with Harald (disable early breakpoints), we came up with another variant. Early breakpoints would be disabled by default, but if we see that there are service workers on the debugged page, we would enable early breakpoints. We still would not be able to break on the first ever install of a worker, but subsequent installs & updates should be debug-able.

Flags: needinfo?(bugmail)

Alexandre Poirot [:ochameau]

Comment 17

•

5 years ago

Attached file Bug 1609201 - Try to expose active service worker's process ID. (obsolete) — Details

Alexandre Poirot [:ochameau]

Comment 18

•

5 years ago

Comment on attachment 9131858 [details]
Bug 1609201 - Try to expose active service worker's process ID.

This patch is probably wrong, but tries to implement nsIServiceWorkerInfo.osPid which would help match the process into which the worker is running, by comparing it to Services.ppmm.getChildAt(0 <= i < ppmm.childCount).osPid.
This patch is probably wrong because it seems to always report the parent process PID, i.e. Services.ppmm.getChildAt(0).osPid.

Logan Smyth [:loganfsmyth]

Assignee

Comment 19

•

5 years ago

•

Edited

I got a bit sidetracked today because I made one discovery I wanted to explore. It turns out that we have been spawning the content process servers serially instead of in parallel. https://searchfox.org/mozilla-central/rev/070a000dd49aac4a26147e137efcd91a728d13b8/devtools/shared/fronts/root.js#212 We call .getTarget here and wait for it to finish before moving on to the next target.

From a brief test, it takes ~100ms for the content process itself to spawn and respond to a .getTarget request on a process where there wasn't previously a spawned server. When you combined that with the fact that we have 3 processes by default, we've been at a minimum spending 300ms doing something that should take 100ms.

Exploring the timing a bit more, using a shared loader as mentioned above seems to shave another 30ms off of that 100ms time in theory.

Here's some DAMP runs:

With just the feature toggle and the memory leak fix:

With feature toggle, memory leak fix, and a shared loader:

With feature toggle, memory leak fix, and parallel target loading

With feature toggle, memory leak fix, and parallel target loading and a shared loader:

It looks like this does bring down the percentages a bit from the highs that Julian was posting earlier, but it's still a 5-10% slowdown on the jsdebugger.open tests and even larger for jsdebugger.close. I have not experimented with how that plays into any of the other tweaks Julian was testing out.

Logan Smyth [:loganfsmyth]

Assignee

Comment 20

•

5 years ago

Here's one more test that would require more work, but I ran it to get a general sense of the effect, which does seem to be noticable.

With feature toggle, memory leak fix, and parallel target loading and a shared loader, and also if the content-process actor didn't eagerly load and create a ThreadActor and WebConsoleActor:

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 21

•

5 years ago

•

Edited

As discussed, started retriggers on:

As well as on the base central revision. This should give us more accurate results.

Edit: retriggers are completed, the compare links from Logan now have more data.

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Updated

•

5 years ago

Depends on: 1621214

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Updated

•

5 years ago

Depends on: 1621217

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Updated

•

5 years ago

Depends on: 1621228

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Updated

•

5 years ago

Comment 22

•

5 years ago

What wonderful analysis you are doing here Julian and Logan!!

While this is great to better understand our existing codebase, I would like to call out the significant time we are spending in trying to optimize an architecture, which:

we know, will and has to change because of fission,
is known to be bad for performance, by design.

You may optimize Damp, but this most likely impacts Devtools more than what Damp reports. Damp doesn't stress service workers at all, nor does it highlights a Firefox with many content processes. For example, the usage of parallelizing the content target creation may be faster on damp, but may also still freeze the browser with an overload of data.

The leaks are also concerning (bug 1621214). We know we get rid of them with the submitted patches, but we have no clue why builtin-modules introduces such leak. We know that we stop leaking the content document, but we may as well still leak many other things that are not tracked by the mochitest leak detector. The detector only track DOM Documents/Windows. For example, we may still leak all modules, sandboxes, actors, pools,... One thing we know is that it relates to content process target actors, which is involved by this feature. Again, we get back to the original design of this feature, where we should not do much or anything at all in all the content processes.

I'm seeing a lot of time spent working around the worker situation and this listAllWorkers function. Here, in about:debugging, the application panel and also when working on fission.

I'd like to suggest going for option 4, mentioned in comment 11. Align with Fission refactorings.
It will surely delay this feature, but improve our engineering efficiency. We would address both fission and this feature by doing only fission work.
We would not have to tweak code that we know is going by away. We would prevent modifying the same code (debugger's events.js and commands.js, and/or RoofFront.listAllWorkers) for working around here and in parallel for fission.

This is really frustrating for me as I called out the day I saw this feature that this should be using the TargetList. And the fact that the Debugger doesn't use the TargetList blocks me on the fission's track.

Bug 1619622 aims at migrating to the TargetList for processes. But we would still need to migrate to the TargetList for workers (bug 1621338). Bug 1594754 aims at improving the TargetList around workers for the browser toolbox, but that isn't enough. Bug 1621337 aims at exposing service workers also for the content toolbox. This one last bug is about implementing an equivalent of listAllWorkers right into the TargetList. Bug 1594754 kind of did that, but for the browser toolbox, where it is slightly easier.

Once all these bugs are completed, this won't address the main performance issue. It will actually implement Option 3. The TargetList caches its targets and so would prevent duplicated calls to listWorkers, listProcesses, listRemoteFrame...
We would need to implement an equivalent of bug 1593937 for workers (i.e bug 1593940), so that we instantiate the WorkerTarget from the server as soon as the worker starts, instead of doing it on-demand by a frontend request (listWorkers/getWorker). Doing that only works if the frontend uses the TargetList as the TargetList is used to communicate the Target actor from the backend to the frontend. (Via target-available-form RDP event).
I did a diagram of how we would create the Frame Targets:
https://docs.google.com/drawings/d/1Ex1QHxljVugwaqLrh3BOfMueaV3l1kuWF5qnbqimFt4
Such setup would allow to register a very lightweight code in Gecko (JS XPCOM, JS Window Actor, ... not using the DevTools Loader), in order to listen in all processes with a very minimal overhead. This code would instantiate the Service Worker targets that matches the tab origin.

This may sound like a lot of work, but this is required for Fission and we know that it would implement this feature correctly.

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Comment 23

•

5 years ago

Thanks for the detailed feedback!

I understand the doubts around this feature, as it starts spawning a lot more servers than usual. Compared to you, I think I am more willing to try and ship the feature if the metrics are satisfactory (without cheating said metrics of course). But I do get your points and your concerns.

When we met (Alex, Honza, myself) a few weeks ago, the conclusion was to try and improve the feature without overlapping with Fission refactors. I thought we reached an agreement on that.
On the other hand, the core problems with this feature are tied to its current design. And the only reasonable way to fix them is to wait for the new architecture.

The two goals seem to be at odds. Either the feature is too dangerous by design and we postpone it for a few releases. Or we try to reach a compromise and ship the feature.

I think we need to meet again and take a decision.

Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct)

Updated

•

5 years ago

Assignee: jdescottes → nobody

Status: ASSIGNED → NEW

Logan Smyth [:loganfsmyth]

Assignee

Comment 24

•

5 years ago

Attached file Bug 1609201 - Clean up Services.obs observers on devtools loader destroy. r=jdescottes! — Details

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → loganfsmyth

Status: NEW → ASSIGNED

Logan Smyth [:loganfsmyth]

Assignee

Comment 25

•

5 years ago

I think I'm fine with landing things if we can get them "fast enough" for the average case, but I also totally understand Alex's concern about making things more complicated to them have to redo them again in the future.

Either way, I 100% agree with Alex about the memory leaks being a big unknown. I didn't record it here along with my notes from yesterday, but at the meeting yesterday I mentioned that I had a few candidates in mind that could be causing the leak from my checks over the codebase, and it does indeed appear that those were the cause.

Here is a test run with only the feature toggle patch and my new leak-fixing patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3a1a6716cb3ef87a99af63d0ff10ffc086aa6746

Pulsebot

Comment 26

•

5 years ago

Pushed by loganfsmyth@gmail.com: https://hg.mozilla.org/integration/autoland/rev/2786da95404d Clean up Services.obs observers on devtools loader destroy. r=jdescottes,ochameau

Razvan Maries

Comment 27

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2786da95404d

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox76: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 76

Phabricator Automation

Updated

•

5 years ago

Attachment #9122888 - Attachment is obsolete: true

Phabricator Automation

Updated

•

5 years ago

Attachment #9131858 - Attachment is obsolete: true

patch for STR 5 years ago Brian Hackett [Laid off!] 2.99 KB, patch		Details \| Diff \| Splinter Review
Bug 1609201 - WIP/Feedback: Use shared loader for content process server 5 years ago Julian Descottes [:jdescottes] (PTO 20 oct -> 25 oct) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1609201 - Try to expose active service worker's process ID. 5 years ago Alexandre Poirot [:ochameau] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1609201 - Clean up Services.obs observers on devtools loader destroy. r=jdescottes! 5 years ago Logan Smyth [:loganfsmyth] 47 bytes, text/x-phabricator-request		Details \| Review