Closed Bug 1546945 Opened 2 years ago Closed 1 year ago

remote/test/browser/ browser-chrome tests cause leakcheck to fail

Categories

(Remote Protocol :: Agent, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: ato, Unassigned)

References

()

Details

(Keywords: leave-open, memory-leak)

Attachments

(3 obsolete files)

After I enabled the remote agent in the default Nightly build, we
started seeing leaks in the browser-chrome tests as seen here:

https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&searchStr=debug%2Cbrowser-chrome&fromchange=b9a625eff9e301f700196de977f916e7f77a895f&tochange=afb947a236befa6eb6f23783405bd9c334aa4129

I will submit a new patch to unhook the remote agent from the default
build, until we have had time to resolve this. Backing out the
original patchset would be detrimental to investigating this
effectively on try.

Assignee: nobody → ato
Status: NEW → ASSIGNED
Depends on: 1533831
Priority: -- → P1

As detailed in https://bugzilla.mozilla.org/show_bug.cgi?id=1546945,
enabling the remote agent in the default Firefox Nightly build
(https://bugzilla.mozilla.org/show_bug.cgi?id=1533831) caused a
leakcheck in M-bc to fail.

This patch cuts the remote agent off from the build, by stopping
to imply --enable-cdp, until we have had time to fix the leak.

Pushed by atolfsen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/004d050a8ec4
remote: temporarily stop implying --enable-cdp; r=aryx
Keywords: leave-open
Blocks: 1540655

All browser-chrome jobs have been requested on debug for
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=004d050a8ec44987f44fd86c62097266b22b915e,
so it should confirm it was the remote agent that caused the leaks.

Because we don’t run all test jobs on all pushes, the working theory
is that the push I linked to in comment #0 saw the effect of the
remote agent M-bc jobs starting to leak.

As detailed in https://bugzilla.mozilla.org/show_bug.cgi?id=1546945,
there is a memory leak in the remote agent that causes it to
intermittently fail brwoser-chrome leakchecks.

It is possible the memory leak is related to holding onto a reference
of the XPConnect C++ object nsSocketTransportService, but this has
yet to be confirmed.

This patch disables the browser-chrome tests on debug builds,
where we run reference counting leakchecks, in order to get the
remote agent enabled in default Firefox Nightly builds.

Thanks-to: Alexandre Poirot <poirot.alex@gmail.com>

This patch negates the effect of 004d050a8ec4 (git 6dd9496ac0f9).

OK, I talked to ochameau and in order to make some meaningful
progress on enabling the remote agent in default Firefox Nightly
builds, we are going to disable the browser-chrome tests on debug
builds where reference counting leakchecks are run. I’ve submitted
the necessary patches for doing that as well as turning the remote
agent back on.

This bug is still relevant and now tracks re-enabling the M-bc
tests in remote/test/browser/*.js by first plugging the leak.

ochameau and I did some wild-goose-chase debugging on IRC, and we
suspect that we’re holding on to an XPConnect reference to a C++
instance of nsSocketTransportService. There are only a handful
of places we use sockets, which inculde for WebSocket client
connections, and for the HTTPD.

I narrowed it down to remote/test/browser/browser_cdp.js. On my
workstation it is failing the leakcheck consistently, which is
actually good news for debugging it.

This means I will revisit the attached patches and change them to
only ignore browser_cpd.js on debug.

Keywords: memory-leak
Summary: Remote agent browser-chrome tests causes leakcheck to fail intermittently → remote/test/browser/browser_cdp.js causes leakcheck to fail
Attachment #9060764 - Attachment description: bug 1546945: remote: disable browser-chrome tests on debug; → bug 1546945: remote: disable browser_cdp.js browser-chrome test on debug;
Pushed by atolfsen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f04e1e9c54aa
remote: disable browser_cdp.js browser-chrome test on debug; r=ochameau
https://hg.mozilla.org/integration/autoland/rev/868b1bf043dc
remote: re-enable remote agent in Firefox Nightly; r=ochameau

The remote agent has now been re-enabled, so relinquishing this
bug. We still need to fix this memory leak, but I don’t consider
it a top priority.

Status: ASSIGNED → NEW
Priority: P1 → P3
Assignee: ato → nobody
Duplicate of this bug: 1546850
Duplicate of this bug: 1547303

Over the weekend two other intermittent leaks were detected in the
remaining browser-chrome tests:

I’ve filed bug 1547679 to temporarily disable all browser-chrome
tests on debug until we’ve had time to investigate (this bug).

Depends on: 1547679
Summary: remote/test/browser/browser_cdp.js causes leakcheck to fail → remote/test/browser/ browser-chrome tests cause leakcheck to fail

I can help you at some point with investigating this leak, though I don't think I'll have a ton of time this week.

No longer blocks: 1540655

The failures in comment 19 are in accessible/tests/browser/events/. I don't know why this bug is being starred. There's another bug on file for it.

Attachment #9060696 - Attachment is obsolete: true
Attachment #9060764 - Attachment is obsolete: true
Attachment #9060765 - Attachment is obsolete: true

Removing the disabled tag per comment 20 (misclassification)

Whiteboard: [stockwell disable-recommended]

Misclassification, see above.

Whiteboard: [stockwell disable-recommended]
Whiteboard: [stockwell disable-recommended]

The leaks are being misclassified. In the two logs I was looking at, they were happening in toolkit/components/certviewer/tests/browser/

See also bug 1587814 for leaks as reported when running the puppeteer unit tests locally.

See Also: → 1587814

Note that any kind of failure in the test, or exception thrown by a CDP endpoint will cause the browser chrome tests to not correctly clean-up the tab and client instances. As such we also leak a lot of memory. And the related leak output is frustrating because it makes it harder to find the real issue.

I will file a new bug for that particular case, which will improve a lot our situation, and maybe is all needed here.

Bug 1589625 has been fixed, and as such I'm fairly sure the leaks should all be gone now. Andreas, can you remember what we have to do to finally close out this bug? With all those formerly landed patches on this and other bugs it's kinda hard to follow. Maybe it's only backing out your first patch?

Flags: needinfo?(ato)

(In reply to Intermittent Failures Robot from comment #45)

1 failures in 4182 pushes (0.0 failures/push) were associated with this bug in the last 7 days.

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1546945&startday=2019-10-21&endday=2019-10-27&tree=all

This is a wrong classification. Alexandru, can you please correct it?

Flags: needinfo?(malexandru)

Corrected, made a new bug for that issue: https://bugzilla.mozilla.org/show_bug.cgi?id=1592568

Flags: needinfo?(malexandru)

Thanks for bringing this to my attention!

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #44)

Bug 1589625 has been fixed, and as such I'm fairly sure the leaks
should all be gone now. Andreas, can you remember what we have to
do to finally close out this bug? With all those formerly landed
patches on this and other bugs it's kinda hard to follow. Maybe
it's only backing out your first patch?

The leak that caused this bug to be filed was plugged by e2fbe8d063e0
in bug 1589625. After that there were a number of misclassifications
that made it hard to keep track of whether the leak problems had
been fixed.

With the patches for bug 1589625 also applied, I haven’t seen any
recent remote agent specific leaks.

As for what action needs to be taken here, I don’t think there is
any. The remote agent was re-enabled by default with 868b1bf043dc
which we could do because f04e1e9c54aa disabled browser_cdp.js
on debug. That leak was eventually plugged in 60dca41daf70 by making
sure all the content process domains had their destructors called.

This means the status quo is currently:

  1. All the tests are passing on debug configurations
  2. We are not seeing any remote agent specific intermittents
  3. The remote again remains enabled by default (on the Nightly channel)
Status: NEW → RESOLVED
Closed: 1 year ago
Flags: needinfo?(ato)
Resolution: --- → FIXED

Thanks Andreas!

Resolution: FIXED → WORKSFORME
You need to log in before you can comment on or make changes to this bug.