Closed Bug 1398563 Opened 7 years ago Closed 4 years ago

Intermittent leakcheck | tab process: 306380 bytes leaked (APZEventState, ActiveElementManager, AsyncLatencyLogger, BackstagePass, CSPService, ...)

Categories

(Core :: General, defect, P5)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, memory-leak, Whiteboard: [stockwell unknown])

There have been 34 failures in the last 7 days and it has started occurring more frequently since 24 October.
Most of the failures are on Windows 7 and Windows 10 x64, but there have also been some occurrences on Linux. All of them occurred on debug builds. The failures appear on the following test suites: mochitest-clipboard-e10s and mochitest-webgl-e10s

Here's an example of a recent log:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=139828088&lineNumber=9906

And a snippet of the test error:
11:45:19     INFO -  TEST-INFO | leakcheck | tab process: leaked 13 xpc::CompartmentPrivate
9904
11:45:19     INFO -  TEST-INFO | leakcheck | tab process: leaked 2 xpcJSWeakReference
9905
11:45:19     INFO -  TEST-INFO | leakcheck | tab process: leaked 70 xptiInterfaceInfo
9906
11:45:19    ERROR -  791 ERROR TEST-UNEXPECTED-FAIL | leakcheck | tab process: 930064 bytes leaked (APZEventState, ActiveElementManager, AsyncLatencyLogger, BackstagePass, CSPService, ...)
9907
11:45:19     INFO -  runtests.py | Running tests: end.
9908
11:45:20     INFO -  Buffered messages finished
9909
11:45:20     INFO -  0 INFO TEST-START | Shutdown
9910
11:45:20     INFO -  1 INFO Passed:  86258
9911
11:45:20     INFO -  2 INFO Failed:  0
9912
11:45:20     INFO -  3 INFO Todo:    0
9913
11:45:20     INFO -  4 INFO Mode:    e10s
9914
11:45:20     INFO -  5 INFO SimpleTest FINISHED
9915
11:45:20     INFO -  Buffered messages finished
9916
11:45:20     INFO -  SUITE-END | took 1225s
9917
11:45:20     INFO - Return code: 0
9918
11:45:20     INFO - TinderboxPrint: mochitest-mochitest-gl<br/>86258/0/0
9919
11:45:20    ERROR - # TBPL FAILURE #
9920
11:45:20  WARNING - setting return code to 2
9921
11:45:20    ERROR - The mochitest suite: mochitest-gl ran with return status: FAILURE
9922
11:45:20     INFO - Running post-action listener: _package_coverage_data
9923
11:45:20     INFO - Running post-action listener: _resource_record_post_action


:selena, Could you please take a look?
Flags: needinfo?(sdeckelmann)
Whiteboard: [stockwell needswork]
Milan -- I'm seeing "(APZEventState, ActiveElementManager, AsyncLatencyLogger, BackstagePass, CSPService, ...)" in most of these leak reports. Can you investigate?
Flags: needinfo?(sdeckelmann) → needinfo?(milan)
Keywords: mlk
Blocks: 933741
Flags: needinfo?(milan) → needinfo?(bugmail)
(In reply to Selena Deckelmann :selenamarie :selena use ni? pronoun: she from comment #8)
> Milan -- I'm seeing "(APZEventState, ActiveElementManager,
> AsyncLatencyLogger, BackstagePass, CSPService, ...)" in most of these leak
> reports. Can you investigate?

This is because the leaked things are listed in alphabetical order. I can fix this by renaming the classes to start with "Z" :)

I looked at a couple of the reports and it does look like a legitimate leak where an entire TabChild and everything hanging off it gets leaked. :mccr8 do we have a mechanism to debug these on try without local reproduction?
Flags: needinfo?(bugmail) → needinfo?(continuation)
Yeah, I need to come up with some mechanism to report a better error when we leak an nsGlobalWindow and everything that hangs off of them. The leakcheck output seen in TreeHerder is useless in that case.

As noted in comment 2, these are happening in two different test suites, mochitest-webgl-e10s and mochitest-clipboard-e10s. I think there are two different leaks involved.

I have a very simple script ( https://github.com/amccreight/mochitest-logs/blob/master/plusplus.py ) that analyzes the ++DOMWINDOW and --DOMWINDOWs in the log to figure out which windows aren't cleaned up. The output of this script is a number of lines that look like this:
[pid = 5984] [serial = 1]
You can then go search in the log for ++DOMWINDOW lines match that that do not have a --DOMWINDOW after it, then see where in the log that is to figure out which test it is happening during.

I looked at two logs in each of the two test suites that are failing.

For the WebGL test suite, the leaked windows are being created during this test:
  dom/canvas/test/webgl-conf/generated/test_2_conformance2__misc__expando-loss-2.html

Expandos are a place where we have to deal with tricky memory management, so perhaps that is related to why it is leaking? This test has not changed recently, but some other tests were disabled in the directory recently. Comment 2 said that this started happening more frequently on Oct 24, which is when bug 1410306 landed. If we were timing out before then, maybe we wouldn't have gotten leak reports, or something.

For the clipboard test suite, the leaked windows are being created during this test:
  dom/events/test/test_bug1327798.html

This test was changed in bug 1199729, but in early September so maybe this isn't the cause of the increase on October 24.
Flags: needinfo?(continuation)
I'm guessing that bug 1371474 is the same thing, based on test suites affected.
See Also: → 1371474
The WebGL failures seem to have stopped on Oct 26th, which is why the overall rate has dropped so much. I have no idea why that would be. Bug 1371474 shows a similar pattern.
While trying to land bug 1415692, my patch there caused this leak to be much more prevalent. It seems like the problem there is related to Places as the only code change was calling a Places method during browser startup. Maybe that will help someone narrow down the exact cause.
See Also: → 1415692
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME

No failures since Oct 2019.

Status: REOPENED → RESOLVED
Closed: 6 years ago4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.