Closed Bug 1516629 Opened 3 years ago Closed 16 days ago

make leaks on shutdown easier to associate with a root cause


(Testing :: General, enhancement, P3)

Version 3


(Not tracked)



(Reporter: jmaher, Unassigned)



currently our leaks on shutdown occur and we have to do a lot of detective work.  For example, there are 25 leaks here:
ack, magic keyboard fail.

there are 25 leaks here:

but I had to open each log, wait, search for the failure, search backwards for the last test to run, finally I could determine which directory of tests were run (see bug 1513550).

This took me ~30 minutes to do and if the data was more readily available I could have done this in <10 minutes, if not automatically done by sheriffs.

here are the failure lines as seen in treeherder:
1) TEST-UNEXPECTED-FAIL | leakcheck | tab 34596 bytes leaked (ChannelEventQueue, DOMEventTargetHelper, ListenerAndContextContainer, LoadInfo, Mutex, ...)
2) TEST-UNEXPECTED-FAIL | leakcheck | default missing output line for total leaks!
3) TEST-UNEXPECTED-FAIL | leakcheck | tab missing output line for total leaks!
4) [taskcluster:error] exit status 1

I am not sure if we can realistically hide #4, and #2/#3 might be useful to keep around, although they are almost always ignored- should we do something there?

The purpose of this bug is to solve #1 and I would like to see something like:
TEST-UNEXPECTED-FAIL | infrastructure/server/wpt-server-websocket.sub.html | tab 34596 bytes leaked (ChannelEventQueue, DOMEventTargetHelper, ListenerAndContextContainer, LoadInfo, Mutex, ...)

that could be misleading as the test itself didn't leak, but instead we should shorten to the dir:
TEST-UNEXPECTED-FAIL | infrastructure/server | tab 34596 bytes leaked (ChannelEventQueue, DOMEventTargetHelper, ListenerAndContextContainer, LoadInfo, Mutex, ...)

This would need to apply for web-platform-tests and all mochitests.  I believe doing this will result in fewer duplicate leak bugs, more accurate sheriff annotations, and ideally an opportunity to bisect and find the root cause- at the very least we can pinpoint the dir and allow XXXXX bytes to leak as a threshold before failing.

Doing this would also adjust #2 and #3 which might give us more information as to why we fail to write leak lines (for example we fail to on unexpected crashes and in some cases timeouts where we do a forced crash)
This might create more cases where many bugs are opened for the same issue (whenever a single leak cause affects a variety of test directories); that's problematic especially because we prioritize intermittent failures based on each bug's failure frequency -- having multiple bugs for the same issue may hide the severity of the issue.

On the other hand, associating a leak with a test directory gives us the opportunity to quickly assign problematic leaks to the test's triage owner, which might speed resolution.
The directory could just get dumped out as an INFO after the FAIL. Then you could easily see it by looking at the log viewer, but it wouldn't affect TreeHerder matching.
good point about losing common causes.

if we can get the directory to show up in the failure summary that would be helpful- I would like to have quick ways to query which directories are "problematic" so we can find a subset of tests that contribute to the leak, etc.
Priority: -- → P3
Closed: 16 days ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1715080

I'm hoping to be able to work on some of these improvements in the second half of 2021.

You need to log in before you can comment on or make changes to this bug.