Closed Bug 1271182 Opened 9 years ago Closed 8 years ago

Intermittent TEST-UNEXPECTED-FAIL | leakcheck | default process: negative leaks caught!, in various browser-chrome mochitests, one negative Runnable and sometimes one negative CancelableRunnable

Categories

(Core :: XPCOM, defect, P3)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox49 --- wontfix
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: cbook, Assigned: mccr8)

References

(Depends on 2 open bugs, )

Details

(Keywords: intermittent-failure)

Attachments

(4 files)

https://treeherder.mozilla.org/logviewer.html#?job_id=3843368&repo=mozilla-central#L3640 13:41:04 WARNING - TEST-UNEXPECTED-FAIL | leakcheck | default process: negative leaks caught! mccr8, kyle: any idea what this error message mean :)
The classes are CancelableRunnable, Runnable, and nsTimerEvent. Something is wrong with the leak checking for these classes. Maybe this is a regression from bug 1266595?
See Also: → 1158234
Flags: needinfo?(khuey)
The usual cause of this is an unmarked copy constructor. But that doesn't appear to be what's going on here, because nsTimerEvent shouldn't be copy constructed. It's possible TimerEventAllocator does something crazy ...
Flags: needinfo?(khuey)
Summary: Intermittent leakcheck | default process: negative leaks caught! → Intermittent TEST-UNEXPECTED-FAIL | leakcheck | default process: negative leaks caught!
Summary: Intermittent TEST-UNEXPECTED-FAIL | leakcheck | default process: negative leaks caught! → Intermittent TEST-UNEXPECTED-FAIL | leakcheck | default process: negative leaks caught!, in various browser-chrome mochitests, one negative Runnable and sometimes one negative CancelableRunnable and nsTimerEvent
One other thought is that maybe we have a threadsafety bug in nsTraceRefcnt around the first object created for a class (for the initial object being created around the same time on two different threads).
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #6) > One other thought is that maybe we have a threadsafety bug in nsTraceRefcnt > around the first object created for a class (for the initial object being > created around the same time on two different threads). That's an interesting idea, but it does look like all of the calls to GetBloatEntry are inside AutoTraceLogLock. (It looks like calls to InitTraceLog could theoretically race because gInitialized is just a bool, but presumably that would have shown up in other ways, and hopefully we poke at it before threads are started.)
So Runnable and CancelableRunnable use NS_DECL_THREADSAFE_ISUPPORTS / NS_DECL_ISUPPORTS_INHERITED. nsTimerEvent then uses MOZ_COUNT_CTOR and MOZ_COUNT_DTOR. Those aren't really supposed to be used together, and I suspect there might be an issue with the interaction.
These are generally good practice for reference-counted objects; they catch cases where these operations are used by accident, breaking reference-counting. This doesn't show any existing problems, though. MozReview-Commit-ID: EvRkNCymOqT
Attachment #8756643 - Flags: review?(khuey)
I suspect this will fix the odd behavior we're seeing, although I haven't really thought through how very closely. MozReview-Commit-ID: qHgBRAc1PE
Attachment #8756644 - Flags: review?(khuey)
Comment on attachment 8756644 [details] [diff] [review] Remove MOZ_COUNT_[CD]TOR from nsTimerEvent, since it uses XPCOM reference counting macros Review of attachment 8756644 [details] [diff] [review]: ----------------------------------------------------------------- Not sure why you think this will fix it, but it can't hurt ...
Attachment #8756644 - Flags: review?(khuey) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/bd4cebf613e67bc4e04e94a5239879194819f4c0 Bug 1271182 - Add some deleted assignment operators and copy constructors. r=khuey https://hg.mozilla.org/integration/mozilla-inbound/rev/abafa95db1b11babb82ee2b32cd8c0bac784711c Bug 1271182 - Remove MOZ_COUNT_[CD]TOR from nsTimerEvent, since it uses XPCOM reference counting macros. r=khuey
OK, you were right -- that didn't fix it.
Assignee: dbaron → nobody
Keywords: leave-open
as a note, these failures seem to occur in the browser/components/contextualidentity/test/browser directory. Is it possible there is something in one of those tests?
:mconley, I see your name as author/reviewer for many of the test related changes in browser/components/contextualidentity/test/browser, can you help us figure out a solution here. Maybe figuring out what changed in early may?
Flags: needinfo?(mconley)
I'm afraid I was only a reviewer for one of the tests, and wasn't an author of any of them. Redirecting to baku, who I think did a good chunk of work here.
Flags: needinfo?(mconley) → needinfo?(amarchesini)
(In reply to Joel Maher (:jmaher) from comment #21) > :mconley, I see your name as author/reviewer for many of the test related > changes in browser/components/contextualidentity/test/browser, can you help > us figure out a solution here. Maybe figuring out what changed in early may? This is a regression from Kyle's event loop work. (Though likely it just exposed an existing problem.)
Flags: needinfo?(amarchesini)
Thanks for the info Andrew- as Kyle is on holiday, then I assume in london the week after, can we find someone else to look at his work or back it out?
Really, it would be better to figure out which test is failing and disable it than back it out. I'll try to figure out what the issue is here.
Assignee: nobody → continuation
It seems a bug triggered by how the tests are written. In contextualIdentities code there are no runnable involved. Plus, there is a CancelableRunnable, mainly used in workers: probably it's the ServiceWorker test.
dbaron's patch should at least fix this for TimerEvent, because it is invisible to the leak checker now. One tricky thing here when investigating this is that refcount logging is not threadsafe. eg thread 1 could do addref, thread 2 could do addref, then thread 1 could do a release, but we end up reporting to the refcount logger addref(to 1), release(to 1), addref(to 2). I can't think how that would affect the bloat logging, though.
Summary: Intermittent TEST-UNEXPECTED-FAIL | leakcheck | default process: negative leaks caught!, in various browser-chrome mochitests, one negative Runnable and sometimes one negative CancelableRunnable and nsTimerEvent → Intermittent TEST-UNEXPECTED-FAIL | leakcheck | default process: negative leaks caught!, in various browser-chrome mochitests, one negative Runnable and sometimes one negative CancelableRunnable
These failures occur in more than just the contextualidentity directory. They mostly seem to only leak Runnable, but sometimes also include CancelableRunnable. I assume this is related to something Kyle landed. He landed bug 1269268 and bug 1268313 on 5-5, which more or less lines up with when this was filed.
Depends on: 1279609
See Also: → 1273026
Do you have any other ideas here?
Flags: needinfo?(dbaron)
Depends on: 1158234
I found this log which has a more interesting negative leak: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=202329#L5421 141 |ConnectionPool::CloseConnectionRunnabl| 40 -40| 2 -1| 146 |ConnectionPool::IdleDatabaseInfo | 16 16| 18 1| 147 |ConnectionPool::IdleResource | 8 8| 23 1| 216 |DatabaseConnection::UpdateRefcountFunc| 168 168| 2 1| 608 |Runnable | 24 -24| 34193 -1| 674 |Statement | 120 2040| 237 17| 1403 |nsStringBuffer | 8 144| 292065 18| 1448 |nsTArray_base | 8 160| 2437944 20| There are some IDB runnables leaking. I looked them over and didn't see anything obviously weird about them, so I'm still stumped.
:khuey, we do know your code landing triggered this to be so frequent. Can you please own driving this to completion?
Flags: needinfo?(khuey)
I've been working on greening-up mochitest browser-chrome e10s on Windows 7 VM debug (AWS) and have come across this intermittent. I reduced it down to the following mochitest, hope this helps: browser/components/contextualidentity/test/browser/browser_aboutURLs.js ... 14:21:11 WARNING - TEST-UNEXPECTED-FAIL | leakcheck | default process: negative leaks caught! https://treeherder.mozilla.org/#/jobs?repo=try&revision=473ad42c6ab22cbad74b3538e999c452632ce6c5&selectedJob=24479423
Yoshi, I see you are the original author of browser_aboutURLs.js which landed right around the time this bug was filed. Can you help us fix this bug or disable the test if you don't have time to do it this month?
Flags: needinfo?(allstars.chh)
Comment on attachment 8774778 [details] Bug 1271182 - Temporarily disable browser_aboutURLS on win due to intermittent leak; https://reviewboard.mozilla.org/r/67208/#review64050 ::: browser/components/contextualidentity/test/browser/browser.ini:12 (Diff revision 1) > file_set_storages.html > serviceworker.html > worker.js > > [browser_aboutURLs.js] > +skip-if = os == "win" # intermittent negative leak bug 1271182 I believe this is problematic on all platforms, it might be good to do that. Also debug is the problem here, if we are going to disable, lets leave it running on opt. Possibly we can do windows/linux debug to greatly reduce this. I also ni'd Yoshi who wrote this test.
Attachment #8774778 - Flags: review?(jmaher) → review-
Per comment 32, this happens in more than just the contextualidentity directory. But maybe it is most frequent there, so this would at least reduce the orange rate...
I'll check this, but I have other bugs at hand right now, might don't have enough time to check this. Since Robert already has a patch for disable the test, I'll let him to disable the test first.
Flags: needinfo?(allstars.chh)
Comment on attachment 8774778 [details] Bug 1271182 - Temporarily disable browser_aboutURLS on win due to intermittent leak; Review request updated; see interdiff: https://reviewboard.mozilla.org/r/67208/diff/1-2/
Attachment #8774778 - Flags: review- → review?(jmaher)
Comment on attachment 8774778 [details] Bug 1271182 - Temporarily disable browser_aboutURLS on win due to intermittent leak; https://reviewboard.mozilla.org/r/67208/#review64468
Attachment #8774778 - Flags: review?(jmaher) → review+
(In reply to Joel Maher ( :jmaher ) from comment #73) > Comment on attachment 8774778 [details] > Bug 1271182 - Temporarily disable browser_aboutURLS on win due to > intermittent leak; > > https://reviewboard.mozilla.org/r/67208/#review64468 Verified on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d1d3d9dbfe17
Pushed by rwood@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/86ed11684f27 Temporarily disable browser_aboutURLS on win due to intermittent leak; r=jmaher
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
(In reply to Joel Maher ( :jmaher ) from comment #50) > :khuey, we do know your code landing triggered this to be so frequent. Can > you please own driving this to completion? I agree that I'm probably responsible for this, but I really have no idea what could be going wrong here, and this is my last week at Mozilla so I'm out of time to look into it. Sorry :'(
Flags: needinfo?(khuey)
Reopening, since this doesn't seem to be fixed, per recent comments. (Alternately, if anyone's really sure this bug should be closed and the new failures are something else, please file a new bug for the new failures so that people stop starring them as this one.)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
mccr8, are you really looking into this or should we see if someone else has the cycles to?
Flags: needinfo?(continuation)
Target Milestone: mozilla50 → ---
(In reply to Ryan VanderMeulen [:RyanVM] from comment #106) > mccr8, are you really looking into this or should we see if someone else has > the cycles to? I'm out of ideas for what might cause this, or how to investigate what might cause this.
Assignee: continuation → nobody
Flags: needinfo?(continuation)
Andrew, can you please help find a new owner for this bug?
Flags: needinfo?(overholt)
Depends on: 1307961
I see Nathan hasn't posted in this bug yet. Maybe he has some ideas.
Flags: needinfo?(nfroyd)
I meant to imply that maybe he hadn't seen the bug, because this is in Core:General and he is not CCed.
I'm working on it.
The negative leaks now seem to often be CancelableRunnable, Runnable and IPC::Message. I don't remember the Message before. Figuring out something involving a Message might be more tractable than a Runnable.
Component: General → IPC
Well, now that a non-refcounted class is involved, I have some ideas for figuring out what might be going wrong.
Flags: needinfo?(overholt)
Assignee: nobody → continuation
By my count, in the last week there are 27 failures in bc5, 11 in e10s bc2, 9 in e10s bc4, 6 in bc4, 4 in bc6, 1 in bc1 and e10s bc3.
Note that specific chunk numbers can be immaterial since tests move around from chunk to chunk routinely due to the chunk-by-runtime logic the harness uses.
Potentially then we could figure out if there's a particular set of tests that are always or mostly involved when this fails.
Ah, right, I'd forgotten the test in question, browser_aboutURLs.js, is disabled on Linux and Windows, which is why the negative leak is not showing up on those platforms. My own testing indicates that test is a problem. So we could disable it on all platforms to green up the tree. I'll still look into what is actually happening here.
Whiteboard: [disabled on Windows and Linux]
Whiteboard: [disabled on Windows and Linux] → [disabled on Windows and Linux debug]
(In reply to Andrew McCreight [:mccr8] from comment #109) > I see Nathan hasn't posted in this bug yet. Maybe he has some ideas. I see that Andrew is doing a fine job with this bug. ;)
Flags: needinfo?(nfroyd)
Flags: needinfo?(dbaron)
Depends on: 1308652
Depends on: 1309042
Depends on: 1309051
What I basically ended up doing to track down this bug (ignoring the various more complex things I tried that did not pan out) was to modify GetSerialNumber in nsTraceRefcnt to assert if it did not find a serial number when aCreate is false (I'll add this in bug 1309051). This means that we are attempting to destroy an object that is not known to nsTraceRefcnt to be alive. I then ran with XPCOM_MEM_LOG_CLASSES=IPC::Message, which hit that assertion while running browser_aboutURLs.js (confirming earlier work to figure out which test was causing issues). I then added logging to browser_aboutURLs.js (I'll add this in bug 1309042) that printed out which about: page was being loaded, which indicated that it was asserting during the loading of about:bloat. I'd previously fixed an issue where something called into a helper method of the cycle collector while the CC was still running (bug 1074416), so I was immediately suspicious. I then saw that the code in DumpStatistics was ignoring logging while it was running, which seemed like it could be the issue.
Component: IPC → XPCOM
Windows was green when I did a try run.
MozReview-Commit-ID: 3x58vIBsj7D
Attachment #8799804 - Flags: review?(ryanvm)
Comment on attachment 8799804 [details] [diff] [review] Renable browser_aboutURLs.js on Windows. Review of attachment 8799804 [details] [diff] [review]: ----------------------------------------------------------------- ::: browser/components/contextualidentity/test/browser/browser.ini @@ +8,5 @@ > serviceworker.html > worker.js > > [browser_aboutURLs.js] > +skip-if = debug && os == "linux" # Assertion While I understand work is happening to re-enable this on Linux as well, we should really document this with a reference to the bug in question. Or continue to reference this bug.
Attachment #8799804 - Flags: review?(ryanvm) → review+
Pushed by amccreight@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e836cda145ec Renable browser_aboutURLs.js on Windows. r=RyanVM
Keywords: leave-open
Fixed by bug 1308652. This has not recurred since that landed.
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Keywords: leave-open
Resolution: --- → FIXED
Whiteboard: [disabled on Windows and Linux debug] → [disabled on Linux debug]
Comment on attachment 8799804 [details] [diff] [review] Renable browser_aboutURLs.js on Windows. > [browser_aboutURLs.js] >-skip-if = (debug && (os == "win" || os == "linux")) # intermittent negative leak bug 1271182 >+skip-if = debug && os == "linux" # Assertion Hi Andrew, Is the assertion you referring here is something like "Assertion failure: originAttrsLoadInfo.mUserContextId == originAttrsLoadContext.mUserContextId" ? If so, I am trying to fix this in bug 1308938, and I will remove the skip-if for linux debug here.
Flags: needinfo?(continuation)
(In reply to Yoshi Huang[:allstars.chh] from comment #128) > If so, I am trying to fix this in bug 1308938, and I will remove the skip-if for linux debug here. Yes, I know, I filed bug 1308938. :) Thanks.
Flags: needinfo?(continuation)
Depends on: 1308938
Andrew, I plan to uplift bug 1308938 to 51, do you think we can also uplift this bug to 51? Thanks
Flags: needinfo?(continuation)
That would be a good idea.
Flags: needinfo?(continuation)
Whiteboard: [disabled on Linux debug] → [disabled on Linux debug][checkin-needed-aurora][checkin-needed-beta]
(In reply to Yoshi Huang[:allstars.chh] (PTO 10.21 ~ 11.4) from comment #131) > Andrew, I plan to uplift bug 1308938 to 51, do you think we can also uplift > this bug to 51? Sure. It is a test only change so it doesn't need approval. The leak fix is in bug 1308652.
Whiteboard: [disabled on Linux debug][checkin-needed-aurora][checkin-needed-beta] → [disabled on Linux debug][checkin-needed-beta]
Whiteboard: [disabled on Linux debug][checkin-needed-beta]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: