Closed Bug 855331 Opened 7 years ago Closed 7 years ago

TransactionThreadPool doing sketchy things with mCompleteCallbacks array that can mutate

Categories

(Core :: DOM: IndexedDB, defect, critical)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla23
Tracking Status
firefox20 --- unaffected
firefox21 --- unaffected
firefox22 + wontfix
firefox23 + fixed
firefox-esr17 --- unaffected
b2g18 --- unaffected

People

(Reporter: bent.mozilla, Assigned: janv)

References

Details

(Keywords: regression, sec-critical, Whiteboard: [regression from bug 767944?][don't uplift yet, see comment 23][adv-main23+])

Attachments

(1 file)

See https://tbpl.mozilla.org/php/getParsedLog.php?id=21144311&tree=Mozilla-Inbound

TransactionThreadPool::FinishTransaction() calls MaybeFireCallback with a DatabasesCompleteCallback& reference that comes out of mCompleteCallbacks. It then calls Run() on the callback, which could do something like call WaitForAllDatabasesToComplete which appends an element to mCompleteCallbacks. If that causes reallocation then the DatabasesCompleteCallback& on our stack is now garbage.

Somehow the log says we're jumping to HasTransactionsForDatabase (line 472), but I can't see how that is happening. Maybe we're calling a virtual Run() on one of our callback pointers that is now garbage?
When we figure this out, we'd like to know how far back it goes.
bent says he's not working on bug 855276, which presumably means the same for this bug. Can we please get an owner? It's a frequent orange and s-s.
Kyle you lost our coin toss :) (Please find another owner if appropriate)
Assignee: nobody → khuey
Assignee: khuey → nobody
Keywords: regression
Whiteboard: work happening in bug 855276
Assignee: nobody → khuey
We suspect that this isn't actually a security issue, but we're not confident enough to open it up yet.  In particular, comment 0 is wrong, and every instance of this we've seen is a null deref.
Whiteboard: work happening in bug 855276
We're pretty convinced this is a security bug again ;-)

Still haven't managed to reproduce.
Keywords: sec-othersec-critical
I landed a patch to change from pass-by-reference to pass-by-value when calling this function and that didn't change anything, so this looks like actual heap corruption.
I'm going to assume this affects everything.  Adjust as needed.
Keywords: regression
I strongly suspect this is a regression for bug 767944.
Great, that would match up with the time this started happening.
Yes that's my primary piece of evidence ;-)
I think we should consider backing out Bug 767944.  I haven't managed to reproduce this locally :-/
Please give me a few days, I'll try to run IDB tests under valgrind.
(In reply to Jan Varga [:janv] from comment #12)
> Please give me a few days, I'll try to run IDB tests under valgrind.

Where do we stand on this? Are we considering a backout at this point? We're nearing the end of the Fx23 cycle and I don't think we want to bring this over to Aurora.
The centralized quota manager is already on Aurora, but it seems the crash doesn't happen there so often.
Could the crash somehow relate to the fact that dom/indexedDB tests now run in mochitest-3 ? They used to run in mochitest-2.
It's important to keep in mind that this patch cannot be simply backed out. It changed the directory structure of idb databases so a complete backout would effectively lose all idb data. If we decide to back this out we'll need to handle that migration as well.
I'm testing a patch here:
https://tbpl.mozilla.org/?tree=Try&rev=01b6bfd06fd7

It looks promising so far ...
16 * 40 mochitest-3 runs, no crashes in MaybeFireCallback()
Should we land it on m-c and see if it really helps ?
https://hg.mozilla.org/mozilla-central/rev/093f7b379757

Looking very good to me! Please request Aurora uplift ASAP :)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Ben, it looks to me that we were firing a callback for some other database instance and that database instance didn't exist anymore.
Not sure how can that happen, but that's the only explanation I have for now.
We really need to understand why this happened before we move on or uplift or anything else here. Otherwise we risk simply hiding the problem or (worse) moving it somewhere else.
Whiteboard: [regression from bug 767944?] → [regression from bug 767944?][don't uplift yet, see comment 23]
(In reply to ben turner [:bent] from comment #23)
> We really need to understand why this happened before we move on or uplift
> or anything else here. Otherwise we risk simply hiding the problem or
> (worse) moving it somewhere else.

So are we going to do that ...?

We need to do something on beta.
Jan, are you looking into this?
Flags: needinfo?(bent.mozilla)
(In reply to ben turner [:bent] from comment #25)
> Jan, are you looking into this?

well, I can take a look ...
do you think it could be something like the thing described in comment 22 ?
I haven't been able to come up with any explanation of why this patch seems to work. Since you came up with it I'm hoping you might have a better idea ;)
There hasn't been enough traction here to get into Beta 22, and comment 27 doesn't make me feel good about getting this into our final two betas.
This is a "kungFuDeathGrip" type patch. The worst risk of uplifting it would seem to be memory leaks (but those should show up as unexpected leaks in tbpl and don't seem to be).
Blocks: 767944
I don't know how to appeal the "wontfix" because there's no attachment on which to request approval :-( but that is what the previous comment was attempting to be.
(In reply to ben turner [:bent] from comment #23)
> We really need to understand why this happened before we move on or uplift
> or anything else here. Otherwise we risk simply hiding the problem or
> (worse) moving it somewhere else.

Whether we uplift to Fx22 or not the above is true since the patch is on its way to release regardless.
Can we back this out of trunk before we branch? There's no reason to think that this patch is what we actually want.
Can we just back this out from beta after the merge rather than inflicting the nearly once per push crash back on the rest of the tree? This was one of our most frequent oranges when we were hitting it and the rate at which this and associated bugs has moved along doesn't leave me brimming with confidence that the pain of a trunk backout will be short-lived.
Ok, Ben and I suspect that those crashes we used to have before the "noone knows why it works" patch could be caused by the issue described in the bug 878703 and that the centralized manager somehow made the crashes to happen more often.
I'm going to backout the "noone knows why it works" patch on the try server this weekend. I need to retrigger the tests like hundreds of times to be sure that the crashes don't happen anymore, that's why I'll do it on the weekend, when the load is lower.
no, bug 878703 seems to be unrelated
when I backed out the hacky patch, crashes reappeared:
https://tbpl.mozilla.org/php/getParsedLog.php?id=24186335&tree=Try
Tagging this since we seem to be shipping this in two weeks.
Whiteboard: [regression from bug 767944?][don't uplift yet, see comment 23] → [regression from bug 767944?][don't uplift yet, see comment 23][adv-main23+]
Group: core-security
You need to log in before you can comment on or make changes to this bug.