Closed Bug 764758 Opened 12 years ago Closed 12 years ago

Intermittent crash during test_xhtmlManifest.xhtml [@ nsOfflineCacheUpdateItem::OnDataAvailable]

Categories

(Core :: Networking: Cache, defect)

x86_64
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla19
Tracking Status
b2g18 --- fixed

People

(Reporter: emorley, Assigned: mayhemer)

References

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(4 files)

Rev4 MacOSX Lion 10.7 mozilla-inbound opt test mochitests-3/5 on 2012-06-13 12:26:51 PDT for push 584ae96f6c95

slave: talos-r4-lion-075

https://tbpl.mozilla.org/php/getParsedLog.php?id=12632207&tree=Mozilla-Inbound

{
2441 INFO TEST-START | /tests/dom/tests/mochitest/ajax/offline/test_xhtmlManifest.xhtml
TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/ajax/offline/test_xhtmlManifest.xhtml | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:01:04.853570
INFO | automation.py | Reading PID log: /var/folders/qd/srwd5f710sj0fcl9z464lkj00000gn/T/tmp7Sfd_ypidlog
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64/1339613605/firefox-16.0a1.en-US.mac.crashreporter-symbols.zip
PROCESS-CRASH | /tests/dom/tests/mochitest/ajax/offline/test_xhtmlManifest.xhtml | application crashed (minidump found)
Crash dump filename: /var/folders/qd/srwd5f710sj0fcl9z464lkj00000gn/T/tmpeQN5Os/minidumps/BC32F20A-7DE5-469F-A027-84C92E87BD2E.dmp
Operating system: Mac OS X
                  10.7.2 11C74
CPU: amd64
     family 6 model 23 stepping 10
     2 CPUs

Crash reason:  EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
Crash address: 0xd0

Thread 0 (crashed)
 0  XUL!nsOfflineCacheUpdateItem::OnDataAvailable [nsOfflineCacheUpdate.cpp : 1882 + 0x0]
    rbx = 0x140fbb80   r12 = 0x0a6f1110   r13 = 0x00000000   r14 = 0x0000016a
    r15 = 0x00000000   rip = 0x01c48d5b   rsp = 0x5fbfba10   rbp = 0x00000000
    Found by: given as instruction pointer in context
 1  XUL!nsStreamListenerTee::OnDataAvailable [nsStreamListenerTee.cpp : 89 + 0x1a]
    rbx = 0x00000000   r12 = 0x0a6f1110   r13 = 0x00000000   r14 = 0x0000016a
    r15 = 0x00000000   rip = 0x010590f9   rsp = 0x5fbfba40   rbp = 0x0cd21580
    Found by: call frame info
 2  XUL!mozilla::net::nsHttpChannel::OnDataAvailable [nsHttpChannel.cpp : 5053 + 0x16]
    rbx = 0x08906800   r12 = 0x0000016a   r13 = 0x003d2000   r14 = 0x00000000
    r15 = 0x0a6f1110   rip = 0x010d5a5c   rsp = 0x5fbfbad0   rbp = 0x0000016a
    Found by: call frame info
 3  XUL!nsInputStreamPump::OnStateTransfer [nsInputStreamPump.cpp : 486 + 0x1f]
    rbx = 0x00000000   r12 = 0x003d2000   r13 = 0x5fbfbb4c   r14 = 0x5fbfbb30
    r15 = 0x5fbfbc4f   rip = 0x010385fb   rsp = 0x5fbfbb20   rbp = 0x16430b70
    Found by: call frame info
 4  XUL!nsInputStreamPump::OnInputStreamReady [nsInputStreamPump.cpp : 370 + 0x7]
    rbx = 0x16430b70   r12 = 0x16430b78   r13 = 0x00307da8   r14 = 0x5fbfbbe0
    r15 = 0x5fbfbc4f   rip = 0x01038a58   rsp = 0x5fbfbb80   rbp = 0x003d2000
    Found by: call frame info
 5  XUL!nsInputStreamReadyEvent::Run [nsStreamUtils.cpp : 82 + 0x5]
    rbx = 0x161678e0   r12 = 0x00000000   r13 = 0x00307da8   r14 = 0x5fbfbbe0
    r15 = 0x5fbfbc4f   rip = 0x0202b51c   rsp = 0x5fbfbba0   rbp = 0x8000ffff
    Found by: call frame info
 6  XUL!nsThread::ProcessNextEvent [nsThread.cpp : 624 + 0x5]
    rbx = 0x00307d80   r12 = 0x00000000   r13 = 0x00307da8   r14 = 0x5fbfbbe0
    r15 = 0x5fbfbc4f   rip = 0x020400b3   rsp = 0x5fbfbbb0   rbp = 0x8000ffff
    Found by: call frame info
 7  XUL!NS_ProcessPendingEvents_P [nsThreadUtils.cpp : 166 + 0xd]
    rbx = 0x00307d80   r12 = 0xe751dd1c   r13 = 0x00000014   r14 = 0x5fbfbc4f
    r15 = 0x00306d88   rip = 0x01fff82e   rsp = 0x5fbfbc40   rbp = 0x00000000
    Found by: call frame info
 8  XUL!nsBaseAppShell::NativeEventCallback [nsBaseAppShell.cpp : 97 + 0x13]
    rbx = 0x05b36480   r12 = 0x00307d80   r13 = 0x00000000   r14 = 0x05b0a458
    r15 = 0x00306d88   rip = 0x01e69f19   rsp = 0x5fbfbc80   rbp = 0x00000002
    Found by: call frame info
 9  XUL!nsAppShell::ProcessGeckoEvents [nsAppShell.mm : 402 + 0x7]
    rbx = 0x05b0a440   r12 = 0x003d2000   r13 = 0x74f1dae0   r14 = 0x05b0a458
    r15 = 0x00306d88   rip = 0x01e2e09f   rsp = 0x5fbfbcb0   rbp = 0x05b36480
}
Could it be that Thread 0 stack is not complete?  My idea is a re-entrance.  But the stack starts with CoreFoundation (in full log) that indicates it's the thread's entry point (though, I'm not a Gecko-on-MacOSX expert, however, some code in ProcessGeckoEvents suggests it may be re-entrant aware).
Assignee: nobody → jschoenick
Status: NEW → ASSIGNED
John, are you working on this?
(In reply to Honza Bambas (:mayhemer) from comment #20)
> John, are you working on this?

I spent some time looking at this, and made some unsuccessful attempts to reproduce this in a debugger. As you noted it's not clear how we could be re-entering, and I didn't see any obvious issues looking over the code path.

I'll look at it some more today, if you have any ideas let me know.
It's definitely a regression from one of these changesets:
http://hg.mozilla.org/mozilla-central/rev/10b04214089d
http://hg.mozilla.org/mozilla-central/rev/f2b089df69b3

Question is, which one.  Hard to say based on the low frequency of this crash.

I'd like to keep this on the radar, since it may be caused by some more serious issue hidden under, however that is not very likely (from my experience).

I personally would land an investigation patch that MOZ_RUNTIME_ABORT's (or what is the name) when mUpdate is null on the crashing place with some status message about the object.  It has to be smart.  I don't at the moment have an idea about the cause, but some mistake in the parallelization code is very likely.
Thank you for looking at this :-)
John, this is really something we need to take care about (we already had to, actually, since it really hides or creates more serious issue).  

What is your progress here?

If none, then I will take this and start working on this immediately.

Thanks.
Apologies, this ended up pretty low on my priority list, and I didn't get the sense that it was very critical from comment 22.
Assignee: jschoenick → honzab.moz
Attached patch v1Splinter Review
Jason, this review is quit urgent, please take a look soon, the patch is extremely simple.

https://tbpl.mozilla.org/?tree=Try&rev=49c69c40ec59


Cause of the crash, quit racy:
- important note here is that the first item in a manifest must lead to e.g. a 404 and have a lot (16+) of following, working, explicit items
- we post nsOfflineCacheUpdate::Run() up to the count of the parallel limit
- there may be a lot of them, up to 16
- the channel for the first item finishes with an error (like 404) before all those nsOfflineCacheUpdate::Run() gets processed
- nsOfflineCacheUpdate switches to STATE_FINISHED and cancels all items that are in progress (what are most of them)
- canceling an item means to a) cancel its http channel if there is such yet b) set its mState to UNINITIALIZED
- then, another nsOfflineCacheUpdate::Run() gets called and searches for an item that is in state UNINITIALIZED (most of them) ; [this is the bug actually, it should just immediately return when the update is already done]
- it finds it and opens another channel on it
- problem is that OnStopRequest on the item posts nsOfflineCacheUpdateItem::Run() [method of the item in this case, not the update] that may get triggered later, after the previous step
- call to nsOfflineCacheUpdateItem::Run() is already in the queue while we have opened a new (second) channel for which the same item is again a listener
- nsOfflineCacheUpdateItem::Run() gets fired, clears mUpdate
- the second channel now calls OnStopRerquest => Bang!
Attachment #670810 - Flags: review?(jduell.mcbugs)
An additional check that should prevent this issue when we changes something in the future.

This has to land only on m-c since, IMO dangerous for m-a+.
Attachment #670813 - Flags: review?(jduell.mcbugs)
Comment on attachment 670813 [details] [diff] [review]
v1 annexe for m-c only

The annex has its own try run: https://tbpl.mozilla.org/?tree=Try&rev=203090a833f7
Comment on attachment 670810 [details] [diff] [review]
v1

Review of attachment 670810 [details] [diff] [review]:
-----------------------------------------------------------------

::: uriloader/prefetch/nsOfflineCacheUpdate.cpp
@@ +1714,5 @@
>      NS_ASSERTION(mState == STATE_DOWNLOADING,
>                   "ProcessNextURI should only be called from the DOWNLOADING state");
>  
> +    if (mState != STATE_DOWNLOADING) {
> +        LOG(("  after completion, ignoring"));

Do you know that the only other state possible here is after completion?  If not, better to use "Unexpected: mState != STATE_DOWNLOADING.  Ignoring URI".
Attachment #670810 - Flags: review?(jduell.mcbugs) → review+
Comment on attachment 670813 [details] [diff] [review]
v1 annexe for m-c only

Review of attachment 670813 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good too.  Do you intend for this to bake on m-c for a while then land on aurora, or do you just want to let it ride the trains?  It will miss B2G v1 if it doesn't land on aurora.
Attachment #670813 - Flags: review?(jduell.mcbugs) → review+
(In reply to Jason Duell (:jduell) from comment #30)
> Do you know that the only other state possible here is after completion?  If
> not, better to use "Unexpected: mState != STATE_DOWNLOADING.  Ignoring URI".

I changed the log to "  should only be called from the DOWNLOADING state, ignoring", similar to what the assertion comment looks like.


(In reply to Jason Duell (:jduell) from comment #31)
> Looks good too.  Do you intend for this to bake on m-c for a while then land
> on aurora, or do you just want to let it ride the trains?  It will miss B2G
> v1 if it doesn't land on aurora.

I think it is OK to just land and leave this part on m-c.  It is not needed for the fix.


BTW, I am also able to modify an existing test [1] to reproduce the crash on OSX instantly [2].  I may produce a patch for it too...

[1] https://hg.mozilla.org/try/rev/e4ce2cbec104
[2] https://tbpl.mozilla.org/?tree=Try&rev=e4ce2cbec104
https://hg.mozilla.org/mozilla-central/rev/7e6fcfed1b74
https://hg.mozilla.org/mozilla-central/rev/14516fa301a3
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
Whiteboard: [orange]
Attachment #670810 - Flags: approval-mozilla-b2g18?
Attachment #670813 - Flags: approval-mozilla-b2g18?
Comment on attachment 702488 [details] [diff] [review]
v1 annexe merged for b2g18

[Approval Request Comment]
Regression caused by (bug #): 744719
User impact if declined: potential null deref crash
Testing completed (on m-c, etc.): fix now on beta
Risk to taking this patch (and alternatives if risky): very low according the time we have been testing it on trains
Attachment #702488 - Attachment is patch: true
Attachment #702488 - Flags: approval-mozilla-release?
Comment on attachment 702487 [details] [diff] [review]
v1 merged for b2g18

[Approval Request Comment]
Same as the other part of the fix.
Attachment #702487 - Flags: approval-mozilla-release?
Attachment #702487 - Flags: approval-mozilla-b2g18?
Attachment #670810 - Flags: approval-mozilla-b2g18?
Attachment #670813 - Flags: approval-mozilla-b2g18?
Attachment #702488 - Flags: approval-mozilla-b2g18?
> [Approval Request Comment]

Please note that we believe this is causing bug 829637, a b2g18 test permaorange, which likely reflects broken appcache functionality.
Comment on attachment 702487 [details] [diff] [review]
v1 merged for b2g18

Approving for b2g18 branch, no need for this on the mozilla-release branch - we're already past the deadline for 18.0.1 landings.
Attachment #702487 - Flags: approval-mozilla-b2g18? → approval-mozilla-b2g18+
Attachment #702487 - Flags: approval-mozilla-release? → approval-mozilla-release-
Attachment #702488 - Flags: approval-mozilla-release?
Attachment #702488 - Flags: approval-mozilla-release-
Attachment #702488 - Flags: approval-mozilla-b2g18?
Attachment #702488 - Flags: approval-mozilla-b2g18+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: