Closed Bug 687710 Opened 13 years ago Closed 13 years ago

Restoring session causes hang from nsSHEntry parent loop

Categories

(Firefox :: Session Restore, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
Firefox 10
Tracking Status
firefox8 + affected
firefox9 + fixed
firefox10 + fixed

People

(Reporter: sfink, Assigned: justin.lebar+bug)

References

Details

(Keywords: hang, verified-aurora, verified-beta, Whiteboard: [qa!][inbound])

Attachments

(6 files, 3 obsolete files)

For the last 2 or 3 nightlies, I've been getting some very, very slow responses, and now sometimes complete hangs.

For this latest nightly, I downloaded symbols and attached with gdb. It shows a pair of nsSHEntry objects that are parented to each other:

(gdb) p result->mURI.mRawPtr
$25 = (nsStandardURL *) 0x7f8bcf24d860 "http://www.google.com/" (22 more characters)
...walk through the loop, allowing it to do GetParent...
(gdb) p result->mURI.mRawPtr
$26 = (nsStandardURL *) 0x7f8bcfe73ba0 "http://arstechnica.com/tech-policy/news/2011/08/week-in-tech-how-fast-is-your-internet-edition.ars" (98 more characters)

(This is with jdm's archer-mozilla extensions partly applied, which is why I can see the strings.)

If I continue walking the loop, it toggles between those two.

It's now hanging fast enough that I'm having difficulty getting the about:buildconfig from it, but application.ini says:

Vendor=Mozilla
Name=Firefox
Version=9.0a1
BuildID=20110918030911
SourceRepository=http://hg.mozilla.org/mozilla-central
SourceStamp=06445f55f009
ID={ec8030f7-c20a-464f-9b0e-13a3a9e97384}

(gdb) bt
#0  GetRootSHEntry (aEntry=<value optimized out>) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:10471
#1  0x00007f8c0a9885af in nsDocShell::SetHistoryEntry (this=0x7f8bf5c1d000, aPtr=0x7f8bf5c1d288, aEntry=0x7f8be1828780) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:10488
#2  0x00007f8c0a98f8d1 in nsDocShell::InternalLoad (this=<value optimized out>, aURI=0x7f8bd6393080, aReferrer=0x7f8bd6393160, aOwner=<value optimized out>, aFlags=<value optimized out>, aWindowTarget=<value optimized out>, aTypeHint=0x7f8c0bd13d20 "", aPostData=0x0, aHeadersData=0x0, aLoadType=<value optimized out>, aSHEntry=0x7f8be1828780, aFirstParty=1, aDocShell=0x0, aRequest=0x0) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:8678
#3  0x00007f8c0a99cc00 in nsDocShell::LoadHistoryEntry (this=0x7f8bf5c1d000, aEntry=<value optimized out>, aLoadType=<value optimized out>) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:10217
#4  0x00007f8c0a99158b in nsDocShell::LoadURI (this=0x7f8bf5c1d000, aURI=0x7f8bd6393080, aLoadInfo=<value optimized out>, aLoadFlags=<value optimized out>, aFirstParty=0) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:1364
#5  0x00007f8c0a9c047c in nsSHistory::InitiateLoad (this=<value optimized out>, aFrameEntry=<value optimized out>, aFrameDS=0x7f8bf5c1d100, aLoadType=<value optimized out>) at /builds/slave/m-cen-lnx64-ntly/build/docshell/shistory/src/nsSHistory.cpp:1772
#6  0x00007f8c0a9c0d4d in nsSHistory::LoadEntry (this=0x7f8bf5c52160, aIndex=5, aLoadType=2, aHistCmd=3) at /builds/slave/m-cen-lnx64-ntly/build/docshell/shistory/src/nsSHistory.cpp:1639
#7  0x00007f8c0a9bf958 in nsSHistory::ReloadCurrentEntry (this=0x7f8bf5c52160) at /builds/slave/m-cen-lnx64-ntly/build/docshell/shistory/src/nsSHistory.cpp:925
#8  0x00007f8c0aca5aca in NS_InvokeByIndex_P (that=<value optimized out>, methodIndex=<value optimized out>, paramCount=<value optimized out>, params=<value optimized out>) at ../../../../../../../xpcom/reflect/xptcall/src/md/unix/xptcinvoke_x86_64_unix.cpp:195
#9  0x00007f8c0a902f23 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) () from /home/sfink/Downloads/firefox/libxul.so
#10 0x00007f8c0a90a3a0 in XPC_WN_CallMethod(JSContext*, unsigned int, unsigned long*) () from /home/sfink/Downloads/firefox/libxul.so
#11 0x00007f8c0b039d92 in CallCompiler::generateNativeStub() () from /home/sfink/Downloads/firefox/libxul.so
#12 0x00007f8c0b037d93 in js::mjit::ic::NativeCall(js::VMFrame&, js::mjit::ic::CallICInfo*) () from /home/sfink/Downloads/firefox/libxul.so
#13 0x00007f8bf63f9748 in ?? ()
#14 0x0000000500000000 in ?? ()
#15 0x0000000000000142 in ?? ()
#16 0x00007fffae0abf60 in ?? ()
#17 0x0000000000000000 in ?? ()
From the stack looks like session history, moving to Document Navigation
Component: Bookmarks & History → Document Navigation
Product: Firefox → Core
QA Contact: bookmarks → docshell
Later crashes are showing the same URL pair loop.

When I start, it's alive for a little while, then hangs. I think it may be when the delayed loading makes it to the relevant URLs.
If I go up a level in the stack, I think it tells me what URL it started with:

(gdb) up
#1  0x00007f64a30885af in nsDocShell::SetHistoryEntry (this=0x7f647b56e400, aPtr=0x7f647b56e688, aEntry=0x7f6467784780) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:10488
(gdb) p aEntry->mURI.mRawPtr
$41 = (nsStandardURL *) 0x7f64677887c0 "https://blog.mozilla.com/sfink/wp-login.php?redirect_to=https%3A%2F%2Fblog.mozilla.com%2Fsfink%2Fwp-admin%2Fedit-comments.php&reauth=1" (134 more characters)

(that's going to do a redirect, though it probably isn't relevant).
Um.... having two shentries parented to each other is bad juju, obviously.  Any ideas on how to reproduce?
1. Sit in my chair. Open up my laptop.
2. Run firefox or firefox -safe-mode (it happens either way.)
3. Restore the previous session.
4. Wait for 3 minutes. (Almost exactly 3 minutes, it seems. Perhaps that's a hint.)

I am on 64-bit Linux (Fedora 14).

It may be tied to my saved session. I'd rather not attach my sessionstore.js to a public bug, but I can email it to anyone interested. It's 32MB uncompressed, and has about 252 tabs in 3 windows. I made it to 116 tabs open before it froze (measured by hitting reload on about:tabs every few seconds, which is one of Dietrich's extensions.)

About Tabs

    252 tabs across 3 groups in 3 windows
    116 tabs open
    231 unique addresses
    91 unique hosts
    1 empty tab
    127 https:
    112 http:
    6 file:
    1 ftp:
    5 about:
I'll try whittling down my sessionstore.js tomorrow to see if it's a particular URL or set of URLs. Any easy way to have it log the URL of each tab it restores?

I guess I should bisect it too.
This was "fun" to whittle down. The attached sessionstore.js causes the hang immediately on my machine. It's a session with exactly two tabs. There are children entries in there (for back/forward, I guess?), but every single one of them is a b.m.o. URL. I deleted all cookies, extData, etc. (I didn't try removing all of the children entries.)

Next I'll have to blog about my interactive command-line JSON navigator/whacker that I wrote to do this...
Some more detail:

There are two tabs, one for show_bug.cgi?id=578560 and show_bug.cgi?id=673451. Both have a single 'entries' entry, and each of those has 12 'children' entries, all of which are show_activity.cgi?id=(same). (So there are two sets of 12 children, with bug ids matching the parent tabs.)

What happens is that at one point there is an nsSHEntry tree with show_bug.cgi?id=578560 parented to show_bug.cgi?id=673451. Then we get an "show_bug.cgi?id=578560"->addChild(show_activity.cgi?id=578560). show_bug.cgi?id=673451 and show_activity.cgi?id=578560 have the same nsSHEntry.mShared field value, which seems very odd, and so after reparenting the two mShared fields form a loop.

If you don't mind me using the output of my json explorer:

> cat windows/0/tabs/*/entries/*/url
windows/0/tabs/0/entries/0/url = https://bugzilla.mozilla.org/show_bug.cgi?id=673451
windows/0/tabs/1/entries/0/url = https://bugzilla.mozilla.org/show_bug.cgi?id=578560
> cat windows/0/tabs/*/entries/*/children/*/url
windows/0/tabs/0/entries/0/children/0/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/1/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/2/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/3/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/4/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/5/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/6/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/7/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/8/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/9/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/10/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/11/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/1/entries/0/children/0/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/1/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/2/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/3/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/4/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/5/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/6/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/7/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/8/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/9/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/10/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/11/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560

Another odd thing I noticed is that two of the children's docShellIDs are the same, and match the docShellID of their tab (marked with "<---"):

> cd /windows/0/tabs/
0/
1/
/windows/0/tabs> cat */entries/0/docshellID 
0/entries/0/docshellID = 429
1/entries/0/docshellID = 439                  <---
/windows/0/tabs> cat */entries/0/children/*/docshellID 
0/entries/0/children/0/docshellID = 430
0/entries/0/children/1/docshellID = 243
0/entries/0/children/2/docshellID = 249
0/entries/0/children/3/docshellID = 239
0/entries/0/children/4/docshellID = 278
0/entries/0/children/5/docshellID = 354
0/entries/0/children/6/docshellID = 314
0/entries/0/children/7/docshellID = 226
0/entries/0/children/8/docshellID = 208
0/entries/0/children/9/docshellID = 175
0/entries/0/children/10/docshellID = 253
0/entries/0/children/11/docshellID = 247
1/entries/0/children/0/docshellID = 314
1/entries/0/children/1/docshellID = 226
1/entries/0/children/2/docshellID = 208
1/entries/0/children/3/docshellID = 175
1/entries/0/children/4/docshellID = 253
1/entries/0/children/5/docshellID = 247
1/entries/0/children/6/docshellID = 439       <---
1/entries/0/children/7/docshellID = 439       <---
1/entries/0/children/8/docshellID = 226
1/entries/0/children/9/docshellID = 178
1/entries/0/children/10/docshellID = 257
1/entries/0/children/11/docshellID = 251
Summary: Nightly hangs on Linux → Restoring session causes hang from nsSHEntry parent loop
Component: Document Navigation → Session Restore
Product: Core → Firefox
QA Contact: docshell → session.restore
Version: unspecified → Trunk
I think I'm reproducing the same bug here.
I've set firefox to only load the focused tab at startup.
When I click on a tab which is not already loaded, at some point firefox freeze with 100% CPU load.

gdb back trace give me this :

#0  GetRootSHEntry (aEntry=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:10427
#1  0x00007fc5ab3ffa33 in nsDocShell::SetHistoryEntry (this=0x7fc58079d800, aPtr=0x7fc58079da88, 
    aEntry=0x7fc580d61280)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:10444
#2  0x00007fc5ab40aecb in nsDocShell::InternalLoad (this=0x7fc58079d800, aURI=0x7fc5816a6320, 
    aReferrer=0x7fc5816a6400, aOwner=0x7fc56dfe5d30, aFlags=0, aWindowTarget=0x7fc500000000, 
    aTypeHint=0x7fc5ac72f930 "", aPostData=0x0, aHeadersData=0x0, aLoadType=4, aSHEntry=0x7fc580d61280, 
    aFirstParty=1, aDocShell=0x0, aRequest=0x0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:8681
#3  0x00007fc5ab40689c in nsDocShell::LoadHistoryEntry (this=0x7fc58079d800, aEntry=0x7fc580d61280, aLoadType=4)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:10173
#4  0x00007fc5ab3f80a7 in nsDocShell::LoadURI (this=0x7fc58079d800, aURI=0x7fc5816a6320, 
    aLoadInfo=<value optimized out>, aLoadFlags=0, aFirstParty=0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:1364
#5  0x00007fc5ab42a06a in nsSHistory::InitiateLoad (this=<value optimized out>, aFrameEntry=<value optimized out>, 
    aFrameDS=0x7fc58079d900, aLoadType=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/shistory/src/nsSHistory.cpp:1772
#6  0x00007fc5ab42a8a6 in nsSHistory::LoadEntry (this=0x7fc5807efca0, aIndex=1, aLoadType=2, aHistCmd=3)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/shistory/src/nsSHistory.cpp:1639
#7  0x00007fc5ab4296f8 in nsSHistory::ReloadCurrentEntry (this=0x7fc5807efca0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/shistory/src/nsSHistory.cpp:925
#8  0x00007fc5ab711a7d in NS_InvokeByIndex_P (that=<value optimized out>, methodIndex=<value optimized out>, 
    paramCount=0, params=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_x86_64_unix.cpp:195
#9  0x00007fc5ab379170 in Invoke (ccx=<value optimized out>, mode=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:3147
#10 Call (ccx=<value optimized out>, mode=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:2370
#11 XPCWrappedNative::CallMethod (ccx=<value optimized out>, mode=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:2334
#12 0x00007fc5ab37c7da in XPC_WN_CallMethod (cx=0x7fc595a5c400, argc=0, vp=0x7fc5972637e8)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/xpconnect/src/xpcwrappednativejsops.cpp:1629
#13 0x00007fc5aba8cdd5 in CallJSNative (this=0x7fffacea94b0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/jscntxtinlines.h:296
#14 CallCompiler::generateNativeStub (this=0x7fffacea94b0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/methodjit/MonoIC.cpp:937
#15 0x00007fc5aba89cb3 in js::mjit::ic::NativeCall (f=<value optimized out>, ic=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/methodjit/MonoIC.cpp:1162
#16 0x00007fc557792a7b in ?? ()
#17 0x00007fc58a83b190 in ?? ()
#18 0x0000000000000000 in ?? ()


To unfreeze it I usually do one or more time :
> gdb -p $(pidof firefox)
(gdb) next 10
(gdb) next (to be on the line http://mxr.mozilla.org/comm-central/source/mozilla/docshell/base/nsDocShell.cpp#10425)
(gdb) set var rootEntry = 0L
(gdb) cont


My firefox was build with :
--host=x86_64-linux-gnu --prefix=/usr --libexecdir=/usr/lib/firefox-9.0a2 '--with-l10n-base=/build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/l10n' '--srcdir=/build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla' --disable-install-strip --disable-updater --enable-application=browser --enable-startup-notification --with-distribution-id=com.ubuntu --enable-optimize --enable-tests --enable-ipdl-tests --disable-crashreporter --with-branding=browser/branding/aurora --disable-gnomevfs --enable-gio --enable-update-channel=aurora --disable-debug --disable-elf-hack --enable-extensions=default,globalmenu



Also this other bug seems to be a duplicate :
https://bugzilla.mozilla.org/show_bug.cgi?id=679444
So, are there exact steps to reproduce this?
(In reply to Olli Pettay [:smaug] from comment #11)
> So, are there exact steps to reproduce this?

I haven't tried on a different profile or OS, but this will reproduce it for me:

1. With firefox not running, save the attachment from this bug ( https://bugzilla.mozilla.org/attachment.cgi?id=561922 ) to $PROFILE/sessionstore.js.

2. Start up firefox on that profile.

3. Choose to restore the previous session.

That will immediately hang for me.

Note that deleting /windows/0/tabs/*/entries/*/children from my sessionstore.js does not fix the problem for me, so that might be an even "smaller" test case.
Build: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0a1) Gecko/20111004 Firefox/10.0a1. Built from http://hg.mozilla.org/mozilla-central/rev/149fc4a6efca

I've hit this bug today on OS X 10.6.8 with today's nightly.  Nightly went into an infinite loop using 100% CPU with a similar stacktrace and STR.  I had restarted Nightly a few minutes before and it loaded some tabs fine. Then I clicked to load a bugzilla tab with session restore since I have browser.sessionstore.restore_on_demand set to true and experience the indefinite hang.

Sample from activity monitor is attached.
Thanks Matt. I'm interested, can you reproduce this on a new profile with the same session data?
OS: Linux → All
Hardware: x86_64 → All
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #14)
> Thanks Matt. I'm interested, can you reproduce this on a new profile with
> the same session data?

I didn't try since there was already a reduced testcase attached and I can reproduce the indefinite hang with it on a clean profile.
Attached file very minimal test case
I got this down to the most minimal case which will reproduce the hang. I may have stripped out other info that could be relevant to figuring out how we got into this position (not hard to correlate back if need be)

docIdentifiers are what's guilty here. Somehow you're in a state where children's docIdentifiers are the same as an unrelated entry. docIdentifier is used for bfcache stuff. See https://mxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/src/nsSessionStore.js#1923 and https://mxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/src/nsSessionStore.js#3183 (the latter being what's actually going to result in the hang)

So we can craft this case, but I'm most interested in how we got into this state in real life. We pull this value off the entry directly.

I'm also confused by the number of children entries. I thought that had to do with frames on the page, but afaik bugzilla doesn't do that.
I can reproduce this in a clean mozilla-central debug build, STR:
1. create clean profile, set "show windows/tabs from last time", exit
2. firefox $(cat tp4-urls.txt)  # where tp4-urls.txt contains the 100 Tp4 test urls
3. exit
4. firefox  # => hang trying to restore those 100 tabs.
5. if it doesn't occur directly, try ctrl+pg_down/up to cycle through the tabs
Severity: normal → critical
Keywords: hang
Changing to track firefox 9 since that is what Comment 0 was reported on.
Assignee: nobody → Olli.Pettay
Actually, if the bug is in session restore, would be better if someone who knows that code fixes this.
Assigning back to default
Assignee: Olli.Pettay → nobody
Regression window(cached m-c hourly),
Works:
http://hg.mozilla.org/mozilla-central/rev/c2a23df05c4c
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110808 Firefox/8.0a1 ID:20110808000159
Hangs
http://hg.mozilla.org/mozilla-central/rev/9c7ab0a15292
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110808 Firefox/8.0a1 ID:20110808052250
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c2a23df05c4c&tochange=9c7ab0a15292


Regression window(cached m-i hourly),
Works:
http://hg.mozilla.org/integration/mozilla-inbound/rev/
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110806 Firefox/8.0a1 ID:20110806110835
Hangs
http://hg.mozilla.org/integration/mozilla-inbound/rev/53af4a6b8965
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110806 Firefox/8.0a1 ID:20110806154229
Pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=f182f03aaee9&tochange=53af4a6b8965

Suspected bug:
Bug 646641 - If a document has multiple history entries (either by pushstate or anchor navs) only the last one is hooked into bfcache
Blocks: 646641
Component: Session Restore → Document Navigation
Product: Firefox → Core
QA Contact: session.restore → docshell
Thanks Alice.
> docIdentifiers are what's guilty here. Somehow you're in a state where children's docIdentifiers are 
> the same as an unrelated entry.

That's pretty bad!  I wonder how that's happening.
(In reply to Alice0775 White from comment #21)

I think I have misunderstood.
The regression range only shows that the browser hangs  with corrupted sessionstore.js of Comment 16.

So this does not show when sessionstore.js collapse began at.
Component: Document Navigation → Session Restore
Product: Core → Firefox
QA Contact: docshell → session.restore
(In reply to Alice0775 White from comment #24)
> (In reply to Alice0775 White from comment #21)
> 
> I think I have misunderstood.
> The regression range only shows that the browser hangs  with corrupted
> sessionstore.js of Comment 16.
> 
> So this does not show when sessionstore.js collapse began at.

Okay.  It's not unlikely that bug 646641 caused both the hang and the corruption, but if we didn't previously hang with that corrupted sessionstore.js, we should fix that even if we can't figure out why it's getting corrupted.
---------------------------------[ Triage Comment ]---------------------------------

We should track this for Firefox 8/9/10 as it appears to be a bad regression from bug 646641 and we need to figure what to do.

Is there any real downside to backing out bug 646641 for Firefox 8? Does it add more risk to do so in your opinion? How often will this be hit?
> Is there any real downside to backing out bug 646641 for Firefox 8?

A risk is that the sessionstore.js files generated with the patch from bug 646641 may not be compatible with session restore pre-bug 646641.  We could nuke peoples' tabs on restart.  I'll look at this in the morning.

> How often will this be hit?

We don't know what's causing the sessionstore.js corruption (it may be bug 646641, or it may be something else), so it's hard to say how often it'd be hit.
It sounds like the priority task on this bug is tracking down the cause(s) of corruption. Is this something QA can help out with? Can you give us some direction?
Whiteboard: [qa+]
Comment 17 could be a place to start.  You could narrow down the causes of the corruption to a smaller set of the pages in TP4, and you could then bisect to find where the corrupted sessionstore.js files are coming from.

You'd need to write a script to recognize the corruption, though.  See Paul's very minimal testcase.
(In reply to Justin Lebar [:jlebar] from comment #27)
> > Is there any real downside to backing out bug 646641 for Firefox 8?
> 
> A risk is that the sessionstore.js files generated with the patch from bug
> 646641 may not be compatible with session restore pre-bug 646641.  We could
> nuke peoples' tabs on restart.  I'll look at this in the morning.

I think they should be fine, but then again I don't know the format of entry.BFCacheEntry.ID vs entry.docIdentifier. I guess it's possible that we'll still have people you have sessionstore.js files with these circular references though. So if we back out bug 646641, we'll go back to what sessionstore was doing & setting entry.docIdentifier. Would that still have the same hang potential?
> Would that still have the same hang potential?

Probably not.  I think the hang is happening because with bug 646641, we take "these two shentries have the same doc identifier" much more seriously.  They share state (the BFCacheEntry).  Still not sure exactly what's going on, though.

In whatever branches keep bug 646641, I think we'll need to scrub the sessionrestore data for these loops...
Christian, can you advise on scheduling considerations for a backout?  I'd like to understand this problem better before brute-forcing it (and regressing what's fixed by bug 646641, which landed in FF7).  When's the next beta being spun?
This finds a cycle in both the testcases in this bug, but finds no cycles in my personal sessionrestore.js.

(It's not very smart; it happily finds the same cycle more than once.)
See a really strange cycle in bug 696973 comment 3.
(In reply to Mats Palmgren [:mats] from comment #17)
> I can reproduce this in a clean mozilla-central debug build, STR:
> 1. create clean profile, set "show windows/tabs from last time", exit
> 2. firefox $(cat tp4-urls.txt)  # where tp4-urls.txt contains the 100 Tp4
> test urls
> 3. exit
> 4. firefox  # => hang trying to restore those 100 tabs.
> 5. if it doesn't occur directly, try ctrl+pg_down/up to cycle through the
> tabs

This doesn't work for me, unfortunately.
Attached patch Hack v1 (obsolete) — Splinter Review
This fixes the hang on my machine, but I'm not sure it's right.

This patch says that the scope of a doc identifier is limited to a node's children.  You can share a document with your sibling, but not with your cousin.  bz or smaug, do you know if this is right?

I still have no idea how we're getting into this state in the first place...
Attachment #570361 - Flags: review?(paul)
Attachment #570361 - Flags: feedback?(bzbarsky)
Comment on attachment 570361 [details] [diff] [review]
Hack v1

Review of attachment 570361 [details] [diff] [review]:
-----------------------------------------------------------------

Can you add a test using a staste along the lines of the minimal test case to make sure the state following a restore is no longer cyclical.

This will wipe out some legitimate same-id cases, but I don't know that it's a big deal. For example:
     P
   /   \
  C     D
 / \   / \
G   H G   I

GHGI would then remap to GHJI
Attachment #570361 - Flags: review?(paul)
Attachment #570361 - Flags: review?(mark.finkle)
Attachment #570361 - Flags: review+
Actually based on comment 38, maybe my example isn't legitimate?
Attachment #570361 - Flags: review?(mark.finkle) → review+
> bz or smaug, do you know if this is right?

I don't think it is.

Consider document A framing document B framing document C.  Then you navigate to a named anchor in C, then you navigate to a named anchor in B.

Now you have three SHEntries for C, two of which are cousins.

What you _can_ guarantee should be true is that if two entries have the same doc identifier N than their parents both have doc identifier M.

A possible implementation for that would be to store the childDocIdentMap in aDocIdentMap as part of the stuff keyed to our identifier, so that it can be retrieved when processing the cousin...  Effectively, aDocIdentMap becomes a map from a doc identifier to whatever data we have now plus the doc identifier map for kids of that _document_.  Make sense?
> Make sense?

Think so.  Thanks.
I'm hoping I can write a fix for this in the next day or two.  But it seems unlikely that we're going to fix this one way or another for beta.  Backing out bug 646641 scares me, but so does taking a sessionrestore change in the last week of beta.
Attachment #570361 - Flags: feedback?(bzbarsky) → review-
Of course, this only works if you can't re-parent an shentry.  (Or, equivalently, if all shentries which legitimately have the same doc identifier have the same depth in the shentry tree.)

smaug tells me we currently can't re-parent without reloading the document, although we'd like to change this at some point.
Ok, so sounds like we are going to live with this for Fx8? If not, please email release-drivers@mozilla.org ASAP as we still have time for a rebuild (and will likely do so today).
The bug was present in 7 and didn't seem to cause a calamity, so I think biting the bullet for 8 is the safest bet at the moment.
Attached patch Patch v1 (obsolete) — Splinter Review
Attachment #570361 - Attachment is obsolete: true
Attached patch Patch v1.1 (obsolete) — Splinter Review
Attachment #570906 - Attachment is obsolete: true
Attachment #570908 - Flags: feedback?(bzbarsky)
We may have found the source of the sessionstore.js corruption: bug 698656.
Comment on attachment 570908 [details] [diff] [review]
Patch v1.1

s/resepect/respect/.

The last word of the big comment should be "document", not "parent".

Same for the Mr. Hyde version of browser.js

Otherwise looks good.
Attachment #570908 - Flags: feedback?(bzbarsky) → feedback+
Attached patch Patch v1.2Splinter Review
Attachment #570908 - Attachment is obsolete: true
Attachment #570911 - Flags: review?(paul)
Attachment #570911 - Flags: review?(mark.finkle)
Attachment #570911 - Flags: review?(paul) → review+
Comment on attachment 570911 [details] [diff] [review]
Patch v1.2

Review of attachment 570911 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/sessionstore/src/nsSessionStore.js
@@ +3218,5 @@
> +
> +        // We're mysteriously getting sessionrestore.js files with a cycle in
> +        // the doc-identifier graph.  (That is, we have an entry where doc
> +        // identifier A is an ancestor of doc identifier B, and another entry
> +        // where doc identifier B is an ancestor of A.)

This isn't so mysterious anymore is it now that you found & fixed bug 698656, right? If that's the case, please mention that bug so we have as few "things break mysteriously" comments as possible.
Attachment #570911 - Flags: review?(mark.finkle) → review+
Comment on attachment 570911 [details] [diff] [review]
Patch v1.2

Requesting aurora approval.  Relatively simple patch which fixes a full-browser hang on startup.
Attachment #570911 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/971d806aa96e

Please can you set assignee + milestone when landing on inbound - thanks :-)
https://wiki.mozilla.org/Tree_Rules/Inbound#Please_do_the_following_after_pushing_to_inbound
Assignee: nobody → justin.lebar+bug
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 10
Blocks: 699164
Comment on attachment 570911 [details] [diff] [review]
Patch v1.2

[triage comment]

Approved for aurora. Please land today if at all possible.
Attachment #570911 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Also verified this issue manually on Windows XP using the attached minimal sessionstore.js test case:
Mozilla/5.0 (Windows NT 5.1; rv:9.0) Gecko/20100101 Firefox/9.0 (20111116091359)
Mozilla/5.0 (Windows NT 5.1; rv:10.0a2) Gecko/20111121 Firefox/10.0a2
Mozilla/5.0 (Windows NT 5.1; rv:11.0a1) Gecko/20111121 Firefox/11.0a1
Depends on: 957150
You need to log in before you can comment on or make changes to this bug.