Last Comment Bug 687710 - Restoring session causes hang from nsSHEntry parent loop
: Restoring session causes hang from nsSHEntry parent loop
Status: VERIFIED FIXED
[qa!][inbound]
: hang, verified-aurora, verified-beta
Product: Firefox
Classification: Client Software
Component: Session Restore (show other bugs)
: Trunk
: All All
: -- critical with 1 vote (vote)
: Firefox 10
Assigned To: Justin Lebar (not reading bugmail)
:
Mentors:
: 696973 699737 (view as bug list)
Depends on: 957150
Blocks: 646641 699164
  Show dependency treegraph
 
Reported: 2011-09-19 16:06 PDT by Steve Fink [:sfink] [:s:]
Modified: 2014-01-15 13:10 PST (History)
21 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
affected
+
fixed
+
fixed


Attachments
minimal sessionstore.js test case (6.27 KB, text/plain)
2011-09-22 16:36 PDT, Steve Fink [:sfink] [:s:]
no flags Details
Sample of indefinite hang restoring webpage (65.70 KB, text/plain)
2011-10-04 14:40 PDT, Matthew N. [:MattN]
no flags Details
very minimal test case (866 bytes, text/plain)
2011-10-07 11:51 PDT, Paul O'Shannessy [:zpao] (not reading much bugmail, email directly)
no flags Details
List of Tp4 URLs (1.81 KB, text/plain)
2011-10-16 17:54 PDT, Mats Palmgren (:mats)
no flags Details
Python script for detecting loops in sessionrestore.py (1.56 KB, text/plain)
2011-10-27 12:14 PDT, Justin Lebar (not reading bugmail)
no flags Details
Hack v1 (2.96 KB, patch)
2011-10-28 14:27 PDT, Justin Lebar (not reading bugmail)
paul: review+
mark.finkle: review+
justin.lebar+bug: review-
Details | Diff | Splinter Review
Patch v1 (11.46 KB, patch)
2011-10-31 19:00 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Splinter Review
Patch v1.1 (10.88 KB, patch)
2011-10-31 19:03 PDT, Justin Lebar (not reading bugmail)
bzbarsky: feedback+
Details | Diff | Splinter Review
Patch v1.2 (10.88 KB, patch)
2011-10-31 19:26 PDT, Justin Lebar (not reading bugmail)
mark.finkle: review+
paul: review+
christian: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Steve Fink [:sfink] [:s:] 2011-09-19 16:06:16 PDT
For the last 2 or 3 nightlies, I've been getting some very, very slow responses, and now sometimes complete hangs.

For this latest nightly, I downloaded symbols and attached with gdb. It shows a pair of nsSHEntry objects that are parented to each other:

(gdb) p result->mURI.mRawPtr
$25 = (nsStandardURL *) 0x7f8bcf24d860 "http://www.google.com/" (22 more characters)
...walk through the loop, allowing it to do GetParent...
(gdb) p result->mURI.mRawPtr
$26 = (nsStandardURL *) 0x7f8bcfe73ba0 "http://arstechnica.com/tech-policy/news/2011/08/week-in-tech-how-fast-is-your-internet-edition.ars" (98 more characters)

(This is with jdm's archer-mozilla extensions partly applied, which is why I can see the strings.)

If I continue walking the loop, it toggles between those two.

It's now hanging fast enough that I'm having difficulty getting the about:buildconfig from it, but application.ini says:

Vendor=Mozilla
Name=Firefox
Version=9.0a1
BuildID=20110918030911
SourceRepository=http://hg.mozilla.org/mozilla-central
SourceStamp=06445f55f009
ID={ec8030f7-c20a-464f-9b0e-13a3a9e97384}

(gdb) bt
#0  GetRootSHEntry (aEntry=<value optimized out>) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:10471
#1  0x00007f8c0a9885af in nsDocShell::SetHistoryEntry (this=0x7f8bf5c1d000, aPtr=0x7f8bf5c1d288, aEntry=0x7f8be1828780) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:10488
#2  0x00007f8c0a98f8d1 in nsDocShell::InternalLoad (this=<value optimized out>, aURI=0x7f8bd6393080, aReferrer=0x7f8bd6393160, aOwner=<value optimized out>, aFlags=<value optimized out>, aWindowTarget=<value optimized out>, aTypeHint=0x7f8c0bd13d20 "", aPostData=0x0, aHeadersData=0x0, aLoadType=<value optimized out>, aSHEntry=0x7f8be1828780, aFirstParty=1, aDocShell=0x0, aRequest=0x0) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:8678
#3  0x00007f8c0a99cc00 in nsDocShell::LoadHistoryEntry (this=0x7f8bf5c1d000, aEntry=<value optimized out>, aLoadType=<value optimized out>) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:10217
#4  0x00007f8c0a99158b in nsDocShell::LoadURI (this=0x7f8bf5c1d000, aURI=0x7f8bd6393080, aLoadInfo=<value optimized out>, aLoadFlags=<value optimized out>, aFirstParty=0) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:1364
#5  0x00007f8c0a9c047c in nsSHistory::InitiateLoad (this=<value optimized out>, aFrameEntry=<value optimized out>, aFrameDS=0x7f8bf5c1d100, aLoadType=<value optimized out>) at /builds/slave/m-cen-lnx64-ntly/build/docshell/shistory/src/nsSHistory.cpp:1772
#6  0x00007f8c0a9c0d4d in nsSHistory::LoadEntry (this=0x7f8bf5c52160, aIndex=5, aLoadType=2, aHistCmd=3) at /builds/slave/m-cen-lnx64-ntly/build/docshell/shistory/src/nsSHistory.cpp:1639
#7  0x00007f8c0a9bf958 in nsSHistory::ReloadCurrentEntry (this=0x7f8bf5c52160) at /builds/slave/m-cen-lnx64-ntly/build/docshell/shistory/src/nsSHistory.cpp:925
#8  0x00007f8c0aca5aca in NS_InvokeByIndex_P (that=<value optimized out>, methodIndex=<value optimized out>, paramCount=<value optimized out>, params=<value optimized out>) at ../../../../../../../xpcom/reflect/xptcall/src/md/unix/xptcinvoke_x86_64_unix.cpp:195
#9  0x00007f8c0a902f23 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) () from /home/sfink/Downloads/firefox/libxul.so
#10 0x00007f8c0a90a3a0 in XPC_WN_CallMethod(JSContext*, unsigned int, unsigned long*) () from /home/sfink/Downloads/firefox/libxul.so
#11 0x00007f8c0b039d92 in CallCompiler::generateNativeStub() () from /home/sfink/Downloads/firefox/libxul.so
#12 0x00007f8c0b037d93 in js::mjit::ic::NativeCall(js::VMFrame&, js::mjit::ic::CallICInfo*) () from /home/sfink/Downloads/firefox/libxul.so
#13 0x00007f8bf63f9748 in ?? ()
#14 0x0000000500000000 in ?? ()
#15 0x0000000000000142 in ?? ()
#16 0x00007fffae0abf60 in ?? ()
#17 0x0000000000000000 in ?? ()
Comment 1 Steve Fink [:sfink] [:s:] 2011-09-19 16:07:48 PDT
Sent it signal 11: 

http://crash-stats.mozilla.com/report/index/bp-7443fa96-631e-4189-a343-1fad12110919
Comment 2 Marco Bonardo [::mak] 2011-09-19 16:12:35 PDT
From the stack looks like session history, moving to Document Navigation
Comment 3 Steve Fink [:sfink] [:s:] 2011-09-19 16:14:53 PDT
Later crashes are showing the same URL pair loop.

When I start, it's alive for a little while, then hangs. I think it may be when the delayed loading makes it to the relevant URLs.
Comment 4 Steve Fink [:sfink] [:s:] 2011-09-19 16:19:42 PDT
If I go up a level in the stack, I think it tells me what URL it started with:

(gdb) up
#1  0x00007f64a30885af in nsDocShell::SetHistoryEntry (this=0x7f647b56e400, aPtr=0x7f647b56e688, aEntry=0x7f6467784780) at /builds/slave/m-cen-lnx64-ntly/build/docshell/base/nsDocShell.cpp:10488
(gdb) p aEntry->mURI.mRawPtr
$41 = (nsStandardURL *) 0x7f64677887c0 "https://blog.mozilla.com/sfink/wp-login.php?redirect_to=https%3A%2F%2Fblog.mozilla.com%2Fsfink%2Fwp-admin%2Fedit-comments.php&reauth=1" (134 more characters)

(that's going to do a redirect, though it probably isn't relevant).
Comment 5 Boris Zbarsky [:bz] 2011-09-19 19:06:50 PDT
Um.... having two shentries parented to each other is bad juju, obviously.  Any ideas on how to reproduce?
Comment 6 Steve Fink [:sfink] [:s:] 2011-09-19 23:33:16 PDT
1. Sit in my chair. Open up my laptop.
2. Run firefox or firefox -safe-mode (it happens either way.)
3. Restore the previous session.
4. Wait for 3 minutes. (Almost exactly 3 minutes, it seems. Perhaps that's a hint.)

I am on 64-bit Linux (Fedora 14).

It may be tied to my saved session. I'd rather not attach my sessionstore.js to a public bug, but I can email it to anyone interested. It's 32MB uncompressed, and has about 252 tabs in 3 windows. I made it to 116 tabs open before it froze (measured by hitting reload on about:tabs every few seconds, which is one of Dietrich's extensions.)

About Tabs

    252 tabs across 3 groups in 3 windows
    116 tabs open
    231 unique addresses
    91 unique hosts
    1 empty tab
    127 https:
    112 http:
    6 file:
    1 ftp:
    5 about:
Comment 7 Steve Fink [:sfink] [:s:] 2011-09-19 23:39:45 PDT
I'll try whittling down my sessionstore.js tomorrow to see if it's a particular URL or set of URLs. Any easy way to have it log the URL of each tab it restores?

I guess I should bisect it too.
Comment 8 Steve Fink [:sfink] [:s:] 2011-09-22 16:36:16 PDT
Created attachment 561922 [details]
minimal sessionstore.js test case

This was "fun" to whittle down. The attached sessionstore.js causes the hang immediately on my machine. It's a session with exactly two tabs. There are children entries in there (for back/forward, I guess?), but every single one of them is a b.m.o. URL. I deleted all cookies, extData, etc. (I didn't try removing all of the children entries.)

Next I'll have to blog about my interactive command-line JSON navigator/whacker that I wrote to do this...
Comment 9 Steve Fink [:sfink] [:s:] 2011-09-26 14:14:29 PDT
Some more detail:

There are two tabs, one for show_bug.cgi?id=578560 and show_bug.cgi?id=673451. Both have a single 'entries' entry, and each of those has 12 'children' entries, all of which are show_activity.cgi?id=(same). (So there are two sets of 12 children, with bug ids matching the parent tabs.)

What happens is that at one point there is an nsSHEntry tree with show_bug.cgi?id=578560 parented to show_bug.cgi?id=673451. Then we get an "show_bug.cgi?id=578560"->addChild(show_activity.cgi?id=578560). show_bug.cgi?id=673451 and show_activity.cgi?id=578560 have the same nsSHEntry.mShared field value, which seems very odd, and so after reparenting the two mShared fields form a loop.

If you don't mind me using the output of my json explorer:

> cat windows/0/tabs/*/entries/*/url
windows/0/tabs/0/entries/0/url = https://bugzilla.mozilla.org/show_bug.cgi?id=673451
windows/0/tabs/1/entries/0/url = https://bugzilla.mozilla.org/show_bug.cgi?id=578560
> cat windows/0/tabs/*/entries/*/children/*/url
windows/0/tabs/0/entries/0/children/0/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/1/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/2/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/3/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/4/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/5/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/6/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/7/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/8/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/9/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/10/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/0/entries/0/children/11/url = https://bugzilla.mozilla.org/show_activity.cgi?id=673451
windows/0/tabs/1/entries/0/children/0/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/1/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/2/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/3/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/4/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/5/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/6/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/7/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/8/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/9/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/10/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560
windows/0/tabs/1/entries/0/children/11/url = https://bugzilla.mozilla.org/show_activity.cgi?id=578560

Another odd thing I noticed is that two of the children's docShellIDs are the same, and match the docShellID of their tab (marked with "<---"):

> cd /windows/0/tabs/
0/
1/
/windows/0/tabs> cat */entries/0/docshellID 
0/entries/0/docshellID = 429
1/entries/0/docshellID = 439                  <---
/windows/0/tabs> cat */entries/0/children/*/docshellID 
0/entries/0/children/0/docshellID = 430
0/entries/0/children/1/docshellID = 243
0/entries/0/children/2/docshellID = 249
0/entries/0/children/3/docshellID = 239
0/entries/0/children/4/docshellID = 278
0/entries/0/children/5/docshellID = 354
0/entries/0/children/6/docshellID = 314
0/entries/0/children/7/docshellID = 226
0/entries/0/children/8/docshellID = 208
0/entries/0/children/9/docshellID = 175
0/entries/0/children/10/docshellID = 253
0/entries/0/children/11/docshellID = 247
1/entries/0/children/0/docshellID = 314
1/entries/0/children/1/docshellID = 226
1/entries/0/children/2/docshellID = 208
1/entries/0/children/3/docshellID = 175
1/entries/0/children/4/docshellID = 253
1/entries/0/children/5/docshellID = 247
1/entries/0/children/6/docshellID = 439       <---
1/entries/0/children/7/docshellID = 439       <---
1/entries/0/children/8/docshellID = 226
1/entries/0/children/9/docshellID = 178
1/entries/0/children/10/docshellID = 257
1/entries/0/children/11/docshellID = 251
Comment 10 Christophe PARIS 2011-10-01 08:17:02 PDT
I think I'm reproducing the same bug here.
I've set firefox to only load the focused tab at startup.
When I click on a tab which is not already loaded, at some point firefox freeze with 100% CPU load.

gdb back trace give me this :

#0  GetRootSHEntry (aEntry=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:10427
#1  0x00007fc5ab3ffa33 in nsDocShell::SetHistoryEntry (this=0x7fc58079d800, aPtr=0x7fc58079da88, 
    aEntry=0x7fc580d61280)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:10444
#2  0x00007fc5ab40aecb in nsDocShell::InternalLoad (this=0x7fc58079d800, aURI=0x7fc5816a6320, 
    aReferrer=0x7fc5816a6400, aOwner=0x7fc56dfe5d30, aFlags=0, aWindowTarget=0x7fc500000000, 
    aTypeHint=0x7fc5ac72f930 "", aPostData=0x0, aHeadersData=0x0, aLoadType=4, aSHEntry=0x7fc580d61280, 
    aFirstParty=1, aDocShell=0x0, aRequest=0x0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:8681
#3  0x00007fc5ab40689c in nsDocShell::LoadHistoryEntry (this=0x7fc58079d800, aEntry=0x7fc580d61280, aLoadType=4)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:10173
#4  0x00007fc5ab3f80a7 in nsDocShell::LoadURI (this=0x7fc58079d800, aURI=0x7fc5816a6320, 
    aLoadInfo=<value optimized out>, aLoadFlags=0, aFirstParty=0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/base/nsDocShell.cpp:1364
#5  0x00007fc5ab42a06a in nsSHistory::InitiateLoad (this=<value optimized out>, aFrameEntry=<value optimized out>, 
    aFrameDS=0x7fc58079d900, aLoadType=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/shistory/src/nsSHistory.cpp:1772
#6  0x00007fc5ab42a8a6 in nsSHistory::LoadEntry (this=0x7fc5807efca0, aIndex=1, aLoadType=2, aHistCmd=3)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/shistory/src/nsSHistory.cpp:1639
#7  0x00007fc5ab4296f8 in nsSHistory::ReloadCurrentEntry (this=0x7fc5807efca0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/docshell/shistory/src/nsSHistory.cpp:925
#8  0x00007fc5ab711a7d in NS_InvokeByIndex_P (that=<value optimized out>, methodIndex=<value optimized out>, 
    paramCount=0, params=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_x86_64_unix.cpp:195
#9  0x00007fc5ab379170 in Invoke (ccx=<value optimized out>, mode=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:3147
#10 Call (ccx=<value optimized out>, mode=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:2370
#11 XPCWrappedNative::CallMethod (ccx=<value optimized out>, mode=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:2334
#12 0x00007fc5ab37c7da in XPC_WN_CallMethod (cx=0x7fc595a5c400, argc=0, vp=0x7fc5972637e8)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/xpconnect/src/xpcwrappednativejsops.cpp:1629
#13 0x00007fc5aba8cdd5 in CallJSNative (this=0x7fffacea94b0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/jscntxtinlines.h:296
#14 CallCompiler::generateNativeStub (this=0x7fffacea94b0)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/methodjit/MonoIC.cpp:937
#15 0x00007fc5aba89cb3 in js::mjit::ic::NativeCall (f=<value optimized out>, ic=<value optimized out>)
    at /build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/js/src/methodjit/MonoIC.cpp:1162
#16 0x00007fc557792a7b in ?? ()
#17 0x00007fc58a83b190 in ?? ()
#18 0x0000000000000000 in ?? ()


To unfreeze it I usually do one or more time :
> gdb -p $(pidof firefox)
(gdb) next 10
(gdb) next (to be on the line http://mxr.mozilla.org/comm-central/source/mozilla/docshell/base/nsDocShell.cpp#10425)
(gdb) set var rootEntry = 0L
(gdb) cont


My firefox was build with :
--host=x86_64-linux-gnu --prefix=/usr --libexecdir=/usr/lib/firefox-9.0a2 '--with-l10n-base=/build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla/l10n' '--srcdir=/build/buildd/firefox-9.0~a2~hg20110928r78199/build-tree/mozilla' --disable-install-strip --disable-updater --enable-application=browser --enable-startup-notification --with-distribution-id=com.ubuntu --enable-optimize --enable-tests --enable-ipdl-tests --disable-crashreporter --with-branding=browser/branding/aurora --disable-gnomevfs --enable-gio --enable-update-channel=aurora --disable-debug --disable-elf-hack --enable-extensions=default,globalmenu



Also this other bug seems to be a duplicate :
https://bugzilla.mozilla.org/show_bug.cgi?id=679444
Comment 11 Olli Pettay [:smaug] 2011-10-01 14:20:29 PDT
So, are there exact steps to reproduce this?
Comment 12 Steve Fink [:sfink] [:s:] 2011-10-01 21:25:55 PDT
(In reply to Olli Pettay [:smaug] from comment #11)
> So, are there exact steps to reproduce this?

I haven't tried on a different profile or OS, but this will reproduce it for me:

1. With firefox not running, save the attachment from this bug ( https://bugzilla.mozilla.org/attachment.cgi?id=561922 ) to $PROFILE/sessionstore.js.

2. Start up firefox on that profile.

3. Choose to restore the previous session.

That will immediately hang for me.

Note that deleting /windows/0/tabs/*/entries/*/children from my sessionstore.js does not fix the problem for me, so that might be an even "smaller" test case.
Comment 13 Matthew N. [:MattN] 2011-10-04 14:40:50 PDT
Created attachment 564676 [details]
Sample of indefinite hang restoring webpage

Build: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0a1) Gecko/20111004 Firefox/10.0a1. Built from http://hg.mozilla.org/mozilla-central/rev/149fc4a6efca

I've hit this bug today on OS X 10.6.8 with today's nightly.  Nightly went into an infinite loop using 100% CPU with a similar stacktrace and STR.  I had restarted Nightly a few minutes before and it loaded some tabs fine. Then I clicked to load a bugzilla tab with session restore since I have browser.sessionstore.restore_on_demand set to true and experience the indefinite hang.

Sample from activity monitor is attached.
Comment 14 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-10-04 15:14:03 PDT
Thanks Matt. I'm interested, can you reproduce this on a new profile with the same session data?
Comment 15 Matthew N. [:MattN] 2011-10-04 15:28:07 PDT
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #14)
> Thanks Matt. I'm interested, can you reproduce this on a new profile with
> the same session data?

I didn't try since there was already a reduced testcase attached and I can reproduce the indefinite hang with it on a clean profile.
Comment 16 Paul O'Shannessy [:zpao] (not reading much bugmail, email directly) 2011-10-07 11:51:00 PDT
Created attachment 565599 [details]
very minimal test case

I got this down to the most minimal case which will reproduce the hang. I may have stripped out other info that could be relevant to figuring out how we got into this position (not hard to correlate back if need be)

docIdentifiers are what's guilty here. Somehow you're in a state where children's docIdentifiers are the same as an unrelated entry. docIdentifier is used for bfcache stuff. See https://mxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/src/nsSessionStore.js#1923 and https://mxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/src/nsSessionStore.js#3183 (the latter being what's actually going to result in the hang)

So we can craft this case, but I'm most interested in how we got into this state in real life. We pull this value off the entry directly.

I'm also confused by the number of children entries. I thought that had to do with frames on the page, but afaik bugzilla doesn't do that.
Comment 17 Mats Palmgren (:mats) 2011-10-16 17:53:28 PDT
I can reproduce this in a clean mozilla-central debug build, STR:
1. create clean profile, set "show windows/tabs from last time", exit
2. firefox $(cat tp4-urls.txt)  # where tp4-urls.txt contains the 100 Tp4 test urls
3. exit
4. firefox  # => hang trying to restore those 100 tabs.
5. if it doesn't occur directly, try ctrl+pg_down/up to cycle through the tabs
Comment 18 Mats Palmgren (:mats) 2011-10-16 17:54:20 PDT
Created attachment 567373 [details]
List of Tp4 URLs
Comment 19 Matthew N. [:MattN] 2011-10-16 18:04:05 PDT
Changing to track firefox 9 since that is what Comment 0 was reported on.
Comment 20 Olli Pettay [:smaug] 2011-10-17 02:16:02 PDT
Actually, if the bug is in session restore, would be better if someone who knows that code fixes this.
Assigning back to default
Comment 21 Alice0775 White 2011-10-17 05:06:29 PDT
Regression window(cached m-c hourly),
Works:
http://hg.mozilla.org/mozilla-central/rev/c2a23df05c4c
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110808 Firefox/8.0a1 ID:20110808000159
Hangs
http://hg.mozilla.org/mozilla-central/rev/9c7ab0a15292
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110808 Firefox/8.0a1 ID:20110808052250
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c2a23df05c4c&tochange=9c7ab0a15292


Regression window(cached m-i hourly),
Works:
http://hg.mozilla.org/integration/mozilla-inbound/rev/
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110806 Firefox/8.0a1 ID:20110806110835
Hangs
http://hg.mozilla.org/integration/mozilla-inbound/rev/53af4a6b8965
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110806 Firefox/8.0a1 ID:20110806154229
Pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=f182f03aaee9&tochange=53af4a6b8965

Suspected bug:
Bug 646641 - If a document has multiple history entries (either by pushstate or anchor navs) only the last one is hooked into bfcache
Comment 22 Olli Pettay [:smaug] 2011-10-17 05:24:34 PDT
Thanks Alice.
Comment 23 Justin Lebar (not reading bugmail) 2011-10-17 07:14:53 PDT
> docIdentifiers are what's guilty here. Somehow you're in a state where children's docIdentifiers are 
> the same as an unrelated entry.

That's pretty bad!  I wonder how that's happening.
Comment 24 Alice0775 White 2011-10-17 08:31:08 PDT
(In reply to Alice0775 White from comment #21)

I think I have misunderstood.
The regression range only shows that the browser hangs  with corrupted sessionstore.js of Comment 16.

So this does not show when sessionstore.js collapse began at.
Comment 25 Justin Lebar (not reading bugmail) 2011-10-17 08:37:27 PDT
(In reply to Alice0775 White from comment #24)
> (In reply to Alice0775 White from comment #21)
> 
> I think I have misunderstood.
> The regression range only shows that the browser hangs  with corrupted
> sessionstore.js of Comment 16.
> 
> So this does not show when sessionstore.js collapse began at.

Okay.  It's not unlikely that bug 646641 caused both the hang and the corruption, but if we didn't previously hang with that corrupted sessionstore.js, we should fix that even if we can't figure out why it's getting corrupted.
Comment 26 christian 2011-10-25 22:27:53 PDT
---------------------------------[ Triage Comment ]---------------------------------

We should track this for Firefox 8/9/10 as it appears to be a bad regression from bug 646641 and we need to figure what to do.

Is there any real downside to backing out bug 646641 for Firefox 8? Does it add more risk to do so in your opinion? How often will this be hit?
Comment 27 Justin Lebar (not reading bugmail) 2011-10-25 22:43:29 PDT
> Is there any real downside to backing out bug 646641 for Firefox 8?

A risk is that the sessionstore.js files generated with the patch from bug 646641 may not be compatible with session restore pre-bug 646641.  We could nuke peoples' tabs on restart.  I'll look at this in the morning.

> How often will this be hit?

We don't know what's causing the sessionstore.js corruption (it may be bug 646641, or it may be something else), so it's hard to say how often it'd be hit.
Comment 28 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-10-26 09:48:26 PDT
It sounds like the priority task on this bug is tracking down the cause(s) of corruption. Is this something QA can help out with? Can you give us some direction?
Comment 29 Justin Lebar (not reading bugmail) 2011-10-26 09:54:37 PDT
Comment 17 could be a place to start.  You could narrow down the causes of the corruption to a smaller set of the pages in TP4, and you could then bisect to find where the corrupted sessionstore.js files are coming from.

You'd need to write a script to recognize the corruption, though.  See Paul's very minimal testcase.
Comment 30 Paul O'Shannessy [:zpao] (not reading much bugmail, email directly) 2011-10-26 13:53:35 PDT
(In reply to Justin Lebar [:jlebar] from comment #27)
> > Is there any real downside to backing out bug 646641 for Firefox 8?
> 
> A risk is that the sessionstore.js files generated with the patch from bug
> 646641 may not be compatible with session restore pre-bug 646641.  We could
> nuke peoples' tabs on restart.  I'll look at this in the morning.

I think they should be fine, but then again I don't know the format of entry.BFCacheEntry.ID vs entry.docIdentifier. I guess it's possible that we'll still have people you have sessionstore.js files with these circular references though. So if we back out bug 646641, we'll go back to what sessionstore was doing & setting entry.docIdentifier. Would that still have the same hang potential?
Comment 31 Justin Lebar (not reading bugmail) 2011-10-26 14:33:47 PDT
> Would that still have the same hang potential?

Probably not.  I think the hang is happening because with bug 646641, we take "these two shentries have the same doc identifier" much more seriously.  They share state (the BFCacheEntry).  Still not sure exactly what's going on, though.

In whatever branches keep bug 646641, I think we'll need to scrub the sessionrestore data for these loops...
Comment 32 Justin Lebar (not reading bugmail) 2011-10-27 07:16:13 PDT
Christian, can you advise on scheduling considerations for a backout?  I'd like to understand this problem better before brute-forcing it (and regressing what's fixed by bug 646641, which landed in FF7).  When's the next beta being spun?
Comment 33 Justin Lebar (not reading bugmail) 2011-10-27 12:14:03 PDT
Created attachment 570048 [details]
Python script for detecting loops in sessionrestore.py

This finds a cycle in both the testcases in this bug, but finds no cycles in my personal sessionrestore.js.

(It's not very smart; it happily finds the same cycle more than once.)
Comment 34 Justin Lebar (not reading bugmail) 2011-10-27 12:21:50 PDT
*** Bug 696973 has been marked as a duplicate of this bug. ***
Comment 35 Justin Lebar (not reading bugmail) 2011-10-27 12:25:00 PDT
See a really strange cycle in bug 696973 comment 3.
Comment 36 Justin Lebar (not reading bugmail) 2011-10-28 13:48:28 PDT
(In reply to Mats Palmgren [:mats] from comment #17)
> I can reproduce this in a clean mozilla-central debug build, STR:
> 1. create clean profile, set "show windows/tabs from last time", exit
> 2. firefox $(cat tp4-urls.txt)  # where tp4-urls.txt contains the 100 Tp4
> test urls
> 3. exit
> 4. firefox  # => hang trying to restore those 100 tabs.
> 5. if it doesn't occur directly, try ctrl+pg_down/up to cycle through the
> tabs

This doesn't work for me, unfortunately.
Comment 37 Justin Lebar (not reading bugmail) 2011-10-28 14:27:10 PDT
Created attachment 570361 [details] [diff] [review]
Hack v1
Comment 38 Justin Lebar (not reading bugmail) 2011-10-28 14:29:34 PDT
This fixes the hang on my machine, but I'm not sure it's right.

This patch says that the scope of a doc identifier is limited to a node's children.  You can share a document with your sibling, but not with your cousin.  bz or smaug, do you know if this is right?

I still have no idea how we're getting into this state in the first place...
Comment 39 Paul O'Shannessy [:zpao] (not reading much bugmail, email directly) 2011-10-28 15:50:51 PDT
Comment on attachment 570361 [details] [diff] [review]
Hack v1

Review of attachment 570361 [details] [diff] [review]:
-----------------------------------------------------------------

Can you add a test using a staste along the lines of the minimal test case to make sure the state following a restore is no longer cyclical.

This will wipe out some legitimate same-id cases, but I don't know that it's a big deal. For example:
     P
   /   \
  C     D
 / \   / \
G   H G   I

GHGI would then remap to GHJI
Comment 40 Paul O'Shannessy [:zpao] (not reading much bugmail, email directly) 2011-10-28 16:10:35 PDT
Actually based on comment 38, maybe my example isn't legitimate?
Comment 41 Boris Zbarsky [:bz] 2011-10-28 20:27:19 PDT
> bz or smaug, do you know if this is right?

I don't think it is.

Consider document A framing document B framing document C.  Then you navigate to a named anchor in C, then you navigate to a named anchor in B.

Now you have three SHEntries for C, two of which are cousins.

What you _can_ guarantee should be true is that if two entries have the same doc identifier N than their parents both have doc identifier M.

A possible implementation for that would be to store the childDocIdentMap in aDocIdentMap as part of the stuff keyed to our identifier, so that it can be retrieved when processing the cousin...  Effectively, aDocIdentMap becomes a map from a doc identifier to whatever data we have now plus the doc identifier map for kids of that _document_.  Make sense?
Comment 42 Justin Lebar (not reading bugmail) 2011-10-28 20:48:02 PDT
> Make sense?

Think so.  Thanks.
Comment 43 Justin Lebar (not reading bugmail) 2011-10-30 17:15:43 PDT
I'm hoping I can write a fix for this in the next day or two.  But it seems unlikely that we're going to fix this one way or another for beta.  Backing out bug 646641 scares me, but so does taking a sessionrestore change in the last week of beta.
Comment 44 Justin Lebar (not reading bugmail) 2011-10-31 12:41:12 PDT
Of course, this only works if you can't re-parent an shentry.  (Or, equivalently, if all shentries which legitimately have the same doc identifier have the same depth in the shentry tree.)

smaug tells me we currently can't re-parent without reloading the document, although we'd like to change this at some point.
Comment 45 christian 2011-10-31 15:59:59 PDT
Ok, so sounds like we are going to live with this for Fx8? If not, please email release-drivers@mozilla.org ASAP as we still have time for a rebuild (and will likely do so today).
Comment 46 Justin Lebar (not reading bugmail) 2011-10-31 16:18:53 PDT
The bug was present in 7 and didn't seem to cause a calamity, so I think biting the bullet for 8 is the safest bet at the moment.
Comment 47 Justin Lebar (not reading bugmail) 2011-10-31 19:00:36 PDT
Created attachment 570906 [details] [diff] [review]
Patch v1
Comment 48 Justin Lebar (not reading bugmail) 2011-10-31 19:03:38 PDT
Created attachment 570908 [details] [diff] [review]
Patch v1.1
Comment 49 Justin Lebar (not reading bugmail) 2011-10-31 19:06:42 PDT
We may have found the source of the sessionstore.js corruption: bug 698656.
Comment 50 Boris Zbarsky [:bz] 2011-10-31 19:12:10 PDT
Comment on attachment 570908 [details] [diff] [review]
Patch v1.1

s/resepect/respect/.

The last word of the big comment should be "document", not "parent".

Same for the Mr. Hyde version of browser.js

Otherwise looks good.
Comment 51 Justin Lebar (not reading bugmail) 2011-10-31 19:26:46 PDT
Created attachment 570911 [details] [diff] [review]
Patch v1.2
Comment 52 Paul O'Shannessy [:zpao] (not reading much bugmail, email directly) 2011-11-01 13:02:14 PDT
Comment on attachment 570911 [details] [diff] [review]
Patch v1.2

Review of attachment 570911 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/sessionstore/src/nsSessionStore.js
@@ +3218,5 @@
> +
> +        // We're mysteriously getting sessionrestore.js files with a cycle in
> +        // the doc-identifier graph.  (That is, we have an entry where doc
> +        // identifier A is an ancestor of doc identifier B, and another entry
> +        // where doc identifier B is an ancestor of A.)

This isn't so mysterious anymore is it now that you found & fixed bug 698656, right? If that's the case, please mention that bug so we have as few "things break mysteriously" comments as possible.
Comment 53 Justin Lebar (not reading bugmail) 2011-11-01 15:12:47 PDT
Comment on attachment 570911 [details] [diff] [review]
Patch v1.2

Requesting aurora approval.  Relatively simple patch which fixes a full-browser hang on startup.
Comment 54 Justin Lebar (not reading bugmail) 2011-11-01 15:22:33 PDT
Inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/971d806aa96e
Comment 55 Ed Morley [:emorley] 2011-11-02 06:33:27 PDT
https://hg.mozilla.org/mozilla-central/rev/971d806aa96e

Please can you set assignee + milestone when landing on inbound - thanks :-)
https://wiki.mozilla.org/Tree_Rules/Inbound#Please_do_the_following_after_pushing_to_inbound
Comment 56 Justin Lebar (not reading bugmail) 2011-11-04 08:32:28 PDT
*** Bug 699737 has been marked as a duplicate of this bug. ***
Comment 57 christian 2011-11-07 13:34:15 PST
Comment on attachment 570911 [details] [diff] [review]
Patch v1.2

[triage comment]

Approved for aurora. Please land today if at all possible.
Comment 58 Justin Lebar (not reading bugmail) 2011-11-07 13:48:12 PST
Landed in Aurora for FF9: https://hg.mozilla.org/releases/mozilla-aurora/rev/15b028ac0dcf
Comment 60 Ioana (away) 2011-11-23 02:17:34 PST
Also verified this issue manually on Windows XP using the attached minimal sessionstore.js test case:
Mozilla/5.0 (Windows NT 5.1; rv:9.0) Gecko/20100101 Firefox/9.0 (20111116091359)
Mozilla/5.0 (Windows NT 5.1; rv:10.0a2) Gecko/20111121 Firefox/10.0a2
Mozilla/5.0 (Windows NT 5.1; rv:11.0a1) Gecko/20111121 Firefox/11.0a1

Note You need to log in before you can comment on or make changes to this bug.