Closed Bug 429110 Opened 16 years ago Closed 16 years ago

Random crashing ( [@ PR_AtomicIncrement] ?) ( [@ nsIOService::NewURI] ?)

Categories

(Core :: Security: PSM, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED

People

(Reporter: stevee, Assigned: KaiE)

References

()

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(8 files, 2 obsolete files)

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9pre) Gecko/2008041406 Minefield/3.0pre ID:2008041406

This isn't going to be a very useful bug report, but recently I've been occasionally crashing for no apparent reason. I will be reading pages I normally visit and then POW! Firefox crashes. This is using my daily Firefox profile.

http://crash-stats.mozilla.com/report/index/9eb41a87-0a3c-11dd-8ecb-001cc4e2bf68
http://crash-stats.mozilla.com/report/index/185f4e5c-0adb-11dd-98a2-001b78bc73ea

Signature	PR_AtomicIncrement
UUID	185f4e5c-0adb-11dd-98a2-001b78bc73ea
Time	2008-04-15 03:58:51-07:00
Uptime	26
Product	Firefox
Version	3.0pre
Build ID	2008041406
OS	Windows NT
OS Version	5.1.2600 Service Pack 2
CPU	x86
CPU Info	AuthenticAMD family 6 model 8 stepping 1
Crash Reason	EXCEPTION_ACCESS_VIOLATION
Crash Address	0x42004c
Comments	
Frame  	Module  	Signature [Expand]  	Source
0 	nspr4.dll 	PR_AtomicIncrement 	mozilla/nsprpub/pr/src/misc/pratom.c:306
1 	xul.dll 	nsACString_internal::Assign 	mozilla/xpcom/string/src/nsTSubstring.cpp:396
2 	xul.dll 	nsIOService::NewURI 	mozilla/netwerk/base/src/nsIOService.cpp:485
3 	xul.dll 	nsIOService::NewChannel 	mozilla/netwerk/base/src/nsIOService.cpp:579
4 	xul.dll 	nsHTTPDownloadEvent::Run 	mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:119
5 	xul.dll 	nsThread::ProcessNextEvent 	mozilla/xpcom/threads/nsThread.cpp:510
6 	xul.dll 	nsBaseAppShell::Run 	mozilla/widget/src/xpwidgets/nsBaseAppShell.cpp:170
7 	nspr4.dll 	PR_GetEnv 	
8 	firefox.exe 	wmain 	mozilla/toolkit/xre/nsWindowsWMain.cpp:87
9 	firefox.exe 	firefox.exe@0x217f 	
10 	kernel32.dll 	BaseProcessStart 	

From http://crash-stats.mozilla.com/report/list?range_unit=weeks&version=Firefox%3A3.0pre&range_value=2&signature=PR_AtomicIncrement we can see that the crashing appears on the 2008-04-12 build

Checkins to module PhoenixTinderbox between 2008-04-11 06:00 and 2008-04-12 07:00 : 
http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=PhoenixTinderbox&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2008-04-11+06&maxdate=2008-04-12+07&cvsroot=%2Fcvsroot
there's only one thread. please find some other nss/psm bug where i complained that there's only one thread. :)
timeless, this report http://crash-stats.mozilla.com/report/index/49a958fd-0aea-11dd-ad14-001cc45a2c28 has other threads listed. Is that a help at all?
Flags: blocking-firefox3?
I had another random crash today at startup. This may be related, or a different thing, so I will post the stack here anyway.

http://crash-stats.mozilla.com/report/index/9df3a71c-0b1b-11dd-998a-001cc4e2bf68

Signature	nsIOService::NewURI(nsACString_internal const&, char const*, nsIURI*, nsIURI**)
UUID	9df3a71c-0b1b-11dd-998a-001cc4e2bf68
Time	2008-04-15 11:41:24-07:00
Uptime	22
Product	Firefox
Version	3.0pre
Build ID	2008041506
OS	Windows NT
OS Version	5.1.2600 Service Pack 2
CPU	x86
CPU Info	AuthenticAMD family 6 model 8 stepping 1
Crash Reason	EXCEPTION_ACCESS_VIOLATION
Crash Address	0x43004e
Comments	

Frame  	Module  	Signature [Expand]  	Source
0 	xul.dll 	nsIOService::NewURI 	mozilla/netwerk/base/src/nsIOService.cpp:485
1 	xul.dll 	nsIOService::NewChannel 	mozilla/netwerk/base/src/nsIOService.cpp:579
2 	xul.dll 	nsHTTPDownloadEvent::Run 	mozilla/security/manager/ssl/src/nsNSSCallbacks.cpp:119
3 	xul.dll 	nsThread::ProcessNextEvent 	mozilla/xpcom/threads/nsThread.cpp:510
4 	xul.dll 	nsBaseAppShell::Run 	mozilla/widget/src/xpwidgets/nsBaseAppShell.cpp:170
5 	nspr4.dll 	PR_GetEnv 	
6 	firefox.exe 	wmain 	mozilla/toolkit/xre/nsWindowsWMain.cpp:87
7 	firefox.exe 	firefox.exe@0x217f 	
8 	kernel32.dll 	BaseProcessStart 	
Summary: Random crashing ( [@ PR_AtomicIncrement] ?) → Random crashing ( [@ PR_AtomicIncrement] ?) ( [@ nsIOService::NewURI] ?)
Can't block on random crashes with no STR to cause it or specific area in which the crash occurs.
Flags: blocking-firefox3? → blocking-firefox3-
I'm getting random crashes as well.  Although, I think I've found one way to reproduce:  
1) Go to a site containing flash
2) While that site is reloading open another tab
3) Try to scroll in the newly opened tab
Result: Crash

Although, this doesn't always happen, but it may be of some help.
Ouch! Could be my system as I'm pushing it for its age to run Vista HP with some rather old hardware, however - using the steps in comment #6 I had a total hard crash - system rebooted.  Have never seen this since starting to use Vista in over a year now.  

1. had a flash vid playing from youtube
2. went to betanews.com and while the page was loading in another tab started to scroll the page - 
3. System Crash!/Reboot

Nothing in the Event Manager/Appliction log or system logs points to the problem other than 'Unexpected Shutdown'  Yeah No-kidding

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9pre) Gecko/2008041610 Minefield/3.0pre Firefox/3.0 ID:2008041610

I also am getting random crashes. Mostly just the browser crashes, but vista64 froze a few times completely too. I suspected flash to be the cause, which was a good guess I guess, reading the comments in this bug.
Given the comments, I'm fairly sure my issue is related to this.

I'm getting hangs, though.  It never crashes.  I have to kill the browser.... memory usage stays constant, and I've tried leaving it for several minutes without it ever coming back alive.

I've had it happen 16+ times in a day.  Recently I've been working with a lot of Flash applications in my development, which is probably the reason for that.

-[Unknown]
Hanging related to flash is bug 418811.
Keywords: topcrash
This is topcrash #4 for the last 2 days.  That makes it a serious regression, IMO.
Flags: blocking-firefox3- → blocking-firefox3?
Looks like either the pointer to the HTTP request session is garbage or the access to the string on multiple threads is not being happy.  Either way, looks like a regression from the resent PSM changes...
Requesting QA to help find steps to reproduce.
Keywords: qawanted
--> Core::General
Flags: blocking-firefox3?
Product: Firefox → Core
QA Contact: general → general
Carrying over Jesse's blocking nomination from comment 11
Flags: blocking1.9?
Yes, I can reproduce a crash using the URL given in comment 17.

My first thought was, this can't be related to PSM changes, because UI's URL is http. However, that page seems to load a webbug pixel gif from paypal over https...

Before we crash, there is a period of time when Firefox hangs with a busy loop. I've captured a stack while we still run, see attachment 317040 [details].

After we crash, I have the stack from attachment 317041 [details]. This is similar to what UI got, we have the same root. But we crash in a different location, it smells like memory corruption.

I've also attached the stacks of the other threads after the crash, attachment 317044 [details].
I've backed out the patch from bug 420187 for testing purposes.
The page from comment 17 still takes ages to load, and Firefox is stuck in a busy loop for a while.

But then it eventually wakes up, displays the web page, connects to paypal, and succeeds in loading. No crash.

I'm not yet fully convinced that my patch is the culprit for the crash, but I agree, at least it triggered it.
Does it help to copy mRequestSession->mURL into the download event at event creation time, so you don't have multiple threads accessing the non-threadsafe string object?

That assert in your first post-crash stack trace is Really Bad.  At what point does it fire?  What do the thread stacks look like then?
Trying to track down what it is on the page that caused the hang:

Removing all sidebar stuff didn't affect hang.

Removing the paypal webbug didn't affect the hang (so PSM is probably ruled out).

Removing first block of bogus links (the one hidden with the font tag, tpao.org) didn't affect the hang.

Removing the third block of bogus links (the one hidden with the u tag, secondlife.reuters.com) didn't affect the hang.

Removing the second block of bogus links (the one not hidden due to the corrupted a tag, fortt.com) resulted in no hang. (As I suspected, but wanted to clean up other stuff first.)

Replacing the corrupted

[a href="http://www.dreamhost.com%3E%20Dreamhost%3C/a%3E%3C/p%3E%0D%0A%20%20%20%20%3C/div%3E%0D%0A%3C/div%3E%0D%0A%3C%21--%20%7E%20--%3E%3Cu%20style=" display:none=""][/a]

With

[a href="http://www.dreamhost.com"]Dreamhost[/a][div style="display: none;"]

resulted in no hang.

[Note: replaced angle brackets with square brackets here because I'm not sure what will happen if I try to place the original markup in the bugzilla comment.]


Leaving the above modification in, but creating a new bogus link block using the same markup as the original and a few of the bogus links caused no hang.

Reinserting all of the bogus links (1500) caused the hang again.  Gradually deleting chunks of that block reduced the hang time.  At 321 links it was no more than a minor spike on the CPU.  At 715 it was 6 seconds.  At 977 links it was 21 seconds.  With all 1500 links it was 35 seconds.

So some combination of the malformed markup that was trying to hide the links and the large quantity of links being hidden caused the hang.  With proper markup, the large number of links was not an issue.  With the improper markup, a small number of links was not an issue.


(In reply to comment #24)
> Trying to track down what it is on the page that caused the hang:

David, in your experiments, did you ever crash after the hang?
(In reply to comment #23)
> That assert in your first post-crash stack trace is Really Bad.  At what point
> does it fire?  What do the thread stacks look like then?

I think it fires immediately before the crash.
I just ran with 
  XPCOM_DEBUG_BREAK=stack ./firefox -no-remote -P trunktest -g -d gdb

and I've attached the output and full stack from gdb.

I think I should run once more with XPCOM_DEBUG_BREAK=stack
so we get the real full stack of the assertion.


I'll try Boris' other proposal soon.
The performance problem there is a separate issue from the crash.  The stack trace in comment 18 indicates that we're just doing layout, which can easily get bogged down with certain kinds of deeply-nested DOMs.

It's probably worth filing a separate bug on the performance issue, especially if it's easy to reproduce with a standalone HTML file that doesn't involve hitting this site.

Oh, and bugzilla will escape whatever HTML you put into comments, so you can just use angle brackets without fear.  ;)
Kai, XPCOM_DEBUG_BREAK=stack more or less outputs garbage unless you pipe the output through fix-linux-stack.pl (which you didn't in this case, hence all it hands out are library offsets on your system, which are of limited use to anyone else).  Can you run with XPCOM_DEBUG_BREAK=break and do a gdb backtrace at that point?  Or take that output and run it through fix-linux-stack.pl?
Oh, and as I said I'd like to see all the thread stacks, not just the one the assert if firing on, at the point in time when we hit the assert.
Full stack for assertion (stack 1), I told to debugger to continue, and it immediately crashed (stack 2)-
OK.  Can you stop at the assert again, and see what |str| looks like there?  I'd really like to see all the members of that object.
(In reply to comment #31)
> OK.  Can you stop at the assert again, and see what |str| looks like there? 
> I'd really like to see all the members of that object.


(gdb) up
#2  0x00360187 in nsACString_internal::Assign (this=0xbf93ca04, str=@0xb29b4c0)
    at /home/kaie/moz/head/mozilla/xpcom/string/src/nsTSubstring.cpp:387
387             NS_ASSERTION(str.mFlags & F_TERMINATED, "shared, but not terminated");
(gdb) print str
$1 = (const nsACString_internal &) @0xb29b4c0: {<nsCSubstring_base> = {<No data fields>}, mData = 0xfb5628 "\034��",
  mLength = 16471688, mFlags = 16471716}
Boris, thanks a lot for asking helpful questions.
I'm now convinced I'm guilty.
Working on a patch.
Assignee: nobody → kengert
Component: General → Security: PSM
OS: Windows XP → All
QA Contact: general → psm
Hardware: PC → All
Attached patch Patch v1 (obsolete) — Splinter Review
Attachment #317083 - Flags: review?(bzbarsky)
(In reply to comment #25)
> (In reply to comment #24)
> > Trying to track down what it is on the page that caused the hang:
> 
> David, in your experiments, did you ever crash after the hang?
> 

The site was crashing the browser on a build a couple days ago, but the current
build just hangs for a period of time.

I've stripped the web page down to the basics and will file another bug (bug
430332) for that.  It also involved 2 of the CSS rules, in addition to the
other factors.
Explanation: when we crash, we are accessing memory which has been already destroyed.

Originally, the code that executes an OCSP request always waited for the result. If it was necessary to cancel, it would send a cancel event, and wait until the download really got canceled.

Because of this design, object nsHTTPDownloadEvent simply used a pointer without any ownership or reference counting.

With the recent work in bug 420187 I changed the above design, in order to avoid deadlocks, the caller no longer waits. And here I made the mistake: I missed the no-ownership pointer.


The patch I've attached introduces reference counting for the object that needs to survive longer.
Attachment #317083 - Flags: review?(rrelyea)
Someone more familiar with this code needs to review that patch (and in particular, the ownership model)...
Depends on: 420187
Attached file test case
Thanks to David for reducing the web page.
I used the test case he attached to bug 430332 and added the webbug (loading image from paypal with https).

This attachment crashes for me, but works with the patch applied.
Comment on attachment 317083 [details] [diff] [review]
Patch v1

r- because of the following reservations:

1) is ++ deemed to be a valid atomic operation on all mozilla platforms for 32bit values. (that is are there platforms where ++ expands to:

Load r1, mRefCount
Add  r1, 1
Store r1, mRefCount

Or do we always know that we generate:

Add mRefCount,1

If not, then mRefCount needs to be PR_AtomicIncrement().


2) Explicitly incrementing the reference count seems wrong to me. It seems to need a function nsNSSHttpRequestSession::AddRef() or something similar that gets the reference (and probably returns 'this').

I think case 1 is a real bug, case 2 is more of a style thing.

bob
Attachment #317083 - Flags: review?(rrelyea) → review+
Comment on attachment 317083 [details] [diff] [review]
Patch v1

r- because of the following reservations:

1) is ++ deemed to be a valid atomic operation on all mozilla platforms for 32bit values. (that is are there platforms where ++ expands to:

Load r1, mRefCount
Add  r1, 1
Store r1, mRefCount

Or do we always know that we generate:

Add mRefCount,1

If not, then mRefCount needs to be PR_AtomicIncrement().


2) Explicitly incrementing the reference count seems wrong to me. It seems to need a function nsNSSHttpRequestSession::AddRef() or something similar that gets the reference (and probably returns 'this').

I think case 1 is a real bug, case 2 is more of a style thing.

bob
Attachment #317083 - Flags: review+ → review-
Attached patch Patch v2 (obsolete) — Splinter Review
Attachment #317083 - Attachment is obsolete: true
Attachment #317103 - Flags: review?(rrelyea)
Attachment #317083 - Flags: review?(bzbarsky)
Comment on attachment 317103 [details] [diff] [review]
Patch v2

r+ much better (though I would have liked to see this->AddRef() rather than just AddRef(), this patch is sufficient).

bob
Attachment #317103 - Flags: review?(rrelyea) → review+
Addressed Bob's proposal to use this->AddRef()

carrying forward r=rrelyea

requesting approval
Attachment #317103 - Attachment is obsolete: true
Attachment #317105 - Flags: review+
Attachment #317105 - Flags: approval1.9?
Comment on attachment 317105 [details] [diff] [review]
Patch v2 with nit addressed

a=shaver
Attachment #317105 - Flags: approval1.9? → approval1.9+
Flags: in-testsuite?
Flags: blocking1.9?
Flags: blocking1.9+
checked in
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Keywords: qawanted
i just installed a build post-patch:
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9pre) Gecko/2008042218 Minefield/3.0pre ID:2008042218

The testcase from #39 hangs Fx indefinitely for me. On first clicking it, there is a short (10seconds or so) where Fx freezes, then it loads the page. Then some seconds later Fx will freeze and become unresponsive.
(In reply to comment #47)
> i just installed a build post-patch:
> Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9pre) Gecko/2008042218
> Minefield/3.0pre ID:2008042218
> 
> The testcase from #39 hangs Fx indefinitely for me. On first clicking it, there
> is a short (10seconds or so) where Fx freezes, then it loads the page. Then
> some seconds later Fx will freeze and become unresponsive.
> 
I see now, if you stop mousing over the links, the freeze will subside in about 10 seconds. Apologies! Still rather nasty :)
Bryan, please don't quote comments in their entirety.

The whole point of comment 39 is that the testcase comes from a performance bug (the one referenced in that comment, and comment 36).  So yes, you'll see a performance issue on that testcase.  That has nothing to do with this bug.
I'm seeing this problem as well.  The instructions in comment #6 sound strikingly similar to what I'm doing when it happens.  Usually I'm at www.linuxtoday.com and doing lots of right-click -> "open in new tab" when this happens.  I see this BZ is marked resolved.  I am using 2.0.0.14.  Does this mean the fix will be in 2.0.0.15 or just a standalone Gecko update?
bryan, this bug was filed against the trunk builds, not the 2.0 branch, so whatever you're seeing should be covered by another bug and not this one.
Operating System: Windows Vista Home Premium (32)

I was closing down a web page tab (a forum), leaving two pages/tabs still open (myspace and yahoo mail), when firefox crashed. Windows gave a pop up information box stating that the program was encountering issues, and gave the option to close the program. I clicked on the "close program" button. I re-opened firefox and was given the option of restoring the session, so nothing was lost from the prior session. I was playing an online playlist from a myspace page on one of the remaining open tabs. More flash issues? 
Crash Signature: [@ PR_AtomicIncrement] [@ nsIOService::NewURI]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: