Last Comment Bug 745141 - crash in imgRequestProxy::OnDiscard with abort message: "Main-thread-only object used off the main thread"
: crash in imgRequestProxy::OnDiscard with abort message: "Main-thread-only obj...
Status: RESOLVED FIXED
: crash, regression
Product: Core
Classification: Components
Component: ImageLib (show other bugs)
: 14 Branch
: x86 Windows 7
: -- critical (vote)
: mozilla15
Assigned To: Brian R. Bondy [:bbondy]
:
: Milan Sreckovic [:milan]
Mentors:
Depends on: 745662
Blocks: 735572
  Show dependency treegraph
 
Reported: 2012-04-13 03:45 PDT by Scoobidiver (away)
Modified: 2012-08-10 05:59 PDT (History)
17 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
unaffected
+
fixed
+
verified


Attachments
Patch v1. (953 bytes, patch)
2012-05-02 16:59 PDT, Brian R. Bondy [:bbondy]
no flags Details | Diff | Splinter Review
Patch v2. (3.76 KB, patch)
2012-05-05 08:18 PDT, Brian R. Bondy [:bbondy]
no flags Details | Diff | Splinter Review
Patch v2' (3.76 KB, patch)
2012-05-05 08:22 PDT, Brian R. Bondy [:bbondy]
justin.lebar+bug: review-
Details | Diff | Splinter Review
Patch v3. (6.70 KB, patch)
2012-05-06 13:28 PDT, Brian R. Bondy [:bbondy]
no flags Details | Diff | Splinter Review
Patch v4. (9.27 KB, patch)
2012-05-06 15:26 PDT, Brian R. Bondy [:bbondy]
no flags Details | Diff | Splinter Review
Patch v5. (6.38 KB, patch)
2012-05-06 20:49 PDT, Brian R. Bondy [:bbondy]
no flags Details | Diff | Splinter Review
Patch v5' (6.03 KB, patch)
2012-05-06 20:51 PDT, Brian R. Bondy [:bbondy]
no flags Details | Diff | Splinter Review
Patch for Aurora (FF14) (1.08 KB, patch)
2012-05-06 21:46 PDT, Justin Lebar (not reading bugmail)
joe: review+
akeybl: approval‑mozilla‑aurora+
Details | Diff | Splinter Review
Patch v6. (10.04 KB, patch)
2012-05-07 07:47 PDT, Brian R. Bondy [:bbondy]
justin.lebar+bug: review+
Details | Diff | Splinter Review

Description Scoobidiver (away) 2012-04-13 03:45:00 PDT
It first appeared in 14.0a1/20120331. The regression range might be (low volume):
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=92fe907ddac8&tochange=4c43cfe73516

Signature 	mozalloc_abort(char const* const) | NS_DebugBreak_P | nsHTMLImageElement::AddRef() More Reports Search
UUID	8c6ee025-80d3-4164-a269-2ea072120412
Date Processed	2012-04-12 15:56:10
Uptime	605
Last Crash	7.6 weeks before submission
Install Age	1.0 days since version was first installed.
Install Time	2012-04-11 15:07:02
Product	Firefox
Version	14.0a1
Build ID	20120411030716
Release Channel	nightly
OS	Windows NT
OS Version	6.1.7601 Service Pack 1
Build Architecture	x86
Build Architecture Info	GenuineIntel family 6 model 42 stepping 7
Crash Reason	EXCEPTION_BREAKPOINT
Crash Address	0x71541998
App Notes 	
AdapterVendorID: 0x10de, AdapterDeviceID: 0x1200, AdapterSubsysID: 00000000, AdapterDriverVersion: 8.17.12.9573
D2D? D2D+ DWrite? DWrite+ D3D10 Layers? D3D10 Layers+ xpcom_runtime_abort(###!!! ABORT: Main-thread-only object used off the main thread: file e:/builds/moz2_slave/m-cen-w32-ntly/build/xpcom/base/nsCycleCollector.cpp, line 1292)
EMCheckCompatibility	True
Total Virtual Memory	4294836224
Available Virtual Memory	3214364672
System Memory Use Percentage	30
Available Page File	13851955200
Available Physical Memory	5932273664

Frame 	Module 	Signature 	Source
0 	mozalloc.dll 	mozalloc_abort 	memory/mozalloc/mozalloc_abort.cpp:79
1 	xul.dll 	NS_DebugBreak_P 	xpcom/base/nsDebugImpl.cpp:375
2 	xul.dll 	nsHTMLImageElement::AddRef 	content/html/content/src/nsHTMLImageElement.cpp:214
3 	xul.dll 	imgRequestProxy::OnDiscard 	image/src/imgRequestProxy.cpp:717
4 	xul.dll 	imgRequest::OnDiscard 	image/src/imgRequest.cpp:733
5 	xul.dll 	mozilla::image::RasterImage::Discard 	image/src/RasterImage.cpp:2201
6 	xul.dll 	mozilla::image::DiscardTracker::DiscardNow 	image/src/DiscardTracker.cpp:256
7 	xul.dll 	mozilla::image::DiscardTracker::DiscardRunnable::Run 	image/src/DiscardTracker.cpp:34
8 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:656
9 	xul.dll 	nsThread::ThreadFunc 	xpcom/threads/nsThread.cpp:289
10 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:426
11 	nspr4.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:122
12 	msvcr100.dll 	_callthreadstartex 	f:\dd\vctools\crt_bld\self_x86\crt\src\threadex.c:314
13 	msvcr100.dll 	_threadstartex 	f:\dd\vctools\crt_bld\self_x86\crt\src\threadex.c:292
14 	kernel32.dll 	BaseThreadInitThunk 	
15 	ntdll.dll 	__RtlUserThreadStart 	
16 	ntdll.dll 	_RtlUserThreadStart 	

More reports at:
https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%28char+const*+const%29+|+NS_DebugBreak_P+|+nsHTMLImageElement%3A%3AAddRef%28%29
Comment 1 Boris Zbarsky [:bz] (still a bit busy) 2012-04-13 08:56:58 PDT
Uh... how did this end up on a non-main thread?  The tracker uses NS_DispatchToCurrentThread for the event, but the tracker itself should never run on a non-main thread, I'd think...
Comment 2 Justin Lebar (not reading bugmail) 2012-04-13 09:13:51 PDT
We must be calling InformAllocation or Reset off the main thread.

InformAllocation is called from imgFrame::~imgFrame, imgFrame::Init, and imgFrame::Optimize.

Reset is called from RasterImage::DecodingComplete, RasterImage::WantDecodedFrames, and RasterImage::UnlockImage.

None of this looks particularly suspicious to me... Joe?
Comment 3 Justin Lebar (not reading bugmail) 2012-04-13 09:27:53 PDT
FWIW, I browsed around a bit with off-main-thread assertions in the discard tracker, and I couldn't trigger them.
Comment 4 Justin Lebar (not reading bugmail) 2012-04-15 18:50:25 PDT
I have no idea how we're ending up off the main thread here.  We could change NS_DispatchToCurrentThread to NS_DispatchToMainThread here, but that would still leave us touching the discard tracker off main-thread, which would be Bad.

I can't figure out how this is happening, so unless anyone has a better idea, I'll post a patch adding release-build fatal main-thread assertions here.
Comment 5 Joe Drew (not getting mail) 2012-04-17 08:37:16 PDT
(In reply to Justin Lebar [:jlebar] from comment #2)
> We must be calling InformAllocation or Reset off the main thread.
> 
> InformAllocation is called from imgFrame::~imgFrame, imgFrame::Init, and

These two are called from RasterImage management functions, and hence should be on the main thread (though I wouldn't bet my house on it).

> imgFrame::Optimize.

This is called on the main thread after decoding is finished.

> Reset is called from RasterImage::DecodingComplete,

Called from the decoder, and hence on the main thread.

> RasterImage::WantDecodedFrames, and RasterImage::UnlockImage.

I could buy that either of these two could be called off the main thread by cruel people.
Comment 6 Justin Lebar (not reading bugmail) 2012-04-20 20:53:44 PDT
This is simple enough to solve -- just use a lock.

Problem is, the discard tracker is a global -- it lives until shutdown.  But if I don't want this to "leak", I have to destroy the lock at some point.

Destroying it at XPCOM shutdown is not necessarily sufficient -- I recall that I've seen images used after shutdown.  And anyway, nsCOMPtr is not thread-safe, so even if I wanted to destroy the lock at shutdown, I don't think I could, safely.

Anyway, this is a general problem: It's straightforward to write correct code, but writing code which passes our tests is difficult and requires making the code less safe.

bsmedberg or cjones, do you have any thoughts on this?  It seems to me that I should be able to say "I know I'm leaking this mutex; it's OK" and somehow inform our testing framework of that.
Comment 7 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2012-04-20 21:32:56 PDT
So ignoring your question for a second, sticking a lock around something that's supposed to be single-threaded in order to fix a crash that's happening when the code doesn't run on that thread doesn't like a good idea.  This isn't Java ;).

CC'ing andreas who may or may not be thinking about/working on something possibly related.

To your question, so I'm no expert on XPCOM shutdown, but as I recall there's the elaborate dance with ObserverService that's usually referred to as "xpcom shutdown", but then there's also NS_LogTerm() which is when refcount logging dies.  If images or whatnot are living on past NS_LogTerm() then I think we have serious problems.

So I wonder if a reasonable solution here is to make an atlogterm(fn) helper, in the spirit of atexit(fn).  In release builds we could make atlogterm() calls no-ops to avoid wasting time on shutdown.  In debug builds, we would run through the list of registered functions and call them to allow them to free resources, and *then* shutdown refcount logging.

How does that sound?
Comment 8 Justin Lebar (not reading bugmail) 2012-04-20 21:40:05 PDT
> So ignoring your question for a second, sticking a lock around something that's supposed 
> to be single-threaded in order to fix a crash that's happening when the code doesn't run 
> on that thread doesn't like a good idea.

Sure.  I get the impression that the multi-threaded-ness here is intentional, but I admit to being clueless as to the motivations.

> In debug builds, we would run through the list of registered functions and call them to 
> allow them to free resources, and *then* shutdown refcount logging.

That's similar to what I was trying to do with ClearOnShutdown, but I like the idea that we do this right before we shut down logging, rather than at some undefined point during shutdown.
Comment 9 Joe Drew (not getting mail) 2012-04-20 21:49:25 PDT
Windows peoples, see https://bugzilla.mozilla.org/show_bug.cgi?id=745662#c3 : in short, we're {trans,en}coding an image from within the Windows 7 peek functionality, which is a huge no-no.

I'd much rather we just remove the multi-threading, or at least the use of imagelib off the main thread. Is that doable, he asked naïvely?
Comment 10 Rob Arnold [:robarnold] 2012-04-21 09:18:43 PDT
Can you explain me how the Windows 7 peek functionality is involved here? I don't see it in the stack or either bug. To the best of my knowledge, the native code for it is single threaded (yet async). There may be some JS which asynchronously loads the image data from the favicon database or off the network (I do not remember) but nothing related to PNGs that I can recall: only favicons and raw bitmaps (HBITMAP).
Comment 11 Bobby Holley (:bholley) (busy with Stylo) 2012-04-21 09:25:40 PDT
(In reply to Rob Arnold [:robarnold] from comment #10)
> Can you explain me how the Windows 7 peek functionality is involved here? I
> don't see it in the stack or either bug.

See bug 745662 comment 3.
Comment 12 Rob Arnold [:robarnold] 2012-04-21 09:36:40 PDT
(In reply to Bobby Holley (:bholley) from comment #11)
> (In reply to Rob Arnold [:robarnold] from comment #10)
> > Can you explain me how the Windows 7 peek functionality is involved here? I
> > don't see it in the stack or either bug.
> 
> See bug 745662 comment 3.

On a more careful read, I noticed this: JumpListBuilder.cpp in the stack trace; this appears to be the jumplist feature, not the taskbar previews. Jim should take a look but to get back to Joe's question: the images involved there are tiny (favicons I believe) so doing the image decoding on the main thread seems reasonable (shouldn't cause noticeable UI jank).
Comment 13 Jim Mathies [:jimm] 2012-04-21 09:54:52 PDT
Looks like our jumplist fave icon code might be involved. If I remember correctly, we wanted the writes for these images off the main thread for snappy reasons. Maybe that was overkill.
Comment 14 Brian R. Bondy [:bbondy] 2012-04-21 13:45:13 PDT
So there are 2 async operations going on close to there.  First we request the favicon from the favicon db. Then we dispatch another async request to write it out to disk.  The encoding of the image into an ICO happens on the second async request .  I think we could probably move the encoding of the image into the main thread after the first async request.  

But why can't you encode images on a different thread if you want to?
Comment 15 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2012-04-21 16:03:55 PDT
Encoding is not the problem here.  Decoding is the problem.
Comment 16 Brian R. Bondy [:bbondy] 2012-04-21 17:23:51 PDT
OK, the same holds though as above.

On the second async runnable we more specifically i) decode the image data we have in a buffer, ii) then encode it, iii) then write it to disk.

Would it be better to decode the image on the main thread once the favicon callback executes, then pass the imgIContainer to the thread to encode it as an icon and write it to disk?  I don't think the implementation is thread safe.  So we could both encode and decode in the main thread and then pass a buffer to write to disk? 

Also just for my knowledge, why can't you decode image data outside of the main thread?   It seems like keeping it as is off the main thread is optimal in this case.
Comment 17 Bobby Holley (:bholley) (busy with Stylo) 2012-04-21 18:15:30 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #16)
> Also just for my knowledge, why can't you decode image data outside of the
> main thread?   It seems like keeping it as is off the main thread is optimal
> in this case.

We're working on it, but the code isn't thread-safe at the moment.
Comment 18 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2012-04-22 01:26:05 PDT
I wonder if we could just stick a flag somewhere that bypasses the discard logic entirely.  For what this code is doing it doesn't make very much sense.
Comment 19 Justin Lebar (not reading bugmail) 2012-04-30 13:25:06 PDT
Ping on this bug?  This is a serious issue.  If the discard logic races against itself, we can end up with a cyclical linked list, causing infinite main-thread hangs.  Not to mention crashes and other, more general fail.
Comment 20 Brian R. Bondy [:bbondy] 2012-04-30 13:32:38 PDT
Why did this become an issue in v14? Bug 549472 landed in mozilla9.
Comment 21 Justin Lebar (not reading bugmail) 2012-04-30 13:36:07 PDT
Because I rewrote the discard logic to include an NS_DISPATCH_TO_CURRENT_THREAD, which triggered a crash.

But the code was never safe.
Comment 22 Brian R. Bondy [:bbondy] 2012-04-30 13:47:17 PDT
I'm not fighting that it is not safe, but just for my understanding, why *was* it unsafe? 

It was doing decode, and re-encoding on a single thread before and wasn't causing memory leaks or crashes as far as I know.  If it was main thread only usually we do something like MOZ_ASSERT(!NS_IsMainThread());
Comment 23 Justin Lebar (not reading bugmail) 2012-04-30 13:56:14 PDT
> but just for my understanding, why *was* it unsafe? 

The original discard code also modified a linked list, so the original code was subject to race conditions just like the new code.

Maybe we didn't hit this race condition often.  The discard logic used to run every 20s or so, although now it runs more often.

> If it was main thread only usually we do something like MOZ_ASSERT(!NS_IsMainThread());

95% of Gecko is main-thread only. As a result, we often assume that people who call are functions are also on the main thread, and that reviewers will look closely at code which runs off the main thread.
Comment 24 Justin Lebar (not reading bugmail) 2012-04-30 13:57:42 PDT
> Maybe we didn't hit this race condition often.

Also, hitting the race condition would not necessarily lead to a crash.  More likely, it would lead to an incorrect list of some sort.
Comment 25 Brian R. Bondy [:bbondy] 2012-04-30 14:06:46 PDT
Does anyone see a problem with khuey's suggestion in Comment 18? Otherwise we will have to do decoding + encoding of several fav icons and I think this can lead to some jank every 120 seconds.  Otherwise we should add a mutex for the critical section in the discard logic.
Comment 26 Brian R. Bondy [:bbondy] 2012-04-30 14:10:27 PDT
In any case I'll get to it this week so if you're not sure I will debug and find out.
Comment 27 Justin Lebar (not reading bugmail) 2012-04-30 14:13:19 PDT
khuey's suggestion in comment 18 is fine with me, but we'd still have to ensure that the rest of the image logic is thread-safe.

However, I think this

> Otherwise we will have to do decoding + encoding of several fav icons and I think this can lead to 
> some jank every 120 seconds.

is premature optimization.  A favicon is at most what, 128x128 px?  That should be extremely fast.  

To put this in perspective: When switching tabs, we synchronously decode on the main thread all images smaller than 150kb compressed.  This doesn't cause noticeable jank except on pages with many medium-sized images.  A few favicons is tiny in comparison.
Comment 28 Brian R. Bondy [:bbondy] 2012-04-30 14:18:46 PDT
> ...is premature optimization

Just for the record, I wouldn't say it was an optimization at all. The original code was made async because of the write to disk and not for the image decode+encode.  The image decode+encode was only done on the same thread as the write to disk because it made the most sense to be there.  I (nor the reviewer) was aware of the main thread only precondition.
Comment 29 Alex Keybl [:akeybl] 2012-04-30 16:09:36 PDT
(In reply to Justin Lebar [:jlebar] from comment #19)
> Ping on this bug?  This is a serious issue.  If the discard logic races
> against itself, we can end up with a cyclical linked list, causing infinite
> main-thread hangs.  Not to mention crashes and other, more general fail.

Tracking for FF14 and FF15 given this evaluation of the risk if left unfixed. Hangs will be difficult to detect.
Comment 30 Brian R. Bondy [:bbondy] 2012-05-02 16:59:12 PDT
Created attachment 620525 [details] [diff] [review]
Patch v1.

So I don't think the discard logic should be used at all if we are decoding + encoding an image off the main thread.  I see no reason to disallow encoding/decoding off the main thread otherwise.   So I added a check for NS_IsMainThread().
Comment 31 Justin Lebar (not reading bugmail) 2012-05-03 08:56:36 PDT
Comment on attachment 620525 [details] [diff] [review]
Patch v1.

Do you know that this is the only place where we end up touching the discard tracker off the main thread?  I'd expect not.

If the decoder is running off the main thread and the image isn't locked (I doubt it is, and that's a separate bug), then when we finish decoding we'll insert the image into the discard list.

If this patch plus the patch from bug 745662 doesn't crash when you trigger the OMT work, then let's talk.  But I think you're going to need a more generic patch which actually disables image discarding.  See RasterImage::Init and the mDiscardable flag.

We'll also have to disable this InformAllocation call, of course, perhaps based on that flag.
Comment 32 Brian R. Bondy [:bbondy] 2012-05-05 08:18:16 PDT
Created attachment 621303 [details] [diff] [review]
Patch v2.

So I initially added a new flag for this and extended the interface of imgTools to take a flags parameter, but I don't think that's needed.

We already have the INIT_FLAG_DISCARDABLE flag in Image.h it looks a bit strange to have another one like INIT_FLAG_DISABLE_DISCARD_LOGIC and I don't think that's needed.
We already don't specify INIT_FLAG_DISCARDABLE in the imgTools code.

ImgTools also doesn't call into LockImage or UnlockImage which I think are the only other places that could cause problems.

There was one other case I found that could be dangerous in the destructor.  I call that always if the image is discardable, I also call it always if we're on the main thread.
Comment 33 Brian R. Bondy [:bbondy] 2012-05-05 08:22:01 PDT
Created attachment 621304 [details] [diff] [review]
Patch v2'

qref'ed patch
Comment 34 Justin Lebar (not reading bugmail) 2012-05-05 09:15:15 PDT
Hmm.  Given that mDiscardable is always false in imgTools::DecodeImageData, do you have any idea how we got into the position in comment 0, where the image was in the discard list?  That's pretty bizarre; there may be some other bug lurking here.

I'd prefer not to use NS_IsMainThread() for anything except assertions; it muddles the point, can lead to surprising results ("this code does something different when it's off-main-thread!"), and is un-idiomatic.  Thankfully, I don't think we need it:

1) The code in DiscardTracker which keeps track of how much memory we've allocated to images is intended to track *all* images, not just main-thread images.  So even though these jumplist images aren't discardable, they should be tracked.

So let's just make the code which modifies sDecodedBytes use atomic operations (pratom.h).  You could use a mutex if you wanted, but then we'd have to figure out how to safely tear down the mutex at shutdown so we don't "leak", and that's tricky.
 
2) DiscardTracker::Remove should definitely be main-thread only, as you noticed.  But we can accomplish that by moving the ::Remove call up into the if(mDiscardable) statement above (and removing the now incorrect comment) and adding an assertion to RasterImage::Init() that discardable implies main thread.

Does that sound reasonable?  I'm sorry to drag you through another round here.
Comment 35 Brian R. Bondy [:bbondy] 2012-05-05 09:50:27 PDT
> Given that mDiscardable is always false in imgTools::DecodeImageData, 
> do you have any idea how we got into the position in comment 0, where the 
> image was in the discard list? 

We got in that position from imgFrame.cpp, discard logic is only used in imgFrame when mInformedDiscardTracker is true by the way.

> if (!mPalettedImageData) {		
>   DiscardTracker::InformAllocation(4 * mSize.width * mSize.height);		
>   mInformedDiscardTracker = true;
> }
Comment 36 Justin Lebar (not reading bugmail) 2012-05-05 10:08:38 PDT
> We got in that position from imgFrame.cpp

I agree that we touched the discard tracker form imgFrame.cpp, but the stack trace also shows that a raster image was discarded.  That's separate from the imgFrame issue.

If an image is discarded from DiscardTracker::DiscardNow, that means that we called DiscardTracker::Reset for the image.  Which shouldn't happen if the image is not discardable.

> 5 	xul.dll 	mozilla::image::RasterImage::Discard 	image/src/RasterImage.cpp:2201
> 6 	xul.dll 	mozilla::image::DiscardTracker::DiscardNow 	image/src/DiscardTracker.cpp:256
> 7 	xul.dll 	mozilla::image::DiscardTracker::DiscardRunnable::Run 	image/src
Comment 37 Justin Lebar (not reading bugmail) 2012-05-05 10:10:05 PDT
Oh, sorry, I see what you're saying.  DiscardRunnable is running off the main thread, because InformAllocation was called off the main thread.  Got it, right.
Comment 38 Brian R. Bondy [:bbondy] 2012-05-05 15:30:54 PDT
What about adding an allowDiscardDispatch boolean parameter with default value of true to InformAllocation?
It would have a value of false when !mDiscardable

And then inside ~RasterImage have   if (mDiscardable ) {
    DiscardTracker::Remove(&mDiscardTrackerNode);
  }
Comment 39 Brian R. Bondy [:bbondy] 2012-05-05 15:42:34 PDT
We would still need the mutex or equivalent in InformAllocation of course as well.
Comment 40 Justin Lebar (not reading bugmail) 2012-05-05 16:49:19 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #38)
> What about adding an allowDiscardDispatch boolean parameter with default
> value of true to InformAllocation?
> It would have a value of false when !mDiscardable

We could just dispatch explicitly to the main thread, and then this wouldn't be necessary, right?  Anyway we don't have mDiscardable inside imgFrame, which is where we call InformAllocation.  We could pass that around too, but feels heavy-handed.

You'd need synchronization around sDiscardRunnablePending too, by the way.
Comment 41 Brian R. Bondy [:bbondy] 2012-05-05 16:58:22 PDT
Ya I was working on a patch and set a flag in imgFrame before the call to imgFrame::Init, but dispatching to the main thread always sounds cleaner.
Comment 42 Brian R. Bondy [:bbondy] 2012-05-06 13:28:24 PDT
Created attachment 621454 [details] [diff] [review]
Patch v3.

Tested and it's working even when I InformAloocation with sCurrentDecodedImageBytes > sMaxDecodedImageKB * 1024 from the non main thread.

Details about the implementation:
Making them atomic operations only via pratom.h functions would not be enough by the way.
- For InformAllocation, there is no automic increment by this amount operation.  Using set value wouldn't be enough sine 2 threads could set at the same time.
- For the discard runnable lock, we need to check the value inside the lock
- I considered using a single lock for both but I think using 2 is best here.
Comment 43 Justin Lebar (not reading bugmail) 2012-05-06 13:44:15 PDT
I'm not against using locks in principal, but calling PR_NewLock from a static constructor is scary.  There are also startup implications (static constructors are relatively expensive, and we're trying to get rid of them).  You could lazily initialize, but then you have a race condition in the initializer that you'd need to solve with atomic ops.

Why exactly can't you use PR_ATOMIC_ADD for InformAllocation and PR_ATOMIC_SET for the have-i-dispatched bool?
Comment 44 Rob Arnold [:robarnold] 2012-05-06 13:52:08 PDT
(In reply to Justin Lebar [:jlebar] from comment #43)
> I'm not against using locks in principal, but calling PR_NewLock from a
> static constructor is scary.  There are also startup implications (static
> constructors are relatively expensive, and we're trying to get rid of them).
> You could lazily initialize, but then you have a race condition in the
> initializer that you'd need to solve with atomic ops.
NSPR has some threadsafe initialization functions that I think are designed for this. I'm not sure how they work but since it's a C library, I don't think static constructors will come into play.
Comment 45 Brian R. Bondy [:bbondy] 2012-05-06 14:50:53 PDT
> Why exactly can't you use PR_ATOMIC_ADD for InformAllocation 

We can, I thought you were suggesting PR_AtomicSet originally which wouldn't work for obvious reasons. 

> and PR_ATOMIC_SET for the have-i-dispatched bool?

Because then you'd have a race condition just after the !sDiscardRunnablePending check and just before the sDiscardRunnablePending = true operation.
Comment 46 Brian R. Bondy [:bbondy] 2012-05-06 14:53:25 PDT
By the way I originally had the initializations of the locks inside the DiscardTracker::Initialize function, but InformAllocation is called before DiscardTracker::Initialize.  I'm not sure if that's a bad thing, but if so please post a new bug for that.
Comment 47 Brian R. Bondy [:bbondy] 2012-05-06 15:19:13 PDT
I'll just make a DiscardTracker::Initialize() call from within imglib_Initialize() and then I'll remove the other call to it.  And make it public.
Comment 48 Brian R. Bondy [:bbondy] 2012-05-06 15:26:49 PDT
Created attachment 621466 [details] [diff] [review]
Patch v4.
Comment 49 Justin Lebar (not reading bugmail) 2012-05-06 19:49:35 PDT
> Because then you'd have a race condition just after the !sDiscardRunnablePending check 
> and just before the sDiscardRunnablePending = true operation.

The way PR_ATOMIC_SET works in this case is you say "set *foo to true" and then you get back the original value.  Really, it's an atomic get-and-set operation.

In this case, you'd atomic-set sDiscardRunnablePending to true, and then post the event if it returned false, indicating that we did a false --> true conversion.
Comment 50 Brian R. Bondy [:bbondy] 2012-05-06 20:20:26 PDT
Thanks for the quick feedback.  That sounds equivalent and more succinct, I'll update the patch shortly.
Comment 51 Brian R. Bondy [:bbondy] 2012-05-06 20:49:35 PDT
Created attachment 621495 [details] [diff] [review]
Patch v5.
Comment 52 Brian R. Bondy [:bbondy] 2012-05-06 20:51:15 PDT
Created attachment 621496 [details] [diff] [review]
Patch v5'

Removed unused forward declaration
Comment 53 Justin Lebar (not reading bugmail) 2012-05-06 21:20:12 PDT
> #include "nsComponentManagerUtils.h"
> #include "nsITimer.h"
> #include "RasterImage.h"
> #include "DiscardTracker.h"
> #include "mozilla/Preferences.h"
>+#include "prlock.h"

Don't need this anymore.

> void
> DiscardTracker::InformAllocation(PRInt64 bytes)
> {
>   // This function is called back e.g. from RasterImage::Discard(); be careful!
> 
>-  sCurrentDecodedImageBytes += bytes;
>+  PR_ATOMIC_ADD(&sCurrentDecodedImageBytes, bytes);

I think you're going to need to change sCurrentDecodedImageBytes to a PRInt32 to get this to compile on win32.  (And perhaps elsewhere.)  I'm not wild about that change, but oh well.  (If I'm wrong and this compiles on all our platforms, then great.)

(It's silly, because x86-32 has 64-bit atomic operations, and anyway, NSPR has functionality for faking them anyway.  We just don't have easily-accessible code to do a 64-bit atomic add.)

Also, if you wouldn't mind, could you indicate in the header file which of the public functions are main-thread only and which are thread-safe?

With that, r=me.  Thanks for being so patient here, Brian.
Comment 54 Justin Lebar (not reading bugmail) 2012-05-06 21:22:21 PDT
We should just make InformAllocation a nop in FF14.  That will fix this race condition with very low risk.
Comment 55 Brian R. Bondy [:bbondy] 2012-05-06 21:24:24 PDT
> We should just make InformAllocation a nop in FF14.  
> That will fix this race condition with very low risk.

Are you suggesting a different fix for FF14 and FF15? Or are you suggesting an alternate fix to the current r+'ed patch?  I don't know exactly what you want me to do here.
Comment 56 Justin Lebar (not reading bugmail) 2012-05-06 21:27:00 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #55)
> > We should just make InformAllocation a nop in FF14.  
> > That will fix this race condition with very low risk.
> 
> Are you suggesting a different fix for FF14 and FF15? Or are you suggesting
> an alternate fix to the current r+'ed patch?  I don't know exactly what you
> want me to do here.

Sorry, that was really unclear!  Yes, a different fix for FF14; it's safer, I think, and the InformAllocation business has been causing trouble anyway.  And you don't need to do anything; I'll spin the patch in a few minutes.
Comment 57 Brian R. Bondy [:bbondy] 2012-05-06 21:43:33 PDT
Comment on attachment 621496 [details] [diff] [review]
Patch v5'

> I think you're going to need to change sCurrentDecodedImageBytes to a 
> PRInt32 to get this to compile on win32

Well it compiles OK but it won't act correctly.
It won't work properly with Windows nor PR_AtomicSet because they only take in PRInt32*.  Even making it unsigned won't work correctly.

So which way should I proceed? 
1) Use a lock for this case
2) Add in the needed code to NSPR
3) Maybe you can think of something else?

I don't mind doing either way.
Comment 58 Justin Lebar (not reading bugmail) 2012-05-06 21:46:56 PDT
Created attachment 621501 [details] [diff] [review]
Patch for Aurora (FF14)
Comment 59 Brian R. Bondy [:bbondy] 2012-05-06 21:49:27 PDT
I can change to PRInt32 but are we sure it'll always be below 2GB?
Comment 60 Justin Lebar (not reading bugmail) 2012-05-06 21:55:54 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #59)
> I can change to PRInt32 but are we sure it'll always be below 2GB?

Sigh.  I don't know.  So we should probably do a PRInt64 and use a lock.  If you really want to add 64-bit atomics (to mfbt, not NSPR, for both our sakes), we can do that and I'll review it, but I bet you have better things to do.  We can also do it as a follow-up.

Sorry to keep going round and round here.
Comment 61 Brian R. Bondy [:bbondy] 2012-05-07 04:45:36 PDT
> to mfbt, not NSPR

It won't take long to add, but I think they really do belong in NSPR. 

I would use _InterlockedExchange64 on Windows, __sync_lock_test_and_set in the conditions its available (GNUC >= 4.1 + some other things) but it needs to otherwise compile to a new PR_AtomicAdd64 function.

In nsprpub/pr/src/misc/pratom.c, PR_AtomicAdd calls _PR_MD_ATOMIC_ADD which hashes the pointer to get a bucket index for an array of locks, then uses that lock to protect the atomic operation.  I could duplicate the array of lock and hashing in mfbt but that doesn't seem right.  Adding a PRAtomicAdd64 inline there would be fast and easy.

So I think NSPR is best but I know that adding to NSPR can be a pain and might take a while. 
I think it would be good to have in general in NSPR.

---

If you'd prefer the PRLock way, are you against the previous reworking I did to Initialize?
Comment 62 Brian R. Bondy [:bbondy] 2012-05-07 05:23:18 PDT
InterlockedExchange64 is only avail in Vista and up anyway, so let's just go ahead with the lock, but please let me know if you're ok with the previous way I did the initialize reworking though.
Comment 63 Justin Lebar (not reading bugmail) 2012-05-07 07:39:55 PDT
> It won't take long to add, but I think they really do belong in NSPR.

I don't disagree.  But adding code to NSPR is a huge pain, and if we did that, we wouldn't be able to land this patch until after an NSPR uplift...  Code in NSPR is basically hostage, which is why we don't touch often it these days.

If I were to write an atomic ops library, I'd probably do it with inline asm for MSVC, which would let it work on all versions of Windows.  We'd of course have to have a fallback for non-x86/ARM platforms, although maybe we could just rely on GCC to provide the right intrinsics.

There's also chromium atomic ops code we have in the tree that we might be able to mooch off.  But afaik atm nobody outside chromium code uses those methods.

> If you'd prefer the PRLock way, are you against the previous reworking I did to 
> Initialize?

My preference is to use atomic ops.  But I don't think we should need to write an NSPR patch or a new atomic ops library to fix this bug.

> please let me know if you're ok with the previous way I did the initialize reworking 
> though.

The initialization code is good.  But if we're going to destroy the lock in ::Shutdown, let's make it a mozilla::Mutex, and use AutoAquireMutex and nsRefPtr.  Alternatively, I'd be OK never freeing the lock, although then you'd have to use PR locks so we don't leak.  (At least, I know the mutex will "leak".  I suspect the PR locks won't.)
Comment 64 Brian R. Bondy [:bbondy] 2012-05-07 07:47:32 PDT
Created attachment 621597 [details] [diff] [review]
Patch v6.
Comment 65 Justin Lebar (not reading bugmail) 2012-05-07 07:58:32 PDT
> void
> DiscardTracker::InformAllocation(PRInt64 bytes)
> {
>   // This function is called back e.g. from RasterImage::Discard(); be careful!
> 
>+  MOZ_ASSERT(sInitialized);
>+
>+  PR_Lock(sAllocationLock);
>   sCurrentDecodedImageBytes += bytes;
>+  PR_Unlock(sAllocationLock);
>   MOZ_ASSERT(sCurrentDecodedImageBytes >= 0);

Move this assertion up into the lock, please.  That way, each addition gets
exactly one assertion of its result.

>     /**
>+     * Initializes the discard tracker.
>+     */
>+    static nsresult Initialize();

Nit: Main thread only.

>     /**
>      * Inform the discard tracker that we've allocated or deallocated some
>      * memory for a decoded image.  We use this to determine when we've
>-     * allocated too much memory and should discard some images.
>+     * allocated too much memory and should discard some images.  This function
>+     * can be called from any thread and is thread safe.

Nit: thread-safe.

If this doesn't leak on try, let's do it.
Comment 66 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2012-05-07 07:59:37 PDT
I haven't read the bug, so my apologies if the answer is yes, but is there a good reason to use raw PRLock instead of mozilla::Mutex?
Comment 67 Justin Lebar (not reading bugmail) 2012-05-07 08:01:09 PDT
Comment 63.
Comment 68 Joe Drew (not getting mail) 2012-05-07 10:39:01 PDT
I am not jumping up and down happy that we're adding thread-safety to imagelib a bit at a time here. In particular, just adding it to InformAllocation makes me worry that we're just going to be papering over things.
Comment 69 Brian R. Bondy [:bbondy] 2012-05-07 10:52:43 PDT
An alternative is to just not use DiscardTracker from non main threads.  I personally don't need thread safety, I only want to be able to use decoders from a single thread when that thread happens to not be the main thread.
Comment 70 Justin Lebar (not reading bugmail) 2012-05-07 10:56:03 PDT
Comment on attachment 621501 [details] [diff] [review]
Patch for Aurora (FF14)

[Approval Request Comment]
Regression caused by (bug #): bug 732820
User impact if declined: Race condition, potentially crashes.
Testing completed (on m-c, etc.): Joe and I both looked at the code... :)
Risk to taking this patch (and alternatives if risky): I can't think of a simpler change which eliminates the races.
String changes made by this patch: None
Comment 71 Justin Lebar (not reading bugmail) 2012-05-07 10:57:42 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #69)
> An alternative is to just not use DiscardTracker from non main threads.  I
> personally don't need thread safety, I only want to be able to use decoders
> from a single thread when that thread happens to not be the main thread.

That's basically what we're doing, except we're keeping the InformAllocation call.

Even if we didn't keep the InformAllocation call, we'd still be slowly adding thread-safety to imagelib, in the sense that we'd be asserting that this particular OMT use of imagelib is safe.

I don't like it either!
Comment 73 Brian R. Bondy [:bbondy] 2012-05-08 05:32:50 PDT
try results were all good by the way.
Comment 74 Ed Morley [:emorley] 2012-05-08 11:19:48 PDT
https://hg.mozilla.org/mozilla-central/rev/c2ddddc68432
Comment 75 Alex Keybl [:akeybl] 2012-05-09 15:53:03 PDT
Comment on attachment 621501 [details] [diff] [review]
Patch for Aurora (FF14)

[Triage Comment]
Approved for Aurora 14 to prevent a crash regression.
Comment 76 Justin Lebar (not reading bugmail) 2012-05-09 16:20:17 PDT
Thanks!

Pushed to Aurora for FF14: https://hg.mozilla.org/releases/mozilla-aurora/rev/22bc96126073
Comment 77 Justin Lebar (not reading bugmail) 2012-05-09 16:22:09 PDT
QA: If you want to verify this fix, simply push to try with the patch in bug 745662.  Without this fix, we'll crash and burn on Windows.  With this fix, we should be good.
Comment 78 Brian R. Bondy [:bbondy] 2012-05-09 17:33:27 PDT
I don't think pushing to try will help, you need to have a discardable image favicon as a frequent site list.  Also jump list builds only happen after 120 seconds and all tests complete before 120s.  Also you'd have to be over the limit so a discard is actually dispatched.   

Using a debug build should see it though with an assertion now after 120 seconds. 
You should clear you jump list cache so this code is hit here:
C:\Users\Brian\AppData\Local\Mozilla\Firefox\Profiles\SomeProfileDir\jumpListCache

You can increase the frequency of jump list builds by changing browser.taskbar.lists.refreshInSeconds to something like 10 seconds instead.

But I think we'll see this eventually because of the assertions that are in the patch.  For the aurora patch, since it differs from the Nightly one you'd have to try as I explained above if you want to test it.
Comment 79 Justin Lebar (not reading bugmail) 2012-05-09 22:22:10 PDT
I saw this crash in a try build.  See bug 745662 comment 3.  Or does that stack trace appear to be something different?
Comment 80 Brian R. Bondy [:bbondy] 2012-05-10 06:53:39 PDT
Ya the call stack looks the same. 
OK I guess it is possible then, maybe one of the tests take longer than 120 seconds.

Note You need to log in before you can comment on or make changes to this bug.