Tp4 freezes with decode-on-draw machinery

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
10 years ago
10 years ago

People

(Reporter: bholley, Assigned: bholley)

Tracking

Trunk
x86
macOS
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [decodeondraw])

Attachments

(1 attachment, 2 obsolete attachments)

Assignee

Updated

10 years ago
Whiteboard: [decodeondraw]
Assignee

Comment 1

10 years ago
The two most pressing issues:

1) false/false windows failures on wikipedia tu.tv, since that's trunk at the moment.

2) false/true failures on stumbleupon, since that's what's preventing us from getting Tp4 numbers on the false/true configuration.

they may in fact be the same bug. Going to try to reproduce locally.
Assignee

Updated

10 years ago
Depends on: 516486
Assignee

Comment 2

10 years ago
Correction - the stumbleupon failures are crashes, not freezes. This makes sense in context of bug 516486.
I'm seeing this as a reproducible (but not entirely predictabe) hang when trying to run the framecycler.html performance test, which loads about 400 pages in succession into a frame, and repeats the cycle 5 times by default. (Is this similar to what Tp4 runs?)

Prior to changeset f34cc41267d8, this test consistently runs to completion. After f34cc41267d8, it consistently hangs while loading one of the test pages, though at different stages of the cycle - the exact point of failure is not predictable. Sometimes it will hang within the first couple of hundred page loads, other times it may get through one or even more complete cycles of the 400-page set before hanging.

I first encountered the problem when trying to run performance tests on current trunk builds; testing with official nightlies confirmed that it was not specific to my local build. Searching for a regression range using nightly builds narrowed it down to Sept 13th; building specific changesets from that day points to f34cc41267d8 as the moment when the builds begin hanging.

Attaching GDB to the hung process shows a stack such as

0x909a5286 in mach_msg_trap ()
(gdb) bt
#0  0x909a5286 in mach_msg_trap ()
#1  0x909aca7c in mach_msg ()
#2  0x951dde7e in CFRunLoopRunSpecific ()
#3  0x951deaa8 in CFRunLoopRunInMode ()
#4  0x969ac2ac in RunCurrentEventLoopInMode ()
#5  0x969ac0c5 in ReceiveNextEventCommon ()
#6  0x969abf39 in BlockUntilNextEventMatchingListInMode ()
#7  0x900436d5 in _DPSNextEvent ()
#8  0x90042f88 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#9  0x9003bf9f in -[NSApplication run] ()
#10 0x01f494da in nsAppShell::Run ()
#11 0x01ddb5a7 in nsAppStartup::Run ()
#12 0x0150828a in XRE_main ()
#13 0x00002cb8 in main ()

This is from an optimized build, which is what I was trying to use for performance tests. With a recent trunk debug build, the stack when it hangs is very similar:

0x909a5286 in mach_msg_trap ()
(gdb) bt
#0  0x909a5286 in mach_msg_trap ()
#1  0x909aca7c in mach_msg ()
#2  0x951dde7e in CFRunLoopRunSpecific ()
#3  0x951deaa8 in CFRunLoopRunInMode ()
#4  0x969ac2ac in RunCurrentEventLoopInMode ()
#5  0x969ac0c5 in ReceiveNextEventCommon ()
#6  0x969abf39 in BlockUntilNextEventMatchingListInMode ()
#7  0x900436d5 in _DPSNextEvent ()
#8  0x90042f88 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#9  0x9003bf9f in -[NSApplication run] ()
#10 0x150785b6 in nsAppShell::Run (this=0x1585e050) at /Users/jonathan/mozdev/mozilla-central/widget/src/cocoa/nsAppShell.mm:770
#11 0x14d9cc2a in nsAppStartup::Run (this=0x85aa50) at /Users/jonathan/mozdev/mozilla-central/toolkit/components/startup/src/nsAppStartup.cpp:182
#12 0x000acc4c in XRE_main (argc=1, argv=0xbfffead8, aAppData=0x810f80) at /Users/jonathan/mozdev/mozilla-central/toolkit/xre/nsAppRunner.cpp:3420
#13 0x00002863 in main (argc=1, argv=0xbfffead8) at /Users/jonathan/mozdev/mozilla-central/browser/app/nsBrowserApp.cpp:156

Given that the changeset where this problem surfaces (for me, on the Mac) was the change that actually *disabled* decode-on-draw, I'm guessing that the real problem lies somewhere in the underlying architectural changes that were made, and just happens to show up only (or at least more readily) when DoD is turned off. The variability of the exact time of failure suggests a timing- or memory-related problem, perhaps involving threads, GC, caching, or something like that. In that case it's likely to be affected by specific machine speed and configuration, too.

When the test "hangs", note that the browser is not actually unresponsive; I can interact with it normally. It's just the test page that is stopped, with its "loading" spinner still running, apparently waiting "forever" for the load to complete.

As I seem to be able to reproduce this reliably here, is there any further information I can provide that would help track it down?
Assignee

Comment 4

10 years ago
Thanks for looking into this.

Probably the next step is to see if it's related to bug 512435 (which landed with the d-o-d bomb). I think there was another related patch on top of it, so probably the easiest thing to do to disable that functionality is to return true at the beginning of SetBlockingOnload:

http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsImageLoadingContent.cpp#968

This is the one "new thing" we did that would cause the behavior you're seeing. I'd be curious as to whether reverting it fixes things.
When you wrote "...return true" did you actually just mean "return", as SetBlockingOnload is a void function?

I tried this (so SetBlockingOnload effectively becomes a no-op), and have now run several iterations of the test, both with changeset f34cc41267d8 and with my current trunk, and seen no recurrence of the problem.

I'll be running more tests shortly, as I want to do some performance comparisons for a patch I'm working on. If the issue resurfaces at all, I'll note it here, but so far, it looks like you've identified the right problem area.
Assignee

Comment 6

10 years ago
(In reply to comment #5)
> When you wrote "...return true" did you actually just mean "return", as
> SetBlockingOnload is a void function?

Yes.

> I'll be running more tests shortly, as I want to do some performance
> comparisons for a patch I'm working on. If the issue resurfaces at all, I'll
> note it here, but so far, it looks like you've identified the right problem
> area.

Awesome.

Once we're pretty sure that this fixes it, the next step is to figure out if we can identify an image that's consistently triggering the block but never the unblock.

To do this, probably the best solution is to modify SetBlockingOnload to accept an additional parameter, the imgIRequest in question, and then log it. If aBlocking is PR_TRUE, get the uri, then get the spec, then print something like ("[this=%p]BLOCKING for %s\n", this, specString.get()). Otherwise, just print ("this=%p]UNBLOCKING\n"). This means that you only need to pass an imgIRequest from OnStartDecode, where it's very obvious what that imgIRequest should be.

After that, it's just mahjong once you hit a hang - Find a [this] where things don't match up, and then see what URI triggered the blocking.

I apologize for not writing this debugging patch for you myself, but I'm really busy with school. I appreciate your help on it.
Some results from logging BLOCKING and UNBLOCKING calls, to see what's blocked at the point where the test hangs.

First run, hung after about 12 pages with two blocked image loads:
[this=0x1bd3617c] BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.yahoo.com.cn/cn.yimg.com/i/hp/07/ic3.gif
[this=0x1bc9a75c] BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.yahoo.com.cn/cn.yimg.com/i/hp/07/ic4.gif

I added "sequence numbers" for every call into SetBlockingOnload(), to make it easier to sift the log data (sort by "this" ptr to make it easier to eliminate matched block/unblock pairs, then re-sort by sequence number to view the residue in order). Several more runs, to show the variability in where it hangs....

Hung after 100+ pages:
[this=0x1829facc] seq=00027771 BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.msn.com.br/stb.msn.com/i/8F/27C9CB71DE67CA4BC06DEE2F5F1D5D.jpg

After 15 pages:
[this=0x1ae505bc] seq=00001555 BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.taobao.com/pics.taobao.com/bao/album/promotion/lucky_760x90_061130.jpg

After about 1100 pages(!), four successive image loads remained blocked (3 of them for the same image):
[this=0x1b6d181c] seq=00203593 BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.ctrip.com/www.ctrip.com/images/homepage/icon02.jpg
[this=0x1bd62dac] seq=00203594 BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.ctrip.com/www.ctrip.com/images/homepage/icon02.jpg
[this=0x1bf9637c] seq=00203595 BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.ctrip.com/www.ctrip.com/images/homepage/icon02.jpg
[this=0x1f3e2a8c] seq=00203596 BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.ctrip.com/www.ctrip.com/images/help.jpg

(Note that this is on the third iteration over the page set, so the exact same page has loaded OK previously on this run.)

After 31 pages:
[this=0x1e7ebfbc] seq=00003821 BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.aol.com/www.awm1.net/awm/www.aol.com/main/20/i.png

These are all running the same test on the same machine, all with the exact same build (except the first example, as obviously I recompiled to add the sequence numbers). It's pretty unpredictable; I don't think this is an issue with a particular image file, but some subtle timing-dependent interaction. (The test is running entirely locally, BTW, with networking turned off.)
More info: I think we may be closing in on this. I decided to log nsImageLoadingContent creation and destruction as well, and check for request and blocking status during destruction. I see there are assertions there, but of course those don't help us in an optimized build -- such as we normally use for these performance tests.

Here is the relevant trace data for the image that blocked in my next run:

[this=0x2202fe6c] seq=00051726 CREATED
[this=0x2202fe6c] seq=00051727 BLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.godaddy.com/imagesak.godaddy.com/aaa/home/icann.png

At this point, the test hangs, waiting for onload to fire so that the test script knows it can continue to the next page. There is one unmatched BLOCKING call.

If I quit the browser now, all the pending nsImageLoadingContent objects get destroyed, of course. In the process, we call SetBlockingOnload(PR_FALSE) on each of them, so an UNBLOCKING call is logged. However, for this object (only), it turns out that mBlockingOnload is still TRUE when the destructor is reached:

[this=0x2202fe6c] seq=00052738 UNBLOCKING, spec=file:///Users/jonathan/page_load_test/pages/www.godaddy.com/imagesak.godaddy.com/aaa/home/icann.png
[this=0x2202fe6c] seq=00052739 DELETED  WHILE BLOCKING!

Looking at SetBlockingOnload(), I think the only way this could happen is if GetOurDocument() returned null. In this case, it can't call doc->UnblockOnload, and the mBlockingOnload flag doesn't get cleared.

What I guess must be happening is that for some (timing-related) reason, it's possible for the nsImageLoadingContent to get "detached" from its document after it has called doc->BlockOnload(), leaving the document in an onload-blocked state. But I don't really know much about all this document and content stuff.
BTW, I wonder if this could be responsible for some of the "random" orange we're seeing on tinderbox? Cases where tests seem to simply hang/timeout with no output would be consistent with this.
This patch to OnStopDecode appears to resolve the possible "hang" in my local testing; I have run the complete 400-page x 5 iterations test numerous times with this, and it completes successfully every time. Tracing indicates that the situation where this patch applies is rare - just a handful of occurrences in a typical 2000-page test.
Assignee: bobbyholley → jfkthame
Attachment #405508 - Flags: review?(bobbyholley)
Assignee

Comment 11

10 years ago
Thanks a ton for tracking this down. I agree this is probably the source of some random orange - I feel bad that I've been too busy with school to spend much time on it.

So the weirdness here is that OnStopDecode doesn't do what it says it does (this is a bug, but hasn't been fixed yet).
See http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#687

So what your patch does is to stop blocking onload when the data from the "current" request is finished _loading_, which isn't actually what we want (in fact, that's the behavior that this whole complicated blocking mechanism is designed to prevent - we want to block until we're _decoded_, not just until we're loaded).

When you say "the situation where this patch applies is rare", I assume you mean the situation in which we hit that code _and_ we're blocked (ie, the unblock call isn't a no-op). We should usually be getting OnStopDecode (which is secretly OnStopRequest) at some point for the current request, and that should only not be the case in the case where some js does "img.src = bar;" when the old img.src==foo hadn't finished loading.

The two places where we actually unblock are OnStopFrame (assuming the image is good), and OnStopContainer (if the image is erroneous and never got to a frame). OnStopContainer is guaranteed (unless there's some bug) to always fire for any image, so we _should_ always catch any images there.

What pageset are you running? I couldn't find some of those sites in either Tp4 or standalone talos (though my standalone talos release is a bit old).

One interesting thing to note is that these images (as found on the web right now) are all rather "small". But I don't know enough about the size distribution on the web to know if this is any different from normal.

So as I see it, there are two possible explanations:
1) We're screwing up our tracking of "current" and "pending" images in nsImageLoadingContent such that, depending on the timing, we block on something and never unblock.
2) imagelib has a bug where, depending on the timing of how it gets its data, doesn't properly fire OnStopContainer.

I'm going to try to spend some time this afternoon to take a look at the code, see if I get any ideas, and see if I can figure out what the next step for debugging is.
Thanks for the explanations.... I definitely don't have a good handle on this whole process yet. OK, so this isn't really the right place for a fix, it's just band-aiding over the real issue (no big surprise - that's why I wanted you to see if it made any sense).

Yes, I meant that (in my testing, at least), it's very rare to hit that code while blocked. The frequency of finding that situation seemed comparable to the frequency of hangs I'm seeing - once in anything from a few dozen to a few hundred pages, typically.

Re the pageset, I'm running a set that Roc gave me last Nov (though I don't know how fresh it was then; the pages with dated material indicate they're several years old), so I'm not surprised if it differs from current Tp4/talos. Given how variable the point of failure is, I don't believe it's tied to specific images - though there do seem to be certain pages where I've seen several occurrences of a hang. Most likely certain patterns of use, such as numerous small images on a page, increase the probability.

If I have some time, I'll try to get my head around a bit more of this with the help of your comments, but anything further you come up with will be most welcome.
Assignee

Updated

10 years ago
Attachment #405508 - Flags: review?(bobbyholley) → review-
I've done some more tracing, with the patch from bug 521497 in place.

I added more tracing calls, to track the notifications that are received by the nsImageLoadingContent objects. Here's the result of a typical run, after filtering the trace to show only the objects responsible for the outstanding blocking calls. (Note that the images concerned *are* actually displayed in the browser window at this point.)

[this=0x1781f5fc] seq=00169891 CREATED
[this=0x1781f5fc] seq=00169892 OnStartRequest
[this=0x1781f5fc] seq=00169893 OnStartContainer
[this=0x1781f5fc] seq=00169894 OnStartDecode
[this=0x1781f5fc] seq=00169895 BLOCKING spec='file:///Users/jonathan/page_load_test/pages/www.mofile.com/cache.mofile.com/tv/static/picture/u7/Disk1/cnc/2006/11/29/V7/8U/V78U7E90.jpg'
[this=0x1781f5fc] seq=00169896 OnStartFrame
[this=0x1781f5fc] seq=00169897 OnDataAvailable
[this=0x1781f5fc] seq=00169968 OnStopDecode
[this=0x1781f5fc] seq=00169969 OnStopRequest

[this=0x1f05bb4c] seq=00169911 CREATED
[this=0x1f05bb4c] seq=00169912 OnStartRequest
[this=0x1f05bb4c] seq=00169913 OnStartContainer
[this=0x1f05bb4c] seq=00169914 OnStartDecode
[this=0x1f05bb4c] seq=00169915 BLOCKING spec='file:///Users/jonathan/page_load_test/pages/www.mofile.com/cache.mofile.com/tv/static/picture/u7/Disk1/cnc/2006/11/30/FK/RT/FKRTLOBS.jpg'
[this=0x1f05bb4c] seq=00169916 OnStartFrame
[this=0x1f05bb4c] seq=00169917 OnDataAvailable
[this=0x1f05bb4c] seq=00169972 OnStopDecode
[this=0x1f05bb4c] seq=00169973 OnStopRequest

[this=0x217de5bc] seq=00169925 CREATED
[this=0x217de5bc] seq=00169926 OnStartRequest
[this=0x217de5bc] seq=00169927 OnStartContainer
[this=0x217de5bc] seq=00169928 OnStartDecode
[this=0x217de5bc] seq=00169929 BLOCKING spec='file:///Users/jonathan/page_load_test/pages/www.mofile.com/cache.mofile.com/tv/static/picture/u12/Disk1/ctc/2006/11/30/FG/23/FG23CR3P.jpg'
[this=0x217de5bc] seq=00169930 OnStartFrame
[this=0x217de5bc] seq=00169931 OnDataAvailable
[this=0x217de5bc] seq=00169970 OnStopDecode
[this=0x217de5bc] seq=00169971 OnStopRequest

[this=0xa7458c] seq=00169918 CREATED
[this=0xa7458c] seq=00169919 OnStartRequest
[this=0xa7458c] seq=00169920 OnStartContainer
[this=0xa7458c] seq=00169921 OnStartDecode
[this=0xa7458c] seq=00169922 BLOCKING spec='file:///Users/jonathan/page_load_test/pages/www.mofile.com/cache.mofile.com/tv/static/picture/u1/Disk1/cnc/2006/11/30/3L/EW/3LEW8066.jpg'
[this=0xa7458c] seq=00169923 OnStartFrame
[this=0xa7458c] seq=00169924 OnDataAvailable
[this=0xa7458c] seq=00169966 OnStopDecode
[this=0xa7458c] seq=00169967 OnStopRequest

So in this case, there are 4 image requests that have blocked the onload event (more commonly, I just get one or two). For comparison, the trace for another similar image request on the same page looks like:

[this=0x17024c1c] seq=00169826 CREATED
[this=0x17024c1c] seq=00169827 OnStartRequest
[this=0x17024c1c] seq=00169828 OnStartContainer
[this=0x17024c1c] seq=00169829 OnStartDecode
[this=0x17024c1c] seq=00169830 BLOCKING spec='file:///Users/jonathan/page_load_test/pages/www.mofile.com/cache.mofile.com/tv/static/picture/u12/Disk1/ctc/2006/11/27/JD/ZA/JDZAX1RH.jpg'
[this=0x17024c1c] seq=00169831 OnStartFrame
[this=0x17024c1c] seq=00169832 OnDataAvailable
[this=0x17024c1c] seq=00169833 OnStopFrame
[this=0x17024c1c] seq=00169834 UNBLOCKING spec='file:///Users/jonathan/page_load_test/pages/www.mofile.com/cache.mofile.com/tv/static/picture/u12/Disk1/ctc/2006/11/27/JD/ZA/JDZAX1RH.jpg'
[this=0x17024c1c] seq=00169835 OnStopContainer
[this=0x17024c1c] seq=00169836 UNBLOCKING spec='file:///Users/jonathan/page_load_test/pages/www.mofile.com/cache.mofile.com/tv/static/picture/u12/Disk1/ctc/2006/11/27/JD/ZA/JDZAX1RH.jpg'
[this=0x17024c1c] seq=00169837 OnStopDecode
[this=0x17024c1c] seq=00169838 OnStopRequest

where we can see that OnStopFrame and OnStopContainer have been called, both of while resulted in unblocking. In the case of the "problem" requests, these two notifications didn't happen, though we did get the final OnStopDecode and OnStopRequest.

Manually quitting the stalled browser confirms that OnStopContainer is never called for these "blocking" requests, and they are eventually deleted while their mBlockingOnload flag is still TRUE, as described in comment #8.
Assignee

Comment 14

10 years ago
From what you're seeing, we're definitely looking at an imagelib bug here. I'm really swamped until wednesday, so I probably won't get to hunting it down until then. Jonathan - is this blocking you in any way? Just curious on the level of urgency I should tackle this with.
No worries - obviously, it'd be good to get this fixed (especially as it might cause random tinderbox orangeness, which is a Bad Thing), but for local testing of other stuff I can just patch around it by disabling SetBlockingOnload, so it's not specifically blocking me.
Assignee

Comment 16

10 years ago
Let's get down to business. First I'm going to do some analysis of the possibilities, and then propose the next steps for tracking this down.

So what we're seeing is that, for various jpegs (possibly fishy, but there was a png that was the culprit in an earlier run), we first get an OnStartDecode().

The OnStartDecode() can come from 2 places:

1) the JPEG decoder fires it off to the imgRequest, which in turn fires it off to all conumers:
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp#156
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#567

2) A new consumer is added to an existing request, and OnStartRequest is fired to "get it up to speed",
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#267
but the request still isn't finished, because OnStopContainer is not fired, meaning stateRequestStopped is not set:
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#297

Presently, I don't see any reason why the distinction between 1 and 2 above would matter much, but I'm including it for completeness.

Then, something happens that triggers OnStopDecode() and OnStopRequest(), but OnStopContainer() never happens first.

Since imgRequest.cpp squelches all OnStopDecode()s from imgContainer.cpp and the decoders, there are only 3 places the OnStopDecode() can be coming from.

A) The place mentioned in part 2 above, which seems impossible:
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#297

B) Necko tells us the request is done with OnStopRequest.
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#938

C) Somebody calls RemoveProxy() with aNotify == PR_TRUE
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#193

Case A shouldn't be the issue here, since OnStopContainer() is fired immediately before OnStopDecode().

So let's analyze case B. If this is what's happening, then we execute this code first:
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#906
That means, that either mImage is null (unlikely, but I can't rule it out), or that we manage to call SourceDataComplete() on the image without triggering OnStopContainer(). So let's look at that method:
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgContainer.cpp#1259
StoringSourceData() should always be false on trunk by these days (false/false configuration), so unless we hit one of the early return conditions, we should definitely call ShutdownDecoder(). The second early return case would imply that we've already called the function.

Thus for case B, we can deduce that either mImage is null, or that ((mError == true || we call ShutdownDecoder()) without ever firing OnStopContainer()). Either of these would be incorrect, and thus are two possible bugs.

Now let's analyze case C. RemoveProxy can be called in 4 places.
The destructor - http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequestProxy.cpp#105
DoCancel - http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequestProxy.cpp#250
CancelAndForgetObserver - http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequestProxy.cpp#267
ChangeOwner - http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequestProxy.cpp#154

The first 3 all fall into the category of "the consumer handle going away before the request finishes". The last one is an edge case with cache validation that always trips me up, so I'd avoid thinking about it unless I have to.




So the next step is to figure out whether we're dealing with case B or case C.
This involves instrumenting at the following locations:
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#935
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#193


jfkthame - can you figure out which of those last two locations is the source for the OnStopDecode() in the case where we're not getting an OnStopContainer()?
It's case B.

Just ran a couple of tests that hung with several outstanding blocks, and in all cases the OnStopDecode notification came from the observer-notification loop in imgRequest::OnStopRequest() at http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#938

(FWIW, my logging of the test runs shows absolutely no calls to imgRequest::RemoveProxy where aNotify==PR_TRUE.)
Assignee

Comment 18

10 years ago
That's good to hear, since it means this should be pretty straightforward to debug.

The next step is to hone in on it more. I'm not sure how long this stuff takes to reproduce, so it might make sense to binary search, or it might make sense to stick all the logging in at once.

1 - Do we end up calling SourceDataComplete() here?
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#910

2) - Is mError true when we enter SourceDataComplete()?
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgContainer.cpp#1259

3) Do we end up calling ShutdownDecoder() here?
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgContainer.cpp#1276

4) Do we end up calling Close() here?
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgContainer.cpp#2150

5) (Assuming this is a jpeg) - Is mState JPEG_ERROR here?
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp#205

5) Do we end up calling NotifyDone() here?
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp#213

6) If we're good all the way up to this, then why don't we fire OnStopContainer() here?
http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp#597
Attachment #405508 - Attachment is obsolete: true
As far as I can tell, answers are typically:
1) yes
2) mError is false
3) yes
4) yes
5) no error state
5b) NO - because in [almost?] every case I've traced, mNotifiedDone is already true at this point; apparently NotifyDone() is being called from term_source() at http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp#924.

As mentioned on IRC, I found a typo at http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp#211 that could conceivably be relevant. However, I've changed the first && to & here, and although I initially thought this might have helped, after repeated tests it's not clear that this has made any significant difference.

I have seen the "hang" due to blocked onload happen on JPEG, PNG and GIF images, which makes me suspect that the real problem is not in a format-specific decoder but rather in the overall image-loading code.

It's been noticeable that as I add more logging to try and follow what's happening, the problem gets increasingly rare (e.g., no logging: typically hangs once in every 1-200 pageloads; with extensive logging, it often completes several thousand pageloads before hanging). This is consistent with the idea that the problem is very sensitive to timing; slowing down the image-loading components makes it less likely the problem will occur. Unfortunately this means that a debug build, or even a heavily-instrumented non-debug build, is unlikely to run into the issue! :(
Filed bug 531363 to fix the typo in nsJPEGDecoder.cpp; it seemed better to handle that separately, as fixing it does not (unfortunately) resolve this bug.
Assignee

Comment 21

10 years ago
Posted patch patch v1 (obsolete) — Splinter Review
I think I figured out what the problem is.

We keep track of decoding state so that consumers that are added to the imgRequest midway through processing can be brought up to speed with the appropriate notifications in NotifyProxyListener:

http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#254

However, it looks like currently we're not keeping track of when decoding stops (when we get OnStopDecode and OnStopContainer from the decoder), and NotifyProxyListener sends OnStopContainer with the OnStopRequest() notifications (this was probably a casualty of all the confusion surrounding the meaning of OnStopDecode, see bug 505385).

So the hang is the result of the following timing-dependent scenario:

1) An image kicks off decoding
2) The decoder gets enough data that it declares completion and fires OnStopDecode() and OnStopContainer() to the imgRequest.
3) The imgRequest() squelches OnStopDecode(), and passes OnStopContainer() off to all existing consumers.
4) A new consumer is added, before the necko request completes
5) NotifyProxyListener brings it up to speed, but does NOT fire OnStopContainer, because currently we fire that based on necko request completion (therein lies the rub), and that's not done yet
6) The necko request completes, and OnStopRequest gets called on the imgRequest
7) OnStopRequest calls SourceDataComplete on the container
8) SourceDataComplete closes the decoder, which eschews firing OnStopContainer because it already did.
9) The consumer that was added midstream waits indefinitely for OnStopContainer.

I've attached a patch that tracks decode completion and brings consumers up to speed appropriately.

I think that it would also make sense to make OnStopFrame dependent on this state variable as well, and move it from here:

http://mxr.mozilla.org/mozilla-central/source/modules/libpr0n/src/imgRequest.cpp#288

However, I want to test it first.

jfkthame - if you get the chance, can you apply this patch and see if the problem goes away?
Assignee: jfkthame → bobbyholley
Status: NEW → ASSIGNED
Looks good! That scenario sounds like a very plausible explanation for the behavior I've been seeing. I've just been running tests with this patch, and in 75 iterations of my pageset (almost 30,000 pageloads), there have been no blocking problems at all. Without the patch, it usually hangs within the first few hundred pageloads.
Assignee

Comment 23

10 years ago
The OnStopFrame thing looks more complicated than I thought, so I filed bug 531559 for it.
Assignee

Comment 24

10 years ago
Comment on attachment 414895 [details] [diff] [review]
patch v1

(In reply to comment #22)
> Looks good! That scenario sounds like a very plausible explanation for the
> behavior I've been seeing. I've just been running tests with this patch, and in
> 75 iterations of my pageset (almost 30,000 pageloads), there have been no
> blocking problems at all. Without the patch, it usually hangs within the first
> few hundred pageloads.

Woohoo! Here's to tracking down random orange.

Flagging joe for review.
Attachment #414895 - Flags: review?(joe)
Comment on attachment 414895 [details] [diff] [review]
patch v1

>diff --git a/modules/libpr0n/src/imgRequest.cpp b/modules/libpr0n/src/imgRequest.cpp

>   if (aNotify) {
>+
>+    // The "real" OnStopDecode - fix this with bug 505385.
>+    if (!(mState & stateDecodeStopped)) {
>+      proxy->OnStopContainer(mImage);
>+    }
>+
>     // make sure that observer gets an OnStopDecode message sent to it
>     if (!(mState & stateRequestStopped)) {
>       proxy->OnStopDecode(aStatus, nsnull);
>     }
> 
>   }

Can you tighten up the whitespace a bit in this block? We don't really need blank lines beginning and ending it.
Attachment #414895 - Flags: review?(joe) → review+
Assignee

Comment 26

10 years ago
Posted patch patch v2Splinter Review
updated patch
Attachment #414895 - Attachment is obsolete: true
Pushed this, now that the tree is open:
http://hg.mozilla.org/mozilla-central/rev/44c392db6672
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.