Open Bug 1103459 Opened 10 years ago Updated 2 years ago

Second play-through of animated GIF sometimes starts at a random point

Categories

(Core :: Graphics: ImageLib, defect)

x86_64
Linux
defect

Tracking

()

Tracking Status
firefox33 --- wontfix
firefox34 --- wontfix
firefox35 --- wontfix
firefox36 - wontfix
firefox37 - affected
firefox38 --- affected
firefox39 --- affected

People

(Reporter: dholbert, Unassigned)

Details

Attachments

(4 files)

STR:
 1. Load (or shift-reload) https://i.imgur.com/Y46SPPu.gif
 2. Watch the first playthrough, and **pay special attention** to where the second playthrough starts (after you've seen the guy fly through the hole at you)
 3. Shift-reload if nothing funny happened.


EXPECTED RESULTS: Second playthrough should start at the beginning of the GIF.

ACTUAL RESULTS: Some non-trivial fraction of the time (feels like maybe 1/3 of the time that I perform the STR), the second playthrough starts at a random point, not at the fully-zoomed-out first flyover.


This almost feels like we start the animation timer when we display the first frame (while we're downloading image data), and for the first playthrough, we show data when it becomes available; but for the second playthrough (including its starting-point), we respect the animation timer. This means the second playthrough doesn't start exactly at the beginning, depending on how much delay was incurred while we downloaded + decoded image-data during the first playthrough.

Given the timing, seems likely to be a very-recent regression from bug 1101759 or bug 1079653.

I'm hitting this Nightly 36.0a1 (2014-11-21) on 64-bit Linux, in my normal browsing profile (with e10s disabled).
Here's a screencast of what this looks like.

In particular:
 - The animated gif ends about 8 seconds into this screencast (after you see the guy fly past you)
 - Then ***instead of resetting to the initial zoomed-out view** for the second playthrough, it resets to a point about halfway through the animation (where you're about to fly through the hole in the cliff)

(Side note: I used a fresh profile this time, with e10s enabled -- so, not e10s-dependent, since it reproduces w/ and w/out e10s; and not profile-dependent.)
Comment on attachment 8527274 [details]
the gif in question (testcase)

I can reproduce with this GIF hosted on bugzilla.

FWIW: often it jumps to ~80% of the way through the animation, for the second playthrough, so the bug manifests as a "stutter" where I see the guy fly out of the cliff, and then teleport backwards and fly out again.

This is from my home internet connection, which is sometimes a bit flaky (and seems to be having some slow response times right now), so this might only be noticeable with a slightly-slowed internet connection... not sure. (That would fit my "This almost feels like" hypothesis in comment 0, at least.)
Attachment #8527274 - Attachment description: the gif in question (testcase?) → the gif in question (testcase)
[Tracking Requested - why for this release]: recent regression, close to merge day
All signs point to Seth - can you take a look here and see if a backout is the best option before we merge on Friday?
Flags: needinfo?(seth)
(In reply to Daniel Holbert [:dholbert] from comment #0)
> Given the timing, seems likely to be a very-recent regression from bug
> 1101759 or bug 1079653.

If it's either of those it's the latter; I don't see how the former could have caused this problem.

> This almost feels like we start the animation timer when we display the
> first frame (while we're downloading image data), and for the first
> playthrough, we show data when it becomes available; but for the second
> playthrough (including its starting-point), we respect the animation timer.
> This means the second playthrough doesn't start exactly at the beginning,
> depending on how much delay was incurred while we downloaded + decoded
> image-data during the first playthrough.

Sounds like something along those lines could be happening, yes. We actually do not send out FRAME_UPDATE notifications from decoding after the first frame anymore, so we don't literally send out data when it becomes available; it's controlled by the normal animation playback code. But that may be exactly the problem, as you say; we are starting the animation as soon as we realize the image is animated, so the timeline starts at that point but the framerate is limited by how fast we can download and decode. When we finish decoding, the animation code may decide that it has dropped a bunch of frames and it has to catch up. If so, this is probably fairly straightforward to fix.

(In reply to Lukas Blakk [:lsblakk] use ?needinfo from comment #5)
> All signs point to Seth - can you take a look here and see if a backout is
> the best option before we merge on Friday?

Right now I don't think a backout is likely to be necessary. FWIW, I'm skeptical that the regressing bug is correctly identified here; I'd guess it was one of the bugs hanging off of bug 910441.
Flags: needinfo?(seth)
I'm going to see if I can track down a regression range later this evening, FWIW.
Flags: needinfo?(dholbert)
Surprise -- I can reproduce in Firefox 33.0 (release)! So, looks like it's not so recent a regression after all. (This surprises me, because I don't remember ever seeing this before; maybe it's only noticable with long-duration GIFs, or something.)

UA string: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0

I'll do more thorough regression range tracking later, but for now, we can take this off of the worrisome-recent-regression list, & mark versions as "affected" back at least as far as Firefox 33.  Seth, sorry for mistakenly blaming your recent changes!
I tried a bit to get a regression range, but I didn't have much luck, largely because this only reproduces a fraction of the time (e.g. in some of my per-build tests, 1/10 shift-reloads triggered the bug, and the remaining 9/10 were fine).  So, I don't have much confidence when decreeing any particular build as "good".

Given that this isn't a recent regression (per comment 8), I'm not going to bother pursuing the regression range further, at this point. (Leaving the regressionrange-wanted keyword, though, in case someone else can reproduce this & has more success at build-classifying/range-narrowing than I did.)
Flags: needinfo?(dholbert)
Summary: Second play-through of animated GIF starts at a random point → Second play-through of animated GIF sometimes starts at a random point
So neither Daniel nor I can manage to reproduce this on a different network. (In particular, I can't reproduce it at all.) I'm kinda saddened by that, since it seems like there's something interesting going on here. If anyone can reproduce it happening and figure out what happens in the RasterImage animation code and FrameAnimator when the problem happens, that would be an amazing help (and solving it would probably be pretty easy).
I've been hoping to poke at this in a debugger -- haven't gotten to it yet, but FWIW, I can still reproduce this (using my main browsing profile, on my home network), with the latest nightly.

38.0a1 (2015-01-13)
So, it seems that nothing is going to happen about the 36. Stop tracking.
dholbert - Can you still reproduce as per comment 11 and, if so, can you find some time to poke at this so that we can see about fixing it in 37?
Flags: needinfo?(dholbert)
Yup -- I'll take a look at this from home (the only place I've been able to reproduce it, some fraction of the time) tonight, or some night this week.
I can reproduce this very reliably by tethering with my phone, FWIW.

What happens is:
 * The first time through the animation, many of our calls to RequestRefresh() fail to advance the frame, even though time has elapsed, because AdvanceFrame takes its early "!canDisplay" return-path (since we don't have the next frame decoded yet -- probably because we haven't received all of its data yet).

 * So, RequestRefresh() aTime keeps getting larger and larger -- BUT MEANWHILE, the local-var currentFrameEndTime' and 'mCurrentAnimationFrameTime' (also timestamps) are much smaller, because they're unaware that we've fallen behind. They're basically set to the time that we *should've* started/stopped showing the current frame, if we had all the data.

 * When we finally receive all the data and AdvanceFrame stops failing early all the time, then all of a sudden RequestRefresh()'s loop is free to spin until currentFrameEndTime catches up with aTime. This makes us skip a ton of frames on the second play-through.
Flags: needinfo?(dholbert)
Here's a debug log generated using the attached logging patch.  I included RequestRefresh calls for the first few frames, and then I jumped straight to the last frame (103).

At that point, since we've received all the data and AdvanceFrame has nothing to make it fail early, we have a particularly-eventful call to RequestRefresh, where we loop through 70 frames (jumping from frame 103, at the end of the first playthrough, to frame 70 for the start of the second playthrough.)

(Note: as you may be able to infer from the timestamps, this is on an extremely slow (~100kbps) connection -- I actually ate up my 1GB high-speed mobile data allowance while testing this bug today. :) I expect you should be able to reproduce something like this bug even on a faster connection, as long as the first playthrough is held up a bit, making curFrameEndTime fall behind.  It may not be obvious visibly, but it should be obvious with my logging patch.)
The most relevant chunk of the log is this bug, on the RequestRefresh() call where we go from the first playthrough to the second:
{
*****dholbert Entering RequestRefresh; frame=103, aTime=144469.994079, curFrameEndTime=7219.179671
*****dholbert --> GetCurrentImgFrameEndTime: frame=103, currentFrameTime=7179.179671, timeout=40
*****dholbert  --> DID advance frame
*****dholbert --> GetCurrentImgFrameEndTime: frame=104, currentFrameTime=141619.179671, timeout=40
*****dholbert --> GetCurrentImgFrameEndTime: frame=104, currentFrameTime=141619.179671, timeout=40
*****dholbert  --> DID advance frame
*****dholbert --> GetCurrentImgFrameEndTime: frame=0, currentFrameTime=141659.179671, timeout=40
*****dholbert --> GetCurrentImgFrameEndTime: frame=0, currentFrameTime=141659.179671, timeout=40
*****dholbert  --> DID advance frame
*****dholbert --> GetCurrentImgFrameEndTime: frame=1, currentFrameTime=141699.179671, timeout=40
*****dholbert --> GetCurrentImgFrameEndTime: frame=1, currentFrameTime=141699.179671, timeout=40
*****dholbert  --> DID advance frame
*****dholbert --> GetCurrentImgFrameEndTime: frame=2, currentFrameTime=141739.179671, timeout=40
*****dholbert --> GetCurrentImgFrameEndTime: frame=2, currentFrameTime=141739.179671, timeout=40
[...]
*****dholbert  --> DID advance frame
*****dholbert --> GetCurrentImgFrameEndTime: frame=69, currentFrameTime=144419.179671, timeout=40
*****dholbert --> GetCurrentImgFrameEndTime: frame=69, currentFrameTime=144419.179671, timeout=40
*****dholbert  --> DID advance frame
*****dholbert --> GetCurrentImgFrameEndTime: frame=70, currentFrameTime=144459.179671, timeout=40
*****dholbert Leaving RequestRefresh; frame=70, aTime=144469.994079, curFrameEndTime=144499.179671
}
(Seth, I'm hoping this is enough data for you to go on. Sorry for taking a little while to capture it. :))
Flags: needinfo?(seth)
(Also, FWIW: billm says he recalls seeing symptoms like this bug [second playthrough starting from random point] for like a year or more. So this might not be a recent-ish regression after all.)
[Tracking Requested - why for this release]:

(In reply to Daniel Holbert [:dholbert] from comment #20)
> (Also, FWIW: billm says he recalls seeing symptoms like this bug [second
> playthrough starting from random point] for like a year or more. So this
> might not be a recent-ish regression after all.)

Let's not track it for Beta uplift then. Leaving the NI for Seth so he can diagnose later.
I think dropping tracking makes sense. This is an edge case that has been around since at least 33 (Oct 2014) and has not blown up. Given that we have good debugging information and STR I would like to see this prioritized and addressed. As a reminder, 38 is the next ESR cycle and it would be good to get a fix in for 38 if the risk can be managed.
I would guess that this is fixed now.

Clear a needinfo that is pending on an inactive user.

Inactive users most likely will not respond; if the missing information is essential and cannot be collected another way, the bug maybe should be closed as INCOMPLETE.

For more information, please visit auto_nag documentation.

Flags: needinfo?(seth.bugzilla)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: