Closed Bug 1287246 Opened 3 years ago Closed 3 years ago

Add yielding support to StreamingLexer

Categories

(Core :: ImageLib, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox50 --- fixed

People

(Reporter: seth, Assigned: seth)

References

Details

Attachments

(5 files, 7 obsolete files)

31.55 KB, patch
Details | Diff | Splinter Review
5.77 KB, patch
Details | Diff | Splinter Review
24.93 KB, patch
Details | Diff | Splinter Review
4.99 KB, patch
Details | Diff | Splinter Review
23.22 KB, patch
Details | Diff | Splinter Review
Alright, this is what all of the refactorings recently have been working towards. In this bug we'll add support for yielding to StreamingLexer. This will in turn allow us to decode animated images one frame at a time, which will enable a drastic reduction in memory usage for animated images.
The first step is to provide a way to communicate that a yield has happened to
Decoder code. We already have a LexerResult type; this patch just propagates it
out far enough that Decoder::Decode() can get ahold of it.

There can be multiple reasons to yield that are semantically different. For that
reason, this patch also adds a Yield enumeration, to which we can add reasons
for yielding as needed. In this patch, the only value is Yield::NEED_MORE_DATA,
which takes the place of a |Nothing()| value in the old |Maybe<TerminalState>|
API.
Attachment #8771632 - Flags: review?(n.nethercote)
Attached patch (Part 1b) - Update tests. (obsolete) — Splinter Review
The tests need to be updated to reflect the change in part 1a. This is all
pretty mechanical.
Attachment #8771633 - Flags: review?(n.nethercote)
This is a pretty small change: we're going to add another piece of state for
unbuffered reads in part 3, so it'll be handy to keep them together and manage
them atomically. This patch adds an UnbufferedState struct that lets us do just
that.
Attachment #8771635 - Flags: review?(n.nethercote)
Here's the meat of the bug. I wish I could've broken this out into smaller
pieces, but it's not easy.

You can read the newly added documentation to understand how the yielding API is
meant to work, and I've tried to add comments wherever anything in the code
might be unclear, so hopefully the patch itself will serve as a pretty good
explanation of what's going on here. The structure of the code has changed a
little in order to make things as easy to understand as possible.

It's discussed in the patch, but I wanted to discuss here as well the reason
that after yielding, we return the same data. (Or the same data minus whatever
has already been consumed, in the unbuffered case.) I want to make it easy to
someone writing a decoder to support the pattern "get some data, yield to allow
the caller to do something with the data or maybe even terminate decoding, and
then come back and keep processing the data". As an example, imagine if we
implement metadata decodes with yielding, so that decoders don't need to know if
they're performing a metadata decode or not: we can read the header, yield, and
then continue processing the data in the header (for example by allocating
textures) if the caller decides to continue processing. If we don't have a way
to get back the same data that the state we yielded from had, doing this will be
all kinds of tricky and complicated in many cases. I want to minimize the
statefulness of the decoders, and this design reduces the amount of state the
decoders need to keep.

This is even more crucial in the unbuffered case; yielding would be useless if
we couldn't return the same data. Consider the PNG and JPEG decoders, which are
currently implemented with one giant unbuffered read. If we want to allow
yielding from these decoders, we have to be able to say "I consumed X bytes, so
drop those X bytes, but give me back the remaining bytes that I haven't yet
consumed". We *do*, in fact, want to start yielding from those decoders, and
that's why we need the ability to get the same data back for unbuffered reads.

Hopefully this explains why yielding works this way, because I know it might not
be obvious at first glance. Of course, it's probably clear that if you *don't*
need the data from the state you yielded from, you can trivially just jump to
another state, so the convenience level remains high even in that case. Easy peasy.
Attachment #8771637 - Flags: review?(n.nethercote)
Alright, if we want to trust this yielding stuff works as intended, we need to
be sure that yielding is giving us the same data that the state we yielded from
got. That's a lot easier to believe if the data isn't a repeated [1, 2, 3]
pattern, so this is a quick patch to make every byte in the test array in
TestStreamingLexer unique.
Attachment #8771638 - Flags: review?(n.nethercote)
Here are some new tests. The tests are mostly for the new yielding code,
although I also added a test or two related to the edge case of what state you
start StreamingLexer in. (I'm sure you noticed from the code in part 3 that I
gave this some thought.)
Attachment #8771639 - Flags: review?(n.nethercote)
Assignee: nobody → seth
Fixed a warning in the test code.
Attachment #8771806 - Flags: review?(n.nethercote)
Attachment #8771639 - Attachment is obsolete: true
Attachment #8771639 - Flags: review?(n.nethercote)
Comment on attachment 8771632 [details] [diff] [review]
(Part 1a) - Expose LexerResult from the StreamingLexer API and add an explicit Yield type.

Review of attachment 8771632 [details] [diff] [review]:
-----------------------------------------------------------------

::: image/Decoder.cpp
@@ +121,5 @@
>      PROFILER_LABEL("ImageDecoder", "Decode", js::ProfileEntry::Category::GRAPHICS);
>      AutoRecordDecoderTelemetry telemetry(this);
>  
> +    return DoDecode(*mIterator, aOnResume);
> +  }();

Is using the lambda necessary here? Seems like an overly-clever way of doing things. Please remove it if possible.

::: image/StreamingLexer.h
@@ +299,5 @@
>            // the network yet. We don't have to do anything special; the
>            // SourceBufferIterator will ensure that |aOnResume| gets called when
>            // more data is available.
> +          result = Some(LexerResult(Yield::NEED_MORE_DATA));
> +          break;

With this change we'll go back around the loop; before we didn't. Is that intended?
Attachment #8771632 - Flags: review?(n.nethercote) → review+
Attachment #8771633 - Flags: review?(n.nethercote) → review+
Comment on attachment 8771635 [details] [diff] [review]
(Part 2) - Store the state for unbuffered reads in a separate structure in StreamingLexer.

Review of attachment 8771635 [details] [diff] [review]:
-----------------------------------------------------------------

::: image/StreamingLexer.h
@@ +361,4 @@
>        MOZ_ASSERT(mTransition.UnbufferedState() ==
>                     unbufferedTransition.NextState());
> +      mUnbufferedState->mBytesRemaining -=
> +        std::min(mUnbufferedState->mBytesRemaining, aIterator.Length());

The introduction of the min() call here changes the functionality slightly. I think you're just being more cautious, to make sure mBytesRemaining never goes negative. Is that right? A comment about this might be useful, possibly with a reference to the new assertion at the top of the function.
Attachment #8771635 - Flags: review?(n.nethercote) → review+
Comment on attachment 8771637 [details] [diff] [review]
(Part 3) - Add support for yielding to StreamingLexer.

Review of attachment 8771637 [details] [diff] [review]:
-----------------------------------------------------------------

::: image/StreamingLexer.h
@@ +49,5 @@
>  /// Possible yield reasons for the lexer.
>  enum class Yield
>  {
> +  NEED_MORE_DATA,   // The lexer cannot continue without more data.
> +  OUTPUT_AVAILABLE  // There is output available for the caller to consume.

|Yield| and |ControlFlowStrategy| are isomorphic. Will this always be the case? Is having them as separate types important?

@@ +214,5 @@
> +  /**
> +   * Continue receiving unbuffered data. @aUnbufferedState should be the same
> +   * state as the @aUnbufferedState specified in the preceding call to
> +   * ToUnbuffered(). @aSize indicates the amount of data that has already been
> +   * consumed; the next state will same data that the current state delivered,

Garbled sentence: "the next state will same data".

@@ +357,5 @@
> +      // semantically (since yield states are supposed to deliver the same data
> +      // as previous states, and there's no previous state here), but more
> +      // importantly, it's necessary to advance a SourceBufferIterator at least
> +      // once before you can read from it, and adding the necessary checks to
> +      // avoid that issue to Lex() has the potential of masking real bugs. So

Moving the "to Lex()" after "checks" will make this sentence easier to parse, IMO.

@@ +383,5 @@
> +    // If the lexer requested a yield last time, we deliver the same data again
> +    // before we read anything else from |aIterator|. Note that although to the
> +    // callers of Lex(), Yield::NEED_MORE_DATA is just another type of yield,
> +    // internally they're different in that we don't redeliver the same data in
> +    // the Yield::NEED_MORE_DATA case, and |mYieldingToState| to state is not

"to state" doesn't make sense here. Omit?

@@ +385,5 @@
> +    // callers of Lex(), Yield::NEED_MORE_DATA is just another type of yield,
> +    // internally they're different in that we don't redeliver the same data in
> +    // the Yield::NEED_MORE_DATA case, and |mYieldingToState| to state is not
> +    // set. This means that for Yield::NEED_MORE_DATA, we just directly to the
> +    // loop below.

Garbled sentence: "we just directly to the".

@@ +488,5 @@
> +    // chunk. Make the necessary adjustments. (Note that the std::min call is
> +    // just belt-and-suspenders to keep this code memory safe even if there's
> +    // a bug somewhere.)
> +    const size_t toSkip =
> +      std::min(mUnbufferedState->mBytesConsumedInCurrentChunk, aIterator.Length());

Adding this kind of belt-and-suspenders comment would be a nice way to address my std::min comment from the previous patch's review.

@@ +524,5 @@
> +
> +    MOZ_ASSERT(mTransition.UnbufferedState() ==
> +                 unbufferedTransition.NextState());
> +
> +    // Perform bookkeeping.

Fun fact: "bookkeeping" is a word with three double letters in a row.

@@ +532,5 @@
> +    }
> +
> +    MOZ_ASSERT(unbufferedTransition.Size() == 0);
> +    mUnbufferedState->mBytesRemaining -=
> +      std::min(mUnbufferedState->mBytesRemaining, aLength);

Another brief comment about the std::min call?
Attachment #8771637 - Flags: review?(n.nethercote) → review+
Attachment #8771638 - Flags: review?(n.nethercote) → review+
Comment on attachment 8771806 [details] [diff] [review]
(Part 5) - Add GTests for StreamingLexer's support for yielding.

Review of attachment 8771806 [details] [diff] [review]:
-----------------------------------------------------------------

::: image/test/gtest/TestStreamingLexer.cpp
@@ +403,5 @@
> +    mSourceBuffer->Append(mData + i, 1);
> +    LexerResult result = mLexer.Lex(mIterator, mCountResumes, DoLexWithYield);
> +
> +    switch (i) {
> +      case 2:

I don't understand why 2 gets special treatment here.

@@ +681,5 @@
> +
> +  // We expect 2 resumes because TestState::ONE consumes 3 bytes and then
> +  // yields. When the lexer resumes at TestState::TWO, which receives the same 3
> +  // bytes, TerminateSuccess() gets called immediately.  That's three bytes
> +  // total,  which are delivered one at a time, requiring 2 resumes.

Nit: extra space after the comma.
Attachment #8771806 - Flags: review?(n.nethercote) → review+
BTW: thank you for all the lovely comments in these patches.
Thanks for the quick review! Some comments before I upload the updated patches:

(In reply to Nicholas Nethercote [:njn] from comment #10)
> Is using the lambda necessary here? Seems like an overly-clever way of doing
> things. Please remove it if possible.

I was wondering if I was going to get away with that. =) It's not necessary; the code is just a little uglier without it due to the necessity of initializing LexerResult with an arbitrary value. That can also be avoided by breaking that code out into a separate method, of course, so I may just do that.

> 
> ::: image/StreamingLexer.h
> @@ +299,5 @@
> >            // the network yet. We don't have to do anything special; the
> >            // SourceBufferIterator will ensure that |aOnResume| gets called when
> >            // more data is available.
> > +          result = Some(LexerResult(Yield::NEED_MORE_DATA));
> > +          break;
> 
> With this change we'll go back around the loop; before we didn't. Is that
> intended?

We'll drop out of the loop at the bottom, since |result| is a |Some()| value. I figured we may as well handle all of the cases the same way now that LexerResult can actually express the "need more data" concept.

(In reply to Nicholas Nethercote [:njn] from comment #11)
> The introduction of the min() call here changes the functionality slightly.
> I think you're just being more cautious, to make sure mBytesRemaining never
> goes negative. Is that right? A comment about this might be useful, possibly
> with a reference to the new assertion at the top of the function.

Yep, that's right. I'll discuss this more in a followup comment, but I ended up pulling this code out into a separate method.

(In reply to Nicholas Nethercote [:njn] from comment #12)
> |Yield| and |ControlFlowStrategy| are isomorphic. Will this always be the
> case? Is having them as separate types important?

I don't expect them to always remain isomorphic, no. The idea is that we can expand this enum to provide more reasons for yielding, which will make calling code clearer and should make it easier to add assertions. In particular, Yield::METADATA_AVAILABLE seems like a natural extension. I haven't spent any time thinking this through in detail, but I think we can probably make decoders not even know whether they exist for metadata decoding or a normal decode; they can just yield, and continue their work if the caller wants to let them.

(In reply to Nicholas Nethercote [:njn] from comment #14)
> BTW: thank you for all the lovely comments in these patches.

Glad they helped!
(In reply to Nicholas Nethercote [:njn] from comment #13)
> Comment on attachment 8771806 [details] [diff] [review]
> (Part 5) - Add GTests for StreamingLexer's support for yielding.
> 
> Review of attachment 8771806 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: image/test/gtest/TestStreamingLexer.cpp
> @@ +403,5 @@
> > +    mSourceBuffer->Append(mData + i, 1);
> > +    LexerResult result = mLexer.Lex(mIterator, mCountResumes, DoLexWithYield);
> > +
> > +    switch (i) {
> > +      case 2:
> 
> I don't understand why 2 gets special treatment here.

That's the place where DoLexWithYield() actually yields, so we check for it here.
Here's an updated version of part 1. The review comments are addressed, and part
1a and 2b are folded together. I decided to just initialize the LexerResult
value in Decoder::Decode() with TerminalState::FAILURE, which eliminates the
need for the lambda. For some reason that option seemed grosser to me Friday
night than it does now. =)
Attachment #8771632 - Attachment is obsolete: true
Attachment #8771633 - Attachment is obsolete: true
Attachment #8771635 - Attachment is obsolete: true
This updated version of part 3 addresses the review comments. There's one other
change, too. I noticed a bug when writing the tests for my next (and final,
don't worry =) addition to StreamingLexer. It was possible to trip up the code
as follows:

1. Start an unbuffered read. (Say it's for 3 bytes.)

2. Yield after only partially consuming the current chunk of the read. (Say the
   chunk holds all 3 bytes, but you only consumed 2.)

3. When you return to the unbuffered state after yielding, don't yield again, so
   that the current chunk of the read gets finished. (In this case, you'd
   consume the remaining byte, completing the unbuffered read.)

If you did this with the previous version of the code, the bookkeeping (now I
can't unsee the double letters!) would be wrong: we'd end up subtracted *1* from
|mBytesRemaining|, not 3, so we'd wrongly think we still needed to read 2 more
bytes. The issue is that we subtracted off the amount of data we consumed on the
final read, which, because we yielded earlier, was not necessarily the entire
chunk. Effectively we ended up forgetting about the data that we consumed when
we yielded.

This bug got introduced because of a confusion in the code between the length of
the data that we're passing to the lexer function, and the length of the current
chunk. The fix *could've* been a one-liner - I just needed to change the value
that gets subtracted from |mBytesRemaining|. However, when I was fixing this
issue, I noticed that we should've been updating |mBytesRemaining| in the early
return in UnbufferedReadAfterYield() as well. I decided instead to pull the
update out into its own little method, FinishCurrentChunkOfUnbufferedRead(),
which has the simultaneous advantages of clarifying what's happening by its
name, giving me more space to write comments (including clarifying what the
|std::min| calls are for), and letting me explicitly name the parameter
|aChunkLength| for further self-documentation.

So the fix could've been a one-liner (well, two one-liners) but I made it more
like a 15-liner to try to make the code more clear so similar mistakes don't
happen again.
Attachment #8771637 - Attachment is obsolete: true
Attachment #8771638 - Attachment is obsolete: true
I updated this part by adding a few new tests inspired by the issue I discussed
above. There are a couple of paranoia tests but the important new one is
ChunkPerStateWithUnbufferedYield(), which is a combination I hadn't covered
before, and which would've exposed the issue immediately. The step-by-step
instructions for reproducing it in part 3 are taken directly from this test.
Attachment #8771806 - Attachment is obsolete: true
Blocks: 1287367
Pushed by mfowler@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3ce690bdd3a8
(Part 1) - Expose LexerResult from the StreamingLexer API and add an explicit Yield type. r=njn
https://hg.mozilla.org/integration/mozilla-inbound/rev/d858390c1386
(Part 2) - Store the state for unbuffered reads in a separate structure in StreamingLexer. r=njn
https://hg.mozilla.org/integration/mozilla-inbound/rev/1888681263a2
(Part 3) - Add support for yielding to StreamingLexer. r=njn
https://hg.mozilla.org/integration/mozilla-inbound/rev/faf0c4549a12
(Part 4) - Make the data checking in TestStreamingLexer more discerning. r=njn
https://hg.mozilla.org/integration/mozilla-inbound/rev/c8ba6a6eec22
(Part 5) - Add GTests for StreamingLexer's support for yielding. r=njn
Blocks: 1287691
You need to log in before you can comment on or make changes to this bug.