Closed Bug 815473 Opened 12 years ago Closed 12 years ago

Eliminate runtime computed constant tables

Categories

(Core :: General, defect)

All
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20
Tracking Status
firefox19 --- fixed
firefox20 --- fixed
b2g18 --- fixed

People

(Reporter: laszio.bugzilla, Assigned: laszio.bugzilla)

References

Details

Attachments

(3 files, 7 obsolete files)

173.80 KB, patch
Details | Diff | Splinter Review
11.88 KB, patch
Details | Diff | Splinter Review
5.44 KB, patch
Details | Diff | Splinter Review
There are 4 coefficient tables computed at runtime:

   Num:    Value  Size Type    Bind   Vis      Ndx Name
194068: 012f3170 65536 OBJECT  LOCAL  DEFAULT   24 jpeg_nbits_table
180068: 012e0be4 65536 OBJECT  LOCAL  DEFAULT   24 _ZL17sPremultiplyTable
180066: 012d0be4 65536 OBJECT  LOCAL  DEFAULT   24 _ZL19sUnpremultiplyTable
 14744: 012bb15c 41984 OBJECT  LOCAL  DEFAULT   24 _ZL18gUnicodeToGBKTable

http://mxr.mozilla.org/mozilla-central/source/media/libjpeg/jchuff.c#24
http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxUtils.cpp#24
http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxUtils.cpp#25
http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ucvcn/nsGBKConvUtil.cpp#18

They can be easily converted to constants to save 233KB .bss at the expense of elf size. Is it worth it?

The following two dynamically allocated tables are actually redundant of the above, although they are in quite different source trees.
http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/CanvasRenderingContext2D.cpp#3558
http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/CanvasRenderingContext2D.cpp#3362
Blocks: 802446
> The following two dynamically allocated tables are actually redundant of the
> above, although they are in quite different source trees.
> http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/
> CanvasRenderingContext2D.cpp#3558
> http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/
> CanvasRenderingContext2D.cpp#3362

These two seem to be called rarely. I played for a while without hitting the breakpoints.
(In reply to Ting-Yuan Huang from comment #0)
> http://mxr.mozilla.org/mozilla-central/source/media/libjpeg/jchuff.c#24

Seems to me this could be converted to a 256-entry table storing the number of bits per byte, at very low cost (two array lookups and an add instead of one array lookup). Then storing it in .rodata definitely makes sense.

> http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxUtils.cpp#24
> http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxUtils.cpp#25
> http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ucvcn/nsGBKConvUtil.
> cpp#18
> 
> They can be easily converted to constants to save 233KB .bss at the expense
> of elf size. Is it worth it?

Seems like a good idea especially since we can share that data across processes if we put it in .rodata.

> The following two dynamically allocated tables are actually redundant of the
> above, although they are in quite different source trees.
> http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/
> CanvasRenderingContext2D.cpp#3558
> http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/
> CanvasRenderingContext2D.cpp#3362

These are used by canvas getImageData and putImageData. These should definitely be modified to use the gfxUtils tables.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #2)
> (In reply to Ting-Yuan Huang from comment #0)
> > http://mxr.mozilla.org/mozilla-central/source/media/libjpeg/jchuff.c#24
> 
> Seems to me this could be converted to a 256-entry table storing the number
> of bits per byte, at very low cost (two array lookups and an add instead of
> one array lookup). Then storing it in .rodata definitely makes sense.

Sounds great! and it should have very little impacts on performance. I have to bother try server again...

By the way, this is actually a ceil(lgX) table in spite of its name. Although we may utilize instructions like CLZ in modern CPUs, mixing compiler dependent built-ins or directives would be a mess.

> > The following two dynamically allocated tables are actually redundant of the
> > above, although they are in quite different source trees.
> > http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/
> > CanvasRenderingContext2D.cpp#3558
> > http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/
> > CanvasRenderingContext2D.cpp#3362
> 
> These are used by canvas getImageData and putImageData. These should
> definitely be modified to use the gfxUtils tables.

Maybe we can address this in the second patch?
(In reply to Ting-Yuan Huang from comment #3)
> Maybe we can address this in the second patch?

Sure.
Attachment #688692 - Flags: review?(smontagu)
Attachment #688694 - Flags: review?(justin.lebar+bug)
Comment on attachment 688692 [details] [diff] [review]
Part 1/3: Replace runtime computed gUnicodeToGBKTable by constants.

Review of attachment 688692 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, but can you just add a comment to cp936invmap.h explaining how it was generated?
Attachment #688692 - Flags: review?(smontagu) → review+
Comment on attachment 688693 [details] [diff] [review]
Part 2/3: Replace runtime calculated sUnpremultiplyTable/sPremultiplyTable with constants.

Review of attachment 688693 [details] [diff] [review]:
-----------------------------------------------------------------

Instead of checking in these giant tables, can we use macros or templates to generate them during compilation?

Something like
#define GEN_TABLE_ENTRY_PREMULTIPLY(v) (((v) >> 8)*((v) & 0xFF))/255,
#define GEN_TABLE_ENTRY_PREMULTIPLY_2(v) GEN_TABLE_ENTRY_PREMULTIPLY(v) \
                                         GEN_TABLE_ENTRY_PREMULTIPLY(v + 1)
#define GEN_TABLE_ENTRY_PREMULTIPLY_4(v) GEN_TABLE_ENTRY_PREMULTIPLY_2(v) \
                                         GEN_TABLE_ENTRY_PREMULTIPLY_2(v + 2)
...
#define GEN_TABLE_ENTRY_PREMULTIPLY_65536(v) GEN_TABLE_ENTRY_PREMULTIPLY_32768(v) \
                                             GEN_TABLE_ENTRY_PREMULTIPLY_32768(v + 32768)
static const uint8_t table = {
GEN_TABLE_ENTRY_PREMULTIPLY_65536(0)
};

You'd want to check that this produces the exact values we want :-)
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #9)
> Comment on attachment 688693 [details] [diff] [review]
> Part 2/3: Replace runtime calculated sUnpremultiplyTable/sPremultiplyTable
> with constants.
> 
> Review of attachment 688693 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Instead of checking in these giant tables, can we use macros or templates to
> generate them during compilation?
> 
> Something like
> #define GEN_TABLE_ENTRY_PREMULTIPLY(v) (((v) >> 8)*((v) & 0xFF))/255,
> #define GEN_TABLE_ENTRY_PREMULTIPLY_2(v) GEN_TABLE_ENTRY_PREMULTIPLY(v) \
>                                          GEN_TABLE_ENTRY_PREMULTIPLY(v + 1)
> #define GEN_TABLE_ENTRY_PREMULTIPLY_4(v) GEN_TABLE_ENTRY_PREMULTIPLY_2(v) \
>                                          GEN_TABLE_ENTRY_PREMULTIPLY_2(v + 2)
> ...
> #define GEN_TABLE_ENTRY_PREMULTIPLY_65536(v)
> GEN_TABLE_ENTRY_PREMULTIPLY_32768(v) \
>                                             
> GEN_TABLE_ENTRY_PREMULTIPLY_32768(v + 32768)
> static const uint8_t table = {
> GEN_TABLE_ENTRY_PREMULTIPLY_65536(0)
> };
> 
> You'd want to check that this produces the exact values we want :-)

Cool! but msvc seems to be unable to handle this:

https://tbpl.mozilla.org/php/getParsedLog.php?id=17661781&tree=Try#error0
e:/builds/moz2_slave/try-w32-dbg/build/gfx/thebes/gfxUtils.cpp(61) : fatal error C1060: compiler is out of heap space

Although a simple compiler tweak should fix it, should we adopt the naive but compiler friendly approach?
Thanks, smontagu. I added the C snippet used to generate it. A more formal way to do this might be a perl script similar to that generates cp936map.h. Since this table should be modified rarely, if ever, that seems to be unnecessary.
Attachment #688692 - Attachment is obsolete: true
Comment on attachment 688694 [details] [diff] [review]
Part 3/3: Replace the runtime computed FFS table with constants.

Have you done any performance measurements with this change?  It makes what appears to be a hot path in jpeg encoding significantly more complicated, by adding what appears to be an unpredictable branch.

We encode JPEGs when we take screenshots in B2G, so I'd want to know if this is a significant slowdown.

Please update media/libjpeg/MOZCHANGES and media/libjpeg/mozilla.diff.

Also, if the perf aspects don't look too bad, this would probably be worth upstreaming (perhaps with a compile-time pref to enable).
Attachment #688694 - Flags: review?(justin.lebar+bug) → feedback+
DRC, I'm curious if you have any thoughts on patch 3/3.  (DRC is the maintainer of libjpeg-turbo.)
(In reply to Ting-Yuan Huang from comment #10)
> Although a simple compiler tweak should fix it, should we adopt the naive
> but compiler friendly approach?

Another easy option is to generate it with a python script instead.
(In reply to Justin Lebar [:jlebar] from comment #12)
> Have you done any performance measurements with this change?  It makes what
> appears to be a hot path in jpeg encoding significantly more complicated, by
> adding what appears to be an unpredictable branch.

Are there any suggested benchmarks?

After twiddling the codes a bit, the compiler now generates conditional execution codes which are available on ARM:

    movs    r2, r0, lsr #8
    andeq   r2, r0, #255
    add     r3, pc, r3
    ldrb    r3, [r3, r2]    @ zero_extendqisi2
    movne   r0, #8
    moveq   r0, #0
    add     r0, r0, r3
    and     r0, r0, #255

Although it needs 7 more instructions, it should not be much slower due to better data locality, especially on CPUs with small caches.

I'm also OK for the simplest approach: a 64KB table in .rodata. Since it can be shared, the overhead is really small :)
(In reply to Justin Lebar [:jlebar] from comment #13)
> DRC, I'm curious if you have any thoughts on patch 3/3.  (DRC is the
> maintainer of libjpeg-turbo.)

In libjpeg-turbo, I treat performance the same as I treat stability.  That means that every change that might potentially affect performance must undergo rigorous testing on a variety of platforms to ensure that performance has not regressed.  Since I don't get paid to work on libjpeg-turbo unless I have a specific commercial contract to enhance or fix something in it (which hasn't happened in a while-- mostly, it has reached a point at which everyone is happy with it), I'm very reluctant to make any modifications like this, because I would have to eat the cost of doing a complete performance regression test.

I did run a quick & dirty test with this on one platform, using the same benchmark methodology described here:  http://www.libjpeg-turbo.org/About/Performance.  I don't claim these images to be canonical, but they represent the sort of performance that I specifically care about (bearing in mind that the reason why I started maintaining libjpeg-turbo to begin with was as a way of accelerating remote display of 3D applications.)

Patch 3 regresses performance for 32-bit code by 5% (+/- 0.5%) in some cases and 64-bit code by 7% (+/- 0.5%) in some cases, and that's on my fastest machine.  It may very well be more on another machine, but this is enough that I do not trust the performance neutrality of the patch.  I've been doing this sort of thing long enough (17 years next month) that I can look at a patch like this and tell that it will have an impact, and my spidey sense told me 5% before I even ran the tests.  What probably would be performance-neutral is to build the entire 64k-entry table at compile time, but why bother?  This seems to be solving a problem that isn't a problem.  Additionally, I would still have to do a complete regression suite before even a patch I felt was performance-neutral would be accepted upstream.  Unless I could see a significant performance gain, there is no impetus for me to even fool with this, nor is there any impetus for a customer to pay me to fool with it.
(In reply to DRC from comment #16)
> What probably would be
> performance-neutral is to build the entire 64k-entry table at compile time,
> but why bother?  This seems to be solving a problem that isn't a problem.

Because if we have N different Gecko processes each instantiating the JPEG decoder, then the current approach means we have N*64K of identical data tables. We can at least convert this to 64K of .rodata which can be shared between the processes.

It sounds like we should just do that.
I'm sure there's a good reason why Mozilla does that, but it seems to me that spawning a lot of sub-processes in that manner is rather unique behavior.  The table is allocated globally in LJT, so even if you used threads, this would be a non-issue.

Since this is an issue that is likely to affect only Mozilla for the moment, I would suggest maintaining a downstream patch that pre-computes the entire 64k table at compile time.  Since LJT only computes the table once per process, I don't think a static table would improve upstream performance, so there isn't any reason for me to look at merging the patch upstream unless other applications were discovered to have the same sorts of multi-process memory footprint issues.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #14)
> (In reply to Ting-Yuan Huang from comment #10)
> > Although a simple compiler tweak should fix it, should we adopt the naive
> > but compiler friendly approach?
> 
> Another easy option is to generate it with a python script instead.

Did you mean
1) 2 giant table with scripts generating them, or
2) just 2 scripts to generate tables at compile time?

If 2), I would rather generate them by your nice macro, since either way I have to make changes to the build system :)
(In reply to Ting-Yuan Huang from comment #19)
> Did you mean
> 1) 2 giant table with scripts generating them, or
> 2) just 2 scripts to generate tables at compile time?

1 script could generate 2 files at build time.

> If 2), I would rather generate them by your nice macro, since either way I
> have to make changes to the build system :)

Those CPP macros aren't as nice as a python script :-)
Attachment #688693 - Attachment is obsolete: true
Attachment #688693 - Flags: review?(roc)
Attachment #691186 - Flags: review?(roc)
Comment on attachment 691186 [details] [diff] [review]
Part 2/3: Replace runtime calculated sUnpremultiplyTable/sPremultiplyTable with constants.

Review of attachment 691186 [details] [diff] [review]:
-----------------------------------------------------------------

Great!
Attachment #691186 - Flags: review?(roc) → review+
Attachment #688694 - Attachment is obsolete: true
Attachment #691206 - Flags: review?(justin.lebar+bug)
Comment on attachment 691206 [details] [diff] [review]
Part 3/3: Replace the runtime computed jpeg_nbits_table with constants.

r=me with some tweaks per below.

>diff --git a/media/libjpeg/genTables.py b/media/libjpeg/genTables.py
>new file mode 100755
>--- /dev/null
>+++ b/media/libjpeg/genTables.py
>@@ -0,0 +1,11 @@
>+#!/usr/bin/python
>+
>+import math
>+
>+def table_generator(f):
>+    return ",\n".join([", ".join(["0x%2.2x" % h for h in [f(i) for i in range(r,r+16)]]) for r in range(0, 65536, 16)])
>+
>+f = open("jpeg_nbits_table.h", "w")
>+f.write(table_generator(lambda i: math.ceil(math.log(i + 1, 2))) + "\n")
>+f.close()

This is kind of difficult to read (at least for me)  because of the nested list
comprehensions.

A simple loop feels more Pythonic to me:

> for i in range(65536):
>   f.write('%2d' % math.ceil(math.log(i + 1, 2)))
>   if i != 65535:
>     f.write(', ')
>   if (i + 1) % 16 == 0:
>     f.write('\n')

Note I switched to %2d instead of 0x%2.2x, although that's just my preference.

>diff --git a/media/libjpeg/Makefile.in b/media/libjpeg/Makefile.in
>--- a/media/libjpeg/Makefile.in
>+++ b/media/libjpeg/Makefile.in
>@@ -158,8 +158,13 @@ EXPORTS	= \
> 	jpegint.h \
> 	jpeglib.h \
> 	$(NULL)
> 
> # need static lib for some of the libimg componentry to link properly
> FORCE_STATIC_LIB = 1
> 
> include $(topsrcdir)/config/rules.mk
>+
>+CONSTANT_TABLES:
>+	$(PYTHON) $(srcdir)/genTables.py
>+
>+jchuff.$(OBJ_SUFFIX): CONSTANT_TABLES

I think you need to add CONSTANT_TABLES to .DUMMY.  Or otherwise just add the
right dependencies:

  jpeg_nbits_table.h: genTables.py
      $(PYTHON) $(srcdir)/genTables.py

  jchuff.$(OBJ_SUFFIX): jpeg_nbits_table.h

should do it, I think.  You'd probably want to make this change in the other patch, too.


For the record, another alternative which might perform well is detailed at

  http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog.

This uses a small lookup table (so we avoid blowing out the cache), but is still branch-less.  We'd have to tweak it to make it work for 16-bit ints, though.

I don't think we need to investigate this unless the current approach is found to be lacking for some reason.
Attachment #691206 - Flags: review?(justin.lebar+bug) → review+
just renaming the title.
Attachment #689145 - Attachment is obsolete: true
Add the CONSTANT_TABLES rule to .PHONY.
Attachment #691186 - Attachment is obsolete: true
1. Add rule CONSTANT_TABLES to .PHONY.
2. Refine the script generating constant tables according to review comments.
Attachment #691206 - Attachment is obsolete: true
fix a typo in patch title...
Attachment #691684 - Attachment is obsolete: true
Ting-Yuan, can you ask for approval here if you think these patches are safe to take for B2G v1?
Comment on attachment 691681 [details] [diff] [review]
Part 1/3: Replace runtime computed gUnicodeToGBKTable by constants. r=smontagu

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: 41KB redundant memory per process
Testing completed: tryserver*1*2.
Risk to taking this patch (and alternatives if risky): Scrambled GBK characters if something wrong, that is very unlikely to happen since this is just a replacement of runtime computed table by a constant table. I also compared the tables by dumping them using GDB.
String or UUID changes made by this patch:

*1 There are two tests covering this change: xpcshell/tests/intl/uconv/tests/unit/test_encode_gbk.js, xpcshell/tests/intl/uconv/tests/unit/test_bug367026.js
*2 https://tbpl.mozilla.org/?tree=Try&rev=91097f392043
Attachment #691681 - Flags: approval-mozilla-b2g18?
Comment on attachment 691682 [details] [diff] [review]
Part 2/3: Replace runtime computed sUnpremultiplyTable/sPremultiplyTable with constants. r=roc

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: 128KB redundant memory per process
Testing completed: tryserver*1
Risk to taking this patch (and alternatives if risky): Rendering errors when using the corresponding APIs, that is very unlikely to happen since this is just a replacement of runtime computed table by a constant table. I also compared the tables by dumping them using GDB.
String or UUID changes made by this patch:

*1 https://tbpl.mozilla.org/?tree=Try&rev=91097f392043
Attachment #691682 - Flags: approval-mozilla-b2g18?
Comment on attachment 692126 [details] [diff] [review]
Part 3/3: Replace runtime computed jpeg_nbits_table by constants. r=jlebar

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: 64KB redundant memory per process
Testing completed: tryserver
Risk to taking this patch (and alternatives if risky): The encoded/decoded jpegs would be scrambled, that is very unlikely to happen since this is just a replacement of runtime computed table by a constant table. I also compared the tables by dumping them using GDB.
String or UUID changes made by this patch:

*1 https://tbpl.mozilla.org/?tree=Try&rev=91097f392043
Attachment #692126 - Flags: approval-mozilla-b2g18?
Comment on attachment 691681 [details] [diff] [review]
Part 1/3: Replace runtime computed gUnicodeToGBKTable by constants. r=smontagu

I think we're triple-landing on m-c, aurora, and b2g18.
Attachment #691681 - Flags: approval-mozilla-aurora?
Attachment #691682 - Flags: approval-mozilla-aurora?
Attachment #692126 - Flags: approval-mozilla-aurora?
Product: Boot2Gecko → Core
Comment on attachment 691682 [details] [diff] [review]
Part 2/3: Replace runtime computed sUnpremultiplyTable/sPremultiplyTable with constants. r=roc

Review of attachment 691682 [details] [diff] [review]:
-----------------------------------------------------------------

Could you look over the Makefile change here?
Attachment #691682 - Flags: review?(gps)
Comment on attachment 691682 [details] [diff] [review]
Part 2/3: Replace runtime computed sUnpremultiplyTable/sPremultiplyTable with constants. r=roc

Review of attachment 691682 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/Makefile.in
@@ +397,5 @@
> +.PHONY: CONSTANT_TABLES
> +CONSTANT_TABLES:
> +	$(PYTHON) $(srcdir)/genTables.py
> +
> +gfxUtils.$(OBJ_SUFFIX): CONSTANT_TABLES

This is less than ideal for a few reasons.

1) genTables.py actually produces files. So, the use of a .PHONY target is inappropriate. You should instead list the files it produces explicitly in the Makefile.

2) The CONSTANT_TABLES rule will be evaluated needlessly. This isn't too bad because it doesn't take long. Still, it's something we like to avoid.

I'd do something like:

gen_tables_output := sPremultiplyTable.h sUnpremultiplyTable.h

$(gen_tables_output): $(srcdir)/genTables.py
    $(PYTHON) $(srcdir)/genTables.py

gfxUtils.$(OBJ_SUFFIX): $(gen_tables_output)

---

What you have will work, it just isn't fully proper. I think a follow-up bug should be sufficient if you don't want to deal with it now.
Attachment #691682 - Flags: review?(gps)
Comment on attachment 691681 [details] [diff] [review]
Part 1/3: Replace runtime computed gUnicodeToGBKTable by constants. r=smontagu

Approving since this appears to be a fairly low risk memory win. Before landing, please give QA some pointers on how to identify possible regressions in Aurora desktop/mobile and B2G18.
Attachment #691681 - Flags: approval-mozilla-b2g18?
Attachment #691681 - Flags: approval-mozilla-b2g18+
Attachment #691681 - Flags: approval-mozilla-aurora?
Attachment #691681 - Flags: approval-mozilla-aurora+
Attachment #691682 - Flags: approval-mozilla-b2g18?
Attachment #691682 - Flags: approval-mozilla-b2g18+
Attachment #691682 - Flags: approval-mozilla-aurora?
Attachment #691682 - Flags: approval-mozilla-aurora+
Attachment #692126 - Flags: approval-mozilla-b2g18?
Attachment #692126 - Flags: approval-mozilla-b2g18+
Attachment #692126 - Flags: approval-mozilla-aurora?
Attachment #692126 - Flags: approval-mozilla-aurora+
Keywords: verifyme
(In reply to Alex Keybl [:akeybl] from comment #39)
> Approving since this appears to be a fairly low risk memory win. Before
> landing, please give QA some pointers on how to identify possible
> regressions in Aurora desktop/mobile and B2G18.

(FYI, I'm holding off on uplifting based on this comment)
Flags: needinfo?(justin.lebar+bug)
> Approving since this appears to be a fairly low risk memory win. Before landing, please give QA some 
> pointers on how to identify possible regressions in Aurora desktop/mobile and B2G18.

Based on my experience with Mozilla QA and the nature of these patches, it seems extremely unlikely that our QA will be able to test this in a meaningful way beyond everyday use of the devices.

If you wanted to test this, you should look for

 1. artifacts in screenshots (screenshots use the JPEG encoder)
 2. artifacts in code which uses canvas, particularly canvas's getData functions (this tickles the premultiply code)
 3. (I'm not sure how you test the GBK table bit, or whether this is covered by automated tests.)

Given that the artifacts in all three cases might be too small to notice in casual observation, ensuring correctness demands either automated tests or code inspection, neither of which falls under QA's responsibilities.
Flags: needinfo?(justin.lebar+bug)
https://hg.mozilla.org/releases/mozilla-aurora/rev/36934087350a
https://hg.mozilla.org/releases/mozilla-aurora/rev/875a5bd397b4
https://hg.mozilla.org/releases/mozilla-aurora/rev/dfa3a86df2e7

This doesn't apply cleanly enough to b2g18 for me to comfortably uplift it (Part 2 is where I gave up).
Target Milestone: --- → mozilla20
I think I have this ported to b2g18; I'm compiling now to make sure.
Blocks: 823351
(In reply to Justin Lebar [:jlebar] from comment #41)
>  1. artifacts in screenshots (screenshots use the JPEG encoder)
>  2. artifacts in code which uses canvas, particularly canvas's getData
> functions (this tickles the premultiply code)

Thanks - this is a good list of things for QA to look out for (unexpected artifacts).
FWIW, the build parts of this patch are fixed in bug 827934
mass remove verifyme requests greater than 4 months old
Keywords: verifyme
Popping the stack on this, because I'm getting requests upstream for a similar solution for much the same reasons (reduced footprint on Android devices.)  The upstream submitter pointed out that, as long as we're using 8-bit JPEG, it should not be necessary for the table to have more than 2048 entries.  Can anyone comment on that?  It seems solid to me.  The original libjpeg code in fact checks for this:

  if (nbits > MAX_COEF_BITS+1)
    ERREXIT(state->cinfo, JERR_BAD_DCT_COEF);

(where MAX_COEF_BITS is 10 for 8-bit components.)
Because in LJT the check is removed altogether, I assume the inputs are always valid. Otherwise you may want to check the inputs before the look-up to prevent array-index-out-of-bound.
Hmmm...  Well, this is on the encoder side, so it's not as if a malformed JPEG file could cause the index to be out of bounds.  Basically, if the index is out of bounds, then it's because of a bug in the libjpeg code, and I'm assuming that since no such bug has surfaced in 15 years, it's probably safe.  :/

However, I'm asking the upstream submitter whether he can adopt the same approach that you guys are using, i.e. sharing the table among multiple instances.  Since Mozilla has already thoroughly tested that approach, it makes sense to leverage your testing.

Also, wanted to clarify that the latest version of this patch (where you pre-compute the entire table) is performance-neutral.  My comments above regarding a 5-7% performance deficit were referring to an earlier version.
The Mozilla patch for precomputing the bit table has been integrated upstream in our subversion trunk (libjpeg-turbo 1.4 evolving.)  I also integrated a patch that uses clz/bsr intrinsics for bit counting by default on ARM platforms, and that seems to actually improve performance significantly in some cases (x86 still uses the LUT, because the implementation of bsr seems to have more variation among x86 chips, and on some of them, it causes a significant performance regression.)

Feedback welcomed.
For newer gcc (>= 4.7?), will __builtin_clrsb() fit better than __builtin_clz()? Specifically,

#define JPEG_NBITS(x) (31 - __builtin_clrsb(x))
#define JPEG_NBITS_NONZERO(x) JPEG_NBITS(x)

Although it should be no faster with arm and thumb2, which have conditional executions, it probably helps in thumb mode. Don't know the behavior on x86/x64.
I'll ask the upstream submitter.  I'd be interested in seeing if clrsb() might benefit x86 as well.  Currently, none of the machines I have has a compiler new enough to test it.  :|
The reply was that, apparently, __builtin_clrsb() is implemented in software on ARM.  There is some question, however, as to whether the branch (x ? JPEG_NBITS_NONZERO(x) : 0) is necessary.  Apparently the ARM documentation says that clz(0) is well-defined, but the GCC documentation says that __builtin_clz(0) is undefined.

If you are interested in following up, please engage in the upstream discussion so I don't have to keep acting as intermediary:
https://sourceforge.net/p/libjpeg-turbo/patches/57
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: