Closed
Bug 594883
Opened 14 years ago
Closed 13 years ago
Speed up ARGB premultiply in PNG, JPG, GIF and BMP decoders
Categories
(Core :: Graphics, defect)
Core
Graphics
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: justin.lebar+bug, Unassigned)
References
(Depends on 1 open bug)
Details
(Keywords: perf)
In nsPNGDecoder::row_callback, if the PNG is ARGB, we do:
for (PRUint32 x=width; x>0; --x) {
*cptr32++ = GFX_PACKED_PIXEL(line[3], line[0], line[1], line[2]);
if (line[3] != 0xff)
rowHasNoAlpha = PR_FALSE;
line += 4;
}
I got a significant speedup in bug 519400 by using a lookup table for the canvas's premultiply (here the premultiply is performed in GFX_PACKED_PIXEL). The canvas's premultiply is a little different from GFX_PACKED_PIXEL's premultiply, so we might not get as large a perf win. It's probably still worth experimenting with.
Comment 1•13 years ago
|
||
This seems interresting to test out. The logical place to change GFX_PACKED_PIXEL is either in Decoder.cpp as only the image decoders use these. (GIF,JPG,PNG and BMP).
The premultiply tables are allready in gfxUtils.cpp, but need to be externalized, so that both the image decodes in libpr0n, gfx itself and nsCanvasRenderingContext2D can all use the same matrices. This will save 64K memory as well as speed the image decoders.
See Bug 633467 - 2 Premultiply tables both gfxUtil and nsCanvasRenderingContext2D for that.
So by applying bug 633467 we could then also let the image decoders use the same matrix.
Depends on: 633467
Summary: Speed up ARGB premultiply in PNG decoder → Speed up ARGB premultiply in PNG, JPG, GIF and BMP decoders
Reporter | ||
Comment 2•13 years ago
|
||
See also bug 659725, which adds NEON code for this premultiply. You should be able to do something similar (albeit with more pain) using SSE.
Comment 3•13 years ago
|
||
Tested using a premultiply table instead of some other local improvements proved that there is no reduction in instruction count, and instead of register shuffling, it requires access to three 64K tables, so that will blow any cache, etc...
See bug 517713 for some simpler optimizations.
Comment 4•13 years ago
|
||
(In reply to Alfred Kayser from comment #3)
> Tested using a premultiply table instead of some other local improvements
> proved that there is no reduction in instruction count, and instead of
> register shuffling, it requires access to three 64K tables, so that will
> blow any cache, etc...
> See bug 517713 for some simpler optimizations.
Closing based on this. Please reopen if it turns out that there's still a speed up we can achieve here after all.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•