Open Bug 1406366 Opened 5 years ago Updated 5 years ago
Manually vectorize ns
Bidi::Get Logical Run
Before bug 1288761, nsBidi::GetLogicalRun uses ubidi_GetLogicalRun which is based on runs, so it's faster when there are a few runs but lots of characters. Bug 1288761 changes it to a level-based algorithm to address O(n^2) performance issue when there are lots of runs. But its current implementation can be slow in case there are few runs but lots of characters, since it compares one byte each time. We should be able to vectorize it to compare 4byte/8byte/16byte each time so that it can run faster. I've verified that there is no compiler we use so far do this auto-vectorization.
With SSE2, we can do 16 bytes each loop, with intrinsics * __m128i _mm_cmpeq_epi8 to compare 16 bytes together * _mm_movemask_epi8 to compress the result into an int32_t then we can locate the first one bit. Bad side of using SSE2 is that we would probably need more tests to ensure its correctness because it cannot be generalized to a template function... We can probably just write a 4/8 bytes version first...
Before spending too much effort on this, we should profile (or instrument) some examples with really long direction runs, and see whether the current implementation actually takes enough time to be worth optimizing further.
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.