Open
Bug 638091
Opened 14 years ago
Updated 17 minutes ago
Possible speed improvement by using shift-and in string searching
Categories
(Core :: JavaScript Engine, enhancement, P5)
Tracking
()
NEW
People
(Reporter: h4writer, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(5 files)
1.66 KB,
patch
|
Details | Diff | Splinter Review | |
2.36 KB,
patch
|
Details | Diff | Splinter Review | |
1.84 KB,
patch
|
Details | Diff | Splinter Review | |
67.85 KB,
text/plain
|
Details | |
1.67 KB,
patch
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:2.0b13pre) Gecko/20110302 Firefox/4.0b13pre
Build Identifier:
In string matching when the pattern length is longer than 255 (sBMHPatLenMax), UnrolledMatch<MemCmp> is used or UnrolledMatch<ManualCmp>. In my case (linux) the slower manual loop is used.
The complexity of the manual loop is O(mn). (m=patternlength, n=textlength) So it isn't that good.
Shift-and has the same complexity O(mn), but has some interesting characteristic. Searching itself is actually 0(mn/31). So when the pattern length is lower then is lower then 32. The complexity is actually 0(n). In simulations I've done it's quite obvious that it is linear, but the overhead is to big and BMH is faster for smaller patterns.
But BHM only works untill a pattern length of 255. Then the shift-and algorithm with some adjustments becomes interesting. The adjustments are as following: use the shift-and algorithm on the first 31 characters of the pattern. When a match is found in the text, use the slower manual loop to check the other part of the pattern. Mostly the slower manual loop is only once used. Because if the first 31 characters match, the chance is quite high the other will match too.
A nice thing about this implementation is that even in the worst case, when the manual loop is always executed it will be faster or as fast as the manual loop.
Now some timings
- text with 13033779 characters, query had 1722 characters:
Without shift-and:
real 0m0.941s
real 0m0.933s
real 0m0.945s
real 0m0.943s
With shift-and:
real 0m0.908s
real 0m0.926s
real 0m0.920s
real 0m0.919s
real 0m0.921s
- Took "best BMH/normal ratio" testcase from bug #526348
I altered it to start with patterns with length 33. I also disabled BHM.
Without shift-and:
415 424 434 443 450 455 469 473 486 486 504 503 514 514 528 548 544
With shift-and:
343 352 360 370 381 387 400 405 412 426 432 445 450 462 465 478 484
My questions: is it worthwhile to keep looking into this? Or is the speedup negligible? I also don't know much MemCmp is faster then ManualCmp? I suppose this would be linux-only, because I think MemCmp is much faster because of the use of SIMD?
Reproducible: Always
Reporter | ||
Comment 1•14 years ago
|
||
This patch adds the shift-and algorithm. Code works, but doesn't follow the guidelines atm.
![]() |
||
Comment 2•14 years ago
|
||
> I also don't know much MemCmp is faster then ManualCmp?
On Linux it's not; that's why ManualCmp is used there... We can't tell what the heck the glibc memcmp is doing, but it's slow...
Luke, want to take a look at that patch?
Status: UNCONFIRMED → NEW
Ever confirmed: true
![]() |
||
Comment 3•14 years ago
|
||
I like the idea of using a smart algorithm up to some max pattern size and then falling back to the basic O(mn) loop past that.
Both BMH and shift-and have an up-front pre-processing cost and a small per-iteration cost (compared to the basic O(mn) algorithm) which is payed in speculation that the algorithm will pay off. The BMH payoff is that it can skip characters. IIUC, the shift-and payoff is that the algorithm will be linear even with short partial matches. In this comparison, the BMH return on investment seems better given comparable investments.
However, the additional benefit of the shift-and code you've demonstrated is that it can still be used for longer patterns. So that raises the question: can BMH be modified, similar to your shift-and code, to work for longer patterns? If so, then we could get the super-linear behavior of BMH even for long patterns with partial matches of length < sBMHPatLenMax.
Reporter | ||
Comment 4•14 years ago
|
||
Like requested the BMH variant. Again this is not final. (e.g. not using MemCmp when possible)
- text with 13033779 characters, query had 1722 characters:
real 0m0.921s
real 0m0.905s
real 0m0.917s
real 0m0.917s
real 0m0.919s
better then the loop. Same as shift-and variant. Still I would have thought this would be a bit faster then shift-and. Well probably because there is only one full hit. (There is only one time that the first 255 are the same and that is the hit too.). So actually not a real good testcase.
I'm not sure how to test it with the memcmp (I'm on Linux). But I think it would be interesting to know if the hybrid is faster then memcmp alone.
![]() |
||
Comment 5•14 years ago
|
||
What data are you measuring with? It'd be good to know if this is affecting real sites.
Reporter | ||
Comment 6•14 years ago
|
||
(In reply to comment #5)
> What data are you measuring with?
- The data is a sequence. So small alphabet (c,t,g,a).
> It'd be good to know if this is affecting real sites.
- I didn't find a site using patterns longer then 255 chars yet. If I do I will definitely run some benchmarks on it.
I did another test, now with a bigger alphabet. So more in line with real searches.
I took 203733904 characters of lorem ipsum from http://www.lipsum.com.
And for the query/pattern I took the last line (400 characters)
- The loop:
real 0m0.974s
real 0m0.935s
real 0m0.953s
- Shift-and algorithm (the loop is really really optimized. Shift-and could use the same optimization (like loop unrolling))
real 0m1.731s
real 0m1.727s
real 0m1.743s
real 0m1.736s
- adjusted BMH
real 0m0.755s
real 0m0.752s
real 0m0.759s
In this case it is really clear the adjusted-BMH would be faster.
Note 1: if I need to add my tests, just let me know
Note 2: I'm leaving Friday for a week. So I won't posting updates next week.
![]() |
||
Comment 7•14 years ago
|
||
(In reply to comment #6)
Thanks for trying out the adjusted-BMH and continuing experimentation. Using lorem ipsum was a good idea. Since random English doesn't seem like it would cause the O(mn) worst-case of "The loop" to manifest often, it doesn't surprise me that Shift-and would be slower than "The loop", unrolling or not, since Shift-and is doing more work per iteration. The perf of adjusted-BMH is exciting and makes me wonder how much more speedup we could get by uglifying BMH with unrolling analogous to The loop.
Since we are in fancy/speculative-algorithm territory, it would be good to get a few more "real world" measurements before seriously considering landing a change. Lorem ipsum is probably a good sample of English searches, but the web probably has a lot of weird searches that aren't like English. One idea would be to hack up a browser build to print out the text/pattern for every Nth search, surf around for a bit, and turn the resulting log into a benchmark. Also useful would be to measure, for this harvested benchmark the average text/pattern length, variance, and max.
If comment 6 is any predictor, it seems like there is some real optimization potential here, so I hope you are still interested in pursuing this :)
Reporter | ||
Comment 8•14 years ago
|
||
(In reply to comment #7)
I like the idea of creating a benchmark with real life search examples. It would be easier to track if the changes don't regress the search speed and it would definitely help to have some data of 255+ pattern length searches.
I'm willing to test the loop unrolling. I'm also curious how much it would improve BMH. I'll do that when I'm back.
Reporter | ||
Comment 9•14 years ago
|
||
I've looked a bit further. No loop unrolling yet. Couldn't find any speedup by using it, yet.
But I've updated my patch to be faster. In my patch there was first a loop running untill the 'reduced' pattern was matched and afterwards a new loop was ran to match the whole pattern. So now I only run the loop to match the whole pattern.
The times listed here are again from "lorem ipsum"
- no patch applied:
real 0m1.015s
real 0m1.001s
real 0m0.983s
real 0m0.996s
real 0m1.042s
real 0m0.997s
- using the patch "adjusted BMH":
real 0m0.814s
real 0m0.804s
real 0m0.816s
real 0m0.802s
real 0m0.807s
- using the new patch "improved adjusted BMH":
real 0m0.792s
real 0m0.799s
real 0m0.784s
real 0m0.779s
real 0m0.792s
real 0m0.774s
real 0m0.782s
real 0m0.776s
Reporter | ||
Comment 10•14 years ago
|
||
Reporter | ||
Comment 11•14 years ago
|
||
Reporter | ||
Comment 12•14 years ago
|
||
I adjusted the patch to use memcmp. This is to test if the loop is faster or memcmp is faster on windows/mac. If somebody is willing to test the speed-up on those platforms please go ahead. Just report the time it uses to complete the added testcase without any patches, with the "improved adjusted BMH" patch and afterwards with the "Improved adjusted BMH using memcmp".
![]() |
||
Comment 13•14 years ago
|
||
I just came across this:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
The part that I found relevant was Boyer-Moore+unrolling :)
Reporter | ||
Comment 14•14 years ago
|
||
(In reply to comment #13)
> I just came across this:
> http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
> The part that I found relevant was Boyer-Moore+unrolling :)
That post is indeed very interesting. The loop unrolling is different then I have done and I expect some improvements there. Also they use another shift implementation. I'll try to implement that as fast as possible, but I'm having a lot of work and because of that I haven't been able to update/test as regular as I wanted.
Now I've been able to build a windows build to see the difference in performance between the patch with memcmp and without.
the lorem testcase:
without patch:
0m0.772s
0m0.765s
0m0.774s
0m0.769s
0m0.762s
0m0.769s
patch without memcmp
0m0.638s
0m0.636s
0m0.632s
0m0.636s
0m0.669s
0m0.637s
0m0.674s
patch with memcmp
0m0.656s
0m0.657s
0m0.661s
0m0.658s
0m0.670s
0m0.662s
0m0.664s
The only thing I can conclude atm is that the performance is better with patch. The performance between 'with memcmp' and 'without memcmp' isn't that much. It is in favor for 'without memcmp', but maybe the difference isn't big enough to rule it out.
Comment 15•13 years ago
|
||
Is this still actual?
Reporter | ||
Comment 16•13 years ago
|
||
A month ago I looked at it again and it still applies.
It is a definitely win when searching on strings larger than 256.
I tried loop unrolling, like suggested in Comment 13 by Luke.
This didn't give any gains. (I suspect because our implementation there is a bad character test and that made loop unrolling harder).
Also there is a difference between BMH (what we use) and BM (what grep uses). So I'm not eager to switch to BM without having enough tests. Also BM with loop unrolling uses some tricks and that increases speed mostly. But the worst case scenario also gets slower (noticeably).
I don't have enough time to do those tests atm. Maybe I will in the future. But I can't guarantee.
Assignee | ||
Updated•11 years ago
|
Assignee: general → nobody
Comment 17•11 years ago
|
||
Is this a duplicate of the more recent string matching bugs?
Flags: needinfo?(hv1989)
Reporter | ||
Comment 18•11 years ago
|
||
Actually not.
The recent string matching bugs are for when we cannot use a search table (e.g. table too big or overhead of creating the table doesn't outweigh the search speed gain).
This patch is to increase the time for when we can create a search table.
Flags: needinfo?(hv1989)
Updated•2 years ago
|
Severity: normal → S3
Comment 19•6 months ago
|
||
Chrome: https://share.firefox.dev/4gCD716 (130ms)
Nightly: 64ms
Should we close this bug ?
Comment 20•5 months ago
|
||
Even if we are faster than Chrome here, it seems like there is some opportunity for future improvement, so I'm inclined to keep this around.
You need to log in
before you can comment on or make changes to this bug.
Description
•