Open Bug 638091 Opened 14 years ago Updated 17 minutes ago

Possible speed improvement by using shift-and in string searching

Categories

(Core :: JavaScript Engine, enhancement, P5)

x86
Linux
enhancement

Tracking

()

People

(Reporter: h4writer, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(5 files)

User-Agent: Mozilla/5.0 (X11; Linux i686; rv:2.0b13pre) Gecko/20110302 Firefox/4.0b13pre Build Identifier: In string matching when the pattern length is longer than 255 (sBMHPatLenMax), UnrolledMatch<MemCmp> is used or UnrolledMatch<ManualCmp>. In my case (linux) the slower manual loop is used. The complexity of the manual loop is O(mn). (m=patternlength, n=textlength) So it isn't that good. Shift-and has the same complexity O(mn), but has some interesting characteristic. Searching itself is actually 0(mn/31). So when the pattern length is lower then is lower then 32. The complexity is actually 0(n). In simulations I've done it's quite obvious that it is linear, but the overhead is to big and BMH is faster for smaller patterns. But BHM only works untill a pattern length of 255. Then the shift-and algorithm with some adjustments becomes interesting. The adjustments are as following: use the shift-and algorithm on the first 31 characters of the pattern. When a match is found in the text, use the slower manual loop to check the other part of the pattern. Mostly the slower manual loop is only once used. Because if the first 31 characters match, the chance is quite high the other will match too. A nice thing about this implementation is that even in the worst case, when the manual loop is always executed it will be faster or as fast as the manual loop. Now some timings - text with 13033779 characters, query had 1722 characters: Without shift-and: real 0m0.941s real 0m0.933s real 0m0.945s real 0m0.943s With shift-and: real 0m0.908s real 0m0.926s real 0m0.920s real 0m0.919s real 0m0.921s - Took "best BMH/normal ratio" testcase from bug #526348 I altered it to start with patterns with length 33. I also disabled BHM. Without shift-and: 415 424 434 443 450 455 469 473 486 486 504 503 514 514 528 548 544 With shift-and: 343 352 360 370 381 387 400 405 412 426 432 445 450 462 465 478 484 My questions: is it worthwhile to keep looking into this? Or is the speedup negligible? I also don't know much MemCmp is faster then ManualCmp? I suppose this would be linux-only, because I think MemCmp is much faster because of the use of SIMD? Reproducible: Always
Attached patch Temporary patchSplinter Review
This patch adds the shift-and algorithm. Code works, but doesn't follow the guidelines atm.
> I also don't know much MemCmp is faster then ManualCmp? On Linux it's not; that's why ManualCmp is used there... We can't tell what the heck the glibc memcmp is doing, but it's slow... Luke, want to take a look at that patch?
Status: UNCONFIRMED → NEW
Ever confirmed: true
I like the idea of using a smart algorithm up to some max pattern size and then falling back to the basic O(mn) loop past that. Both BMH and shift-and have an up-front pre-processing cost and a small per-iteration cost (compared to the basic O(mn) algorithm) which is payed in speculation that the algorithm will pay off. The BMH payoff is that it can skip characters. IIUC, the shift-and payoff is that the algorithm will be linear even with short partial matches. In this comparison, the BMH return on investment seems better given comparable investments. However, the additional benefit of the shift-and code you've demonstrated is that it can still be used for longer patterns. So that raises the question: can BMH be modified, similar to your shift-and code, to work for longer patterns? If so, then we could get the super-linear behavior of BMH even for long patterns with partial matches of length < sBMHPatLenMax.
Attached patch BMH adjustedSplinter Review
Like requested the BMH variant. Again this is not final. (e.g. not using MemCmp when possible) - text with 13033779 characters, query had 1722 characters: real 0m0.921s real 0m0.905s real 0m0.917s real 0m0.917s real 0m0.919s better then the loop. Same as shift-and variant. Still I would have thought this would be a bit faster then shift-and. Well probably because there is only one full hit. (There is only one time that the first 255 are the same and that is the hit too.). So actually not a real good testcase. I'm not sure how to test it with the memcmp (I'm on Linux). But I think it would be interesting to know if the hybrid is faster then memcmp alone.
What data are you measuring with? It'd be good to know if this is affecting real sites.
(In reply to comment #5) > What data are you measuring with? - The data is a sequence. So small alphabet (c,t,g,a). > It'd be good to know if this is affecting real sites. - I didn't find a site using patterns longer then 255 chars yet. If I do I will definitely run some benchmarks on it. I did another test, now with a bigger alphabet. So more in line with real searches. I took 203733904 characters of lorem ipsum from http://www.lipsum.com. And for the query/pattern I took the last line (400 characters) - The loop: real 0m0.974s real 0m0.935s real 0m0.953s - Shift-and algorithm (the loop is really really optimized. Shift-and could use the same optimization (like loop unrolling)) real 0m1.731s real 0m1.727s real 0m1.743s real 0m1.736s - adjusted BMH real 0m0.755s real 0m0.752s real 0m0.759s In this case it is really clear the adjusted-BMH would be faster. Note 1: if I need to add my tests, just let me know Note 2: I'm leaving Friday for a week. So I won't posting updates next week.
(In reply to comment #6) Thanks for trying out the adjusted-BMH and continuing experimentation. Using lorem ipsum was a good idea. Since random English doesn't seem like it would cause the O(mn) worst-case of "The loop" to manifest often, it doesn't surprise me that Shift-and would be slower than "The loop", unrolling or not, since Shift-and is doing more work per iteration. The perf of adjusted-BMH is exciting and makes me wonder how much more speedup we could get by uglifying BMH with unrolling analogous to The loop. Since we are in fancy/speculative-algorithm territory, it would be good to get a few more "real world" measurements before seriously considering landing a change. Lorem ipsum is probably a good sample of English searches, but the web probably has a lot of weird searches that aren't like English. One idea would be to hack up a browser build to print out the text/pattern for every Nth search, surf around for a bit, and turn the resulting log into a benchmark. Also useful would be to measure, for this harvested benchmark the average text/pattern length, variance, and max. If comment 6 is any predictor, it seems like there is some real optimization potential here, so I hope you are still interested in pursuing this :)
(In reply to comment #7) I like the idea of creating a benchmark with real life search examples. It would be easier to track if the changes don't regress the search speed and it would definitely help to have some data of 255+ pattern length searches. I'm willing to test the loop unrolling. I'm also curious how much it would improve BMH. I'll do that when I'm back.
I've looked a bit further. No loop unrolling yet. Couldn't find any speedup by using it, yet. But I've updated my patch to be faster. In my patch there was first a loop running untill the 'reduced' pattern was matched and afterwards a new loop was ran to match the whole pattern. So now I only run the loop to match the whole pattern. The times listed here are again from "lorem ipsum" - no patch applied: real 0m1.015s real 0m1.001s real 0m0.983s real 0m0.996s real 0m1.042s real 0m0.997s - using the patch "adjusted BMH": real 0m0.814s real 0m0.804s real 0m0.816s real 0m0.802s real 0m0.807s - using the new patch "improved adjusted BMH": real 0m0.792s real 0m0.799s real 0m0.784s real 0m0.779s real 0m0.792s real 0m0.774s real 0m0.782s real 0m0.776s
Attached file Testcase: lorum ipsum
I adjusted the patch to use memcmp. This is to test if the loop is faster or memcmp is faster on windows/mac. If somebody is willing to test the speed-up on those platforms please go ahead. Just report the time it uses to complete the added testcase without any patches, with the "improved adjusted BMH" patch and afterwards with the "Improved adjusted BMH using memcmp".
I just came across this: http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html The part that I found relevant was Boyer-Moore+unrolling :)
(In reply to comment #13) > I just came across this: > http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html > The part that I found relevant was Boyer-Moore+unrolling :) That post is indeed very interesting. The loop unrolling is different then I have done and I expect some improvements there. Also they use another shift implementation. I'll try to implement that as fast as possible, but I'm having a lot of work and because of that I haven't been able to update/test as regular as I wanted. Now I've been able to build a windows build to see the difference in performance between the patch with memcmp and without. the lorem testcase: without patch: 0m0.772s 0m0.765s 0m0.774s 0m0.769s 0m0.762s 0m0.769s patch without memcmp 0m0.638s 0m0.636s 0m0.632s 0m0.636s 0m0.669s 0m0.637s 0m0.674s patch with memcmp 0m0.656s 0m0.657s 0m0.661s 0m0.658s 0m0.670s 0m0.662s 0m0.664s The only thing I can conclude atm is that the performance is better with patch. The performance between 'with memcmp' and 'without memcmp' isn't that much. It is in favor for 'without memcmp', but maybe the difference isn't big enough to rule it out.
Is this still actual?
A month ago I looked at it again and it still applies. It is a definitely win when searching on strings larger than 256. I tried loop unrolling, like suggested in Comment 13 by Luke. This didn't give any gains. (I suspect because our implementation there is a bad character test and that made loop unrolling harder). Also there is a difference between BMH (what we use) and BM (what grep uses). So I'm not eager to switch to BM without having enough tests. Also BM with loop unrolling uses some tricks and that increases speed mostly. But the worst case scenario also gets slower (noticeably). I don't have enough time to do those tests atm. Maybe I will in the future. But I can't guarantee.
Assignee: general → nobody
Is this a duplicate of the more recent string matching bugs?
Flags: needinfo?(hv1989)
Actually not. The recent string matching bugs are for when we cannot use a search table (e.g. table too big or overhead of creating the table doesn't outweigh the search speed gain). This patch is to increase the time for when we can create a search table.
Flags: needinfo?(hv1989)
Severity: normal → S3

Chrome: https://share.firefox.dev/4gCD716 (130ms)
Nightly: 64ms

Should we close this bug ?

Even if we are faster than Chrome here, it seems like there is some opportunity for future improvement, so I'm inclined to keep this around.

Blocks: sm-js-perf
Severity: S3 → S4
Priority: -- → P5
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: