Closed Bug 933149 Opened 12 years ago Closed 4 months ago

PodCopy "optimizes" memcpy to a loop of smaller memcpys

Tracking

()

Status:

RESOLVED FIXED

Milestone:

140 Branch

Tracking Flags:

Tracking

Status

firefox140

---

fixed

People

(Reporter: jruderman, Assigned: mgaudet)

References

(Blocks 1 open bug)

Details

(Whiteboard: [sp3])

Attachments

(2 files)

patch: remove the small-array path 12 years ago Jesse Ruderman 1.98 KB, patch		Details \| Diff \| Splinter Review
Bug 933149 - Remove PodCopy loop heuristic r?mstange 1 year ago Matthew Gaudet (he/him) [:mgaudet] 48 bytes, text/x-phabricator-request		Details \| Review

Jesse Ruderman

Reporter

Description

•

12 years ago

Attached patch patch: remove the small-array path — Details — Splinter Review

http://hg.mozilla.org/mozilla-central/rev/30cabba00388 added PodCopy included a small-array path (for compilers that don't understand memcpy?) http://hg.mozilla.org/mozilla-central/rev/b36175bbda47 changed the small-array path from an "=" loop to a silly "memcpy" loop. http://hg.mozilla.org/mozilla-central/rev/de6afab8b383 moved PodCopy from jsutil.h to mfbt/PodOperations.h

Attachment #825165 - Flags: review?(luke)

Jesse Ruderman

Reporter

Updated

•

12 years ago

Attachment #825165 - Flags: review?(jwalden+bmo)

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 1

•

12 years ago

I believe the idea was that the memcpy calls with constant arguments would be inlined by the compiler. Do you have data that suggests this isn't the case?

Nathan Froyd [:froydnj]

Comment 2

•

12 years ago

(In reply to :Ms2ger from comment #1) > I believe the idea was that the memcpy calls with constant arguments would > be inlined by the compiler. Do you have data that suggests this isn't the > case? In the interest of sharing: I too was confused by this path in PodOperations. GCC and Clang will both inline memcpy with constant arguments (assuming the constant is small enough). MSVC (2010 is the version I checked, I think; maybe they've made this smarter) *will not* when optimizing for size, even when the inlining is obviously beneficial: memcpy(&an_int, &some_other_int_sized_thing, 4); Two movs is shorter than the function call, but MSVC prefers the function call. The argument for the short-array path (in bugzilla comments or IRC, unfortunately not preserved in the code) is twofold: - The above; - For arrays of runtime length, the inline loop saves the call to memcpy, which can be expensive relative to the cost of the copy.

Jesse Ruderman

Reporter

Comment 3

•

12 years ago

I don't understand your first argument. If MSVC [optimizing-for-size] never inlines even a constant-size memcpy, how can it be better to call memcpy in a loop rather than call memcpy once? For gcc and clang, we should just file bugs if they make bad decisions about whether to inline non-constant-size memcpy.

Nathan Froyd [:froydnj]

Comment 4

•

12 years ago

(In reply to Jesse Ruderman from comment #3) > I don't understand your first argument. If MSVC [optimizing-for-size] never > inlines even a constant-size memcpy, how can it be better to call memcpy in > a loop rather than call memcpy once? Oh, I see what's going on here. I assumed we were copying with operator=.

Luke Wagner [:luke]

Comment 5

•

12 years ago

Comment on attachment 825165 [details] [diff] [review] patch: remove the small-array path That's a lot of snark there chief. The small length case was added because memcpy was a measurable slowdown on that did a lot of small (but not constant) length memcpys. The reason was that the memcpy becomes a libc call (PLT overhead) and the memcpy implementation spends extra operations preparing to do a large copy that don't make sense when there is little to copy). This was all measured with old gcc/msvc, though. Feel free to find some workloads that pound on small PodCopy and remeasure on modern gcc/msvc.

Attachment #825165 - Flags: review?(luke)

Jeff Walden [:Waldo]

Comment 6

•

12 years ago

Comment on attachment 825165 [details] [diff] [review] patch: remove the small-array path Review of attachment 825165 [details] [diff] [review]: ----------------------------------------------------------------- What luke said. Numbers, Mrs. Landingham.

Attachment #825165 - Flags: review?(jwalden+bmo)

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 7

•

1 year ago

Attached file Bug 933149 - Remove PodCopy loop heuristic r?mstange — Details

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 8

•

1 year ago

I think we should drop the POD copy heuristic and just use memcopy. I wandered into Bug 661957 yesterday, and then thought a bit about how things had changed in the last decade -- and surely memcopy has had a lot of optimization pressure. So I did some experiments:

Evidence:

Doubling heuristic length to avoid memcopy (128 -> 256): Regression -- pretty uniformly slower
Halving heuristic length (128 -> 64) : Mixed results -- Medium confidence improvements and regressions
Dropping from 128 to 8 (16x lower) -- low confidence improvment of top level score, improvements a little better on subtests
Removing the heuristic entirely brings a large number of 1-2% improvements with high confidence... but also brings 2 high confidence regressions on ChartJS in the 4% range -- the summary score ends up being a wash.

Based on this I suspect strongly we'd like to the equivalent patch to Jesse Ruderman's above (I will provide a phabricator revision shortly); however I would also like to understand a little more the regression:

Markus, would you be able to do a perf-compare on the provided patch, to see if you can find where the time has gone into in ChartJS (I suppose one other possibility is unfortunate moving of a GC into timed area).

In the mean time, I'll dispatch the rest of the OS -- it's possible this will be OS/Compiler dependent and we may need to #ifdef this on platform.

Blocks: speedometer3

Flags: needinfo?(mstange.moz)

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 9

•

1 year ago

Alright, for my own notes here's the upcoming perf links:

You'll be able to find a performance comparison here once the tests are complete (ensure you select the right framework):
https://perf.compare/compare-results?baseRev=1701b0ea175efa4a486bb8a74b9a0e07605e7cb2&newRev=143793b3c147abd66619a654cd00ef20fb0c2074&baseRepo=try&newRepo=try

The old comparison tool is still available at this URL:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=1701b0ea175efa4a486bb8a74b9a0e07605e7cb2&newProject=try&newRevision=143793b3c147abd66619a654cd00ef20fb0c2074

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Updated

•

1 year ago

Updated

•

1 year ago

Severity: S3 → N/A

Type: defect → enhancement

Priority: -- → P3

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

1 year ago

Whiteboard: [sp3]

Jira Integration Bot

Updated

•

1 year ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-794

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 10

•

1 year ago

Huh. A re-test showed more regressions.

Mayank Bansal

Updated

•

9 months ago

Comment 11

•

5 months ago

Pushed to try with updated patch:
Base revision's try run: https://treeherder.mozilla.org/jobs?repo=try&revision=d9485fabe00f5b6033e94899fd7e653c412d65e0
Local revision's try run: https://treeherder.mozilla.org/jobs?repo=try&revision=4e363c28dfa1c4c68f038ef50530e23e8f770297

I can run a comparison report on the PGO builds from these pushes once they're ready.

Flags: needinfo?(mstange.moz)

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 12

•

5 months ago

(Retriggered the tests for more comparability)

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Updated

•

4 months ago

Updated

•

4 months ago

Assignee: nobody → mgaudet

Attachment #9421328 - Attachment description: WIP: Bug 933149 - Remove PodCopy loop heuristic → Bug 933149 - Remove PodCopy loop heuristic r?mstange

Status: NEW → ASSIGNED

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 13

•

4 months ago

Basically perf seems to be a wash, and at this point we've talked about this for 12 years. We should land the patch, and if and only if we see actual regressions do we investigate further.

Also, this lets us undo a compiler workaround added in Bug 1863131.

Pulsebot

Comment 14

•

4 months ago

Pushed by mgaudet@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cd42b1907239 Remove PodCopy loop heuristic r=mstange

agoloman

Comment 15

•

4 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/cd42b1907239

Status: ASSIGNED → RESOLVED

Closed: 4 months ago

status-firefox140: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 140 Branch

Mayank Bansal

Updated

•

4 months ago

Regressions: 1967062

Camelia Badau [:cbadau], Desktop Test Engineering

Updated

•

4 months ago

QA Whiteboard: [qa-triage-done-c141/b140]

You need to log in before you can comment on or make changes to this bug.