Closed Bug 933149 Opened 12 years ago Closed 4 months ago

PodCopy "optimizes" memcpy to a loop of smaller memcpys

Categories

(Core :: MFBT, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
140 Branch
Tracking Status
firefox140 --- fixed

People

(Reporter: jruderman, Assigned: mgaudet)

References

(Blocks 1 open bug)

Details

(Whiteboard: [sp3])

Attachments

(2 files)

http://hg.mozilla.org/mozilla-central/rev/30cabba00388 added PodCopy included a small-array path (for compilers that don't understand memcpy?) http://hg.mozilla.org/mozilla-central/rev/b36175bbda47 changed the small-array path from an "=" loop to a silly "memcpy" loop. http://hg.mozilla.org/mozilla-central/rev/de6afab8b383 moved PodCopy from jsutil.h to mfbt/PodOperations.h
Attachment #825165 - Flags: review?(luke)
Attachment #825165 - Flags: review?(jwalden+bmo)
I believe the idea was that the memcpy calls with constant arguments would be inlined by the compiler. Do you have data that suggests this isn't the case?
(In reply to :Ms2ger from comment #1) > I believe the idea was that the memcpy calls with constant arguments would > be inlined by the compiler. Do you have data that suggests this isn't the > case? In the interest of sharing: I too was confused by this path in PodOperations. GCC and Clang will both inline memcpy with constant arguments (assuming the constant is small enough). MSVC (2010 is the version I checked, I think; maybe they've made this smarter) *will not* when optimizing for size, even when the inlining is obviously beneficial: memcpy(&an_int, &some_other_int_sized_thing, 4); Two movs is shorter than the function call, but MSVC prefers the function call. The argument for the short-array path (in bugzilla comments or IRC, unfortunately not preserved in the code) is twofold: - The above; - For arrays of runtime length, the inline loop saves the call to memcpy, which can be expensive relative to the cost of the copy.
I don't understand your first argument. If MSVC [optimizing-for-size] never inlines even a constant-size memcpy, how can it be better to call memcpy in a loop rather than call memcpy once? For gcc and clang, we should just file bugs if they make bad decisions about whether to inline non-constant-size memcpy.
(In reply to Jesse Ruderman from comment #3) > I don't understand your first argument. If MSVC [optimizing-for-size] never > inlines even a constant-size memcpy, how can it be better to call memcpy in > a loop rather than call memcpy once? Oh, I see what's going on here. I assumed we were copying with operator=.
Comment on attachment 825165 [details] [diff] [review] patch: remove the small-array path That's a lot of snark there chief. The small length case was added because memcpy was a measurable slowdown on that did a lot of small (but not constant) length memcpys. The reason was that the memcpy becomes a libc call (PLT overhead) and the memcpy implementation spends extra operations preparing to do a large copy that don't make sense when there is little to copy). This was all measured with old gcc/msvc, though. Feel free to find some workloads that pound on small PodCopy and remeasure on modern gcc/msvc.
Attachment #825165 - Flags: review?(luke)
Comment on attachment 825165 [details] [diff] [review] patch: remove the small-array path Review of attachment 825165 [details] [diff] [review]: ----------------------------------------------------------------- What luke said. Numbers, Mrs. Landingham.
Attachment #825165 - Flags: review?(jwalden+bmo)
Severity: normal → S3

I think we should drop the POD copy heuristic and just use memcopy. I wandered into Bug 661957 yesterday, and then thought a bit about how things had changed in the last decade -- and surely memcopy has had a lot of optimization pressure. So I did some experiments:

Evidence:

Based on this I suspect strongly we'd like to the equivalent patch to Jesse Ruderman's above (I will provide a phabricator revision shortly); however I would also like to understand a little more the regression:

Markus, would you be able to do a perf-compare on the provided patch, to see if you can find where the time has gone into in ChartJS (I suppose one other possibility is unfortunate moving of a GC into timed area).

In the mean time, I'll dispatch the rest of the OS -- it's possible this will be OS/Compiler dependent and we may need to #ifdef this on platform.

Blocks: speedometer3
Flags: needinfo?(mstange.moz)
See Also: → 661957
Severity: S3 → N/A
Type: defect → enhancement
Priority: -- → P3
Whiteboard: [sp3]

Huh. A re-test showed more regressions.

See Also: → 1936756

Pushed to try with updated patch:
Base revision's try run: https://treeherder.mozilla.org/jobs?repo=try&revision=d9485fabe00f5b6033e94899fd7e653c412d65e0
Local revision's try run: https://treeherder.mozilla.org/jobs?repo=try&revision=4e363c28dfa1c4c68f038ef50530e23e8f770297

I can run a comparison report on the PGO builds from these pushes once they're ready.

Flags: needinfo?(mstange.moz)

(Retriggered the tests for more comparability)

See Also: → 1863131
Assignee: nobody → mgaudet
Attachment #9421328 - Attachment description: WIP: Bug 933149 - Remove PodCopy loop heuristic → Bug 933149 - Remove PodCopy loop heuristic r?mstange
Status: NEW → ASSIGNED

Basically perf seems to be a wash, and at this point we've talked about this for 12 years. We should land the patch, and if and only if we see actual regressions do we investigate further.

Also, this lets us undo a compiler workaround added in Bug 1863131.

Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 140 Branch
Regressions: 1967062
QA Whiteboard: [qa-triage-done-c141/b140]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: