Closed Bug 1754207 Opened 3 years ago Closed 3 years ago

2.8 - 1.76% compiler_metrics num_static_constructors / compiler_metrics num_static_constructors + 9 more (Windows) regression on Fri February 4 2022

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

defect

Tracking

()

RESOLVED WONTFIX
99 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox97 --- unaffected
firefox98 --- wontfix
firefox99 --- fixed

People

(Reporter: aesanu, Assigned: aaggarwal)

References

(Regression)

Details

(Keywords: perf-alert, regression)

Attachments

(1 file)

Perfherder has detected a build_metrics performance regression from push d4be0d9bbd6885ecaf5dc780195f16de7461455d. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Test Platform Options Absolute values (old vs new)
3% compiler_metrics num_static_constructors windows2012-64-shippable 107.00 -> 110.00
3% compiler_metrics num_static_constructors windows2012-64 110.00 -> 113.00
3% compiler_metrics num_static_constructors windows2012-32-shippable 114.00 -> 117.00
3% compiler_metrics num_static_constructors windows2012-32 118.00 -> 121.00
2% compiler_metrics num_static_constructors windows2012-64-shippable instrumented 132.00 -> 135.00
2% compiler_metrics num_static_constructors windows2012-32-shippable instrumented 141.00 -> 144.00
2% compiler_metrics num_static_constructors windows2012-64 157.00 -> 160.00
2% compiler_metrics num_static_constructors windows2012-32 159.00 -> 162.00
2% compiler_metrics num_static_constructors windows2012-64 fuzzing 159.00 -> 162.00
2% compiler_metrics num_static_constructors windows-mingw32 32 clang debug 169.00 -> 172.00
2% compiler_metrics num_static_constructors windows-mingw32 64 clang debug 170.00 -> 173.00

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(aaggarwal)

The push in question contains 3 of the 4 patches that eventually landed in central to close the bug 1746631. These 3 patches caused some failures which were later fixed as the 4th patch and landed in the form of the push 7907a10250a1a42425a1eb4918cc0be6d2e83dfa.

Does the build_metrics performance regression happen even after the 4th patch?

Flags: needinfo?(aaggarwal) → needinfo?(aesanu)

I don't fully understand this metric and its effect on the performance. (cc'ing author of bug 1464522) Is this regression acceptable?

There is a chance the constructors are coming from third-party intgemm library. Is there a way to tell which new three symbols were added?

Flags: needinfo?(mh+mozilla)

After running dump_syms on windows for jsshell build I can see only two "dynamic initializers" symbols:

    FUNC 11cd2c0 6c 0 intgemm::Int8Shift::PrepareBiasImpl<intgemm::callbacks::UnquantizeAndAddBiasAndWrite>::`dynamic initializer for 'run'()
    FUNC 11cd9b0 73 0 intgemm::Int8Shift::MultiplyImpl<intgemm::callbacks::UnquantizeAndAddBiasAndWrite>::`dynamic initializer for 'run'()

Let's assume that https://github.com/kpu/intgemm/blob/768aa689dda91f49024e647b5aee23e6b5d08556/intgemm/intgemm.h#L221 is the offender here. Looking at the code of GetCPUID will probably need to replace it with our own, e.g. to remove undesired get-environment-variable logic, console I/O, etc.

let us know your plans within 3 business days, or the offending patch(es) will be backed out

The intgemm is built only for Nightly and there is no urgency to back it out.

Flags: needinfo?(bugzilla)
Flags: needinfo?(mh+mozilla)

I think it's commenting that there are static constructors that implement the CPUID dispatch by setting function pointers to CPU-dependent functions. With some refactoring this could be reduced to one static constructor by using a struct.

The EnvironmentCPUID() function is only there to support tests that need to emulate different CPU models. It can be removed entirely if you'd like.

diff --git a/intgemm/intgemm.cc b/intgemm/intgemm.cc
index 31370e2..62e090b 100644
--- a/intgemm/intgemm.cc
+++ b/intgemm/intgemm.cc
@@ -14,8 +14,6 @@
 
 #include <stdlib.h>
 
-#include <iostream>
-
 namespace intgemm {
 
 namespace {
@@ -88,29 +86,10 @@ CPUType RealCPUID() {
 #endif
 }
 
-CPUType EnvironmentCPUID() {
-#if defined(_MSC_VER)
-  char env_override[11];
-  size_t len = 0;
-  if (getenv_s(&len, env_override, sizeof(env_override), "INTGEMM_CPUID")) return CPUType::AVX512VNNI;
-  if (!len) return CPUType::AVX512VNNI;
-#else
-  const char *env_override = getenv("INTGEMM_CPUID");
-  if (!env_override) return CPUType::AVX512VNNI; /* This will be capped to actual ID */
-#endif
-  if (!strcmp(env_override, "AVX512VNNI")) return CPUType::AVX512VNNI;
-  if (!strcmp(env_override, "AVX512BW")) return CPUType::AVX512BW;
-  if (!strcmp(env_override, "AVX2")) return CPUType::AVX2;
-  if (!strcmp(env_override, "SSSE3")) return CPUType::SSSE3;
-  if (!strcmp(env_override, "SSE2")) return CPUType::SSE2;
-  std::cerr << "Unrecognized INTGEMM_CPUID " << env_override << std::endl;
-  return CPUType::AVX512VNNI;
-}
-
 } // namespace
 
 CPUType GetCPUID() {
-  static const CPUType kLocalCPU = std::min(RealCPUID(), EnvironmentCPUID());
+  static const CPUType kLocalCPU = RealCPUID();
   return kLocalCPU;
 }

(Though in intgemm master I would want to ifdef this somehow to avoid breaking tests.)

And if the static const CPUType kLocalCPU = RealCPUID(); is a problem then it can be replaced with just return RealCPUID(); at the expense of repeatedly calling the function at load time.

It is true that intgemm::Int8Shift::MultiplyImpl<intgemm::callbacks::UnquantizeAndAddBiasAndWrite> is dynamically dispatched. This is by design so it can depend on the CPU model. However, if push comes to shove, the template arguments used are finite and could be enumerated in the cc file. I'm not sure that would change the static constructor count, so much as move them around though.

Flags: needinfo?(bugzilla)

At a high level, intgemm is a matrix multiply library that compiles multiple implementations for different CPU models. Then, at load time, it runs CPUID and initializes function pointers to go to the correct implementation. This load time mechanism is triggered by a static initializer. This initialization appears to add .000162 s to average load time i.e. it is below the precision of the time tool.

Can you help me understand the goal/requirements?

If it's loading time, eliminate the environment variable code as instructed above and I don't think you'll be able to measure the difference.

If it's the number of static initializers but one is still allowed, this might be doable by eliminating iostream (though the unsupported CPU error would also need to go and we like crash reporting) and instantiating C++ templates in the cc file rather than the header.

If it's a complete elimination of static initializers, the pointer initialization would move to an Init function. Risk of segfault due to uninitialized function pointers if Init isn't called before the other functions.

Script to reproduce timing:

#!/bin/bash
set -eo pipefail
mkdir repro
cd repro

# Download and compile intgemm
git clone https://github.com/kpu/intgemm
mkdir intgemm/build
pushd intgemm/build
cmake -DINTGEMM_DONT_BUILD_TESTS=ON ..
make -j4
popd

# A minimal program with just main
g++ -O3 -x c++ - <<<"int main() {}" -o without

# A minimal program using intgemm
cat >with.cc <<EOF
#include "intgemm/intgemm.h"
int main() { return (int)intgemm::kCPU; }
EOF
LIBRARY_PATH=intgemm/build g++ -Iintgemm -Iintgemm/build -O3 with.cc -o with -lintgemm

n=1000
# Run both programs 1000 times and time them.
for ((i=0;i<$n;++i)); do
  (time ./with) 2>>with.times || true
  (time ./without) 2>>without.times
done

#Sum up the seconds.  Assumes we won't have minutes.
without_total=$(grep real without.times |cut -d m -f 2 |cut -d s -f 1 |paste -sd+ |bc)
with_total=$(grep real with.times |cut -d m -f 2 |cut -d s -f 1 |paste -sd+ |bc)
without_avg=$(bc <<<"scale=10; $without_total/$n")
with_avg=$(bc <<<"scale=10; $with_total/$n")
echo Average time without intgemm $without_avg second, with intgemm $with_avg second
difference=$(bc <<<$with_avg-$without_avg)
echo Average difference in time $difference second

Script output:

Average time without intgemm .0011670000 second, with intgemm .0013290000 second
Average difference in time .0001620000 second

Can you help me understand the goal/requirements?

From my understanding, but I can be corrected here, the goal is zero impact on startup performance of Firefox and adding initializers can potentially impact that. It is impossible to proof that by just looking at the number of static constructors -- it needs to be further analyzed.

There is some uncertainty in EnvironmentCPUID since the performance of getenv is unpredictable and vary across platforms -- my recommendation to remove that using preprocessor. The RealCPUID operation is still under question, but we can argue it is needed/required here. There will be a change in the amount of dynamic initializers when intgemm is included.

Priority: -- → P2

Set release status flags based on info from the regressing bug 1746631

Has Regression Range: --- → yes

Changed iostream to fprintf in intgemm master.

Regarding EnvironmentCPUID hiding behind a preprocessor macro, take a look at: https://github.com/kpu/intgemm/pull/98/ . Have a look at the PR to confirm then I'll merge.

Neither of these will change the number of static constructors metric, but as you say this is metric is more of an indicator to look at startup time.

Flags: needinfo?(ydelendik)

EnvironmentCPUID hiding behind a preprocessor macro now in master https://github.com/kpu/intgemm following review by Yury. Since you're not using cmake anyway, there should be no change to your build setup to have getenv off.

I will create a phab updating the intgemm to the latest commit to resolve this issue.

(In reply to Abhishek from comment #1)

The push in question contains 3 of the 4 patches that eventually landed in central to close the bug 1746631. These 3 patches caused some failures which were later fixed as the 4th patch and landed in the form of the push 7907a10250a1a42425a1eb4918cc0be6d2e83dfa.

Does the build_metrics performance regression happen even after the 4th patch?

Looking at this graph the regression remains even after the 4th patch and retriggers for it.
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&selected=1689407,1486464709&series=autoland,1689407,1,2&timerange=1209600&zoom=1643928337692,1644295637778,117.23413376247181,121.43626433669351

Flags: needinfo?(aesanu)
Assignee: nobody → aaggarwal
Status: NEW → ASSIGNED

Hey Andra. Thanks a lot for confirming the regression after 4th patch.

Changes pulled as https://phabricator.services.mozilla.com/D138462:

  1. Changed iostream to fprintf
  2. Disabled using getenv by hiding EnvironmentCPUID behind a preprocessor macro
Pushed by ydelendik@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6074d13732e8 Updating the vendor intgemm library r=yury
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 99 Branch
Flags: needinfo?(ydelendik)

The patch landed in nightly and beta is affected.
:aaggarwal, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(aaggarwal)
Blocks: 1722102
See Also: → 1755325

Firefox integer gemm intrinsic feature is currently only restricted to run in Nightly. So, we don't need to uplift it to beta. Thanks @Pascal for setting the flag.

Flags: needinfo?(aaggarwal)

(In reply to Abhishek from comment #14)

Hey Andra. Thanks a lot for confirming the regression after 4th patch.

Changes pulled as https://phabricator.services.mozilla.com/D138462:

  1. Changed iostream to fprintf
  2. Disabled using getenv by hiding EnvironmentCPUID behind a preprocessor macro

Abhishek, the regression still appears to be present in the graph (https://treeherder.mozilla.org/perfherder/graphs?timerange=5184000&series=autoland,1689410,1,2) is this expected?

Flags: needinfo?(aaggarwal)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Hi Dave. I was on PTO. So, I couldn't reply earlier.

The regression is expected for the reasons explained in the comment below:

(In reply to Yury Delendik (:yury) from comment #6)

From my understanding, but I can be corrected here, the goal is zero impact on startup performance of Firefox and adding initializers can potentially impact that. It is impossible to proof that by just looking at the number of static constructors -- it needs to be further analyzed.

There is some uncertainty in EnvironmentCPUID since the performance of getenv is unpredictable and vary across platforms -- my recommendation to remove that using preprocessor. The RealCPUID operation is still under question, but we can argue it is needed/required here. There will be a change in the amount of dynamic initializers when intgemm is included.

Please let me know if more information is required here. Thanks :)

Flags: needinfo?(aaggarwal)

Dave, is comment 20 sufficient justification?

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Flags: needinfo?(dave.hunt)
Resolution: --- → FIXED

(Didn't mean to close, sorry.)

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Yes, sorry, all good here.

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Flags: needinfo?(dave.hunt)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: