Closed Bug 1894170 Opened 1 year ago Closed 1 year ago

aarch64 Linux CPU information missing

Categories

(Core :: XPCOM, defect)

defect

Tracking

()

VERIFIED FIXED
127 Branch
Tracking Status
firefox127 --- fixed

People

(Reporter: jlorenzo, Assigned: glandium)

References

Details

Attachments

(7 files, 1 obsolete file)

In bug 1889793, we officially published aarch64 (aka ARM64) nightlies. A few days later, this new architecture doesn't show up in telemetry.clients_last_seen. Thanks to :jan-erik and :Dexter's help we could rule out a problem within Firefox (see attached logs) notably by following 74e689ac-c19f-4834-9cb5-36d16a2b2260, a clientId I created to debug the issue. We could confirm this clientId doesn't appear on telemetry.clients_last_seen. :Dexter noticed this clientId shows up in glean but not in the legacy telemetry.

This explains why we can't get DAU/MAU on this CPU architecture.

At the moment, we are still determining the root cause.

Original slack thread: https://mozilla.slack.com/archives/C4D5ZA91B/p1714038934756329 (will expire in 6 months)

Hey Chris,

can you give this a look?

Flags: needinfo?(chutten)
10:06:32.240 1714385192240	Toolkit.Telemetry	TRACE	TelemetrySend::_doPing - server: https://incoming.telemetry.mozilla.org, persisted: true, id: 0fda9eeb-2fee-4150-98b5-df9f018195b2
...
10:06:32.485 1714385192485	Toolkit.Telemetry	INFO	TelemetrySend::_doPing - successfully loaded, status: 200

This means a ping with document id 0fda9eeb-2fee-4150-98b5-df9f018195b2 has been successfully transmitted, as far as the client is able to tell. It's possible there's network hardware intercepting this request and responding 200 blindly, it's possible that though this is a well-formed enough request to be received by the HTTP Edge it isn't well-formed enough to be fully ingested.

Johan, a bit of an out-of-left-field request, but what is the value of environment.system.os.distro for your build? Is it null? Would it be null for all the aarch64 builds?

Flags: needinfo?(chutten) → needinfo?(jlorenzo)

I'm no expert here so maybe I didn't look at the right location. It's set to "Ubuntu" (Edit: see attached screenshot below).

Flags: needinfo?(jlorenzo)

That's good, it means you're not falling afoul of something I just found.

Luckily, I've found the ping that your log submitted! It was indeed in the "this is a well-formed enough request to be received by the HTTP Edge it isn't well-formed enough to be fully ingested" camp as it hits the following ParsePayload error: org.everit.json.schema.ValidationException: #/environment/system/cpu/cores: #: no subschema matched out of the total 2 subschemas

Could you dump the raw JSON of your current payload (about:telemetry, click on "Raw JSON" at the bottom left) into an attachment here? I'd like to validate it against the schema.

Flags: needinfo?(jlorenzo)
Attached file telemetry_raw.json

Sure! Here it is.

Flags: needinfo?(jlorenzo)

Yup, wouldn't you know it, your environment.system.cpu is:

      "cpu": {
        "extensions": [
          "hasNEON"
        ]
      },

Whereas my Linux nightly has:

      "cpu": {
        "count": 16,
        "cores": 8,
        "vendor": "GenuineIntel",
        "name": "Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz",
        "family": 6,
        "model": 79,
        "stepping": 1,
        "l2cacheKB": 256,
        "l3cacheKB": 20480,
        "speedMHz": 4000,
        "extensions": [
          "hasMMX",
          "hasSSE",
          "hasSSE2",
          "hasSSE3",
          "hasSSSE3",
          "hasSSE4_1",
          "hasSSE4_2",
          "hasAVX",
          "hasAVX2",
          "hasAES"
        ]
      }

So, first and foremost, it appears as though there's some processinfo and cpuinfo stuff that needs to be filled in. ...but why do they need to be filled in, I wonder? The schema for the relevant portion is:

            "cores": {
              "type": [
                "integer",
                "null"
              ],
              "minimum": 1,
              "maximum": 2048,
              "description": "The number of physical CPU cores. Desktop only, e.g. 4, or `null` on failure."
            },

It's not required, and so it should be permitted to be absent, present and null, or present and a number between 1 and 2048. Your current payload has it as absent, so what's the deal? Maybe the payload for 0fda9eeb-2fee-4150-98b5-df9f018195b2 was different? Can you pull it up (I think it was the bhr ping in your archive) and attach its Raw JSON as well?

...oh dang, PTO notice. Well, there's not really much I would be able to do anyway (I'm the wrong person to try and implement PR_GetNumberOfProcessors for aarch64, not leastwise because I don't have that hardware myself). Gonna move this to Core :: XPCOM because the last few meaningful changes to nsSystemInfo.cpp were under that component, and there's engineering work to be done to figure out why processinfo and friends aren't being populated under aarch64.

Component: Telemetry → XPCOM
Product: Toolkit → Core
Summary: Linux aarch64 nightlies don't show up in legacy tables → aarch64 Linux CPU information missing

It looks like PR_GetNumberOfProcessors() is called in a bunch of places, so if that isn't returning the correct value, a lot of things could be performing suboptimally. I'm not sure how easy it is to change NSPR nowadays.

On my raspberry pi, /sys/devices/system/cpu/present contains 0-3, so PR_GetNumberOfProcessors should work, but none of the stuff the code in nsSystemInfo.cpp looks in /proc/cpuinfo is there, although the cpu cache sizes are there.

That said, even if PR_GetNumberOfProcessors returned a wrong value, it should be there, but it isn't, so something else must be going on.

This probably has nothing to do with aarch64. Because I'm getting the same problem on x86_64, where environment.system.cpu only contains extensions.

After the browser has been up for a minute, then the information is filled up. On aarch64, I get a correct cpu.count, but 0 or empty string for everything else.

(In reply to Mike Hommey [:glandium] from comment #11)

On aarch64, I get a correct cpu.count, but 0 or empty string for everything else.

That was on a mac under UTM, where /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq, and /sys/devices/system/cpu/cpu0/cache/*/size don't exist. I don't have X or Wayland on the raspberry pi.

Considering the schema, it would seem something like:

-  JS::Rooted<JS::Value> valCoreInfo(aCx, JS::Int32Value(info.cpuCores));
+  JS::Rooted<JS::Value> valCoreInfo(
+      aCx, info.cpuCores ? JS::Int32Value(info.cpuCores) : JS::NullHandleValue);

would be the sensible thing to do. Or changing the schema to allow 0.

FWIW, the lscpu command line tool looks at e.g. /sys/devices/system/cpu/cpu*/topology/thread_siblings and core_siblings. There's also a lot of manual work to figure out the CPU type. https://github.com/util-linux/util-linux/blob/master/sys-utils/lscpu-arm.c

(In reply to Mike Hommey [:glandium] from comment #13)

Or changing the schema to allow 0.

I'd be happy to help with this, but is 0 actually a valid value or is it a sentinel value that means something else (unknown, failure, complication)? If it's a sentinel, I think data consumers would prefer it to be null with a slight change to the docs saying that null can mean either failure or this sentinel.

0 is definitely not a valid value.

The telemetry schema doesn't want a value below 1, and 0 is a value that
doesn't make sense anyways. So in the case where we didn't get a proper
value, set to null, which the schema will accept.

Assignee: nobody → mh+mozilla
Status: NEW → ASSIGNED
Blocks: 1894549

Comment on attachment 9399670 [details]
Bug 1894170 - Make cpu.cores null instead of 0.

Revision D209163 was moved to bug 1894549. Setting attachment 9399670 [details] to obsolete.

Attachment #9399670 - Attachment is obsolete: true

Let's keep this bug about adding the missing information. Bug 1894549 is about making the data conform to the schema.

(In reply to Mike Hommey [:glandium] from comment #11)

After the browser has been up for a minute, then the information is filled up. On aarch64, I get a correct cpu.count, but 0 or empty string for everything else.

Filed bug 1894554, FWIW.

(In reply to Mike Hommey [:glandium] from comment #14)

FWIW, the lscpu command line tool looks at e.g. /sys/devices/system/cpu/cpu*/topology/thread_siblings and core_siblings.

In fact, we should do that on all platforms, and not only arm64, because here's what I can see on a real machine:

$ grep "cpu cores" /proc/cpuinfo 
cpu cores	: 1
cpu cores	: 1
cpu cores	: 1
cpu cores	: 1

That machine has 4 sockets (so four 1-core CPUs). Multi-socket machines are rare as desktop machines these days, but still.

Severity: -- → S2
See Also: → 1883730
Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/6f90a9eecde4 Simplify reading integers from /proc/cpuinfo and /sys/devices/**. r=xpcom-reviewers,emilio https://hg.mozilla.org/integration/autoland/rev/99a3593f7b4a Add a Tokenizer helper to read hexadecimal. r=xpcom-reviewers,emilio https://hg.mozilla.org/integration/autoland/rev/81d5a5a1166b Get proper CPU vendor, model and stepping for ARM/aarch64. r=xpcom-reviewers,emilio,dnazer https://hg.mozilla.org/integration/autoland/rev/fb5bf71faa24 Get the proper count of cores on Linux. r=xpcom-reviewers,emilio

I just got back from PTO. I checked this query[1] and I now see aarch64 in the results. Thanks :glandium, :emilio, :dnazer, and :chutten! 👍

[1] https://sql.telemetry.mozilla.org/queries/99585/source

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: