aarch64 Linux CPU information missing
Categories
(Core :: XPCOM, defect)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox127 | --- | fixed |
People
(Reporter: jlorenzo, Assigned: glandium)
References
Details
Attachments
(7 files, 1 obsolete file)
|
25.96 KB,
text/plain
|
Details | |
|
882.70 KB,
image/png
|
Details | |
|
280.80 KB,
application/json
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review |
In bug 1889793, we officially published aarch64 (aka ARM64) nightlies. A few days later, this new architecture doesn't show up in telemetry.clients_last_seen. Thanks to :jan-erik and :Dexter's help we could rule out a problem within Firefox (see attached logs) notably by following 74e689ac-c19f-4834-9cb5-36d16a2b2260, a clientId I created to debug the issue. We could confirm this clientId doesn't appear on telemetry.clients_last_seen. :Dexter noticed this clientId shows up in glean but not in the legacy telemetry.
This explains why we can't get DAU/MAU on this CPU architecture.
At the moment, we are still determining the root cause.
Original slack thread: https://mozilla.slack.com/archives/C4D5ZA91B/p1714038934756329 (will expire in 6 months)
Comment 2•1 year ago
|
||
10:06:32.240 1714385192240 Toolkit.Telemetry TRACE TelemetrySend::_doPing - server: https://incoming.telemetry.mozilla.org, persisted: true, id: 0fda9eeb-2fee-4150-98b5-df9f018195b2
...
10:06:32.485 1714385192485 Toolkit.Telemetry INFO TelemetrySend::_doPing - successfully loaded, status: 200
This means a ping with document id 0fda9eeb-2fee-4150-98b5-df9f018195b2 has been successfully transmitted, as far as the client is able to tell. It's possible there's network hardware intercepting this request and responding 200 blindly, it's possible that though this is a well-formed enough request to be received by the HTTP Edge it isn't well-formed enough to be fully ingested.
Johan, a bit of an out-of-left-field request, but what is the value of environment.system.os.distro for your build? Is it null? Would it be null for all the aarch64 builds?
| Reporter | ||
Comment 3•1 year ago
•
|
||
I'm no expert here so maybe I didn't look at the right location. It's set to "Ubuntu" (Edit: see attached screenshot below).
| Reporter | ||
Comment 4•1 year ago
|
||
Comment 5•1 year ago
|
||
That's good, it means you're not falling afoul of something I just found.
Luckily, I've found the ping that your log submitted! It was indeed in the "this is a well-formed enough request to be received by the HTTP Edge it isn't well-formed enough to be fully ingested" camp as it hits the following ParsePayload error: org.everit.json.schema.ValidationException: #/environment/system/cpu/cores: #: no subschema matched out of the total 2 subschemas
Could you dump the raw JSON of your current payload (about:telemetry, click on "Raw JSON" at the bottom left) into an attachment here? I'd like to validate it against the schema.
Comment 7•1 year ago
|
||
Yup, wouldn't you know it, your environment.system.cpu is:
"cpu": {
"extensions": [
"hasNEON"
]
},
Whereas my Linux nightly has:
"cpu": {
"count": 16,
"cores": 8,
"vendor": "GenuineIntel",
"name": "Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz",
"family": 6,
"model": 79,
"stepping": 1,
"l2cacheKB": 256,
"l3cacheKB": 20480,
"speedMHz": 4000,
"extensions": [
"hasMMX",
"hasSSE",
"hasSSE2",
"hasSSE3",
"hasSSSE3",
"hasSSE4_1",
"hasSSE4_2",
"hasAVX",
"hasAVX2",
"hasAES"
]
}
So, first and foremost, it appears as though there's some processinfo and cpuinfo stuff that needs to be filled in. ...but why do they need to be filled in, I wonder? The schema for the relevant portion is:
"cores": {
"type": [
"integer",
"null"
],
"minimum": 1,
"maximum": 2048,
"description": "The number of physical CPU cores. Desktop only, e.g. 4, or `null` on failure."
},
It's not required, and so it should be permitted to be absent, present and null, or present and a number between 1 and 2048. Your current payload has it as absent, so what's the deal? Maybe the payload for 0fda9eeb-2fee-4150-98b5-df9f018195b2 was different? Can you pull it up (I think it was the bhr ping in your archive) and attach its Raw JSON as well?
...oh dang, PTO notice. Well, there's not really much I would be able to do anyway (I'm the wrong person to try and implement PR_GetNumberOfProcessors for aarch64, not leastwise because I don't have that hardware myself). Gonna move this to Core :: XPCOM because the last few meaningful changes to nsSystemInfo.cpp were under that component, and there's engineering work to be done to figure out why processinfo and friends aren't being populated under aarch64.
Updated•1 year ago
|
Comment 8•1 year ago
|
||
It looks like PR_GetNumberOfProcessors() is called in a bunch of places, so if that isn't returning the correct value, a lot of things could be performing suboptimally. I'm not sure how easy it is to change NSPR nowadays.
| Assignee | ||
Comment 9•1 year ago
|
||
On my raspberry pi, /sys/devices/system/cpu/present contains 0-3, so PR_GetNumberOfProcessors should work, but none of the stuff the code in nsSystemInfo.cpp looks in /proc/cpuinfo is there, although the cpu cache sizes are there.
That said, even if PR_GetNumberOfProcessors returned a wrong value, it should be there, but it isn't, so something else must be going on.
| Assignee | ||
Comment 10•1 year ago
|
||
This probably has nothing to do with aarch64. Because I'm getting the same problem on x86_64, where environment.system.cpu only contains extensions.
| Assignee | ||
Comment 11•1 year ago
|
||
After the browser has been up for a minute, then the information is filled up. On aarch64, I get a correct cpu.count, but 0 or empty string for everything else.
| Assignee | ||
Comment 12•1 year ago
|
||
(In reply to Mike Hommey [:glandium] from comment #11)
On aarch64, I get a correct cpu.count, but 0 or empty string for everything else.
That was on a mac under UTM, where /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq, and /sys/devices/system/cpu/cpu0/cache/*/size don't exist. I don't have X or Wayland on the raspberry pi.
| Assignee | ||
Comment 13•1 year ago
|
||
Considering the schema, it would seem something like:
- JS::Rooted<JS::Value> valCoreInfo(aCx, JS::Int32Value(info.cpuCores));
+ JS::Rooted<JS::Value> valCoreInfo(
+ aCx, info.cpuCores ? JS::Int32Value(info.cpuCores) : JS::NullHandleValue);
would be the sensible thing to do. Or changing the schema to allow 0.
| Assignee | ||
Comment 14•1 year ago
•
|
||
FWIW, the lscpu command line tool looks at e.g. /sys/devices/system/cpu/cpu*/topology/thread_siblings and core_siblings. There's also a lot of manual work to figure out the CPU type. https://github.com/util-linux/util-linux/blob/master/sys-utils/lscpu-arm.c
Comment 15•1 year ago
|
||
(In reply to Mike Hommey [:glandium] from comment #13)
Or changing the schema to allow 0.
I'd be happy to help with this, but is 0 actually a valid value or is it a sentinel value that means something else (unknown, failure, complication)? If it's a sentinel, I think data consumers would prefer it to be null with a slight change to the docs saying that null can mean either failure or this sentinel.
| Assignee | ||
Comment 16•1 year ago
|
||
0 is definitely not a valid value.
| Assignee | ||
Comment 17•1 year ago
|
||
The telemetry schema doesn't want a value below 1, and 0 is a value that
doesn't make sense anyways. So in the case where we didn't get a proper
value, set to null, which the schema will accept.
Updated•1 year ago
|
Comment 18•1 year ago
|
||
Comment on attachment 9399670 [details]
Bug 1894170 - Make cpu.cores null instead of 0.
Revision D209163 was moved to bug 1894549. Setting attachment 9399670 [details] to obsolete.
| Assignee | ||
Comment 19•1 year ago
|
||
Let's keep this bug about adding the missing information. Bug 1894549 is about making the data conform to the schema.
| Assignee | ||
Comment 20•1 year ago
|
||
(In reply to Mike Hommey [:glandium] from comment #11)
After the browser has been up for a minute, then the information is filled up. On aarch64, I get a correct cpu.count, but 0 or empty string for everything else.
Filed bug 1894554, FWIW.
| Assignee | ||
Comment 21•1 year ago
|
||
(In reply to Mike Hommey [:glandium] from comment #14)
FWIW, the lscpu command line tool looks at e.g. /sys/devices/system/cpu/cpu*/topology/thread_siblings and core_siblings.
In fact, we should do that on all platforms, and not only arm64, because here's what I can see on a real machine:
$ grep "cpu cores" /proc/cpuinfo
cpu cores : 1
cpu cores : 1
cpu cores : 1
cpu cores : 1
That machine has 4 sockets (so four 1-core CPUs). Multi-socket machines are rare as desktop machines these days, but still.
| Assignee | ||
Comment 22•1 year ago
|
||
| Assignee | ||
Comment 23•1 year ago
|
||
| Assignee | ||
Comment 24•1 year ago
|
||
| Assignee | ||
Comment 25•1 year ago
|
||
Updated•1 year ago
|
Comment 26•1 year ago
|
||
Comment 27•1 year ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/6f90a9eecde4
https://hg.mozilla.org/mozilla-central/rev/99a3593f7b4a
https://hg.mozilla.org/mozilla-central/rev/81d5a5a1166b
https://hg.mozilla.org/mozilla-central/rev/fb5bf71faa24
| Reporter | ||
Comment 28•1 year ago
|
||
I just got back from PTO. I checked this query[1] and I now see aarch64 in the results. Thanks :glandium, :emilio, :dnazer, and :chutten! 👍
Description
•