1284240 - Telemetry for seccomp-bpf support looks faulty

Reporter

Description

•

9 years ago

https://sql.telemetry.mozilla.org/queries/622/source#table For some kernel versions that outright should have seccomp, there still a lot more false entries than true entries. Either the detection is not working correctly, or Telemetry is sent before the setup code has had a chance to run. Due to the use of a "flag" variable, this may cause false "false's" to be sent.

Georg Fritzsche [:gfritzsche]

Comment 1

•

9 years ago

You grab values from each submission of each client, so you can run into pseudoreplication (clients that submit often will bias the distribution). You could just grab the latest value for each client (`sandbox_capabilities_seccomp_bpf[1]` etc.). I would limit this to the "release" & "beta" channels (`WHERE channel IS ...`); especially Nightly can have noisy developer data.

Georg Fritzsche [:gfritzsche]

Comment 2

•

9 years ago

Telemetry is not sent very early - we don't start sending the pings that collect this data until 60 seconds after startup. If you want to compare this explicitly to make sure, we could only look at those sent on shutdown (`WHERE reason = 'shutdown'`) or filter out data with short session lengths.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 3

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] [away Jun 24 - Jul 3] from comment #1) > You grab values from each submission of each client, so you can run into > pseudoreplication (clients that submit often will bias the distribution). > You could just grab the latest value for each client > (`sandbox_capabilities_seccomp_bpf[1]` etc.). I doubt this is the case. This would require those strange systems with a new kernel yet no seccomp-bpf to systematically be the ones over-reporting. That makes little sense? > I would limit this to the "release" & "beta" channels (`WHERE channel IS > ...`); especially Nightly can have noisy developer data. I don't think the Telemetry has propagated there yet, as it was introduced in 49.

Georg Fritzsche [:gfritzsche]

Comment 4

•

9 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #3) > (In reply to Georg Fritzsche [:gfritzsche] [away Jun 24 - Jul 3] from > comment #1) > > You grab values from each submission of each client, so you can run into > > pseudoreplication (clients that submit often will bias the distribution). > > You could just grab the latest value for each client > > (`sandbox_capabilities_seccomp_bpf[1]` etc.). > > I doubt this is the case. This would require those strange systems with a > new kernel yet no seccomp-bpf to systematically be the ones over-reporting. > That makes little sense? I wouldn't know, but its easy to fix and check. Sampling once per client is a good practice in general.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 5

•

9 years ago

I did: SELECT os, system_os[1].version AS osversion, sandbox_capabilities_seccomp_bpf[1] AS seccomp, COUNT(1) AS hits FROM longitudinal WHERE sandbox_capabilities_seccomp_bpf[1] IS NOT NULL AND reason[1] = 'shutdown' GROUP BY os, system_os[1].version, sandbox_capabilities_seccomp_bpf[1] ORDER BY os, system_os[1].version DESC, hits DESC and SELECT os, system_os[1].version AS osversion, sandbox_capabilities_seccomp_bpf[1] AS seccomp FROM longitudinal WHERE sandbox_capabilities_seccomp_bpf[1] IS NOT NULL AND reason[1] = 'shutdown' ORDER BY os, system_os[1].version DESC which mostly reveals that only a few people run Linux ;-) Seccomp enabled or not is still wildly inconsistent with kernel versions, but appears to shift for the same users, which makes no sense.

Georg Fritzsche [:gfritzsche]

Comment 6

•

9 years ago

I changed this a little to always use the latest available shutdown ping etc., but the numbers are still very low: https://sql.telemetry.mozilla.org/queries/634/source#table Cross-checking with the Aurora population numbers on the Firefox dashboard here, the user numbers we see here are definitely too low: https://metrics.services.mozilla.com/firefox-dashboard/

Georg Fritzsche [:gfritzsche]

Comment 7

•

9 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #5) > Seccomp enabled or not is still wildly inconsistent with kernel versions, > but appears to shift for the same users, which makes no sense. I think there are currently very few users, so its hard to tell whether its noise or an actual issue. For now, how about: * looking at the histories per user in more detail to see if a pattern stands out (e.g. false when session_length is <Nsec, but true otherwise) * testing the behavior locally to confirm the probe works as intended

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 8

•

9 years ago

https://sql.telemetry.mozilla.org/queries/641/source The data is quite consistent per user, that is if they have "true" it generally stays so, and if it's "false" it also generally stays so. I see no relation to session length. Looks like the detection might be faulty, then?

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 9

•

9 years ago

I've downgraded my Ubuntu development VM from kernel 4.4.0-28-generic to 4.4.0-22-generic (a version that shows up in the linked data with a lot of "false" but also some "true") to test this, with current mozilla-central, and I'm seeing "true" everywhere in about:support#sandbox and "1" everywhere in about:telemetry, as expected. For whatever that's worth. The test in SandboxInfo.cpp is fairly simple: calls prctl(PR_SET_SECCOMP, 2, nullptr), if it fails with EFAULT then it's supported, if it fails with EINVAL then it's not supported, and if anything else happens (other error code or success) then a there's a MOZ_DIAGNOSTIC_ASSERT. (Chromium uses the same test, but without the assertions, so if we're actually reading a false negative at this layer then so are they.) The telemetry side of this has a lot more “moving parts”, so I'd tend to wonder if something has gone wrong there. Another thing I'm wondering: are these actually our builds, or could some of them be modified downstream builds somehow? Using --disable-sandbox will disable building everything in security/sandbox and won't poke values into telemetry, but the probes as specified in the JSON are conditional only on XP_LINUX. We might be able to get information from the build IDs, if nothing else.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 10

•

9 years ago

(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #9) > Another thing I'm wondering: are these actually our builds, or could some of > them be modified downstream builds somehow? That's a possibility, although given that this Telemetry mostly comes from Nightly, distro repacks shouldn't be a thing, and 3rd party builds would need to have explicitly opted out of the sandbox.

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Blocks: 1098428

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 11

•

9 years ago

Don't see any obvious correlations between build_id and getting seccomp on or off.

Jim Mathies [:jimm]

Updated

•

9 years ago

Whiteboard: sblc2

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 12

•

9 years ago

I got a local repro for this problem. No exact steps, but: 1) On startup, SandboxInfo::SubmitTelemetry() was called. 2) After running for a while, about:telemetry showed SANDBOX_CAPABILITIES_SECCOMP_BPF 1 samples, average = 0, sum = 0 All Sandbox histograms were 0. About:support showed all true.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 13

•

9 years ago

Values are OK right after startup: SANDBOX_CAPABILITIES_SECCOMP_BPF 1 samples, average = 1, sum = 1

Georg Fritzsche [:gfritzsche]

Comment 14

•

9 years ago

Hm, so the problem seems to be that: * in the first subsession, the right values are submitted * later we start a new subsession (e.g. on local midnight or addons being installed or disabled) * the new subsession resets flags to their default (false) value This should be easy to reproduce with enabling/disabling an addon and watching the value change in about:telemetry. Switching the "show subsession data" box on and off should show the two different values. I never liked the "default to false" behavior of flags, i think it is confusing - but we can't break that right now. In bug 1281214 though we are about to implement a "boolean scalar value" measurement, which matches your intent better. So you should be able to: * in re:dash, look at the first subsessions/pings only instead of the shutdown one (subsession_counter = 1) * if the non-subsession value looks correct: * look at the old saved-session pings, which contain values for the whole session * either on t.m.o [1] * or raw in a Spark analysis [1] https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2016-07-07&keys=__none__!__none__!__none__&max_channel_version=nightly%252F50&measure=SANDBOX_CAPABILITIES_SECCOMP_BPF&min_channel_version=null&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2016-06-06&table=0&trim=1&use_submission_date=0

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 15

•

9 years ago

You are right, I could reproduce it with those steps. Given that we actually discussed boolean vs flag in the original bug, and still shot ourselves in the foot, I think it might be a good idea to mark flag as deprecated and not recommend it any more. I wonder if there's any other flag telemetry that is broken in the same way.

Georg Fritzsche [:gfritzsche]

Comment 16

•

9 years ago

I think flag histograms are working, just counter-intuitively. At this stage i want to wait a short time and then push people on using scalar measurements (i.e. we will soon disallow adding new flag & count histograms). Does this work for you for now, checking the value on the t.m.o dashboard or on sql.t.m.o grabbing by subsession_counter=1?

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 17

•

9 years ago

It appears to be more consistent per user, but I'm still seeing a lot of false on kernels where we know it's supposed to be supported. So I don't know what to think.

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 18

•

9 years ago

Attached file Bug 1284240 - Use boolean rather than flag for Sandboxing Telemetry. — Details

Review commit: https://reviewboard.mozilla.org/r/65918/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/65918/

Attachment #8773191 - Flags: review?(gfritzsche)

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 19

•

9 years ago

I still see strange behavior in the Telemetry. Because of this bug, even after filtering on subsession_counter, I'm still not confident I can trust the data I'm seeing. But we do need this data, so I want to change the Telemetry so I know I can trust the result. This renames the telemetry and changes everything from flag to boolean. I verified that this is set in the first subsession, and nothing gets set for subsequent ones.

Georg Fritzsche [:gfritzsche]

Comment 20

•

9 years ago

Comment on attachment 8773191 [details] Bug 1284240 - Use boolean rather than flag for Sandboxing Telemetry. https://reviewboard.mozilla.org/r/65918/#review62854 If that solves your problem, lets do it. I'm looking forward to when we (soon) can point people to scalar booleans for this kind of usage.

Attachment #8773191 - Flags: review?(gfritzsche) → review+

Pulsebot

Comment 21

•

9 years ago

Pushed by gpascutto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0ce00a7b3b58 Use boolean rather than flag for Sandboxing Telemetry. r=gfritzsche

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 22

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0ce00a7b3b58

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox50: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla50

Gian-Carlo Pascutto [:gcp]

Reporter

Comment 23

•

9 years ago

Comment on attachment 8773191 [details] Bug 1284240 - Use boolean rather than flag for Sandboxing Telemetry. Approval Request Comment [Feature/regressing bug #]: Bug 1098428 [User impact if declined]: No visible user impact. Useless data is collected. [Describe test coverage new/current, TreeHerder]: Landed on Nightly a few days ago. [Risks and why]: Very low. No user exposed functionality. [String/UUID change made/needed]: N/A.

Attachment #8773191 - Flags: approval-mozilla-aurora?

Gerry Chang [:gchang]

Updated

•

9 years ago

status-firefox49: --- → affected

Gerry Chang [:gchang]

Comment 24

•

9 years ago

Comment on attachment 8773191 [details] Bug 1284240 - Use boolean rather than flag for Sandboxing Telemetry. This patch fixes collecting useless telemetry data. Take it in aurora.

Attachment #8773191 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Carsten Book [:Tomcat]

Comment 25

•

9 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/8d3761e5445c

status-firefox49: affected → fixed