Closed Bug 1284240 Opened 3 years ago Closed 3 years ago

Telemetry for seccomp-bpf support looks faulty

Categories

(Core :: Security: Process Sandboxing, defect)

x86
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox49 --- fixed
firefox50 --- fixed

People

(Reporter: gcp, Unassigned)

References

Details

(Whiteboard: sblc2)

Attachments

(1 file)

https://sql.telemetry.mozilla.org/queries/622/source#table

For some kernel versions that outright should have seccomp, there still a lot more false entries than true entries. Either the detection is not working correctly, or Telemetry is sent before the setup code has had a chance to run. Due to the use of a "flag" variable, this may cause false "false's" to be sent.
You grab values from each submission of each client, so you can run into pseudoreplication (clients that submit often will bias the distribution).
You could just grab the latest value for each client (`sandbox_capabilities_seccomp_bpf[1]` etc.).

I would limit this to the "release" & "beta" channels (`WHERE channel IS ...`); especially Nightly can have noisy developer data.
Telemetry is not sent very early - we don't start sending the pings that collect this data until 60 seconds after startup.
If you want to compare this explicitly to make sure, we could only look at those sent on shutdown (`WHERE reason = 'shutdown'`) or filter out data with short session lengths.
(In reply to Georg Fritzsche [:gfritzsche] [away Jun 24 - Jul 3] from comment #1)
> You grab values from each submission of each client, so you can run into
> pseudoreplication (clients that submit often will bias the distribution).
> You could just grab the latest value for each client
> (`sandbox_capabilities_seccomp_bpf[1]` etc.).
 
I doubt this is the case. This would require those strange systems with a new kernel yet no seccomp-bpf to systematically be the ones over-reporting. That makes little sense?

> I would limit this to the "release" & "beta" channels (`WHERE channel IS
> ...`); especially Nightly can have noisy developer data.

I don't think the Telemetry has propagated there yet, as it was introduced in 49.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #3)
> (In reply to Georg Fritzsche [:gfritzsche] [away Jun 24 - Jul 3] from
> comment #1)
> > You grab values from each submission of each client, so you can run into
> > pseudoreplication (clients that submit often will bias the distribution).
> > You could just grab the latest value for each client
> > (`sandbox_capabilities_seccomp_bpf[1]` etc.).
>  
> I doubt this is the case. This would require those strange systems with a
> new kernel yet no seccomp-bpf to systematically be the ones over-reporting.
> That makes little sense?

I wouldn't know, but its easy to fix and check. Sampling once per client is a good practice in general.
I did:

SELECT os,
       system_os[1].version AS osversion,
       sandbox_capabilities_seccomp_bpf[1] AS seccomp,
       COUNT(1) AS hits       
FROM longitudinal
WHERE sandbox_capabilities_seccomp_bpf[1] IS NOT NULL AND reason[1] = 'shutdown'
GROUP BY os,
         system_os[1].version,
         sandbox_capabilities_seccomp_bpf[1]
ORDER BY os,
         system_os[1].version DESC,
         hits DESC

and

SELECT os,
       system_os[1].version AS osversion,
       sandbox_capabilities_seccomp_bpf[1] AS seccomp       
FROM longitudinal
WHERE sandbox_capabilities_seccomp_bpf[1] IS NOT NULL AND reason[1] = 'shutdown'
ORDER BY os,
         system_os[1].version DESC         

which mostly reveals that only a few people run Linux ;-)

Seccomp enabled or not is still wildly inconsistent with kernel versions, but appears to shift for the same users, which makes no sense.
I changed this a little to always use the latest available shutdown ping etc., but the numbers are still very low:
https://sql.telemetry.mozilla.org/queries/634/source#table

Cross-checking with the Aurora population numbers on the Firefox dashboard here, the user numbers we see here are definitely too low:
https://metrics.services.mozilla.com/firefox-dashboard/
(In reply to Gian-Carlo Pascutto [:gcp] from comment #5)
> Seccomp enabled or not is still wildly inconsistent with kernel versions,
> but appears to shift for the same users, which makes no sense.

I think there are currently very few users, so its hard to tell whether its noise or an actual issue.
For now, how about:
* looking at the histories per user in more detail to see if a pattern stands out (e.g. false when session_length is <Nsec, but true otherwise)
* testing the behavior locally to confirm the probe works as intended
https://sql.telemetry.mozilla.org/queries/641/source

The data is quite consistent per user, that is if they have "true" it generally stays so, and if it's "false" it also generally stays so. I see no relation to session length.

Looks like the detection might be faulty, then?
I've downgraded my Ubuntu development VM from kernel 4.4.0-28-generic to 4.4.0-22-generic (a version that shows up in the linked data with a lot of "false" but also some "true") to test this, with current mozilla-central, and I'm seeing "true" everywhere in about:support#sandbox and "1" everywhere in about:telemetry, as expected.  For whatever that's worth.

The test in SandboxInfo.cpp is fairly simple: calls prctl(PR_SET_SECCOMP, 2, nullptr), if it fails with EFAULT then it's supported, if it fails with EINVAL then it's not supported, and if anything else happens (other error code or success) then a there's a MOZ_DIAGNOSTIC_ASSERT.  (Chromium uses the same test, but without the assertions, so if we're actually reading a false negative at this layer then so are they.)  The telemetry side of this has a lot more “moving parts”, so I'd tend to wonder if something has gone wrong there.

Another thing I'm wondering: are these actually our builds, or could some of them be modified downstream builds somehow?  Using --disable-sandbox will disable building everything in security/sandbox and won't poke values into telemetry, but the probes as specified in the JSON are conditional only on XP_LINUX.  We might be able to get information from the build IDs, if nothing else.
(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #9)
> Another thing I'm wondering: are these actually our builds, or could some of
> them be modified downstream builds somehow?

That's a possibility, although given that this Telemetry mostly comes from Nightly, distro repacks shouldn't be a thing, and 3rd party builds would need to have explicitly opted out of the sandbox.
Don't see any obvious correlations between build_id and getting seccomp on or off.
Whiteboard: sblc2
I got a local repro for this problem. No exact steps, but:

1) On startup, SandboxInfo::SubmitTelemetry() was called.
2) After running for a while, about:telemetry showed 

SANDBOX_CAPABILITIES_SECCOMP_BPF
1 samples, average = 0, sum = 0

All Sandbox histograms were 0. About:support showed all true.
Values are OK right after startup:

SANDBOX_CAPABILITIES_SECCOMP_BPF
1 samples, average = 1, sum = 1
Hm, so the problem seems to be that:
* in the first subsession, the right values are submitted
* later we start a new subsession (e.g. on local midnight or addons being installed or disabled)
* the new subsession resets flags to their default (false) value

This should be easy to reproduce with enabling/disabling an addon and watching the value change in about:telemetry.
Switching the "show subsession data" box on and off should show the two different values.
I never liked the "default to false" behavior of flags, i think it is confusing - but we can't break that right now.
In bug 1281214 though we are about to implement a "boolean scalar value" measurement, which matches your intent better.

So you should be able to:
* in re:dash, look at the first subsessions/pings only instead of the shutdown one (subsession_counter = 1)
* if the non-subsession value looks correct:
  * look at the old saved-session pings, which contain values for the whole session
  * either on t.m.o [1]
  * or raw in a Spark analysis

[1] https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2016-07-07&keys=__none__!__none__!__none__&max_channel_version=nightly%252F50&measure=SANDBOX_CAPABILITIES_SECCOMP_BPF&min_channel_version=null&product=Firefox&sanitize=0&sort_keys=submissions&start_date=2016-06-06&table=0&trim=1&use_submission_date=0
You are right, I could reproduce it with those steps.

Given that we actually discussed boolean vs flag in the original bug, and still shot ourselves in the foot, I think it might be a good idea to mark flag as deprecated and not recommend it any more. I wonder if there's any other flag telemetry that is broken in the same way.
I think flag histograms are working, just counter-intuitively.
At this stage i want to wait a short time and then push people on using scalar measurements (i.e. we will soon disallow adding new flag & count histograms).

Does this work for you for now, checking the value on the t.m.o dashboard or on sql.t.m.o grabbing by subsession_counter=1?
It appears to be more consistent per user, but I'm still seeing a lot of false on kernels where we know it's supposed to be supported. So I don't know what to think.
I still see strange behavior in the Telemetry. Because of this bug, even after filtering on subsession_counter, I'm still not confident I can trust the data I'm seeing. But we do need this data, so I want to change the Telemetry so I know I can trust the result.

This renames the telemetry and changes everything from flag to boolean.

I verified that this is set in the first subsession, and nothing gets set for subsequent ones.
Comment on attachment 8773191 [details]
Bug 1284240 - Use boolean rather than flag for Sandboxing Telemetry.

https://reviewboard.mozilla.org/r/65918/#review62854

If that solves your problem, lets do it.
I'm looking forward to when we (soon) can point people to scalar booleans for this kind of usage.
Attachment #8773191 - Flags: review?(gfritzsche) → review+
Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0ce00a7b3b58
Use boolean rather than flag for Sandboxing Telemetry. r=gfritzsche
https://hg.mozilla.org/mozilla-central/rev/0ce00a7b3b58
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Comment on attachment 8773191 [details]
Bug 1284240 - Use boolean rather than flag for Sandboxing Telemetry.

Approval Request Comment
[Feature/regressing bug #]: Bug 1098428 
[User impact if declined]: No visible user impact. Useless data is collected.
[Describe test coverage new/current, TreeHerder]: Landed on Nightly a few days ago.
[Risks and why]: Very low. No user exposed functionality.
[String/UUID change made/needed]: N/A.
Attachment #8773191 - Flags: approval-mozilla-aurora?
Comment on attachment 8773191 [details]
Bug 1284240 - Use boolean rather than flag for Sandboxing Telemetry.

This patch fixes collecting useless telemetry data. Take it in aurora.
Attachment #8773191 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.