Closed Bug 1070036 Opened 10 years ago Closed 10 years ago

[Linux] Exceptions during openh264 provider initialization

Categories

(Core :: WebRTC, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox33 - fixed
firefox34 - fixed
firefox35 - fixed

People

(Reporter: Irving, Assigned: gfritzsche)

References

Details

Attachments

(3 files)

The add-on manager reports exceptions during provider start up via Telemetry. Recent telemetry reports are showing a high frequency (29% of beta 33 sessions) of uncaught exceptions in the OpenH264 provider initialization:

{"message":"[Exception... \"Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [mozIGeckoMediaPluginService.addPluginDirectory]\" nsresult: \"0x80040111 (NS_ERROR_NOT_AVAILABLE)\" location: \"JS frame :: resource://gre/modules/addons/OpenH264Provider.jsm :: OpenH264Provider.startup :: line 266\" data: no]","context":"provider startup","module"

Aggregated data is at https://s3-us-west-2.amazonaws.com/telemetry-public-analysis/addon_perf/data/weekly_exceptions_20140915.csv.gz, quick&dirty spreadsheet at https://docs.google.com/spreadsheets/d/1xwCv8s8mxTIwYWkqBZVssVWAlVW3Clm51V8gLliE-v8/edit?usp=sharing
We should hop on figuring out why this is failing so often on Linux.

I don't see any non-Linux failures in that spreadsheet.
Flags: needinfo?(georg.fritzsche)
This is here, so not critical now for the provider setup in release:
https://hg.mozilla.org/releases/mozilla-beta/file/f14c89b414b6/toolkit/mozapps/extensions/internal/OpenH264Provider.jsm#l271

Looking at addPluginDirectory, this seems the likely source:
https://hg.mozilla.org/releases/mozilla-beta/file/f14c89b414b6/content/media/gmp/GMPService.cpp#l392
Do we actually support sandboxing on Linux yet? Should OpenH264 work there?
Flags: needinfo?(rjesup)
Flags: needinfo?(jld)
Flags: needinfo?(georg.fritzsche)
We do support sandboxing on linux, and OpenH264 should work there.
Flags: needinfo?(rjesup)
Hm, and 29% percent sessions seems a little high for scenarios where MOZ_DISABLE_GMP_SANDBOX is set:
https://hg.mozilla.org/releases/mozilla-beta/file/f14c89b414b6/security/sandbox/linux/Sandbox.h#l26
I'm hoping for some input from jld.
The next-best option i can think of would be to confirm the failure source in GeckoMediaPluginService::AddPluginDirectory() through Telemetry.
Hmm. My exception telemetry hook (AddonManagerPrivate.recordException) is JS-only. You could add new NS_ERROR codes (or just choose different existing codes) to indicate exactly where in the call chain the error was found.

(/me shakes fist at error handling systems that can't report context information)

The JS caller still needs to handle the error, because there are legitimate reasons for addPluginDirectory to throw NS_ERROR_NOT_AVAILABLE.
(In reply to :Irving Reid from comment #6)
> Hmm. My exception telemetry hook (AddonManagerPrivate.recordException) is
> JS-only. You could add new NS_ERROR codes (or just choose different existing
> codes) to indicate exactly where in the call chain the error was found.
> 
> (/me shakes fist at error handling systems that can't report context
> information)

/me joins

> The JS caller still needs to handle the error, because there are legitimate
> reasons for addPluginDirectory to throw NS_ERROR_NOT_AVAILABLE.


What would those be?  We're not writing to the directory/profile or creating the directory (and for that matter, we aren't even creating a sandboxed GMP process here either - that only happens when it gets used).  Not to discount handling errors!  but I want to know what you're thinking of.
[Tracking Requested - why for this release]:
(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> [Tracking Requested - why for this release]:

... This may affect Loop functionality for a significant chunk of the Linux population.
(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> > [Tracking Requested - why for this release]:
> 
> ... This may affect Loop functionality for a significant chunk of the Linux
> population.

It's unlikely to impact Loop functionality because Desktop-Desktop calls currently prefer VP8.  It would impact calls between the Loop mobile app on FxOS and Linux Desktop users who had this problem;  those calls would have to use VP8.  

This would also impact interop between Linux Desktop and any partner endpoints that only supported H.264.
(In reply to Randell Jesup [:jesup] from comment #7)
> (In reply to :Irving Reid from comment #6)
...
> > The JS caller still needs to handle the error, because there are legitimate
> > reasons for addPluginDirectory to throw NS_ERROR_NOT_AVAILABLE.
> 
> 
> What would those be?  We're not writing to the directory/profile or creating
> the directory (and for that matter, we aren't even creating a sandboxed GMP
> process here either - that only happens when it gets used).  Not to discount
> handling errors!  but I want to know what you're thinking of.

If Sandbox.isSupported ends up false at http://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/Sandbox.cpp#74, calls to addPluginDirectory will throw. Maybe that condition should be handled some other way, but with the current design we should catch the exception.
(In reply to :Irving Reid from comment #11)

> If Sandbox.isSupported ends up false at
> http://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/Sandbox.
> cpp#74, calls to addPluginDirectory will throw. Maybe that condition should
> be handled some other way, but with the current design we should catch the
> exception.

So that would be a linux system that didn't support the fundamentals needed for sandboxing ("Determine whether seccomp-bpf is supported by trying to enable it with an invalid pointer for the filter.")  Perhaps these are VMs?  or much older/odder linux systems?  Can we get a more detailed host system breakdown than 'linux'?
Flags: needinfo?(irving)
I could pull a sample of telemetry packets showing this error; the useful information is likely to be in the System Information section (you can see what we get by looking at about:telemetry on your own browser)

'version' is the kernel version ID string, 'arch' the CPU architecture, 'adapter*' the video adapter (may allow you to identify VMs, e.g. VMware shows up in video adapter)
Flags: needinfo?(irving)
This is the result of bug 1043733; see also bug 1039819 and bug 1041885.

To summarize: I used the Firefox Health Report data on Linux kernel version (the string printed by `uname -r`) for profiles from the then-current release (Firefox 30) to estimate seccomp-bpf support; the fraction lacking it should have been an order of magnitude less than what's reported in comment #0.

Questions:
1. What is the cause of this discrepancy?
2. Can we resolve the discrepancy and arrive at a better estimate of the actual figure?
3. How many of the seccomp-lacking users are on EOL'ed versions of their distributions?

Sandboxing was never a hard requirement for OpenH264 as far as I know, so we could triple-land a patch to disable bug 1043733 if needed (release management willing)… but then we need to figure out what to do for EME.  Options are:

1. Ship a setuid-root executable, so we can use the full Chromium sandbox.  Bug 1041885 goes into some of the problems here.  Additionally: for users who lack seccomp-bpf support and haven't been getting regular security updates, this sandbox is probably escapeable via known kernel vulnerabilities.

2. Continue desupporting non-seccomp-capable users for EME, but let them run OpenH264 unsandboxed.
Assignee: nobody → jld
Flags: needinfo?(jld)
See Also: → 1043733, 1039819, 1041885
(In reply to :Irving Reid from comment #13)
> I could pull a sample of telemetry packets showing this error; the useful
> information is likely to be in the System Information section (you can see
> what we get by looking at about:telemetry on your own browser)
> 
> 'version' is the kernel version ID string, 'arch' the CPU architecture,
> 'adapter*' the video adapter (may allow you to identify VMs, e.g. VMware
> shows up in video adapter)

The kernel version string should be helpful.

For the architecture, non-x86 systems don't enable MOZ_GMP_SANDBOX, so they shouldn't be showing this error, but it couldn't hurt to have it.  VMness isn't directly relevant (in terms of whether sandboxing works), but combined with the kernel version info it might tell us something.
ni? for telemetry data with kernel versions.
Flags: needinfo?(irving)
So this is... odd.  Here's a summary of the FHR data submitted since 2014-09-02 (Firefox 32 release).  Following the Metrics group's usual procedures, it's based on a random sample rather than the entire data set.  The columns are version, estimated percent with seccomp-bpf support, and total number of records.

21 85 511
22 96 866
23 85 1108
24 45 4462
25 96 1457
26 92 2805
27 84 1133
28 96 2844
29 96 2493
30 96 4233
31 93 10230
32 97 15078
33 74 31
34 100 21
35 100 21

Note in particular the line for Firefox 33.  I think it's worth rerunning this with the full FHR data, but at the moment it looks more likely that Beta is, for whatever reason, significantly unrepresentative.
Merging in Health Reports submitted during the Firefox 31 cycle (and not superseded by a more recent report from the same profile), grouped by version relative to current (so 0 is release, 1 is beta, etc.):

-3 96 4919
-2 96 6259
-1 94 14802
0 96 25851
1 74 39
2 100 34
3 95 44

Again, beta stands out.  Some other demographic differences between beta and release:

Debian stable/oldstable: 17.9% of beta, 0.8% of release
           RHEL 6 and 5:  5.2% of beta, 0.2% of release
   Ubuntu, all versions: 58% of beta,  90% of release

(Disclaimer: these are guesses based on the version string.  /-generic(-|$)/ → Ubuntu, /el[56]/ → RHEL 5/6, /-(686(|-pae|-bigmem)|amd64)$/ → Debian; Debian and (3.2.0 or 2.6.32) → stable/oldstable.  Also, the seccomp-bpf approximation is: ≥ 3.5 or (Ubuntu and 3.2).)

The Debian and RedHat versions in question are known to lack seccomp-bpf support and are still supported, whereas Ubuntu backported seccomp-bpf for 12.04 which is its oldest supported desktop version.

So this would explain the disturbingly large number reported for beta telemetry in comment #0.
Attached file beta-33-versions
This is count vs. 'version' string for Beta 33 telemetry received on 2014-09-18 (I have a separate file for Nightly 35 on the same day)

I didn't count the total number of Linux reports, so I don't have a percentage, but this confirms that the exceptions are happening in older kernels.
Flags: needinfo?(irving)
Attached file nightly-35-versions
Count vs. kernel version for Nightly 35 telemetry received on 2014-09-18 that contained the NOT_AVAILABLE exception.
(In reply to :Irving Reid from comment #19)
> I didn't count the total number of Linux reports, so I don't have a
> percentage, but this confirms that the exceptions are happening in older
> kernels.

And exactly the older kernels expected: < 3.5 in general, < 3.2 for Ubuntu "-generic", RHEL 6 but not 7, no Fedora ("fc") newer than 15 (actually lower than the expected 17, but I'm not complaining, and in any case 19 is the oldest currently supported).  One or two exceptions in the nightly data, which I'm guessing are local custom builds.

So, the answer to the questions in comment #14 is that the telemetry stats on beta aren't a reason to start panicking about release, and there's no reason to doubt the FHR data, but they might be a reason to start worrying about beta/aurora/nightly users.
Should we be reporting this error differently, from a programming/telemetry perspective?  It looks like there wasn't a resolution to what was brought up in comment #6 and comment #7.

(There is a separate question of what we should do about the UX of this error condition.  That needs a separate bug.)
Flags: needinfo?(rjesup)
Tracking since H.264 is important.
FYI, we have just released beta 6. So, a fix will need to land soon if we want to see that bug fixed in 33.
(In reply to Jed Davis [:jld] from comment #22)
> Should we be reporting this error differently, from a programming/telemetry
> perspective?  It looks like there wasn't a resolution to what was brought up
> in comment #6 and comment #7.

It's showing up in telemetry because it's an unhandled exception during provider start-up (and thus *may* indicate bustage in the field). Catching the exception inside your startup() and cleaning up the provider initialization appropriately is all I'm looking for.

If you want to add telemetry probes to explicitly count how many clients have sandboxing available and/or disabled, I'd treat that as a separate bug.
So, we should handle catching the exception in this bug and do a follow-up for proper reporting of this.

(In reply to Sylvestre Ledru [:sylvestre] from comment #23)
> Tracking since H.264 is important.
> FYI, we have just released beta 6. So, a fix will need to land soon if we
> want to see that bug fixed in 33.

Per comment 2 and comment 21 this is not critical.
Blocks: 1072363
(In reply to :Irving Reid from comment #24)
> If you want to add telemetry probes to explicitly count how many clients
> have sandboxing available and/or disabled, I'd treat that as a separate bug.

I filed bug 1072363 on reporting an error so we can tell, although we could morph it to just to proper reporting too. Jed, is this something you can take?
Flags: needinfo?(jld)
Georg, OK. Thanks. Not blocking 33 for this then.
I attached a patch to bug 1072363 which removes OpenH264 from the plugin list in about:addons and avoids the addPluginDirectory (by disabling the OpenH264Provider) if GMP is disabled for sandboxing reasons… but, rereading the earlier comments, maybe that patch makes more sense on this bug?

The one thing I haven't done yet is note the error in a form that telemetry or FHR would pick up.  But if they collect add-on information, that might also work.
Flags: needinfo?(jld)
Comment on attachment 8494579 [details] [diff] [review]
Catch NS_ERROR_NOT_AVAILABLE in OpenH264Provider

Review of attachment 8494579 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/mozapps/extensions/internal/OpenH264Provider.jsm
@@ +271,5 @@
> +      try {
> +        gmpService.addPluginDirectory(this.gmpPath);
> +      } catch (e if e.name == 'NS_ERROR_NOT_AVAILABLE') {
> +        this._log.warning("startup() - adding gmp directory failed with " + e.name + " - sandboxing not available?");
> +      }

This is fine. Note that Log.jsm knows how to format exceptions, so you could use:

this._log.warming("startup() - adding gmp directory failed, sandboxing may not be available", e);

though in this case the extra info in the formatted exception wouldn't be very helpful, so no need to change this patch.
Attachment #8494579 - Flags: review?(irving) → review+
Comment on attachment 8494579 [details] [diff] [review]
Catch NS_ERROR_NOT_AVAILABLE in OpenH264Provider

Approval Request Comment
[Feature/regressing bug #]: OpenH264 integration
[User impact if declined]: Avoids spamming Addon exception telemetry with a known issue.
[Describe test coverage new/current, TBPL]: We have proper test-coverage for this path.
[Risks and why]: Low-risk, this just trivially catches one expected exception.
[String/UUID change made/needed]: None.
Attachment #8494579 - Flags: approval-mozilla-beta?
Attachment #8494579 - Flags: approval-mozilla-aurora?
Flags: needinfo?(rjesup)
Comment on attachment 8494579 [details] [diff] [review]
Catch NS_ERROR_NOT_AVAILABLE in OpenH264Provider

Approving even it is not in m-c to make sure it is in beta 8
Attachment #8494579 - Flags: approval-mozilla-beta?
Attachment #8494579 - Flags: approval-mozilla-beta+
Attachment #8494579 - Flags: approval-mozilla-aurora?
Attachment #8494579 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/d71444b75291
Assignee: jld → georg.fritzsche
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Depends on: 1079312
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: