1070036 - [Linux] Exceptions during openh264 provider initialization

:Irving Reid (No longer working on Firefox)

Reporter

Description

•

10 years ago

The add-on manager reports exceptions during provider start up via Telemetry. Recent telemetry reports are showing a high frequency (29% of beta 33 sessions) of uncaught exceptions in the OpenH264 provider initialization:

{"message":"[Exception... \"Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [mozIGeckoMediaPluginService.addPluginDirectory]\" nsresult: \"0x80040111 (NS_ERROR_NOT_AVAILABLE)\" location: \"JS frame :: resource://gre/modules/addons/OpenH264Provider.jsm :: OpenH264Provider.startup :: line 266\" data: no]","context":"provider startup","module"

Aggregated data is at https://s3-us-west-2.amazonaws.com/telemetry-public-analysis/addon_perf/data/weekly_exceptions_20140915.csv.gz, quick&dirty spreadsheet at https://docs.google.com/spreadsheets/d/1xwCv8s8mxTIwYWkqBZVssVWAlVW3Clm51V8gLliE-v8/edit?usp=sharing

Randell Jesup [:jesup] (needinfo me)

Comment 1

•

10 years ago

We should hop on figuring out why this is failing so often on Linux.

I don't see any non-Linux failures in that spreadsheet.

Flags: needinfo?(georg.fritzsche)

Georg Fritzsche [:gfritzsche]

Assignee

Comment 2

•

10 years ago

This is here, so not critical now for the provider setup in release:
https://hg.mozilla.org/releases/mozilla-beta/file/f14c89b414b6/toolkit/mozapps/extensions/internal/OpenH264Provider.jsm#l271

Looking at addPluginDirectory, this seems the likely source:
https://hg.mozilla.org/releases/mozilla-beta/file/f14c89b414b6/content/media/gmp/GMPService.cpp#l392
Do we actually support sandboxing on Linux yet? Should OpenH264 work there?

Flags: needinfo?(rjesup)

Flags: needinfo?(jld)

Flags: needinfo?(georg.fritzsche)

Randell Jesup [:jesup] (needinfo me)

Comment 3

•

10 years ago

We do support sandboxing on linux, and OpenH264 should work there.

Flags: needinfo?(rjesup)

Georg Fritzsche [:gfritzsche]

Assignee

Comment 4

•

10 years ago

Hm, and 29% percent sessions seems a little high for scenarios where MOZ_DISABLE_GMP_SANDBOX is set:
https://hg.mozilla.org/releases/mozilla-beta/file/f14c89b414b6/security/sandbox/linux/Sandbox.h#l26

Georg Fritzsche [:gfritzsche]

Assignee

Comment 5

•

10 years ago

I'm hoping for some input from jld.
The next-best option i can think of would be to confirm the failure source in GeckoMediaPluginService::AddPluginDirectory() through Telemetry.

:Irving Reid (No longer working on Firefox)

Reporter

Comment 6

•

10 years ago

Hmm. My exception telemetry hook (AddonManagerPrivate.recordException) is JS-only. You could add new NS_ERROR codes (or just choose different existing codes) to indicate exactly where in the call chain the error was found.

(/me shakes fist at error handling systems that can't report context information)

The JS caller still needs to handle the error, because there are legitimate reasons for addPluginDirectory to throw NS_ERROR_NOT_AVAILABLE.

Randell Jesup [:jesup] (needinfo me)

Comment 7

•

10 years ago

(In reply to :Irving Reid from comment #6)
> Hmm. My exception telemetry hook (AddonManagerPrivate.recordException) is
> JS-only. You could add new NS_ERROR codes (or just choose different existing
> codes) to indicate exactly where in the call chain the error was found.
> 
> (/me shakes fist at error handling systems that can't report context
> information)

/me joins

> The JS caller still needs to handle the error, because there are legitimate
> reasons for addPluginDirectory to throw NS_ERROR_NOT_AVAILABLE.


What would those be?  We're not writing to the directory/profile or creating the directory (and for that matter, we aren't even creating a sandboxed GMP process here either - that only happens when it gets used).  Not to discount handling errors!  but I want to know what you're thinking of.

Georg Fritzsche [:gfritzsche]

Assignee

Comment 8

•

10 years ago

[Tracking Requested - why for this release]:

tracking-firefox33: --- → ?

Georg Fritzsche [:gfritzsche]

Assignee

Comment 9

•

10 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> [Tracking Requested - why for this release]:

... This may affect Loop functionality for a significant chunk of the Linux population.

Maire Reavy [:mreavy]

Comment 10

•

10 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> > [Tracking Requested - why for this release]:
> 
> ... This may affect Loop functionality for a significant chunk of the Linux
> population.

It's unlikely to impact Loop functionality because Desktop-Desktop calls currently prefer VP8.  It would impact calls between the Loop mobile app on FxOS and Linux Desktop users who had this problem;  those calls would have to use VP8.  

This would also impact interop between Linux Desktop and any partner endpoints that only supported H.264.

:Irving Reid (No longer working on Firefox)

Reporter

Comment 11

•

10 years ago

(In reply to Randell Jesup [:jesup] from comment #7)
> (In reply to :Irving Reid from comment #6)
...
> > The JS caller still needs to handle the error, because there are legitimate
> > reasons for addPluginDirectory to throw NS_ERROR_NOT_AVAILABLE.
> 
> 
> What would those be?  We're not writing to the directory/profile or creating
> the directory (and for that matter, we aren't even creating a sandboxed GMP
> process here either - that only happens when it gets used).  Not to discount
> handling errors!  but I want to know what you're thinking of.

If Sandbox.isSupported ends up false at http://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/Sandbox.cpp#74, calls to addPluginDirectory will throw. Maybe that condition should be handled some other way, but with the current design we should catch the exception.

Randell Jesup [:jesup] (needinfo me)

Comment 12

•

10 years ago

(In reply to :Irving Reid from comment #11)

> If Sandbox.isSupported ends up false at
> http://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/Sandbox.
> cpp#74, calls to addPluginDirectory will throw. Maybe that condition should
> be handled some other way, but with the current design we should catch the
> exception.

So that would be a linux system that didn't support the fundamentals needed for sandboxing ("Determine whether seccomp-bpf is supported by trying to enable it with an invalid pointer for the filter.")  Perhaps these are VMs?  or much older/odder linux systems?  Can we get a more detailed host system breakdown than 'linux'?

Flags: needinfo?(irving)

:Irving Reid (No longer working on Firefox)

Reporter

Comment 13

•

10 years ago

I could pull a sample of telemetry packets showing this error; the useful information is likely to be in the System Information section (you can see what we get by looking at about:telemetry on your own browser)

'version' is the kernel version ID string, 'arch' the CPU architecture, 'adapter*' the video adapter (may allow you to identify VMs, e.g. VMware shows up in video adapter)

Flags: needinfo?(irving)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 14

•

10 years ago

This is the result of bug 1043733; see also bug 1039819 and bug 1041885.

To summarize: I used the Firefox Health Report data on Linux kernel version (the string printed by `uname -r`) for profiles from the then-current release (Firefox 30) to estimate seccomp-bpf support; the fraction lacking it should have been an order of magnitude less than what's reported in comment #0.

Questions:
1. What is the cause of this discrepancy?
2. Can we resolve the discrepancy and arrive at a better estimate of the actual figure?
3. How many of the seccomp-lacking users are on EOL'ed versions of their distributions?

Sandboxing was never a hard requirement for OpenH264 as far as I know, so we could triple-land a patch to disable bug 1043733 if needed (release management willing)… but then we need to figure out what to do for EME.  Options are:

1. Ship a setuid-root executable, so we can use the full Chromium sandbox.  Bug 1041885 goes into some of the problems here.  Additionally: for users who lack seccomp-bpf support and haven't been getting regular security updates, this sandbox is probably escapeable via known kernel vulnerabilities.

2. Continue desupporting non-seccomp-capable users for EME, but let them run OpenH264 unsandboxed.

Assignee: nobody → jld

Flags: needinfo?(jld)

Comment 15

•

10 years ago

(In reply to :Irving Reid from comment #13)
> I could pull a sample of telemetry packets showing this error; the useful
> information is likely to be in the System Information section (you can see
> what we get by looking at about:telemetry on your own browser)
> 
> 'version' is the kernel version ID string, 'arch' the CPU architecture,
> 'adapter*' the video adapter (may allow you to identify VMs, e.g. VMware
> shows up in video adapter)

The kernel version string should be helpful.

For the architecture, non-x86 systems don't enable MOZ_GMP_SANDBOX, so they shouldn't be showing this error, but it couldn't hurt to have it.  VMness isn't directly relevant (in terms of whether sandboxing works), but combined with the kernel version info it might tell us something.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 16

•

10 years ago

ni? for telemetry data with kernel versions.

Flags: needinfo?(irving)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 17

•

10 years ago

So this is... odd.  Here's a summary of the FHR data submitted since 2014-09-02 (Firefox 32 release).  Following the Metrics group's usual procedures, it's based on a random sample rather than the entire data set.  The columns are version, estimated percent with seccomp-bpf support, and total number of records.

21 85 511
22 96 866
23 85 1108
24 45 4462
25 96 1457
26 92 2805
27 84 1133
28 96 2844
29 96 2493
30 96 4233
31 93 10230
32 97 15078
33 74 31
34 100 21
35 100 21

Note in particular the line for Firefox 33.  I think it's worth rerunning this with the full FHR data, but at the moment it looks more likely that Beta is, for whatever reason, significantly unrepresentative.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 18

•

10 years ago

Merging in Health Reports submitted during the Firefox 31 cycle (and not superseded by a more recent report from the same profile), grouped by version relative to current (so 0 is release, 1 is beta, etc.):

-3 96 4919
-2 96 6259
-1 94 14802
0 96 25851
1 74 39
2 100 34
3 95 44

Again, beta stands out.  Some other demographic differences between beta and release:

Debian stable/oldstable: 17.9% of beta, 0.8% of release
           RHEL 6 and 5:  5.2% of beta, 0.2% of release
   Ubuntu, all versions: 58% of beta,  90% of release

(Disclaimer: these are guesses based on the version string.  /-generic(-|$)/ → Ubuntu, /el[56]/ → RHEL 5/6, /-(686(|-pae|-bigmem)|amd64)$/ → Debian; Debian and (3.2.0 or 2.6.32) → stable/oldstable.  Also, the seccomp-bpf approximation is: ≥ 3.5 or (Ubuntu and 3.2).)

The Debian and RedHat versions in question are known to lack seccomp-bpf support and are still supported, whereas Ubuntu backported seccomp-bpf for 12.04 which is its oldest supported desktop version.

So this would explain the disturbingly large number reported for beta telemetry in comment #0.

:Irving Reid (No longer working on Firefox)

Reporter

Comment 19

•

10 years ago

Attached file beta-33-versions — Details

This is count vs. 'version' string for Beta 33 telemetry received on 2014-09-18 (I have a separate file for Nightly 35 on the same day)

I didn't count the total number of Linux reports, so I don't have a percentage, but this confirms that the exceptions are happening in older kernels.

Flags: needinfo?(irving)

:Irving Reid (No longer working on Firefox)

Reporter

Comment 20

•

10 years ago

Attached file nightly-35-versions — Details

Count vs. kernel version for Nightly 35 telemetry received on 2014-09-18 that contained the NOT_AVAILABLE exception.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 21

•

10 years ago

(In reply to :Irving Reid from comment #19)
> I didn't count the total number of Linux reports, so I don't have a
> percentage, but this confirms that the exceptions are happening in older
> kernels.

And exactly the older kernels expected: < 3.5 in general, < 3.2 for Ubuntu "-generic", RHEL 6 but not 7, no Fedora ("fc") newer than 15 (actually lower than the expected 17, but I'm not complaining, and in any case 19 is the oldest currently supported).  One or two exceptions in the nightly data, which I'm guessing are local custom builds.

So, the answer to the questions in comment #14 is that the telemetry stats on beta aren't a reason to start panicking about release, and there's no reason to doubt the FHR data, but they might be a reason to start worrying about beta/aurora/nightly users.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 22

•

10 years ago

Should we be reporting this error differently, from a programming/telemetry perspective?  It looks like there wasn't a resolution to what was brought up in comment #6 and comment #7.

(There is a separate question of what we should do about the UX of this error condition.  That needs a separate bug.)

Flags: needinfo?(rjesup)

Sylvestre Ledru [:Sylvestre]

Comment 23

•

10 years ago

Tracking since H.264 is important.
FYI, we have just released beta 6. So, a fix will need to land soon if we want to see that bug fixed in 33.

status-firefox33: --- → affected

status-firefox34: --- → affected

status-firefox35: --- → affected

tracking-firefox33: ? → +

tracking-firefox34: --- → +

tracking-firefox35: --- → +

:Irving Reid (No longer working on Firefox)

Reporter

Comment 24

•

10 years ago

(In reply to Jed Davis [:jld] from comment #22)
> Should we be reporting this error differently, from a programming/telemetry
> perspective?  It looks like there wasn't a resolution to what was brought up
> in comment #6 and comment #7.

It's showing up in telemetry because it's an unhandled exception during provider start-up (and thus *may* indicate bustage in the field). Catching the exception inside your startup() and cleaning up the provider initialization appropriately is all I'm looking for.

If you want to add telemetry probes to explicitly count how many clients have sandboxing available and/or disabled, I'd treat that as a separate bug.

Georg Fritzsche [:gfritzsche]

Assignee

Comment 25

•

10 years ago

So, we should handle catching the exception in this bug and do a follow-up for proper reporting of this.

(In reply to Sylvestre Ledru [:sylvestre] from comment #23)
> Tracking since H.264 is important.
> FYI, we have just released beta 6. So, a fix will need to land soon if we
> want to see that bug fixed in 33.

Per comment 2 and comment 21 this is not critical.

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

10 years ago

Blocks: 1072363

Georg Fritzsche [:gfritzsche]

Assignee

Comment 26

•

10 years ago

(In reply to :Irving Reid from comment #24)
> If you want to add telemetry probes to explicitly count how many clients
> have sandboxing available and/or disabled, I'd treat that as a separate bug.

I filed bug 1072363 on reporting an error so we can tell, although we could morph it to just to proper reporting too. Jed, is this something you can take?

Flags: needinfo?(jld)

Georg Fritzsche [:gfritzsche]

Assignee

Comment 27

•

10 years ago

Attached patch Catch NS_ERROR_NOT_AVAILABLE in OpenH264Provider — Details — Splinter Review

Attachment #8494579 - Flags: review?(irving)

Sylvestre Ledru [:Sylvestre]

Comment 28

•

10 years ago

Georg, OK. Thanks. Not blocking 33 for this then.

tracking-firefox33: + → -

tracking-firefox34: + → -

tracking-firefox35: + → -

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 29

•

10 years ago

I attached a patch to bug 1072363 which removes OpenH264 from the plugin list in about:addons and avoids the addPluginDirectory (by disabling the OpenH264Provider) if GMP is disabled for sandboxing reasons… but, rereading the earlier comments, maybe that patch makes more sense on this bug?

The one thing I haven't done yet is note the error in a form that telemetry or FHR would pick up.  But if they collect add-on information, that might also work.

Flags: needinfo?(jld)

:Irving Reid (No longer working on Firefox)

Reporter

Comment 30

•

10 years ago

Comment on attachment 8494579 [details] [diff] [review]
Catch NS_ERROR_NOT_AVAILABLE in OpenH264Provider

Review of attachment 8494579 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/mozapps/extensions/internal/OpenH264Provider.jsm
@@ +271,5 @@
> +      try {
> +        gmpService.addPluginDirectory(this.gmpPath);
> +      } catch (e if e.name == 'NS_ERROR_NOT_AVAILABLE') {
> +        this._log.warning("startup() - adding gmp directory failed with " + e.name + " - sandboxing not available?");
> +      }

This is fine. Note that Log.jsm knows how to format exceptions, so you could use:

this._log.warming("startup() - adding gmp directory failed, sandboxing may not be available", e);

though in this case the extra info in the formatted exception wouldn't be very helpful, so no need to change this patch.

Attachment #8494579 - Flags: review?(irving) → review+

Georg Fritzsche [:gfritzsche]

Assignee

Comment 31

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/d71444b75291

Georg Fritzsche [:gfritzsche]

Assignee

Comment 32

•

10 years ago

Comment on attachment 8494579 [details] [diff] [review]
Catch NS_ERROR_NOT_AVAILABLE in OpenH264Provider

Approval Request Comment
[Feature/regressing bug #]: OpenH264 integration
[User impact if declined]: Avoids spamming Addon exception telemetry with a known issue.
[Describe test coverage new/current, TBPL]: We have proper test-coverage for this path.
[Risks and why]: Low-risk, this just trivially catches one expected exception.
[String/UUID change made/needed]: None.

Attachment #8494579 - Flags: approval-mozilla-beta?

Attachment #8494579 - Flags: approval-mozilla-aurora?

Randell Jesup [:jesup] (needinfo me)

Updated

•

10 years ago

Flags: needinfo?(rjesup)

Sylvestre Ledru [:Sylvestre]

Comment 33

•

10 years ago

Comment on attachment 8494579 [details] [diff] [review]
Catch NS_ERROR_NOT_AVAILABLE in OpenH264Provider

Approving even it is not in m-c to make sure it is in beta 8

Attachment #8494579 - Flags: approval-mozilla-beta?

Attachment #8494579 - Flags: approval-mozilla-beta+

Attachment #8494579 - Flags: approval-mozilla-aurora?

Attachment #8494579 - Flags: approval-mozilla-aurora+

Georg Fritzsche [:gfritzsche]

Assignee

Comment 34

•

10 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/155cfcc5efa9
https://hg.mozilla.org/releases/mozilla-beta/rev/b6985e15046b

status-firefox33: affected → fixed

status-firefox34: affected → fixed

Ryan VanderMeulen [:RyanVM]

Updated

•

10 years ago

status-firefox35: affected → fixed

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 35

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/d71444b75291

Assignee: jld → georg.fritzsche

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

10 years ago

Depends on: 1079312

beta-33-versions 10 years ago :Irving Reid (No longer working on Firefox) 3.22 KB, text/plain		Details
nightly-35-versions 10 years ago :Irving Reid (No longer working on Firefox) 1.49 KB, text/plain		Details
Catch NS_ERROR_NOT_AVAILABLE in OpenH264Provider 10 years ago Georg Fritzsche [:gfritzsche] 1.62 KB, patch	Irving : review+ Sylvestre : approval-mozilla-aurora+ Sylvestre : approval-mozilla-beta+	Details \| Diff \| Splinter Review