Closed Bug 1354395 Opened 3 years ago Closed 3 years ago

Some older extensions with XPCOM dylib will cause Firefox 53 to crash

Categories

(Core :: js-ctypes, defect)

53 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox53 + wontfix
firefox54 + fixed
firefox55 --- fixed

People

(Reporter: jerry, Assigned: glandium)

References

Details

Attachments

(2 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/603.2.1 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.1

Steps to reproduce:

• Get on a Mac running macOS 10.12.
• Download and install BookMacster version 2.2.18 from here:

https://sheepsystems.com/bookmacster/BookMacster.2.2.18.zip

(Do not get it from the link given on my website because I hope to soon have it replaced with a newer version 2.3 which contains a new WebExtension instead.)

• Launch Firefox 53.0b9 (64-bit) Build ID 20170403072723.
• Launch BookMacster.  Close or cancel any windows that open.
• Click in the menu: BookMacster > Manage Browser Extensions.  A window will open.
• In the top "Syncing" section, row for "Firefox", click button "Install", and follow the instructions given.
• Relaunch Firefox 53.0b9.
• Click in the menu Tools > Add-Ons and verify that you have Sheep Systems Sync Extension version 330 loaded.
• Return to the Manage Browser Extensions window in BookMacster and click "Test" in the row for Firefox.  This sends a message to the .dylib in my extension.



Actual results:

Firefox 53 crashes immediately.  This is reproducible.  Typical crash report:

https://crash-stats.mozilla.com/report/index/bp-9b566232-4dd8-448d-90a7-855cf2170407


Expected results:

Something more graceful, such as Firefox preventing the extension from loading or just not responding to the function call.
The crash occurs because the XPCOM dylib in my old extension calls _NS_GetServiceManager.  Because there is no such symbol in Firefox 53 due to the FIX of Bug 1306329, Firefox crashes.

Mike Hommey asked if my dylib was weak linked.  Maybe so.  The old XPCOM interface was quite complicated.  I do remember that, at some point, I had to build a "xulrunner" library and link it to my dylib.  But I don't have that xulrunner any more.

Now, of course the way it will happen in real life is that users who have been running version 2.2.18 of BookMacster or similar apps, and have installed my old extension, and been happily using it in Firefox 52 or earlier, and have not updated BookMacster before the week of April 18 because they have automatic updating turned off, will find that, after Firefox updates itself to version 53, Firefox will crash whenever they add, delete or change a bookmark in Firefox, or perform certain operations in my app.

Please note that there is apparently nothing in Firefox 53 which prevents my old extension from loading.  The install.rdf in my old extension specifies the maxVersion for Firefox = 51.0, but I read somewhere that Firefox has never enforced this :(
Blocks: 1306329
So, I took a quick look at your addon, and what it's doing is that it's loading a native library with ctypes. Obviously, anything bad can happen in that case, but we should probably avoid random runtime problems when libs loaded with ctypes use symbols that can't be resolved.

Essentially, the problem is that we're loading the ctypes libs with RTLD_LAZY, which makes missing symbols lead to crashes when they are used, rather than the lib failing to load at all.

I think the obvious fix here is something like:

--- a/js/src/ctypes/Library.cpp
+++ b/js/src/ctypes/Library.cpp
@@ -143,17 +143,17 @@ Library::Create(JSContext* cx, HandleValue path, const JSCTypesCallbacks* callba
                 pathStr->length(), pathBytes, &nbytes));
     pathBytes[nbytes] = 0;
   }
 
   libSpec.value.pathname = pathBytes;
   libSpec.type = PR_LibSpec_Pathname;
 #endif
 
-  PRLibrary* library = PR_LoadLibraryWithFlags(libSpec, 0);
+  PRLibrary* library = PR_LoadLibraryWithFlags(libSpec, PR_LD_NOW);
 
 #ifndef XP_WIN
   JS_free(cx, pathBytes);
 #endif
 
   if (!library) {
 #define MAX_ERROR_LEN 1024
     char error[MAX_ERROR_LEN] = "Cannot get error from NSPR.";


But I don't know if some things are actually relying on the RTLD_LAZY behavior... (and would in turn, break ; but they would at least break more gracefully than the opposite breakage)

Benjamin, what do you think?
Component: Untriaged → js-ctypes
Flags: needinfo?(benjamin)
Product: Firefox → Core
FWIW, the patch in comment 2 breaks hazard builds with:
[task 2017-04-07T03:45:05.304063Z] /home/worker/checkouts/gecko/js/src/devtools/rootAnalysis/utility.js:146:15 Error: couldn't open library /home/worker/workspace/sixgill/usr/bin/xdb.so: /home/worker/workspace/sixgill/usr/bin/xdb.so: undefined symbol: __gmpz_sizeinbase
Flags: needinfo?(sphink)
Oh wow, I definitely don't think we should be using RTLD_LAZY for ctypes! That's a footgun.

I'm really only worried about whether this affects startup performance because os.file uses ctypes.
Flags: needinfo?(benjamin)
Depends on: 1354553
I filed bug 1354553 to add the relevant extensions here to the blocklist for Firefox 53+.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> Oh wow, I definitely don't think we should be using RTLD_LAZY for ctypes!
> That's a footgun.
> 
> I'm really only worried about whether this affects startup performance
> because os.file uses ctypes.

Mmmm I wonder what the linker actually does when an already RTLD_LAZY-loaded lib is then loaded with RTLD_NOW...
(looking glibc source) it looks like, from a quick glance at the code, that it doesn't do anything wrt symbols. The question is still open for OSX, though, but I think it's likely it's the same.

That still leaves us with the xdb.so problem on hazard builds, though :-/
[Tracking Requested - why for this release]:
Given that there is a possibility that a number of add-on ill crash Firefox as soon as their code runs (potentially startup), this should be on the release management radar. It looks like the concrete add-on listed here is now blocklisted but we may want to keep our eyes open to see if there are others.
We can track this for 53, but we're already heading into the RC build today. 
This sounds maybe more complicated than we can handle before the release. 

glandium, are you intending to take this on or do you know who might?
Flags: needinfo?(mh+mozilla)
I can take it, the patch is rather trivial, but we need to find a way out for hazard build, which is waiting for sfink's answer to the ni?
Flags: needinfo?(mh+mozilla)
What's happening is that the hazard analysis runs a script using the JS shell to load in databases constructed earlier, and it uses ctypes to dlopen xdb.so. It looks like xdb.so has a number of undefined gmp symbols that it never actually calls, which was previously harmless but with RTLD_NOw will complain about. The set of symbols (for my local build, at least, which might be using different versions of things):

                 U __gmpz_add_ui
                 U __gmpz_clear
                 U __gmpz_cmp_ui
                 U __gmpz_fits_slong_p
                 U __gmpz_get_si
                 U __gmpz_get_str
                 U __gmpz_init
                 U __gmpz_mul_si
                 U __gmpz_sizeinbase

I don't really understand why, given that I'm linking to gmp statically. But I guess I either need to force it to resolve these symbols against the static lib, or stop it from emitting those symbols in the first place. Looking...
Thanks. If you end up feeling like this should go into 53, please let me know - we can do an RC2 build if we have to.
I double checked what happens in practice when a library previously dlopen'ed with RTLD_LAZY is re-dlopen'ed with RTLD_NOW, on both Linux and OSX. In both cases, there is no forced bindings happening on the second dlopen, which means OS.file (or anything else) dlopen'ing libxul through ctypes won't cause symbols to be bound.

FWIW, this is how the behavior is defined in POSIX:

    RTLD_NOW
        All necessary relocations shall be performed when the object is first loaded.

    (...)

    If a file is specified in multiple dlopen() invocations, mode is interpreted at each invocation. Note, however, that once RTLD_NOW has been specified all relocations shall have been completed rendering further RTLD_NOW operations redundant and any further RTLD_LAZY operations irrelevant.

(http://pubs.opengroup.org/onlinepubs/009695399/functions/dlopen.html)
Comment on attachment 8856890 [details]
Bug 1354395 - Update sixgill to a rebuild that links against GMP statically.

https://reviewboard.mozilla.org/r/128804/#review131398

If that version works, I'm fine with it. I rebuilt https://queue.taskcluster.net/v1/task/Y7_CuK8USHGx6w913ir5vQ/runs/0/artifacts/public/sixgill.tar.xz using my toolchain job, but I didn't have time yesterday to re-figure out how to upload to tooltool. I'm fine with using either version.
Attachment #8856890 - Flags: review?(sphink) → review+
Comment on attachment 8856891 [details]
Bug 1354395 - Always bind symbols at load time for ctypes libraries.

https://reviewboard.mozilla.org/r/128806/#review131610

Given the time constrains, my recommendation is to *not* land this into 53.
Attachment #8856891 - Flags: review?(benjamin) → review+
(In reply to Benjamin Smedberg [:bsmedberg] from comment #18)
> Comment on attachment 8856891 [details]
> Bug 1354395 - Always bind symbols at load time for ctypes libraries.
> 
> https://reviewboard.mozilla.org/r/128806/#review131610
> 
> Given the time constrains, my recommendation is to *not* land this into 53.

Users caught in a crash loop at startup will end up in safe mode with addons disabled, right?
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/4b7a0c136074
Update sixgill to a rebuild that links against GMP statically. r=sfink
https://hg.mozilla.org/integration/autoland/rev/fe837e8247ba
Always bind symbols at load time for ctypes libraries. r=bsmedberg
Flags: needinfo?(sphink)
Assignee: nobody → mh+mozilla
Comment on attachment 8856890 [details]
Bug 1354395 - Update sixgill to a rebuild that links against GMP statically.

Approval Request Comment
[Feature/Bug causing the regression]: This patch fixes a problem that is unveiled by the second patch in this bug.
[User impact if declined]: Hazard build bustage when landing the other patch.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: On autoland, yes.
[Needs manual test from QE? If yes, steps to reproduce]: N/A
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: It shouldn't be.
[Why is the change risky/not risky?]: It's a rebuild of the compiler plugin used for the hazard builds. AIUI, there are unit tests checking that the plugin catches what it's supposed to catch.
[String changes made/needed]: N/A
Attachment #8856890 - Flags: approval-mozilla-aurora?
Comment on attachment 8856891 [details]
Bug 1354395 - Always bind symbols at load time for ctypes libraries.

Approval Request Comment
[Feature/Bug causing the regression]: Technically, this has been a problem for a long time, but the removal of some core symbols made the problem more apparent in 53. However, it's late in the 53 schedule to fully appreciate the risk of this change. It may break addons that otherwise work fine, while fixing crashes due to addons that load native code with ctypes and use symbols that aren't available.
[User impact if declined]: Crashes when using addons that load native code with ctypes and use symbols that aren't available.
[Is this code covered by automated tests?]: Only for the "normal" case (where ctypes-loaded libraries don't have missing symbols)
[Has the fix been verified in Nightly?]: On autoland.
[Needs manual test from QE? If yes, steps to reproduce]: N/A
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: To some degree.
[Why is the change risky/not risky?]: It may break addons that work fine currently (because for some reason they don't actually use the missing symbols, so lazy binding makes the missing symbols a non-issue), while fixing crashes due to addons that load native code with ctypes and use symbols that aren't available.
[String changes made/needed]: N/A
Attachment #8856891 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/4b7a0c136074
https://hg.mozilla.org/mozilla-central/rev/fe837e8247ba
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment on attachment 8856890 [details]
Bug 1354395 - Update sixgill to a rebuild that links against GMP statically.

Fix a bustage. Aurora54+.
Attachment #8856890 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8856891 [details]
Bug 1354395 - Always bind symbols at load time for ctypes libraries.

Fix a potential crash. Aurora54+.
Attachment #8856891 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
has conflicts uplifting to aurora so need rebasing
Flags: needinfo?(mh+mozilla)
(In reply to Carsten Book [:Tomcat] from comment #26)
> has conflicts uplifting to aurora so need rebasing

You can apply the conflicting patch if you pass it through `sed 's/^\(.\) */\1/'`
Flags: needinfo?(mh+mozilla)
Is this something we could/should consider for a desktop dot-release ride-along?
Flags: needinfo?(lhenry)
The description of risk in comment 22 (and in general in the comments here) make me think it might be better to wait for 54. 

I left this tracked for 53 in case we see widespread reports of problems.
Flags: needinfo?(lhenry)
You need to log in before you can comment on or make changes to this bug.