Closed Bug 608109 Opened 14 years ago Closed 14 years ago

Off-thread crypto must not send closures across threads

Categories

(Firefox :: Sync, defect)

x86
All
defect
Not set
critical

Tracking

()

RESOLVED WONTFIX
Tracking Status
blocking2.0 --- beta7+

People

(Reporter: gal, Unassigned)

References

Details

(Keywords: crash, regression, reproducible)

JavaScript is single threaded now and sending closures across threads is no longer supported. Chrome workers are probably the proper tool to use here.
This is very like a major cause of the JSString::flatten() crashes that are blocking b7.
No longer depends on: 604756, 606705
Blocks: 604756
No longer blocks: 602157
Blocks: 570619
(In reply to comment #0)
> JavaScript is single threaded now

Shrug! So should we stop exposing nsIThreadManager::newThread to JavaScript then?

> and sending closures across threads is no
> longer supported. Chrome workers are probably the proper tool to use here.

My first prototype for threaded crypto was based on Chrome workers but I ran into a few issues:

* Chrome workers are hung of XUL DOMs, making it hard to get them in JavaScript modules that want to be independent of the application they're being used in (Firefox, Fennec, Thunderbird, SeaMonkey, ...). It's not impossible, my prototype just grabbed the constructor off the first XUL window it found, but I'm not sure how safe this is.

* Chrome workers are lacking various globals we need, e.g. atob/btoa. And I don't remember if we also have access to Components.* but I'm guessing we don't.
(In reply to comment #2)
> (In reply to comment #0)
> > JavaScript is single threaded now
> 
> Shrug! So should we stop exposing nsIThreadManager::newThread to JavaScript
> then?

You have a good point, one I made like a cranky old man when people talked blithely about ripping out JS_THREADSAFE and "seeing what breaks". We have had JS_THREADSAFE and XPCOM threads for over a decade. We can't just "see what breaks" without lots of schedule and quality risk, some already come due and paid in sweat and tears if not blood.

One answer is to rip out the long-standing APIs. This will cause more work, and it may be unnecessary for the same-thread runnable uses from JS of threadmgr.

Another is MT wrappers (bug 566951). I still think we must do this, to mitigate all the unknown uses of threadmgr from JS.

> * Chrome workers are hung of XUL DOMs, making it hard to get them in JavaScript
> modules that want to be independent of the application they're being used in
> (Firefox, Fennec, Thunderbird, SeaMonkey, ...). It's not impossible, my
> prototype just grabbed the constructor off the first XUL window it found, but
> I'm not sure how safe this is.
> 
> * Chrome workers are lacking various globals we need, e.g. atob/btoa. And I
> don't remember if we also have access to Components.* but I'm guessing we
> don't.

These seem fixable at some cost. Bent?

/be
Hm. atob/btoa would be easy to add. Making workers independent of DOM windows is a bit harder, not sure exactly what assumptions that would break.

And yeah, Components is still off limits.
(In reply to comment #0)
> JavaScript is single threaded now and sending closures across threads is no
> longer supported. Chrome workers are probably the proper tool to use here.
Er, so where/when was this major change announced?
(In reply to comment #4)
> And yeah, Components is still off limits.

To be fair, WeaveCrypto.js only uses Components in a couple of places, all of which could be avoided or placed outside the worker.
We did a code audit as part of the compartment landing. Unfortunately the sync stuff landed just about in parallel to that. Anyway, lets make a list of stuff we have to fix to get crypto off the ground with chrome workers. This blocks b7.
(In reply to comment #7)
> We did a code audit as part of the compartment landing. Unfortunately the sync
> stuff landed just about in parallel to that. Anyway, lets make a list of stuff
> we have to fix to get crypto off the ground with chrome workers. This blocks
> b7.
Sure, but where's the announcement about this change?  I reviewed the Sync code change that introduced this, and had I known that we were going to change our API guarantees, I would have asked for a different approach.  Additionally, I know add-ons out there do this type of thing.  This change *really* needs to be announced loudly.
Sync-hosting apps providing Sync with the DOM window to use for the worker seems like a straightforward solution, and not very burdensome for those apps.  Specifically, we should make sure that we are optimizing *hard* for Firefox on mobile and desktop being great experiences in the shortest timeframe, and not compromising that for the sake of simple generalization to other apps at this time.

Shawn: that's not a thread to start in this bug.  I'd recommend leaning over the wall and asking, or doing so in writing on a mailing list.

Add-ons will have to make changes to support FF4, including/especially where threading is involved; XUL applications likewise.  Le Monde and other dark matter can choose to make that adaptation, stay downrev, or rip and replace everything in favour of a more modern model (HTML + Web Workers, etc.).  Similar choices face those who are on PPC or OS X 10.4, which is likely a larger set of users!

The exceedingly-late timing of this change, and apparently insufficient notice to our own application developers let alone add-on developers, is the problem to my mind -- not the righteous decision to take away the flaky and ill-supported "raw" threading model.  Chrome workers should carry that load, and add-on authors should be supported with documentation and otherwise as needed in making such a transition.  We need to improve our communication here, as with other things that changed in the compartments landing around wrappers, and I'll send some mail around to recruit that help.
(In reply to comment #9)
> 
> The exceedingly-late timing of this change, and apparently insufficient notice
> to our own application developers let alone add-on developers, is the problem
> to my mind -- not the righteous decision to take away the flaky and
> ill-supported "raw" threading model. 

Our own application developers (and add-on developers!) should know to stay away from things that are flaky and ill-supported. This code shouldn't have passed review even if the API and JS_THREADSAFE remained unchanged.

But I do agree that we need to move forward as quickly as possible.
The problem isn't just lack of announcement, it also that Chrome Workers are not yet an option for the reasons Philipp gave.

MT wrappers as of gal's last patch (bitrotted now, best to redo by hand instead of patch --fuzz) make a frozen snapshot of the closure's environment. This is probably safer than the old JS_THREADSAFE.

I think it's bad to tell unknown (and non-listening) app, addon, and other platform users to use what we don't even test or do not have yet (Chrome workers that have Components, etc.) as a "rip and replace" option. Let's do MT wrappers and avoid wasting more time on bugs like this one, if possible.

/be
(In reply to comment #10)
> (In reply to comment #9)
> > 
> > The exceedingly-late timing of this change, and apparently insufficient notice
> > to our own application developers let alone add-on developers, is the problem
> > to my mind -- not the righteous decision to take away the flaky and
> > ill-supported "raw" threading model. 
> 
> Our own application developers (and add-on developers!) should know to stay
> away from things that are flaky and ill-supported. This code shouldn't have
> passed review even if the API and JS_THREADSAFE remained unchanged.

You're right, sharing main-thread-only objects across XPCOM-created threads is just unsafe and always has been. Yet it is too easy to do. So we need at least new error cases, or else MT wrappers.

/be
Yeah, #12 is spot on. Whats not supported has to throw clearly, and we can then improve the API from there. Trying to whip up that patch as we speak.
(In reply to comment #9)
> Sync-hosting apps providing Sync with the DOM window to use for the worker
> seems like a straightforward solution, and not very burdensome for those apps.

How would that work? DOM window can't be shared cross-thread.

> Specifically, we should make sure that we are optimizing *hard* for Firefox on
> mobile and desktop being great experiences in the shortest timeframe, and not
> compromising that for the sake of simple generalization to other apps at this
> time.

We don't even have our own code working correctly with shortest-path JS_THREADSAFE removal, and this bug and the topcrash behind it are evidence.

Shortcut hacking when you're the platform steward will break your own platform uses, along with your customers (who won't react well if you don't give them a migration path -- which we have not).

We don't get to be an app, not a platform. That ship sailed 11 years ago and we have broad and deep platform APIs that currently invite cross-thread JS object sharing. That's a problem but it was not delaying b7 one bit, by itself.

> Add-ons will have to make changes to support FF4, including/especially where
> threading is involved; XUL applications likewise.  Le Monde and other dark
> matter can choose to make that adaptation, stay downrev, or rip and replace
> everything in favour of a more modern model (HTML + Web Workers, etc.). 
> Similar choices face those who are on PPC or OS X 10.4, which is likely a
> larger set of users!

We don't know populations (does Le Monde's readership count) but that's not the point.

The point is we made our own misery here, by taking a shortcut of the kind you are recommending. I argued against a majority position to do MT wrappers before turning off JS_THREADSAFE, threw in the towel, and here we are.

> The exceedingly-late timing of this change, and apparently insufficient notice
> to our own application developers let alone add-on developers, is the problem
> to my mind

That assumes the plan was to break our own code and others -- a bad plan! The longer-term plan was to do MT wrappers and then turn off JS_THREADSAFE, but for perf reasons we went the other way, assuming we could get through a beta with a few addons that used cross-thread JS not working (or, more broken).

> -- not the righteous decision to take away the flaky and
> ill-supported "raw" threading model.

It is not a righteous decision and the topcrash and this bug are proof.

>  Chrome workers should carry that load,

They lack a bunch of APIs. Are we going to add these for Firefox 4?

> and add-on authors should be supported with documentation and otherwise as
> needed in making such a transition.  We need to improve our communication here,
> as with other things that changed in the compartments landing around wrappers,
> and I'll send some mail around to recruit that help.

This misdiagnoses the problem as one of miscommunication, when it was technical miscoordination -- doing work out of order assuming only a few addons, not our own code, would care.

We would probably be better off with JS_THREADSAFE for another beta, allowing time to revive MT wrappers. This is still an option. It's how the b7 with just JM, not compartments, that I was urging a month go would have worked.

/be
Blocks: 608142
(In reply to comment #6)
> To be fair, WeaveCrypto.js only uses Components in a couple of places, all of
> which could be avoided or placed outside the worker.

Cool. Just to be absolutely clear, not having Components in ChromeWorker is by design as most of the components and services you can access are not threadsafe themselves.
No longer blocks: 608142
JS_THREADSAFE is no longer enough. Keep in mind that we are crashing in rope code. So we would have to eagerly make immutable the entire graph (every string in it) that the object we are sending to another thread can reach. If thats a chrome object, its usually most of the chrome heap graph (parent -> global -> functions).
The backstage pass needs btoa/atob and the ChromeWorker constructor for this patch. Both easy. We need bugs filed. Bent?
There was no substantial increase or decrease in the crash volume for the flatten crash when jorendorff removed JSObject::title.
Resolving this as WONTFIX since off-thread crypto (bug 570619) has been backed out for now. I filed bug 608156 to redo it with ChromeWorkers and we'll file bugs against ChromeWorkers for any issues we run into in the process.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
(In reply to comment #16)
> JS_THREADSAFE is no longer enough. Keep in mind that we are crashing in rope
> code. So we would have to eagerly make immutable the entire graph (every string
> in it) that the object we are sending to another thread can reach.

The JS_THREADSAFE code did this, in jslock.cpp:

FinishSharingTitle(JSContext *cx, JSTitle *title)
{
    js_InitLock(&title->lock);
    title->u.count = 0;     /* NULL may not pun as 0 */

    JSObject *obj = TITLE_TO_OBJECT(title);
    if (obj) {
        uint32 nslots = obj->slotSpan();
        for (uint32 i = 0; i != nslots; ++i) {
            Value v = obj->getSlot(i);
            if (v.isString() &&
                !js_MakeStringImmutable(cx, v.toString())) {
                /*
                 * FIXME bug 363059: The following error recovery changes
                 * runtime execution semantics, arbitrarily and silently
                 * ignoring errors except out-of-memory, which should have been
                 * reported through JS_ReportOutOfMemory at this point.
                 */
                obj->setSlot(i, UndefinedValue());
            }
        }
    }

When did we stop using the request model, or do whatever it was (if not ripping out titles and JS_THREADSAFE) that kept this code running?

> If thats a
> chrome object, its usually most of the chrome heap graph (parent -> global ->
> functions).

No, the above code has to only through the direct ("own") property values. The transitive closure is handled lazily, not eagerly. You can have an MT object with references to ST objects as values of its properties. The old title-locking fast path would promote these if it couldn't claim the title for the new thread.

/be
Moving CC out of the request model (bug 606752 at least -- we must fix that one) could bypass the code shown in comment 20 before it was removed (bug 606029).

/be
(In reply to comment #21)
> Moving CC out of the request model (bug 606752 at least -- we must fix that
> one) could bypass the code shown in comment 20 before it was removed (bug
> 606029).

The CC was not moved out of the request, it was moved out of JS_GC. That allowed DOM workers to run when CC is active. But I doubt that 606752 is responsible for the string flatten problem as the CC ignores strings.
(In reply to comment #22)

> The CC was not moved out of the request, it was moved out of JS_GC.

Sorry, this is wrong. CC is indeed running outside the request. But that does not invalidate the conclusion as the GC can only happen on the main thread and CC does ignore strings.
(In reply to comment #19)
> Resolving this as WONTFIX since off-thread crypto (bug 570619) has been backed
> out for now. I filed bug 608156 to redo it with ChromeWorkers and we'll file
> bugs against ChromeWorkers for any issues we run into in the process.

Does that need to block b7, or does this effectively put us a bit more in the clear to ship? Having trouble understanding if some of the follow-up work identified in this bug needs to get out there for our add-on/third-party development target or not.
(In reply to comment #24)
> Does that need to block b7, or does this effectively put us a bit more in the
> clear to ship? Having trouble understanding if some of the follow-up work
> identified in this bug needs to get out there for our add-on/third-party
> development target or not.
I think that adding APIs to chrome workers is likely fine to do after beta 7 since they are only accessed by script, and adding APIs for script won't break existing scripts (well, shouldn't is more accurate I suppose).

Moving crypto off of the main thread is a fennec beta 2 blocker, but I don't think it would or should block firefox's beta 7.
(In reply to comment #24)
> (In reply to comment #19)
> > Resolving this as WONTFIX since off-thread crypto (bug 570619) has been backed
> > out for now. I filed bug 608156 to redo it with ChromeWorkers and we'll file
> > bugs against ChromeWorkers for any issues we run into in the process.
> 
> Does that need to block b7, or does this effectively put us a bit more in the
> clear to ship?

Yes, this puts us more in the clear to ship b7 as the do-over bug 608156 blocks b8. There's some platform work that needs to be done for that, though, and depending on interface change policies etc. some of that might block b7.
What's the effect of sync w/ foreground crypto on responsiveness?

/be
(In reply to comment #27)
> What's the effect of sync w/ foreground crypto on responsiveness?

Severe.
Assignee: nobody → jorendorff
Assignee: jorendorff → nobody
(In reply to comment #27)
> What's the effect of sync w/ foreground crypto on responsiveness?

Sync has to encrypt/decrypt items on each sync so it blocks the main thread many times per sync. This problem seems to be particularly severe on mobile where panning/zooming is no longer responsive during syncs. It's also noticeable on desktop right now when we do RSA crypto (some time after startup and first sync per start), though we've got plans to get rid of that.
(In reply to comment #29)
> (In reply to comment #27)
> > What's the effect of sync w/ foreground crypto on responsiveness?
> 
> Sync has to encrypt/decrypt items on each sync so it blocks the main thread
> many times per sync. This problem seems to be particularly severe on mobile
> where panning/zooming is no longer responsive during syncs. It's also
> noticeable on desktop right now when we do RSA crypto (some time after startup
> and first sync per start), though we've got plans to get rid of that.

Which functions are actually causing poor performance outside of RSA and PBKDF2?

I have done lots of AES and HMAC calculations in the main thread of another mobile application (on weaker ARM processors than what Android targets) with acceptable responsiveness. I would not expect AES or HMAC calculations to be the main issue as far as responsiveness goes, especially for typical small items like bookmarks and passwords. It also looks like there are performance improvements that could be made in the way keys are being copied between NSS and Javascript, encoded/decoded, and stored in memory. I would try optimizing the code in the patch for bug 603489 and then measure the performance again.
Indeed, I suspect the real blocking is due to the lack of async APIs in Places, once bug 603489 is fixed.  That said, I think moving those calculations to be async is still useful, on the basis of sheer volume (decrypting 50-150 records per collection query on mobile).  But we should discuss this in bug 608156 if there's questions of whether this is actually worthwhile.
Component: Firefox Sync: Crypto → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.