Closed Bug 621363 Opened 14 years ago Closed 13 years ago

SpecialPowers ipc setter code does not receive new value locally until next event loop run

Categories

(Core :: IPC, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla10

People

(Reporter: jdm, Assigned: jmaher)

References

(Blocks 1 open bug)

Details

(Keywords: dev-doc-needed, Whiteboard: [specialpowers][inbound])

Attachments

(3 files, 8 obsolete files)

13.39 KB, patch
jdm
: review+
Details | Diff | Splinter Review
13.42 KB, patch
Details | Diff | Splinter Review
5.89 KB, patch
Details | Diff | Splinter Review
The way IPC prefs are implemented, we send a synchronous message via SpecialPowers to set a pref in the parent.  However, the parent has an observer that listens for pref changes and forwards them (via an async message) to the child's local pref cache which is used by native code (ie. not SpecialPowers references in tests).  This update message isn't processed until the event loop runs again, so any test code that depends on the pref being updated fails.
Depends on: 619559
Depends on: 621367
How many tests set prefs?  (Rough order of magnitude.)
http://mxr.mozilla.org/mozilla-central/search?string=set%28Int|Bool|Char%29Pref&regexp=1&find=\.htm&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central shows me 60 tests that set prefs.  There will also be tests that add and remove permissions, but that's harder to search for since the calls are just |add| and |remove|.
IMHO, alas, we should change these tests.  Spinning a nested event loop here is dangerous and seems like a recipe for intermittent orange.  We could add some kind of platform support for setting prefs+perms "immediately" in the child, but that doesn't seem wise because (i) we don't actually want those APIs to be used; (ii) not waiting for observer notifications also seems like a recipe for orange; and (iii) maintenance burden.

If we're going to change the tests, maybe we can take an opportunity to fix some other annoyances too.  The API I have in mind would be something like

  setupPrefEnv({'set': ['foo.bar', 2], 'remove': 'bar.baz', ...},
               callback);

where |callback| would be invoked after the indicated DB changes complete.  We could use this kind of API to fix the problem of needing to reset prefs after tests.  Make it |pushPrefEnv(c, f)| instead, with a |popPrefEnv()|, and an implicit pop-env-stack at end-of-test.  (Probably need to forbid multiple in-flight pushEnvs, but that's easy.)

Similarly for permissions, but a bit simpler.

This API should permit remote=false optimizations too, so there's not much overhead for current tinderbox runs.

Thoughts?
What would be the correct way to change the tests, then?  Just split any block that's dependent on a pref value being changed to use an execute_soon?
Sorta.  Factor out the code that depends on the pref/perm changes into a continuation.  E.g.

  setPref('blah', ...);
  //...
  x(); y(); z();

-->

  pushPrefEnv({'set': ['blah', ...},
              function() { x(); y(); z(); });

Whatever implements pushPrefEnv would invoke the continuation when the environment was changed per the spec arg.
There's (shameless plug) some awesome helpers in cookie tests to do this: they use generators, so you don't need to split your test across multiple functions.

See http://mxr.mozilla.org/mozilla-central/source/extensions/cookie/test/unit/head_cookies.js#74 and nearby code; usage example at http://mxr.mozilla.org/mozilla-central/source/extensions/cookie/test/unit/test_eviction.js.
I have a simple parser on bug 582472 that will count the uses and attempt to divine the reason for using EnablePrivilege across the codebase.  It shows me that we use the Preference interfaces 89 times in these tests.  We use Permission interfaces about 21 times.  

I'd also rather not have to spin an event loop each time we have to use special powers.  So, if we can avoid that by adding better APIs to an interface like special powers or to the frame script itself, I'd like to do that.

On a related note, mochitest has its own "harness level" set of things that need frame scripts and the ability to talk between child processes and parent processes, so I'd like to consolidate all this type of stuff into one API, whether that is special powers or dwitte's special generator or what have you.  So, if you come up with a better way to do what "special powers" does, I'd also like to add in the set of mochitest harness chrome-level calls that we need to make from the content process as well. (writing to files, quitting the process, etc)
Oh, sorry for the spam, but if we're revisiting the special powers api, then we also need to design in a way that developers can easily extend the set of available APIs so that this API will be flexible for the 400 odd uses of EnablePriviledge tests in the codebase.
There's a hacky way around this - in the content process, we could create a pref cache that would store new values until the corresponding notification arrived.
Specifically, this would be a cache in the underlying prefbranch component, not on the level of SpecialPowers.
I'm not in favor of hacking around this problem in the platform.  I agree that spinning the event loop sucks, but unless it's actively causing problems, I'd rather stick with that hack until we have a "real" fix like what bug 636063 would allow.
Blocks: 664029
Attachment #539046 - Attachment is obsolete: true
I don't really know how this code worked for me for so long, but it turns out on closer inspection that my observers had some serious problems. This update is proven in the field of fire by the IndexedDB mochitests; we should probably make a decision soon and either push it or shelve it.
Attachment #499707 - Attachment is obsolete: true
Blocks: 668283
Chris, Ted, this problem is holding up being able to run lots of tests on mobile. What's the final word here? Are we going to go with this gross solution, or is there some other direction I can start moving so we can make progress?
I think we should create the Right APIs and change the tests to use them.  Re-reading above, it looks like we have ~100 tests that are broken from pref manip.  That doesn't seem like an insurmountable effort.  It might be more expedient in the short term to to hack around the old APIs in various ways, but that might end up being a false economy.

I'm not familiar with the tests, though, and it would be good to have a more accurate estimate of the work involved in upgrading the tests from someone who knows them better.
I suspect the number is higher than the ~100 previously reported; there are a number of helper js scripts that do things like set up the pref environment for every media or indexeddb test, so each of those would have to be modified to account for the new change.
Unbitrotted.
Attachment #539048 - Attachment is obsolete: true
jsreftests (|make jstestbrowser|) need this sort of thing as well (bug 669949). Modifying them all would mean rewriting thousands of reftests (most of the 3,328 seem to set at least one pref).

How about a setting that allows one to set prefs from the child? And it would only be allowed in tests, of course. Either just applying the value locally (no sync with parent - good enough for jsreftests, see patch in bug 669949), or a sync message to the parent (safer in one sense, but less in another)?
(In reply to comment #19)
> jsreftests (|make jstestbrowser|) need this sort of thing as well (bug
> 669949). Modifying them all would mean rewriting thousands of reftests (most
> of the 3,328 seem to set at least one pref).

Do you have an idea of whether the rewrite would be mechanical or more complicated, in general?  A mechanical rewrite of a few thousand tests wouldn't be so bad, and would be write-once, but if it would be more complicated that's less attractive.

> How about a setting that allows one to set prefs from the child? And it
> would only be allowed in tests, of course. Either just applying the value
> locally (no sync with parent - good enough for jsreftests, see patch in bug
> 669949), or a sync message to the parent (safer in one sense, but less in
> another)?

I think magic modes like this will probably end up causing more trouble than they save.  But the specter of non-mechanical rewrite of thousands of tests might move that needle :/.
cc'ing Igor who I hope can help us understand what should be done for the jsreftests. Igor, to summarize so that you do not need to read this entire bug, the issue is that the jsreftests set prefs (like gczeal), which does not work with multiple processes - we currently do not allow the child process to set preferences (primarily for security reasons, is my understanding). We have some possible approaches for what to do here, quoted in the rest of this comment.

(In reply to comment #20)
> (In reply to comment #19)
> > jsreftests (|make jstestbrowser|) need this sort of thing as well (bug
> > 669949). Modifying them all would mean rewriting thousands of reftests (most
> > of the 3,328 seem to set at least one pref).
> 
> Do you have an idea of whether the rewrite would be mechanical or more
> complicated, in general?  A mechanical rewrite of a few thousand tests
> wouldn't be so bad, and would be write-once, but if it would be more
> complicated that's less attractive.

The cases I saw so far are things like a jsreftest having

  gczeal(0)

and a few other similar commands in the beginning, each of which ends up setting a pref. That seems like it could be rewritten in theory so it is asynchronous, but my concerns are that

1. This will make the tests more complicated. None of the jsreftests right now (that I have read, which is of course a small sample out of thousands) have any async element that I can see - they all expect to run in a simple linear way, and they are pretty simple thanks to that. They are really more like js shell tests that are run in the browser, rather than full-fledged reftests I think, and I would be afraid to change that.

2. The rewriting would not be simple if there are multiple prefs and conditions between them. And maybe there are other potential issues. I have not seen such things, but I haven't read any significant percentage of the jsreftests. I can't think of a simple and quick way to find this out.

> 
> > How about a setting that allows one to set prefs from the child? And it
> > would only be allowed in tests, of course. Either just applying the value
> > locally (no sync with parent - good enough for jsreftests, see patch in bug
> > 669949), or a sync message to the parent (safer in one sense, but less in
> > another)?
> 
> I think magic modes like this will probably end up causing more trouble than
> they save.  But the specter of non-mechanical rewrite of thousands of tests
> might move that needle :/.

I agree there is a risk here, if we have any bugs in how the magic mode is handled then that would be very bad. But I think we can be very careful about it, and it is a small amount of code.
(In reply to comment #21)
> jsreftests set prefs (like gczeal), which does
> not work with multiple processes - we currently do not allow the child
> process to set preferences (primarily for security reasons, is my
> understanding).

We disallow that so as to define away the problem of synchronizing preference state.

> The cases I saw so far are things like a jsreftest having
> 
>   gczeal(0)
> 
> and a few other similar commands in the beginning, each of which ends up
> setting a pref. That seems like it could be rewritten in theory so it is
> asynchronous, but my concerns are that

Oh!  We can make things like this a SpecialPower.  Setting gczeal per-test doesn't need to change global state.  This should be a straightforward sed rewrite too.
(In reply to comment #22)
> (In reply to comment #21)
> > The cases I saw so far are things like a jsreftest having
> > 
> >   gczeal(0)
> > 
> > and a few other similar commands in the beginning, each of which ends up
> > setting a pref. That seems like it could be rewritten in theory so it is
> > asynchronous, but my concerns are that
> 
> Oh!  We can make things like this a SpecialPower.  Setting gczeal per-test
> doesn't need to change global state.  This should be a straightforward sed
> rewrite too.

By 'global state' do you mean the parent process (and other processes)? If so then I agree, just setting the pref locally would work for these tests.

It is my understanding that dbaron is opposed to enabling SpecialPowers in reftests in general. For that reason I suggested that we set the pref locally without using SpecialPowers, instead just doing so directly, and as a secondary benefit to doing it that way we would not need any rewriting either (this is what the patch in bug 669949 does).
(In reply to comment #23)
> (In reply to comment #22)
> > (In reply to comment #21)
> > > The cases I saw so far are things like a jsreftest having
> > > 
> > >   gczeal(0)
> > > 
> > > and a few other similar commands in the beginning, each of which ends up
> > > setting a pref. That seems like it could be rewritten in theory so it is
> > > asynchronous, but my concerns are that
> > 
> > Oh!  We can make things like this a SpecialPower.  Setting gczeal per-test
> > doesn't need to change global state.  This should be a straightforward sed
> > rewrite too.
> 
> By 'global state' do you mean the parent process (and other processes)? If
> so then I agree, just setting the pref locally would work for these tests.

Yes, but I had in mind the SpecialPowers.gczeal() interface bypassing the pref entirely.  Having inconsistent pref state sounds like a slippery slope to me.

> It is my understanding that dbaron is opposed to enabling SpecialPowers in
> reftests in general.

jsreftests are a special beast; they're not really reftests, just kind of shoe-horned into the reftest framework.  It should be possible to only enable SpecialPowers for jsreftests, if dbaron is OK with that.  That the jsreftests are altering GC params already violates the spirit of the idea I think dbaron wants for the rendering reftests.  CC'ing dbaron for his thoughts.

> For that reason I suggested that we set the pref
> locally without using SpecialPowers, instead just doing so directly, and as
> a secondary benefit to doing it that way we would not need any rewriting
> either (this is what the patch in bug 669949 does).

It's not clear that setting that pref is helping anyone.  I would prefer to change the tests than introduce inconsistent pref state across processes, but for the sake of expedience that might be what we're forced to do.

Or maybe we could not SpecialPowers-ize gczeal(), but just change its implementation to avoid the prefs system?
Sadly the discussion of this issue has ended up split between this bug and bug 669949. Do we want to dupe one into the other?
(In reply to comment #21)
> cc'ing Igor who I hope can help us understand what should be done for the
> jsreftests. Igor, to summarize so that you do not need to read this entire
> bug, the issue is that the jsreftests set prefs (like gczeal), which does
> not work with multiple processes - we currently do not allow the child
> process to set preferences (primarily for security reasons, is my
> understanding). We have some possible approaches for what to do here, quoted
> in the rest of this comment.

Typically the tests sets the preference to check for the issue when run in the shell. If we cannot provide a good way to implement the check in the browser, we can just skip the corresponding test. Yes, this would decrease the test coverage as some tests caught regressions when run in the browser as opposed to the js shell, but our fuzziers are much more effective at catching regressions.
jsreftests aren't a problem anymore.

We need more data here in order to decide on an approach!  Does anyone have a good estimate of how many mochitests need to set prefs and so would need to be updated?  Does anyone have an idea of how to estimate that accurately?  (I don't know all the different ways mochitests can set prefs.)
there are 91 instances of nsIPref in the mochitest tests/ directory.  Some of these are in libraries that many other tests use as well, but I don't think that matters as we fix it in the library and it is good.
Doesn't that exclude tests that have been migrated to SpecialPowers 1.0?  Are there any other ways mochitests can access prefs directly, besides "nsIPref*" and SpecialPowers?  (Excluding indirection through helpers.)
there are 44 instances of SpecialPowers.*Pref in mochitests.  Also in mochitest-chrome there are 41 instances of nsIPref, but chrome isn't run in content (well for the most part).

I can't think of any other ways we set prefs.  Round up and say 150 instances in Mochitest.
so what is the status here?  Should we assume editing all the mochitests?  We are trying to get more tests turned on for Fennec and in bug 676394, it looks like we are hitting this same problem.
Let's go forward with that.  Three issues

 (1) I heard jdm was doing something with generators here.  Anything come out of that?

 (2) Does anyone object to the API in comment 4?  (pushPrevEnv + implicit or explicit pop)

 (3) Who can volunteer for this?
I have not done anything besides read this bug and rebase the existing patch.
Alright.  Recap of the concrete API proposal:

// Push a temporary modification to the preferences environment.  The
// first param to pushPrefEnv is an object with option 'set' and
// 'remove' properties.
//
// The 'set' property maps to an array of [
// prefname, value ] tuples that are the preferences to be modified.
// The type of |prefname| is inferred from the type of |value| in the
// tuple.
//
// The 'remove' property maps to an array of prefname strings to be
// removed.
//
// The second param |callback| is a function to be invoked when the
// pref environment has been modified as specified by the first
// argument.  It must not be null.
//
// If the same prefname appears multiple times in the 'set' array,
// behavior is undefined.  Passing the same prefname to be both set
// and removed also results in undefined behavior.
pushPrefEnv({'set': [ ['foo.bar', 2], ... ],
            'remove': [ 'bar.baz', ... ]
            },
            callback);

// "Pop" the most recent pushPrefEnv(), meaning that all modifications
// specified by the pushPrevEnv() are undone.  |callback| is a functon
// to be invoked when the pop is complete.  It must not be null.
//
// It's OK to call popPrefEnv() if there are no applied
// pushPrefEnv()'s.
popPrefEnv(callback);

// All applied pushPrefEnv() modifications are implicitly undone when
// a mochitest finishes.


Objections?
We'll want a similar API For adding and removing permissions, as well.
I like this proposal.  Is this intended to be inside of SpecialPowers, or is this something that will require us to rewrite tests?  Currently our tests are not written to wait for a callback.  If we can encapsulate this change in SpecialPowers and not return from a SpecialPowers.setIntPref call until it is set, that will be a huge win.

++ for the state tracking and automatic cleanup.

In the spirit of geting things done and not delaying, can we get this implemented for prefs and then tackle permissions afterwards?
(In reply to Joel Maher (:jmaher) from comment #36)
> I like this proposal.  Is this intended to be inside of SpecialPowers, or is
> this something that will require us to rewrite tests?

This API would be exposed directly to tests.  The API would be implemented in SpecialPowers.

> Currently our tests
> are not written to wait for a callback.  If we can encapsulate this change
> in SpecialPowers and not return from a SpecialPowers.setIntPref call until
> it is set, that will be a huge win.

That's what the earlier discussion in the bug was about ;).  It's not really possible to create that illusion without also introducing the potential for subtle breakage.  Since it appears that we're not looking at too many tests that use prefs, let's rewrite them to use a non-broken API.

> In the spirit of geting things done and not delaying, can we get this
> implemented for prefs and then tackle permissions afterwards?

Certainly.
Blocks: 676394
Attached patch implement pushPrefEnv (1.0) (obsolete) — Splinter Review
this patch implements pushPrefEnv and actually works with the modified test case in this patch.  In addition when a test case is finished, the harness will call flushPrefEnv and it will reset all the preferences.

This uses some of the code in the previous patch on this bug + the recent comments calling for a stack solution for the preferences.  This is my interpretation of it all, please let me know if this is not what was envisioned.
Attachment #553517 - Flags: superreview?(jones.chris.g)
Attachment #553517 - Flags: review?(josh)
Attached patch implement pushPrefEnv (2.0) (obsolete) — Splinter Review
updated patch to include other scenarios besides just nsPref:changed in the observer service.  This is much more in parity with what :jdm had done originally.
Assignee: nobody → jmaher
Attachment #553517 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #553553 - Flags: review?(josh)
Attachment #553553 - Flags: review?(jones.chris.g)
Attachment #553517 - Flags: superreview?(jones.chris.g)
Attachment #553517 - Flags: review?(josh)
Comment on attachment 553553 [details] [diff] [review]
implement pushPrefEnv (2.0)

>+  this._prefStack = [];

Nit: would call this prefEnvUndoStack.  prefStack sounds like a stack
of individual prefs, but this is a stack of sets of changes required
to undo pushPrefEnv() calls.

>+  // Return the last prefName in the prefs object.  Also if we are doing a 'set'
>+  // we will need to wait for the observer.  
>+  _getLastPref: function(prefs) {

We have to wait for pref clears too, since observers see them and can
change state when a pref is cleared.

In general this function feels a little gross to me.  I'd prefer to
see something that normalized the object with set/clear properties
into an array of modifications the SpecialPowers code needs to make.
Then if you make the modifications sequentially here, you can assume
that the messages are processed in chrome sequentially, and since pref
updates and observer notifications are synchronous, then you can
legally wait for the last action in the array to be completed before
invoking the callback.  The "do nothing" special case in pushPrefEnv
would then be a check for an empty action array.  It would probably
make sense to move the type inference and "build undo" logic into this
function too, and you could return an { actions: [ ... ], antiActions:
{ } } object, or something like that.

>+  /*
>+   * Take in a list of prefs and put the original value on the _prefStack so we can undo
>+   * preferences that we set.  Note, that this is a stack of values to revert to, not
>+   * what we have set.
>+   *
>+   * prefs: {action: [[pref, value], [pref, value], ...], action: [[pref, value], ...], ...}
>+   * ex: {'set': [['foo.bar', 2], ['browser.magic', '0xfeedface']], 'remove': [['bad.pref']] }

Nit: I would replace "action" with explicit set/clear in the docs
here, since the expected types of the property values are different.

Another nit: on second thought, we should probably s/remove/clear/ to
match the pref API more closely.  Sorry about that.

>+   *
>+   * In the scenario where our prefs specify the same pref more than once, we do not guarantee
>+   * the behavior.  

Another note to add: what happens if there's a second call to
pushPrefEnv() while a first call is still in progress (i.e., callback
hasn't been invoked yet because required changes haven't been
applied).  Right now this patch "synchronously" applies pref updates
by spinning the event loop below, but this code can still be
reentered.  That would cause things to break.

I would be OK with saying in the docs "behavior is undefined", but if
we modify the pref-update wait as noted below, you could check for an
in-progress update and throw an exception.

>+  _waitForObserver: function(aObserver, aTopic, aIsPref) {
>+    var thread = Cc["@mozilla.org/thread-manager;1"].getService()
>+                 .QueryInterface(Ci.nsIThreadManager).currentThread;
>+    while (!aObserver._success) {
>+      thread.processNextEvent(true);
>+    }
>+    this._removeObserver(aObserver, aTopic, aIsPref);

What's the motivation for spinning the event loop here?  Is it to
support tests that modify prefs but don't use waitForExplicitFinish()?
Have you come across tests like this already?

The other option is to require one and only one observer at all times,
and track it in a global variable.  That would allow us to more easily
check for overlapping pushPrefEnv()/popPrefEnv() as noted above.  But
if there are a bunch of tests that would be annoying to convert to
waitForExplicitFinish(), I'm OK with the impl here.

>+  // This is a little clunky as we are resetting all the prefs, but
>+  // only waiting on the last pref

That's not going to work properly.  A pref might be modified by
multiple pushPrefEnv() calls.

>+  flushPrefEnv: function(callback) {

You could alternately write this as an "event-loop recursive"
function, which has the advantage of not depending on whether applying
pref modifications is synchronous or asynchronous.  In a nutshell,

  if (this._prefEnvUndoStack.length == 0) {
    callback();  // or setTimeout(0)
    return;
  }
  var undo = this._prefEnvUndoStack.pop();
  // start undo, setting up flushPrefEnv(callback) as the completion
  // callback

sr- for not waiting on clears to complete and buggy flushPrefEnv, but
this is looking very good otherwise :).  (Reset the flag to sr, because
I did an sr-style review, not a detailed one.)
Attachment #553553 - Flags: review?(jones.chris.g) → superreview-
(In reply to Chris Jones [:cjones] [:warhammer] from comment #40)
> Comment on attachment 553553 [details] [diff] [review]
> implement pushPrefEnv (2.0)
> 
>   var undo = this._prefEnvUndoStack.pop();
>   // start undo, setting up flushPrefEnv(callback) as the completion
>   // callback
> 

Er, by all this crap I meant

  popPrefEnv(function () { flushPrefEnv(callback) });
Attached patch implement pushPrefEnv (3.0) (obsolete) — Splinter Review
updated patch to:
 * wait for all changes that occur on the pref branch
 * serialize the changes into an array for applying and cleanup
 * no more wait loop, just call the callback directly from the observer 
 * finally if we are in the middle of applying changes, we push the new set of changes in a pending queue and they will be applied when the other changes are done.

This should address the comments and feedback received so far.  This passes local tests and running on try server right now.  A run last night with most of this patch implemented yielded 2 leaks, but no failing tests.
Attachment #553553 - Attachment is obsolete: true
Attachment #554183 - Flags: superreview?(jones.chris.g)
Attachment #554183 - Flags: review?(josh)
Attachment #553553 - Flags: review?(josh)
Comment on attachment 554183 [details] [diff] [review]
implement pushPrefEnv (3.0)

>+  /*
>+   * Take in a list of prefs and put the original value on the _prefEnvUndoStack so we can undo
>+   * preferences that we set.  Note, that this is a stack of values to revert to, not
>+   * what we have set.

Nit: _prefEnvUndoStack is an implementation detail, but this comment
is something that developers would reference.  I would reword this so
that it describes the abstract stack of "pref environments" created by
pushPrefEnv, not the implementation of having a stack of undo items.

Looks a lot better, thanks.
Attachment #554183 - Flags: superreview?(jones.chris.g) → superreview+
Comment on attachment 554183 [details] [diff] [review]
implement pushPrefEnv (3.0)

>+        content.window.setTimeout(this._sp._finishPrefEnv, 0, this._callback);

I think we should be setting one timeout for the callback and another one for _finishPrefEnv since they are logically separate actions. You'll need to change the test to accommodate this; just use constructs like |SpecialPowers.pushPrefEnv({'clear': [['font.minimum-size.x-western']]}, function() setTimeout(step1, 0));| and everything should just work.

>+        /* We need to exit this observe thread, so execute inside a setTimeout */

This comment would be better phrased as "The callback must execute asynchronously after all the preference observers have run"

>+  cleanup: function(aTopic, aIsPref) {

Given that the observer object has _topic and _isPref, the parameters are unneeded.

>+   * TODO: ensure this is only called once (global lock?)

This is no longer a todo, yes?

>+          if (prefs.prefHasUserValue(prefName) && action == 'clear')
>+            originalValue = this._getPref(prefName, prefType);
>+          else if (action == 'set')
>+            originalValue = this._getPref(prefName, prefType);

I think merging these two conditions makes sense.

>+  flushPrefEnv: function(callback) {
>+    while (this._prefEnvUndoStack.length > 0)
>+      this.popPrefEnv(null);
>+
>+    content.window.setTimeout(callback, 0);
>+  },

This doesn't look correct to me; the callback could be called before any preferences have actually been updated. I think we need to pop all but the last pref environments using null callbacks, then pop the last environment using the given callback.

>+  /* Iterate through all preferences in pendingActions and set/clear appropriately.
>+     Assumes that all actions will change the pref
>+   */

Given that pendingActions isn't a member var, this comment doesn't make much sense. Maybe something like "Iterate through one atomic set of pref actions and perform sets/clears as appropriate. All actions performed must destructively modify the relevant pref."

>+  _applyPrefs: function() {
>+    
>+    if (this._applyingPrefs) {
>+      return;

Nix the blank line.

>+    if (this._pendingPrefs.length <= 0) {
>+      return;
>+    }

A simple |!this._pendingPrefs.length| will suffice.

>+  /* called from the observer when we get a pref:changed.  */
>+  _finishPrefEnv: function(aCallback) {
>+    if (aCallback)
>+      content.window.setTimeout(aCallback, 0);
>+
>+    this.wrappedJSObject.SpecialPowers._applyingPrefs = false;
>+    this.wrappedJSObject.SpecialPowers._applyPrefs();
>+  },

As mentioned earlier, I think the callback should be the responsibility of the caller, not this function. In its current incarnation, the callback will be invoked after further pref environments have started to be pushed, and the handwavy nature of what happens when makes me nervous. I think we should specify that when the last preference change for the first environment is observed, the callback is invoked asynchronously, and any pending environment pushes that occurred while we were previously waiting will begin after the callback finishes executing. Chris, does that sound reasonable to you?

Finally, I think it's worth adding a comment for the call to _applyPrefs along the lines of "Any subsequent pref environment pushes that occurred while waiting for the preference update are pending, and will now be executed."

This is a really great implementation, but I'm going to insist on seeing another revision that fixes the ordering problems I highlighted.
Attachment #554183 - Flags: review?(josh) → review-
Attached patch implement pushPrefEnv (3.1) (obsolete) — Splinter Review
updated patch to address comments in previous r-.  Also updated tests in layout/style/test/ to show how we can use this.  My run last night on try server yielded great results, but this is also depending on the specialpowers refactoring.
Attachment #554183 - Attachment is obsolete: true
Attachment #555402 - Flags: review?(josh)
Comment on attachment 555402 [details] [diff] [review]
implement pushPrefEnv (3.1)

Review of attachment 555402 [details] [diff] [review]:
-----------------------------------------------------------------

Great, this is almost ready for prime time. A couple last changes and it can ship!

::: layout/style/test/test_bug389464.html
@@ -40,4 +30,4 @@
> >  var cs1 = getComputedStyle(document.getElementById("one"), "");
> >  var cs2 = getComputedStyle(document.getElementById("two"), "");
> >  
> > -var oldVariable = get_pref("variable.x-western");
> > +SpecialPowers.pushPrefEnv({'set': [['variable.x-western', 25], ['fixed.x-western', 20]]}, function() { setTimeout(part1, 0); });

Single statement functions don't need to have {} surrounding the body, or the semicolon, and I think they just clutter up this code anyways. Could you remove them in all of these statements?

::: testing/mochitest/tests/SimpleTest/specialpowersAPI.js
@@ -89,0 +92,20 @@
> > +function Observer(specialPowers, aTopic, aCallback, aIsPref) {
> > +  this._sp = specialPowers;
> > +  this._topic = aTopic;
> > +  this._callback = aCallback;
NaN more ...

I'd like to see two setTimeout calls here, the first for the callback, and the second for _finishPrefEnv. Also, can you not use this._sp here instead of the wrappedJSObject craziness?

@@ -126,0 +167,7 @@
> > +  /*
> > +   * Take in a list of prefs and put the original value on the _prefEnvUndoStack so we can undo
> > +   * preferences that we set.  Note, that this is a stack of values to revert to, not
> > +   * what we have set.
NaN more ...

set|clear, not set|remove.

@@ -126,0 +167,91 @@
> > +  /*
> > +   * Take in a list of prefs and put the original value on the _prefEnvUndoStack so we can undo
> > +   * preferences that we set.  Note, that this is a stack of values to revert to, not
> > +   * what we have set.
NaN more ...

I'm not sure why you made these changes, but please revert them. The callback should be executed asynchronously in all cases.

@@ -126,0 +167,108 @@
> > +  /*
> > +   * Take in a list of prefs and put the original value on the _prefEnvUndoStack so we can undo
> > +   * preferences that we set.  Note, that this is a stack of values to revert to, not
> > +   * what we have set.
NaN more ...

I think need needs to be >1.

@@ -126,0 +167,161 @@
> > +  /*
> > +   * Take in a list of prefs and put the original value on the _prefEnvUndoStack so we can undo
> > +   * preferences that we set.  Note, that this is a stack of values to revert to, not
> > +   * what we have set.
NaN more ...

As mentioned earlier, make this happen in the pref observer instead of in this function please.
Attachment #555402 - Flags: review?(josh) → review-
Comment on attachment 555402 [details] [diff] [review]
implement pushPrefEnv (3.1)

>+SpecialPowers.pushPrefEnv({'set': [['variable.x-western', 25], ['fixed.x-western', 20]]}, function() { setTimeout(part1, 0); });

Single statement functions don't need to have {} surrounding the body, or the semicolon, and I think they just clutter up this code anyways. Could you remove them in all of these statements?

>+  observe: function(aSubject, aTopic, aData) {
>+    if ((!this._isPref && aTopic == this._topic) ||
>+        (this._isPref && aTopic == "nsPref:changed")) {
>+      if (aData == this._topic) {
>+       this.cleanup();
>+        /* The callback must execute asynchronously after all the preference observers have run */
>+        content.window.setTimeout(content.window.wrappedJSObject.SpecialPowers._finishPrefEnv, 0, this._callback);

I'd like to see two setTimeout calls here, the first for the callback, and the second for _finishPrefEnv. Also, can you not use this._sp here instead of the wrappedJSObject craziness?

>+  /*
>+   * Take in a list of prefs and put the original value on the _prefEnvUndoStack so we can undo
>+   * preferences that we set.  Note, that this is a stack of values to revert to, not
>+   * what we have set.
>+   *
>+   * prefs: {set|remove: [[pref, value], [pref, value, Iid], ...], set|remove: [[pref, value], ...], ...}
>+   * ex: {'set': [['foo.bar', 2], ['browser.magic', '0xfeedface']], 'remove': [['bad.pref']] }

set|clear, not set|remove.

>+    if (pendingActions.length > 0) {
>+      this._prefEnvUndoStack.push(cleanupActions);
>+      this._pendingPrefs.push([pendingActions, callback]);
>+      this._applyPrefs();
>+    } else {
>+      callback();
>+//      content.window.setTimeout(callback, 0);
>+    }

I'm not sure why you made these changes, but please revert them. The callback should be executed asynchronously in all cases.

>+  flushPrefEnv: function(callback) {
>+    while (this._prefEnvUndoStack.length > 0)
>+      this.popPrefEnv(null);
>+
>+    this.popPrefEnv(callback);
>+  },

I think need needs to be >1.

>+  /* called from the observer when we get a pref:changed.  */
>+  _finishPrefEnv: function(aCallback) {
>+    if (aCallback) {
>+      aCallback();
>+//      content.window.setTimeout(aCallback, 0);
>+    }

As mentioned earlier, make this happen in the pref observer instead of in this function please.
updated patch with cleanup mentioned below.  Passes tests locally, pushed to try.  fwiw, the push I did on my previous patch as green on try (except for known oranges); I expect the same results from this one.
Attachment #555402 - Attachment is obsolete: true
Attachment #555492 - Flags: review?(josh)
Comment on attachment 555492 [details] [diff] [review]
implement pushPrefEnv (3.14)

Great! One last thing to fix when checking in:

>+   * prefs: {set|clear: [[pref, value], [pref, value, Iid], ...], set|clear: [[pref, value], ...], ...}
>+   * ex: {'set': [['foo.bar', 2], ['browser.magic', '0xfeedface']], 'remove': [['bad.pref']] }

Another 'remove' that should be 'clear'.
Attachment #555492 - Flags: review?(josh) → review+
Backed out on inbound because this appeared to cause a new frequent mochitest-browser-chrome leak on Linux.
https://hg.mozilla.org/integration/mozilla-inbound/rev/4dc1e45afdd7

Example: http://tbpl.allizom.org/php/getParsedLog.php?id=6145965
Joel, https://developer.mozilla.org/en/Debugging_memory_leaks has lots of info on debugging leaks.  If any of the instructions there are unclear or you get stuck, there are several people in #content on IRC who can help out (bent and peterv come immediately to mind).
Andrew, can you help Joel out here, maybe point your tools at this, or show Joel how to use the tools etc? Thanks!
The problem I am running into is that I cannot reproduce this leak locally and for the most part this leak takes place on linux64 or osx64 builds.  I have bits that do show a leak for linux, but when I run them locally it fails.  So right now my only way to fix this is via try server.  I am looking into getting a slave that I can use and put some try server bits on.
Let me know when you can reproduce this locally, and I can apply my leak finding tools and see what I find, or help you use them.
After talking to jst, I'm going to throw together a patch that will dump the CC and GC heaps on shutdown.  I'll probably be able to put that up tomorrow morning PDT.
Okay, here you go.  Hopefully this should apply cleanly.

This patch logs the shut down cycle collections and garbage collections.  It will create files cc-edges-{1,2,3}.log and gc-edges-{1,2,3}.log.  Really only cc-edges-3.log and gc-edges-3.log are of interest, because those are the final ones.

Once you have those logs, there are some scripts that you or I can run on them to figure out why they are alive.  I'm perfectly happy to do it, as it is pretty easy, or I can explain it to you.  You'll just have to let me know what is leaking that you want to figure out.  Usually the canary is an nsGlobalWindow.

The analysis scripts are located at https://github.com/amccreight/heapgraph
the patch almost applies cleanly, was easy to fix up.  One question I have is where are these logs stored at?  The application directory or the profile directory?  Why I ask is that I cannot reproduce this and have to run on a try server.  I will need to read these files from the specific directories in the python test script runner and pump the data into the overall stdout/logfile.
When I run it myself, it ends up in the directory I run it from.  I don't know how that translates on the try server.

If you know something about the directory structure on the try server, you could also hard code in absolute paths yourself.  There are two places to change:

js/src/xpconnect/src/nsXPConnect.cpp:
  sprintf(dumpName, "gc-edges-%d.log", ++gcLogCounter);
xpcom/base/nsCycleCollector.cpp:
  sprintf(name, "cc-edges-%d.log", ++gLogCounter);
For posterity, I applied my patch to revision 76368:7d3d1c2c75f8.  The dump code has undergone a bit of churn lately so it may be fairly sensitive to exactly what version you are using.
IIRC the "biggest" thing leaked in that log was a document, so that's probably where you'd want to start.
Well, jmaher got my patch rigged up, and extracted a cycle collector log from a run that showed a leaking nsDocument... but there's no nsDocument in the cycle collector graph!  Very odd.  That would certainly explain why it is leaking, if the cycle collector doesn't see it.  I have no idea how that could happen, though, beyond some sort of weird cycle collector failure.
(In reply to Andrew McCreight [:mccr8] from comment #61)
> Well, jmaher got my patch rigged up, and extracted a cycle collector log
> from a run that showed a leaking nsDocument... but there's no nsDocument in
> the cycle collector graph!  Very odd.  That would certainly explain why it
> is leaking, if the cycle collector doesn't see it.  I have no idea how that
> could happen, though, beyond some sort of weird cycle collector failure.

That just means it's not getting suspected in that generation, right?  So you want to find the last generation it is suspected in and look at that log, if I understand things correctly.
Hmm.  That's a good point.  So, I think all XPC things are always(?) suspected, but I guess if the nsDocument isn't part of a cycle involving JS then maybe it could show up at some prior CC in the run of the program, but never again.
Actually, according to the refcount log, there's one 1 outstanding reference to the document, so the CC graph might not be relevant here at all.
It still could be part of a cycle.

One possibility could be incomplete unlinking at some earlier point: if the CC detects a cycle and then kills it, but doesn't actually kill it, then it will just stick around forever.  Pretty hard to know, though.  If this could be reproduced locally, you could dump every CC, then modify the bloat log to print the address of the nsDocument, to see if it ever showed up in a CC graph, but that would probably be a ludicrous amount of data.

One thing we could do to rule out the incomplete unlinking is to log the address of every nsDocument that is freed by the CC, then somehow get the bloat log to dump the address of any nsDocuments that survive.  Of course, if a document gets allocated multiple times at the same address, then all bets are off.  And there'd be no way to tell if this happened, in general...
(In reply to Andrew McCreight [:mccr8] from comment #65)
> It still could be part of a cycle.

True.

> One possibility could be incomplete unlinking at some earlier point: if the
> CC detects a cycle and then kills it, but doesn't actually kill it, then it
> will just stick around forever.

Since the document only has a refcount of one, I think it has almost certainly been unlinked.  We could verify this by putting printf("%p", this); in the ctor and dtor of nsDocument, and in the Unlink method of nsDocument's CC helper.

Once we've verified that it is an insufficient unlink, we can trace the cycle manually with the refcount logging tools and then figure out where the unlink is insufficient.
So an nsDocument normally has a bunch of things pointing back at it?  One thing I worry about is that it may not be the nsDocument itself that is the problem, but something that is holding on to it.

refcount logging probably isn't doable on a full try server run, for a number of reasons, but I guess we can hope that the incomplete Unlink is always there, it is just that it only shows up as a full leak in some cases.
(In reply to Andrew McCreight [:mccr8] from comment #67)
> So an nsDocument normally has a bunch of things pointing back at it?

1201 nsXMLDocument                                1240     1240     1356        1 (   93.69 +/-    25.49) 11682067        1 (  236.76 +/-    46.69)

There were 11 million references to 1356 documents, so yeah :-)

> One
> thing I worry about is that it may not be the nsDocument itself that is the
> problem, but something that is holding on to it.

Yeah ... 

> refcount logging probably isn't doable on a full try server run, for a
> number of reasons, but I guess we can hope that the incomplete Unlink is
> always there, it is just that it only shows up as a full leak in some cases.

Can you figure out from an incomplete unlinking what else needs to be unlinked?  I suppose with the entire CC graph in hand that should be doable.
a lot of good ideas here.  Is there a way I can try some of these on try server?  Any other ideas in the last couple days?
(In reply to Joel Maher (:jmaher) from comment #69)
> a lot of good ideas here.  Is there a way I can try some of these on try
> server?  Any other ideas in the last couple days?

I (or Andrew) could write a patch that will dump the documents leaked at shutdown and whether or not they've been Unlinked so that we can determine if this is insufficient unlinking.
please do, I can run that through try and follow up.
Attached patch Diagnostic patchSplinter Review
This patch will dump diagnostics on shutdown about leaked documents.  It runs after what should be the last cycle collections.

I added a tmp->AddRef in the unlink code in nsDocument to test it.  The output looks like:

Document leaked!
    this=07ADEA10
    uri=moz-nullprincipal:{d9fd98b1-25c7-4309-87b9-6f1683e9c97
    unlinked=true
----------------
Document leaked!
    this=080B0280
    uri=about:blank
    unlinked=true
----------------
Document leaked!
    this=081272D0
    uri=resource://gre-resources/hiddenWindow.html
    unlinked=true
----------------

We can add more things to the output if it we find other things that would be interesting later.
I was wrong about the unlinking:

Document leaked!
    this=0xbc99f60
    uri=chrome://global/content/bindings/scrollbox.xml
    wasunlinked=false
You mean, you had a run that recreated the leak, but in the document was never unlinked?  I assume this was a try server run?  That's unfortunate.
(In reply to Andrew McCreight [:mccr8] from comment #75)
> You mean, you had a run that recreated the leak, but in the document was
> never unlinked?  I assume this was a try server run?  That's unfortunate.

Yes, that is what I mean.
ahh, you beat me to it :khuey.  So any thoughts on this?  I will look through the browser-chrome tests which use scrollbox.xml and see if those specific tests might not clean up very well.  Any other suggestions are more than welcome.
(In reply to Joel Maher (:jmaher) from comment #77)
> ahh, you beat me to it :khuey.  So any thoughts on this?  I will look
> through the browser-chrome tests which use scrollbox.xml and see if those
> specific tests might not clean up very well.  Any other suggestions are more
> than welcome.

It's not used in any tests, it's part of the XBL that makes up the browser window.  We're well into dark and evil territory here ...

I'll think about this tonight, but my current thought on how to move forward is to try to get a refcount log and see what's owning the doc.
From a run with XPCOM_MEM_LOG_CLASSES=nsXMLDocument

Serial Numbers of Leaked Objects:
699 @0x54ce170 (1 references; 0 from COMPtrs)

Going to attempt to gather a refcount log on try.
Unfortunately refcount logging seems to be perturbing things too much :-(  With refcount logging turned on I didn't hit the leak in any of 30ish runs.
http://people.mozilla.org/~khuey/621363.log

These are the relevant parts of a log with backtraces for AddRef and Release on the leaked doc.
This appears to be a traversal issue, in a log which has no leak we destroy the doc through cycle collection.
So, when we don't leak, the document is destroyed during cycle collection.  The last ref is held by an nsXULPrototypeElement's nodeinfo which is destroyed when the nsXULPrototypeElement is destroyed.

When we do leak, that reference is never cleared (and that nsXULPrototypeElement is presumably one of the ones we're leaking).

Based on a suggestion from Olli, I'm going to investigate the nsXULElements that are surviving next.
We're leaking 8 nsXULElements, all of which have one reference.  Presumably we're leaking one element and its children.

503638 @0xc563108 (1 references; 0 from COMPtrs)
503644 @0xb815528 (1 references; 0 from COMPtrs)
503643 @0xb815408 (1 references; 0 from COMPtrs)
503639 @0xc4fa530 (1 references; 0 from COMPtrs)
503645 @0xb815858 (1 references; 0 from COMPtrs)
503641 @0xc466aa8 (1 references; 0 from COMPtrs)
503642 @0xc68d148 (1 references; 0 from COMPtrs)
503640 @0xc4fa6b8 (1 references; 0 from COMPtrs)

Unfortunately the serial numbers aren't stable here, so getting at this is going to be more tricky.
Is the owner document of those leaked nsXULElements or their binding parent the 699th document? If so, maybe such a test can be used to gather a more limited log here?
The xul elements' prototype's document is the 699th document.
Actually, the xul elements' owner doc is the 699th document too, so this is even easier to pin down.
It appears that the xul elements are the issue here.  Something appears to be holding onto the root element of the XBL binding.  Investigating that now.
*It appears that the xul elements are *not* the issue here.
My current understanding of the leak is ...

There is a cycle that proceeds as follows:

- Document -> Nodes -> NodeInfos -> NodeInfoManager -> Document -

The cycle would ordinarily be broken by the cycle collector.  However, in this case, one of the nodes in the DOM tree is being held alive by something outside this cycle.  The <xbl:binding> element for the arrowscrollbox has a refcount that is 2 AddRefs higher than would be expected from this cycle.

The question to investigate now is wtf is holding on to this <xbl:binding>.
The answer to that question is the two data nodes that are attached to it :-/

Going to try to grab a CC log after the second to last reference is dropped from the doc ...
Kyle, any news here?
so I found that by having both the chromepowers (needed for browser-chrome specifically) and specialpowers installed the memory leak doesn't show up.  The patch queue was specifically removing the specialpowers extension from the chrome/a11y/browser-chrome tests because we didn't need it.

Here is a try server run:
https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=a9def0cb7577

I propose going forward with this route and figuring out the memory leak when we don't have the extension installed.  any objections?
Have you verified that it just hasn't gone away entirely?

I don't have any objections to unblocking this.  I haven't made any progress since the last time I posted here.
I have done 3 pushes to try and this last one I did a lot of retriggers on the debug m-oth.  No leaks show up in the automated log processing.

Is there anything else I should be checking?  Historically we would find at least one leak with every push and on 2/3 retriggers if not all of them.
(In reply to Joel Maher (:jmaher) from comment #95)
> Is there anything else I should be checking?

Not that I can think of.

> Historically we would find at
> least one leak with every push and on 2/3 retriggers if not all of them.

Yeah.

It may still be worth investigating this further on the cset we know is bad, but this drops the priority a bit.
inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/68c5f6d6a37b
Whiteboard: [specialpowers][inbound]
https://hg.mozilla.org/mozilla-central/rev/68c5f6d6a37b

Leaving open due to unreviewed, non-obsoleted patches present & last sentence of comment 96. To make merging easier in the future, please could you add whiteboard annotations indicating whether a bug should be left open, if there are multiple patches attached and just one landing (https://wiki.mozilla.org/Tree_Rules/Inbound#Please_do_the_following_after_pushing_to_inbound). 

Thanks! :-)
Target Milestone: --- → mozilla10
Blocks: 693312
opened bug 693312 to track the leak when removing the specialpowers extension from browser-chrome.

the other patches were added for diagnostics in tracking down the leak.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: