Closed Bug 294260 Opened 19 years ago Closed 12 years ago

Safe Mode: Auto detect previous start-up failure and offer to start in safe mode

Categories

(Toolkit :: Startup and Profile System, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: jaime.bugzilla, Assigned: MattN)

References

(Depends on 3 open bugs, Blocks 3 open bugs)

Details

Attachments

(2 files, 7 obsolete files)

1. User upgrades or installs incompatible extension.

2. Restarts Firefox and it hangs or crashes on loading.

3. Next time user starts Firefox they should be prompted "Firefox previously
failed to start correctly, do you wish to start in safe mode (recommended)"

4. Firefox then starts in safe mode with a local text page displayed instead of
the home page with a few trouble shooting steps. - probably covered by bug 291534

User can then disable extensions as in bug 290836 and try restarting which
should then restart in normal mode

Set flag after extension/theme loading to signal successful start-up. During
initial start-up check that previous start-up was successful. - detection could
occur at the same time as bug 291223 (however that is probably wont fix as per
bug 286830#c5)

This is similar to the way windows offers the user the choice to start in safe
mode if the previous start-up failed and is becoming more common for applications.

A common question in #firefox is my browser will not start after I upgraded or
installed extension x. A lot of users do not think of trying starting in safe
mode and any documentation about the safe mode (bug 286589) is only easily
accessible from   inside Firefox.
Status -> NEW

I'm for this. Since the landing of bug 293461 this should be easier I think.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Sort of related, bug 291223.
Product: Firefox → Toolkit
Blocks: safe-mode
If there are more than browser.startup.max_resumed_crashes startup crashes and the last one was within the last 6 hours (copied this time from session restore) then restart with a modified safe mode dialog (based on text from faaborg in attachment 336775 [details]).

This is only to handle startup crashes whereas session restore will handle crashes afterwards when restoring tabs.  

My initial patch started tracking in XRE_main but it was discussed that if we are crashing so early, there isn't much we can do.  The reasoning for hooking into "final-ui-startup" and not earlier is that dolske suggested we have to be at least able to show the safe mode dialog so if we crash before that then we can't do anything about it.  As I type this I'm starting to think that this logic is flawed since it assumes that running in safe mode (when we show the dialog to make permanent changes) isn't going to fix any problems.  I think we should set the tracking pref before "profile-after-change" since that's the earliest notification that extensions can observe. [1]


[1] https://developer.mozilla.org/en/Observer_Notifications
Assignee: nobody → mnoorenberghe+bmo
Status: NEW → ASSIGNED
Attachment #567697 - Flags: feedback?(paul)
Attachment #567697 - Flags: feedback?(bmcbride)
(In reply to Matthew N. [:MattN] from comment #4)
> The reasoning for
> hooking into "final-ui-startup" and not earlier is that dolske suggested we
> have to be at least able to show the safe mode dialog so if we crash before
> that then we can't do anything about it.

We need to determine what's up and trigger safe mode before any add-ons are loaded, or else going into safe mode won't help at all.

That said, some time in the future we ideally should find some way to maybe even be able to start up *something* when even regular safe mode doesn't launch correctly, as it would be nice to have us figure out when third-party libraries hooking into our process are the problem - but that's material for another different mechanism and a different bug.

The bug here is for getting into some stable mode (and be able to run product, blocklist and perhaps add-on updates) when some add-on being loaded is causing the startup crash. We should make sure that scenario is being caught be the patch here.
That said, thanks for working on it!
Comment on attachment 567697 [details] [diff] [review]
v.1 start in safe mode after multiple startup crashes

Review of attachment 567697 [details] [diff] [review]:
-----------------------------------------------------------------

Would like to use this for detecting when a potentially incompatible addon is crashing the application. See: https://wiki.mozilla.org/Features/Add-ons/Add-ons_Default_to_Compatible

However, that can't rely on anything in /browser/. Could the detection code and related prefs be moved into /toolkit/ somewhere?
Attachment #567697 - Flags: feedback?(bmcbride) → feedback-
(In reply to Blair McBride (:Unfocused) from comment #6)
Is tracking before just before "profile-after-change" (possibly profile-do-change) early enough?
(In reply to Matthew N. [:MattN] from comment #7)
> Is tracking before just before "profile-after-change" (possibly
> profile-do-change) early enough?

profile-do-change looks like it will work, and it's the earliest notification before extensions are loaded. See https://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsXREDirProvider.cpp#746
* Moved prefs and detection to toolkit per Blair's request.  
* Started tracking earlier in startup using a new notification right before "profile-do-change" to beat anything which may crash which is listening to that.
** The pref service is something which gets setup with profile-do-change so I fire the event early to it.
* I made a new XPCOM component for this code but didn't spend much time on naming/location yet so suggestions are welcome.

Paul, I'm firing "startup-success" just before an init of sessionstore but the comment implies that it may already be running.  Where's the best place to fire the notification just before any tabs are restored and before sessionrestore changes state.  I don't want there to be overlap between the crash detection since it would be misleading if safe mode is started and about:sessionrestore is displayed since then we don't know if a crash was startup or a page-related. Is there an existing notification like this?
Attachment #567697 - Attachment is obsolete: true
Attachment #568349 - Flags: feedback?(paul)
Attachment #568349 - Flags: feedback?(bmcbride)
Attachment #567697 - Flags: feedback?(paul)
Comment on attachment 568349 [details] [diff] [review]
WIP v.2 move to toolkit and start tracking earlier

Review of attachment 568349 [details] [diff] [review]:
-----------------------------------------------------------------

f+, but I have one concern... This looks like it will be completely separate from session restore's crash detection. What's the interaction going to be like there?

Otherwise, I'll defer to Blair about addons interaction.

While you're here, you might want to look at bug 347680, which was a similar attempt a couple years ago, integrating with session restore.

::: browser/base/content/safeMode.js
@@ +138,5 @@
>  }
>  
>  function onLoad() {
> +  var startup = Components.classes["@mozilla.org/startup/startup;1"]
> +                          .getService().wrappedJSObject;

What's this now? You're adding a new class but not an interface? If it's just going to be accessible from JS you should probably just do this as a jsm and not do the wrappedJSObject stuff. Otherwise, add an interface. (There's already nsIAppStartup which is sort of becoming overloaded, but could be useful).

::: toolkit/components/startup/startup.js
@@ +55,5 @@
> +  this.wrappedJSObject = this;
> +}
> +
> +Startup.prototype = {
> +observe: function (aSubject, aTopic, aData) {

style nit: indent all of the object (maybe that's just personal pref, but I haven't seen any code here like that so I stand by it)
Attachment #568349 - Flags: feedback?(paul) → feedback+
Comment on attachment 568349 [details] [diff] [review]
WIP v.2 move to toolkit and start tracking earlier

Review of attachment 568349 [details] [diff] [review]:
-----------------------------------------------------------------

Could you send a notification after startup when it's been forced into safemode? Something like "safemode-forced", and maybe "crash" as the data. That way the Addons Manager (and any other potential consumer) can listen for that, and try to do some self-repair. Note that that notification would have to occur *after* the addons-startup notification - maybe once profile-after-change has fired?
Attachment #568349 - Flags: feedback?(bmcbride) → feedback+
(In reply to Paul O'Shannessy [:zpao] from comment #10)
> f+, but I have one concern... This looks like it will be completely separate
> from session restore's crash detection. What's the interaction going to be
> like there?

There should be no interaction between them and I think that is the least confusing for the user.  They should either be crashing during startup in which case they get safe mode or they are crashing during session restore in which case they see about:sessionrestore.  The user shouldn't see both since they cover two mutually-exclusive parts of startup.  Does this answer your question?

> While you're here, you might want to look at bug 347680, which was a similar
> attempt a couple years ago, integrating with session restore.

Yes, I saw that bug and patch and incorporated feedback from there.   I think that a button on about:sessionrestore would still make sense and that bug can focus on that feature.
 
> ::: browser/base/content/safeMode.js
> @@ +138,5 @@
> >  }
> >  
> >  function onLoad() {
> > +  var startup = Components.classes["@mozilla.org/startup/startup;1"]
> > +                          .getService().wrappedJSObject;
> 
> What's this now? You're adding a new class but not an interface? If it's
> just going to be accessible from JS you should probably just do this as a
> jsm and not do the wrappedJSObject stuff. Otherwise, add an interface.
> (There's already nsIAppStartup which is sort of becoming overloaded, but
> could be useful).

Yeah, I figured I would be moving this somewhere else so just did that to get feedback on approach and so I could write it in JS.  I think I'll move it nsIAppStartup unless there is an objection to that. I guess means porting to C++?

Paul, do you think that I'm firing "startup-success" at an appropriate time to be as close to session restore startup without overlapping. (See paragraph 2 of comment 9).

I'm not sure what to do about the hack in nsXREDirProvider.cpp is it acceptable to init the pref service early and then make sure it doesn't get init'd again later?

Is it fine to add the two startup notifications in my patch or should both/any of them just call directly into nsIAppStartup?
(In reply to Matthew N. [:MattN] from comment #12)
> (In reply to Paul O'Shannessy [:zpao] from comment #10)
> > f+, but I have one concern... This looks like it will be completely separate
> > from session restore's crash detection. What's the interaction going to be
> > like there?
> 
> There should be no interaction between them and I think that is the least
> confusing for the user.  They should either be crashing during startup in
> which case they get safe mode or they are crashing during session restore in
> which case they see about:sessionrestore.  The user shouldn't see both since
> they cover two mutually-exclusive parts of startup.  Does this answer your
> question?

Ok, that's what I figured was happening.

> Paul, do you think that I'm firing "startup-success" at an appropriate time
> to be as close to session restore startup without overlapping. (See
> paragraph 2 of comment 9).

Session restore is started from the first window's delayedStartup, so I think you're fine.

> I'm not sure what to do about the hack in nsXREDirProvider.cpp is it
> acceptable to init the pref service early and then make sure it doesn't get
> init'd again later?

Maybe. I think you might want to check with somebody who knows that whole thing better. I don't know if we've done any optimizations at startup to move pref service initialization later already or what sort of hit it is. I'm guessing we already do things pretty early since so much needs prefs. Taras might now (you might want to talk to him anyway and see if there's anything telemetry related he might have thoughts on)

> Is it fine to add the two startup notifications in my patch or should
> both/any of them just call directly into nsIAppStartup?

Both are fine. I would imagine there's additional overhead to using observers, but keeping it open for something else to easily know about starup steps may be useful.
* Ported new JS to nsAppStartup.cpp
* Changed from using notifications to calling ...Begin/End directly
* Unit tests for added functions
* Simple mochitest to make sure that in_progress pref gets cleared on a regular startup
Attachment #568349 - Attachment is obsolete: true
Attachment #571253 - Flags: review?(dtownsend)
Comment on attachment 571253 [details] [diff] [review]
v.3 start in safe mode after multiple startup crashes

Review of attachment 571253 [details] [diff] [review]:
-----------------------------------------------------------------

Please put the pref names into constants to save us from typos

::: browser/base/content/aboutDialog.js
@@ +274,3 @@
>        // If already in safe mode restart in safe mode (bug 327119)
>        if (Services.appinfo.inSafeMode) {
> +        appStartup.restartInSafeMode(Components.interfaces.nsIAppStartup.eAttemptQuit);

Need a return here I think

::: modules/libpref/src/init/all.js
@@ +3377,5 @@
> +
> +// Startup Crash Tracking
> +// number of startup crashes that can occur before starting into safe mode automatically
> +// (this pref has no effect if more than 6 hours have passed since the last crash)
> +pref("toolkit.startup.max_resumed_crashes", 2);

I think this should go into firefox.js

::: toolkit/components/startup/nsAppStartup.cpp
@@ +797,5 @@
> +  NS_ENSURE_SUCCESS(rv, rv);
> +
> +  PRInt32 maxResumedCrashes;
> +  rv = branch->GetIntPref("toolkit.startup.max_resumed_crashes", &maxResumedCrashes);
> +  NS_ENSURE_SUCCESS(rv, rv);

You need to test if the pref exists at all and return false if it doesn't

@@ +808,5 @@
> +
> +  PRInt32 recentCrashes = 0;
> +  rv = branch->GetIntPref("toolkit.startup.recent_crashes", &recentCrashes);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +  *_retval = (recentCrashes > maxResumedCrashes && maxResumedCrashes != -1);

What is the -1 test for?

@@ +815,5 @@
> +
> +NS_IMETHODIMP
> +nsAppStartup::TrackStartupCrashBegin()
> +{
> +  const PRInt32 SIX_HOURS_IN_SECONDS = 6 * 60 * 60;

This isn't a good constant name. If we really want a constant use something like MAX_TIME_SINCE_STARTUP or something. I think we probably want this to be a pref though so apps can control it easily.

@@ +825,5 @@
> +    return NS_ERROR_FAILURE;
> +
> +  xr->GetInSafeMode(&inSafeMode);
> +
> +  nsCOMPtr<nsIPrefBranch> branch = do_GetService(NS_PREFSERVICE_CONTRACTID, &rv);

You're not getting a pref branch here, just the full service. Name it appropriately and get it with Preferences::GetService().

@@ +835,5 @@
> +
> +  PRInt64 temp;
> +  LL_DIV(temp, PR_Now(), PR_USEC_PER_SEC);
> +  PRInt32 now;
> +  LL_L2I(now, temp);

I may be mistaken but I don't think we need these special operators to do math on 64-bit numbers. Maybe check with someone who knows more, bsmedberg perhaps

@@ +861,5 @@
> +      // and we don't want someone to get stuck in safe mode if their prefs are read-only.
> +      rv = branch->ClearUserPref("toolkit.startup.recent_crashes");
> +    }
> +  }
> +  NS_ENSURE_SUCCESS(rv, rv);

This needs to be inside the if block

@@ +864,5 @@
> +  }
> +  NS_ENSURE_SUCCESS(rv, rv);
> +  rv = branch->SetIntPref("toolkit.startup.in_progress", now);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +  nsIPrefService* prefs = Preferences::GetService();

You already get the service above, no need to again.

@@ +867,5 @@
> +  NS_ENSURE_SUCCESS(rv, rv);
> +  nsIPrefService* prefs = Preferences::GetService();
> +  if (!prefs)
> +    return NS_ERROR_FAILURE;
> +  rv = prefs->SavePrefFile(nsnull); // flush prefs to disk since we are tracking crashes

What's the perf hit here? Wonder if it's cheaper to just do a simple file instead.

@@ +877,5 @@
> +nsAppStartup::TrackStartupCrashEnd()
> +{
> +  nsresult rv;
> +  nsCOMPtr<nsIPrefBranch> branch = do_GetService(NS_PREFSERVICE_CONTRACTID, &rv);
> +  NS_ENSURE_SUCCESS(rv, rv);

Same problem as the last method

@@ +889,5 @@
> +    xr->GetInSafeMode(&inSafeMode);
> +
> +  bool safeModeNecessary;
> +  rv = this->AutomaticSafeModeNecessary(&safeModeNecessary);
> +  if (NS_FAILED(rv)) NS_WARNING("Could not determine whether we are in automatic safe mode.");

If you're going to proceed anyway then you should give safeModeNecessary a default value before you access it.

@@ +903,5 @@
> +    // clear the count of recent crashes after a succesful startup when not in safe mode
> +    rv = branch->ClearUserPref("toolkit.startup.recent_crashes");
> +    if (NS_FAILED(rv)) NS_WARNING("Could not clear startup crash count.");
> +  } else {
> +    // manual safe mode – don't reset recent crash count

This else block is unnecessary

@@ +906,5 @@
> +  } else {
> +    // manual safe mode – don't reset recent crash count
> +  }
> +
> +  nsIPrefService* prefs = Preferences::GetService();

Again, not needed again

@@ +928,5 @@
> +  PR_SetEnv("MOZ_SAFE_MODE_RESTART=1");
> +
> +  nsCOMPtr<nsIAppStartup> appStartup = do_GetService("@mozilla.org/toolkit/app-startup;1");
> +  if (!appStartup)
> +    return NS_ERROR_FAILURE;

This is already the app startup component, no need to get it again.

::: toolkit/components/startup/public/nsIAppStartup.idl
@@ +87,5 @@
> +
> +    /**
> +     * Whether automatic safe mode is necessary at this time
> +     */
> +    bool automaticSafeModeNecessary();

I think this should be named inForcedSafeMode

::: toolkit/components/startup/tests/unit/test_startup_crash.js
@@ +27,5 @@
> +  test_AutomaticSafeModeNecessary_safeMode();
> +}
> +
> +// reset prefs to default
> +function init() {

I always find it weird to have a method called init called a lot. How about naming it resetPrefs?

@@ +47,5 @@
> +  let beforeBegin = now_seconds();
> +  appStartup.trackStartupCrashBegin();
> +  let afterBegin = now_seconds();
> +  do_check_true(prefService.getIntPref(pref_in_progress) >= beforeBegin);
> +  do_check_true(prefService.getIntPref(pref_in_progress) <= afterBegin);

Talk to Marco about this, IIRC it isn't safe to compare values from Date.now() and PR_Now() like this.

@@ +49,5 @@
> +  let afterBegin = now_seconds();
> +  do_check_true(prefService.getIntPref(pref_in_progress) >= beforeBegin);
> +  do_check_true(prefService.getIntPref(pref_in_progress) <= afterBegin);
> +  do_check_false(prefService.prefHasUserValue(pref_recent_crashes));
> +  // TODO: ensure it was actually written to storage

Test that it hasn't force safe mode

@@ +59,5 @@
> +  appStartup.trackStartupCrashBegin();
> +  afterBegin = now_seconds();
> +  do_check_true(prefService.getIntPref(pref_in_progress) >= beforeBegin);
> +  do_check_true(prefService.getIntPref(pref_in_progress) <= afterBegin);
> +  do_check_eq(1, prefService.getIntPref(pref_recent_crashes)); // should remain the same

Ditto

@@ +70,5 @@
> +  afterBegin = now_seconds();
> +  do_check_true(prefService.getIntPref(pref_in_progress) >= beforeBegin);
> +  do_check_true(prefService.getIntPref(pref_in_progress) <= afterBegin);
> +  // should remain the same since the last startup was not a crash
> +  do_check_eq(max_resumed + 1, prefService.getIntPref(pref_recent_crashes));

Test that it is now in safe mode, for the rest of these too

@@ +75,5 @@
> +
> +  init();
> +  // normal startup after 1 non-recent crash (1 year ago), no other recent
> +  beforeBegin = now_seconds();
> +  prefService.setIntPref(pref_in_progress, beforeBegin - 365*24*60*60);

Spaces around operators

::: toolkit/xre/nsXREDirProvider.cpp
@@ +761,5 @@
> +
> +      rv = appStartup->AutomaticSafeModeNecessary(&safeModeNecessary);
> +      if (NS_FAILED(rv)) NS_ERROR("Could not determine whether automatic safe mode is necessary");
> +      if (!inSafeMode && safeModeNecessary)
> +        appStartup->RestartInSafeMode(nsIAppStartup::eForceQuit);

You need to return after calling this so we don't continue into the rest of startup

@@ +786,5 @@
>                                          "profile-after-change");
>  
> +    if (inSafeMode && safeModeNecessary) {
> +      static const PRUnichar kCrashed[] = {'c','r','a','s','h','e','d','\0'};
> +      obsSvc->NotifyObservers(nsnull, "safemode-forced", kCrashed);

What is this notification for?
Attachment #571253 - Flags: review?(dtownsend) → review-
(In reply to Dave Townsend (:Mossop) from comment #15)
> > +    if (inSafeMode && safeModeNecessary) {
> > +      static const PRUnichar kCrashed[] = {'c','r','a','s','h','e','d','\0'};
> > +      obsSvc->NotifyObservers(nsnull, "safemode-forced", kCrashed);
> 
> What is this notification for?

See comment 11 and bug 695584.
Comment on attachment 571253 [details] [diff] [review]
v.3 start in safe mode after multiple startup crashes

Review of attachment 571253 [details] [diff] [review]:
-----------------------------------------------------------------

Once I ran this through talos and saw the perf hit I got distracted from submitting these comments.  I've fixed anything I didn't comment in but I'm looking for a new approach to deal with the perf hit.

::: modules/libpref/src/init/all.js
@@ +3377,5 @@
> +
> +// Startup Crash Tracking
> +// number of startup crashes that can occur before starting into safe mode automatically
> +// (this pref has no effect if more than 6 hours have passed since the last crash)
> +pref("toolkit.startup.max_resumed_crashes", 2);

Oh, I moved it out of there based on Comment 6 from Blair.

::: toolkit/components/startup/nsAppStartup.cpp
@@ +808,5 @@
> +
> +  PRInt32 recentCrashes = 0;
> +  rv = branch->GetIntPref("toolkit.startup.recent_crashes", &recentCrashes);
> +  NS_ENSURE_SUCCESS(rv, rv);
> +  *_retval = (recentCrashes > maxResumedCrashes && maxResumedCrashes != -1);

Startup crash detection can be disabled by setting the pref to -1. (Same as browser.sessionstore.max_resumed_crashes).

@@ +815,5 @@
> +
> +NS_IMETHODIMP
> +nsAppStartup::TrackStartupCrashBegin()
> +{
> +  const PRInt32 SIX_HOURS_IN_SECONDS = 6 * 60 * 60;

Yeah, I got this from sessionrestore code. It uses a const and no pref now.  Do you want me to add a pref?

@@ +835,5 @@
> +
> +  PRInt64 temp;
> +  LL_DIV(temp, PR_Now(), PR_USEC_PER_SEC);
> +  PRInt32 now;
> +  LL_L2I(now, temp);

Other code seems to do without and bent confirmed that it's not necessary anymore. https://mxr.mozilla.org/mozilla-central/search?string=PR_NOW.*PR_USEC_PER_SEC&regexp=on

@@ +867,5 @@
> +  NS_ENSURE_SUCCESS(rv, rv);
> +  nsIPrefService* prefs = Preferences::GetService();
> +  if (!prefs)
> +    return NS_ERROR_FAILURE;
> +  rv = prefs->SavePrefFile(nsnull); // flush prefs to disk since we are tracking crashes

The perf hit of this whole patch is 10-25% on startup and changing to touching a simple file and using it's timestamp to replace toolkit.startup.in_progress only helped about 5%. Some options we discussed are using the existing profile lock file, integrating with the crash reporter, environment variables or using OS specific storage (ie. registry on Windows).

::: toolkit/xre/nsXREDirProvider.cpp
@@ +786,5 @@
>                                          "profile-after-change");
>  
> +    if (inSafeMode && safeModeNecessary) {
> +      static const PRUnichar kCrashed[] = {'c','r','a','s','h','e','d','\0'};
> +      obsSvc->NotifyObservers(nsnull, "safemode-forced", kCrashed);

See comment 11
The following is the approach I'm starting to test now.  If anyone knows of a better approach that does minimal disk I/O at startup, let me know.

At a high level, we use the existence of an old profile lock at startup to know whether there was a crash.  We use the timestamp of that lock compared to a timestamp pref which gets set on successful startup to know whether that existing lock was for a startup-related crash or if it happened after.  This moves most of the I/O out of regular startup to the case where there was a crash on last run.

1) At profile lock time, if there exists a stale lock, get the modification date (t1) of it and keep track of this on the profile lock object. This requires some disk I/O but only if the application crashed on last run.
2) Lock the profile as usual, replacing any existing locks.  Record this timestamp (t2) for later.
3) In TrackStartupCrashBegin, if stale lock existed and modification date (t1) != toolkit.startup.last_success pref containing last successful startup, increment toolkit.startup.recent_crashes pref.
4) Restart in safe mode if AutomaticSafeModeNecessary() is true (same as before)
5) Once we declare successful startup, call TrackStartupCrashEnd() as before.  This will store t2 in a pref toolkit.startup.last_success to be used in step 3 of the next startup if necessary.

If there's no suggestions, I'll benchmark this approach.
(In reply to Matthew N. [:MattN] from comment #18)
> 3) In TrackStartupCrashBegin, if stale lock existed and modification date
> (t1) != toolkit.startup.last_success pref containing last successful
> startup, increment toolkit.startup.recent_crashes pref.

I guess we also zero out if no stale lock existed? (i.e. reset when we manage to not crash so that a couple of crashes between a number of non-crashing runs won't trigger safe mode in the end)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #19)
> I guess we also zero out if no stale lock existed? (i.e. reset when we
> manage to not crash so that a couple of crashes between a number of
> non-crashing runs won't trigger safe mode in the end)

Yes, this bug just handles consecutive startup crashes and I glossed over those details since it's the same as the previous approach.
My first talos try run using the approach in Comment 18 seems to no impact on performance except maybe 5% on Windows.  I'm going to do more talos runs to check the numbers and then clean up my patch.
MAX_STARTUP_BUFFER is used since I don't actually stat the profile lock during startup.  Instead I use the timestamp of StartupTimeline::MAIN as an approximation since it's recorded right around the same time.  The current value is a 10 second buffer which would seem reasonable to get from the beginning of XRE_main (line 2623 is where it's recorded) to line 3192 (SelectProfile) in the same function.

Here are some rough notes of cases I've tried to consider:
* handle crash in second crash ever startup using feature
* backwards migration and forward again lock will be gone
* new profile
* upgrade and crash from version without crash tracking
* auto safe mode then start again => should go in regular mode, num crashes = max
* manual safe mode vs. auto safe mode
* handle bogus case where lastSuccess > replacedLockTime

I'm not sure of the best (safe) way to handle the case where we fallback to using a symlink.  Right now it always removes the symlink but I could change it to only remove the symlink |if (!aFatalSignal)| but then we have to make sure that the next instance doesn't think we're still locked.
Attachment #571253 - Attachment is obsolete: true
Attachment #576666 - Flags: review?(dtownsend)
Comment on attachment 576666 [details] [diff] [review]
v.4 detect startup crashes using profile lock modification time

Review of attachment 576666 [details] [diff] [review]:
-----------------------------------------------------------------

Some first comments, generally this looks really good but I have to look at it a second time tomorrow with a clearer head

::: profile/dirserviceprovider/src/nsProfileLock.cpp
@@ +355,5 @@
> +    // don't replace an existing lock time if fcntl already got one
> +    if (!mReplacedLockTime) {
> +        nsresult hasMod = aLockFile->GetLastModifiedTimeOfLink(&mReplacedLockTime);
> +        if (NS_FAILED(hasMod))
> +            mReplacedLockTime = 0;

In LockWithFcntl you assume that if GetLastModifiedTime fails it won't alter mReplacedLockTime, is there reason to think differently here? I'd opt for consistency and don't bother checking the return either time.

@@ +611,5 @@
>          return rv;
>  
> +    nsresult hasMod = lockFile->GetLastModifiedTime(&mReplacedLockTime);
> +    if (NS_FAILED(hasMod))
> +        mReplacedLockTime = 0;

Same here

::: profile/public/nsIProfile.idl
@@ +74,5 @@
> +     * Modification time of an existing lock file (LOCKFILE_NAME) at startup.
> +     * This is used for startup crash detection and assumes that the lock file
> +     * is not modified after startup.
> +     */
> +    readonly attribute PRTime replacedLockTime;

Don't know if this code is even used anymore, since you haven't patched the C++ and this compiles I guess you don't need touch the idl here.

::: toolkit/components/startup/nsAppStartup.cpp
@@ +792,5 @@
> +  if (prefType != nsIPrefBranch::PREF_INT)
> +    return NS_ERROR_NOT_AVAILABLE;
> +
> +  PRInt32 maxResumedCrashes;
> +  rv = Preferences::GetInt(kPrefMaxResumedCrashes, &maxResumedCrashes);

I wouldn't bother with the test above, if it isn't there then this will just throw an error anyway

@@ +836,5 @@
> +  PRInt32 lockSeconds = (PRInt32)(mReplacedLockTime / PR_MSEC_PER_SEC);
> +
> +  // started close enough to good startup so call it good
> +  if (lockSeconds <= lastSuccessfulStartup + MAX_STARTUP_BUFFER
> +      && lockSeconds >= lastSuccessfulStartup - MAX_STARTUP_BUFFER) {

The last lock is always going to have a later timestamp than the last successful start right?

::: toolkit/components/startup/public/nsIAppStartup.idl
@@ +177,5 @@
>      attribute boolean interrupted;
> +
> +    /*
> +     * Modification time of the profile lock before the profile was locked on
> +     * this startup. Used to know the last time the profile was used.

"... was used and not closed cleanly" or something

::: toolkit/xre/nsAppRunner.cpp
@@ +3391,5 @@
>          NS_TIME_FUNCTION_MARK("Created AppStartup");
>  
> +        PRTime replacedLockTime;
> +        profileLock->GetReplacedLockTime(&replacedLockTime);
> +        appStartup->SetReplacedLockTime(replacedLockTime);

Why do this shuffling here instead of having nsAppStartup pull it directly from the profile when needed?
Comment on attachment 576666 [details] [diff] [review]
v.4 detect startup crashes using profile lock modification time

Review of attachment 576666 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks.

::: profile/dirserviceprovider/src/nsProfileLock.cpp
@@ +355,5 @@
> +    // don't replace an existing lock time if fcntl already got one
> +    if (!mReplacedLockTime) {
> +        nsresult hasMod = aLockFile->GetLastModifiedTimeOfLink(&mReplacedLockTime);
> +        if (NS_FAILED(hasMod))
> +            mReplacedLockTime = 0;

Not sure if this was because I got some weird return values if the link doesn't exist or I was just being safe since the return value wasn't defined in that case.  I'll check.

@@ +611,5 @@
>          return rv;
>  
> +    nsresult hasMod = lockFile->GetLastModifiedTime(&mReplacedLockTime);
> +    if (NS_FAILED(hasMod))
> +        mReplacedLockTime = 0;

IIRC, I received a negative integer when the lock file didn't exist on Windows.  I was trying to workaround that case but I could just fix GetLastModifiedTime or at least document that here in a comment.

::: toolkit/components/startup/nsAppStartup.cpp
@@ +836,5 @@
> +  PRInt32 lockSeconds = (PRInt32)(mReplacedLockTime / PR_MSEC_PER_SEC);
> +
> +  // started close enough to good startup so call it good
> +  if (lockSeconds <= lastSuccessfulStartup + MAX_STARTUP_BUFFER
> +      && lockSeconds >= lastSuccessfulStartup - MAX_STARTUP_BUFFER) {

Yes, later or equal in the normal cases but not when the last lock time is 0 since it didn't exist (ie. first run on Windows with this feature or if the user deleted the lock) but this is already covered by line 826.

::: toolkit/components/startup/public/nsIAppStartup.idl
@@ +177,5 @@
>      attribute boolean interrupted;
> +
> +    /*
> +     * Modification time of the profile lock before the profile was locked on
> +     * this startup. Used to know the last time the profile was used.

I'll also fix the comment opening -> /**

::: toolkit/xre/nsAppRunner.cpp
@@ +3391,5 @@
>          NS_TIME_FUNCTION_MARK("Created AppStartup");
>  
> +        PRTime replacedLockTime;
> +        profileLock->GetReplacedLockTime(&replacedLockTime);
> +        appStartup->SetReplacedLockTime(replacedLockTime);

I couldn't see a clean way to get a reference to the profile lock from within the "app" which starts up.
(In reply to Matthew N. [:MattN] from comment #24)
> ::: toolkit/xre/nsAppRunner.cpp
> @@ +3391,5 @@
> >          NS_TIME_FUNCTION_MARK("Created AppStartup");
> >  
> > +        PRTime replacedLockTime;
> > +        profileLock->GetReplacedLockTime(&replacedLockTime);
> > +        appStartup->SetReplacedLockTime(replacedLockTime);
> 
> I couldn't see a clean way to get a reference to the profile lock from
> within the "app" which starts up.

nsIToolkitProfileService.selectedProfile ought to work
Comment on attachment 576666 [details] [diff] [review]
v.4 detect startup crashes using profile lock modification time

Review of attachment 576666 [details] [diff] [review]:
-----------------------------------------------------------------

A few more issues here that I think we need to figure out.

::: browser/base/content/browser.js
@@ +1614,5 @@
> +    let appStartup = Cc["@mozilla.org/toolkit/app-startup;1"]
> +                       .getService(Ci.nsIAppStartup);
> +    appStartup.trackStartupCrashEnd();
> +  } catch (ex) {
> +    dump("could not end startup crash tracking: " + ex + "\n");

Use Components.utils.reportError here

::: modules/libpref/src/init/all.js
@@ +3366,5 @@
>  
> +// Startup Crash Tracking
> +// number of startup crashes that can occur before starting into safe mode automatically
> +// (this pref has no effect if more than 6 hours have passed since the last crash)
> +pref("toolkit.startup.max_resumed_crashes", 2);

As I said before, I think this should go into firefox.js

::: profile/dirserviceprovider/src/nsProfileLock.cpp
@@ +611,5 @@
>          return rv;
>  
> +    nsresult hasMod = lockFile->GetLastModifiedTime(&mReplacedLockTime);
> +    if (NS_FAILED(hasMod))
> +        mReplacedLockTime = 0;

XPCOM rules say that if a method returns an error then it should not affect the out parameter. Maybe GetLastModified is doing something wrong there that should be fixed. Really I just want consistency though, either we can trust it to not change the out param or we can't.

::: toolkit/components/startup/nsAppStartup.cpp
@@ +780,5 @@
>    return NS_OK;
>  }
> +
> +NS_IMETHODIMP
> +nsAppStartup::AutomaticSafeModeNecessary(bool *_retval)

This method is called at least twice in every startup which seems unnecessary. What I think would be better would be to do the work in TrackStartupCrashBegin and just store it as a field on the class. Have TrackStartupCrashBegin return that too so nsXREDirProvider doesn't need to call back to nsIAppStartup to get it.

As I mentioned before I think the getter for this would be better named inForcedSafeMode.

::: toolkit/xre/nsXREDirProvider.cpp
@@ +760,5 @@
> +
> +      rv = appStartup->AutomaticSafeModeNecessary(&safeModeNecessary);
> +      if (NS_FAILED(rv)) NS_WARNING("Could not determine whether automatic safe mode is necessary");
> +      if (!inSafeMode && safeModeNecessary) {
> +        appStartup->RestartInSafeMode(nsIAppStartup::eForceQuit);

Why do we need to restart here? Can't we just switch to safe mode at this point?
Attachment #576666 - Flags: review?(dtownsend) → review-
(In reply to Blair McBride (:Unfocused) from comment #6)

> Would like to use this for detecting when a potentially incompatible addon
> is crashing the application. See:
> https://wiki.mozilla.org/Features/Add-ons/Add-ons_Default_to_Compatible
> 
> However, that can't rely on anything in /browser/. Could the detection code
> and related prefs be moved into /toolkit/ somewhere?

(In reply to Dave Townsend (:Mossop) from comment #26)
> ::: modules/libpref/src/init/all.js
> > +pref("toolkit.startup.max_resumed_crashes", 2);
> 
> As I said before, I think this should go into firefox.js

I'm trying to accommodate comment 6 too so does that mean I can leave the prefix "toolkit.startup." but have it located in firefox.js?  Is the idea that each app should opt-in for this feature rather than having this as the default for toolkit consumers?

(In reply to Dave Townsend (:Mossop) from comment #25)
> (In reply to Matthew N. [:MattN] from comment #24)
> > I couldn't see a clean way to get a reference to the profile lock from
> > within the "app" which starts up.
> 
> nsIToolkitProfileService.selectedProfile ought to work

This gives me an object referring to the profile (from which I can get the lock) but not the same profile object that is used during startup since we actually create two instances of the nsToolkitProfileService. At startup nsToolkitProfileService is created using NS_NewToolkitProfileService, I assume because XPCOM isn't ready at that point.  The instance of the service is not used for future GetService calls. This means that replacedLockTime is always null on the instance of the profile that I get from this new instance of the profile service. Should I change the code to remember the service instance in a global somewhere or am I missing an easier solution?

(In reply to Dave Townsend (:Mossop) from comment #26)
> ::: toolkit/components/startup/nsAppStartup.cpp
> @@ +780,5 @@
> >    return NS_OK;
> >  }
> > +
> > +NS_IMETHODIMP
> > +nsAppStartup::AutomaticSafeModeNecessary(bool *_retval)
> 
> This method is called at least twice in every startup which seems
> unnecessary. What I think would be better would be to do the work in
> TrackStartupCrashBegin and just store it as a field on the class. Have
> TrackStartupCrashBegin return that too so nsXREDirProvider doesn't need to
> call back to nsIAppStartup to get it.
> 
> As I mentioned before I think the getter for this would be better named
> inForcedSafeMode.

The reason I didn't go with that name is because this returns true when we are not in safe mode but we're going to restart in safe mode.  See my comment below on why we need to restart.  Without the restart I would agree.

(In reply to Dave Townsend (:Mossop) from comment #26)
> ::: toolkit/xre/nsXREDirProvider.cpp
> @@ +760,5 @@
> > +
> > +      rv = appStartup->AutomaticSafeModeNecessary(&safeModeNecessary);
> > +      if (NS_FAILED(rv)) NS_WARNING("Could not determine whether automatic safe mode is necessary");
> > +      if (!inSafeMode && safeModeNecessary) {
> > +        appStartup->RestartInSafeMode(nsIAppStartup::eForceQuit);
> 
> Why do we need to restart here? Can't we just switch to safe mode at this
> point?

Code has already made decisions at this point of whether we're in safe mode (by checking gSafeMode) so it's too late to just set gSafeMode = true.  For example in nsAppRunner: https://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp?rev=13185042410d#3271 
and nsXREDirProvider::GetFile.
> (In reply to Dave Townsend (:Mossop) from comment #25)
> > (In reply to Matthew N. [:MattN] from comment #24)
> > > I couldn't see a clean way to get a reference to the profile lock from
> > > within the "app" which starts up.
> > 
> > nsIToolkitProfileService.selectedProfile ought to work
> 
> This gives me an object referring to the profile (from which I can get the
> lock) but not the same profile object that is used during startup since we
> actually create two instances of the nsToolkitProfileService. At startup
> nsToolkitProfileService is created using NS_NewToolkitProfileService, I
> assume because XPCOM isn't ready at that point.  The instance of the service
> is not used for future GetService calls. This means that replacedLockTime is
> always null on the instance of the profile that I get from this new instance
> of the profile service. Should I change the code to remember the service
> instance in a global somewhere or am I missing an easier solution?

Also, nsIToolkitProfileService.selectedProfile is not necessarily equal to the currently running profile if there is more than one.  From what I understand, it's only used really changed when you use the profile manager UI.  It returns the default profile to use which would be selected by default in the profile manager UI.  This could also probably be fixed but I don't know if there's a reason for selectedProfile not to always return the currently running profile after we've chosen one.  I'll still have the problem mentioned in comment 27 with NS_NewToolkitProfileService though.

Maybe bsmedberg can shed some light?
I don't understand why you want to be using the profile service at all (note that the currently running profile directory may not be one of the listed profiles at all), and I'd still like to remove the profile service altogether (not a high priority currently though).

It appears that you just want access to the current profile lock, right? If so, please just expose the current profile lock on nsIXULRuntime, or even just expose the specific data you care about directly on nsIXULRuntime and don't mess with the profile lock object (which is fragile).

Also there's some very strange stuff in this patch which calls profile-do-change on the prefservice directly, which sounds like a really bad idea offhand. The comment " to be able to catch crashes caused by other code run from this notification." isn't clear why this specific call is necessary in order to catch these crashes. What is special about the prefservice?

I also have style nits related to the patch, but I can go over those in review later.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #29)
> I don't understand why you want to be using the profile service at all (note
> that the currently running profile directory may not be one of the listed
> profiles at all), and I'd still like to remove the profile service
> altogether (not a high priority currently though).
Yeah, I realized before writing comment 28 that the profile service is only really for the profile manager UI and isn't very useful outside of it.  It was suggested to me in comment 25 probably because the name of the attribute is misleading.
I'm possibly going to be adding another consumer for the toolkit profile service in the near term though but that's not related to this bug.

> It appears that you just want access to the current profile lock, right? If
> so, please just expose the current profile lock on nsIXULRuntime, or even
> just expose the specific data you care about directly on nsIXULRuntime and
> don't mess with the profile lock object (which is fragile).

This is to address the last point in comment 23. I already have access to the lock from that code but the suggestion was to move that logic.  I need a way to get the modification time of an existing lock file before it's replaced so I think that makes sense to be in nsProfileLock.cpp.  I'm kinda doing what you say when I set the lock time on an attribute on the nsIAppStartup interface (rather than nsIXULRuntime but both are fine by me).

What do you mean to not "mess with the profile lock object"?  What kinds of changes don't you like.  I understand that breaking profile locks would be a big problem so I definitely trying to avoid that.  Is the way I'm getting the data in the current patch OK?

> Also there's some very strange stuff in this patch which calls
> profile-do-change on the prefservice directly, which sounds like a really
> bad idea offhand. The comment " to be able to catch crashes caused by other
> code run from this notification." isn't clear why this specific call is
> necessary in order to catch these crashes. What is special about the
> prefservice?

I need to be able to read prefs in nsAppStartup::TrackStartupCrashBegin and AutomaticSafeModeNecessary in order to determine if we've exceeded the crash threshold and then go into safe mode.  The pref service needs to be initialized before they are called.  The reasoning for just initing that component first is because any of the other components which will run code on profile-do-change may be the cause of the startup crashes. We want to be able to detect these crashes so I can't just move the calls to the added nsAppStartup methods after the profile-do-change call or we wouldn't be able to detect crashes that happen before that point.  

The goal is to detect profile-related startup crashes (until session restore starts).  This is why I'm trying to start crash tracking as early as possible which is as soon as the pref service is ready.  Using a file rather than prefs led to startup perf. regressions mentioned earlier.

Non-goals:
* application crashes not related to the profile - since safe mode is less likely to help in this case
* crashes after we start session restore - about:sessionrestore already helps recovery from these situations
(In reply to Matthew N. [:MattN] from comment #27)
> (In reply to Blair McBride (:Unfocused) from comment #6)
> 
> > Would like to use this for detecting when a potentially incompatible addon
> > is crashing the application. See:
> > https://wiki.mozilla.org/Features/Add-ons/Add-ons_Default_to_Compatible
> > 
> > However, that can't rely on anything in /browser/. Could the detection code
> > and related prefs be moved into /toolkit/ somewhere?
> 
> (In reply to Dave Townsend (:Mossop) from comment #26)
> > ::: modules/libpref/src/init/all.js
> > > +pref("toolkit.startup.max_resumed_crashes", 2);
> > 
> > As I said before, I think this should go into firefox.js
> 
> I'm trying to accommodate comment 6 too so does that mean I can leave the
> prefix "toolkit.startup." but have it located in firefox.js?  Is the idea
> that each app should opt-in for this feature rather than having this as the
> default for toolkit consumers?

Yep, exactly that.
> I'm possibly going to be adding another consumer for the toolkit profile
> service in the near term though but that's not related to this bug.

Please talk to me directly before doing anything else with this.

> > It appears that you just want access to the current profile lock, right? If
> > so, please just expose the current profile lock on nsIXULRuntime, or even
> > just expose the specific data you care about directly on nsIXULRuntime and
> > don't mess with the profile lock object (which is fragile).
> 
> This is to address the last point in comment 23. I already have access to

I want to override Mossop here and say that it would be better to expose this on nsIXULRuntime or nsIAppStartup rather than trying to expose the profile lock object. As long as the profile service isn't involved we're ok ;-)

> What do you mean to not "mess with the profile lock object"?  What kinds of
> changes don't you like.  I understand that breaking profile locks would be a
> big problem so I definitely trying to avoid that.  Is the way I'm getting
> the data in the current patch OK?

The current patch exposes something on the toolkit profile service, which is not necessary or acceptable.

> I need to be able to read prefs in nsAppStartup::TrackStartupCrashBegin and
> AutomaticSafeModeNecessary in order to determine if we've exceeded the crash
> threshold and then go into safe mode.  The pref service needs to be
> initialized before they are called.  The reasoning for just initing that
> component first is because any of the other components which will run code
> on profile-do-change may be the cause of the startup crashes. We want to be
> able to detect these crashes so I can't just move the calls to the added
> nsAppStartup methods after the profile-do-change call or we wouldn't be able
> to detect crashes that happen before that point.

What kinds of startup crashes will be affected? I'm pretty sure that no addons are loaded during profile-do-change (if I'm remember and reading the code correctly, they are only loaded during the addons-startup notification, which immediately follows this). So are you worried about crashes that happen in Mozilla's own code during the profile-do-change notification? I'm not completely against this change if there is a good example of a crash that we would be catching with it, but initialization ordering has traditionally been risky and I'd rather be conservative about this and do this work after profile-do-change if we can get away with it. And it seems better at that point to just invent a custom notification for the prefservice and have it stop registering for profile-do-change altogether.

Do we know what exactly caused such a massive startup regression? Could we instead of using prefs use a file we already read at startup like compatibility.ini? Prefs are a pretty heavyweight solution for non-preference data (in general).
Assignee: mnoorenberghe+bmo → nobody
Assignee: nobody → mnoorenberghe+bmo
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #32)
> > I'm possibly going to be adding another consumer for the toolkit profile
> > service in the near term though but that's not related to this bug.
> 
> Please talk to me directly before doing anything else with this.

OK, it's bug 273874 that I was going to use the service to list the possible profiles to migrate from. I'll get in contact with you about that though.

> > > It appears that you just want access to the current profile lock, right? If
> > > so, please just expose the current profile lock on nsIXULRuntime, or even
> > > just expose the specific data you care about directly on nsIXULRuntime and
> > > don't mess with the profile lock object (which is fragile).
> > 
> > This is to address the last point in comment 23. I already have access to
> 
> I want to override Mossop here and say that it would be better to expose
> this on nsIXULRuntime or nsIAppStartup rather than trying to expose the
> profile lock object. As long as the profile service isn't involved we're ok
> ;-)

OK, I discussed this review comment on IRC with Mossop last week and we were going to keep replacedLockTime on nsIAppStartup and leave nsAppRunner.cpp setting it on nsIAppStartup which I think aligns with what you're saying. Are you suggesting that nsProfileLock would get a reference to nsIAppStartup and set the replacedLockTime directly?  I tried this but do_GetService failed which I assume was because it was too early in startup?

> > What do you mean to not "mess with the profile lock object"?  What kinds of
> > changes don't you like.  I understand that breaking profile locks would be a
> > big problem so I definitely trying to avoid that.  Is the way I'm getting
> > the data in the current patch OK?
> 
> The current patch exposes something on the toolkit profile service, which is
> not necessary or acceptable.

I am adding an attribute to nsToolkitProfileLock which is in nsToolkitProfileService.cpp (not on the service itself) but from what I can tell from the patch on bug 214675, this new attribute could easily migrate to the new ProfileLocker class in xre/nsToolkitProfileService.h when that lands.

> > I need to be able to read prefs in nsAppStartup::TrackStartupCrashBegin and
> > AutomaticSafeModeNecessary in order to determine if we've exceeded the crash
> > threshold and then go into safe mode.  The pref service needs to be
> > initialized before they are called.  The reasoning for just initing that
> > component first is because any of the other components which will run code
> > on profile-do-change may be the cause of the startup crashes. We want to be
> > able to detect these crashes so I can't just move the calls to the added
> > nsAppStartup methods after the profile-do-change call or we wouldn't be able
> > to detect crashes that happen before that point.
> 
> What kinds of startup crashes will be affected? I'm pretty sure that no
> addons are loaded during profile-do-change (if I'm remember and reading the
> code correctly, they are only loaded during the addons-startup notification,
> which immediately follows this). So are you worried about crashes that
> happen in Mozilla's own code during the profile-do-change notification? I'm
> not completely against this change if there is a good example of a crash
> that we would be catching with it, but initialization ordering has
> traditionally been risky and I'd rather be conservative about this and do
> this work after profile-do-change if we can get away with it. And it seems
> better at that point to just invent a custom notification for the
> prefservice and have it stop registering for profile-do-change altogether.

Yes, the goal is to catch more than just add-on related crashes so we'd catch crashes from all of the other components registered for profile-do-change.  From what I can tell there is currently no guaranteed ordering for observers to be notified of profile-do-change so if there is a problem with initialization ordering then we should just fix the specific case.

> Do we know what exactly caused such a massive startup regression? Could we
> instead of using prefs use a file we already read at startup like
> compatibility.ini? Prefs are a pretty heavyweight solution for
> non-preference data (in general).

Any kind of disk I/O added to startup was noticeable. I don't know exact causes but I tried different approaches (touching + checking modification of a new file and setting a pref at the beginning of startup then flushing them) on try server and they all increased startup time. The problem is that I need to be able to persist state that indicates whether the last startup was a crash so it's reading and writing. Also, startup.max_resumed_crashes makes sense as a pref and this needs to be read at startup so the pref service would need to be initialized for it anyways.

I considered using compatibility.ini before but it would require refactoring WriteVersion and associated functions into another class so that I could call it after a successful startup to write the new success timestamp.  Writing this file on every successful startup would probably be a performance hit since we don't currently do that and even touching a file was a perf. hit.  This is something we get for free using prefs. Also, at the end of startup where I would call WriteVersion, I may not have access to all of the required arguments so I may have to restructure the code to get access to them or parse the INI first before I write a new version with a modification.

I'd like to get clear direction on how to proceed. I'd be willing to have a meeting with the two of you if that helps move this forward.
bsmedberg, can you take a look at my reply in comment 33 and let me know how I can proceed?  This is part of a P1 feature and I'd like to get this part finished.
Dammit, I wrote a comment which apparently disappeared at some point. Let me try to reconstruct it...

> OK, I discussed this review comment on IRC with Mossop last week and we were
> going to keep replacedLockTime on nsIAppStartup and leave nsAppRunner.cpp
> setting it on nsIAppStartup which I think aligns with what you're saying.
> Are you suggesting that nsProfileLock would get a reference to nsIAppStartup
> and set the replacedLockTime directly?  I tried this but do_GetService
> failed which I assume was because it was too early in startup?

I don't understand this paragraph at all. Why is there any "setting" at all, instead of just getting it on-demand from the profile lock object?

Also I don't think the current patch 

> Yes, the goal is to catch more than just add-on related crashes so we'd
> catch crashes from all of the other components registered for
> profile-do-change.  From what I can tell there is currently no guaranteed
> ordering for observers to be notified of profile-do-change so if there is a
> problem with initialization ordering then we should just fix the specific
> case.

Do you have any examples of crashes you'd be protecting against? Note that code has access to the profile "early" though ProfDS and ProfLDS, so there exists the *possibility* of crashing even before profile-do-change. If profile-do-change is really a common crash point, then just make the pref service explicitly aware of our startup order (instead of reusing profile-do-change but delivered early).

> > Do we know what exactly caused such a massive startup regression? Could we
> > instead of using prefs use a file we already read at startup like
> > compatibility.ini? Prefs are a pretty heavyweight solution for
> > non-preference data (in general).
> 
> Any kind of disk I/O added to startup was noticeable. I don't know exact
> causes but I tried different approaches (touching + checking modification of
> a new file and setting a pref at the beginning of startup then flushing
> them) on try server and they all increased startup time. The problem is that
> I need to be able to persist state that indicates whether the last startup
> was a crash so it's reading and writing. Also, startup.max_resumed_crashes
> makes sense as a pref and this needs to be read at startup so the pref
> service would need to be initialized for it anyways.

Are you worried about catching crashes which are not caught by our crash reporting? If not, we can just move all the disk writing activity to the crash handler itself. If we wrote a special tag to compatibility.ini, 

> was a perf. hit.  This is something we get for free using prefs. Also, at

Prefs are not written until idle (or shutdown), and I expect that in typical usage prefs are often not written at all (because there aren't changed prefs). So they aren't free, but it's true that we already are reading them in the critical path.

But won't the fact that prefs are written on a significant delay disrupt your system, if there is a crash in the 5 minutes after browsers start? Does that matter?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #35)
> Dammit, I wrote a comment which apparently disappeared at some point. Let me
> try to reconstruct it...
> 
> > OK, I discussed this review comment on IRC with Mossop last week and we were
> > going to keep replacedLockTime on nsIAppStartup and leave nsAppRunner.cpp
> > setting it on nsIAppStartup which I think aligns with what you're saying.
> > Are you suggesting that nsProfileLock would get a reference to nsIAppStartup
> > and set the replacedLockTime directly?  I tried this but do_GetService
> > failed which I assume was because it was too early in startup?
> 
> I don't understand this paragraph at all. Why is there any "setting" at all,
> instead of just getting it on-demand from the profile lock object?

Well, when I need access to it, I wouldn't have any way to get it since there is currently no access to the profile lock object inside the application that is started up by the XRE (from what I can tell) and you said in comment 32 that you didn't want me to expose the profile lock object.

(Quote from Benjamin Smedberg  [:bsmedberg] comment #32)
> ... it would be better to expose
> this on nsIXULRuntime or nsIAppStartup rather than trying to expose the
> profile lock object. As long as the profile service isn't involved we're ok
> ;-)
 
> Also I don't think the current patch 

looks like you didn't finish this thought.

> > Yes, the goal is to catch more than just add-on related crashes so we'd
> > catch crashes from all of the other components registered for
> > profile-do-change.  From what I can tell there is currently no guaranteed
> > ordering for observers to be notified of profile-do-change so if there is a
> > problem with initialization ordering then we should just fix the specific
> > case.
> 
> Do you have any examples of crashes you'd be protecting against? Note that
> code has access to the profile "early" though ProfDS and ProfLDS, so there
> exists the *possibility* of crashing even before profile-do-change. If
> profile-do-change is really a common crash point, then just make the pref
> service explicitly aware of our startup order (instead of reusing
> profile-do-change but delivered early).

Unfortunately I wasn't given specific examples of crashes since this feature request is coming from support where the focus is on getting user's back up and running with Firefox ASAP and not necessarily on diagnosing the crashes reported.  I would assume that the recent Roboform startup crash would have been desirable to detect.  

I don't have data to say that profile-do-change is a really common crash point but I do know that a large portion of extensions observe it and therefore could run code that would cause a crash. Changing the pref service initialization works for me.

> > > Do we know what exactly caused such a massive startup regression? Could we
> > > instead of using prefs use a file we already read at startup like
> > > compatibility.ini? Prefs are a pretty heavyweight solution for
> > > non-preference data (in general).
> > 
> > Any kind of disk I/O added to startup was noticeable. I don't know exact
> > causes but I tried different approaches (touching + checking modification of
> > a new file and setting a pref at the beginning of startup then flushing
> > them) on try server and they all increased startup time. The problem is that
> > I need to be able to persist state that indicates whether the last startup
> > was a crash so it's reading and writing. Also, startup.max_resumed_crashes
> > makes sense as a pref and this needs to be read at startup so the pref
> > service would need to be initialized for it anyways.
> 
> Are you worried about catching crashes which are not caught by our crash
> reporting? If not, we can just move all the disk writing activity to the
> crash handler itself. If we wrote a special tag to compatibility.ini, 

Yes, I'd like to catch crashes not caught by the crash reporter since they both have the same effect on the user but I wouldn't say it is a hard requirement.

> > was a perf. hit.  This is something we get for free using prefs. Also, at
> 
> Prefs are not written until idle (or shutdown), and I expect that in typical
> usage prefs are often not written at all (because there aren't changed
> prefs). So they aren't free, but it's true that we already are reading them
> in the critical path.

Is bug 161711 a dupe of a resolved bug.  I can't find the code that write prefs on idle.

> But won't the fact that prefs are written on a significant delay disrupt
> your system, if there is a crash in the 5 minutes after browsers start? Does
> that matter?

Yeah, I've been thinking about this and it probably makes sense to save the prefs a minute or two after session restore starts thus expanding the time duration for "startup" in this context.  Dolske and I also discussed the possibility of writing prefs asynchronously so that it doesn't block the main thread which may minimize the perf impact.
> Well, when I need access to it, I wouldn't have any way to get it since
> there is currently no access to the profile lock object inside the

nsAppRunner.cpp has the profile lock object, and since nsXULAppInfo is implemented inside nsAppRunner.cpp it also has access. So rather than "setting" anything, just expose it on nsIXULRuntime.


> I don't have data to say that profile-do-change is a really common crash
> point but I do know that a large portion of extensions observe it and
> therefore could run code that would cause a crash. Changing the pref service
> initialization works for me.

http://hg.mozilla.org/mozilla-central/annotate/9a230265bad5/toolkit/xre/nsXREDirProvider.cpp#l746

Extensions are not loaded until after profile-do-change. AFAIK any profile-do-change observers are vestigial from before Firefox 4 when we changed component registration. So I really don't think that there is the possibility of extensions causing crashes until line 746, or at the earliest line 741.

> > Prefs are not written until idle (or shutdown), and I expect that in typical
> > usage prefs are often not written at all (because there aren't changed
> > prefs). So they aren't free, but it's true that we already are reading them
> > in the critical path.
> 
> Is bug 161711 a dupe of a resolved bug.  I can't find the code that write
> prefs on idle.

Ah, in this case we don't write prefs until shutdown, and in cases where we crash we just don't write the pref file at all, at least according to my grepping in Preferences.cpp. This makes using prefs even more fragile for this use-case, right?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #37)
> > Well, when I need access to it, I wouldn't have any way to get it since
> > there is currently no access to the profile lock object inside the
> 
> nsAppRunner.cpp has the profile lock object, and since nsXULAppInfo is
> implemented inside nsAppRunner.cpp it also has access. So rather than
> "setting" anything, just expose it on nsIXULRuntime.

Ah, I see now. Thanks for clarifying.

> > I don't have data to say that profile-do-change is a really common crash
> > point but I do know that a large portion of extensions observe it and
> > therefore could run code that would cause a crash. Changing the pref service
> > initialization works for me.
> 
> http://hg.mozilla.org/mozilla-central/annotate/9a230265bad5/toolkit/xre/
> nsXREDirProvider.cpp#l746
> 
> Extensions are not loaded until after profile-do-change. AFAIK any
> profile-do-change observers are vestigial from before Firefox 4 when we
> changed component registration. So I really don't think that there is the
> possibility of extensions causing crashes until line 746, or at the earliest
> line 741.

OK, I got the impression from [1] that extensions could receive the notification since it doesn't say otherwise like the prior two notifications.
 
> > > Prefs are not written until idle (or shutdown), and I expect that in typical
> > > usage prefs are often not written at all (because there aren't changed
> > > prefs). So they aren't free, but it's true that we already are reading them
> > > in the critical path.
> > 
> > Is bug 161711 a dupe of a resolved bug.  I can't find the code that write
> > prefs on idle.
> 
> Ah, in this case we don't write prefs until shutdown, and in cases where we
> crash we just don't write the pref file at all, at least according to my
> grepping in Preferences.cpp. This makes using prefs even more fragile for
> this use-case, right?

Well, it's just a matter of flushing the prefs to disk by calling SavePrefFile when we want the startup crash tracking to end.  If I have to make it later to avoid disk I/O during startup then that's still acceptable IMO.

I'll attach a new patch addressing your review shortly. Thanks for the timely responses.

[1] https://developer.mozilla.org/en/Observer_Notifications#Application_startup
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #37)
> > Well, when I need access to it, I wouldn't have any way to get it since
> > there is currently no access to the profile lock object inside the
> 
> nsAppRunner.cpp has the profile lock object, and since nsXULAppInfo is
> implemented inside nsAppRunner.cpp it also has access. So rather than
> "setting" anything, just expose it on nsIXULRuntime.

I made a getter for the replacedLockTime on nsIXULRuntime which just gets the value from the profile lock object which I moved to be static.  This works but leaks in xpcshell tests and gives a warning otherwise: "WARNING: XPCOM objects created/destroyed from static ctor/dtor".

Could you give me more details on how to do this without the leak or warning?
Attachment #576666 - Attachment is obsolete: true
Attachment #589693 - Flags: feedback?(benjamin)
Comment on attachment 589693 [details] [diff] [review]
v.5 For review of usage of the profile lock in nsAppRunner

The completely static comptr is dangerous because you never clear it. The scoping of `profileLock` in XRE_main is actually quite important, because the timing of the destructor (at http://hg.mozilla.org/mozilla-central/annotate/f76b576a9e28/toolkit/xre/nsAppRunner.cpp#l3604) guarantees that we don't leak the object and that we don't leave the profile locked when we restart the app.

I would suggest keeping the nsCOMPtr scoped as it currently is and just set a `static nsIProfileLock* gProfileLock;` to it. Also, please null-check gProfileLock before you use it, because there are cases (such as the profile selectiohn dialog and some embeddings) where there is no profile and your calls to nsXULAppInfo::GetReplacedLockTime would just crash.
Attachment #589693 - Flags: feedback?(benjamin)
Depends on: 719654
Attached patch v.6 addressing review comments (obsolete) — Splinter Review
* This patch applies upon a backout of 5ba5e09d2dbe which flattened the startup/tests directory

* This doesn't detect startup crashes if the profile is using symlinks (rather than fcntl) as the primary locking mechanism.  This is because replacedLockTime requires that the existing lock is still on disk but not unlinking the symlink would cause profile locked messages.  A possible workaround is to change the symlink during an expected shutdown so that it is detected as stale on next startup.  My understanding is that only a small proportion of users will be relying on symlinks since the fallback is for old NFS servers that don't support fcntl.  I propose that crash detection support for users falling back to symlinks be added in a follow-up (bug 719654).

I'm going to do some more manual testing now to see if there are edge cases not caught by the tests.
Attachment #589693 - Attachment is obsolete: true
Attachment #590052 - Flags: review?(benjamin)
Attached patch v.6.1 Fixes some problems in v.6 (obsolete) — Splinter Review
Attachment #590052 - Attachment is obsolete: true
Attachment #590128 - Flags: review?(benjamin)
Attachment #590052 - Flags: review?(benjamin)
Comment on attachment 590128 [details] [diff] [review]
v.6.1 Fixes some problems in v.6

I only reviewed the bootstrap/startup/prefs portion of these changes. They appear to be correct. I would explicitly null out gProfileLock here, though:

http://hg.mozilla.org/mozilla-central/annotate/758005504cab/toolkit/xre/nsAppRunner.cpp#l3570

Did this pass tryserver? there are some xpcshell tests which issue "fake" profile-do-change notifications, but it probably doesn't matter for this since they wouldn't be creating a prefs.js
Attachment #590128 - Flags: review?(benjamin) → review+
Comment on attachment 590128 [details] [diff] [review]
v.6.1 Fixes some problems in v.6

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #43)
> Comment on attachment 590128 [details] [diff] [review]
> v.6.1 Fixes some problems in v.6
> 
> I only reviewed the bootstrap/startup/prefs portion of these changes. They
> appear to be correct. I would explicitly null out gProfileLock here, though:
> 
> http://hg.mozilla.org/mozilla-central/annotate/758005504cab/toolkit/xre/
> nsAppRunner.cpp#l3570

Thanks, I'll do that.
 
> Did this pass tryserver? there are some xpcshell tests which issue "fake"
> profile-do-change notifications, but it probably doesn't matter for this
> since they wouldn't be creating a prefs.js

Yes, it did pass try server: https://tbpl.mozilla.org/?tree=Try&rev=5ad4f22898f1

Mossop, could you review the rest of the patch (with the addition of the nulling out)?
Attachment #590128 - Flags: review?(dtownsend)
Comment on attachment 590128 [details] [diff] [review]
v.6.1 Fixes some problems in v.6

Review of attachment 590128 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/base/content/aboutDialog.js
@@ +276,5 @@
>        if (cancelQuit.data)
>          return;
>  
> +      let appStartup = Cc["@mozilla.org/toolkit/app-startup;1"]
> +                         .getService(Ci.nsIAppStartup);

Cc and Ci don't seem to be defined in aboutDialog.js so I think this hunk is broken

::: browser/base/content/browser.js
@@ +1763,5 @@
> +  // End startup crash tracking after a delay to catch crashes while restoring
> +  // tabs and to postpone saving the pref to disk.
> +  try {
> +    let appStartup = Cc["@mozilla.org/toolkit/app-startup;1"]
> +                       .getService(Ci.nsIAppStartup);

There are a few different variants of styling this in browser.js I know, but both me and gavin prefer it as this going forwards:

let appStartup = Cc["@mozilla.org/toolkit/app-startup;1"].
                 getService(Ci.nsIAppStartup);

Can you update your patch throughout to match.

::: browser/locales/en-US/chrome/browser/safeMode.dtd
@@ +39,5 @@
>  <!ENTITY safeModeDialog.title         "&brandShortName; Safe Mode">
>  <!ENTITY window.width                 "37em">
>  
> +<!ENTITY autoSafeModeIntro.label      "&brandShortName; Closed Unexpectedly While Starting">
> +<!ENTITY autoSafeModeDescription.label "We sincerely apologize for the inconvenience. &brandShortName; is now running in Safe Mode, which temporarily disables your custom settings, themes, and extensions.">

We've had UX sign-off on these strings right?
Attachment #590128 - Flags: review?(dtownsend) → review-
(In reply to Dave Townsend (:Mossop) from comment #45)
> Comment on attachment 590128 [details] [diff] [review]
> v.6.1 Fixes some problems in v.6

I'll post a new patch ASAP.

> ::: browser/locales/en-US/chrome/browser/safeMode.dtd
> @@ +39,5 @@
> >  <!ENTITY safeModeDialog.title         "&brandShortName; Safe Mode">
> >  <!ENTITY window.width                 "37em">
> >  
> > +<!ENTITY autoSafeModeIntro.label      "&brandShortName; Closed Unexpectedly While Starting">
> > +<!ENTITY autoSafeModeDescription.label "We sincerely apologize for the inconvenience. &brandShortName; is now running in Safe Mode, which temporarily disables your custom settings, themes, and extensions.">
> 
> We've had UX sign-off on these strings right?

They came from faaborg's mockups in a related bug.  See "about:safemode" on the right of attachment 336775 [details].
browser/base/content/safeMode.js is consistent in using the other indentation/wrapping style.  Are you fine with this one line being inconsistent?

I actually thought general consensus was that the style I used was preferred since it's easier to read (less ambiguous as to whether the wrapped line is a new statement or whether it's, in fact, wrapped).
Attachment #590128 - Attachment is obsolete: true
Attachment #592966 - Flags: review?(dtownsend)
Comment on attachment 592966 [details] [diff] [review]
v.7 Fix Cc/Ci and wrapping style

Review of attachment 592966 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/base/content/safeMode.js
@@ +138,5 @@
>  }
>  
>  function onLoad() {
> +  var appStartup = Components.classes["@mozilla.org/toolkit/app-startup;1"].
> +                   getService(Components.interfaces.nsIAppStartup);

Yeah so when the file is actually consistent then consistency with the file wins, so make this match the example in onCancel, then this patch is ready to land I think.
Attachment #592966 - Flags: review?(dtownsend) → review+
Disable startup crash tracking in test profiles to avoid disruptions if there's consecutive crashes.  I just looked for wherever browser.sessionstore.resume_from_crash = false since it has a similar effect of causing different behaviour after crashes.
Attachment #594016 - Flags: review?(rcampbell)
Attachment #594016 - Flags: review?(jwalden+bmo)
By the way, both of these patches passed try: https://tbpl.mozilla.org/?tree=Try&rev=88c3d7771142
Blocks: 723802
Comment on attachment 594016 [details] [diff] [review]
v.1 Disable startup crash tracking in tests

Review of attachment 594016 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #594016 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 594016 [details] [diff] [review]
v.1 Disable startup crash tracking in tests

ok!
Attachment #594016 - Flags: review?(rcampbell) → review+
(In reply to Matthew N. [:MattN] from comment #53)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/21cc56ed5fe6

This also needed to have modified toolkit/toolkit-makefiles.sh
(Also, it re-introduced an #ifdef ENABLE_TESTS, which are being phased out in favour of |TEST_DIRS = foo|, see bug 702388)
(In reply to Ed Morley [:edmorley] from comment #55)
> This also needed to have modified toolkit/toolkit-makefiles.sh

Ed offered to take care of this in bug 725429.

(In reply to Ed Morley [:edmorley] from comment #56)
> (Also, it re-introduced an #ifdef ENABLE_TESTS, which are being phased out
> in favour of |TEST_DIRS = foo|, see bug 702388)

This will be fixed in bug 702388 now.
Depends on: 726043
Depends on: 726877
Depends on: 731613
Blocks: 732303
Version: unspecified → Trunk
Depends on: 734886
Depends on: 735573
Blocks: FF2SM
No longer blocks: 734883
Depends on: 745154
Depends on: 920083
Depends on: 1095799
Depends on: 1361262
You need to log in before you can comment on or make changes to this bug.