Closed Bug 1410270 Opened 2 years ago Closed 2 years ago

Run Places maintenance for certain Sync bookmark errors

Categories

(Firefox :: Sync, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
Firefox 61
Tracking Status
firefox61 --- fixed

People

(Reporter: Lina, Assigned: Lina)

References

Details

Attachments

(1 file)

Thom suggested that we can have Sync immediately run Places maintenance if it sees records with missing or invalid GUIDs, or tombstones for items that aren't deleted.
If immediately sounds problematic, maybe just "soon"?
(In reply to Thom Chiovoloni [:tcsc] from comment #1)
> If immediately sounds problematic, maybe just "soon"?

Makes sense. I was thinking we could even let the current sync fail, and run `PlacesDBUtils.maintenanceOnIdle` in the `catch` handler if we got a non-server error. If maintenance was able to fix the problem, the next bookmark sync should succeed.
Priority: -- → P3
See Also: → 1422672
Priority: P3 → P2
WDYT is the best way to record telemetry for this, Mark? I'm thinking two fields in the ping, or two separate Boolean histograms:

1. Did we run maintenance this sync? (This would be our denominator, the number of maintenance attempts).
2. If we ran maintenance last sync, did the current sync succeed?

I'm guessing we'll want a pref or some kind of way to keep track of (2)? At first, I was thinking we could just include (1) in the sync ping, and determine if maintenance succeeded when we build our dashboards...but IIUC finding the "next success" after a failure where we ran maintenance would require trawling through every sync ping, since we're looking for successes, and ~99% of our syncs succeed. So it seems like we'd be doing a lot of work to reconstruct info that the client already knows...but maybe not?
Flags: needinfo?(markh)
(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #3)

> finding the "next success" after a failure where we ran maintenance would
> require trawling through every sync ping, since we're looking for successes,
> and ~99% of our syncs succeed. So it seems like we'd be doing a lot of work
> to reconstruct info that the client already knows...

While I've had to do something similar in the past (find nearest row to a row with some property), it definitely makes the query messier and increases run time. So if having (2) is plausible without much work, it gets my vote.
(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #3)
> WDYT is the best way to record telemetry for this, Mark? I'm thinking two
> fields in the ping, or two separate Boolean histograms:
> 
> 1. Did we run maintenance this sync? (This would be our denominator, the
> number of maintenance attempts).
> 2. If we ran maintenance last sync, did the current sync succeed?

As Leif says, let's go with 2 just to help with analysis.

> I'm guessing we'll want a pref or some kind of way to keep track of (2)?

I don't think it would hurt to miss some of these, so could we just keep a flag in memory and accept the fact that if the last sync in the session was the repair we just lose the ability to report how it went?
Flags: needinfo?(markh)
Comment on attachment 8967110 [details]
Bug 1410270 - Run Places maintenance for certain Sync bookmark errors.

https://reviewboard.mozilla.org/r/235760/#review241556

::: services/sync/modules/engines.js:841
(Diff revision 1)
>        throw new Error("engine does not implement _sync method");
>      }
>  
> -    return this._notify("sync", this.name, this._sync)();
> +    return this._notify("sync", this.name, async () => {
> +      await this._sync();
> +      return this;

This is really unfortunate, do you think there's a better way to pass success info to the `weave:engine:sync:finish` notification, or to pass along side-channel info like this in general?

I tried using `engineImpl` at first, but that complicates tests in the cases we unregister the built-in engine. We could also return a POJO like `{ succeededAfterMaintenance }`, but that seems icky, since it's just copying a property that's already on the engine, and doesn't immediately map to the fact that we return it so that it can also be passed to the observer...

::: services/sync/modules/engines/bookmarks.js:420
(Diff revision 1)
> +                        { elapsedSinceMaintenance });
> +        await PlacesDBUtils.maintenanceOnIdle();
> +        this._ranMaintenanceOnLastSync = true;
> +        throw new RecoverableError(
> +          "Bookmark sync failed; ran maintenance to recover",
> +          { originalError: ex, ranMaintenance: true });

I'm not really happy with this "re-wrap the original exception but also flag that we ran maintenance so we can tell telemetry" pattern, either. :-/
Assignee: nobody → kit
Status: NEW → ASSIGNED
Comment on attachment 8967110 [details]
Bug 1410270 - Run Places maintenance for certain Sync bookmark errors.

https://reviewboard.mozilla.org/r/235760/#review241620

I don't like the changes to the schema (around error), and I kind of feel like the right place to report this in telemetry is via event telemetry. I'd accept a non-event telemetry solution that doesn't add properties to the error types though, if you have a reason events wouldn't work for it.

::: services/sync/modules/engines.js:841
(Diff revision 1)
>        throw new Error("engine does not implement _sync method");
>      }
>  
> -    return this._notify("sync", this.name, this._sync)();
> +    return this._notify("sync", this.name, async () => {
> +      await this._sync();
> +      return this;

Yes, I agree that this is a bit crazy... Is there a reason we wouldn't want to use event telemetry here? (We might need more than one event type which isn't ideal, but might be better than what we have here).

::: services/sync/modules/engines/bookmarks.js:39
(Diff revision 1)
>    PlacesSyncUtils.bookmarks.DESCRIPTION_ANNO,
>    PlacesSyncUtils.bookmarks.SIDEBAR_ANNO, PlacesUtils.LMANNO_FEEDURI,
>    PlacesUtils.LMANNO_SITEURI,
>  ]);
>  
> +const PLACES_MAINTENANCE_INTERVAL = 1 * 60 * 60 * 1000; // 1 hour.

This seems a bit frequent to me... How heavyweight is places maintenance?

::: services/sync/modules/engines/bookmarks.js:416
(Diff revision 1)
> +        "places.database.lastMaintenance", 0) * 1000;
> +      if (elapsedSinceMaintenance >= PLACES_MAINTENANCE_INTERVAL) {
> +        this._log.error("Bookmark sync failed, ${elapsedSinceMaintenance}s " +
> +                        "elapsed since last run; running Places maintenance",
> +                        { elapsedSinceMaintenance });
> +        await PlacesDBUtils.maintenanceOnIdle();

Is there any risk to actually doing this in the middle of a sync instead of after one?

... I don't exactly know what I mean by 'risk', admittedly, so the answer is probably "no", but it seems possible that it could cause some confusion...

::: services/sync/tests/unit/sync_ping_schema.json:192
(Diff revision 1)
>      "otherError": {
>        "required": ["name"],
>        "properties": {
>          "name": { "enum": ["othererror"] },
> -        "error": { "type": "string" }
> +        "error": { "type": "string" },
> +        "ranMaintenance": { "type": "boolean" }

I'd *really* rather not add additional properties to errors, since the way the scala handles it is somewhat problematic, it just assumes there are two properties and whatever one isnt "name" holds the information about what error occured.

Essentially, Android and iOS both messed up on the names (since we have so many, e.g. `code`, `error`, etc...), which forced the scala to do it this way in order to be compatible without additional craziness. It does mean that adding additional properties isn't practical, though, at least not without some thought.
Attachment #8967110 - Flags: review?(tchiovoloni)
Comment on attachment 8967110 [details]
Bug 1410270 - Run Places maintenance for certain Sync bookmark errors.

https://reviewboard.mozilla.org/r/235760/#review241620

> This seems a bit frequent to me... How heavyweight is places maintenance?

`maintenanceOnIdle` is fairly lightweight, it runs at least once a day already...though the intent here is to avoid repeatedly running maintenance when it's unlikely to help, so bumping to 4 hours or so is reasonable.

> Is there any risk to actually doing this in the middle of a sync instead of after one?
> 
> ... I don't exactly know what I mean by 'risk', admittedly, so the answer is probably "no", but it seems possible that it could cause some confusion...

I don't think so...do you mean in one of the storage methods?

> I'd *really* rather not add additional properties to errors, since the way the scala handles it is somewhat problematic, it just assumes there are two properties and whatever one isnt "name" holds the information about what error occured.
> 
> Essentially, Android and iOS both messed up on the names (since we have so many, e.g. `code`, `error`, etc...), which forced the scala to do it this way in order to be compatible without additional craziness. It does mean that adding additional properties isn't practical, though, at least not without some thought.

Ouch. Thanks for the clarification, this is very good to know! Changed to use event telemetry, and two events is still much simpler than rigging the Sync ping to support this.
Comment on attachment 8967110 [details]
Bug 1410270 - Run Places maintenance for certain Sync bookmark errors.

https://reviewboard.mozilla.org/r/235760/#review241662

This looks great, thanks.
Attachment #8967110 - Flags: review?(tchiovoloni) → review+
Pushed by kcambridge@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/57093565dccc
Run Places maintenance for certain Sync bookmark errors. r=tcsc
https://hg.mozilla.org/mozilla-central/rev/57093565dccc
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 61
You need to log in before you can comment on or make changes to this bug.