Open Bug 1398930 Opened 7 years ago Updated 2 years ago

Migration of corrupt Chrome history taking 18+ hours

Categories

(Firefox :: Migration, enhancement, P3)

enhancement

Tracking

()

Tracking Status
firefox57 - ---

People

(Reporter: bugzilla, Unassigned)

References

Details

[Tracking Requested - why for this release]: Given the (hopeful) expectation that we will pick up users from Chrome when we release 57, we probably should be ensuring that we do not provide such a poor migration experience to users.


As reported on Reddit:
https://www.reddit.com/r/firefox/comments/6yz2t4/is_firefox_supposed_to_take_18_hours_to_import/

It looks like the Chrome profile was corrupt.

Yes, Chrome profile corruption is not our fault, but our reaction to such a state is. We should figure out a solution such that we don't just sit around like that, since we are likely to see the blame on this.

Even a timeout would be better than nothing.
Matt, is there something we can do to work around this sort of issues in the migration tool in the 57 timeframe?
Flags: needinfo?(MattN+bmo)
I replied asking for more info on Reddit.

It sounds like everything worked fine when the history database was recreated by Chrome sync so it doesn't sound like an issue of a schema change in Chrome.

I don't think we rolled out auto-migration to release (I asked what channel the user was on but didn't hear back) but did wonder if we would have problems if manual migration was run at the same time as auto-migration but there is no evidence to point to that in this case.
Summary: Migration of corrupt Chrome profile taking 18+ hours → Migration of corrupt Chrome history taking 18+ hours
Since this is part of onboarding I'm tagging it for the onboarding team.
Whiteboard: [photon-onboarding] [triage]
(In reply to Matthew N. [:MattN] (huge backlog; PM if requests are blocking you) from comment #2)
> I replied asking for more info on Reddit.
> 
> It sounds like everything worked fine when the history database was
> recreated by Chrome sync so it doesn't sound like an issue of a schema
> change in Chrome.
> 
> I don't think we rolled out auto-migration to release (I asked what channel
> the user was on but didn't hear back) but did wonder if we would have
> problems if manual migration was run at the same time as auto-migration but
> there is no evidence to point to that in this case.
Yes, this should not be the case.
The auto-migration was disabled by bug 1381714 for all channels. And that disabling was done before the manual migration so these 2 shouldn't run at the same time.
Whiteboard: [photon-onboarding] [triage] → [photon-structure] [triage]
Were we able to get STR for this?
No reply on Reddit still
Hmm… bug 1340115 got uplifted to Fx53 and bug 1357448 is filed to lower the limit. Looking at the graphs, it seems like most Chrome history migrations happen within seconds.

I think adding a timeout wouldn't be bad for the FTU migration (which doesn't seem to be what the Reddit user was doing) but really I think it'd be better to fix the underlying issue. We'll see if Photon teams have time to do anything here.
Flags: needinfo?(MattN+bmo)
Priority: -- → P3
See Also: → 1357448
(In reply to Matthew N. [:MattN] (huge backlog; PM if requests are blocking you) from comment #7)
> Hmm… bug 1340115 got uplifted to Fx53 and bug 1357448 is filed to lower the
> limit. Looking at the graphs, it seems like most Chrome history migrations
> happen within seconds.
> 
> I think adding a timeout wouldn't be bad for the FTU migration (which
> doesn't seem to be what the Reddit user was doing) but really I think it'd
> be better to fix the underlying issue.

From what I recall from previous investigations (see deps of bug 1332225, esp. bug 1341097) making history import fast is non-trivial, especially once the browser is up, because so much UI hooks into history and everything goes through a multithreaded bit of C++, but still does individual inserts. I don't think we have much chance of doing anything to that code in time for 57.

If we think it's important enough that we do do something, we could try to write an alternative "mass history import" thing in History.jsm , that omits update notifications and then tells consumers at the end to just throw away their cached information. This is more feasible for 57 than it was previously because of the add-on compat side, but I believe that there are webextension history APIs, so we might still struggle there, I'm not sure.

That's besides the time needed to actually read the chrome history information, which is significantly longer if chrome has the database file open and we have to apply hacks to actually get the information in the first place.

> We'll see if Photon teams have time
> to do anything here.

From what I know, most of us are still pretty busy, so either we drop other work or someone else needs to pick this up.


To be clear, 18h is completely unacceptable, but I also don't understand how that would ever happen, so without solid STR it's hard to know what to do there. It's possible that the data someone (else, I think?) left in bug 1332225 would help here, but I haven't had time to look.
(In reply to :Gijs from comment #8)
> To be clear, 18h is completely unacceptable, but I also don't understand how
> that would ever happen, so without solid STR it's hard to know what to do
> there. It's possible that the data someone (else, I think?) left in bug
> 1332225 would help here, but I haven't had time to look.

Note here there's evidence of a corrupt database, so it doesn't look like the issue is a normal history import, but the fact we don't detect some specific kind of history corruption. While we do have plan for the 58 cycle to investigate the history views perf, it may even not help at all if it's a bad kind of corruption.
The fact Chrome could rebuild the DB from their sync is nothing useful, surely you can rebuild a db from fresh data.

Having one of these databases causing the issue would be of great help, as well as it may be of help having bug 1367023, that I'm asking from some time, exactly to detect problems with queries in the wild.

As a quick workaround, we could issue a PRAGMA quick_check(1) on the Chrome database, and start importing only if it returns OK, otherwise give up. It's not a complete check like PRAGMA integrity_check(), but it may exclude some obvious cases of corruption without delaying the process too much. It should be a simple "fix" to implement.
Ni dolske to consider who can/should be owning this. I'm afraid I don't understand why the onboarding team marked this for structure triage as it does not actually have anything to do with the photon structure project.
Flags: needinfo?(dolske)
another thing in the Reddit thread is "firefox takes up around 333-268 MB of memory when importing (i think around 60% of my memory)". Does that mean this is a 512MB system? That would swap like crazy.
Do we have telemetry on this? (average time to import Chrome data?) This would help to know if this is a common issue or just rare.
(In reply to Sylvestre Ledru [:sylvestre] from comment #12)
> Do we have telemetry on this? (average time to import Chrome data?) This
> would help to know if this is a common issue or just rare.

We do and it's been mentioned in comment 7. See bug 1357448.

I think the problem is that, like mak says, it sounds like it's never finishing and the user cancels so we probably aren't recording the time in this case. So while we have time for successful completions, we don't have data for how long the dialogs stays open for overall.

Someone may be able to write a complex query to compare the count of FX_MIGRATION_HISTORY_IMPORT_MS submissions divided by history counts in FX_MIGRATION_USAGE (only for non-firstrun) for profiles which clearly aren't on their first run to figure out a success rate but that would probably take many hours of work to get right so it depends on how badly we want this data.
Whiteboard: [photon-structure] [triage]
Ok, I talked with MattN about this. Nutshell takeaway is that with a refined analysis he did, it looks like ~0.33% of Chrome users are possibly hitting an issue (more specifically, are having something happen where the number of migrations started < migrations completed). In absolute numbers, this works out to ~200 users over the month of 55.0.3 having been released.

While there are some caveats in the details of the data, we both agree that this indicates that this is unlikely to be a major issue that we need to jump on for 57. That seems a fair conclusion, even given the increases sensitivity we have here in that prior experience says users just don't report major problems with this stuff, so a lack of reports != proof it isn't happening.

It would be great if the onboarding team could revisit this as part of our work to improve onboarding, either to attempt fixing the problem, or to add telemetry to get a more direct answer to how much this is happening.
Flags: needinfo?(dolske)
(In reply to Justin Dolske [:Dolske] from comment #15)
> Ok, I talked with MattN about this. Nutshell takeaway is that with a refined
> analysis he did

https://gist.github.com/mnoorenberghe/251257ca8236e1c569b906dd34424f02
Not tracking for 57 based on comment 15's analysis.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.