Closed Bug 1527945 Opened 6 years ago Closed 6 years ago

High error rate in the Tippytop collection

Categories

(Firefox :: Remote Settings Client, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 67
Tracking Status
firefox67 --- fixed

People

(Reporter: leplatrem, Assigned: peterbe)

References

Details

(Keywords: github-merged)

Attachments

(6 files)

Using Peter's script to monitor RS health [0], we noticed that Tippytop had a high error rate on synchronization (~15%).

These are the status obtained via the Uptake Telemetry [1] helper:

main/onboarding                          (good:627,994 bad:21)                          0.00%
main/sites-classification                (good:628,048 bad:41)                          0.01%
main/tippytop                            (good:597,947 bad:110,066)                    15.55%
pinning/pins                             (good:657,038 bad:2,560)                       0.39%
settings-changes-monitoring              (good:1,402,779 bad:39,606)                    2.75%

2 settings have a bad ratio over threshold.
main/tippytop                            (network_error:103,977 sync_error:5,917 sign_retry_error:113 unknown_error:56 sign_error:3)
settings-changes-monitoring              (network_error:39,356 server_error:204 unknown_error:46)

The data hasn't changed recently. But it is pulled entirely (no local JSON dump) when RemoteSettings.get() is called.

I suppose that this happens when a new window/tab is opened.

Is it because their local database has issues (IndexedDB™), and Remote Settings keeps on trying to sync without success?

Is it because the browser is never left open long enough for the sync to finish?

We see huge amount of network and server errors compared with other collections. Why?

Is it because users switch off the network while it sync? Is it because Tippytop tries to pull data too soon on new profiles/instances/windows/tab?

We should investigate what is going on, and see if we can come up with solutions!

[0] https://github.com/mozilla/remote-settings-uptake-health
[1] https://firefox-source-docs.mozilla.org/main/latest/toolkit/components/telemetry/telemetry/collection/uptake.html

which release channel is it for this metric?

FWIW, there was a slight drop (about 0.2% - 0.5%) on the tippytop icon coverage in Activity-Stream in release. https://sql.telemetry.mozilla.org/queries/59956#154948

Unfortunately, since RemoteSettings.get() hides all the errors behind the library users, we are unable to capture the error(s) and report the telemetry in Activity Stream.

It's a bit concerning to see that this issue only happened to tippytop, as we had decreased the collection size from 1000 to 200 a while ago.

which release channel is it for this metric?

I would say that it contains all channels. Peter, can you confirm?

Unfortunately, since RemoteSettings.get() hides all the errors behind the library users, we are unable to capture the error(s) and report the telemetry in Activity Stream.

Originally, a call to .get() had no reason to fail, since it was just reading the local data. In Bug 1509066 we made some changes to start a sync when .get() is called on empty collections. We kept the original «signature» by hiding errors.
We could add options to disable this adhoc sync or throw on error if we identify that this become a problem.

BTW in your case, you know that if .get() returns an empty list, this means that something didn't work. And we report the errors internally as we can see in this metric.

It's a bit concerning to see that this issue only happened to tippytop, as we had decreased the collection size from 1000 to 200 a while ago.

Especially «network error»!

Nan, do you have information about the context (eg. user offline?) when RemoteSettings("tippytop").get() is called? Do you think we should disable the implicit sync on .get() for Tippytop and return an empty list until a scheduled sync happens?

Flags: needinfo?(peterbe)
Attached image Release
Flags: needinfo?(peterbe)
Attached image Beta
Attached image Dev edition
Attached image Nightly

My guess is as good as yours but it seems pretty bad across all channels.

(In reply to Mathieu Leplatre [:leplatrem] from comment #2)

Originally, a call to .get() had no reason to fail, since it was just reading the local data. In Bug 1509066 we made some changes to start a sync when .get() is called on empty collections. We kept the original «signature» by hiding errors.
We could add options to disable this adhoc sync or throw on error if we identify that this become a problem.

Yes, throwing an error sounds better than being silent here, as it allows Activity Stream to report it in the telemetry.

BTW in your case, you know that if .get() returns an empty list, this means that something didn't work. And we report the errors internally as we can see in this metric.

Hmm, but .get(key) could also return an empty list if the given "key" is missing in the collection, which is totally fine for tippytop. So we still can't tell the difference, correct?

Nan, do you have information about the context (eg. user offline?) when RemoteSettings("tippytop").get() is called? Do you think we should disable the implicit sync on .get() for Tippytop and return an empty list until a scheduled sync happens?

Unfortunately, we don't have that context either in AS. If the user gets offline, our telemetry would fly in blind as well. Agreed, the implicit sync at least sounds unnecessary for tippytop, as it's totally fine to have some data latency or even absence. Perhaps we can leave this implicit sync as an option for the RS users? They can decide whether or not to enable it based on their demand of data freshness. What do you think?

Depends on: 1528704

Quick note: the network error rate went down in the last days (8.85% compared to 14%)

Perhaps we can leave this implicit sync as an option for the RS users? They can decide whether or not to enable it based on their demand of data freshness. What do you think?

Yes, and with the patch in Bug 1528704, when .get() does not sync, there will be no swallowed errors.

Hmm, but .get(key) could also return an empty list if the given "key" is missing in the collection, which is totally fine for tippytop. So we still can't tell the difference, correct?

Well, what I meant is that with implicit sync in .get() you would have expected it to always return a non-empty array.

One thing to note, is that when I wake up my macbook pro, from hibernation, it's not unusual that I can get into Firefox faster than my macOS can re-connect to the WiFi.

If I'm able to start the browser and hit Cmd-T it's not unrealistic at all that the newtab is started without a network connection available.

It's beyond me if this would/should cause a "network error" because it might just take the socket request and if the Wifi isn't available, it'll automatically wait a few seconds just as if it would when a server is slow to respond.

Assignee: nobody → peterbe
Blocks: 1532321
Status: NEW → RESOLVED
Closed: 6 years ago
Keywords: github-merged
Resolution: --- → FIXED
Target Milestone: --- → Firefox 67

Error rate is dropping in Nightly :)

\o/ The patch is working.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: