High error rate in the Tippytop collection
Categories
(Firefox :: Remote Settings Client, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox67 | --- | fixed |
People
(Reporter: leplatrem, Assigned: peterbe)
References
Details
(Keywords: github-merged)
Attachments
(6 files)
Using Peter's script to monitor RS health [0], we noticed that Tippytop had a high error rate on synchronization (~15%).
These are the status obtained via the Uptake Telemetry [1] helper:
main/onboarding (good:627,994 bad:21) 0.00%
main/sites-classification (good:628,048 bad:41) 0.01%
main/tippytop (good:597,947 bad:110,066) 15.55%
pinning/pins (good:657,038 bad:2,560) 0.39%
settings-changes-monitoring (good:1,402,779 bad:39,606) 2.75%
2 settings have a bad ratio over threshold.
main/tippytop (network_error:103,977 sync_error:5,917 sign_retry_error:113 unknown_error:56 sign_error:3)
settings-changes-monitoring (network_error:39,356 server_error:204 unknown_error:46)
The data hasn't changed recently. But it is pulled entirely (no local JSON dump) when RemoteSettings.get()
is called.
I suppose that this happens when a new window/tab is opened.
Is it because their local database has issues (IndexedDB™), and Remote Settings keeps on trying to sync without success?
Is it because the browser is never left open long enough for the sync to finish?
We see huge amount of network and server errors compared with other collections. Why?
Is it because users switch off the network while it sync? Is it because Tippytop tries to pull data too soon on new profiles/instances/windows/tab?
We should investigate what is going on, and see if we can come up with solutions!
[0] https://github.com/mozilla/remote-settings-uptake-health
[1] https://firefox-source-docs.mozilla.org/main/latest/toolkit/components/telemetry/telemetry/collection/uptake.html
Comment 1•6 years ago
|
||
which release channel is it for this metric?
FWIW, there was a slight drop (about 0.2% - 0.5%) on the tippytop icon coverage in Activity-Stream in release. https://sql.telemetry.mozilla.org/queries/59956#154948
Unfortunately, since RemoteSettings.get()
hides all the errors behind the library users, we are unable to capture the error(s) and report the telemetry in Activity Stream.
It's a bit concerning to see that this issue only happened to tippytop, as we had decreased the collection size from 1000 to 200 a while ago.
Reporter | ||
Comment 2•6 years ago
|
||
which release channel is it for this metric?
I would say that it contains all channels. Peter, can you confirm?
Unfortunately, since RemoteSettings.get() hides all the errors behind the library users, we are unable to capture the error(s) and report the telemetry in Activity Stream.
Originally, a call to .get()
had no reason to fail, since it was just reading the local data. In Bug 1509066 we made some changes to start a sync when .get()
is called on empty collections. We kept the original «signature» by hiding errors.
We could add options to disable this adhoc sync or throw on error if we identify that this become a problem.
BTW in your case, you know that if .get()
returns an empty list, this means that something didn't work. And we report the errors internally as we can see in this metric.
It's a bit concerning to see that this issue only happened to tippytop, as we had decreased the collection size from 1000 to 200 a while ago.
Especially «network error»!
Nan, do you have information about the context (eg. user offline?) when RemoteSettings("tippytop").get()
is called? Do you think we should disable the implicit sync on .get()
for Tippytop and return an empty list until a scheduled sync happens?
Assignee | ||
Comment 3•6 years ago
|
||
Assignee | ||
Comment 4•6 years ago
|
||
Assignee | ||
Comment 5•6 years ago
|
||
Assignee | ||
Comment 6•6 years ago
|
||
Assignee | ||
Comment 7•6 years ago
|
||
My guess is as good as yours but it seems pretty bad across all channels.
Comment 8•6 years ago
•
|
||
(In reply to Mathieu Leplatre [:leplatrem] from comment #2)
Originally, a call to
.get()
had no reason to fail, since it was just reading the local data. In Bug 1509066 we made some changes to start a sync when.get()
is called on empty collections. We kept the original «signature» by hiding errors.
We could add options to disable this adhoc sync or throw on error if we identify that this become a problem.
Yes, throwing an error sounds better than being silent here, as it allows Activity Stream to report it in the telemetry.
BTW in your case, you know that if
.get()
returns an empty list, this means that something didn't work. And we report the errors internally as we can see in this metric.
Hmm, but .get(key)
could also return an empty list if the given "key" is missing in the collection, which is totally fine for tippytop. So we still can't tell the difference, correct?
Nan, do you have information about the context (eg. user offline?) when
RemoteSettings("tippytop").get()
is called? Do you think we should disable the implicit sync on.get()
for Tippytop and return an empty list until a scheduled sync happens?
Unfortunately, we don't have that context either in AS. If the user gets offline, our telemetry would fly in blind as well. Agreed, the implicit sync at least sounds unnecessary for tippytop, as it's totally fine to have some data latency or even absence. Perhaps we can leave this implicit sync as an option for the RS users? They can decide whether or not to enable it based on their demand of data freshness. What do you think?
Comment 9•6 years ago
|
||
Reporter | ||
Comment 10•6 years ago
|
||
Quick note: the network error rate went down in the last days (8.85% compared to 14%)
Perhaps we can leave this implicit sync as an option for the RS users? They can decide whether or not to enable it based on their demand of data freshness. What do you think?
Yes, and with the patch in Bug 1528704, when .get()
does not sync, there will be no swallowed errors.
Hmm, but .get(key) could also return an empty list if the given "key" is missing in the collection, which is totally fine for tippytop. So we still can't tell the difference, correct?
Well, what I meant is that with implicit sync in .get()
you would have expected it to always return a non-empty array.
Assignee | ||
Comment 11•6 years ago
|
||
One thing to note, is that when I wake up my macbook pro, from hibernation, it's not unusual that I can get into Firefox faster than my macOS can re-connect to the WiFi.
If I'm able to start the browser and hit Cmd-T it's not unrealistic at all that the newtab is started without a network connection available.
It's beyond me if this would/should cause a "network error" because it might just take the socket request and if the Wifi isn't available, it'll automatically wait a few seconds just as if it would when a server is slow to respond.
Updated•6 years ago
|
Comment 12•6 years ago
|
||
Reporter | ||
Comment 13•6 years ago
|
||
Reporter | ||
Comment 14•6 years ago
|
||
Error rate is dropping in Nightly :)
Comment 15•6 years ago
|
||
\o/ The patch is working.
Description
•