Experiment with a larger DNS grace time and a larger DNS cache size
Categories
(Core :: Networking: DNS, task, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox143 | --- | fixed |
People
(Reporter: valentin, Assigned: omansfeld)
References
()
Details
(Whiteboard: [necko-triaged])
Attachments
(1 file)
dnsGracePeriod and maxDnsCacheEntries have been added to the FeatureManifest in bug 1936996.
We should do a Nightly experiment where we increase these to see how it affects the cache hit rate.
Reporter | ||
Updated•7 months ago
|
Assignee | ||
Updated•6 months ago
|
Assignee | ||
Updated•6 months ago
|
Assignee | ||
Comment 1•6 months ago
•
|
||
Update
The first experiment increasing the cache size has ended (pending review by the experimenter team). For detailed results see the result section in the experiment brief (found in the first experimenter link sidebar).
TLDR
- We could observe that the amount of times the cache was full and something had to be evicted roughly halved.
- This didn't lead to any significant increase in cache hit rate though.
- That's because the number of times the cache was full for the experiment cohort (~10k) was so much smaller than the total amount of lookups done (100s of millions), so the halving didn't have any impact.
Next steps
- bug 1947414 was fixed, so we can now also experiment on increasing the grace period, a follow up experiment is already linked above.
- can you have a look at that and it's brief Valentin?
- maybe still increase the cache size for nightly, as there is no downside to it I think?
Reporter | ||
Comment 2•6 months ago
|
||
(In reply to Oskar Mansfeld from comment #1)
Update
The first experiment increasing the cache size has ended (pending review by the experimenter team). For detailed results see the result section in the experiment brief (found in the first experimenter link sidebar).
TLDR
- We could observe that the amount of times the cache was full and something had to be evicted roughly halved.
πββ
- This didn't lead to any significant increase in cache hit rate though.
- That's because the number of times the cache was full for the experiment cohort (~10k) was so much smaller than the total amount of lookups done (100s of millions), so the halving didn't have any impact.
This one is a bit odd to me, considering that once we resolve network.dnsCacheEntries
domains, every following domain should register in telemetry - though negative cache entries are not counted.
Next steps
- bug 1947414 was fixed, so we can now also experiment on increasing the grace period, a follow up experiment is already linked above.
- can you have a look at that and it's brief Valentin?
Followup experiment looks good.
- maybe still increase the cache size for nightly, as there is no downside to it I think?
I think we can increase it and let it ride the trains. If there are no red flags in the experiment data, there's no need to keep it in Nightly.
Assignee | ||
Comment 3•6 months ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #2)
This one is a bit odd to me, considering that once we resolve
network.dnsCacheEntries
domains, every following domain should register in telemetry - though negative cache entries are not counted.
They only show up in telemetry if the cache is full, right? So to me that means that the cache is just not very often hitting it's limit. But if you think that's unreasonable maybe there is something wrong with my query or assumptions around it.
Since we don't have a probe that counts each time the cache is full (which I assume is equal to an entry being evicted with our current logic) I counted the amount of entries that were recorded for dns.cleanup_age
+ dns.by_type_cleanup_age
.
This is my SQL for it (STMO link).
EDIT: I double checked and there is definitely something wrong with it, GLAM shows sample counts in the ~5M range for just one of those probes on nightly per build (I get around 10k for that probe only if I take the experiment filter out of my query). Would be interested in any pointers on how GLAM gets the sample count and how I could repro that in my own query.
Of course even with it being a few million evictions that's still very low compared to the 100s of million lookups happening, so the argument that this is why it doesn't show in the hit-rate statistics might still stand true.
Followup experiment looks good.
Thanks for confirming. I'll check experiment coordination with the team and will go ahead and start enrollment after the version bump next week if everything is cleared.
- maybe still increase the cache size for nightly, as there is no downside to it I think?
I think we can increase it and let it ride the trains. If there are no red flags in the experiment data, there's no need to keep it in Nightly.
I'll put in a patch doubling the cache size then. Should we still land it even though the soft code freeze for 137 started?
Reporter | ||
Comment 4•6 months ago
|
||
(In reply to Oskar Mansfeld from comment #3)
I think we can increase it and let it ride the trains. If there are no red flags in the experiment data, there's no need to keep it in Nightly.
I'll put in a patch doubling the cache size then. Should we still land it even though the soft code freeze for 137 started?
Let's do it after the code freeze.
Assignee | ||
Comment 5•6 months ago
|
||
Re: Follow up experiment with grace period increased
I think it makes sense to implement telemetry that checks when the longer grace period could lead to breakage.
My thought process is that when the cache renewal that is triggered when an entry is in the grace period produces a result that is different than the cached entry then we have potentially broken something by introducing the grace period and should record that in telemetry so we can see if the experiment has a negative impact (or how big that impact is).
I will look into implementing that telemetry tomorrow in a follow-up bug, but if you have any input or initial pointers about it (or other thoughts on the necessity) please share!
Reporter | ||
Comment 6•6 months ago
|
||
I'm very much in favor of adding some telemetry for this, though the grace period isn't the only case in which we refresh DNS records.
If you have tab A and B open to the same origin, and tab B gets force-refreshed, that would also update the DNS cache entries available to tab A, which is similar to the grace period refresh, but I'll grant you that the grace period is something we'll probably hit more often.
Regardless, I think this is good telemetry to have.
I'm also quite curious to see how often the refreshed DNS request fails.
Assignee | ||
Comment 7•6 months ago
|
||
Should we hold the experiment start until that telemetry lands and maybe push the telemetry into prio-queue? I think it makes sense, we really need the telemetry to evaluate in my opinion. I didn't manage to spend time on it this week, but will look at it on Monday.
Reporter | ||
Comment 8•6 months ago
|
||
Yes, let's wait until we land telemetry.
Assignee | ||
Updated•5 months ago
|
Assignee | ||
Updated•5 months ago
|
Assignee | ||
Comment 9•5 months ago
|
||
removed from priority queue as it's tracked in performance work and we should rather put bugs/improvements in the queue that emerge out of experiment results
Assignee | ||
Comment 10•3 months ago
|
||
Update
https://yardstick.mozilla.org/dashboard/snapshot/SFH0Pt3bnBmbJyjZggM8MncLQaEXar1r
Doubled the amount of DNS cache entries used in their grace period (going from 2.03% to 3.46%)
resulting in a ~1.1% increase in positive DNS cache outcomes through the noise.
Have not found any side effects.
Next steps
Run a follow up on beta to confirm results and keep watching out for side effects/bugs. Eventually push to
release.
Assignee | ||
Comment 11•3 months ago
|
||
Beta follow up experiment is running and has finished enrolling users.
Assignee | ||
Comment 12•13 days ago
|
||
Results from beta experiment
dashboard: https://yardstick.mozilla.org/dashboard/snapshot/R1d0A3D5629oXp2pf7xXZC4r65p2HqCN
- beta experiment confirmed nightly results
- saw an increase in DNS cache entries used in their grace period of +1.46% (going from 2.58% to 4.04%, which is an increase of slightly more than a third) landing us at ~93% positive DNS caching outcomes
- have not found any substantial side effects
Next steps
we will increase the pref and let it ride the trains
- further optimization of the concrete value is not expected to be worth the effort at the moment
- the patch increasing the pref will thus close this bug
Assignee | ||
Comment 13•13 days ago
|
||
After experimentation gave us a good read that increasing the grace
period gives us ~1.5% more cache usage and no side effects we're now
shipping that increase.
Comment 14•12 days ago
|
||
Comment 15•11 days ago
|
||
bugherder |
Description
•