Closed Bug 1937022 Opened 8 months ago Closed 11 days ago

Experiment with a larger DNS grace time and a larger DNS cache size

Tracking

()

Status:

RESOLVED FIXED

Milestone:

143 Branch

Tracking Flags:

Tracking

Status

firefox143

---

fixed

People

(Reporter: valentin, Assigned: omansfeld)

References

(
URL
)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

Bug 1937022 - increased dns cache grace period to 10min. r=#necko 13 days ago Oskar Mansfeld 48 bytes, text/x-phabricator-request		Details \| Review

Valentin Gosu [:valentin] (he/him)

Reporter

Description

•

8 months ago

dnsGracePeriod and maxDnsCacheEntries have been added to the FeatureManifest in bug 1936996.
We should do a Nightly experiment where we increase these to see how it affects the cache hit rate.

Valentin Gosu [:valentin] (he/him)

Reporter

Updated

•

7 months ago

URL: https://experimenter.services.mozilla...

Oskar Mansfeld

Assignee

Updated

•

6 months ago

Depends on: 1947414

Oskar Mansfeld

Assignee

Updated

•

6 months ago

Assignee: nobody → omansfeld

Oskar Mansfeld

Assignee

Updated

•

6 months ago

See Also: → https://experimenter.services.mozilla.com/nimbus/dns-caching-exp-grace-period/

Oskar Mansfeld

Assignee

Comment 1

•

6 months ago

•

Edited

Update

The first experiment increasing the cache size has ended (pending review by the experimenter team). For detailed results see the result section in the experiment brief (found in the first experimenter link sidebar).

TLDR

We could observe that the amount of times the cache was full and something had to be evicted roughly halved.
This didn't lead to any significant increase in cache hit rate though.
That's because the number of times the cache was full for the experiment cohort (~10k) was so much smaller than the total amount of lookups done (100s of millions), so the halving didn't have any impact.

Next steps

bug 1947414 was fixed, so we can now also experiment on increasing the grace period, a follow up experiment is already linked above.
- can you have a look at that and it's brief Valentin?
maybe still increase the cache size for nightly, as there is no downside to it I think?

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Reporter

Comment 2

•

6 months ago

(In reply to Oskar Mansfeld from comment #1)

Update

The first experiment increasing the cache size has ended (pending review by the experimenter team). For detailed results see the result section in the experiment brief (found in the first experimenter link sidebar).

TLDR

We could observe that the amount of times the cache was full and something had to be evicted roughly halved.

👍‍‍

This didn't lead to any significant increase in cache hit rate though.

That's because the number of times the cache was full for the experiment cohort (~10k) was so much smaller than the total amount of lookups done (100s of millions), so the halving didn't have any impact.

This one is a bit odd to me, considering that once we resolve network.dnsCacheEntries domains, every following domain should register in telemetry - though negative cache entries are not counted.

Next steps

bug 1947414 was fixed, so we can now also experiment on increasing the grace period, a follow up experiment is already linked above.

can you have a look at that and it's brief Valentin?

Followup experiment looks good.

maybe still increase the cache size for nightly, as there is no downside to it I think?

I think we can increase it and let it ride the trains. If there are no red flags in the experiment data, there's no need to keep it in Nightly.

Flags: needinfo?(valentin.gosu)

Oskar Mansfeld

Assignee

Comment 3

•

6 months ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #2)

This one is a bit odd to me, considering that once we resolve network.dnsCacheEntries domains, every following domain should register in telemetry - though negative cache entries are not counted.

They only show up in telemetry if the cache is full, right? So to me that means that the cache is just not very often hitting it's limit. But if you think that's unreasonable maybe there is something wrong with my query or assumptions around it.

Since we don't have a probe that counts each time the cache is full (which I assume is equal to an entry being evicted with our current logic) I counted the amount of entries that were recorded for dns.cleanup_age + dns.by_type_cleanup_age.

This is my SQL for it (STMO link).

EDIT: I double checked and there is definitely something wrong with it, GLAM shows sample counts in the ~5M range for just one of those probes on nightly per build (I get around 10k for that probe only if I take the experiment filter out of my query). Would be interested in any pointers on how GLAM gets the sample count and how I could repro that in my own query.

Of course even with it being a few million evictions that's still very low compared to the 100s of million lookups happening, so the argument that this is why it doesn't show in the hit-rate statistics might still stand true.

Followup experiment looks good.

Thanks for confirming. I'll check experiment coordination with the team and will go ahead and start enrollment after the version bump next week if everything is cleared.

maybe still increase the cache size for nightly, as there is no downside to it I think?

I think we can increase it and let it ride the trains. If there are no red flags in the experiment data, there's no need to keep it in Nightly.

I'll put in a patch doubling the cache size then. Should we still land it even though the soft code freeze for 137 started?

Flags: needinfo?(valentin.gosu)

Oskar Mansfeld

Assignee

Updated

•

6 months ago

Depends on: 1950852

Valentin Gosu [:valentin] (he/him)

Reporter

Comment 4

•

6 months ago

(In reply to Oskar Mansfeld from comment #3)

I think we can increase it and let it ride the trains. If there are no red flags in the experiment data, there's no need to keep it in Nightly.

I'll put in a patch doubling the cache size then. Should we still land it even though the soft code freeze for 137 started?

Let's do it after the code freeze.

Oskar Mansfeld

Assignee

Comment 5

•

6 months ago

Re: Follow up experiment with grace period increased

I think it makes sense to implement telemetry that checks when the longer grace period could lead to breakage.

My thought process is that when the cache renewal that is triggered when an entry is in the grace period produces a result that is different than the cached entry then we have potentially broken something by introducing the grace period and should record that in telemetry so we can see if the experiment has a negative impact (or how big that impact is).

I will look into implementing that telemetry tomorrow in a follow-up bug, but if you have any input or initial pointers about it (or other thoughts on the necessity) please share!

Valentin Gosu [:valentin] (he/him)

Reporter

Comment 6

•

6 months ago

I'm very much in favor of adding some telemetry for this, though the grace period isn't the only case in which we refresh DNS records.

If you have tab A and B open to the same origin, and tab B gets force-refreshed, that would also update the DNS cache entries available to tab A, which is similar to the grace period refresh, but I'll grant you that the grace period is something we'll probably hit more often.

Regardless, I think this is good telemetry to have.
I'm also quite curious to see how often the refreshed DNS request fails.

Flags: needinfo?(valentin.gosu)

Oskar Mansfeld

Assignee

Updated

•

6 months ago

Depends on: 1952221

Oskar Mansfeld

Assignee

Comment 7

•

6 months ago

Should we hold the experiment start until that telemetry lands and maybe push the telemetry into prio-queue? I think it makes sense, we really need the telemetry to evaluate in my opinion. I didn't manage to spend time on it this week, but will look at it on Monday.

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Reporter

Comment 8

•

6 months ago

Yes, let's wait until we land telemetry.

Flags: needinfo?(valentin.gosu)

Oskar Mansfeld

Assignee

Updated

•

5 months ago

Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-queue]

Oskar Mansfeld

Assignee

Updated

•

5 months ago

Status: NEW → ASSIGNED

Oskar Mansfeld

Assignee

Updated

•

5 months ago

Depends on: 1954708

Oskar Mansfeld

Assignee

Comment 9

•

5 months ago

removed from priority queue as it's tracked in performance work and we should rather put bugs/improvements in the queue that emerge out of experiment results

Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged]

Oskar Mansfeld

Assignee

Comment 10

•

3 months ago

Update

https://yardstick.mozilla.org/dashboard/snapshot/SFH0Pt3bnBmbJyjZggM8MncLQaEXar1r

Doubled the amount of DNS cache entries used in their grace period (going from 2.03% to 3.46%)
resulting in a ~1.1% increase in positive DNS cache outcomes through the noise.

Have not found any side effects.

Next steps

Run a follow up on beta to confirm results and keep watching out for side effects/bugs. Eventually push to
release.

Oskar Mansfeld

Assignee

Comment 11

•

3 months ago

Beta follow up experiment is running and has finished enrolling users.

Oskar Mansfeld

Assignee

Comment 12

•

13 days ago

Results from beta experiment

dashboard: https://yardstick.mozilla.org/dashboard/snapshot/R1d0A3D5629oXp2pf7xXZC4r65p2HqCN

beta experiment confirmed nightly results
saw an increase in DNS cache entries used in their grace period of +1.46% (going from 2.58% to 4.04%, which is an increase of slightly more than a third) landing us at ~93% positive DNS caching outcomes
have not found any substantial side effects

Next steps

we will increase the pref and let it ride the trains

further optimization of the concrete value is not expected to be worth the effort at the moment
the patch increasing the pref will thus close this bug

Oskar Mansfeld

Assignee

Comment 13

•

13 days ago

Attached file Bug 1937022 - increased dns cache grace period to 10min. r=#necko — Details

After experimentation gave us a good read that increasing the grace
period gives us ~1.5% more cache usage and no side effects we're now
shipping that increase.

Pulsebot

Comment 14

•

12 days ago

Pushed by omansfeld@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/b5b11dc71c22 https://hg.mozilla.org/integration/autoland/rev/47b751230211 increased dns cache grace period to 10min. r=necko-reviewers,valentin

Cristian Tuns

Comment 15

•

11 days ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/47b751230211

Status: ASSIGNED → RESOLVED

Closed: 11 days ago

status-firefox143: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 143 Branch

You need to log in before you can comment on or make changes to this bug.