Closed Bug 1937022 Opened 8 months ago Closed 11 days ago

Experiment with a larger DNS grace time and a larger DNS cache size

Categories

(Core :: Networking: DNS, task, P2)

task

Tracking

()

RESOLVED FIXED
143 Branch
Tracking Status
firefox143 --- fixed

People

(Reporter: valentin, Assigned: omansfeld)

References

()

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

dnsGracePeriod and maxDnsCacheEntries have been added to the FeatureManifest in bug 1936996.
We should do a Nightly experiment where we increase these to see how it affects the cache hit rate.

Depends on: 1947414
Assignee: nobody → omansfeld

Update

The first experiment increasing the cache size has ended (pending review by the experimenter team). For detailed results see the result section in the experiment brief (found in the first experimenter link sidebar).

TLDR

  • We could observe that the amount of times the cache was full and something had to be evicted roughly halved.
  • This didn't lead to any significant increase in cache hit rate though.
  • That's because the number of times the cache was full for the experiment cohort (~10k) was so much smaller than the total amount of lookups done (100s of millions), so the halving didn't have any impact.

Next steps

  • bug 1947414 was fixed, so we can now also experiment on increasing the grace period, a follow up experiment is already linked above.
    • can you have a look at that and it's brief Valentin?
  • maybe still increase the cache size for nightly, as there is no downside to it I think?
Flags: needinfo?(valentin.gosu)

(In reply to Oskar Mansfeld from comment #1)

Update

The first experiment increasing the cache size has ended (pending review by the experimenter team). For detailed results see the result section in the experiment brief (found in the first experimenter link sidebar).

TLDR

  • We could observe that the amount of times the cache was full and something had to be evicted roughly halved.

πŸ‘β€β€

  • This didn't lead to any significant increase in cache hit rate though.
  • That's because the number of times the cache was full for the experiment cohort (~10k) was so much smaller than the total amount of lookups done (100s of millions), so the halving didn't have any impact.

This one is a bit odd to me, considering that once we resolve network.dnsCacheEntries domains, every following domain should register in telemetry - though negative cache entries are not counted.

Next steps

  • bug 1947414 was fixed, so we can now also experiment on increasing the grace period, a follow up experiment is already linked above.
    • can you have a look at that and it's brief Valentin?

Followup experiment looks good.

  • maybe still increase the cache size for nightly, as there is no downside to it I think?

I think we can increase it and let it ride the trains. If there are no red flags in the experiment data, there's no need to keep it in Nightly.

Flags: needinfo?(valentin.gosu)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #2)

This one is a bit odd to me, considering that once we resolve network.dnsCacheEntries domains, every following domain should register in telemetry - though negative cache entries are not counted.

They only show up in telemetry if the cache is full, right? So to me that means that the cache is just not very often hitting it's limit. But if you think that's unreasonable maybe there is something wrong with my query or assumptions around it.

Since we don't have a probe that counts each time the cache is full (which I assume is equal to an entry being evicted with our current logic) I counted the amount of entries that were recorded for dns.cleanup_age + dns.by_type_cleanup_age.

This is my SQL for it (STMO link).

EDIT: I double checked and there is definitely something wrong with it, GLAM shows sample counts in the ~5M range for just one of those probes on nightly per build (I get around 10k for that probe only if I take the experiment filter out of my query). Would be interested in any pointers on how GLAM gets the sample count and how I could repro that in my own query.

Of course even with it being a few million evictions that's still very low compared to the 100s of million lookups happening, so the argument that this is why it doesn't show in the hit-rate statistics might still stand true.

Followup experiment looks good.

Thanks for confirming. I'll check experiment coordination with the team and will go ahead and start enrollment after the version bump next week if everything is cleared.

  • maybe still increase the cache size for nightly, as there is no downside to it I think?

I think we can increase it and let it ride the trains. If there are no red flags in the experiment data, there's no need to keep it in Nightly.

I'll put in a patch doubling the cache size then. Should we still land it even though the soft code freeze for 137 started?

Flags: needinfo?(valentin.gosu)
Depends on: 1950852

(In reply to Oskar Mansfeld from comment #3)

I think we can increase it and let it ride the trains. If there are no red flags in the experiment data, there's no need to keep it in Nightly.

I'll put in a patch doubling the cache size then. Should we still land it even though the soft code freeze for 137 started?

Let's do it after the code freeze.

Re: Follow up experiment with grace period increased

I think it makes sense to implement telemetry that checks when the longer grace period could lead to breakage.

My thought process is that when the cache renewal that is triggered when an entry is in the grace period produces a result that is different than the cached entry then we have potentially broken something by introducing the grace period and should record that in telemetry so we can see if the experiment has a negative impact (or how big that impact is).

I will look into implementing that telemetry tomorrow in a follow-up bug, but if you have any input or initial pointers about it (or other thoughts on the necessity) please share!

I'm very much in favor of adding some telemetry for this, though the grace period isn't the only case in which we refresh DNS records.

If you have tab A and B open to the same origin, and tab B gets force-refreshed, that would also update the DNS cache entries available to tab A, which is similar to the grace period refresh, but I'll grant you that the grace period is something we'll probably hit more often.

Regardless, I think this is good telemetry to have.
I'm also quite curious to see how often the refreshed DNS request fails.

Flags: needinfo?(valentin.gosu)
Depends on: 1952221

Should we hold the experiment start until that telemetry lands and maybe push the telemetry into prio-queue? I think it makes sense, we really need the telemetry to evaluate in my opinion. I didn't manage to spend time on it this week, but will look at it on Monday.

Flags: needinfo?(valentin.gosu)

Yes, let's wait until we land telemetry.

Flags: needinfo?(valentin.gosu)
Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-queue]
Status: NEW → ASSIGNED
Depends on: 1954708

removed from priority queue as it's tracked in performance work and we should rather put bugs/improvements in the queue that emerge out of experiment results

Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged]

Update

https://yardstick.mozilla.org/dashboard/snapshot/SFH0Pt3bnBmbJyjZggM8MncLQaEXar1r

Doubled the amount of DNS cache entries used in their grace period (going from 2.03% to 3.46%)
resulting in a ~1.1% increase in positive DNS cache outcomes through the noise.

Have not found any side effects.

Next steps

Run a follow up on beta to confirm results and keep watching out for side effects/bugs. Eventually push to
release.

Beta follow up experiment is running and has finished enrolling users.

Results from beta experiment

dashboard: https://yardstick.mozilla.org/dashboard/snapshot/R1d0A3D5629oXp2pf7xXZC4r65p2HqCN

  • beta experiment confirmed nightly results
  • saw an increase in DNS cache entries used in their grace period of +1.46% (going from 2.58% to 4.04%, which is an increase of slightly more than a third) landing us at ~93% positive DNS caching outcomes
  • have not found any substantial side effects

Next steps

we will increase the pref and let it ride the trains

  • further optimization of the concrete value is not expected to be worth the effort at the moment
  • the patch increasing the pref will thus close this bug

After experimentation gave us a good read that increasing the grace
period gives us ~1.5% more cache usage and no side effects we're now
shipping that increase.

Status: ASSIGNED → RESOLVED
Closed: 11 days ago
Resolution: --- → FIXED
Target Milestone: --- → 143 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: