Closed Bug 1655840 Opened 1 year ago Closed 1 year ago

Confidence/replayed/not-replayed metrics are not being calculated correctly

Categories

(Testing :: Raptor, task, P3)

task

Tracking

(firefox84 fixed)

RESOLVED FIXED
84 Branch
Tracking Status
firefox84 --- fixed

People

(Reporter: sparky, Assigned: kanishk509, Mentored)

References

Details

(Keywords: good-first-bug)

User Story

See here for information on how to contribute: https://wiki.mozilla.org/TestEngineering/Performance/NewContributors

Attachments

(1 file)

See this graph: https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&highlightedRevisions=5a874f1887ff&series=autoland,2396440,1,10&series=autoland,2396442,1,10&series=autoland,2396439,1,10&timerange=5184000&zoom=1591223605643,1591227721330,33.62381966393652,886.2515267675393

Note how replayed decreased, and not-replayed decreased which should not be the case. If replayed decreases, not-replayed should increase. This results in a poor calculation of the confidence metric and it becomes very difficult to interpret.

We should modify how the confidence is calculated and use a total number of expected requests that is known beforehand. At the moment, we assume that all requests will be seen in mitmproxy so we calculate the total number of requests as replayed+not-replayed which is a bad assumption.

See Also: → 1655841

We could think of these as hits (requested, and replayed from recording), misses (requested, but not replayed from recording), and spares (in recording, but not requested). The confidence could be the percentage of the total (hits+misses+spares) that are hits.

I think that if we can get the number of spares, that would solve the issue and we wouldn't need to redo the confidence calculation. We could add a new metric for recordingProportionUsed (which would include the spares in the total) and then for confidence, we could keep the same calculation but rename it to replayConfidence or networkConfidence to make their purpose clearer.

It would be great if we could find a way to make sure devs know what these metrics are and how to interpret them as well.

Priority: P2 → P3
Mentor: gmierz2
User Story: (updated)
Keywords: good-first-bug

Hey Greg.
I would like to look into this bug. Would be great if you can point me to where I can find the relevant code and give me a little context about the confidence calculations that are being discussed.
Cheers.

Flags: needinfo?(gmierz2)

Hi Kanishk!

You should start with getting your environment going first: https://wiki.mozilla.org/TestEngineering/Performance/NewContributors

After that you'll want to be looking in these files:

  1. Confidence: https://searchfox.org/mozilla-central/source/testing/mozbase/mozproxy/mozproxy/backends/mitm/mitm.py#372
  2. Where the metrics are gathered: https://searchfox.org/mozilla-central/source/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py#241,252

You'll need to change the existing confidence metric to replayConfidence and its calculation will be the same. Then, you'll also need to add a new recordingProportionUsed metric which would be calculated as hits+misses/total where total would be the maximum possible requests that can be done in this recording.

I think you can get that total from the size of self.flow_map, :bebe can you confirm?

Flags: needinfo?(gmierz2) → needinfo?(fstrugariu)
Assignee: nobody → kanishk509
Status: NEW → ASSIGNED

I've submitted an initial patch, taking (hits + misses) as (self._replayed + self._not_replayed) and the total as len(self.flowmap). I've updated the metrics in mitm.py too.
Let me know what further changes are required.
Cheers.

Flags: needinfo?(gmierz2)

:sparky that sounds right
:Kanishk thanks for the patch I will also take a look.

Flags: needinfo?(fstrugariu)

:bebe, we're getting some odd numbers using the flowmap attribute. Do you have any suggestions for a better way of doing this?

See the amazon tests here and how the metric is >100%: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e8ab6e0986ae83ddaa94e9eed2503903a4a38358&selectedTaskRun=eWo5gunAR6uGnv0VOXp3eg.0

Flags: needinfo?(gmierz2) → needinfo?(fstrugariu)

Here is how I see it:
The proxy is used for all the test retrys used in the test. That means it's started at the beginning of the test and closed at the end.

If the metric we are looking for is: Percent of the available recording used during all the retries we should calculate it like this.

Count every time we use a hash (compiled packaged available in the recording) and generate the metric by dividing the total number of available replays to the number of used replies.

A issue I see here is that a hash is generated using a combination of arguments of a request(url components, port scheme raw content host etc.)
so one or more replies are bundled in one hash(from which we use only the last added reply to respond) so we ignore part of the recording from start. But that was always like this. So should we compare with the total number of request available in a recorded file for the number of hashes generated?

Flags: needinfo?(fstrugariu)

Here is how I would do it:

diff --git a/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py b/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py
--- a/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py
+++ b/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py
 
@@ -123,9 +123,10 @@ class AlternateServerPlayback:
                 )
             elif i.response and self.mitm_version in ("4.0.2", "4.0.4", "5.1.1"):
                 # see: https://github.com/mitmproxy/mitmproxy/issues/3856
-                l = self.flowmap.setdefault(self._hash(i), [])
-                l.append(i)
-                self._max_requests += 1
+                l = self.flowmap.setdefault(self._hash(i),
+                                            {"items": [],
+                                             "reply_count": 0})
+                l["items"].append(i)
             else:
                 ctx.log.info(
                     "Recorded request %s has no response. Removing from recording list"
@@ -196,8 +197,9 @@ class AlternateServerPlayback:
             found.
         """
         hsh = self._hash(request)
-        if hsh in self.flowmap:
-            return self.flowmap[hsh][-1]
+        if hsh in self.flowmap.keys():
+            self.flowmap[hsh]["reply_count"] += 1
+            return self.flowmap[hsh]["items"][-1]
 
     def configure(self, updated):
         if not self.configured and ctx.options.server_replay_files:
@@ -208,8 +210,18 @@ class AlternateServerPlayback:
         if self._done or not ctx.options.upload_dir:
             return
 
+        recording_proportion_count = 0
+        for flow in self.flowmap:
+            if self.flowmap[flow]["reply_count"] > 0:
+                recording_proportion_count += 1
+
         replayConfidence = float(self._replayed) / (self._replayed + self._not_replayed)
-        recordingProportionUsed = float(self._replayed + self._not_replayed) / self._max_requests
+
+        if recording_proportion_count > 0:
+            recordingProportionUsed = recording_proportion_count / len(self.flowmap)
+        else:
+            recordingProportionUsed = 0
+
         stats = {"totals": dict(self.netlocs),
                  "calls": self.calls,
                  "replayed": self._replayed,
@@ -283,6 +295,7 @@ if hasattr(signal, 'SIGBREAK'):
     def _shutdown(sig, frame):
         ctx.master.shutdown()
 
+
     signal.signal(signal.SIGBREAK, _shutdown)
 
 addons = [playback]

:bebe, So basically we should calculate the proportion of hashes that we later used vs the total hashes we generated/stored?
One question (might be unrelated to the task at hand): why are we storing a list of replies for a colliding hash when we are always using the most recent one?

Flags: needinfo?(fstrugariu)

Updated the patch based on :bebe 's suggestions.

(In reply to Kanishk from comment #11)

:bebe, So basically we should calculate the proportion of hashes that we later used vs the total hashes we generated/stored?
One question (might be unrelated to the task at hand): why are we storing a list of replies for a colliding hash when we are always using the most recent one?

That code supported multiple changes over time according to the proxy version and usage. But you have a great point in there...

Can you file a bug to review that code and add improvements. Any other suggestions are really appreciated!

Flags: needinfo?(fstrugariu) → needinfo?(kanishk509)

Okay. My first time filing a bug.

Flags: needinfo?(kanishk509)
Attachment #9181489 - Attachment description: Bug 1655840 - New confidence metrics replayConfidence, recordingProportionUsed. r=sparky → Bug 1655840 - Add new mozproxy metrics replay-confidence, and recording-proportion-used. r=sparky
Pushed by gmierz2@outlook.com:
https://hg.mozilla.org/integration/autoland/rev/766437adb95f
Add new mozproxy metrics replay-confidence, and recording-proportion-used. r=sparky,perftest-reviewers
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 84 Branch
You need to log in before you can comment on or make changes to this bug.