Confidence/replayed/not-replayed metrics are not being calculated correctly
Categories
(Testing :: Raptor, task, P3)
Tracking
(firefox84 fixed)
Tracking | Status | |
---|---|---|
firefox84 | --- | fixed |
People
(Reporter: sparky, Assigned: kanishk509, Mentored)
References
Details
(Keywords: good-first-bug)
User Story
See here for information on how to contribute: https://wiki.mozilla.org/TestEngineering/Performance/NewContributors
Attachments
(1 file)
Note how replayed decreased, and not-replayed decreased which should not be the case. If replayed decreases, not-replayed should increase. This results in a poor calculation of the confidence metric and it becomes very difficult to interpret.
We should modify how the confidence is calculated and use a total number of expected requests that is known beforehand. At the moment, we assume that all requests will be seen in mitmproxy so we calculate the total number of requests as replayed+not-replayed which is a bad assumption.
Comment 1•4 years ago
|
||
We could think of these as hits (requested, and replayed from recording), misses (requested, but not replayed from recording), and spares (in recording, but not requested). The confidence could be the percentage of the total (hits+misses+spares) that are hits.
Reporter | ||
Comment 2•4 years ago
|
||
I think that if we can get the number of spares, that would solve the issue and we wouldn't need to redo the confidence calculation. We could add a new metric for recordingProportionUsed
(which would include the spares in the total) and then for confidence, we could keep the same calculation but rename it to replayConfidence
or networkConfidence
to make their purpose clearer.
It would be great if we could find a way to make sure devs know what these metrics are and how to interpret them as well.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Hey Greg.
I would like to look into this bug. Would be great if you can point me to where I can find the relevant code and give me a little context about the confidence calculations that are being discussed.
Cheers.
Reporter | ||
Comment 4•4 years ago
|
||
Hi Kanishk!
You should start with getting your environment going first: https://wiki.mozilla.org/TestEngineering/Performance/NewContributors
After that you'll want to be looking in these files:
- Confidence: https://searchfox.org/mozilla-central/source/testing/mozbase/mozproxy/mozproxy/backends/mitm/mitm.py#372
- Where the metrics are gathered: https://searchfox.org/mozilla-central/source/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py#241,252
You'll need to change the existing confidence metric to replayConfidence
and its calculation will be the same. Then, you'll also need to add a new recordingProportionUsed
metric which would be calculated as hits+misses/total
where total would be the maximum possible requests that can be done in this recording.
I think you can get that total
from the size of self.flow_map
, :bebe can you confirm?
Updated•4 years ago
|
I've submitted an initial patch, taking (hits + misses) as (self._replayed + self._not_replayed)
and the total as len(self.flowmap)
. I've updated the metrics in mitm.py
too.
Let me know what further changes are required.
Cheers.
Comment 7•4 years ago
|
||
:sparky that sounds right
:Kanishk thanks for the patch I will also take a look.
Reporter | ||
Comment 8•4 years ago
|
||
:bebe, we're getting some odd numbers using the flowmap
attribute. Do you have any suggestions for a better way of doing this?
See the amazon tests here and how the metric is >100%: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e8ab6e0986ae83ddaa94e9eed2503903a4a38358&selectedTaskRun=eWo5gunAR6uGnv0VOXp3eg.0
Comment 9•4 years ago
|
||
Here is how I see it:
The proxy is used for all the test retrys used in the test. That means it's started at the beginning of the test and closed at the end.
If the metric we are looking for is: Percent of the available recording used during all the retries we should calculate it like this.
Count every time we use a hash (compiled packaged available in the recording) and generate the metric by dividing the total number of available replays to the number of used replies.
A issue I see here is that a hash is generated using a combination of arguments of a request(url components, port scheme raw content host etc.)
so one or more replies are bundled in one hash(from which we use only the last added reply to respond) so we ignore part of the recording from start. But that was always like this. So should we compare with the total number of request available in a recorded file for the number of hashes generated?
Comment 10•4 years ago
|
||
Here is how I would do it:
diff --git a/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py b/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py
--- a/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py
+++ b/testing/mozbase/mozproxy/mozproxy/backends/mitm/scripts/alternate-server-replay.py
@@ -123,9 +123,10 @@ class AlternateServerPlayback:
)
elif i.response and self.mitm_version in ("4.0.2", "4.0.4", "5.1.1"):
# see: https://github.com/mitmproxy/mitmproxy/issues/3856
- l = self.flowmap.setdefault(self._hash(i), [])
- l.append(i)
- self._max_requests += 1
+ l = self.flowmap.setdefault(self._hash(i),
+ {"items": [],
+ "reply_count": 0})
+ l["items"].append(i)
else:
ctx.log.info(
"Recorded request %s has no response. Removing from recording list"
@@ -196,8 +197,9 @@ class AlternateServerPlayback:
found.
"""
hsh = self._hash(request)
- if hsh in self.flowmap:
- return self.flowmap[hsh][-1]
+ if hsh in self.flowmap.keys():
+ self.flowmap[hsh]["reply_count"] += 1
+ return self.flowmap[hsh]["items"][-1]
def configure(self, updated):
if not self.configured and ctx.options.server_replay_files:
@@ -208,8 +210,18 @@ class AlternateServerPlayback:
if self._done or not ctx.options.upload_dir:
return
+ recording_proportion_count = 0
+ for flow in self.flowmap:
+ if self.flowmap[flow]["reply_count"] > 0:
+ recording_proportion_count += 1
+
replayConfidence = float(self._replayed) / (self._replayed + self._not_replayed)
- recordingProportionUsed = float(self._replayed + self._not_replayed) / self._max_requests
+
+ if recording_proportion_count > 0:
+ recordingProportionUsed = recording_proportion_count / len(self.flowmap)
+ else:
+ recordingProportionUsed = 0
+
stats = {"totals": dict(self.netlocs),
"calls": self.calls,
"replayed": self._replayed,
@@ -283,6 +295,7 @@ if hasattr(signal, 'SIGBREAK'):
def _shutdown(sig, frame):
ctx.master.shutdown()
+
signal.signal(signal.SIGBREAK, _shutdown)
addons = [playback]
Assignee | ||
Comment 11•4 years ago
|
||
:bebe, So basically we should calculate the proportion of hashes that we later used vs the total hashes we generated/stored?
One question (might be unrelated to the task at hand): why are we storing a list of replies for a colliding hash when we are always using the most recent one?
Assignee | ||
Comment 12•4 years ago
|
||
Updated the patch based on :bebe 's suggestions.
Comment 13•4 years ago
|
||
(In reply to Kanishk from comment #11)
:bebe, So basically we should calculate the proportion of hashes that we later used vs the total hashes we generated/stored?
One question (might be unrelated to the task at hand): why are we storing a list of replies for a colliding hash when we are always using the most recent one?
That code supported multiple changes over time according to the proxy version and usage. But you have a great point in there...
Can you file a bug to review that code and add improvements. Any other suggestions are really appreciated!
Updated•4 years ago
|
Comment 15•4 years ago
|
||
Pushed by gmierz2@outlook.com: https://hg.mozilla.org/integration/autoland/rev/766437adb95f Add new mozproxy metrics replay-confidence, and recording-proportion-used. r=sparky,perftest-reviewers
Comment 16•4 years ago
|
||
bugherder |
Description
•