Closed Bug 1005176 Opened 6 years ago Closed 5 years ago

15% tp5o responsiveness regression on osx 10.6 on april 30 from push 72215f1d9de0


(Firefox Build System :: General, defect)

Not set


(Not tracked)



(Reporter: jmaher, Unassigned)



(Keywords: perf, regression, Whiteboard: [talos_regression])

this is an odd one, but I am pretty certain this regression is on this push:

I had done some retriggers to help reduce noise:

I don't know much about ICU and how building it differently could cause this.
Someone needs to verify if the regression goes away with those patches backed out perhaps one by one.
this seems to be problematic on windows also.  I can start backing these out later this weekend or early next week.
Bug 1002437 makes no sense as a Mac regression, since the configuration that the patch added was already the ICU default on a Mac. That is, the patch should only have had a (small) effect on Windows and Linux.

The other two bugs removed dead code. Unless the removal of code made the layout of other code radically different, it makes no sense that the two other bugs could have caused responsiveness regression, either.

Is there any chance that the regression could be misattributed to the wrong changesets?
oh this is (bug 943252)

here is the original pushlog: (responsiveness went from 60->70)

and here is me backing out 9eb4f13a78d8: (back to 60)

I also backed out 72215f1d9de0 (bug 1002437) and we remained at 70:

So we have the changeset, there are a lot of changes in that, Henri, any thoughts on that?
Flags: needinfo?(hsivonen)
(In reply to Joel Maher (:jmaher) from comment #4)
> So we have the changeset, there are a lot of changes in that, Henri, any
> thoughts on that?

This is *really* weird.

As you can see, the changeset only removes code, except there are some added (changed) comment lines. It makes no sense that a pure removal of code that we were already not using causes that kind of tp change.

So it looks like the change in responsiveness is indeed reproducable and attributable to a changeset. Still, it looks like it can't be any sort of normal consequence of the code being changed. It still looks like the patch is unlucky in some weird way, such as pushing some size over a magic threshold or something.

I double-checked that there shouldn't be any dangling references from XUL or chrome JS to the removed code.

I think putting the dead code back in the tree isn't the answer. Any ideas what the real cause could be?
Flags: needinfo?(hsivonen)
I have no idea here, maybe ted or mbrubeck know more about responsiveness and why removing code would cause this.

Alternatively we could bisect that patch on try and see where the offending cause is.
Flags: needinfo?(ted)
Flags: needinfo?(mbrubeck)
Is it possible that something depends on part of the code that was removed implicitly and takes a more expensive path with that removed?
Looking over a longer period of time, this benchmark appears to be multimodal:[[275,64,21]]&sel=none&displayrange=365

At any given point it seems to cluster around two or three values separated by gaps of about 10.  This seems pretty suspect given what the result is supposed to represent.  In fact, looking at how this value is calculated, it seems completely broken:

We calculate the sum of squares of event delays and divide by 10^6.  But we never divide the number of samples in the list, so instead of computing the average of the (squared) delay, we are computing the "total squared delay", which has nonsensical units of seconds^2 / 10^-12 and always grows larger as we record more samples.

For example, comparing raw data from these logs before and after the regression:
n = 9072 samples; average = 60.5 ms; responsiveness_Metric = 59.9
n = 9132 samples; average = 65.7 ms; responsiveness_Metric = 70.2

So the number reported is only coincidentally close to the average delay in ms.  It also depends on how many samples are recorded, which probably depends on how long the test runs and maybe other factors.
Flags: needinfo?(mbrubeck)
(In reply to Matt Brubeck (:mbrubeck) from comment #8)
> So the number reported is only coincidentally close to the average delay in
> ms.  It also depends on how many samples are recorded, which probably
> depends on how long the test runs and maybe other factors.

This raises the question of whether Tp5 scores regressed at the same time as Tp5 Responsiveness.  They did not; the main Tp5 latency numbers appear totally flat during the time range in question.
I'd have to see if I still have the notes from when we decided on how we'd calculate the responsiveness metric. We involved bz and jlebar in the design, so I find it hard to believe we'd come up with something nonsensical. Perhaps we simply implemented it incorrectly.
Flags: needinfo?(ted)
I can't find where we defined the metric, but I do see this is where we added it:

if we need to change this, lets do it.
I found the email thread and forwarded it to mbrubeck/jmaher. It looks like jlebar gave us that exact formula, so it's not something that got lost in translation. There should be more information in the thread about how we arrived at that formula, though.
here is what is in talos (
return sum([float(x)*float(x) / 1000000.0 for x in val_list])

and here is what the original formula was defined as:
def responsiveness_Metric(val_list):
  s = sum([x*x / 1000000.0 for x in val_list])
  return str(round(s))

we switched from int to floats in bug 904642
Ah, I missed the fact that we only record a sample when it is above a threshold of 20ms.  Now the code makes sense to me.  (I had thought we recorded a sample for every "tracer event".)  Sorry for the confusion.
I would like to close this bug, it is not actionable and any regressions have shipped.

Matt, any objections?,132,25%5D,%5B254,53,25%5D%5D&sel=1380648865666,1412184865666&displayrange=365&datatype=running
Flags: needinfo?(mbrubeck)
Closed: 5 years ago
Flags: needinfo?(mbrubeck)
Resolution: --- → WONTFIX
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.