Closed Bug 1046162 Opened 5 years ago Closed 5 years ago

Significant drop in OF score after Mochitest structured logging landed

Categories

(Tree Management :: OrangeFactor, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: automatedtester, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: sheriffing-P1)

Attachments

(1 file)

The OF score currently seems very low (2.3 on 29/07/2014) and looking at http://brasstacks.mozilla.com/orangefactor/?display=OrangeFactor&endday=2014-07-30&startday=2014-07-09&tree=trunk there was a significant drop around the 18th when https://hg.mozilla.org/integration/mozilla-inbound/rev/a283c1237d96 landed for bug 886570

I wonder if either there is an issue with either the structured logging not producing what we expect or if the log parser needs updating (or our score is finally correct having been wrong the whole time)
Some other milestones of note:
* The fix for bug 1041016 was deployed on 21st July.
* A whole bunch of additional platform/repo/testsuite names were added by bug 817269 on 24th July.
Note that my sleuthing has not been extensive but there appears to be an issue around this
Also helpful is looking at:
http://brasstacks.mozilla.com/orangefactor/?display=OrangeFactor&endday=2014-07-30&startday=2014-07-02&tree=trunk
(slightly longer range than comment 0 link)
...and clicking on the graph legend "orange count" and "test run count" labels, which displays those lines on the graph. The former is the data counts provided by TBPL, the latter the test run counts as found by logparser (with the orangefactor score being the former divided by the latter).
Blocks: 886570
OS: Mac OS X → All
Hardware: x86 → All
If you look at the raw bug and testrun counts since July 1, it looks like the testrun count is _higher_ since around July 21, and the raw orange counts somewhat lower.

I'm not exactly sure what's going on, but an OF of 2.3 looks wrong, just eyeballing TBPL.
Since it landed there have been a fair number of regressions that Ahmed and Chris have been busy fixing. At least one of them (bug 1045255) did hide a fair number of oranges. I don't know if that would explain the entire discrepancy or if there are other issues still in play here.
Note that the dips (particularly for orange count) correspond to the weekends, which makes sense.
Keywords: sheriffing-P1
Maybe this is related to the e10s tests that were activated after the leak checking "disappeared"? The graph definitely looks fishy.
On the comment 3 URL:
"plus 2505 oranges with no daily test-run count"

That seems like an awfully high number of runs that presumably weren't parsed correctly by the logparser and thus were missing from the run count? Just the missing jobs prior to bug 1041016 being deployed?
(In reply to Ed Morley [:edmorley] from comment #8)
> On the comment 3 URL:
> "plus 2505 oranges with no daily test-run count"
> 
> That seems like an awfully high number of runs that presumably weren't
> parsed correctly by the logparser and thus were missing from the run count?
> Just the missing jobs prior to bug 1041016 being deployed?

Yes, this is the number reported on July 21 as well, when bug 1041016 was deployed.
(In reply to Andrew Halberstadt [:ahal] from comment #5)
> Since it landed there have been a fair number of regressions that Ahmed and
> Chris have been busy fixing. At least one of them (bug 1045255) did hide a
> fair number of oranges. I don't know if that would explain the entire
> discrepancy or if there are other issues still in play here.

As far as I know the leak logging issue was the only regression confirmed to have hidden oranges. Mochitest structured logging also caused at least one orange (bug 1041075).
(In reply to Chris Manchester [:chmanchester] from comment #10)
> (In reply to Andrew Halberstadt [:ahal] from comment #5)
> > Since it landed there have been a fair number of regressions that Ahmed and
> > Chris have been busy fixing. At least one of them (bug 1045255) did hide a
> > fair number of oranges. I don't know if that would explain the entire
> > discrepancy or if there are other issues still in play here.
> 
> As far as I know the leak logging issue was the only regression confirmed to
> have hidden oranges. Mochitest structured logging also caused at least one
> orange (bug 1041075).

Sorry, bug 1044206 also hid oranges.
Keep in mind that we are missing all data from the 9th to the 19th. This means that any failure that was either fixed or disabled during this time period would contribute to the drop in OF. I'm not saying there isn't something fishy going on (the OF is never that low), but it's something to keep in mind.

I went through the list of top intermittents from July 1st to July 9th [1] and found the following (in order from most intermittents during that range to fewest):

* Bug 1035172 <- workaround landed on the 10th
* Bug 1031928 <- test was disabled on the 18th
* Bug 996465  <- fixed on the 11th
* Bug 1020707 <- failure rate reduced drastically due to bug 1033121 on the 10th
* Bug 924622  <- fixed on the 10th
* Bug 1030024 <- inexplicably "fixed" itself on the 16th
* Bug 1019572 <- fixed on the 18th
* Bug 972826  <- fixed on the 22nd
* Bug 1035446 <- change backed out on the 10th
* Bug 1007824 <- fixed on the 15th
* Bug 663657  <- inexplicably "fixed" itself on the 22nd
* Bug 1034559 <- change backed out on the 9th
* Bug 1035443 <- change backed out on the 9th
* Bug 1034397 <- test disabled on the 14th
* Bug 1008422 <- inexplicably fixed itself on the 10th

Every other bug in that range was either:

a) Fixed before the 9th, or
b) Is still reporting failures, or
c) Had fewer than 20 intermittents during that period (I had to stop somewhere)

None of the bugs I looked at "magically" stopped reporting failures around the 18th. This also ignores all new intermittents filed after the 9th.

I'm not sure this data is thorough enough to draw any solid conclusions other than "None of the bugs I looked at were regressed by structured logging".

[1] http://brasstacks.mozilla.com/orangefactor/?display=OrangeFactor&endday=2014-07-10&startday=2014-07-01&tree=trunk
Filed bug 1048920 for the recent failure to ingest data, which is presumably unrelated to this issue but will hinder further work here.
I believe this is at least partly coincidental, and that there is likely a problem with the logparser that needs to be resolved.

The problem seems to be that we're reporting a testrun count that's too high.

For August 5, OF reports 18 testruns, but looking at TBPL, I see only 7.  For August 4, OF reports 11 testruns, but looking at TBPL I see only 4.  I need to take a look at the raw data OF is returning and see where these extra runs are coming from.
(In reply to Jonathan Griffin (:jgriffin) from comment #14)
> I believe this is at least partly coincidental, and that there is likely a
> problem with the logparser that needs to be resolved.
> 
> The problem seems to be that we're reporting a testrun count that's too high.
> 
> For August 5, OF reports 18 testruns, but looking at TBPL, I see only 7. 
> For August 4, OF reports 11 testruns, but looking at TBPL I see only 4.  I
> need to take a look at the raw data OF is returning and see where these
> extra runs are coming from.

This is all on mozilla-central, FYI.
We're also misreporting oranges.  Here's the data for Aug 3 for m-c:

"2014-08-03": {

    "orangecount": 12,
    "testruns": 1,
    "oranges": [ ]

},

There are no runs at all on this date in TPBL; the first run before this (on Aug 2) had 16 oranges, not 12, and the run after it (on Aug 3) had only 5.

Does anyone know if we may be double-reporting data from Treeherder?
I'm trying some hacky solutions to address this, meanwhile I've filed bug 1050002.
Mauro did the OF integration in TH.
Flags: needinfo?(mdoglio)
This is what treeherder production has sent to tbpl's starcomment.php until now on m-c:

mysql> select bug_id,  FROM_UNIXTIME(submit_timestamp) from bug_job_map ;
+---------+---------------------------------+
| bug_id  | FROM_UNIXTIME(submit_timestamp) |
+---------+---------------------------------+
|  979207 | 2014-06-28 01:36:16             |
|  920475 | 2014-06-28 01:48:06             |
|  722166 | 2014-06-30 11:02:02             |
| 1024734 | 2014-07-07 10:45:25             |
| 1035172 | 2014-07-09 08:10:02             |
|  944624 | 2014-07-23 14:15:16             |
+---------+---------------------------------+

This very low number of submissions is due to Bug 1042695 which I'm  currently working on.

This wouldn't prevent submissions from other treeherder instances though, even on developer machines. It would probably be a good idea to restrict the access to that script so that only tbpl.mozilla.org and treeherder.mozilla.org can post data to it. Also, is there a way to get a log of the POST requests to starcomment.php on tbpl's side? That would give us more info to understand what's going on.
Flags: needinfo?(mdoglio)
Bug 1052371 hid shutdown leaks too, from 2014-07-18 until 2014-08-14.
I did some more digging into this.  The root of the problem is that we've never defined the concept of "testrun" very well...it's an artificial construct that doesn't match very well what happens in buildbot.

The ultimate solution, like edmorley said today on IRC, is to switch OF to calculate the number of oranges per job, not per testrun (maybe multiplied by some power of 10)...this will take care of all the strange edge cases that make identifying the number of testruns hard.

Probably the largest contributing factor to this sudden drop (aside from the mochitest refactoring) was the cleanup that edmorley did in bug 817269; among other things, this caused us to start parsing logs for Android tests, which we were previously missing.  It turns out that buildbot doesn't cooperate very well with the way we calculate testruns here; we rely on the buildid, and buildbot usually produces a different buildid for Android runs than for anything else, so a given push (without periodic builds or retriggers) often has what OF sees as two distinct testruns, one for Android, one for everything else.

I've turned off Android logs again in OF, and this should probably return OF to about what it recently was, but longer term we should plan to migrate to a orange/job ratio.
(In reply to Jonathan Griffin (:jgriffin) from comment #21)
> I did some more digging into this.  The root of the problem is that we've
> never defined the concept of "testrun" very well...it's an artificial
> construct that doesn't match very well what happens in buildbot.

Thank you for digging into this :-)

> Probably the largest contributing factor to this sudden drop (aside from the
> mochitest refactoring) was the cleanup that edmorley did in bug 817269;
> among other things, this caused us to start parsing logs for Android tests,
> which we were previously missing.

We already had support for some Android builds and tests, I presume this just bumped the number up enough that it became more significant from a test run inflation POV?

> I've turned off Android logs again in OF, and this should probably return OF
> to about what it recently was, but longer term we should plan to migrate to
> a orange/job ratio.

I couldn't see any commits to the logparser repo, so took a look on orangefactor1 - a lot more than the bug 817269 changes have been reverted - was this intentional? eg: "cppunit", "jetpack", "jittest" etc.

These changes shouldn't be needed if we exclude Android from the platforms list, right?

Either way, we should get this checked into the repo, since uncommitted changes in the working directory have bitten us in the past.
(In reply to Ed Morley [:edmorley] from comment #22)
> I couldn't see any commits to the logparser repo, so took a look on
> orangefactor1 - a lot more than the bug 817269 changes have been reverted -
> was this intentional? eg: "cppunit", "jetpack", "jittest" etc.

Compare these bug 817269 commits:
https://hg.mozilla.org/automation/logparser/rev/c5cb4428fb6c
https://hg.mozilla.org/automation/logparser/rev/07c09f48b3ef

With:
https://bugzilla.mozilla.org/attachment.cgi?id=8475068&action=diff
Yes, I'm trying to find a magic set of platforms/tests that gives us better numbers.  I was thinking of checking these in after we find that, but arguably we should check things in as we change them on production.

I think I can revert most of the test suite changes (sans the Talos ones).
> We already had support for some Android builds and tests, I presume this just bumped the number up 
> enough that it became more significant from a test run inflation POV?

So if you look at a random commit on TBPL, like https://tbpl.mozilla.org/?rev=4d94eeca89f3, and open individual test logs so you can see the buildid at the top (from which OF calculates testruns), you can see that generally speaking, all of the desktop and B2G tests share a common buildid.  All of the android tests share a slightly different buildid.  This results in a given commit having at least 2 buildid's (more in the case of periodic builds or retriggers), which means at least two testruns.

I'm not sure exactly what we were parsing before bug 817269, and it's hard to say, since it isn't always obvious how those values matched platforms in the pulse stream...i.e., I'm not sure what just 'android' would have matched.
Depends on: 1057558
Depends on: 1052371
Depends on: 1044206
Product: Testing → Tree Management
The real fix here will be OF v2 using Treeherder's API.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.