Closed Bug 848834 Opened 11 years ago Closed 11 years ago

Switch OrangeFactor over to the new IT scl3 ES instance

Categories

(Tree Management Graveyard :: OrangeFactor, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(1 file)

See bug 772503.

TBPL patch to submit data to both old and new simultaneously is in bug 848826; once that's landed and we have some data there to test with, we can test the OF patch locally and then just land it. (Not like we have much data in the old ES instance to transition now, thanks to the data loss)
Attached patch Patch v1Splinter Review
We'll need to change the not-checked-in server config file to the new value too; but thought it made sense to keep the example one up to date to save unnecessary confusion for people testing locally in the future :-)
Attachment #722317 - Flags: review?(mcote)
(In reply to Ed Morley [:edmorley UTC+0] from comment #1)
> We'll need to change the not-checked-in server config file to the new value
> too

To avoid confusion: This step will need to wait until bug 848826 is in production.
Comment on attachment 722317 [details] [diff] [review]
Patch v1

Review of attachment 722317 [details] [diff] [review]:
-----------------------------------------------------------------

Cool cool.  Let me know when I should make the change to the live config file.
Attachment #722317 - Flags: review?(mcote) → review+
Depends on: 848870
(In reply to Mark Côté ( :mcote ) from comment #3)
> Cool cool.  Let me know when I should make the change to the live config
> file.

Sure will do. I've tried to test the TBPL part (bug 848826), to put some data in the new DB, so I can then test the OrangeFactor change locally. However MPT-VPN access isn't sufficient for the new ES instance, so I've had to file bug 849161 to see if they'll let me have access to SCL3 VPN.

Presuming that goes ok, I'll test + land the TBPL part, then let you know we can switch.

In the meantime I've landed the example config change:
https://hg.mozilla.org/automation/orangefactor/rev/e81972ea9bec
Depends on: 849161
(Still waiting on bug 849161)
Whiteboard: [Waiting on bug 849161]
Depends on: 860256
(In reply to Mark Côté ( :mcote ) from comment #3)
> Cool cool.  Let me know when I should make the change to the live config
> file.

There doesn't seem a decent way around bug 860256 other than getting the client to submit to both, so let's just go ahead with this and deal with bug 860256 next time we need to do significant OF backend work (and in the meantime we can tunnel through brasstacks to prod).

Please can you update the live file to the ES server value that is in the example config :-)
Flags: needinfo?(mcote)
Whiteboard: [Waiting on bug 849161]
mcote: edmorley: yikes that broke OF
mcote: having trouble figuring out what th eproblem is
mcote: okay reverting it
edmorley: the ES server version is newer on the new ES server, wonder if that has anything to do with OF not working...
edmorley: bzcache index is empty on the new server, guess we need to run the manually populate script?
mcote: hm I'm not sure how that part works actually
mcote: yeah you/I/we will have to do some debugging here
edmorley: I've also spotted two parts of the bzcache repo that reference the old ES server
May not be the cause of the problems, but we'll need to change these at some point too, to elasticsearch-zlb.webapp.scl3.mozilla.com:9200:
http://hg.mozilla.org/users/jgriffin_mozilla.com/bzcache/file/f0180691151f/bzcache/bz_cache_refresh.py#l9
http://hg.mozilla.org/users/jgriffin_mozilla.com/bzcache/file/f0180691151f/bzcache/bzcache.py#l7
Depends on: 869648
Depends on: 869652
So the problem is just that the "logs" index didn't exist.  This is due to the fact that the logparser hasn't been writing to the new database(s).  jgriffin created the index, but OrangeFactor won't work with a blank logs index, since it can't correlate oranges to test runs.  So we will need to patch it (and bzcache) to write to the existing instance, as well as the new prod and dev instances, as discussed in bug 860256.
Flags: needinfo?(mcote)
(In reply to Mark Côté ( :mcote ) from comment #9)
> jgriffin created the index, but OrangeFactor won't work with a blank logs
> index, since it can't correlate oranges to test runs. 

I've often wondered this, but why does OF even need to parse the logs at all? The user star data from TBPL contains all the required information, so the logs shouldn't be required - unless I'm missing something?

> patch it (and bzcache) to write to the existing instance, as well as the new
> prod and dev instances, as discussed in bug 860256.

To keep things simple, I'm happy not writing to the dev instance for now, given our workaround on IRC. (We'll have to figure out the TTL story etc otherwise).
(CCing jgriffin for the first part of comment 10).
I can answer that--we need to know how many test runs we have done in total, including those that have no oranges, which, by definition, would not be recorded in TBPL star data.  For an accurate count of oranges per test run, we need to be sure we know about all test runs (and the platform, build type, and test suites for each run).  We parse test logs to accomplish this.
Ah of course - I was thinking it was only done for the latter, and had forgotten we need total run count in order to calculate the OrangeFactor score.

That said, I still don't think we need to parse the logs - IMO I'd trust the sheriff TBPL star data much more than the out of date regexps in OrangeFactor (also - how does it even guess which bug to match against?).
(In reply to Ed Morley [:edmorley UTC+1] from comment #13)
> That said, I still don't think we need to parse the logs 

As in, we only need to monitor the completed jobs in the pulse stream, so we can count total jobs, not download and parse the log itself.
Depends on: 870559
According to jgriffin, the pass/fail data in the messages is not sufficiently accurate:

12:42 < mcote> if only we could get the necessary info from the pulse messages themselves
12:43 < mcote> jgriffin: think that would be possible?
12:43 <@jgriffin> what info?
12:43 < mcote> whatever we're parsing out of the logs :)
12:43  * mcote is not sure what that is exactly
12:44 <@jgriffin> oh, we possibly could, although it's not as reliable as parsing the logs
12:44 < mcote> jgriffin: meaning what, that we might miss some, or that the data in the pulse messages is not always right?
12:45 <@jgriffin> we would occasionally miss some
12:45 < mcote> jgriffin: the logparser also scrapes ftp then?
12:46 <@jgriffin> no, just listens to pulse, but the pass/fail data in the pulse stream isn't as accurate as parsing the logs
To be more specific, there are known flaws in our test harnesses that can cause some errors to be missed in the pass/fail/todo summaries, which is what's used by TBPL to determine orange status.

The logparser doesn't have this problem, as it parses every line in the log and accounts for every TEST-UNEXPECTED-FAILURE.

We have used this data from time to time to identify such "hidden oranges", so I'd hate to lose this data.  But, it probably wouldn't significantly impact OF either way.
I mean use TBPL submitted star data to determine:
* if a run was a failure
* what platform/suite/... that failure was recorded on

And pulse to record total test runs completed.

In the pre perma-sheriff world where not everything got starred, I can understand why we tried to parse the log for failure lines - I just think it's not needed now.
(In reply to Jonathan Griffin (:jgriffin) from comment #16)
> To be more specific, there are known flaws in our test harnesses that can
> cause some errors to be missed in the pass/fail/todo summaries, which is
> what's used by TBPL to determine orange status.

Ah sorry I missed this - I think this is the root of our misunderstanding/confusion. I don't believe there are any of these now (I fixed some for mobile 6-9 months ago) - and if there are, it's a major bug that needs to be fixed. (ie: if a run failed but is showing on TBPL as green, we have major issues).
(Sorry I should learn to pause before hitting submit)

Note TBPL's jobs result come from buildbot, not the pass/fail summaries directly. Buildbot uses a bunch of factors, including exit code, pass/fail summaries, looking for specific infra keywords ([1]) & also looking for things like TEST-UNEXPECTED-FAIL ([2]).

[1] https://hg.mozilla.org/build/buildbotcustom/file/3e95f8313a4c/status/errors.py#l5
[2] https://hg.mozilla.org/build/buildbotcustom/file/3e95f8313a4c/steps/unittest.py#l92
https://hg.mozilla.org/build/buildbotcustom/file/3e95f8313a4c/steps/talos.py#l116
https://hg.mozilla.org/build/buildbotcustom/file/3e95f8313a4c/steps/mobile.py#l42
https://hg.mozilla.org/build/mozharness/file/6f326eb887f3/mozharness/mozilla/testing/errors.py#l47
Currently, the data from parsed logs is intended to be used by OF in a few different ways:

- it's used to generate the testrun count
- it's supposed to be used to display the "JSON details" column, but this hasn't worked in many months due to a flaw in the way the last ES upgrade on the metrics cluster was performed
- it used to be used to display the number of unstarred oranges, although this code doesn't seem to be hooked up to the UI any longer

Per the latter, this included some failures that showed up as green in TBPL, rather than orange, since the mechanism that buildbot uses to determine pass/fail doesn't always account for every TEST-UNEXPECTED-FAIL.  See bug 677964.  AFAICT from looking at all the links you mention, this is still true.  (And yes, we should fix this.)

I think there's some value in retaining this data, but that probably belongs to the domain of treeherder in the future, and not the OF logparser, since we're moving towards a single source of truth.

Because it appears that the only thing we currently use from the parsed logs is the testrun count, it makes sense to get that information from the pulse stream.  But because we have no access to the historical pulse stream, it doesn't solve the problem of how to preserve historical data during the ES switch, unless we just intend to dump the historical data altogether.  If we want to preserve that data, then moving the testrun count from the logparser to the pulse stream would actually seem to complicate things, unless I'm missing something.
I agree with everything in comment #20, except that I believe the oranges-with-no-test-run count is still working fine--we just haven't had any since the Great History Reset of 2013.  For example, trying to use OF with the new cluster (no test run count) results in "0 failures in 0 testruns (plus 2937 oranges with no daily test-run count)".
Oh wait, you're talking about the converse.  Hm I don't remember what happened there...
(In reply to Jonathan Griffin (:jgriffin) from comment #20)
> But because we have no access to the historical pulse stream, it
> doesn't solve the problem of how to preserve historical data during the ES
> switch, unless we just intend to dump the historical data altogether. If we
> want to preserve that data, then moving the testrun count from the logparser
> to the pulse stream would actually seem to complicate things, unless I'm
> missing something.

Historical data is available from https://secure.pub.build.mozilla.org/builddata/buildjson/ though it would likely require more effort to convert than a scrolling re-index using the current ES instance. I was more thinking longer term - ie: could we avoid burning CPU cycles parsing logs unnecessarily (though it may just be easier to accept the unnecessary parsing until we switch entirely to using treeherder's data store).
I would say that it's not worth it.  The current way works well enough for now, and we have enough OF work to do. :)
No longer depends on: 860256
bzcache in the OrangeFactor installation has now been switched to using elasticsearch-zlb.webapp.scl3.mozilla.com:9200.  Once bug 870559 is resolved, we can switch OrangeFactor itself to using the new cluster and resolve this bug.
Blocks: 871889
Mark, I've toggled the config to the new ES instance, all seems to work - just checking this was sufficient?

[webtools@brasstacks1.dmz.scl3 server]$ cp orangefactor.conf orangefactor.conf.backup
[webtools@brasstacks1.dmz.scl3 server]$ cp orangefactor.conf.example orangefactor.conf
[webtools@brasstacks1.dmz.scl3 server]$ exit
logout
[root@brasstacks1.dmz.scl3 ~]# /etc/init.d/orangefactor stop; /etc/init.d/orangefactor st art
stopping orangefactor                                      [  OK  ]
starting orangefactorspawn-fcgi: child spawned successfully: PID: 26098
                                                           [  OK  ]
Flags: needinfo?(mcote)
Blocks: 883218
For reference:

[webtools@brasstacks1.dmz.scl3 server]$ cat orangefactor.conf.backup

[servers]
es = buildbot-es.metrics.scl3.mozilla.com:9200
bzapi = https://api-dev.bugzilla.mozilla.org/latest/

[orangefactor]
trees = mozilla-central, mozilla-inbound, fx-team, mozilla-aurora, mozilla-beta, mozilla-esr10
trunk_trees = mozilla-central, mozilla-inbound, fx-team
exclude_platforms = win64, linuxqt, android-r7, android-r7-nothumb
exclude_tbpl_os = windows7-64, maemo4, maemo5



[webtools@brasstacks1.dmz.scl3 server]$ cat orangefactor.conf

[servers]
es = elasticsearch-zlb.webapp.scl3.mozilla.com:9200
bzapi = https://api-dev.bugzilla.mozilla.org/latest/

[orangefactor]
trees = mozilla-central, mozilla-inbound, build-system, fx-team, ionmonkey, profiling, services-central, mozilla-aurora, mozilla-beta, mozilla-b2g18, mozilla-esr17
trunk_trees = mozilla-central, mozilla-inbound, build-system, fx-team, ionmonkey, profiling, services-central
exclude_platforms = win64
exclude_tbpl_os = windows7-64
Depends on: 883825
While the logparser and tbpl are still writing to both databases for a bit, in case we need to switch back, this bug is done.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: needinfo?(mcote)
Resolution: --- → FIXED
Now that things have been proven working:

[webtools@brasstacks1.dmz.scl3 orangefactor]$ rm server/orangefactor.conf.backup
Product: Testing → Tree Management
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: