Closed
Bug 1367362
Opened 8 years ago
Closed 7 years ago
Create a replacement to OrangeFactor that uses Treeherder as a data source
Categories
(Tree Management :: Treeherder, defect, P1)
Tree Management
Treeherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: sclements)
References
Details
Attachments
(4 files, 3 obsolete files)
47 bytes,
text/x-github-pull-request
|
emorley
:
review+
emorley
:
checkin+
|
Details | Review |
47 bytes,
text/x-github-pull-request
|
camd
:
review+
|
Details | Review |
47 bytes,
text/x-github-pull-request
|
emorley
:
review+
camd
:
review+
|
Details | Review |
47 bytes,
text/x-github-pull-request
|
emorley
:
review+
|
Details | Review |
I realised whilst we've talked about this a fair bit, we didn't yet have a bug on file.
Currently:
* Treeherder mirrors failure classifications to OrangeFactor's Elasticsearch instance
* OrangeFactor also pulls public data from hg.mozilla.org pushlog and Bugzilla
* This is made available via the OrangeFactor API / https://brasstacks.mozilla.com/orangefactor/
* For full details, see: https://wiki.mozilla.org/Auto-tools/Projects/OrangeFactor#Architecture
This is daft since the data is already available in Treeherder, and adds another layer of indirection (which needs each layer updating when adding new features) plus ageing infra to keep alive (eg bug 1356410). In addition, moving the dashboard into Treeherder will simplify any future switch to a bugless crash-stats like workflow, since doing so currently would require completing changing the way the documents are indexed in ES.
An MVP would likely need to:
* Be named something other than OrangeFactor (what's an orange? factor of what? etc)
* Be in Treeherder's code base (like perfherder, ...)
* Allow viewing top N intermittent failures over a certain time spam
* Allow viewing individual occurrences for a specific intermittent failure
* Periodically comment on bugs, like woo_commenter.py does currently
* Periodically email the mailing list, like woo_mailer.py does currently
* ...? (there has been some discussion via email but unfortunately not to public mailing lists, so I can't link to it - boo)
Reporter | ||
Comment 1•7 years ago
|
||
Status for anyone watching this bug: due to insufficient headcount this isn't being worked on currently, however there might be a chance of making this an outreachy project.
Updated•7 years ago
|
Priority: P3 → P1
Reporter | ||
Updated•7 years ago
|
Priority: P1 → P2
Comment 2•7 years ago
|
||
Hey Joel and Geoff-- Would you guys chime in here on the absolute minimum use cases you'd need solved on the first rev of an Orange Factor replacement? We will work through designing the UI at the all-hands with our new Outreachy Intern, Sarah. So if we can get the required use-cases ahead of time, we'll be able to hit the ground running.
Thanks guys!
Flags: needinfo?(gbrown)
Updated•7 years ago
|
Flags: needinfo?(jmaher)
Comment 3•7 years ago
|
||
our entry point for stockwell triage is:
https://charts.mozilla.org/NeglectedOranges/index.html
this is important because we can determine when a bug has no activity for the last 7 days (ignoring robots). What we do from there is access OF, a few things that are important:
1) history of the bug over time with a list of instances so we can find patterns in failures per os/config and see logs to get more information
2) we don't need machine name, that is irrelevant now (except for some of the talos hardware bugs)
3) being able to view this on a timeline, minimum 30 days- up to 90+ days
4) comment in bugzilla as the current orangefactor robot does:
* every day if bugs are >15 failures for the day
* every day look for 200 failures in the last 30, mark whiteboard as [stockwell disable-recommended]
* every 7 days comment in all bugs that have activity in the last 7 days (1+ failures)
* every 7 days adjust the whiteboard if we have [stockwell needswork] or [stockwell needswork:owner] if the failures in the last 7 days are >30
* every 7 days adjust the whiteboard to [stockwell unknown] if we have <20 failures in the last 7 days
5) ideally be able to edit the dates and bug number via the URL
6) support a redirect from orangefactor to bugzilla so existing comments in bugs will work (probably need this for 6 months, at least 3 months)
there are a lot of exceptions to the rules in comment 4.
Flags: needinfo?(jmaher)
Comment 4•7 years ago
|
||
I agree with comment 3, but if we are looking for simple, bare essentials, I think Ed nailed it in the Description. Top 3 needs, prioritized:
#1 * Allow viewing individual occurrences for a specific intermittent failure
** like https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1387222&endday=2017-11-16&startday=2017-11-10&tree=trunk
#2 * Allow viewing top N intermittent failures over a certain time spam
** like https://brasstacks.mozilla.com/orangefactor/
#3 * Periodically comment on bugs, like woo_commenter.py does currently
** like https://bugzilla.mozilla.org/show_bug.cgi?id=1418406#c5
I'd be happy to meet Sarah at the all-hands, and/or otherwise help define requirements.
Flags: needinfo?(gbrown)
Reporter | ||
Comment 6•7 years ago
|
||
This probably doesn't affect the already planned timeline/priority, but for completeness... We've been told there is a hard cutoff of 2018-09-01 for migrating OrangeFactor out of SCL3, so will need to have completed the OrangeFactor decom (bug 1367364) by then.
This seems very doable so long as:
(a) we have at least a rough MVP by the end of Q1,
(b) in Q2+ we pro-actively identify/prioritise the remaining use-cases that block switching off OrangeFactor. From my experience with the TBPL->Treeherder transition, it's easy for this phase to drag out if issues/missing workflows are only discovered in a piecemeal manner (though hopefully with OrangeFactor's significantly smaller user-base it should be easier to directly engage with users who are having to revert to the legacy system for certain workflows, to encourage feedback early).
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → sclements313
Assignee | ||
Comment 7•7 years ago
|
||
Attachment #8947260 -
Flags: review?(emorley)
Attachment #8947260 -
Flags: review?(cdawson)
Reporter | ||
Comment 8•7 years ago
|
||
Comment on attachment 8947260 [details] [review]
Updates to BugJobMap and Bugscache models for intermittents view
I've left some comments :-)
Attachment #8947260 -
Flags: review?(emorley)
Comment 9•7 years ago
|
||
I left some comments as well. :)
Updated•7 years ago
|
Attachment #8947260 -
Flags: review?(cdawson)
Assignee | ||
Comment 10•7 years ago
|
||
Attachment #8949636 -
Flags: review?(emorley)
Reporter | ||
Comment 11•7 years ago
|
||
Comment on attachment 8949636 [details] [review]
PR#3182: Model changes
I've left some comments on the PR :-)
Attachment #8949636 -
Flags: review?(emorley) → feedback+
Reporter | ||
Comment 12•7 years ago
|
||
Comment on attachment 8949636 [details] [review]
PR#3182: Model changes
Meant to say - whilst testing this PR and trying a couple of approaches with the model changes/migrations, I hit a Django migrations bug that results in an `OperationalError` due to the `operations` in the generated migration being inverted.
I created a reduced testcase and reported upstream, here:
https://code.djangoproject.com/ticket/29123
If you hit this issue, the workaround is to manually edit the generated migration to swap the `RemoveField` and `AddField`. Though this may not be necessary if instead adjusting the PR to use the approach in the blog post mentioned here:
https://github.com/mozilla/treeherder/pull/3182#discussion_r167413596
Updated•7 years ago
|
Priority: P2 → P1
Reporter | ||
Updated•7 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 13•7 years ago
|
||
Comment on attachment 8949636 [details] [review]
PR#3182: Model changes
Third time's the charm? :)
Attachment #8949636 -
Flags: review?(emorley)
Reporter | ||
Comment 14•7 years ago
|
||
Hmm so I was thinking more about what timestamp should be used for intermittent failure occurrences, and it's perhaps not as straight-forwards as we thought.
Each occurrence has the following time attributes (in order of when the events occur):
1) the timestamp of the push on which the job was run (Job.push.time)
2) when the job was put in the queue (Job.submit_time)
--> for some jobs this is very soon after #1
--> however for others (tests scheduled after the build completed, or say later retriggers) it may be minutes/hours/days later
3) when the job started running (Job.start_time)
--> this can be minutes/hours after #2 at peak times, due to infra load
4) when the job completed (Job.end_time)
--> minutes/hours after #3
5) when the failure was classified (BugJobMap.created)
--> minutes/hours/days after #4, particularly for repositories not monitored by the sheriffs
We need to decide:
* which should be used for bucketing/graphing the occurrences.
--> Any of 1-4 seem viable, it really depends on the use-case - for example:
- #1 would mean intermittents on retriggers get associated with the pushtimestamp of the existing push (possibly good for correlating with when regressing commits landed)
- whereas 2-4 would mean the intermittent gets associated with the more recent timestamp of the retrigger (possibly good for correlating things like "is our infra less flaky at weekends when lower load?")
* whether the BugJobMap should be slightly denormalised and contain that timestamp for performance reasons (like Perfherder does for `push_timestamp`), or whether an index on the original table field is fine
Legacy OrangeFactor uses #3 (which is why I'd suggested we use that previously), however from what I can tell Perfherder uses #1.
Will, I'm presuming the reason Perfherder uses push timestamp is since it needs to be much more closely correlated with when things landed than the actual time of the run? Would you denormalise `push_timestamp` again if starting from scratch?
Geoff/Joel, do you have any thoughts on this?
Flags: needinfo?(wlachance)
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
Reporter | ||
Comment 15•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #14)
> Legacy OrangeFactor uses #3
Correction - it uses a combination of #3 and #2.
I believe #2 was used (even though #3 seems superior to it) because we allow classifying jobs that are still in the pending state, and pending jobs don't have a start time. (iirc this was to allow sheriffs to mass select still-pending/running jobs and classify them in one go, rather than needing to play whac-a-mole. Which sounds like another plus of #1, since it's not affected by that)
Reporter | ||
Comment 16•7 years ago
|
||
Comment on attachment 8949636 [details] [review]
PR#3182: Model changes
I've left some comments on the PR - will hold off final review until the questions answered above. Sorry for the churn! :-)
Attachment #8949636 -
Flags: review?(emorley)
Comment 17•7 years ago
|
||
I am going between push timestamp (code) and job start timestamp (infra).
The purpose here is not real time monitoring, but trends over multi day time ranges. There are cases where infra is highlighted via orange factor, likewise where code is highlighted as the problem. Typically that is not the case where orangefactor shows trends that are hard to track down to code or infra- typically the serious issues show up in regular tree sheriffing.
Given that, I am more leaning towards the push timestamp since it aligns with perfherder and gives us the opportunity for future work where we can spot an trending issue that isn't almost perma failing and find the root cause much faster. It also helps avoid issues with retriggers causing odd spikes in timestamps but instead spikes when they correlate to the code.
:gbrown is on holiday today, so you will have to wait an extra day to hear his perspective.
Flags: needinfo?(jmaher)
Comment 18•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #14)
> ...
> Will, I'm presuming the reason Perfherder uses push timestamp is since it
> needs to be much more closely correlated with when things landed than the
> actual time of the run? Would you denormalise `push_timestamp` again if
> starting from scratch?
Yes, that's exactly the reason why we use push timestamp. Note that you could also get this information by joining on the push table (which the performance datum also has a reference to).
If I had to do it over, my inclination would be to keep things as is (i.e. keep the push_timestamp column intact). At least half the queries to this table also want to look up the push timestamp, so I think it makes things quite a bit simpler to keep the redundant information on hand in the table and avoid the overhead of a join. But I would probably double check that with a DBA and/or run some experiments.
Flags: needinfo?(wlachance)
Comment 19•7 years ago
|
||
Great discussion / I don't have anything to add.
My preference is for push timestamp, #1. I think that will be most useful most often.
Flags: needinfo?(gbrown)
Assignee | ||
Updated•7 years ago
|
Attachment #8949636 -
Flags: feedback+ → review?(emorley)
Reporter | ||
Comment 20•7 years ago
|
||
Comment on attachment 8949636 [details] [review]
PR#3182: Model changes
This is almost ready - couple of small tweaks and we should be good to go :-)
Attachment #8949636 -
Flags: review?(emorley) → review-
Assignee | ||
Updated•7 years ago
|
Attachment #8949636 -
Flags: review- → review?(emorley)
Reporter | ||
Comment 21•7 years ago
|
||
Comment on attachment 8949636 [details] [review]
PR#3182: Model changes
Worked well on prototype (https://github.com/mozilla/treeherder/pull/3182#pullrequestreview-97266505) - ready to merge :-)
Attachment #8949636 -
Flags: review?(emorley) → review+
Reporter | ||
Updated•7 years ago
|
Attachment #8947260 -
Attachment is obsolete: true
Comment 22•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/4db1aa42109ab537c8458a8628ef278d334ce86e
Bug 1367362 - Model changes to support intermittents view (#3182)
change BugJobMap bug_id field to foreign key field to create a join on
Bugscache model; add whiteboard to bugzilla, add index to Push time
field for search by date range; update associated tests
Reporter | ||
Updated•7 years ago
|
Attachment #8949636 -
Attachment description: Bug 1367362 updates → PR#3182: Model changes
Attachment #8949636 -
Flags: checkin+
Comment 23•7 years ago
|
||
Reporter | ||
Updated•7 years ago
|
Attachment #8951740 -
Flags: review?(cdawson)
Updated•7 years ago
|
Attachment #8951740 -
Flags: review?(cdawson) → review+
Comment 24•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/5052c5dfaae007011cfb8ec7e025a89b8974263c
Revert BugJobMap foreign key part of 4db1aa4 (bug 1367362) (#3236)
Since unfortunately the migration failed on stage due to:
```
IntegrityError: (1452, 'Cannot add or update a child row: a foreign
key constraint fails (`treeherder`.`#sql-1c26_1936a5`, CONSTRAINT
`bug_job_map_bug_id_013db14b_fk_bugscache_id` FOREIGN KEY (`bug_id`)
REFERENCES `bugscache` (`id`))')
```
I believe this is due to the fact that bug numbers are sometimes
manually entered when classifying failures (eg for infra failures),
meaning that not every bug ID in `BugJobMap` is for an intermittent
bug, and so doesn't exist in `Bugscache` (which is currently only
populated with bugs that have the keyword `intermittent-failure`).
This wasn't caught on prototype, since whilst that has some failure
classifications saved, none of them where hand-entered and so all
existed in `Bugscache`, satisfying the new constraint.
I've edited the existing migrations file since in most environments
it won't have already run (I'll fix up prototype/stage by hand). The
non-FK changes have been left as-is, since they are working fine.
Reporter | ||
Comment 25•7 years ago
|
||
So unfortunately I've had to revert this (see commit message above). I've hand-fixed stage (since the migration failed halfway through, which confuses Django) and used the Django `migrate` command to roll back prototype.
I'd completely forgotten that sheriffs hand-type bug numbers when classifying, and as such not all `BugJobMap` entries have a corresponding intermittent-failure bug entry in `Bugscache`. Sorry for not thinking of this before.
So I'm not 100% sure yet of the best way to deal with this (I'll be more awake Monday). One possibility is to make the "classify a failure API" always check if the bug exists in Bugscache, and if not, either fetch the bug details there and then, or else add a dummy entry which a later background task would go back and populate. For legacy entries we'd also need to go and backfill the missing bugs (probably best done via a Django management command).
However this would mean needing to think about:
* the fact that log failure messages would then possibly match against these new bugs in Bugscache, which aren't actually keyword=intermittent-failure bugs, but random "infrastructure went down today" bugs. Does this matter? Probably need to speak to sheriffs
* how we would keep these non keyword=intermittent-failure bugs up to date, given the existing bugs cache population process only fetches intermittent-failure bugs (so if someone updated a bug summary or changed the whiteboard, it wouldn't be reflected in bugscache). Though possible fixes for this (re-working how we populate bugscache) might also fix the bug ingestion performance issues (bug 1330783) so might actually be a good thing longer term
Let's chat through this at the Treeherder meeting Monday.
Reporter | ||
Comment 26•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #25)
> Let's chat through this at the Treeherder meeting Monday.
Ah today is a US holiday too (the all@ email the other day only mentioned Canada in the subject confusingly), so the Treeherder meeting is cancelled. We'll discuss another day.
Comment 27•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Attachment #8954225 -
Flags: review?(emorley)
Attachment #8954225 -
Flags: review?(cdawson)
Comment 28•7 years ago
|
||
Comment on attachment 8954225 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3271
This mostly looks awesome. Only setting to r- because I have a few questions and opinions we can discuss between the 3 of us when Ed's done his review as well.
Attachment #8954225 -
Flags: review?(cdawson) → review-
Reporter | ||
Comment 29•7 years ago
|
||
Comment on attachment 8954225 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3271
Left a comment :-)
Attachment #8954225 -
Flags: review?(emorley)
Assignee | ||
Comment 30•7 years ago
|
||
Attachment #8954631 -
Flags: review?(emorley)
Attachment #8954631 -
Flags: review?(cdawson)
Comment 31•7 years ago
|
||
Comment on attachment 8954631 [details] [review]
PR#3271: Add new APIs
One more nit, but otherwise looks great to me!
Attachment #8954631 -
Flags: review?(cdawson) → review+
Reporter | ||
Comment 32•7 years ago
|
||
Comment on attachment 8954631 [details] [review]
PR#3271: Add new APIs
Forgot to update the bug earlier :-)
Attachment #8954631 -
Flags: review?(emorley) → review-
Assignee | ||
Comment 33•7 years ago
|
||
Attachment #8955369 -
Flags: review?(emorley)
Reporter | ||
Comment 34•7 years ago
|
||
Comment on attachment 8955369 [details] [review]
Bug 1367362 endpoint updates
Looks good to me! Could you add this as the first commit in your intermittents_ui branch (and rebase on master), so I can deploy to prototype to test it out? Presuming the API parts work fine then, we can merge them and iterate on the UI.
Attachment #8955369 -
Flags: review?(emorley) → review+
Comment 35•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Attachment #8955714 -
Flags: review?(emorley)
Attachment #8955714 -
Flags: review?(cdawson)
Reporter | ||
Comment 36•7 years ago
|
||
Since we're still waiting on bug 1442369, I created a new RDS instance from the latest prod snapshot, named `treeherder-dev-bug1367362`, and have updated the treeherder-prototype heroku app's DATABASE_URL to point at that rather than the existing dev RDS instance.
I've also deployed the `intermittents_ui` branch to prototype - so the new UI can be seen here:
https://treeherder-prototype.herokuapp.com/intermittentsview.html
(The API requests made by the page currently time out - so some adjustment to the queries made by the APIs will be needed)
Reporter | ||
Comment 37•7 years ago
|
||
Looking at the MySQL slow query log via the AWS control panel, I see:
# Query_time: 90.017612 Lock_time: 0.000197 Rows_sent: 0 Rows_examined: 6505
SELECT COUNT(*) FROM (SELECT `bug_job_map`.`bug_id` AS Col1, COUNT(`bug_job_map`.`job_id`) AS `bug_count` FROM `bug_job_map` INNER JOIN `job` ON (`bug_job_map`.`job_id` = `job`.`id`) INNER JOIN `push` ON (`job`.`push_id` = `push`.`id`) WHERE (`job`.`failure_classification_id` = 4 AND `job`.`repository_id` IN (1, 2, 77, 14) AND `push`.`time` BETWEEN '2018-02-24 00:00:00' AND '2018-03-03 23:59:59.999999') GROUP BY `bug_job_map`.`bug_id` ORDER BY NULL) subquery;
# Query_time: 90.097798 Lock_time: 0.000083 Rows_sent: 0 Rows_examined: 31480
SELECT DATE(`push`.`time`) AS `date`, COUNT(`job`.`id`) AS `failure_count` FROM `job` INNER JOIN `push` ON (`job`.`push_id` = `push`.`id`) WHERE (`push`.`time` BETWEEN '2018-02-24 00:00:00' AND '2018-03-03 23:59:59.999999' AND `job`.`repository_id` IN (1, 2, 77, 14) AND `job`.`failure_classification_id` = 4) GROUP BY DATE(`push`.`time`) ORDER BY `date` ASC;
I would start by looking at the EXPLAIN <QUERY> of each :-)
Assignee | ||
Comment 38•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #37)
> Looking at the MySQL slow query log via the AWS control panel, I see:
>
> # Query_time: 90.017612 Lock_time: 0.000197 Rows_sent: 0 Rows_examined: 6505
> SELECT COUNT(*) FROM (SELECT `bug_job_map`.`bug_id` AS Col1,
> COUNT(`bug_job_map`.`job_id`) AS `bug_count` FROM `bug_job_map` INNER JOIN
> `job` ON (`bug_job_map`.`job_id` = `job`.`id`) INNER JOIN `push` ON
> (`job`.`push_id` = `push`.`id`) WHERE (`job`.`failure_classification_id` = 4
> AND `job`.`repository_id` IN (1, 2, 77, 14) AND `push`.`time` BETWEEN
> '2018-02-24 00:00:00' AND '2018-03-03 23:59:59.999999') GROUP BY
> `bug_job_map`.`bug_id` ORDER BY NULL) subquery;
>
> # Query_time: 90.097798 Lock_time: 0.000083 Rows_sent: 0 Rows_examined: 31480
> SELECT DATE(`push`.`time`) AS `date`, COUNT(`job`.`id`) AS `failure_count`
> FROM `job` INNER JOIN `push` ON (`job`.`push_id` = `push`.`id`) WHERE
> (`push`.`time` BETWEEN '2018-02-24 00:00:00' AND '2018-03-03
> 23:59:59.999999' AND `job`.`repository_id` IN (1, 2, 77, 14) AND
> `job`.`failure_classification_id` = 4) GROUP BY DATE(`push`.`time`) ORDER BY
> `date` ASC;
>
> I would start by looking at the EXPLAIN <QUERY> of each :-)
Are there differences between the db mirror I'm using and the db instance/environment you're using for prototype? The latter should still have the same indexes (the new one added to push date), right? Here's what I get with the same queries on my local (using the db mirror):
mysql> SELECT COUNT(*) FROM (SELECT `bug_job_map`.`bug_id` AS Col1, COUNT(`bug_job_map`.`job_id`) AS `bug_count` FROM `bug_job_map` INNER JOIN `job` ON (`bug_job_map`.`job_id` = `job`.`id`) INNER JOIN `push` ON (`job`.`push_id` = `push`.`id`) WHERE (`job`.`failure_classification_id` = 4 AND `job`.`repository_id` IN (1, 2, 77, 14) AND `push`.`time` BETWEEN '2018-02-24 00:00:00' AND '2018-03-03 23:59:59.999999') GROUP BY `bug_job_map`.`bug_id` ORDER BY NULL) subquery;
1 row in set (0.99 sec)
* actually more like 2000ms in the ORM because of the group by count (necessary for correctly ordered results)
mysql> SELECT DATE(`push`.`time`) AS `date`, COUNT(`job`.`id`) AS `failure_count` FROM `job` INNER JOIN `push` ON (`job`.`push_id` = `push`.`id`) WHERE (`push`.`time` BETWEEN '2018-02-24 00:00:00' AND '2018-03-03 23:59:59.999999' AND `job`.`repository_id` IN (1, 2, 77, 14) AND `job`.`failure_classification_id` = 4) GROUP BY DATE(`push`.`time`) ORDER BY `date` ASC;
8 rows in set (0.84 sec)
* also 2000-ish ms in the ORM, as the group by date is needed for pagination
Can you give me access to the prototype db instance so I can check the explain query there?
Reporter | ||
Comment 39•7 years ago
|
||
The new instance is an m4.xlarge, the same as the read-only replica. The sclements user/password should work on the new instance too -- replace the instance name `treeherder-prod-ro` with `treeherder-dev-bug1367362`.
However the queries now succeed without timing out - I'm guessing this is due to the way EBS initialisation works. The RDS "restore from snapshot" docs don't say this explicitly, but the generic EBS docs mention lazy initialisation:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-restoring-volume.html
ie: Even after the instance is created, not all the data is pulled down from S3 to EBS, so will be slow to load. And given this query is fairly intensive as queries go (not an issue for now and something we can and were planning on iterating on in the future) it meant it hit the timeout.
So thankfully this has fixed itself :-)
Reporter | ||
Updated•7 years ago
|
Attachment #8954225 -
Attachment is obsolete: true
Reporter | ||
Updated•7 years ago
|
Attachment #8955369 -
Attachment is obsolete: true
Reporter | ||
Updated•7 years ago
|
Attachment #8951740 -
Attachment description: Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3236 → PR#3236: Revert BugJobMap foreign key part of #3182
Reporter | ||
Comment 40•7 years ago
|
||
Comment on attachment 8955714 [details] [review]
PR#3296: Intermittents view UI
I've left some initial comments on Friday and some more just now - looks great overall though!
(I've obsoleted some of the duplicate attachments - when requesting re-review existing attachments can be re-used, no need to attach another)
Attachment #8955714 -
Attachment description: Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3296 → PR#3296: Intermittents view UI
Attachment #8955714 -
Flags: review?(emorley) → review-
Reporter | ||
Updated•7 years ago
|
Attachment #8954631 -
Attachment description: Bug 1367362 endpoint updates → PR#3271: Add new APIs
Attachment #8954631 -
Flags: review- → review+
Comment 41•7 years ago
|
||
Comment on attachment 8955714 [details] [review]
PR#3296: Intermittents view UI
I did an initial review and found a couple things, but will continue to look over the rest today.
Looks really nice and well done! The animations on the graph redraw are really nice, I must say. :) This is looking really great so far!
Attachment #8955714 -
Flags: review?(cdawson) → review-
Assignee | ||
Updated•7 years ago
|
Attachment #8955714 -
Flags: review?(emorley)
Attachment #8955714 -
Flags: review?(cdawson)
Attachment #8955714 -
Flags: review-
Comment 42•7 years ago
|
||
Comment on attachment 8955714 [details] [review]
PR#3296: Intermittents view UI
I commented on a couple little nit-picks. But this looks great to me! Ship it! :)
Attachment #8955714 -
Flags: review?(cdawson) → review+
Reporter | ||
Comment 43•7 years ago
|
||
Comment on attachment 8955714 [details] [review]
PR#3296: Intermittents view UI
Last couple of changes (some due to the new ESLint rules on master; sorry!) and we should be good to go! :-)
Attachment #8955714 -
Flags: review?(emorley) → review-
Assignee | ||
Updated•7 years ago
|
Attachment #8955714 -
Flags: review- → review?(emorley)
Reporter | ||
Comment 44•7 years ago
|
||
Comment on attachment 8955714 [details] [review]
PR#3296: Intermittents view UI
I've had a look and seems great! We might need a tweak to the naming depending on the response to:
https://groups.google.com/forum/#!topic/mozilla.tools.treeherder/_m4sscUO2Po
...other than that should be good to merge :-)
Attachment #8955714 -
Flags: review?(emorley)
Assignee | ||
Updated•7 years ago
|
Attachment #8955714 -
Flags: review+ → review?(emorley)
Reporter | ||
Comment 45•7 years ago
|
||
Comment on attachment 8955714 [details] [review]
PR#3296: Intermittents view UI
Woohoo! :-D
Attachment #8955714 -
Flags: review?(emorley) → review+
Comment 46•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/30df0aae6c94183391fb6311075a90ba91bac405
Bug 1367362 - Add an intermittent failures view (#3296)
Adds a new view to display intermittent test failure occurrences,
to replace the functionality currently provided by the legacy
OrangeFactor tool.
Includes the new API endpoints originally reviewed in #3271.
Reporter | ||
Comment 47•7 years ago
|
||
This is now deployed to production :-)
https://treeherder.mozilla.org/intermittent-failures.html
And I've created a new Bugzilla component for it in bug 1445262 (don't forget to adjust component watching and saved bug queries/whines):
https://bugzilla.mozilla.org/describecomponents.cgi?product=Tree%20Management#Intermittent%20Failures%20View
I'll comment in bug 1367364 with what steps are left before OrangeFactor can be decommissioned.
Thank you for all your hard work Sarah!
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•7 years ago
|
Component: Treeherder → Intermittent Failures View
Updated•3 years ago
|
Component: Intermittent Failures View → TreeHerder
You need to log in
before you can comment on or make changes to this bug.
Description
•