Closed Bug 943611 Opened 8 years ago Closed 8 years ago

Datazilla: Generate datapoints based on commit range

Categories

(Firefox OS Graveyard :: Performance, enhancement, P1)

x86
macOS
enhancement

Tracking

(Not tracked)

RESOLVED INCOMPLETE
2.0 S5 (4july)

People

(Reporter: kgrandon, Assigned: davehunt)

References

Details

(Keywords: perf, Whiteboard: [c=automation p= s= u=])

We need a way to narrow down commit ranges when we encounter a performance regression. One option would be to implement an interface that would allow us to run device-based tests based on a commit range.

I'm sure there's more, but the basic information we would need would be:

- Device
- Branch
- Starting/Ending Commits
Summary: Generate datapoints based on range → Datazilla: Generate datapoints based on commit range
Whiteboard: [c= p= s= u=] → [c=automation p= s= u=]
Component: User interface → General
Product: Datazilla → Firefox OS
Blocks: 943594
Blocks: 942893
I've added a few bugs which block this so we have insights into what kinds of regression ranges this would help us triage.
No longer blocks: 943594
No longer blocks: 942893
Hi Jonathan,

When will A*Team be able to work on this? Is this being targeted for Q1?

Thanks,
Mike
Yes, this is targeted for Q1, probably will begin work on this in February.
Jonathan,

We're currently doing sprint planning for our next sprint, Feb 3 - 14. Who on your team can work on this during that time?

Thanks,
Mike
Flags: needinfo?(jgriffin)
Priority: -- → P1
Dave is going to do this, but he won't be back until about Feb 12.  Is it urgent that it gets worked on sooner?
Flags: needinfo?(jgriffin)
No we can wait for him to get back. I'll move this into the following sprint Feb 17 - 28. Thanks.
Bug 966586 is another bug that would benefit from this tooling.
Dave is going to start working on this.

Potentially we might need to bisect over either gecko or gaia, and those will require different tooling.  I think starting with gecko probably makes the most sense, since manual bisection over gecko is more difficult.  Mike, do you agree?

This work will have some dependencies...we'll have to make sure RelEng is storing per-commit device builds for trees and devices we care about.  Normally, will those trees be b2g-inbound and mozilla-central?

Also, we'll have to do some datazilla work to make it possible to interleave bisection data correctly with existing data.  Bug 974860 might be relevant here.
Flags: needinfo?(mlee)
It's hard for me to say that either Gecko or Gaia is more relevant here re: bisection. We get regressions caused by code in both.

This is essentially the same issue as the bisector Clint's been working on based on John Ford's tool. That tool looks to do a full chronological expansion of paired repo commits and then bisects across that. (https://github.com/jhford/bisect_b2g/blob/master/README.md) Whatever we do for that can (and probably should) be done here as the primary option. I've cc'd Clint with that his comments on this in mind.

If we're allowing options, I think there's also value in logic that bisects Gaia primarily and pulls the last Gecko commit chronologically previous to the Gaia commit to match, as well as an option to bisect Gecko primarily and pull the first Gaia commit chronologically after the Gecko commit. 

That gets the most likely intentionally matched pairs with any given app code depending on most recent platform code, and is a shorter bisect for including Gaia + later Gecko commits. It'll be especially useful if we're pretty certain where the problem is re: Gaia vs. Gecko since there'll be less reason to test the same Gaia with multiple Geckos or vice versa.

Options to freeze on a particular Gecko commit and just bisect Gaia over it or vice-versa might be interesting. That's what I think you're suggesting, but could be misunderstanding. I'm unsure whether the resultant builds could be considered valid, however, so it's not the first way I'd go. The further you get away from the chronologically matched pair, the more likely the code wasn't intended to be together.

Really, though, the best answer is that any Gaia commit should include a file with a reference to the Gecko commit it was written against. If we had that you could just bisect Gaia and pull Gecko via that reference and that would give you valid bisection of the full stack.

But that also reflects an attitude that the app layer should make an intentional choice to upgrade to a different library version (i.e. if you want to go to the next Gecko commit, explicitly bump the reference in the Gaia repo) which may or may not align with the rest of our strategy. It is, however, the best way to make sure you always bisect valid builds IMO.
(In reply to Geo Mealer [:geo] from comment #9)
> That gets the most likely intentionally matched pairs with any given app
> code depending on most recent platform code, and is a shorter bisect for
> including Gaia + later Gecko commits. It'll be especially useful if we're

Oops, typoed that and flipped my meaning. It's a shorter bisect for *not* including Gaia commits paired with a later Gecko commit.
I think it's worth discussing in more detail how we envision this to work.

Currently, b2gperf isn't necessarily run against chronologically valid pairs, AFAIK.  The Jenkins jobs that run b2gperf use gecko from the latest nightly, against the most recent gaia, several times per day.  (I.e., gaia.json is ignored.)

Whenever a regression is noted that someone wants to bisect, there will almost always be differences in gaia commit, and possibly in the gecko commit, if the regression happened between different nightlies.

In the former case, we could probably safely bisect over just gaia; in the latter case, perhaps we'd want to bisect over valid gaia/gecko pairs, starting with the earliest gecko commit before the regression, and ending with the most recent gaia commit after the regression, which would yield the largest range of combinations.

If the tool you pointed to could give us a list of those combinations, and I think it can, that would be helpful; I'm not sure whether its model of evaluating bisection results would be useful here.
The fact that b2gperf might not be pairing in a valid way doesn't fill me with joy either, but what you describe is basically valid. You want either gaia+last gecko or gecko+last gaia, since either one gives you a pair that existed at some point in real life. latest+latest will be one or the other.

John's tool gives you both sets. So, if you check in Gaia1, Gecko1, Gecko2, Gaia2, Gecko3, Gaia3 you get:

Gaia1+Gecko1 (gecko + last gaia)
Gaia1+Gecko2 (gecko + last gaia)
Gecko2+Gaia2 (gaia + last gecko)
Gecko3+Gaia2 (gecko + last gaia)
Gaia3+Gecko3 (gaia + last gecko)

That does represent a full list of all pairs that existed in real life.

Re: the balance of your logic re: only keying on gecko or gaia changes, I think that includes an assumption we're only bisecting between nightlies. I think we'd want to generalize this to any two commits. With that in mind, John's logic is probably the most general. In your case where only one changed:

Gaia1
Gecko1
Gecko2
Gecko3
Gecko4

...you'd get:

Gaia1+Gecko1
Gaia1+Gecko2
Gaia1+Gecko3
Gaia1+Gecko4

...which is what you have in mind, I think.

The other comments I made about either only bisecting across Gaia or Gecko or freezing stuff may amount to smoke and optimization. I'd start with this since it's the most valid, and then I'd probably experiment with variations after deploying this.

Aside from that, what I'm saying more than anything is bisection is coming up in a number of contexts at this point, and I think solving it the same way everywhere makes a ton of sense. Either we actually put in pointers, in which case we have a source of truth, or decide on a most-valid heuristic. 

I was a little shaky on John's at first, but having analyzed it further now I think he's right on. You can optimize where there's a voluntary dependency between app and library, per my last comment, but since we don't have that right now the expansion works best.
(In reply to Jonathan Griffin (:jgriffin) from comment #11)

> I'm not sure whether its model of
> evaluating bisection results would be useful here.

Re: this, btw, the tool is relatively process agnostic. 

At its most basic, it actually just updates your repos then runs a verification script (presumably by grepping code or whatever, though I think it can run an interactive shell as well). Clint has been improving it to allow a build script to be hooked in after the repo updates. 

Theoretically, you could pass control to it and have the verification script run the performance tests in question, perhaps comparing it to a known-good threshold or whatever. How to figure out what is a "big enough" regression to stop the bisect vs. natural variation will probably be a pain no matter what we do, especially since performance tests reflect a bit of nondeterministic behavior.

Right answer may be to separate out the changeset generation engine into its own module, though, if it's not already and then use it from various tools like this one.
What we were initially thinking was to have a Jenkins job that would take a revision and submit the results to datazilla. The results would ideally show up between the original datapoints so that they can be compared with the good/bad points, though this would likely be a manual task initially.

The job could even take the good/bad revisions and run a script to determine suitable intermediate revisions for testing, which sounds a bit like the tool mentioned in comment 9. It could then automatically run the tests on a handful of commits evenly spaced between good and bad, again for manual review via the datazilla dashboard. This process could be repeated as much as necessary to narrow the regression range.

While we can relatively easily reset gaia to any specified revision, we want to avoid building gecko on the Jenkins node, so having per-commit builds available would be necessary, as mentioned in comment 8.

Currently datazilla does not display data based on time submitted or time of revision, so we will need to enhance this to either somehow plot based on a combined gecko/gaia timestamp or allow the user to pick which of these to track against, and plot based on that repositories timestamps.
Mike: Any chance you could provide feedback?
Geo: Any thoughts on comment 14?
Flags: needinfo?(gmealer)
The datazilla work necessary to display this well has been added as https://www.pivotaltracker.com/story/show/66375364.   Shorter-term, this might need to be done without datazilla visualization.
(In reply to Jonathan Griffin (:jgriffin) from comment #8)
> we'll have to make sure RelEng is storing per-commit device builds for trees
> and devices we care about.

We're currently running the performance tests on mozilla-central nightly builds against inari and hamachi devices. We have tinderbox builds available for b2g-inbound-hamachi-eng and b2g-inbound-inari-eng as well as the same for mozilla-inbound. For mozilla-central we only have mozilla-central-hamachi-eng and nothing for inari.

Datazilla currently displays the git changeset form sources.xml for gecko rather than the mercurial one. We now have access to the mercurial revision via mozversion if we want to switch this over.

Either way, once we have the range we need to be able to download the appropriate tinderbox build by revision. The only way I can see to do this is to examine the source.xml files on pvtbuilds. Jonathan: Do you know if there's another way to achieve this?

I'm currently looking into setting something up using the b2g_bisect tool by John Ford as mentioned by Geo in comment 9.
Flags: needinfo?(jgriffin)
(In reply to Dave Hunt (:davehunt) from comment #17)
> 
> I'm currently looking into setting something up using the b2g_bisect tool by
> John Ford as mentioned by Geo in comment 9.
That's a good tool, but it can get a little wonky with merge commits in the stream when it tries to create its unified history. (If you're only bisecting one repo at a time, and if you skip merge commits, it works superb). John and I were talking the other day about creating a server that would store all the landings for gecko, gaia, etc and would enable us to more easily craft a unified history. Just wanted to give you a head's up there on that. For details, I'd recommend sitting down with jhford.
Depends on: 979554
(In reply to Dave Hunt (:davehunt) from comment #17)
> (In reply to Jonathan Griffin (:jgriffin) from comment #8)
> Either way, once we have the range we need to be able to download the
> appropriate tinderbox build by revision. The only way I can see to do this
> is to examine the source.xml files on pvtbuilds. Jonathan: Do you know if
> there's another way to achieve this?

I've filed bug 979554 to have B2G device build url's added to builds-4hr.js; once we have this we can expose it via Treeherder or a separate mapping service.
Flags: needinfo?(jgriffin)
(In reply to Dave Hunt (:davehunt) from comment #14)
> What we were initially thinking was to have a Jenkins job that would take a
>snip<

Based on how I understand the problem: finding a narrower regression range between a known-good and known-bad build showing on datazilla, I think you're on target here.

I think I made the problem too complicated up above, and we probably don't need to bisect. I think we can reuse the tinderbox builds we'll be storing for doing functional QA regression range finding and just run all the tinderbox builds between two days.

So, yeah, what you said, but concretely:

Identify the first bad day (point of regression). Run that test against all the tinderbox builds between the build tested the day before and the build tested that day and graph the results. Basically, drill down between nightly builds using tinderbox builds.

That'll narrow the regression within a few hours, and it'll be against known-valid build sets. *If* that's not sufficient, I think we should add logic for bisecting, but that's a lot better than what we have now and where we should start IMO.

We've been having a version of the same discussion with similar conclusions around functional regression ranges. I don't see any reason to do things differently here. Only difference is not bisecting because it'll be hard for the system to understand when to step, but easy for the human to look at the graph and see the jump.

Added Kevin for needinfo, since he's really the one that should be vetting the solution.
Flags: needinfo?(gmealer) → needinfo?(kgrandon)
(In reply to Geo Mealer [:geo] from comment #20)
> Only difference is not bisecting because it'll be hard for
> the system to understand when to step, but easy for the human to look at the

Oops, I mean understand when to stop (natural variance makes it hard to 100% identify the point of regression).
I'm not sure if I have enough understanding of our tinderbox builds, but if we have a build for each and every gecko commit - it seems like we should absolutely be able to use those. I think you guys will ultimately be able to design a much better solution to the problem given your knowledge of the current tooling than I could though.

My original thoughts is that given two datazilla points - it would show you a datapoint for every commit played in chronological order. Probably mapping to the same order as: http://hg.mozilla.org/mozilla-central/shortlog. Unfortunately I'm still relatively uninformed when it comes to the majority of our gecko code and workflows - so sorry I can't be of more help here.
Flags: needinfo?(kgrandon)
In datazilla it's easy enough to grab the gecko and gaia revisions for the good/bad builds. With the gaia revisions it's easy enough to checkout that revision and reset gaia on the device. With the gecko revisions, however, it's not so easy to associate a revision with a build.(In reply to Geo Mealer [:geo] from comment #20)

> (In reply to Dave Hunt (:davehunt) from comment #14)
> I think I made the problem too complicated up above, and we probably don't
> need to bisect. I think we can reuse the tinderbox builds we'll be storing
> for doing functional QA regression range finding and just run all the
> tinderbox builds between two days.

If I understand this correctly, we'd need to identify the timestamp of the good/bad gecko builds, and from that identify all of the tinderbox builds that were produced between those times. Would this be limited to mozilla-central builds or would it include mozilla-inbound or b2g-inbound? Determining a timestamp wouldn't be too difficult, but collecting the appropriate tinderbox builds might be more of a challenge.

> That'll narrow the regression within a few hours, and it'll be against
> known-valid build sets. *If* that's not sufficient, I think we should add
> logic for bisecting, but that's a lot better than what we have now and where
> we should start IMO.

What I was starting to look into was using John's tool to create a combined history, allowing whoever is triggering the 'bisect' to specify how many revisions to test, and selecting equally spaced history combinations to run the tests against. We would need some way to grab a tinderbox build from a revision, and allow for the possibility that a match may not be found (perhaps trying the next/previous revision until one is matched).

I think resetting gaia is optional though, and we may just want to use the tinderbox builds as suggested by the recent comments. If this is the case, John's tool may not be needed here.
(In reply to Kevin Grandon :kgrandon from comment #22)
> I'm not sure if I have enough understanding of our tinderbox builds, but if
> we have a build for each and every gecko commit - it seems like we should
> absolutely be able to use those. I think you guys will ultimately be able to
> design a much better solution to the problem given your knowledge of the
> current tooling than I could though.

It wouldn't be every commit. It'd be every commit tested for TBPL, which means commits are chunked together while tests are running IIUC. But it's a lot less expensive and error-prone than a per-commit solution--any combined history solution involves a bit of guesswork and logic--which is why I'm suggesting it as a first step.

(In reply to Dave Hunt (:davehunt) from comment #23)
> In datazilla it's easy enough to grab the gecko and gaia revisions for the
> good/bad builds. With the gaia revisions it's easy enough to checkout that
> revision and reset gaia on the device. With the gecko revisions, however,
> it's not so easy to associate a revision with a build.

Yeah, we've discussed the need for a two-way pointer but don't have it yet. That's why I keep banging on the "known valid combinations" bit.

> If I understand this correctly, we'd need to identify the timestamp of the
> good/bad gecko builds, and from that identify all of the tinderbox builds
> that were produced between those times. Would this be limited to
> mozilla-central builds or would it include mozilla-inbound or b2g-inbound?
> Determining a timestamp wouldn't be too difficult, but collecting the
> appropriate tinderbox builds might be more of a challenge.

So, a couple of things. 

If we're dealing with datazilla and especially with real hardware, we shouldn't think "gecko", or "gaia" for that matter, we should think "stack." I think focusing on one half or the other makes sense from what devs want to concentrate on, but since it's not what datazilla's actually showing I think it might mislead us. We want interim stacks.

But re: your direct question, we have an inbound tree and a central tree to work from. I think there are a few valid heuristics: use the one with the most builds, use the one with the most-valid builds, use the tree used on datazilla, combine both into a unified chrono history, probably a few others I'm not considering. The trees themselves are laid out chronologically, so the actual grabbing A->B bit shouldn't be hard.

I am making one assumption that I realized might be faulty: can we deploy those builds directly, assuming they're engineering builds for the correct target? Or do you do magic before building? Because that's one reason I wanted to do this--saves a ton of time from having to rebuild every commit combo.

But even if we can't, we can extract their manifests and use those to build valid combos--gets rid of the two way pointer issue.

> What I was starting to look into was using John's tool to create a combined
> history, allowing whoever is triggering the 'bisect' to specify how many
> revisions to test, and selecting equally spaced history combinations to run
> the tests against. We would need some way to grab a tinderbox build from a
> revision, and allow for the possibility that a match may not be found
> (perhaps trying the next/previous revision until one is matched).

Makes sense. I just like the idea of starting with a dead simple solution of "use what we already have" then iterating from there. I realize the request is focused on a per-commit solution, and that is ideal, but it may not make as much sense as from a diminishing-returns POV. If nothing else, "better, sooner" has its own obvious advantages for doing this in steps.

And fwiw, I just don't think bisecting will work, as much as I went on and on about it above. After a lot of further thought, I think it only really works when you have something that can look at the build and say unequivocally "good" or "bad". The perf results won't work reliably that way, I don't think. Technically you could try to get the dev to put in threshold criteria, but I don't think it'll be effective compared to brute-forcing and looking at a graph.

> 
> I think resetting gaia is optional though, and we may just want to use the
> tinderbox builds as suggested by the recent comments. If this is the case,
> John's tool may not be needed here.

How about this?

Iteration 1: brute force tinderbox builds between good->bad as a "drill" operation between current datazilla points.

Iteration 2: brute force interleaved commits between good->bad tinderbox builds as an additional drill operation.

Iteration 3: Try to figure out how to reliably and automatically bisect for a perf issue.

...where we stop when we've cut performance debugging time sufficiently to be effective.
(In reply to Geo Mealer [:geo] from comment #24)
> If we're dealing with datazilla and especially with real hardware, we
> shouldn't think "gecko", or "gaia" for that matter, we should think "stack."
> I think focusing on one half or the other makes sense from what devs want to
> concentrate on, but since it's not what datazilla's actually showing I think
> it might mislead us. We want interim stacks.

Good point. The data in datazilla is currently a nightly 'stack' plus latest gaia at that time. If this regression hunting tool were to just use the tinderbox 'stacks' then it certainly makes the first iteration simpler.

> But re: your direct question, we have an inbound tree and a central tree to
> work from. I think there are a few valid heuristics: use the one with the
> most builds, use the one with the most-valid builds, use the tree used on
> datazilla, combine both into a unified chrono history, probably a few others
> I'm not considering. The trees themselves are laid out chronologically, so
> the actual grabbing A->B bit shouldn't be hard.

Let's use an example. Looking at datazilla, I've picked two consecutive points that differ in gecko revision, labeling them as good and bad for the purposes of this example:

Good:
73e32652feb5e05de4ff838d16138810dfc8c73d (git revision)
8122ffa9e1aa (hg revision)
2014-03-06-04-02-04 (build timestamp)

Bad:
cb6ce085e9672073c276a07bd6b1323c394d4c76 (git revision)
8095b7dd8f58 (hg revision)
2014-03-06-13-41-06 (build timestamp)

Using these build timestamps, the following tinderbox mozilla-central builds would fit the criteria:
20140305180907   06-Mar-2014 04:14 (c7d401d189e0)
20140305182107   06-Mar-2014 04:31 (8122ffa9e1aa)

The following tinderbox mozilla-inbound builds would apply:
20140305201605   06-Mar-2014 04:52 (8adacb553312)
20140305204706   06-Mar-2014 06:36 (b8551123b3da)
20140305210305   06-Mar-2014 06:00 (2e6afd113f7a)
20140305211506   06-Mar-2014 05:56 (3069330887e4)
20140305212405   06-Mar-2014 07:35 (e7e2197a831d)
20140305221905   06-Mar-2014 07:03 (407e82dd7c3c)
20140305225505   06-Mar-2014 08:44 (0f81cbeae0d4)
20140306020405   06-Mar-2014 12:09 (6fb8bc793891)
20140306023106   06-Mar-2014 12:40 (4c9d799155d2)

The following tinderbox b2g-inbound builds would apply:
20140305185807   06-Mar-2014 05:04
20140305201906   06-Mar-2014 05:17
20140305232206   06-Mar-2014 08:24
20140306001706   06-Mar-2014 09:29
20140306003405   06-Mar-2014 08:58
20140306003706   06-Mar-2014 09:43
20140306011706   06-Mar-2014 10:23
20140306012706   06-Mar-2014 09:51
20140306020605   06-Mar-2014 10:30
20140306022206   06-Mar-2014 10:41
20140306022706   06-Mar-2014 10:48
20140306030206   06-Mar-2014 11:19
20140306032905   06-Mar-2014 12:38

This totals 24 builds. Given a single testrun currently takes just over 2 hours, I don't think we can afford to run all of these. We could ask the person using the tool how many builds they want to try, with a reasonable default.

I'm not convinced that using the build timestamp is the best approach (I'd rather use the revision timestamp), but it's possibly the best we can do for the first iteration.

> I am making one assumption that I realized might be faulty: can we deploy
> those builds directly, assuming they're engineering builds for the correct
> target? Or do you do magic before building? Because that's one reason I
> wanted to do this--saves a ton of time from having to rebuild every commit
> combo.

These are all engineering builds, so it should just be a case of downloading, flashing the device, and running the tests.

> And fwiw, I just don't think bisecting will work, as much as I went on and
> on about it above. After a lot of further thought, I think it only really
> works when you have something that can look at the build and say
> unequivocally "good" or "bad". The perf results won't work reliably that
> way, I don't think. Technically you could try to get the dev to put in
> threshold criteria, but I don't think it'll be effective compared to
> brute-forcing and looking at a graph.

Our initial goal is just to provide the user triggering the tool with an email report containing the results for the tests triggered on the builds selected. Ultimately we are hoping to be able to push the results to datazilla, and have regression alerts triggered.

> Iteration 1: brute force tinderbox builds between good->bad as a "drill"
> operation between current datazilla points.
> 
> Iteration 2: brute force interleaved commits between good->bad tinderbox
> builds as an additional drill operation.
> 
> Iteration 3: Try to figure out how to reliably and automatically bisect for
> a perf issue.

This sounds reasonable to me. Do you have any further thoughts based on the recent comments, Kevin?
Flags: needinfo?(kgrandon)
(In reply to Dave Hunt (:davehunt) from comment #25)
> > Iteration 1: brute force tinderbox builds between good->bad as a "drill"
> > operation between current datazilla points.
> > 
> This sounds reasonable to me. Do you have any further thoughts based on the
> recent comments, Kevin?

I think Iteration 1 sounds fine for me as a starting point - and after we implement it I'm sure we will have some experience/learning from doing so. Thanks for hashing this out guys!
Flags: needinfo?(kgrandon)
(In reply to Dave Hunt (:davehunt) from comment #25)

> This totals 24 builds. Given a single testrun currently takes just over 2
> hours, I don't think we can afford to run all of these. We could ask the
> person using the tool how many builds they want to try, with a reasonable
> default.

That's an option. Another option is maybe running less iterations. That's balancing resolution vs. reliability.

Possible that the right answer is just it'll take a long time, and we queue and add more slaves as demand increases. We might not want to sacrifice resolution or reliability.

> 
> I'm not convinced that using the build timestamp is the best approach (I'd
> rather use the revision timestamp), but it's possibly the best we can do for
> the first iteration.

I don't have much opinion here aside from whatever is actually between the stacks represented by the datapoints and not outside them. Since we're talking ranging, that's probably important.

The sources.xml hashes can probably be traced back to a revision timestamp.

> 
> Our initial goal is just to provide the user triggering the tool with an
> email report containing the results for the tests triggered on the builds
> selected. Ultimately we are hoping to be able to push the results to
> datazilla, and have regression alerts triggered.

Hm, not sure we'd want to do a global alert, since someone might retry the same set of datapoints (could cache them I guess). I see this more like a personal "try server" of past builds, but might have the wrong perspective.

Re: email report, will be interesting to see how well it works. Really you need to know not just the result, but also the error margins, etc., that you visualize in datazilla. It can be hard to compare those numerically.

What might work is attaching a csv email onto the report so that it can be imported into something else that'll graph it (spreadsheet, whatever), or someone can write a visualizing tool against it.

Question, though: given the runtime/queuing issues and the fact that we're not value-adding with a graph, does it make sense to do this against datazilla vs. a run-locally solution?

> 
> > Iteration 1: brute force tinderbox builds between good->bad as a "drill"
> > operation between current datazilla points.
> > 
> > Iteration 2: brute force interleaved commits between good->bad tinderbox
> > builds as an additional drill operation.
> > 
> > Iteration 3: Try to figure out how to reliably and automatically bisect for
> > a perf issue.
> 
> This sounds reasonable to me. Do you have any further thoughts based on the
> recent comments, Kevin?
Dave,

Looks like you, Geo and Kevin worked out the details on this. What're next steps?

Thanks,
Mike
Assignee: nobody → dave.hunt
Status: NEW → ASSIGNED
Flags: needinfo?(mlee) → needinfo?(dave.hunt)
I'm working on an initial version of the b2gperf regression hunter tool. I'm hoping to have something to demonstrate by the end of next week.
Flags: needinfo?(dave.hunt)
I have an initial version available here: https://github.com/davehunt/b2ghaystack

Please feel free to take a look, and to try it out (use --dry-run if you want to avoid triggering any jobs). I'm planning on writing up a blog post introducing the tool, and giving an end-to-end example of using it to determine the cause of an actual regression we experienced.
Thanks Dave Hunt.

Dave Huseby,

Please take a look at Dave Hunt's tool in comment 30. And reply here with any feedback.

Thanks, Mike.
Flags: needinfo?(dhuseby)
I've published a blog post [1] that goes into some detail for using the tool, and gives an end-to-end example. I think we can close this bug (I'm also happy to wait for feedback) and iterate on the tool by opening new bugs.

[1] http://blargon7.com/2014/03/hunting-for-performance-regressions-in-firefox-os/
I was able to set up a local instance of Jenkins and get it configured properly.  If I run the example command from your blog post--substituting with my credentials--the tools seems to work fine (regression range 07739c5c874f 318c0d6e24c5).

But then I tried regression ranges for several of my outstanding bugs and nothing happened.  I kept getting output like this:

Getting revisions from: https://hg.mozilla.org/mozilla-central/json-pushes?fromchange=7a2edc5171e6&tochange=d01bf8596d3b
--------> 5 revisions found
Getting builds from: https://pvtbuilds.mozilla.org/pvt/mozilla.org/b2gotoro/tinderbox-builds/mozilla-central-hamachi-eng/
--------> 152 builds found
--------> 0 builds within range
--------> 0 builds matching revisions
No builds to trigger.

Here are the regression ranges:

Bug 981995: 7a2edc5171e6  d01bf8596d3b
Bug 950673: df82be9d89a5  8b5875dc7e31
Bug 987994: fa098f9fe89c  5c0673441fc8

What am I doing wrong?
Flags: needinfo?(dhuseby) → needinfo?(dave.hunt)
(In reply to Dave Huseby [:huseby] from comment #33)
> What am I doing wrong?

What branch are you selecting on the command line? That will make a difference to how many builds match the range. It's also possible - though unlikely - in some cases that there are no tinderbox builds available between two revisions.

> Bug 981995: 7a2edc5171e6  d01bf8596d3b

For me, running the following for mozilla-central found one build to test, none on mozilla-inbound, and five on b2g-inbound:

$ b2ghaystack --dry-run -b mozilla-central --eng -u ***** -p ***** hamachi b2g.hamachi.perf 7a2edc5171e6 d01bf8596d3b

> Bug 950673: df82be9d89a5  8b5875dc7e31

I also found no valid builds on any branch between these two revisions. Are you aware of any that should have been matched?

> Bug 987994: fa098f9fe89c  5c0673441fc8

I found no builds on mozilla-central, but two builds on mozilla-inbound, and 22 builds on b2g-inbound.
Flags: needinfo?(dave.hunt)
In terms of saving old builds, we save 12 weeks on mozilla-central and b2g-inbound, per Bug 969767. 

The other branches save much less (looks like ~4 weeks).
try again.
Flags: needinfo?(dhuseby)
Guys what's going on with this issue? Any active work and if not what's the timeframe for delivering this?
Component: General → Performance
Dave and Dave please see comment 37. Thanks, Mike
Flags: needinfo?(dave.hunt)
I'm not currently working on this. I'm waiting to hear feedback from Dave Huseby, and from that I will likely raise some enhancement bugs for b2ghaystack, if it's worth proceeding with.
Flags: needinfo?(dave.hunt)
Dave Huseby,

What's the latest here. Are we still moving forward with this? Dave Hunt is waiting on you per comment 39.
Flags: needinfo?(dhuseby)
Target Milestone: --- → 2.0 S5 (4july)
Marking as incomplete. The b2ghaystack tool goes some way towards resolving this, but I suspect we'll be taking a different approach in the future.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.