Closed Bug 853029 Opened 9 years ago Closed 9 years ago

[meta] Hand Off Stone Ridge to A-Team & Releng

Categories

(Core :: Networking, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: u408661, Assigned: u408661)

Details

Just talked to Doug, we want to hand Stone Ridge off to the A-Team & releng so we can have all the necko team working directly on necko instead of testing infrastructure. This is a meta-bug to track the progress of the hand-off. Filing in core:networking for now, since we're who currently own Stone Ridge.
I'm not sure I can agree to this just yet. I have a bunch of questions.  Let's start with the top four.

1. What is the work and steps involved to make this happen? 
2. What else is necessary to complete the stone ridge project? Is the project in maintenance mode at this point? 
3. If we need something fixed in it later (particularly in the network shaping layer), can we turn to you?  What needs to be kept running here?
4. What are the future plans that the network team has for stoneridge? Does it already do everything you want?  Has it been useful?  What has it accomplished since it has been switched on?
(In reply to Clint Talbert ( :ctalbert ) from comment #1)
> I'm not sure I can agree to this just yet. I have a bunch of questions. 
> Let's start with the top four.

What, you don't want to take this nice, used car that was only driven by a sweet older woman to the grocery store once a week? :)

> 1. What is the work and steps involved to make this happen? 

If we only want to hand off maintenance of the existing infrastructure, very little. I should change the tests from being run under my personal account on the machines to being run under some stone ridge-only account, and we would need to get appropriate accounts made for new maintainers (releng & ateam).

Ideally, we would at some point (perhaps not now) scale up the infrastructure to increase capacity (stone ridge needs Real Live Hardware, so using VMs won't do it). Scaling up the infrastructure enough could even make the code simpler, as we could partition the client hardware per network configuration, instead of having to do it all with one set of clients as we do now (this is my dream state, fwiw). Also, ideally, we would integrate this into the regular build infrastructure, so stone ridge tests run on every push (and are available as options when pushing to try). We have something kind of like this now - stone ridge runs every night on the most recent nightly (close enough to every push for the purposes of networking), and we can tell stone ridge to run against an existing try push, but full integration would be even better.

> 2. What else is necessary to complete the stone ridge project? Is the
> project in maintenance mode at this point? 

This all depends on who you ask :) It's certainly in a usable state right now (barring a few bugs I have to shake out in the most recent feature), and it's miles better than what we had before (namely: nothing). However, there are some features on our wish-list, which is at https://github.com/mozilla/stoneridge/issues. Some on the necko team would argue more vigorously for those to be complete before considering it "done" enough to hand over, some less vigorously. (To be clear, I'm in the latter camp, for multiple reasons.)

Assuming we can agree to consider it "done" enough for now, the most likely common modification would be to add a new page to the set of pages that are tested against. This is pretty easy, and lends itself well to automation after someone (the person who wants the new page, imho) records a snapshot of the page.

I should note that, when the necko team made our Q1 2013 goals, the idea was that I would spend the quarter getting it into the state it's (very close to) in right now, and then we would put off further feature development or (if we were lucky) find a contractor to continue feature development in Q2, so it shouldn't be too much of a shock to the necko team that we're trying to consider this done for now, until we get the resources to do more.

> 3. If we need something fixed in it later (particularly in the network
> shaping layer), can we turn to you?  What needs to be kept running here?

I have no problem offering myself as an expert (especially given I'm the only one that exists at this point). The code is all in python, so I can't imagine your team would have too much of a problem getting a handle on it. I've also tried my best to very-well comment the tricky bits. In terms of infrastructure, the hardware pretty much takes care of itself. The trickiest bit comes in changing the DNS servers on a particular machine to get it to run against a particular network configuration. This can be eliminated as I mentioned above by scaling up the infrastructure. The DNS changing is by far the most fragile bit of the whole thing, and continues to periodically be a thorn in my side (I have to manually reset the DNS servers every few months - not a long process, just annoying when it crops up).

> 4. What are the future plans that the network team has for stoneridge? Does
> it already do everything you want?  Has it been useful?  What has it
> accomplished since it has been switched on?


Future plans are to (1) add more pages, (2) keep using this to test necko performance, and (at some point) (3) add the features listed on the github page I linked above. One other thing (that I should really add to the issues list) is to make sure that stone ridge can somehow capture cache performance. It may (or may not) do so now, that's a completely untested use case (and not what stone ridge was initially intended to test, either). My suspicion is that it doesn't currently handle caching, but it wouldn't be exceedingly difficult to make it do so.

As for its current usefulness/what it's accomplished, the answer is: "so far, not very/not much". This is a combination of 2 factors: (1) we didn't have the ability to test full page loads until very recently (I'm still shaking some bugs out of that feature, in fact) and (2) we haven't had any big changes that were likely to significantly impact resource loading performance since stone ridge has been deployed. We have artificially proven its usefulness by testing a build that was intended to make performance horrible, and lo, the graph did spike, and there was much rejoicing amongst the necko hackers.
(In reply to Nick Hurley [:hurley] from comment #2)
> (In reply to Clint Talbert ( :ctalbert ) from comment #1)
> > I'm not sure I can agree to this just yet. I have a bunch of questions. 
> > Let's start with the top four.
> 
> What, you don't want to take this nice, used car that was only driven by a
> sweet older woman to the grocery store once a week? :)
> 
Something like that. :)
No, I remember chatting with you when StoneRidge was first discussed, so I knew this was likely going to come about one day.  I just want to see what exactly is involved with taking it over.

> 
> Future plans are to (1) add more pages, (2) keep using this to test necko
> performance, and (at some point) (3) add the features listed on the github
> page I linked above. One other thing (that I should really add to the issues
> list) is to make sure that stone ridge can somehow capture cache
> performance. It may (or may not) do so now, that's a completely untested use
> case (and not what stone ridge was initially intended to test, either). My
> suspicion is that it doesn't currently handle caching, but it wouldn't be
> exceedingly difficult to make it do so.
> 
> As for its current usefulness/what it's accomplished, the answer is: "so
> far, not very/not much". This is a combination of 2 factors: (1) we didn't
> have the ability to test full page loads until very recently (I'm still
> shaking some bugs out of that feature, in fact) and (2) we haven't had any
> big changes that were likely to significantly impact resource loading
> performance since stone ridge has been deployed. We have artificially proven
> its usefulness by testing a build that was intended to make performance
> horrible, and lo, the graph did spike, and there was much rejoicing amongst
> the necko hackers.

I applaud that you did the barebones "make sure it will catch completely broken performance issues" test. So many of our current performance benchmarks we have to support have never been through even that simple test that it's hard not to look at them askance. 

You talk about scaling up the infrastructure on a system that hasn't really provided much benefit.  This seems to me like putting the cart before the horse a little bit. Also, I question why we need this running on a per-push basis. There aren't that many checkins going into necko, and this is an expensive operation to scale up (in terms of machines and datacenter space etc).  I think making it run intermittently (like it does now) but yet making it visible on our dashboards is more the right approach -- that and somehow wiring it more seamlessly into the try server.  That said, I don't necessarily have any time budgeted in Q2 to really dig into these issues.

If you agree with those priorities for the short term, then I think that it does make sense to hand it off. I would hope and encourage you to work with us as we make this transition over so that we can ensure that things aren't totally dropped on the floor.

I do think that expanding our automation into networking analysis is something we are going to want to be doing, and that stone ridge is an good first step in that direction.
(In reply to Clint Talbert ( :ctalbert ) from comment #3)
> You talk about scaling up the infrastructure on a system that hasn't really
> provided much benefit.  This seems to me like putting the cart before the
> horse a little bit.

I talked about that only for 2 reasons: (1) scaling up (in the way I have in mind) would simplify the code and make the tests less brittle (this scaling up requires 6 more machines to run as clients and that's it), and (2) I've spent a decent amount of time on this, and the big scale-up is my "wouldn't it be cool if ..." state, not something I expect to happen any time soon.

> Also, I question why we need this running on a per-push
> basis. There aren't that many checkins going into necko, and this is an
> expensive operation to scale up (in terms of machines and datacenter space
> etc).  I think making it run intermittently (like it does now) but yet
> making it visible on our dashboards is more the right approach -- that and
> somehow wiring it more seamlessly into the try server.  That said, I don't
> necessarily have any time budgeted in Q2 to really dig into these issues.

That's totally reasonable. I think of running it on a per-push basis as one of the "wouldn't it be cool if"s, as well. I won't argue that we absolutely need that today. Seamless integration with try, and displaying on dashboards is more important, but given that I wasn't originally planning on having any time in Q2 to work on this myself, I wouldn't expect them to magically happen in Q2 just because I'm handing it off entirely :)

> If you agree with those priorities for the short term, then I think that it
> does make sense to hand it off. I would hope and encourage you to work with
> us as we make this transition over so that we can ensure that things aren't
> totally dropped on the floor.

Absolutely! I would definitely want to work with you guys on getting this handed off in a sane fashion, and (time permitting) even more into the future, at least to help make sure the necko team gets what we want/need out of stone ridge in the long term.

It sounds like we're in a good place here to make a hand off happen without overburdening your team.
After emailing the a-team (I was able to find them!), I think we've come to agreement that this is as done as it's going to get, until they have to jump into a fire somewhere, and we all realize I forgot to tell them something :)

I'm going to mark this bug as fixed, and we can deal with future issues as they arise (if they arise).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.