Closed Bug 693544 Opened 14 years ago Closed 10 years ago

Migrate WebQA's Jenkins to ci.mozilla.org

Categories

(Infrastructure & Operations :: IT-Managed Tools, task, P4)

Tracking

(Not tracked)

VERIFIED INCOMPLETE

People

(Reporter: stephend, Unassigned)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/119] [triaged 20120910])

While we figure out what to do once bug 672218 is fixed (the plan is to eventually integrate WebQA's Jenkins jobs with Webdev's CI), we'd like to open up our Jenkins (currently running at http://qa-selenium.mv.mozilla.com:8080/) to the outside world. We've already got LDAP-based authentication for viewing/changing/building jobs/configs set up, from bug 691586. From Dave Hunt: "Jenkins communicates with Selenium Grid hub on port 4444 (configurable) Selenium Grid hub communicates with Selenium nodes on port 5555 (also configurable, and may vary in the future)" Phong/IT: how soon could we get qa-selenium.mv.mozilla.com (single Mac Mini) in the colo (I'm guessing SJC, since it's closest), and talking with the rest of the qa-selenium nodes (1-6), still on the QA VLAN on the 2nd floor, on port 5555? Longer-term, we're talking about moving QA assets into a colo, potentially, but were a ways off for that, yet, and we need to open this up for our 3rd-party automation contractors (Waverly), who don't have LDAP.
Can we not do this until SCL3 comes online (a couple more months), or until new jenkins is online (whichever comes first).. We are short on power in sjc1 and will be looking to move stuff out of there soon, not in.. Or, another option may be a VM in the new environment we are looking to setup in phx1. Even then we are a few weeks off at best.
(In reply to Corey Shields [:cshields] from comment #1) > Can we not do this until SCL3 comes online (a couple more months), or until > new jenkins is online (whichever comes first).. Thx, Corey; this can wait until bug 672218 -- Shyam says it's next on his to-do list, after another high-priority item, so the likelihood is good that it'll be soon. Sooner than a few weeks or a few months, at least.
Depends on: 672218
Oh the pressure ;)
Assignee: server-ops → shyam
How much space does your jenkins take up? (I'm making sure we don't run out of space on the new box)
(In reply to Shyam Mani [:fox2mike] from comment #4) > How much space does your jenkins take up? (I'm making sure we don't run out > of space on the new box) About 127GB, currently, though we expect that number to grow a bit as we add projects. There are things we're still talking through, though, on our team: A) do we even need old builds for some/all projects, in the future? If not, which can go? B) can/should we get rid of old jobs right now?
Ouch. The new machine has about 246 GB of space, out of which roughly 80 will go away for the existing sjc1 VM of jenkins. You should clear out as much cruft as you can :) Corey, should be plan on getting a storage blade for hudson in the future?
Can you run a du -hscx /var/lib/jenkins/* (or /var/lib/hudson/*) and post that on the bug. That's all I care about, TBH.
(In reply to Shyam Mani [:fox2mike] from comment #7) > Can you run a du -hscx /var/lib/jenkins/* (or /var/lib/hudson/*) and post > that on the bug. That's all I care about, TBH. Coz I found out that the inital 80GB estimate was off by 60GB :) So there'a bout 21 GB of data..but that leaves over 220 GB free on the disk.
Our /var/lib/ dir only has postfix :-) We have a /plugins dir off /Users/webqa/.jenkins -- our binary lives on the desktop (I know!); will follow-up with you or someone else to figure out what you're looking for, and where it lives :-)
Changing title to reflect what we're actually going to be doing.
Summary: Need help opening up WebQA's Jenkins box (by moving it to SJC/colo?) → Migrate WebQA's Jenkins to ci.mozilla.org
fox2mike: this also/still depends on bug 692912, right?
I guess. I'm not sure :) It depends on what you need. I just got caseconductor up today, so I'll poke at your jenkins tomm or so.
(In reply to Shyam Mani [:fox2mike] from comment #12) > I guess. I'm not sure :) It depends on what you need. I just got > caseconductor up today, so I'll poke at your jenkins tomm or so. We need our Selenium boxes and the new https://ci.mozilla.org/ host talking to each other (comment 0); we might need to open up some new network flows. fox2mike, can you/we just try a WebQA build of something in https://ci.mozilla.org, and work from there? Anything stopping us from adding jobs, now?
I'm going to be away at MozCamp Asia till the 23rd, so I'll get back and work on the github repo etc. Sorry for the delay!
(In reply to Shyam Mani [:fox2mike] from comment #14) > I'm going to be away at MozCamp Asia till the 23rd, so I'll get back and > work on the github repo etc. Sorry for the delay! Is there anyone else from IT who can continue your work in the meantime? This is a Q4 goal of ours (I realize it might no/probably isn't one of yours), and, while we don't need it done ASAP (though that would be nice!), I'm concerned that there might be a series of issues to work through as we continue to get further along.
I doubt it. Corey is the QA contact on this bug, he can re-assign to someone who's free, but I really doubt people aren't already occupied with other things. Don't worry, I'll work towards making your goal :)
Also, the mozillait user on github needs access to the private git repo.
(In reply to Shyam Mani [:fox2mike] from comment #17) > Also, the mozillait user on github needs access to the private git repo. Dave Hunt or James, can you help out with that?
David Burns: Can you sort out the Github request in comment 17? I think it requires an administrator. Thanks.
Comment 17 was actioned by someone else. The user is in the WebQA Team
(In reply to Stephen Donner [:stephend] from comment #15) > (In reply to Shyam Mani [:fox2mike] from comment #14) > > I'm going to be away at MozCamp Asia till the 23rd, so I'll get back and > > work on the github repo etc. Sorry for the delay! > > Is there anyone else from IT who can continue your work in the meantime? Not really, sorry :(
(In reply to Shyam Mani [:fox2mike] from comment #14) > I'm going to be away at MozCamp Asia till the 23rd, so I'll get back and > work on the github repo etc. Sorry for the delay! Yay 23rd, today :-) Know you might have other stuff too, of course! :-)
Haha, yup. Like getting new ssl certs, replacing the *.mozilla.org one with individual certs etc ;) <3
Any (realistic) chance of this happening in Q4, or can/should we rather expect it in early Q1? <3
(In reply to Stephen Donner [:stephend] - off until 12/29 from comment #24) > Any (realistic) chance of this happening in Q4, or can/should we rather > expect it in early Q1? <3 I'll poke at this right now :) And keep you posted.
I'm summarizing what we talked about in person - All you need is a handful of Windows VMs in Phoenix. You also mentioned that the existing Mac Minis are running Fusion and you can export the vmdks from there. Punting to Dan.
Assignee: shyam → dparsons
We should be able to set up a Selenium Grid instance for Jenkins to run tests against with the following: 1x Linux instance - to run Selenium Grid hub on. 2x Windows 7 Professional VMs - to run the Selenium server nodes and the actual tests. This will allow us to run tests in parallel, and we should be able to scale up in future by adding more VMs. Ultimately we may scale the Phoenix grid up to replace the grid in MV. Marlena: Could you provide Dan with a vmdk from one of our virtual machines? We would need to be able to access the Linux box via SSH, and be able to install Java, Ant, and Git on there. We would need to be able to access the Windows VMs via remote desktop or similar. A suggestion for the hostnames: * selenium-hub (Linux box) * selenium-node1 * selenium-node2
Tell me the FQDN of the server, as well as root login, that has the vmdks now and I'll move them, assuming you can give me a downtime window to temporarily shut down the VM.
I think we're good to go on this bug, as Dan's indicating. I think these will go in the dmz VLAN with jenkins1/jenkins2, but I'm happy to be told otherwise. Let's call the hub "selenium-hub1", since we may end up with another hub at some point. To clarify, the vmdk's will be loaded up as the W7 VMs, rather than the W7 VM's running vmware or fusion or something crazy like that, right?
(In reply to Dustin J. Mitchell [:dustin] from comment #29) > I think we're good to go on this bug, as Dan's indicating. I think these > will go in the dmz VLAN with jenkins1/jenkins2, but I'm happy to be told > otherwise. Let's call the hub "selenium-hub1", since we may end up with > another hub at some point. This sounds like a fine suggestion to me, and allows for the future, as you say (although I don't know that we'd ever need a 2nd Selenium Hub, but hey). > To clarify, the vmdk's will be loaded up as the W7 VMs, rather than the W7 > VM's running vmware or fusion or something crazy like that, right? I don't think we have objections; currently (for historical reasons), our vmdk's are running on several of our Mac Minis, inside VMWare Fusion (because we need the ability to run on all platforms, and when we started out a couple years ago, we couldn't virtualize Mac OS X). We're not beholden to Macs, particularly, though, so although we'll move the current Minis over (likely to SCL3), Windows is our preferred platform, and we're happy for those to be hosted however IT sees fit.
Dan, Reclaiming this bug (since it's meta bug of sorts), spinning a new one for you to track these VMs and making that bug block this.
Assignee: dparsons → shyam
Depends on: 727353
Stephen, What's left here? Do you need my help moving things around?
(In reply to Shyam Mani [:fox2mike] from comment #33) > Stephen, > > What's left here? Do you need my help moving things around? 1) We need to double-check that credentials aren't leaked in any form (Dave, can you help us figure that out?) 2) You mentioned that Web QA adding the 3 jobs we did (+ our plugins) severely impacted startup time -- ours takes upwards of 30 minutes (tons of jobs), so we weren't sure you were ready to add the plethora of jobs we have to that, without further investigation 3) We were finishing up Q2, and are still in the process of formulating Q3 goals, so haven't gotten around to standing up the other Selenium VM + a restart/monitoring script for Hub
(In reply to Stephen Donner [:stephend] from comment #34) > 1) We need to double-check that credentials aren't leaked in any form (Dave, > can you help us figure that out?) You reminded me that we do need to fix a leak in Jenkins: bug 770226
The issue is that a stack trace for a failure could contain credentials for test users. The risk is that anyone viewing the stack trace could use these credentials and cause future testruns to fail (consequences could be worse on production). An option to hide this output from unprivileged users would resolve this (it would affect the console log, test failure reports, HTML reports, Selenium logs, and email notifications). y preference would be for each test that requires credentials to take care of creating the user itself via an API. This would mean tests would not be dependant on a single (potentially insecure) test user account. We nearly have this with Browser ID.
This would hurt outside contributors who may need to watch builds and learn from their failures. Would it be possible to write a py.test (or whatever) plugin to deliver scrubbed tracebacks? We use middleware to remove credentials from tracebacks in Django
What do you mean by 'scrubbed tracebacks'? I suspect these wouldn't be very useful to us investigating failures. Also, this seems to be a response to my first option, which I'm not even sure is possible in Jenkins. As I said in comment 36, my preference would be throwaway test accounts created by tests on demand, that way it really wouldn't matter who knew the credentials after the test had finished.
(In reply to Stephen Donner [:stephend] from comment #34) > 2) You mentioned that Web QA adding the 3 jobs we did (+ our plugins) > severely impacted startup time -- ours takes upwards of 30 minutes (tons of > jobs), so we weren't sure you were ready to add the plethora of jobs we have > to that, without further investigation This seems to be the gazillion job directories that are lying around without being archived. Like webqa-sumo has been failing for maybe 8 months, tries to build every 2 mins. Each of those is a "build" and has a directory. Therefore, when jenkins fires up, it runs through all these directories (not sure what it's looking for exactly)...but cleaning those up will significantly improve start up times. This is what I wad referring to. You can add more jobs, just not on 2 min crons when they keep failing :)
A sensible build retention policy should be set to prevent this from happening. It's simple to set this to a number of days or a number of builds. Even just keeping the last 100 builds would improve this a lot. Also, no job should be running every 2 minutes, so that's concerning.
I think the only reason they were 2 mins was to make it easier to debug..but I guess they just got left at that.
Handing off Jenkins to webops, I'll still be around to assist if needed.
Assignee: shyam → server-ops-webops
Component: Server Operations → Server Operations: Web Operations
Priority: -- → P4
Whiteboard: [triaged 20120910][waiting][webops q4 review]
Stephen, can we get a summary of what work needs done on this? We'll evaluate it as a potential Q4 goal/project.
Whiteboard: [triaged 20120910][waiting][webops q4 review] → [triaged 20120910][waiting][webops q4 meetup][2012q4?]
Whiteboard: [triaged 20120910][waiting][webops q4 meetup][2012q4?] → [triaged 20120910][waiting][qa][webops q4 meetup][2012q4?]
(In reply to Jake Maul [:jakem] from comment #43) > Stephen, can we get a summary of what work needs done on this? We'll > evaluate it as a potential Q4 goal/project. Jake, pretty sure we're mostly waiting on bug 773116, before wanting to complete this move -- that's the best/approved path (from NetOps) for running our Selenium tests. The only other issue is the potential to leak usernames and passwords (credentials) when/if authentication fails, via a stacktrace, which could be publicly-viewable. Not sure what else we should be doing, Web QA-side, to mitigate that.
WebOps isn't in a great position to take care of this in Q1 (clearly didn't happen in Q4). Honestly, I'm still not sure what we would even be doing here... we don't generally create or maintain the individual jobs within Jenkins. If there's a convenient way to move projects between instances, we can do that. https://wiki.jenkins-ci.org/display/JENKINS/Administering+Jenkins seems to say you can simply pick up the job directory from one install and drop it onto another. I'm skeptical of this, but we can certainly try. :) I don't have a good answer for the credential concern either. Is it possible to mark certain projects as private, to prevent this possibility? This seems like it wouldn't be any worse than what we have now, where they're "private" because they're on a VPN only. You may want to talk with some of the folks that pay attention to the jobs already on ci.mozilla.org, to see how they handle the same situations. Of course, you're obviously welcome to migrate anything you want by hand. I know there's a lot so this may not be feasible as a whole, but it might suffice to demonstrate that it works for your needs, and let you get the most important stuff moved sooner than we'll get to it.
Flags: needinfo?(stephen.donner)
Whiteboard: [triaged 20120910][waiting][qa][webops q4 meetup][2012q4?] → [triaged 20120910]
Sorry for the delay; this isn't something we're likely to get to for at least a full quarter -- we're now spinning up Firefox OS automation, and are attaching Jenkins slaves to our master. It's critical to what we do now, and we're probably only going to be building on it in the very-near future. Also, we do still have the potential leakage of credentials, and while I'd LOVE to be on a publicly-hosted server, this needs careful thought, planning, and execution, which I/we just don't have time for, right now.
Flags: needinfo?(stephen.donner)
(In reply to Stephen Donner [:stephend] from comment #46) > Also, we do still have the potential leakage of credentials, and while I'd > LOVE to be on a publicly-hosted server, this needs careful thought, > planning, and execution, which I/we just don't have time for, right now. Doesn't mean you can't have a private instance, but I'm guessing you wanted it under the same ci.mozilla.org banner. I'm not a fan of stuff running in Mountain View as that's not a Datacenter. We shouldn't be running anything critical out of an office.
(In reply to Stephen Donner [:stephend] from comment #46) > Also, we do still have the potential leakage of credentials, and while I'd > LOVE to be on a publicly-hosted server, this needs careful thought, > planning, and execution, which I/we just don't have time for, right now. We do have a strategy for django projects, at least. Shyam, if you have time, maybe we can start spinning up the infrastructure we'd need to run Selenium and then webdev can take some initiative on moving those tests over. Then we could start to address the move, and build out what WebQA will need, without taking their time this quarter.
(In reply to James Socol [:jsocol, :james] from comment #48) > (In reply to Stephen Donner [:stephend] from comment #46) > > Also, we do still have the potential leakage of credentials, and while I'd > > LOVE to be on a publicly-hosted server, this needs careful thought, > > planning, and execution, which I/we just don't have time for, right now. > > Shyam, if you have time, maybe we can start spinning up the infrastructure > we'd need to run Selenium and then webdev can take some initiative on moving > those tests over. Then we could start to address the move, and build out > what WebQA will need, without taking their time this quarter. I was just adding my 2 cents. I'm pretty wiped out this Q with stuff :( I can help with the hardware and work with solarce, since most of the jenkins stuff is in puppet...so standing up a new instance is trivial and can be behind LDAP auth...
Component: Server Operations: Web Operations → WebOps: IT-Managed Tools
Product: mozilla.org → Infrastructure & Operations
Blocks: 819469
Whiteboard: [triaged 20120910] → [kanban:https://kanbanize.com/ctrl_board/4/83] [triaged 20120910]
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/83] [triaged 20120910] → [kanban:https://webops.kanbanize.com/ctrl_board/2/119] [triaged 20120910]
If this is still a valid request, I recommend that look at setting this up in AWS when that becomes available ~ end of Q1,begin of Q2. Thanks, Stephanie
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
(In reply to schan from comment #50) > If this is still a valid request, I recommend that look at setting this up > in AWS when that becomes available ~ end of Q1,begin of Q2. > > Thanks, > Stephanie Thx Stephanie -- old bug, and we've decided to go ahead and create a new, replacement instance for our current, behind-VPN Jenkins, here: bug 1112555.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.