Closed
Bug 943992
Opened 12 years ago
Closed 11 years ago
Need High Performance Virtual machine for web crawling and testing
Categories
(Infrastructure & Operations :: Virtualization, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cmtalbert, Assigned: cknowles)
Details
Hallvord and his team need some scalable power to run a web crawler which will crawl over a large selection of top sites looking for compatibility problems. They want to both do "proactive" spidering (i.e. give the script a list of 500 important URLs for a given country or topic, let it find problems) and "regression testing" spidering (i.e. on a regular basis run through a list of URLs that had/have known issues, check if the issues still apply, or even if they re-appear after we got the site to fix things). To make sure our testing is realistic, just spidering with wget or curl or some such isn't good enough - the project we're working on uses real browser instances (currently based on the WebKitGTK Python API) on the backend, and checks various issues by comparing one instance which is spoofing as Safari on iOS with the other one spoofing as FirefoxOS.
This will help ensure our browsers and products are well supported by web developers in general and will help our developer evangelism team to target web sites that need to be changed to better support us.
The code we will be running can be found here: https://github.com/seiflotfy/mozcompat/
One thing it will need is GObject introspection library, not sure if that is in our usual Rhel images or available from our rpm stores. For that reason, we might prefer to use ubuntu (it's all been engineered on ubuntu and we know it works there out of the box).
For storage concerns, it uses Mongodb on the backend and stores 1 row with 15-20 fields (text blobs, numbers, booleans) per URL. We plan to run around 1000 URLs a day, so we are going to see some amount of growth in the storage system, but we aren't taking snapshots or downloading entire pages.
We don't care if this is in our datacenter or AWS, but we need a place to host this where the developers can access it, easily deploy new code to it, and continue to tweak its parameters as we finalize the last production settings. AWS is probably better from an access and turnaround time for that, but it's your call. As far as whether or not we need web access *to* the box, I don't believe so, but it is associated with the arewecompatibleyet.com site. I'll needinfo Hallvord for more on how that works since I don't see anything in the repo that seems to either generate or communicate with that site.
So, a system:
* 64 bit linux (prefer ubuntu)
* 4Gb memory
* 250 Gb of disk
Hallvord, please correct if I missed anything.
Flags: needinfo?(hsteen)
Comment 1•12 years ago
|
||
The interaction between the crawler/tester and the AWCY site is yet to be fully fleshed out. Right now all we have is a feature in the crawler that loads the big table of sites from AWCY and crawls through it - we do want to be able to query testing data from the site eventually, either with XHR (potentially CORS), or by direct database querying on the backend if we get a db connection. We can do whatever is convenient for you.
Otherwise, this looks good - think you got all the information.
Flags: needinfo?(hsteen)
Updated•12 years ago
|
Assignee: server-ops → server-ops-virtualization
Component: Server Operations → Server Operations: Virtualization
QA Contact: shyam → dparsons
Updated•12 years ago
|
Flags: sec-review?(sbennetts)
Hi everyone, this is something Hallvord would like to deploy by the end of the year. How soon can we at least get him a system to test on and for the security guys to take a look at? I.e. I think we should at least be able to deploy the code and test it in our environment by the holidays. I can understand if the sec review can't happen by then (but that would be nice ;-) )
Comment 3•12 years ago
|
||
:ctalbert, it's not 100% clear exactly what this request needs. If all you need is 1x 64-bit Ubuntu with 4GB RAM and 250GB disk, then please specify hostname, datacenter and VLAN and we can get it done.
Comment 4•12 years ago
|
||
Please put the system in a DMZ (74) Vlan.
Assignee | ||
Comment 5•12 years ago
|
||
OK, due to templates, and other things, I'd like to put this in SCL3, unless there's a epecific need for it in PHX1.
So, I've got <HOSTNAME>.dmz.scl3.mozilla.com
With 64-bit Ubuntu
1 Core (?)
4GB RAM,
250GB disk.
So, at the very least I need <HOSTNAME> and # cores answered - if that's all there is, I can have something ready for initial work a couple of days after the details arrive. Any other changes, and of course, things might change.
CJK
Flags: needinfo?(ctalbert)
Comment 6•12 years ago
|
||
can we have compatipede.dmz.scl3.mozilla.com ? "Compatipede" is the naming competition front runner for this tool :)
Comment 7•12 years ago
|
||
Hallvord, "compatipede" is probably the best VM name I've ever seen. I hope it wins the competition :)
Assignee | ||
Comment 8•12 years ago
|
||
We always number things, so compatipede1.dmz.scl3.mozilla.com - it's close to EOD here, so I'll get started on it tomorrow. (# of cores can be changed relatively easily, but I'll start it at 1)
And I second Dan's enthusiasm for the name.
CJK
Assignee: server-ops-virtualization → cknowles
Comment 9•12 years ago
|
||
Re the security review: it would be great to have a chat about this project. Who would be the best person (people) for me to talk to?
Comment 10•12 years ago
|
||
Me and Seif Lotfy (the volunteer who has written the code) - perhaps we could meet in a vidyo room or something?
Comment 11•12 years ago
|
||
That works for me.
What timezones are you 2 in?
I'm based in the UK.
Assignee | ||
Comment 12•12 years ago
|
||
Alright, the VM is up and has been added to puppet and nagios, and is ready for further customization and work as you need.
CJK
Comment 13•12 years ago
|
||
We're both in Europe, so TZ-wise it will be easy. Following up by E-mail..
Reporter | ||
Comment 14•12 years ago
|
||
Hallvord, thanks for answering all the questions here. Thanks also guys for all the quick work. I really appreciate it. My role here was just to get the ball rolling, so I'll let Hallvord take over from here. Sorry I missed these, I instituted a new bugmail filtering system and temporarily lost this thread. :-(
Flags: needinfo?(ctalbert)
Assignee | ||
Comment 15•12 years ago
|
||
So, I'm sorry, do you guys need anything else from the virtualization team? Or can I route the ticket somewhere else for attention, or are we all done?
CJK
Comment 16•12 years ago
|
||
stupid question: how do I access this machine? I need to be on VPN?
Assignee | ||
Comment 17•12 years ago
|
||
Yes, you'll likely need to be on the VPN, I've asked for some help in making sure you've got access through to it.
CJK
Comment 18•12 years ago
|
||
(In reply to Hallvord R. M. Steen from comment #16)
> stupid question: how do I access this machine? I need to be on VPN?
Yes you will need to be on a VPN, instructions here (https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=30769829)
What kind of access do you need on it? SSH?
Comment 19•11 years ago
|
||
My plan is to SSH to the machine and do a "git clone https://github.com/ .." to get the stuff we want to run there, plus some apt-get, pip or easy_install to pull in the various dependencies. Will that work?
Once the script is running, it's supposed to be listening on an open port so I will be able to queue URLs for testing from a Python script on my local machine. (I can be on VPN to do so - at least for now). The script on the server will then process the queue, load those URLs in web browsers, gather information and save to DB.
I've set up VPN (on a Windows laptop right now) and tried to use Putty with the private key I think matches the one I think is in my LDAP entry. However, Putty fails to connect - it says "No supported authentication methods available (server sent publickey)". Am I doing something wrong?
Comment 20•11 years ago
|
||
Belated update - I had a chat with Seif on Jan 7th and it looks like a pretty low risk project to me, so I dont have any major security concerns.
I will have a play with the code but definitely dont want to hold anything up.
Assignee | ||
Comment 21•11 years ago
|
||
Having not seen any further updates from Ed Lim, I'm updating this asking for his input on the login problems you're having. I'll also poke in IRC so he gets the notice to take a look.
CJK
Comment 22•11 years ago
|
||
Access granted to host and sudo given as well.
Comment 23•11 years ago
|
||
All set indeed. Thank you, Ed!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Infrastructure & Operations
Comment 24•10 years ago
|
||
Hi Chris,
I don't know if the machine provisioned here is still active, but if so it's unused and can be taken down. Is there a way to check?
<hallvord> I think the VM was taken down at some point when we didn't use it anymore, have forgotten..
<hallvord> I don't even remember how to get on VPN, so I can't check if compatipede1.dmz.scl3.mozilla.com is alive..
Flags: needinfo?(cknowles)
![]() |
||
Comment 25•10 years ago
|
||
It's alive. We'll get rid of it based on comment 24.
Based on the name free-ing up, we can use that for the new request in bug 1216188.
Flags: needinfo?(cknowles)
Comment 26•10 years ago
|
||
Thanks.
![]() |
||
Comment 27•10 years ago
|
||
Old form of compatipede1.dmz.scl3 deleted.
You need to log in
before you can comment on or make changes to this bug.
Description
•