Closed
Bug 763804
Opened 12 years ago
Closed 12 years ago
Mozilla validator.nu instance for linting HTML
Categories
(Infrastructure & Operations Graveyard :: WebOps: Labs, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: humph, Assigned: gozer)
References
()
Details
Attachments
(1 file)
44.69 KB,
text/plain
|
Details |
I want to run an instance of the HTML5 Validator (http://about.validator.nu/#src) for use in MoFo software project build systems. I want to be able to do HTML linting in our build systems for Popcorn, Popcorn Maker, etc. and thought that we should run an instance instead of spamming http://validator.nu/ with requests all the time. According to the setup info, this wants Ubuntu with Python, Mercurial, Subversion, JDK5 or 6, and a bunch of deps: "commons-codec-1.4/commons-codec-1.4.jar", "commons-httpclient-3.1/commons-httpclient-3.1.jar", "commons-logging-1.1.1/commons-logging-1.1.1.jar", "commons-logging-1.1.1/commons-logging-adapters-1.1.1.jar", "commons-logging-1.1.1/commons-logging-api-1.1.1.jar", "icu4j-charsets-4_4_2.jar", "icu4j-4_4_2.jar", "iri-0.5/lib/iri.jar", "jetty-6.1.26/lib/servlet-api-2.5-20081211.jar", "jetty-6.1.26/lib/jetty-6.1.26.jar", "jetty-6.1.26/lib/jetty-util-6.1.26.jar", "jetty-6.1.26/lib/ext/jetty-ajp-6.1.26.jar", "apache-log4j-1.2.15/log4j-1.2.15.jar", "rhino1_7R1/js.jar", "xerces-2_9_1/xercesImpl.jar", "xerces-2_9_1/xml-apis.jar", "slf4j-1.5.2/slf4j-log4j12-1.5.2.jar", "commons-fileupload-1.2.1/lib/commons-fileupload-1.2.1.jar", "isorelax.jar", "mozilla/intl/chardet/java/dist/lib/chardet.jar", "saxon9.jar" I'm not clear on hardware requirements for the box, and have CC'ed Henri in case he has suggestions. It won't get pummeled with traffic, but given that we'll use it in our build systems and automation scripts, it will get used fairly often. Calling it something like validator.mozillalabs.org or something would be useful. It also seems to run on port 8888, but that might be configurable. We won't be sending any sensitive/private HTML to the service, so HTTP is fine.
Comment 1•12 years ago
|
||
Best to try with a low-end config, and see if we need to bump it up. Distro: Ubuntu Memory: 512M CPU count: 1 Disk size: 8G is the minimum, say if you need more. Project name: Validator Project owner: whatever humph uses Needs https access from the internet Y Need other ports besides 80/https: N Needs a DNS domain name (.i.e. myproject.org) validator.mozillalabs.com Good to know what installed services you'll be running: Java Does the project have a home page explaing what it is? http://validator.nu/
Assignee | ||
Updated•12 years ago
|
Assignee: server-ops-labs → gozer
Status: NEW → ASSIGNED
Comment 2•12 years ago
|
||
(In reply to David Humphrey (:humph) from comment #0) > I want to run an instance of the HTML5 Validator > (http://about.validator.nu/#src) for use in MoFo software project build > systems. I want to be able to do HTML linting in our build systems for > Popcorn, Popcorn Maker, etc. and thought that we should run an instance > instead of spamming http://validator.nu/ with requests all the time. If you want HTML5 linting, it is probably the best to configure the Labs instance like http://html5.validator.nu/ rather than http://validator.nu/. This way, it will default to HTML5 and you don't need to pass the API parameters for requesting HTML5. > According to the setup info, this wants Ubuntu with Python, Mercurial, > Subversion, JDK5 or 6, and a bunch of deps: JDK 5 is legacy software at this point. I recommend running OpenJDK 6. OpenJDK 7 most likely works, but I haven't tested it. I have tested software only on Ubuntu and Mac OS X, but there's no reason to believe that it wouldn't run on any Linux distribution, so it probably makes sense to use whatever distribution works best for server operations and not be too focused on the fact that I've been using Ubuntu. > Calling it something like validator.mozillalabs.org or something would be > useful. It also seems to run on port 8888, but that might be configurable. > We won't be sending any sensitive/private HTML to the service, so HTTP is > fine. A non-root process can't bind to port 80, you're not supposed to start the validator as root and on Linux elevating the binding privileges of a multi-threaded Java process doesn't really work. This is why it makes sense to make the Java process bind to and non-privileged port and to use iptables in the kernel to redirect port 80 to whatever non-privilege port the Java process bound itself to. (In reply to David Ascher (:davida) from comment #1) > Best to try with a low-end config, and see if we need to bump it up. > > Distro: Ubuntu > Memory: 512M > CPU count: 1 http://html5.validator.nu/ runs on a VM with with 512 MB of RAM and a single CPU, so those specs should be okay. When the validator is configured to do only HTML5 validation and the user's ability to choose a schema is disabled, there's no Schematron to be run, so the validator should perform pretty well. This is the configuration of http://html5.validator.nu/. On the other hand, if the validator is configured to allow arbitrary user-provided schemas or the legacy HTML 4 schema, it is easy to cause a denial of service by Schematron. This is why http://validator.nu/ occasionally suffers from overloading even when a single post on the Internet spams it. > Needs https access from the internet Y > Need other ports besides 80/https: N Note that unless you patch the validator to only accept input uploaded to it via the Web service interface, the validator can make outbound connections. The content of these connections can be exposed to whoever makes an HTTP request to the validator (indirectly via error messages or directly via "Show Source"). Therefore, it's important to run the validator outside Mozilla's trusted network if users are allowed to connect to the validator from outside Mozilla's trusted network or, alternatively, the validator needs to be patched never to make outbound connections and work only with content uploaded to it via the Web service interface. My deployment process goes like this: First, I build the validator on my desktop development machine by running the following command: python build/build.py all This pulls fresh code from the repositories, compiles it and runs a local instance for testing. Then I connect to local instance for quick smoke testing. Then I press return in the terminal to terminate the process. Then I run a deployment shell script roughly like this (I've omitted the part that deploys http://validator.nu/ since you probably want to deploy just a duplicate of http://html5.validator.nu/; I've changed the hostnames in the example): #!/bin/sh python build/build.py --no-self-update --control-port=9999 --port=8888 --log4j=/home/validator/log4j.properties --promiscuous-ssl=on --genericpath=unguessable.bogus.host.name/ --html5path=validator.mozillalabs.com/ --heap=-170 script python build/build.py --no-self-update tar python build/build.py --no-self-update --control-port=9999 --scp-target=validator@validator.mozillalabs.com:/home/validator deploy The first line generates a shell script that will launch the validator with control port 9999 and HTTP port 8888. The control port is used for shutting down the validator, so when the next deployment happens the next instance will connect to the port 9999 of the old instance to get rid of the old instance. There's also the path to log4j configuration on the target machine, a flag that tells the validator to turn off TLS certificate validation for outbound connections, settings for what paths the generic validator and the HTML validator are bound to and a flag that tells to configure the JVM such that the JVM heap is the total memory on the target machine minus 170 MB. At present, the validator always gets configured with the path for the HTML5 only facet and the generic facet. There's no way to turn the generic facet off altogether right now, so someone malicious could trigger resource-consuming Schematron-based validation by spoofing the Host header in the request. I should probably fix this at some point. For now, I suggest configuring a bogus hostname for the generic facet using an unguessable bogus hostname for it. The second line generates tarballs of all the files on the build systems that need to be copied to the deployment system. The third line copies the tarballs to the deployment machine via scp (public key-based authentication assumed) and creates an empty file called "DEPLOY" in the target directory also. On the deployment virtual machine, I have a shell script like this running as a Cron job frequently. The Cron job frequency determines how quickly after the upload of a new version the new version gets run. #!/bin/sh VALIDATOR_HOME=/home/validator cd "$VALIDATOR_HOME" pwd echo "About to check DEPLOY" if [ -f "DEPLOY" ]; then echo "Will deploy" rm DEPLOY # fall through elif [ -f "HEAP" ]; then echo "Will examine heap" NEWHEAP=`grep MemTotal /proc/meminfo | awk '{print $2}'` OLDHEAP=`cat HEAP` echo "Old: $OLDHEAP" echo "New: $NEWHEAP" if [ $NEWHEAP != $OLDHEAP ]; then echo $NEWHEAP > "HEAP" echo "Will deploy with new heap" else echo "Exiting" exit fi else echo "Exiting" exit fi echo "End game" if [ -f "deps.tar.gz" ]; then tar zxf deps.tar.gz fi if [ -f "jars.tar.gz" ]; then tar zxf jars.tar.gz fi if [ -f "run-validator.sh" ]; then exec sh run-validator.sh fi If a file called "DEPLOY" exists in the validator home directory, this script extracts the two tar files that contain the new version and then executes run-validator.sh which was generated on the development machine in the deployment phase and got uploaded together with the two tar files. (Note that deps.tar.gz is re-uploaded every time even when it doesn't need to. The system currently doesn't have any checks to see if he needs to re-upload that tar ball.) If a file called "DEPLOY" a doesn't exist in the validator home directory, but a file called "HEAP" exists, the script will compare the current heap size of the machine with the size recorded in the file called "HEAP". If the sizes differ, the script will redeploy the validator. This is useful when running on a virtual machine whose RAM size can change while the virtual machine is running. Since the Cron job keeps polling the heap size, no manual action is needed after changing the RAM size of the virtual machine. Note that the heap size for the JVM was defined relative to the total RAM size of the deployment machine.
Comment 3•12 years ago
|
||
Oh and it's essential to block access to the control port (in my example scripts above 9999) from outside localhost. Otherwise, anyone can ask the validator to terminate.
Reporter | ||
Comment 4•12 years ago
|
||
Philippe, can we do this soon?
Assignee | ||
Comment 5•12 years ago
|
||
(In reply to David Humphrey (:humph) from comment #4) > Philippe, can we do this soon? I can start setting up a VM this week.
Assignee | ||
Comment 6•12 years ago
|
||
VM will be called: validator1.vm.labs.scl3.mozilla.com
Assignee | ||
Comment 7•12 years ago
|
||
Running: $> python build/build.py all [...] "hg" pull --update -R htmlparser http://hg.mozilla.org/projects/htmlparser/ pulling from http://hg.mozilla.org/projects/htmlparser/ searching for changes no changes found http://ovh.dl.sourceforge.net/sourceforge/junit/junit-4.4.jar Traceback (most recent call last): File "build/build.py", line 1081, in <module> downloadDependencies() File "build/build.py", line 822, in downloadDependencies downloadDependency(url, md5sum) File "build/build.py", line 816, in downloadDependency fetchUrlTo(url, path, md5sum) File "build/build.py", line 688, in fetchUrlTo f = urllib2.urlopen(url) File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.6/urllib2.py", line 391, in open response = self._open(req, data) File "/usr/lib/python2.6/urllib2.py", line 409, in _open '_open', req) File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain result = func(*args) File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.6/urllib2.py", line 1136, in do_open raise URLError(err) urllib2.URLError: <urlopen error [Errno 110] Connection timed out>
Reporter | ||
Comment 8•12 years ago
|
||
From the docs: " mkdir checker cd checker hg clone https://bitbucket.org/validator/build build python build/build.py all python build/build.py all (Yes, the last line is there twice intentionally. Running the script twice tends to fix a ClassCastException on the first run.) " Does a second `python build/build.py all` fix it?
Assignee | ||
Comment 9•12 years ago
|
||
Yes, each time, it dies the same way. http://ovh.dl.sourceforge.net/sourceforge/junit/junit-4.4.jar is 404, looks like it moved to http://downloads.sourceforge.net/project/junit/junit/4.4/junit-4.4.jar Made the following patch to build/build.py diff -r ad87a79e8de3 build.py --- a/build.py Thu Apr 05 18:32:33 2012 +0900 +++ b/build.py Mon Jul 23 17:20:22 2012 -0700 @@ -98,7 +98,7 @@ ("http://archive.apache.org/dist/commons/fileupload/binaries/commons-fileupload-1.2.1-bin.zip", "975100c3f74604c0c22f68629874f868"), ("http://archive.apache.org/dist/ant/binaries/apache-ant-1.7.0-bin.zip" , "ac30ce5b07b0018d65203fbc680968f5"), ("http://surfnet.dl.sourceforge.net/sourceforge/iso-relax/isorelax.20041111.zip" , "10381903828d30e36252910679fcbab6"), - ("http://ovh.dl.sourceforge.net/sourceforge/junit/junit-4.4.jar", "f852bbb2bbe0471cef8e5b833cb36078"), + ("http://downloads.sourceforge.net/project/junit/junit/4.4/junit-4.4.jar", "f852bbb2bbe0471cef8e5b833cb36078"), ("http://kent.dl.sourceforge.net/sourceforge/jchardet/chardet.zip", "4091d24451ee9a840933bce34b9e3a55"), ("http://kent.dl.sourceforge.net/sourceforge/saxon/saxonb9-1-0-2j.zip", "9e649eec59103593fb75befaa28e1f3d"), ] And things seem to be progressing better.
Assignee | ||
Comment 10•12 years ago
|
||
I ran build/build.py all and the error was Could not find the main class: nu.validator.servlet.Main. Program will exit. Full log attached
Comment 11•12 years ago
|
||
The first error is: ./util/src/nu/validator/xml/PrudentHttpEntityResolver.java:50: package org.apache.log4j does not exist which means that the log4j dependency was not found. Does ./dependencies/apache-log4j-1.2.15/log4j-1.2.15.jar exist? I guess I should go over the dependency list again. It's annoying that upsteams don't have stable URLs for these things, so the autodownloader in build.py breaks. (And no, Maven is not the answer. That would be jumping from the frying pan into the fire.)
Comment 12•12 years ago
|
||
I just tested a clean install of the validator and log4j download WFM.
Assignee | ||
Comment 13•12 years ago
|
||
The VM was missing 'ant', so building log4j didn't work. I installed ant, then nuked dependencies/apache-log4j* and re-ran build/build.py all. It worked this time. I got an instance of the validator up and running.
Assignee | ||
Comment 14•12 years ago
|
||
Deployment scripts and cron jobs done as well. All that's missing at this point is the public facing access. For now, the VM can be accessed/tested from inside the VPN at http://validator1.vm.labs.scl3.mozilla.com:8888/
Assignee | ||
Comment 15•12 years ago
|
||
It's somewhat live right now over here: https://validator.mozillalabs.com/ One thing of note, the "More Options..." link exposes the supposedly 'hidden' hostname, unfortunately.
Reporter | ||
Comment 16•12 years ago
|
||
Awesome. Question, what about supporting this via http? I want to use http://about.validator.nu/html5check.py with this, and it's not happy with https.
Assignee | ||
Comment 17•12 years ago
|
||
(In reply to David Humphrey (:humph) from comment #16) > Awesome. Question, what about supporting this via http? I want to use > http://about.validator.nu/html5check.py with this, and it's not happy with > https. Can you try now, http://validator.mozillalabs.com/ redirects to https://validator.mozillalabs.com/. Or do you mean it needs to be available over http:// directly.
Reporter | ||
Comment 18•12 years ago
|
||
That code wants http, though maybe it can be hacked to fix this. How hard is it to serve over http?
Assignee | ||
Comment 19•12 years ago
|
||
It's just the default setting for labs sites. I can easily change it.
Reporter | ||
Comment 20•12 years ago
|
||
Actually, I think I've fixed it so it can just use https. I'll confirm in a bit.
Assignee | ||
Comment 21•12 years ago
|
||
http://validator.mozillalabs.com/ doesn't redirect anymore. Warning, the https:// variant used to send an HSTS header, so your browser will most likely insist on switching over https for a day.
Assignee | ||
Comment 22•12 years ago
|
||
(In reply to David Humphrey (:humph) from comment #20) > Actually, I think I've fixed it so it can just use https. I'll confirm in a > bit. If that's the case, please let me know so I can switch things back to the way they were.
Reporter | ||
Comment 23•12 years ago
|
||
The HTTPS one is working perfectly. Thanks so much, Philippe and Henri. Question: I want to tell various people within Mozilla about this, so they can use, too. What's the best channel so as to not have it become a public box (or maybe that's OK too)?
Assignee | ||
Comment 24•12 years ago
|
||
(In reply to David Humphrey (:humph) from comment #23) > The HTTPS one is working perfectly. Thanks so much, Philippe and Henri. Great, I've undone my change, and how http:// access will forward to https:// > Question: I want to tell various people within Mozilla about this, so they > can use, too. That's fine. > What's the best channel so as to not have it become a public > box (or maybe that's OK too)? Good point, at the moment, it's effectively a public service, from an access control point of view. Should I restrict it somehow, or make it LDAP authenticated ?
Reporter | ||
Comment 25•12 years ago
|
||
> Good point, at the moment, it's effectively a public service, from an access
> control point of view. Should I restrict it somehow, or make it LDAP
> authenticated ?
Please don't put auth in front of it, since I want this for automated bots and build tools. Let's leave it open for now. It's already finding bugs for us, and I think other Moz or Moz community people will enjoy it.
Assignee | ||
Comment 26•12 years ago
|
||
Can we close this bug then ?
Reporter | ||
Comment 27•12 years ago
|
||
Indeed. Thanks again.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Infrastructure & Operations
Comment 28•10 years ago
|
||
I think the server is down. Where is the right place to file a bug for this?
Comment 29•10 years ago
|
||
Ali: This server was turned off as part of the general decommissioning of Mozilla Labs. Both David Humphrey and Jon Buckley confirmed that the site was no longer needed. Are you running a project that needs this service?
Comment 30•10 years ago
|
||
(In reply to C. Liang [:cyliang] from comment #29) I'm not using that personally, but I have published a grunt plugin that uses this server to validate the documents and I think people are using that and I just got report about the server is being down. I'm not sure if I should point them to other server or what should be the right action here?
Comment 31•10 years ago
|
||
Whoops, my bad. Before turning off the server, I updated the node module to use the old server, but I didn't rev the version number with the new address. I've published the new version, so it's working now. Sorry :aali and :cyliang!
Comment 32•9 years ago
|
||
As reported on GitHub[1] https://html5.validator.nu is out of order since yesterday. (502 Bad Gateway) [1] https://github.com/mozilla/html5-lint/issues/22
Updated•8 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•