Closed Bug 763804 Opened 12 years ago Closed 12 years ago

Mozilla validator.nu instance for linting HTML

Categories

(Infrastructure & Operations Graveyard :: WebOps: Labs, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: humph, Assigned: gozer)

References

()

Details

Attachments

(1 file)

I want to run an instance of the HTML5 Validator (http://about.validator.nu/#src) for use in MoFo software project build systems.  I want to be able to do HTML linting in our build systems for Popcorn, Popcorn Maker, etc. and thought that we should run an instance instead of spamming http://validator.nu/ with requests all the time.

According to the setup info, this wants Ubuntu with Python, Mercurial, Subversion, JDK5 or 6, and a bunch of deps:

"commons-codec-1.4/commons-codec-1.4.jar",
"commons-httpclient-3.1/commons-httpclient-3.1.jar",
"commons-logging-1.1.1/commons-logging-1.1.1.jar",
"commons-logging-1.1.1/commons-logging-adapters-1.1.1.jar",
"commons-logging-1.1.1/commons-logging-api-1.1.1.jar",
"icu4j-charsets-4_4_2.jar",
"icu4j-4_4_2.jar",
"iri-0.5/lib/iri.jar",
"jetty-6.1.26/lib/servlet-api-2.5-20081211.jar",
"jetty-6.1.26/lib/jetty-6.1.26.jar",
"jetty-6.1.26/lib/jetty-util-6.1.26.jar",
"jetty-6.1.26/lib/ext/jetty-ajp-6.1.26.jar",
"apache-log4j-1.2.15/log4j-1.2.15.jar",
"rhino1_7R1/js.jar",
"xerces-2_9_1/xercesImpl.jar",
"xerces-2_9_1/xml-apis.jar",
"slf4j-1.5.2/slf4j-log4j12-1.5.2.jar",
"commons-fileupload-1.2.1/lib/commons-fileupload-1.2.1.jar",
"isorelax.jar",
"mozilla/intl/chardet/java/dist/lib/chardet.jar",
"saxon9.jar"

I'm not clear on hardware requirements for the box, and have CC'ed Henri in case he has suggestions.  It won't get pummeled with traffic, but given that we'll use it in our build systems and automation scripts, it will get used fairly often.

Calling it something like validator.mozillalabs.org or something would be useful.  It also seems to run on port 8888, but that might be configurable.  We won't be sending any sensitive/private HTML to the service, so HTTP is fine.
Best to try with a low-end config, and see if we need to bump it up.

    Distro: Ubuntu
    Memory: 512M
    CPU count: 1
    Disk size: 8G is the minimum, say if you need more.
    Project name: Validator
    Project owner: whatever humph uses
    Needs https access from the internet Y
        Need other ports besides 80/https: N
    Needs a DNS domain name (.i.e. myproject.org) validator.mozillalabs.com
    Good to know what installed services you'll be running:
        Java
    Does the project have a home page explaing what it is? 
        http://validator.nu/
Assignee: server-ops-labs → gozer
Status: NEW → ASSIGNED
(In reply to David Humphrey (:humph) from comment #0)
> I want to run an instance of the HTML5 Validator
> (http://about.validator.nu/#src) for use in MoFo software project build
> systems.  I want to be able to do HTML linting in our build systems for
> Popcorn, Popcorn Maker, etc. and thought that we should run an instance
> instead of spamming http://validator.nu/ with requests all the time.

If you want HTML5 linting, it is probably the best to configure the Labs instance like http://html5.validator.nu/ rather than http://validator.nu/. This way, it will default to HTML5 and you don't need to pass the API parameters for requesting HTML5.

> According to the setup info, this wants Ubuntu with Python, Mercurial,
> Subversion, JDK5 or 6, and a bunch of deps:

JDK 5 is legacy software at this point. I recommend running OpenJDK 6. OpenJDK 7 most likely works, but I haven't tested it.

I have tested software only on Ubuntu and Mac OS X, but there's no reason to believe that it wouldn't run on any Linux distribution, so it probably makes sense to use whatever distribution works best for server operations and not be too focused on the fact that I've been using Ubuntu.

> Calling it something like validator.mozillalabs.org or something would be
> useful.  It also seems to run on port 8888, but that might be configurable. 
> We won't be sending any sensitive/private HTML to the service, so HTTP is
> fine.

A non-root process can't bind to port 80, you're not supposed to start the validator as root and on Linux elevating the binding privileges of a multi-threaded Java process doesn't really work. This is why it makes sense to make the Java process bind to and non-privileged port and to use iptables in the kernel to redirect port 80 to whatever non-privilege port the Java process bound itself to. 

(In reply to David Ascher (:davida) from comment #1)
> Best to try with a low-end config, and see if we need to bump it up.
> 
>     Distro: Ubuntu
>     Memory: 512M
>     CPU count: 1

http://html5.validator.nu/ runs on a VM with with 512 MB of RAM and a single CPU, so those specs should be okay.

When the validator is configured to do only HTML5 validation and the user's ability to choose a schema is disabled, there's no Schematron to be run, so the validator should perform pretty well. This is the configuration of http://html5.validator.nu/. On the other hand, if the validator is configured to allow arbitrary user-provided schemas or the legacy HTML 4 schema, it is easy to cause a denial of service by Schematron. This is why http://validator.nu/ occasionally suffers from overloading even when a single post on the Internet spams it.

>     Needs https access from the internet Y
>         Need other ports besides 80/https: N

Note that unless you patch the validator to only accept input uploaded to it via the Web service interface, the validator can make outbound connections. The content of these connections can be exposed to whoever makes an HTTP request to the validator (indirectly via error messages or directly via "Show Source"). Therefore, it's important to run the validator outside Mozilla's trusted network if users are allowed to connect to the validator from outside Mozilla's trusted network or, alternatively, the validator needs to be patched never to make outbound connections and work only with content uploaded to it via the Web service interface.

My deployment process goes like this:

First, I build the validator on my desktop development machine by running the following command:
python build/build.py all

This pulls fresh code from the repositories, compiles it and runs a local instance for testing. Then I connect to local instance for quick smoke testing. Then I press return in the terminal to terminate the process.

Then I run a deployment shell script roughly like this (I've omitted the part that deploys http://validator.nu/ since you probably want to deploy just a duplicate of http://html5.validator.nu/; I've changed the hostnames in the example):

#!/bin/sh
python build/build.py --no-self-update --control-port=9999 --port=8888 --log4j=/home/validator/log4j.properties  --promiscuous-ssl=on --genericpath=unguessable.bogus.host.name/ --html5path=validator.mozillalabs.com/ --heap=-170 script
python build/build.py --no-self-update tar
python build/build.py --no-self-update --control-port=9999 --scp-target=validator@validator.mozillalabs.com:/home/validator deploy

The first line generates a shell script that will launch the validator with control port 9999 and HTTP port 8888. The control port is used for shutting down the validator, so when the next deployment happens the next instance will connect to the port 9999 of the old instance to get rid of the old instance. There's also the path to log4j configuration on the target machine, a flag that tells the validator to turn off TLS certificate validation for outbound connections, settings for what paths the generic validator and the HTML validator are bound to and a flag that tells to configure the JVM such that the JVM heap is the total memory on the target machine minus 170 MB.

At present, the validator always gets configured with the path for the HTML5 only facet and the generic facet. There's no way to turn the generic facet off altogether right now, so someone malicious could trigger resource-consuming Schematron-based validation by spoofing the Host header in the request. I should probably fix this at some point. For now, I suggest configuring a bogus hostname for the generic facet using an unguessable bogus hostname for it.

The second line generates tarballs of all the files on the build systems that need to be copied to the deployment system.

The third line copies the tarballs to the deployment machine via scp (public key-based authentication assumed) and creates an empty file called "DEPLOY" in the target directory also.

On the deployment virtual machine, I have a shell script like this running as a Cron job frequently. The Cron job frequency determines how quickly after the upload of a new version the new version gets run.

#!/bin/sh

VALIDATOR_HOME=/home/validator

cd "$VALIDATOR_HOME"

pwd
echo "About to check DEPLOY"
if [ -f "DEPLOY" ]; then
  echo "Will deploy"
  rm DEPLOY
  # fall through
elif [ -f "HEAP" ]; then
  echo "Will examine heap"
  NEWHEAP=`grep MemTotal /proc/meminfo | awk '{print $2}'`
  OLDHEAP=`cat HEAP`
  echo "Old: $OLDHEAP"
  echo "New: $NEWHEAP"
  if [ $NEWHEAP != $OLDHEAP ]; then
    echo $NEWHEAP > "HEAP"
    echo "Will deploy with new heap"
  else
    echo "Exiting"
    exit
  fi
else
  echo "Exiting"
  exit
fi

echo "End game"
if [ -f "deps.tar.gz" ]; then
  tar zxf deps.tar.gz
fi
if [ -f "jars.tar.gz" ]; then
  tar zxf jars.tar.gz
fi
if [ -f "run-validator.sh" ]; then
  exec sh run-validator.sh
fi

If a file called "DEPLOY" exists in the validator home directory, this script extracts the two tar files that contain the new version and then executes run-validator.sh which was generated on the development machine in the deployment phase and got uploaded together with the two tar files. (Note that deps.tar.gz is re-uploaded every time even when it doesn't need to. The system currently doesn't have any checks to see if he needs to re-upload that tar ball.)

If a file called "DEPLOY" a doesn't exist in the validator home directory, but a file called "HEAP" exists, the script will compare the current heap size of the machine with the size recorded in the file called "HEAP". If the sizes differ, the script will redeploy the validator. This is useful when running on a virtual machine whose RAM size can change while the virtual machine is running. Since the Cron job keeps polling the heap size, no manual action is needed after changing the RAM size of the virtual machine. Note that the heap size for the JVM was defined relative to the total RAM size of the deployment machine.
Oh and it's essential to block access to the control port (in my example scripts above 9999) from outside localhost. Otherwise, anyone can ask the validator to terminate.
Philippe, can we do this soon?
(In reply to David Humphrey (:humph) from comment #4)
> Philippe, can we do this soon?

I can start setting up a VM this week.
VM will be called:
 validator1.vm.labs.scl3.mozilla.com
Running:

$> python build/build.py all

[...]

"hg" pull --update -R htmlparser http://hg.mozilla.org/projects/htmlparser/
pulling from http://hg.mozilla.org/projects/htmlparser/
searching for changes
no changes found
http://ovh.dl.sourceforge.net/sourceforge/junit/junit-4.4.jar
Traceback (most recent call last):
  File "build/build.py", line 1081, in <module>
    downloadDependencies()
  File "build/build.py", line 822, in downloadDependencies
    downloadDependency(url, md5sum)
  File "build/build.py", line 816, in downloadDependency
    fetchUrlTo(url, path, md5sum)
  File "build/build.py", line 688, in fetchUrlTo
    f = urllib2.urlopen(url)
  File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
  File "/usr/lib/python2.6/urllib2.py", line 409, in _open
    '_open', req)
  File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.6/urllib2.py", line 1136, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>
From the docs:

"
mkdir checker
cd checker
hg clone https://bitbucket.org/validator/build build
python build/build.py all
python build/build.py all

(Yes, the last line is there twice intentionally. Running the script twice tends to fix a ClassCastException on the first run.)
"

Does a second `python build/build.py all` fix it?
Yes, each time, it dies the same way.

http://ovh.dl.sourceforge.net/sourceforge/junit/junit-4.4.jar is 404, looks like it
moved to

http://downloads.sourceforge.net/project/junit/junit/4.4/junit-4.4.jar

Made the following patch to build/build.py

diff -r ad87a79e8de3 build.py
--- a/build.py	Thu Apr 05 18:32:33 2012 +0900
+++ b/build.py	Mon Jul 23 17:20:22 2012 -0700
@@ -98,7 +98,7 @@
   ("http://archive.apache.org/dist/commons/fileupload/binaries/commons-fileupload-1.2.1-bin.zip", "975100c3f74604c0c22f68629874f868"),
   ("http://archive.apache.org/dist/ant/binaries/apache-ant-1.7.0-bin.zip" , "ac30ce5b07b0018d65203fbc680968f5"),
   ("http://surfnet.dl.sourceforge.net/sourceforge/iso-relax/isorelax.20041111.zip" , "10381903828d30e36252910679fcbab6"),
-  ("http://ovh.dl.sourceforge.net/sourceforge/junit/junit-4.4.jar", "f852bbb2bbe0471cef8e5b833cb36078"),
+  ("http://downloads.sourceforge.net/project/junit/junit/4.4/junit-4.4.jar", "f852bbb2bbe0471cef8e5b833cb36078"),
   ("http://kent.dl.sourceforge.net/sourceforge/jchardet/chardet.zip", "4091d24451ee9a840933bce34b9e3a55"),
   ("http://kent.dl.sourceforge.net/sourceforge/saxon/saxonb9-1-0-2j.zip", "9e649eec59103593fb75befaa28e1f3d"),
 ]

And things seem to be progressing better.
I ran build/build.py all and the error was

Could not find the main class: nu.validator.servlet.Main. Program will exit.

Full log attached
The first error is:
./util/src/nu/validator/xml/PrudentHttpEntityResolver.java:50: package org.apache.log4j does not exist
which means that the log4j dependency was not found.

Does
./dependencies/apache-log4j-1.2.15/log4j-1.2.15.jar 
exist?

I guess I should go over the dependency list again. It's annoying that upsteams don't have stable URLs for these things, so the autodownloader in build.py breaks. (And no, Maven is not the answer. That would be jumping from the frying pan into the fire.)
I just tested a clean install of the validator and log4j download WFM.
The VM was missing 'ant', so building log4j didn't work. I installed ant, then nuked
dependencies/apache-log4j* and re-ran build/build.py all.

It worked this time. I got an instance of the validator up and running.
Deployment scripts and cron jobs done as well.

All that's missing at this point is the public facing access. For now, the VM
can be accessed/tested from inside the VPN at

http://validator1.vm.labs.scl3.mozilla.com:8888/
It's somewhat live right now over here:

https://validator.mozillalabs.com/

One thing of note, the "More Options..." link exposes the supposedly 'hidden' hostname, unfortunately.
Awesome.  Question, what about supporting this via http?  I want to use http://about.validator.nu/html5check.py with this, and it's not happy with https.
(In reply to David Humphrey (:humph) from comment #16)
> Awesome.  Question, what about supporting this via http?  I want to use
> http://about.validator.nu/html5check.py with this, and it's not happy with
> https.

Can you try now, http://validator.mozillalabs.com/ redirects to https://validator.mozillalabs.com/.

Or do you mean it needs to be available over http:// directly.
That code wants http, though maybe it can be hacked to fix this.  How hard is it to serve over http?
It's just the default setting for labs sites. I can easily change it.
Actually, I think I've fixed it so it can just use https.  I'll confirm in a bit.
http://validator.mozillalabs.com/ doesn't redirect anymore.

Warning, the https:// variant used to send an HSTS header, so your browser will most likely insist on switching over https for a day.
(In reply to David Humphrey (:humph) from comment #20)
> Actually, I think I've fixed it so it can just use https.  I'll confirm in a
> bit.

If that's the case, please let me know so I can switch things back to the way they were.
The HTTPS one is working perfectly.  Thanks so much, Philippe and Henri.

Question: I want to tell various people within Mozilla about this, so they can use, too.  What's the best channel so as to not have it become a public box (or maybe that's OK too)?
(In reply to David Humphrey (:humph) from comment #23)
> The HTTPS one is working perfectly.  Thanks so much, Philippe and Henri.

Great, I've undone my change, and how http:// access will forward to https://

> Question: I want to tell various people within Mozilla about this, so they
> can use, too. 

That's fine.

> What's the best channel so as to not have it become a public
> box (or maybe that's OK too)?

Good point, at the moment, it's effectively a public service, from an access control point of view. Should I restrict it somehow, or make it LDAP authenticated ?
> Good point, at the moment, it's effectively a public service, from an access
> control point of view. Should I restrict it somehow, or make it LDAP
> authenticated ?

Please don't put auth in front of it, since I want this for automated bots and build tools.  Let's leave it open for now.  It's already finding bugs for us, and I think other Moz or Moz community people will enjoy it.
Can we close this bug then ?
Indeed.  Thanks again.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
I think the server is down. Where is the right place to file a bug for this?
Ali:

This server was turned off as part of the general decommissioning of Mozilla Labs.  Both David Humphrey and Jon Buckley confirmed that the site was no longer needed.

Are you running a project that needs this service?
(In reply to C. Liang [:cyliang] from comment #29)

I'm not using that personally, but I have published a grunt plugin that uses this server to validate the documents and I think people are using that and I just got report about the server is being down.

I'm not sure if I should point them to other server or what should be the right action here?
Whoops, my bad. Before turning off the server, I updated the node module to use the old server, but I didn't rev the version number with the new address. I've published the new version, so it's working now.

Sorry :aali and :cyliang!
As reported on GitHub[1] https://html5.validator.nu is out of order since yesterday. (502 Bad Gateway)

[1] https://github.com/mozilla/html5-lint/issues/22
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: