Create tests to verify cfx sdocs output

RESOLVED FIXED

Status

P2
normal
RESOLVED FIXED
8 years ago
7 years ago

People

(Reporter: wbamberg, Assigned: wbamberg)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Assignee)

Description

8 years ago
As a response to bug 629922, we should have tests for the output of cfx sdocs. This could run cfx sdocs, and for all links in the output verify that there is a page on the end of the link.
I did a quick search for python web-spidering tools, and asked Ian Bicking,
and here were a few contenders:

  - twill: http://twill.idyll.org/
  - http://pypi.python.org/pypi/linkchecker
  - http://arthurdejong.org/webcheck/
  - http://pypi.python.org/pypi/spider.py/0.5
  - http://pypi.python.org/pypi/spydey/0.3
    - uses httplib2.googlecode.com

Ian points out that the 'lxml' module is probably the best
HTML-parser/link-extractor tool around, and a number of those tools use it.
Most of those tools are aimed at the command line, and/or have some
inconvenient dependencies which we wouldn't want to add to the SDK, but we
might be able to extract some ideas or techniques from them.

Running 'spydey' against the 'cfx docs' server made me realize that our docs
isn't actually very static: I don't think the HTML of the front page has
links to any of the other pages (it's all being driven by scripts). So we
need some way to enumerate the links that need to be crawled. Maybe we should
add a secondary page (/all-links.html?) that contains links to enough pages
to bootstrap the crawl.
(Assignee)

Comment 2

8 years ago
Thanks Brian. I will have a look at these.

> Running 'spydey' against the 'cfx docs' server made me realize that our docs
> isn't actually very static: I don't think the HTML of the front page has
> links to any of the other pages (it's all being driven by scripts). So we
> need some way to enumerate the links that need to be crawled. Maybe we should
> add a secondary page (/all-links.html?) that contains links to enough pages
> to bootstrap the crawl.

Well, after bug 560008 lands they will be much, much more static, so this ought to get easier.
Oh, also BeautifulSoup is good, http://www.crummy.com/software/BeautifulSoup , for parsing HTML to find the links.
also our own Ian Bicking wrote http://blog.ianbicking.org/2010/04/02/webtest-http-testing/ , which can run in the same process as the WSGI app it's testing. I'm not sure if it has any scan-for-links features, though, but maybe we could glue it to a parser for that.
Priority: -- → P3
Hardware: x86 → All
Target Milestone: --- → 1.0
(automatic reprioritization of 1.0 bugs)
Priority: P3 → P2
Target Milestone: 1.0 → 1.1
(Pushing all open bugs to the --- milestone for the new triage system)
Target Milestone: 1.1 → ---
(Assignee)

Comment 7

7 years ago
Created attachment 575105 [details] [diff] [review]
Test links in documentation

Here's a patch that generates the doc set, and looks through all the HTML files looking at the href attribute in each <a> element, and validates it by:
- checking it isn't a file:// URL
- if it's a relative URL, adding the baseurl to it and checking there's a file at the end of the resulting absolute URL

This test found a few bugs, which are also fixed in the patch.

First, just running HTMLParser on the code found a few problems. The only interesting one is the presence of "<" and ">" in the "author" key in package.json files, which we're now escaping using cgi.escape. I'm not sure whether we should use that for all the input files, but I was a bit worried about possible side-effects of that, so I'm only doing it in one place. Really, I think we should remove the "author" field from the package pages, as it's not really maintained, and we have a "Credits" page which works better.

Second, there are a few broken links. The most interesting are the e10s guide and API reference documents, which I decided to remove on the understanding that they are no longer at all accurate.

Finally, although this test runs fine on Windows, I am seeing (only on Windows) the intermittent "directory is not empty" error from shutil.rmtree, which is tracked in bug 659478. I don't think it occurs any more often with this patch. I also don't know what's causing that error. My best guess is that a file handle is sometimes not getting closed in time, but I did try going through generate_static_docs and explicitly closing files, but that did not make it go away. Maybe I missed one. I'll have another look tomorrow.
Attachment #575105 - Flags: review?(warner-bugzilla)
Comment on attachment 575105 [details] [diff] [review]
Test links in documentation

looks great! The only suggestion I could think of in get_base_url() would be to go the other way: strip any leading "/", then always add one back in:

  base_url_path = get_base_url_path().lstrip("/")
  return "file://"+"/"+"/".join(base_url_path.split(os.sep))+"/"

but I'm not sure that's a whole lot better.

You might also consider building some of these strings fewer times: get_base_url() is going to be called for every href in the whole tree, and will always return the same value. (you might call it once in Generate_Docs_Tests and then pass the value into your Link_Checker() constructor).

It might also be time to rename these test methods: they're more than a mere smoketest now. Maybe just test_generate_static_docs and test_generate_docs ?

But it looks perfectly fine as-is.. r+!
Attachment #575105 - Flags: review?(warner-bugzilla) → review+
(Assignee)

Updated

7 years ago
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.