Removed Sphinx documentation output files is not removed from web server

NEW
Unassigned

Status

defect
3 months ago
a month ago

People

(Reporter: ato, Unassigned)

Tracking

(Blocks 2 bugs)

Trunk
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Reporter

Description

3 months ago

In https://searchfox.org/mozilla-central/diff/8411b140ec1d9a6272d4e18b6b600ed7587ea2c0/testing/marionette/moz.build#19
the location of the Marionette server documentation was moved from
/testing/marionette/marionette to /testing/marionette.

However,
https://firefox-source-docs.mozilla.org/testing/marionette/marionette/index.html
is still published, now in addition to
https://firefox-source-docs.mozilla.org/testing/marionette/index.html.

I guess one way to work around this would be to replace the old doc with a dummy page that has nothing but a meta refresh redirect:

<meta http-equiv="refresh" content="0; url=http://firefox-source-docs.com/newpage" />

But this will result in a ton of these files scattered around the tree and might cause issues with using the back button. Ideally we would have an in-tree "redirects" file that developers could modify that gets read by the webserver. But if setting that up dynamically is too complicated I guess we could read the redirects file at doc build time and generate those "meta refresh" files on the fly. At least then we wouldn't need to check them into the tree.

Sorry, this comment is tangentially related to the issue at hand, but doesn't actually address the root problem.

Reporter

Comment 2

3 months ago

I agree it would be nice to not break links, but the immediate
problem is that search engines don’t invalidate out-of-date content.

Both purging the directory and a redirection system would solve
that. I don’t know which one is the best fix.

I imagine we'll want the ability to do both: purge for docs that don't exist anymore, redirect for docs that simply moved. Though if we use server redirects instead of meta refresh, then we'd want to just purge everything anyway. And if we go the meta refresh route, then the old docs will still exist and don't need to be purged.

The upshot being, we should just always purge everything that doesn't exist in the generated docs.

See Also: → 1527363

Since non-existent docs should be purged no matter what, let's keep this bug focused on removing old files. I filed bug 1527363 to track the redirect feature.

Reporter

Updated

3 months ago
Summary: Old documents are not removed → Removed Sphinx documentation output files is not removed from web server

For an instance of that for Marionette see bug 1531068 comment 8. It already causes some confusion.

So turns out there isn't a webserver, firefox-source-docs is an entirely static site hosted in S3. There's a DocUp task that lives in-tree which is responsible for pushing the generated html files to the proper S3 bucket.

The code that is responsible for pushing to S3 lives here:
https://searchfox.org/mozilla-central/source/tools/docs/moztreedocs/upload.py

It uses amazon's boto3 library:
https://boto3.amazonaws.com/v1/documentation/api/latest/index.html

Looks like there are APIs to delete objects:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html?highlight=upload_fileobj#S3.Bucket.delete_objects

There's also an API to list objects (though limited to 1000 per call):
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html?highlight=upload_fileobj#S3.Client.list_objects_v2

So I guess we could get a list of all objects, check if they exist and then delete them if they don't. This seems a bit inefficient, but I guess it's not a huge deal. Tom Prince mentioned that there might be files on the server from other non-central branches (keyed by Firefox version). So we'll need to be careful that we don't accidentally delete files that are meant for a different branch. Looks like the list_objects API can accomplish this with prefix.

Alternatively, the AWS cli seems to have a sync --delete command:
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Maybe we could look into calling that instead of using boto3, but that might be tricky if we have files from all branches in the same namespace.

In either case (aws s3 sync or boto) the idea would be to list files under a branch-specific prefix, and then delete any files which are no longer part of the docs build. A further idea might be to replace such files with a redirect object (it's possible to set a Location header on an S3 object) to the root directory of that branch, so that users don't see mysterious 404 pages.

Andrew, would this make a good mentored bug? It is GSoC / Outreachy season, after all..

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #7)

In either case (aws s3 sync or boto) the idea would be to list files under a branch-specific prefix, and then delete any files which are no longer part of the docs build. A further idea might be to replace such files with a redirect object (it's possible to set a Location header on an S3 object) to the root directory of that branch, so that users don't see mysterious 404 pages.

Great to hear we can specify redirects! Bug 1527363 will track this. I'd propose we solve this bug first and only keep the focus on deleting non-existent files. Then we can use the other bug to add a "redirects" file or moz.build variable to the tree. Developers can then choose whether or not to let their deleted docs get purged or redirected.

Andrew, would this make a good mentored bug? It is GSoC / Outreachy season, after all..

Yes, I actually have a GSoC proposal centered around improving our doc generation (which I'll now expand the scope of to include doc upload). Though this is something I think I'd want to reserve for whoever ends up being selected. So maybe I'll keep my name off the mentor field for now.

Blocks: 1527363
See Also: 1527363
You need to log in before you can comment on or make changes to this bug.