Closed Bug 1839929 Opened 11 months ago Closed 10 months ago

Need to update SiteMap extension to upload xml files to S3 and to do nightly as a cron task

Categories

(bugzilla.mozilla.org :: Extensions, defect)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: dkl, Assigned: dkl)

References

Details

Attachments

(1 file)

When the sitemap xml files are out of date, currently the code regenerates them dynamically when a client accesses page.cgi?id=sitemap/sitemap.xml. In the past this was not an issue as the initial load was just slower and then each load after was quick.

Today, once i fixed a different bug where Mojolicious was blocking the xml files, now the worker is killed off before the new xml sitemap files can be generated.

I suggest we instead make a command line script that generates the xml.gz files and then cron the script once a day in the background. Once a day should be fine IMO.

Then we just make page.cgi?id=sitemap/sitemap.xml load the pre-generated files quickly and not dynamically generate them.

Ok this is not going to work either. We used to store the gzipped xml sitemap files on the shared /data filesystem accessed by all AWS webheads. This was before the migration to GCP which we did away with the shared filesystem which will not be coming back.

So we basically have 3 options I can see:

  1. Migrate the gzipped xml files, etc into the database itself which is the easiest but the files can be quite large and I like to keep the DB size down.
  2. Generate the files once per day using a cron task and upload the files to Amazon S3 like we do other files that used to reside in /data/. dlactin said we can create a public S3 bucket that is read-only and I can put links to the S3 files directly in the sitemap.xml file. The cron task would then use client keys to be able upload the files each night.
  3. dlactin mentioned having a read-only volume in k8s that is mapped into each web head that contains the generated files. The cron task that would run once a day would have write access to the volume to create the files. Doing something similar with Lando already.
  4. Turn off support for the sitemap index files in robots.txt and disallow Google, etc. from indexing bugs in BMO.

#3 would be the least amount of work if it works as dlactin proposes. Minimal changes would be needed to move the code that would ran as a cron task each night. The web heads would just serve the files directly as it used to.

#2 would also not be a whole lot of effort either and this keeps up on our goal of no longer needing a shared file system like we had before the migration to GCP. But is slightly more work than #2.

Which path do you think would be best?

Flags: needinfo?(glob)

I am not opposed to just removing the support for indexing bugs in BMO either. Noone has really seemed to care too much. We would just need to remove the "Google" tab from the query.cgi page which is not giving any recent results anyway.

Let's go with #2.

Having bugs discoverable via a normal Google search has value; people probably didn't notice because they were still seeing results and didn't realise recent bugs weren't returned.

I prefer #2 over #3 as I see benefits in using a consistent approach to solving the same problem.

Flags: needinfo?(glob)
Depends on: 1840148
Summary: Need to update SiteMap extension to include a script that can be ran as cron job instead of on-demand → Need to update SiteMap extension to upload xml files to S3 and to do nightly as a cron task
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: