Closed Bug 1839929 Opened 11 months ago Closed 10 months ago

Need to update SiteMap extension to upload xml files to S3 and to do nightly as a cron task

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: dkl, Assigned: dkl)

References

Details

Attachments

(1 file)

[mozilla-bteam/bmo] Bug 1839929 - Need to update SiteMap extension to upload xml files to S3 and to do nightly as a cron task (#2087) 11 months ago BMO Github Automation 46 bytes, text/x-github-pull-request		Details \| Review

David Lawrence [:dkl]

Assignee

Description

•

11 months ago

When the sitemap xml files are out of date, currently the code regenerates them dynamically when a client accesses page.cgi?id=sitemap/sitemap.xml. In the past this was not an issue as the initial load was just slower and then each load after was quick.

Today, once i fixed a different bug where Mojolicious was blocking the xml files, now the worker is killed off before the new xml sitemap files can be generated.

I suggest we instead make a command line script that generates the xml.gz files and then cron the script once a day in the background. Once a day should be fine IMO.

Then we just make page.cgi?id=sitemap/sitemap.xml load the pre-generated files quickly and not dynamically generate them.

David Lawrence [:dkl]

Assignee

Comment 1

•

11 months ago

Ok this is not going to work either. We used to store the gzipped xml sitemap files on the shared /data filesystem accessed by all AWS webheads. This was before the migration to GCP which we did away with the shared filesystem which will not be coming back.

So we basically have 3 options I can see:

Migrate the gzipped xml files, etc into the database itself which is the easiest but the files can be quite large and I like to keep the DB size down.
Generate the files once per day using a cron task and upload the files to Amazon S3 like we do other files that used to reside in /data/. dlactin said we can create a public S3 bucket that is read-only and I can put links to the S3 files directly in the sitemap.xml file. The cron task would then use client keys to be able upload the files each night.
dlactin mentioned having a read-only volume in k8s that is mapped into each web head that contains the generated files. The cron task that would run once a day would have write access to the volume to create the files. Doing something similar with Lando already.
Turn off support for the sitemap index files in robots.txt and disallow Google, etc. from indexing bugs in BMO.

#3 would be the least amount of work if it works as dlactin proposes. Minimal changes would be needed to move the code that would ran as a cron task each night. The web heads would just serve the files directly as it used to.

#2 would also not be a whole lot of effort either and this keeps up on our goal of no longer needing a shared file system like we had before the migration to GCP. But is slightly more work than #2.

Which path do you think would be best?

Flags: needinfo?(glob)

David Lawrence [:dkl]

Assignee

Comment 2

•

11 months ago

I am not opposed to just removing the support for indexing bugs in BMO either. Noone has really seemed to care too much. We would just need to remove the "Google" tab from the query.cgi page which is not giving any recent results anyway.

:glob ✱

Comment 3

•

11 months ago

Let's go with #2.

Having bugs discoverable via a normal Google search has value; people probably didn't notice because they were still seeing results and didn't realise recent bugs weren't returned.

I prefer #2 over #3 as I see benefits in using a consistent approach to solving the same problem.

Flags: needinfo?(glob)

David Lawrence [:dkl]

Assignee

Updated

•

11 months ago

Depends on: 1840148

David Lawrence [:dkl]

Assignee

Updated

•

11 months ago

Summary: Need to update SiteMap extension to include a script that can be ran as cron job instead of on-demand → Need to update SiteMap extension to upload xml files to S3 and to do nightly as a cron task

BMO Github Automation

Comment 4

•

11 months ago

Attached file [mozilla-bteam/bmo] Bug 1839929 - Need to update SiteMap extension to upload xml files to S3 and to do nightly as a cron task (#2087) — Details

David Lawrence [:dkl]

Assignee

Comment 5

•

10 months ago

Merged
https://github.com/mozilla-bteam/bmo/commit/80fc15845a4c0ae90280991c8fa4656a6632bd6f

Status: ASSIGNED → RESOLVED

Closed: 10 months ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Need to update SiteMap extension to upload xml files to S3 and to do nightly as a cron task

Categories

(bugzilla.mozilla.org :: Extensions, defect)

Tracking

()

People

(Reporter: dkl, Assigned: dkl)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5

Attachment

General

Description

File Name

Content Type