Closed Bug 1395681 Opened 7 years ago Closed 7 years ago

cache .json files on disk for ftpscraper

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(1 file)

In an ftpscraper run, it does hundreds of HTTP GET requests against archive.mozilla.org. That's awesome. It also takes a lot of time. It's also the case that many of those GET requests get data that never changes.

There have been occasions where the .json files *did* change due to fixing bugs in the build system or something like that. Those are very rare. We shouldn't worry about those happening.

For local dev environments, that kind of sucks because it means it takes 5-10 minutes for ftpscraper to pull down data you need for your db. If you do that multiple times a day, that's a lot of time.

This bug covers looking into whether it helps to cache .json files on disk. Thus instead of going over the network, it pulls the data from disk.
This is related to bug #1250638, except that bug was about using CacheControl, so ftpscraper is still doing all the HTTP GET requests, but some of them end up with less data in the response. That effort was abandoned because it didn't seem like a good value.

This bug covers pulling the data from disk if it exists and skipping the HTTP GET altogether.
I'm doing a lot of "make dockersetup" (which wipes the db) followed by "make dockerupdatedata" which ploddingly goes through and downloads all the ftpscraper data and then some other things.

It's slowwww. In my experience and on my machine, it takes between 6 minutes and 9 minutes. 

Timings are done via:

    make dockersetup && time make dockerupdatedata

I tweaked docker/run_setup_postgres.sh to skip the "do you really want to delete the db?" question.

ftpscraper is network heavy, so timings are dependent on quality of the network and how fast archive.mozilla.org responds and probably karma, too.

Couple of run timings from today:

* 7m13s
* 8m15s

I modified ftpscraper a bit so that it caches .json files on disk. That covers about 600 HTTP GET requests. After ftpscraper has cached things, run times are less slow:

* 1m46s
* 1m42s
* 3m45s
* 1m39s
* 1m47s

Seems like the theory is sound.

The size of the cache on disk is about 2.5mb. I'm putting it in a .cache directory which ends up in the repository root of the host so it's easier to persist across runs and easier for the developer to manipulate.

Disk caching is off by default, so it doesn't affect server environments at all. I'll tweak the run scripts to run with caching on in the local dev environment.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Commit pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/068948305872c0cc0efdfeb589824174e339ac9a
fixes bug 1395681 - cache .json files on disk for ftpscraper (#3956)

* fixes bug 1395681 - cache .json files on disk for ftpscraper

This adds a cachedir configuration property letting you specify a directory on
disk to cache .json files. It defaults to "don't cache on disk", but this
enables the disk cache for the local dev environment.

* Remove printing "hit"
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: