Closed Bug 1395681 Opened 7 years ago Closed 7 years ago

cache .json files on disk for ftpscraper

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(1 file)

pr 3956 - add disk cache for .json files for ftpscraper 7 years ago Will Kahn-Greene [:willkg] ET needinfo? me 53 bytes, text/x-github-pull-request		Details \| Review

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

7 years ago

In an ftpscraper run, it does hundreds of HTTP GET requests against archive.mozilla.org. That's awesome. It also takes a lot of time. It's also the case that many of those GET requests get data that never changes.

There have been occasions where the .json files *did* change due to fixing bugs in the build system or something like that. Those are very rare. We shouldn't worry about those happening.

For local dev environments, that kind of sucks because it means it takes 5-10 minutes for ftpscraper to pull down data you need for your db. If you do that multiple times a day, that's a lot of time.

This bug covers looking into whether it helps to cache .json files on disk. Thus instead of going over the network, it pulls the data from disk.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 1

•

7 years ago

This is related to bug #1250638, except that bug was about using CacheControl, so ftpscraper is still doing all the HTTP GET requests, but some of them end up with less data in the response. That effort was abandoned because it didn't seem like a good value.

This bug covers pulling the data from disk if it exists and skipping the HTTP GET altogether.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 2

•

7 years ago

I'm doing a lot of "make dockersetup" (which wipes the db) followed by "make dockerupdatedata" which ploddingly goes through and downloads all the ftpscraper data and then some other things.

It's slowwww. In my experience and on my machine, it takes between 6 minutes and 9 minutes. 

Timings are done via:

    make dockersetup && time make dockerupdatedata

I tweaked docker/run_setup_postgres.sh to skip the "do you really want to delete the db?" question.

ftpscraper is network heavy, so timings are dependent on quality of the network and how fast archive.mozilla.org responds and probably karma, too.

Couple of run timings from today:

* 7m13s
* 8m15s

I modified ftpscraper a bit so that it caches .json files on disk. That covers about 600 HTTP GET requests. After ftpscraper has cached things, run times are less slow:

* 1m46s
* 1m42s
* 3m45s
* 1m39s
* 1m47s

Seems like the theory is sound.

The size of the cache on disk is about 2.5mb. I'm putting it in a .cache directory which ends up in the repository root of the host so it's easier to persist across runs and easier for the developer to manipulate.

Disk caching is off by default, so it doesn't affect server environments at all. I'll tweak the run scripts to run with caching on in the local dev environment.

Assignee: nobody → willkg

Status: NEW → ASSIGNED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 3

•

7 years ago

Attached file pr 3956 - add disk cache for .json files for ftpscraper — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

7 years ago

The .json files it's caching are these:

http://archive.mozilla.org/pub/firefox/nightly/2017/08/2017-08-05-10-03-34-mozilla-central/firefox-57.0a1.en-US.linux-i686.json

[github robot]

Comment 5

•

7 years ago

Commit pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/068948305872c0cc0efdfeb589824174e339ac9a
fixes bug 1395681 - cache .json files on disk for ftpscraper (#3956)

* fixes bug 1395681 - cache .json files on disk for ftpscraper

This adds a cachedir configuration property letting you specify a directory on
disk to cache .json files. It defaults to "don't cache on disk", but this
enables the disk cache for the local dev environment.

* Remove printing "hit"

[github robot]

Updated

•

7 years ago

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

cache .json files on disk for ftpscraper

Categories

(Socorro :: General, task)

Tracking

(Not tracked)

People

(Reporter: willkg, Assigned: willkg)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Attachment

General

Description

File Name

Content Type