Closed
Bug 1007326
Opened 11 years ago
Closed 10 years ago
Mapreduce jobs should use cached data by default
Categories
(Webtools Graveyard :: Telemetry Server, defect)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: benjamin, Unassigned)
Details
When prototyping mapreduce jobs, I frequently-enough make mistakes in the script. Re-running the script currently re-fetches everything (both the listings and the actual data) from S3.
I've been told that the --local-only flag can be used to prevent this, but the incantation to do that is magic. So for instance:
First run: python -m mapreduce.job ... --data-dir=/mnt/telex/data --work-dir=/mnt/telex/work
It appears that --data-dir is unused and the cached data goes into /mnt/telex/work/cache
So if you want to re-do the same job again, you have to do --data-dir=/mnt/telex/work/cache --work-dir=/mnt/telex/work --local-only
Maybe I don't understand what --data-dir is for, but it seems that we ought to use --data-dir for the cache all the time, and use the cache by default. If you really don't want to use the cache, we can add a --no-cache option.
mreid do you agree/object to this?
Flags: needinfo?(mreid)
Comment 1•11 years ago
|
||
The original idea was that "--data-dir" specifies an arbitrary local data set that will be filtered the same way as the remote data. "--work-dir" is intended as a place to store the byproducts of the MR job.
The data fetched from S3 was considered a byproduct, and during development it made sense to keep things separated (primarily for testing).
The way jobs typically work now, there's really no such thing as a "local data set", and the only thing that flag is used for is to point at the data you already downloaded.
So yes, I agree that the data should be cached locally in --data-dir, and then the "--local-only" flag would simply decide whether or not to consult S3.
I don't think we even need a "--no-cache" option at this point. It would make sense if we could clean up the data as we go, in order to avoid running out of disk space, but that would involve a larger change to how data is distributed among processes. That would be an improvement too, but should probably be a separate bug.
Flags: needinfo?(mreid)
| Reporter | ||
Comment 2•10 years ago
|
||
Spark makes life better.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
Updated•7 years ago
|
Product: Webtools → Webtools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•