Closed Bug 1121597 Opened 9 years ago Closed 9 years ago

Reload Impression, App and Error data from S3 into DDFS

Categories

(Content Services Graveyard :: Tiles: Ops, defect)

defect
Not set
normal
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED
Iteration:
38.2 - 9 Feb

People

(Reporter: tspurway, Assigned: tspurway)

Details

(Whiteboard: .008)

We will need all of the data we have from Jan 13th back to wherever the data starts from (assuming a 7 day window).   Note that we will have to delete all of the data from the 13th then reload it, as it is a partial day.

We should follow these steps to ensure consistency:

- all data transmitted must tagged with a DDFS 'processed: ...' prefix

- delete data from the 13th, note the order is important.  incoming: prefixed tags must be deleted first:
    - ddfs rm incoming:app:2015-01-13 incoming:error:2015-01-13 incoming:impression:2015-01-13
    - ddfs rm processed:app:2015-01-13 processed:error:2015-01-13 processed:impression:2015-01-13

- load data for app, error, and impression data for the period <beginning> .. 2015-01-13 inclusive (use processed: as DDFS prefix for ALL tags)

I can do this job as an Inferno task, rather than burden ops with a bunch of data transfers.  I will need read access to relevant s3 buckets and naming conventions, :relud.
Flags: needinfo?(dthornton)
the infernyx host has been granted s3:* permission to tiles-incoming-prod-us-west-2 and all objects in it. broad permissions, because this is temporary. please only list and read objects from the bucket.

file structure is currently "<md5>-<app|impression>-<YYYY>.<MM>.<DD>"
Flags: needinfo?(dthornton)
file structure will be changing soon though: https://bugzilla.mozilla.org/show_bug.cgi?id=1121694
Assignee: nobody → tspurway
relud: I would like to run an Inferno job to load all of the data (> 78,000 blobs).  I am getting an error that seems to indicate the disco slave nodes don't have access to s3.  Could we widen the (temporary) permissions to include the disco slaves?

FATAL: [map:0] Traceback (most recent call last):
  File "/usr/var/disco/data/ip-172-31-26-122/30/bulk_load@58d:a9e7c:c0b62/usr/lib/python2.7/site-packages/disco/worker/__init__.py", line 340, in main
    job.worker.start(task, job, **jobargs)
  File "/usr/var/disco/data/ip-172-31-26-122/30/bulk_load@58d:a9e7c:c0b62/usr/lib/python2.7/site-packages/disco/worker/__init__.py", line 303, in start
    self.run(task, job, **jobargs)
  File "/usr/var/disco/data/ip-172-31-26-122/30/bulk_load@58d:a9e7c:c0b62/usr/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 328, in run
    getattr(self, task.stage)(task, params)
  File "/usr/var/disco/data/ip-172-31-26-122/30/bulk_load@58d:a9e7c:c0b62/usr/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 341, in map
    for key, val in self['map'](entry, params):
  File "infernyx/s3import.py", line 23, in s3_import_map
  File "/usr/lib/python2.7/site-packages/boto/s3/connection.py", line 502, in get_bucket
    return self.head_bucket(bucket_name, headers=headers)
  File "/usr/lib/python2.7/site-packages/boto/s3/connection.py", line 535, in head_bucket
    raise err

S3ResponseError: S3ResponseError: 403 Forbidden
Flags: needinfo?(dthornton)
disco slave nodes have been granted access like infernyx
Flags: needinfo?(dthornton)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Iteration: --- → 38.2 - 9 Feb
Points: --- → 3
Whiteboard: .008
You need to log in before you can comment on or make changes to this bug.