Need reasonable amount of test-run data in new prod ES db

RESOLVED FIXED

Status

defect
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: mcote, Assigned: dmaher)

Tracking

Dependency tree / graph

Details

Reporter

Description

6 years ago
Bug 860256 concerns populating the dev instance, but we will need a reasonable amount of historical test-run data (the "logs" index) in the new production database before we switch OrangeFactor over to it.

TBPL has been writing star data to the new system for a while now, but the logparser has only been writing to the old system.  Without this data, we can't calculate the frequency of oranges (oranges per test run, aka the Orange Factor).

Bug 869648 and bug 869652 will make the logparser write to all three systems (old prod, new prod, dev) in case we need to switch back to the old db for some reason, but we will only have data starting from the point in which we deploy those changes in the new systems.

There are several options, none of which are great:

1. Copy the raw ES data over.  According to bug 860256, this is effectively impossible in this situation.
2. Perform a scroll re-index from one ES instance directly to another.  Also according to bug 860256, this is possible but can be "quirky".
3. Scrape old data from the FTP server.  This will require updating the scraping script, will take some amount of time, and will have a limited history (I think only one month for mozilla-inbound).
4. Wait for some time to build up a reasonable amount of historical data.  Not great as we lose existing history.

I think 2 is really our only decent option.  We should get a rough estimate of how long this will take.  I imagine we should be able to leave the logparser running and writing to both locations, as long as we stop the copy before the first new entry (although overwriting new entries might be fine).

In any case, we'll need to fix bug 869648 and bug 869652 first.
Reporter

Comment 1

6 years ago
phrawtzy, does this seem reasonable?  Can we schedule some of your time to help us out here, after we've got the log parser writing to both locations?  Is there some way you could get a rough estimate of how long the re-indexing will take?
Flags: needinfo?(dmaher)
Assignee

Comment 2

6 years ago
(In reply to Mark Côté ( :mcote ) from comment #1)
> phrawtzy, does this seem reasonable?  Can we schedule some of your time to
> help us out here, after we've got the log parser writing to both locations?

Sure, I'd be happy to help - note that I'm based out of UTC+2, so any real-time collaboration would need to happen during the "morning" in the negative time zones. :)
 
> Is there some way you could get a rough estimate of how long the re-indexing
> will take?

There are two effective elements to consider : resource overhead, and the size of the index in question.  Concerning the former, we've got lots of bandwidth and hardware available, so that leaves only the latter - how large an index are we talking about ?
Flags: needinfo?(dmaher) → needinfo?(mcote)
(In reply to Daniel Maher [:phrawzty] from comment #2)
> how large an index are we talking about ?

The 'logs' index of http://elasticsearch1.metrics.scl3.mozilla.com:9200/

index: {
    primary_size: 30.3gb
    primary_size_in_bytes: 32563220464
    size: 60.7gb
    size_in_bytes: 65241155023
}
docs: {
    num_docs: 11075389
    max_doc: 11075389
    deleted_docs: 0
}
Assignee

Comment 4

6 years ago
(In reply to Ed Morley [:edmorley UTC+1] from comment #3)
> index: {
>     primary_size: 30.3gb
>     primary_size_in_bytes: 32563220464
>     size: 60.7gb
>     size_in_bytes: 65241155023

Wow, that's a good-sized index, and it will consume the lion's share of the disk space available to the development ES cluster.  Shouldn't be a problem for now, though if anybody else asks for development index of that size... well, we'll cross that bridge when we get there. :)

As for how long it would take to import, the VLANs in question likely have 100mbps available to them, so 30GB would take about 40 minutes.  Tack on some overhead time and we could estimate an hour or so.  Of course, if there is no rate limiting in play, then it would be significantly faster - this *may* be the case.

A flow will need to be opened in order to allow this operation to occur.  I'll file a bug now.
Flags: needinfo?(mcote)
We can set TTLs on the development index docs, since we'll only need the last 4 weeks or so for dev work.
Assignee

Updated

6 years ago
Depends on: 871533
Reporter

Comment 6

6 years ago
Great, thanks!  A few clarifications: 

* This particular bug is about transferring data to the new production system (elasticsearch-zlb.webapp.scl3.mozilla.com:9200).  For this system, we need the full "logs" index, at least until last Saturday (see below).  We will *also* need data copied over to our development server (elasticsearch-zlb.dev.vlan81.phx.mozilla.com:9200), but only for the last few weeks.  That's filed separately, as bug 860256.  This bug takes priority, though, as we could always use the new prod system as a source for the dev system, or even just wait a few weeks to fill it up.

* The logparser (source of the "logs" index) started writing to all three ES systems sometime on Friday (PDT).  So we only need data up to that point.  There doesn't appear to be a specific time stamp in the documents to delimit exactly when we starting writing to the new systems, though, so I guess up to 12:00 am PDT on Saturday would be safe, as long as we make sure to update existing documents (as opposed to duplicating them).

* The TTLs apply only to the dev server (bug 860256).  For production (this bug), we'll want to keep them around for the foreseeable future.

* Since I didn't think to mention it in comment #0, the source ES db cluster is buildbot-es.metrics.scl3.mozilla.com, just to be clear.

Finally, I'm in UTC-4, so I could be around at 2:00 or 3:00 pm UTC+2, although edmorley is in a closer timezone to you, so perhaps you could coordinate with him when we hit the go button. :)  We won't be flipping OrangeFactor to the new cluster until we're sure everything went smoothly, so hopefully there won't be much for us to do during the actual import.

If that all makes sense, let's schedule a time!
Reporter

Comment 7

6 years ago
Also wanted to mention that, if it makes things easier for you, we can disable the logparser for the duration of the import, which will stop all writes to the "logs" index on all systems.  When we reenable it, the logparser should be able to catch up fairly quickly, assuming the outage isn't super long.
Reporter

Comment 8

6 years ago
Could I get an update on when we can do this migration, now that the dependent bugs have been fixed?
Flags: needinfo?(dmaher)
Assignee

Comment 9

6 years ago
The migration is normally scheduled for today; however, according to :mcote, this may need to be re-scheduled.  One of us will update this bug either way.
Flags: needinfo?(dmaher)
Assignee

Comment 10

6 years ago
Reindex operation commenced at "Tue Jun 11 08:06:51 PDT 2013".
Assignee

Comment 11

6 years ago
The operation is complete :

real    630m23.271s
user    427m17.049s
sys     1m38.765s

The new index is named "orangefactor_logs" but is aliased as "logs" for your convenience.  Please test it rigorously and let me know if everything seems correct.
Status: NEW → ASSIGNED
Assignee

Comment 12

6 years ago
Closing this bug.  Should you require further assistance please don't hesitate to let me know.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee: nobody → dmaher
Product: Testing → Tree Management
You need to log in before you can comment on or make changes to this bug.