Closed Bug 1181817 Opened 10 years ago Closed 10 years ago

Export OrangeFactor star data to public ES cluster

Categories

(Tree Management Graveyard :: OrangeFactor, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgriffin, Assigned: ekyle)

Details

David Huang is going to be doing some number crunching on our intermittent failures in order to help generate methods that will identify a window in which intermittents were introduced, as well as detect when the frequency of intermittents changes in a significant way. He'll start by using star data in OrangeFactor, but this is currently not very easy to get at for someone not in the VPN. We should export a snapshot of this data to a public ES instance.
Alternatively, given OF is now on it's own ES cluster, and that there is nothing confidential in there - could we just add some shim to allow read only access? (Similar to what I believe is done for Kyle's ES instance?)
Yes, that's a good idea. Kyle, do you have an ES read-only proxy we could already use for this?
Yes, esFrontLine is a simple app that blocks everything but queries https://github.com/klahnakoski/esFrontLine
(In reply to Kyle Lahnakoski [:ekyle] from comment #3) > Yes, esFrontLine is a simple app that blocks everything but queries > > https://github.com/klahnakoski/esFrontLine Perfect, thanks!
Kyle is going to take this. Kyle, the ES instance is of-elasticsearch-zlb.webapp.scl3.mozilla.com:9200, and the relevant index is 'bugs'.
Assignee: jgriffin → klahnakoski
I have a script [1] that replicates the Orange Factor ES cluster to the ActiveData cluster. You can query it [2] just like any other index: > {"from":"orange_factor"} [1] https://github.com/klahnakoski/esReplicate [2] http://activedata.allizom.org/tools/query.html#query_id=CCDYhosR The initial fill has only just started. I will keep this bug open while I verify the final contents, and ensure the ongoing replication works too.
Thanks Kyle. The query is working great! Right now there are only 1000 rows so I can't test it, but I was just wondering about the "up to 3000 rows shown" message. Does that mean even when I set the limit clause to >3000 in my query I'd still only get 3000 rows from one query at a time?
Sorry there are only 1000 rows, but those rows are updated quite frequently! :) I looked into this last night and I could not see the problem. I am away today, but I will be back in the evening to fix this.
Yesterday's data is in there. I will look into why it is not kept up to date, and fix that too.
I have verified the replication is working ongoing; the document count in ActiveData[1] matches the count in the original. [1] http://activedata.allizom.org/tools/query.html#query_id=sJXn2+jR
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
David, I do not know what your data needs are, and would like to know more. If you will be looking to simply download all raw data; You must pull it in suitably size chunks in the short term. In the long term I can look into supporting long-running streams of data that can deliver the volumes you require without breaking the service. OrangeFactor is small, so you may be able to pull it all in a single query, but there are bigger datasets.
(In reply to Kyle Lahnakoski [:ekyle] from comment #11) > David, > > I do not know what your data needs are, and would like to know more. If you > will be looking to simply download all raw data; You must pull it in > suitably size chunks in the short term. In the long term I can look into > supporting long-running streams of data that can deliver the volumes you > require without breaking the service. > > OrangeFactor is small, so you may be able to pull it all in a single query, > but there are bigger datasets. Kyle, Thanks a lot. Yes I do want to download all of the raw data, if possible. For OF, with 472493 instances (currently) I can definitely pull the data in chunks and collate them. I probably will need to consider other datasets. If they're so large that I can't download all the instances, I can still work with a subset of it. A lot of the work we do assume that not all data is available or data volume is too large, so we do a lot of sampling. At the moment I am finalizing my thesis for submission at the end of this month, after that I'll definitely keep in close contact with you, if that's okay with you? I would love to talk to you about my data needs and all.
Sounds good! I look forward to helping.
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.