Closed
Bug 1181817
Opened 10 years ago
Closed 10 years ago
Export OrangeFactor star data to public ES cluster
Categories
(Tree Management Graveyard :: OrangeFactor, defect)
Tree Management Graveyard
OrangeFactor
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jgriffin, Assigned: ekyle)
Details
David Huang is going to be doing some number crunching on our intermittent failures in order to help generate methods that will identify a window in which intermittents were introduced, as well as detect when the frequency of intermittents changes in a significant way.
He'll start by using star data in OrangeFactor, but this is currently not very easy to get at for someone not in the VPN. We should export a snapshot of this data to a public ES instance.
Comment 1•10 years ago
|
||
Alternatively, given OF is now on it's own ES cluster, and that there is nothing confidential in there - could we just add some shim to allow read only access? (Similar to what I believe is done for Kyle's ES instance?)
| Reporter | ||
Comment 2•10 years ago
|
||
Yes, that's a good idea. Kyle, do you have an ES read-only proxy we could already use for this?
| Assignee | ||
Comment 3•10 years ago
|
||
Yes, esFrontLine is a simple app that blocks everything but queries
https://github.com/klahnakoski/esFrontLine
| Reporter | ||
Comment 4•10 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #3)
> Yes, esFrontLine is a simple app that blocks everything but queries
>
> https://github.com/klahnakoski/esFrontLine
Perfect, thanks!
| Reporter | ||
Comment 5•10 years ago
|
||
Kyle is going to take this. Kyle, the ES instance is of-elasticsearch-zlb.webapp.scl3.mozilla.com:9200, and the relevant index is 'bugs'.
Assignee: jgriffin → klahnakoski
| Assignee | ||
Comment 6•10 years ago
|
||
I have a script [1] that replicates the Orange Factor ES cluster to the ActiveData cluster. You can query it [2] just like any other index:
> {"from":"orange_factor"}
[1] https://github.com/klahnakoski/esReplicate
[2] http://activedata.allizom.org/tools/query.html#query_id=CCDYhosR
The initial fill has only just started. I will keep this bug open while I verify the final contents, and ensure the ongoing replication works too.
Thanks Kyle. The query is working great!
Right now there are only 1000 rows so I can't test it, but I was just wondering about the "up to 3000 rows shown" message.
Does that mean even when I set the limit clause to >3000 in my query I'd still only get 3000 rows from one query at a time?
| Assignee | ||
Comment 8•10 years ago
|
||
Sorry there are only 1000 rows, but those rows are updated quite frequently! :)
I looked into this last night and I could not see the problem. I am away today, but I will be back in the evening to fix this.
| Assignee | ||
Comment 9•10 years ago
|
||
Yesterday's data is in there. I will look into why it is not kept up to date, and fix that too.
| Assignee | ||
Comment 10•10 years ago
|
||
I have verified the replication is working ongoing; the document count in ActiveData[1] matches the count in the original.
[1] http://activedata.allizom.org/tools/query.html#query_id=sJXn2+jR
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 11•10 years ago
|
||
David,
I do not know what your data needs are, and would like to know more. If you will be looking to simply download all raw data; You must pull it in suitably size chunks in the short term. In the long term I can look into supporting long-running streams of data that can deliver the volumes you require without breaking the service.
OrangeFactor is small, so you may be able to pull it all in a single query, but there are bigger datasets.
Comment 12•10 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #11)
> David,
>
> I do not know what your data needs are, and would like to know more. If you
> will be looking to simply download all raw data; You must pull it in
> suitably size chunks in the short term. In the long term I can look into
> supporting long-running streams of data that can deliver the volumes you
> require without breaking the service.
>
> OrangeFactor is small, so you may be able to pull it all in a single query,
> but there are bigger datasets.
Kyle, Thanks a lot.
Yes I do want to download all of the raw data, if possible.
For OF, with 472493 instances (currently) I can definitely pull the data in chunks and collate them.
I probably will need to consider other datasets.
If they're so large that I can't download all the instances, I can still work with a subset of it. A lot of the work we do assume that not all data is available or data volume is too large, so we do a lot of sampling.
At the moment I am finalizing my thesis for submission at the end of this month, after that I'll definitely keep in close contact with you, if that's okay with you? I would love to talk to you about my data needs and all.
| Assignee | ||
Comment 13•10 years ago
|
||
Sounds good! I look forward to helping.
Updated•5 years ago
|
Product: Tree Management → Tree Management Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•