Closed Bug 1063324 Opened 5 years ago Closed 5 years ago

Short term solution for putting Talos data on public ES cluster (please discuss)

Categories

(Datazilla Graveyard :: Metrics, defect)

x86_64
Windows 7
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ekyle, Unassigned)

References

Details

Attachments

(1 file)

We are sending out alerts when page-level regressions in Talos performance tests are detected, but developers have no good way of viewing supporting charts.  Datazilla can show the charts has become unresponsive trying to deal with the large amounts of Talos data.  TreeHerder is expected to be month away from presenting this level of detail.  Graph server is fast, but does not drill down to the page level and will require dev work to do so.

There are my Talos charts[1]: The UI is limited, and the page loads slow, but it works.  The issue is this web page requires an ES cluster, and the one we use is behind VPN.  If we can find a publicaly accessible machine to host an ES cluster[2] we can point an ETL instance to fill it, and we will have charts.

Is there a low effort, or expedient, way to show the page-level performance results?

[1] http://people.mozilla.org/~klahnakoski/talos/Alert-Results.html#sampleMax=2014-07-20&sampleMin=2014-06-20&test=RegExp&branch=Fx-Team&os=win.6.2.9200&platform=x86_64 (Requires MozillaVPN)
[2] Cluster requires 8Gig ram, 1T disk
See Also: → 1020537
jonasfj I remember has some experience with doing dashboards maybe he can give some suggestions.
the data used is not confidential I'm assuming since the end product (charts, etc) are going to be public?
(In reply to Jonathan Lin [:jlin] from comment #1)
> jonasfj I remember has some experience with doing dashboards maybe he can
> give some suggestions.

He did, but since we just need something to use until Tree herder starts showing graphs (AFAIK current estimation by jeads is not before mid Q4), starting another project just for that sounds an unsmart use of resources IMO, if it's ready at all by the time tree herder could be used.

Kyle already has a system which works better than DataZilla (these days), and it seems all we need is to make it publicly available. The HW requirements are minimal, the load is expected to be very low (I'm guessing not more than 100 hits a day), IT can put it such that it's public, and Kyle estimates the move to a new system as 1 day's work.

We've been struggling with inability to access the data since about April, jeads tried few things with datazilla and none worked so far, and he's (rightly) focused on tree herder.

(In reply to Jonathan Lin [:jlin] from comment #2)
> the data used is not confidential I'm assuming since the end product
> (charts, etc) are going to be public?

Correct. Both the input data and the outputs are not confidential, and the inputs are already public. It just makes the data browsable like graph server does and DataZilla should.
Sorry I did not include fubar!
:-) It got to me through IRC.

Kyle, can you clear up some confusion for me?

What systems/services need to access the public ES bugs cluster?
Can do not do something like esfrontline, and have a simple proxy in front of the ES cluster?
I know we're looking to get Talos data off that cluster; how is this request different than bug 1020537 ?
fubar,

> What systems/services need to access the public ES bugs cluster?
Generally people working on performance issues will need access.  Public is best for greatest audience.  

> Can do not do something like esfrontline, and have a simple proxy in front of the ES cluster?
It would be exactly like the config for BZ, with esFrontline to protect it from changes, and some backdoor for the ETL to load the data.

> I know we're looking to get Talos data off that cluster; how is this request different than bug 1020537?
Bug 1020537 is the correct long term solution.  This bug is to explore how we can show this data while that is setup.  The BZ ES cluster took time and convincing for very little resources: and rightly so for stability and long term cost minimization.
There is ES-as-a-service, like qbox, but I need a credit card before I even try[1].  The good thing about these services is they have a security model to deal with ES.


[1] https://qbox.io/dashboard/payment-method)
(In reply to Kyle Lahnakoski [:ekyle] from comment #7)
> There is ES-as-a-service, like qbox, but I need a credit card before I even
> try[1].  The good thing about these services is they have a security model
> to deal with ES.
> 
> 
> [1] https://qbox.io/dashboard/payment-method)

With the kind of stuff we need, what kind of expense we're looking at if we'll be using it for 6 months or so? E.g. would it be more than $1000? more than $100? etc
Avi,

For the size (100G disk), ES hosting appears to be around $500/month (+/- $100/month depending on vendor and drive type) :(  AWS is about the same.
So for 6 months - which I think is we need, up to about $4000.

Wouldn't it be easier to just get a normal laptop from servicedesk and put it at a publicly visible network?

I don't pretend to see the picture better than IT or the ateam guys, but since this is clearly a stop-gap approach, we should probably not over engineer it.

I.e. set it up quickly, see how useful it is, and on the unlikely case it won't be enough (IT/bandwidth/setup wise), then let's re-discuss this IMO.
I have an old clean machine here at home, and I installed ElasticSearch, esFrontline, and the ETL script.  I also added this new cluster as a possible data source for my people page charts [1].  

There are complications:
a) The ETL on my clean machine is pure public, so has no access to push date right now (the existing ETL does a request to the Datazilla backend database for this).  This means the charts are sorted by date received, not by push date.  This will not be a problem for much longer.
b) The received date/push date difference may cause failure: Since the dashboard will use VPN data (if you have access to it), it contains logic to switch between the two dates; which is not fully tested.
c) I have recently added a suite-level "SUMMARY" statistic (geometric mean of all test results in a suite): This necessitated a format change to the URL, which breaks the email links, for now.
d) There are easy solutions to the severe slowness of this dashboard
e) This is on my home internet connection, which is quite terrible at times.  Opening the debugger (f12) will hint whether problem is a bug, or just Kyles-bad-internet-connection.


[1] http://people.mozilla.org/~klahnakoski/talos/Alert-Talos.html#sampleMax=2014-09-07&sampleMin=2014-05-08&os=linux.Ubuntu+12.04&branch=Mozilla-Inbound&platform=x86_64&suite=a11yr.SUMMARY
Didn't quite understand if it should work outside the VPN right now. I understood that the only issue would be submission dates instead of push dates, but when I tried it, I don't see data, there are some errors displayed at the bottom of the page, and the browser console also shows errors which start with these:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://klahnakoski-es.corp.tor1.mozilla.com:9200/talos/test_results/_mapping. This can be fixed by moving the resource to the same domain or enabling CORS. _mapping

"04:00:53 - call to http://klahnakoski-es.corp.tor1.mozilla.com:9200/talos/test_results/_mapping has failed" aLog.js:24
"Error with ESQuery
	File ESQuery.js?1410224447361, line 405, in ESQuery.prototype.run
	File thread.js?1410224447361, line 207, in Thread_prototype_resume
	File thread.js?1410224447361, line 188, in Thread_prototype_resume/retval
	File Rest.js?1410224447361, line 41, in Rest.send/ajaxParam.error
	File Rest.js?1410224447361, line 92, in Rest.send/request.onreadystatechange
caused by Error while calling http://67.55.30.33:9201/talos/test_results/_search
caused by Bad response (503)
caused by "{\"error\":\"SearchPhaseExecutionException[Failed to execute phase [query], all shards failed]\",\"status\":503}"
" aLog.js:56

"04:00:54 - Uncaught Error in thread: 
  Error with ESQuery
	File ESQuery.js?1410224447361, line 405, in ESQuery.prototype.run
	File thread.js?1410224447361, line 207, in Thread_prototype_resume
	File thread.js?1410224447361, line 188, in Thread_prototype_resume/retval
	File Rest.js?1410224447361, line 41, in Rest.send/ajaxParam.error
	File Rest.js?1410224447361, line 92, in Rest.send/request.onreadystatechange
caused by Error while calling http://67.55.30.33:9201/talos/test_results/_search
caused by Bad response (503)
caused by "{\"error\":\"SearchPhaseExecutionException[Failed to execute phase [query], all shards failed]\",\"status\":503}"
"
Avi,

It should work, but it is apparent by your suffering I failed to get all the parts working together. It seems my machine had only allocated 1G of memory to the ES process and has been throwing OutOfMemoryExceptions.

I will work on these issues today.
It seems my machine is simply does not have enough memory.
one issue  Ihave ran into with *charts* is that I cannot view the list of tests in something like cart or tart.  This makes it hard to see what wins/losses we have for a given patch.  Likewise there is no metadata when I mouse over.

I am not recommending we fix that, it is just something we need to know up front.

The issue here is the data- can we put the data somewhere else internal and have the front end on a EC2 machine?
(In reply to Joel Maher (:jmaher) from comment #15)
> one issue  Ihave ran into with *charts* is that I cannot view the list of
> tests in something like cart or tart. ...

Not sure I get it, isn't this the whole point of this effort? I.e. Graphserver can't show the data for the sub-tests, and datazilla can - but it's not working, so you suggested Kyle's data browser as a temporary solution for this task, did you not?

I can only assume that *charts* is _not_ what Kyle is trying to set up (and so I don't quite understand why you mentioned it..), otherwise, if it can't show the sub-tests, that it's only as good as graph server, is it not?

So, assuming that Kyle's system _can_ show sub tests, here's a summary:

Few days ago Kyle talked to someone at IT and was told that in order to be publicly visible it "has to be connected to a different netblock".

I talked to jlin just now and he mentioned that different netblock means only at a data center, and not at a mozilla office, and that setting up a system at a datacenter is a considerable effort, probably beyond what we expected for this bug.

jlin also says that there's no easy solution for the hosting other than putting it at someone's home or renting some cloud service. I.e public hosting at mozilla is not straight forward.

If Kyle is willing to host the system at his place for a while (to evaluate how useful it is), then jlin has a laptop with 16G ram and 500G HDD which he can give Kyle for this.

Does this sound like a plan? Kyle?

For the longer term hosting until Tree Herder's graphs are working (most hopefully not into 2015), I think we should use cloud hosting like Kyle mentioned at comment 9.
> Does this sound like a plan? Kyle?
It sounds like a plan. 

On the subject of my machine:  I discovered I had a 32 bit version of java, which was preventing me from utilizing more memory.  I now have it using 2G, and we may be able to go to three, so maybe it is enough to not OoM.

For Joel:  I believe the issue is the giant-list-of-tests in the UI.  The current version on my page has a "Suite" selector, which shows the tests, and allows you to select SUMMARY.  BEWARE!  Use only the test filter or the suite filter NOT BOTH. (Have one set to "All")
I am referring to charts which is linked in the regression alert emails.  I can view subtests, but it is not like datazilla where I can view all the sub test results on a single page- the navigation is more difficult to do as this is a simple (yet effective) tool.

A few things to consider:
1) we need to have a solution in place for 6-12 months that hosts this.  Even if treeherder is showing detailed high resolution data, we still need this project for alerts
2) spending the effort to get this setup on a laptop in Kyle's home is probably not a lot of effort, but might not be the best route forward even if it is the fastest since we need a solution for 1 year
3) going the cloud route is more expensive, but it is more sustainable in the longer term and sets us up for ensuring our code is clean and tooling is done correctly.
If this system is needed for more than 6 months, then yes, it is advisable to set this up properly. Just a reminder that ekyle also has a bug open with IT for proper storage of the alerts ES data in a data centre in bug 1020537.
I received a laptop from jlin, which has 4x more memory than the dusty old box I am using now.  The alert links should work; but if they fail again, I will put the ES on this laptop.
I think it's slightly better now, but still has errors. I clicked one of the dzAlerts links and it shows the dates and some other info, it shows some points at the "Talos test results" graph area, but it keeps loading and spawns some errors.

Maybe you want to post here a sample link which should work, and a screenshot of how it should look like after it completes loading?


Here's the link I followed: http://people.mozilla.org/~klahnakoski/talos/Alert-Results.html#sampleMin=2014-09-02&platform=x86&sampleMax=2014-09-09&branch=Firefox-Non-PGO&test=2-customize-exit.half.TART&os=win.6.1.7601

And here are the first messages+errors at the browser console (note that like comment 12 it start with a XSS error):

"04:51:54 - start Get parts of Product 01:51:54" aLog.js:24

"04:51:54 - start Get parts of Branch 01:51:54" aLog.js:24

"04:51:54 - start Get parts of Platform 01:51:54" aLog.js:24

"04:51:54 - start Get parts of TestOnly 01:51:54" aLog.js:24

"04:51:54 - start Get parts of Test 01:51:54" aLog.js:24

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://klahnakoski-es.corp.tor1.mozilla.com:9200/talos/test_results/_mapping. This can be fixed by moving the resource to the same domain or enabling CORS. _mapping

"04:51:55 - call to http://klahnakoski-es.corp.tor1.mozilla.com:9200/talos/test_results/_mapping has failed" aLog.js:24

"04:51:57 - done Get parts of Branch 01:51:57 (2second)" aLog.js:24

"04:51:57 - done Get parts of Platform 01:51:57 (2second)" aLog.js:24

"04:51:57 - done Get parts of Product 01:51:57 (2second)" aLog.js:24

"04:51:57 - done Get parts of Platform 01:51:57 (2second)" aLog.js:24

"Error with ESQuery
	File ESQuery.js, line 405, in ESQuery.prototype.run
	File thread.js, line 207, in Thread_prototype_resume
	File thread.js, line 188, in Thread_prototype_resume/retval
	File Rest.js, line 41, in Rest.send/ajaxParam.error
	File Rest.js, line 92, in Rest.send/request.onreadystatechange
caused by Error while calling http://67.55.30.33:9201/talos/test_results/_search
caused by Bad response (500)
caused by "{\"error\":\"SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[-q748ucBQZKXx-Bpn7oIhg][talos20140907_130351][1]: QueryPhaseExecutionException[[talos20140907_130351][1]: query[filtered(ConstantScore(+*:*))->cache(_type:test_results)],from[0],size[0]: Query Failed [Failed to execute main query]]; nested: VerifyError[(class: ASMAccessorImpl_21250297201410486716760, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool]; }{[-q748ucBQZKXx-Bpn7oIhg][talos20140907_130351][0]: QueryPhaseExecutionException[[talos20140907_130351][0]: query[filtered(ConstantScore(+*:*))->cache(_type:test_results)],from[0],size[0]: Query Failed [Failed to execute main query]]; nested: VerifyError[(class: ASMAccessorImpl_306141941410486716760, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool]; }{[-q748ucBQZKXx-Bpn7oIhg][talos20140907_130351][2]: QueryPhaseExecutionException[[talos20140907_130351][2]: query[filtered(ConstantScore(+*:*))->cache(_type:test_results)],from[0],size[0]: Query Failed [Failed to execute main query]]; nested: VerifyError[(class: ASMAccessorImpl_8441948351410486716760, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool]; }]\",\"status\":500}"
" aLog.js:56

"04:51:57 - Uncaught Error in thread: 
  Error with ESQuery
	File ESQuery.js, line 405, in ESQuery.prototype.run
	File thread.js, line 207, in Thread_prototype_resume
	File thread.js, line 188, in Thread_prototype_resume/retval
	File Rest.js, line 41, in Rest.send/ajaxParam.error
	File Rest.js, line 92, in Rest.send/request.onreadystatechange
caused by Error while calling http://67.55.30.33:9201/talos/test_results/_search
caused by Bad response (500)
caused by "{\"error\":\"SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[-q748ucBQZKXx-Bpn7oIhg][talos20140907_130351][1]: QueryPhaseExecutionException[[talos20140907_130351][1]: query[filtered(ConstantScore(+*:*))->cache(_type:test_results)],from[0],size[0]: Query Failed [Failed to execute main query]]; nested: VerifyError[(class: ASMAccessorImpl_21250297201410486716760, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool]; }{[-q748ucBQZKXx-Bpn7oIhg][talos20140907_130351][0]: QueryPhaseExecutionException[[talos20140907_130351][0]: query[filtered(ConstantScore(+*:*))->cache(_type:test_results)],from[0],size[0]: Query Failed [Failed to execute main query]]; nested: VerifyError[(class: ASMAccessorImpl_306141941410486716760, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool]; }{[-q748ucBQZKXx-Bpn7oIhg][talos20140907_130351][2]: QueryPhaseExecutionException[[talos20140907_130351][2]: query[filtered(ConstantScore(+*:*))->cache(_type:test_results)],from[0],size[0]: Query Failed [Failed to execute main query]]; nested: VerifyError[(class: ASMAccessorImpl_8441948351410486716760, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool]; }]\",\"status\":500}"
" aLog.js:24

"Can not setup PartitionFilter
	File PartitionFilter.js, line 69, in convertToTreeLater/<
	File thread.js, line 209, in Thread_prototype_resume
	File thread.js, line 188, in Thread_prototype_resume/retval
	File thread.js, line 358, in build/Thread.join/gen<
	File thread.js, line 393, in Thread_join_resume
	File thread.js, line 207, in Thread_prototype_resume
	File thread.js, line 188, in Thread_prototype_resume/retval
	File Rest.js, line 41, in Rest.send/ajaxParam.error
	File Rest.js, line 92, in Rest.send/request.onreadystatechange
caused by Error with ESQuery
	File ESQuery.js, line 405, in ESQuery.prototype.run
	File thread.js, line 207, in Thread_prototype_resume
	File thread.js, line 188, in Thread_prototype_resume/retval
	File Rest.js, line 41, in Rest.send/ajaxParam.error
	File Rest.js, line 92, in Rest.send/request.onreadystatechange
caused by Error while calling http://67.55.30.33:9201/talos/test_results/_search
caused by Bad response (500)
caused by "{\"error\":\"SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[-q748ucBQZKXx-Bpn7oIhg][talos20140907_130351][1]: QueryPhaseExecutionException[[talos20140907_130351][1]: query[filtered(ConstantScore(+*:*))->cache(_type:test_results)],from[0],size[0]: Query Failed [Failed to execute main query]]; nested: VerifyError[(class: ASMAccessorImpl_21250297201410486716760, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool]; }{[-q748ucBQZKXx-Bpn7oIhg][talos20140907_130351][0]: QueryPhaseExecutionException[[talos20140907_130351][0]: query[filtered(ConstantScore(+*:*))->cache(_type:test_results)],from[0],size[0]: Query Failed [Failed to execute main query]]; nested: VerifyError[(class: ASMAccessorImpl_306141941410486716760, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool]; }{[-q748ucBQZKXx-Bpn7oIhg][talos20140907_130351][2]: QueryPhaseExecutionException[[talos20140907_130351][2]: query[filtered(ConstantScore(+*:*))->cache(_type:test_results)],from[0],size[0]: Query Failed [Failed to execute main query]]; nested: VerifyError[(class: ASMAccessorImpl_8441948351410486716760, method: getKnownEgressType signature: ()Ljava/lang/Class;) Illegal type in constant pool]; }]\",\"status\":500}"
" aLog.js:56
Avi,

Thank you for testing this!  The XSS errors are a side effect of the HTTP error code not including the "access-control-allow-origin" header.  I do not know if the browser should be demanding this header when errors are occurring:  In the case of XSS errors contacting  http://klahnakoski-es.corp.tor1.mozilla.com:9200, I have not control over those headers; Mozilla's network is sending the HTTP error.  In the case of http://67.55.30.33:9201, I can probably update esFrontline to set the headers during errors too.  I believe this is a browser bug[1]

The page will eventually stop loading, as shown in the attachment.  The query to count all suites and all tests is consuming an inordinate amount of resources, and there are multiple solutions to fix this.  

This dashboard is low priority right now, but please file bugs[2] for the most egregious UI concerns, and I will do my best to improve it.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1043545
[2] https://bugzilla.mozilla.org/enter_bug.cgi?product=Datazilla&component=Metrics&blocked=962378
Avi,

Finally, on the issue of java.lang.VerifyError:  This needs more investigation.  I do get the same error, and I must investigate what's happening.

In summary, it is a rough tool.
Thanks Kyle!

I can confirm I see what you see at your screenshot, and that after it shows the errors and keeps spinning, it eventually stops spinning.

I also tried choosing different tests and dates-range, and it seems to also work (quick-ish even!), and seems to handle errors reasonably well (e.g. when unselecting the last selected something it selects "All" automatically which usually ends in error - "too many results", but then selecting something shows it correctly, yay!).

I did try to go back a bit with the dates ranges. Did I deduce correctly that the data starts around July 8th?

Another issue I noticed is that when browsing tests via the suite menu, it only shows up to 20 first tests of the suite (so I couldn't choose e.g. newtab-open-preload-no.error.TART).

When I tried to choose this test via the "Tests" menu, it displayed "no results" at the bottom (dates range June 1st - September 30th).

With the current status of the charts page, what should I click and in what order if I want to view the newtab-open-preload-no.error.TART values over some dates range?

Thanks again for making it accessible!
Avi,


> I did try to go back a bit with the dates ranges. Did I deduce correctly
> that the data starts around July 8th?

The data starts at test result 6,000,000, which is probably around July8.  We can go back further if you need.

> 
> Another issue I noticed is that when browsing tests via the suite menu, it
> only shows up to 20 first tests of the suite (so I couldn't choose e.g.
> newtab-open-preload-no.error.TART).

Thank you for noticing the 20-limit:  That is a default to prevent too many items showing.  It is fixed now:  You can see all tests in a Suite (reload).

> 
> When I tried to choose this test via the "Tests" menu, it displayed "no
> results" at the bottom (dates range June 1st - September 30th).

The Suite selection and Test selection are ANDed together: Be sure the Suite==All if you select a Test.  I know this is pathological; The Suite is much easier to navigate, but the old email alerts point to this page using "test" url parameter; keeping the Test drop down was the quickest way to keep this page functional for those links (for now).
(In reply to Kyle Lahnakoski [:ekyle] from comment #26)
> > I did try to go back a bit with the dates ranges. Did I deduce correctly
> > that the data starts around July 8th?
> 
> The data starts at test result 6,000,000, which is probably around July8. 
> We can go back further if you need.

Would greatly appreciate data since beginning of April. Thanks.
(In reply to Avi Halachmi (:avih) from comment #27)
> (In reply to Kyle Lahnakoski [:ekyle] from comment #26)
> > The data starts at test result 6,000,000, which is probably around July8. 
> > We can go back further if you need.
> 
> Would greatly appreciate data since beginning of April. Thanks.

Any estimation when could such thing happen?
Flags: needinfo?(klahnakoski)
Sorry for the delay.  I have updated the settings to pull everything starting at test 5,000,000 (instead of 6,000,000).  We will see later today of that goes back far enough.  

I am still using the old machine, and not the bigger laptop, so we will see if the increased data breaks it.
Flags: needinfo?(klahnakoski)
Large machine is now in use, it is catching up.  The left-side of charts show the date it has got to.  It should be done by tomorrow morning.
The new larger cluster has been working for a while now.  But there seems to be little interest.

The data goes back to sep9 only because I neglected to load all the way back to May.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.