Closed
Bug 1121507
Opened 10 years ago
Closed 10 years ago
Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions after dev ES upgrade
Categories
(Tree Management Graveyard :: OrangeFactor, defect)
Tree Management Graveyard
OrangeFactor
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rwatson, Assigned: emorley)
References
Details
The MOC is getting alerts for orangefactor1.dmz.phx1.mozilla.com. Running low on disk (10%). Can this be looked at?
Assignee | ||
Comment 1•10 years ago
|
||
[root@orangefactor1.dmz.phx1 logparser]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00 39091656 33149620 3956552 90% /
tmpfs 961348 12 961336 1% /dev/shm
/dev/sda1 99150 73630 20400 79% /boot
Virtually all of that is under:
/home/webtools/apps/logparser/incoming-logs
[root@orangefactor1.dmz.phx1 logparser]# du -sh *
76K bin
5.8M finished-logs
4.0K include
30G incoming-logs
11M lib
0 lib64
4.0K logparser.pid
4.0K pending_jobs.txt
4.0K pulse.ini
210M savelogs.err
0 savelogs.log
592K savelogs.out
688K savelogs.out.1
672K savelogs.out.2
980K savelogs.out.3
27M src
Assignee | ||
Comment 2•10 years ago
|
||
savelogs.err contains lots of:
2015-01-14 07:48:08,301 - BuildLogMonitor - ERROR - Unknown exception type
Traceback (most recent call last):
File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse
lp.parseFiles()
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles
self.postResultsToElasticSearch(testdata)
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 196, in postResultsToElasticSearch
self._post_testgroup_to_elasticsearch(data)
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 183, in _post_testgroup_to_elasticsearch
testgroup.submit()
File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 80, in submit
self._generate_testrun()
File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 58, in _generate_testrun
doc_type = [self.doc_type])
File "/home/webtools/apps/logparser/src/mozautoeslib/mozautoeslib/eslib.py", line 186, in query
doc_types=self.doc_type)
File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/es.py", line 793, in count
return self._query_call("_count", query, indexes, doc_types, **query_params)
File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/es.py", line 246, in _query_call
response = self._send_request('GET', path, body, querystring_args)
File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/es.py", line 218, in _send_request
raise_if_error(response.status, decoded)
File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.15.0-py2.6.egg/pyes/convert_errors.py", line 57, in raise_if_error
raise pyes.exceptions.ElasticSearchException("Unknown exception type", status, result)
ElasticSearchException: Unknown exception type
and
2015-01-14 07:48:06,402 - BuildLogMonitor - ERROR - global name 'json' is not defined
Traceback (most recent call last):
File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse
lp.parseFiles()
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles
self.postResultsToElasticSearch(testdata)
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 198, in postResultsToElasticSearch
self._post_buildlog_to_elasticsearch(data)
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 190, in _post_buildlog_to_elasticsearch
AutologBuildLog(index='logs', server=server, **postdata).submit()
File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 87, in submit
build_id = self._add_doc(json, self.doc_type, self.id)
File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 57, in _add_doc
raise Exception(json.dumps(result))
NameError: global name 'json' is not defined
-> Presumably the logparser is failing, so we've ended up with a backlog of unparsed logs taking up 30G on disk?
Assignee | ||
Comment 3•10 years ago
|
||
Yeah finished-logs is empty (is purged periodically [1]) and has a last modified of Jan 7th (presumably the last time we parsed/purged logs).
And incoming logs has some really old logs there:
[root@orangefactor1.dmz.phx1 incoming-logs]# ls -altr | head
total 30566200
-rw-r--r-- 1 webtools webtools 47710 Dec 4 14:15 mozilla-inbound-ubuntu32_vm-linux-opt-1417725651-mochitest-e10s-4.txt.gz
-rw-r--r-- 1 webtools webtools 145470 Dec 4 14:42 b2g-inbound-win7-ix-win32-debug-1417725708-mochitest-gl.txt.gz
-rw-r--r-- 1 webtools webtools 27301 Dec 5 08:52 mozilla-inbound-mountainlion-macosx64-opt-1417791709-mochitest-browser-chrome-2.txt.gz
-rw-r--r-- 1 webtools webtools 47427 Dec 5 13:47 mozilla-inbound-ubuntu64_vm-linux64-opt-1417814675-mochitest-e10s-4.txt.gz
-rw-r--r-- 1 webtools webtools 27734 Dec 5 14:17 mozilla-inbound-ubuntu64_vm-linux64-opt-1417816652-mochitest-browser-chrome-2.txt.gz
-rw-r--r-- 1 webtools webtools 633480 Dec 5 15:38 mozilla-inbound-win7-ix-win32-debug-1417817912-mochitest-3.txt.gz
-rw-r--r-- 1 webtools webtools 244373 Dec 5 15:39 fx-team-ubuntu32_vm-linux-opt-1417818155-mochitest-3.txt.gz
-rw-r--r-- 1 webtools webtools 149176 Dec 5 15:39 mozilla-inbound-win8_64-win64-debug-1417816652-mochitest-gl.txt.gz -rw-r--r-- 1 webtools webtools 29591 Dec 5 15:39 mozilla-inbound-ubuntu64-asan_vm-linux64-asan-opt-1417819050-mochitest-browser-chrome-2.txt.gz
I'm confused why we have the odd old log from Dec (presumably we're not imported/cleaning those up for some reason, will check the error logs once we sort this other problem out), but most of the contents of incoming-logs is from Jan 7th onwards.
[1] [root@orangefactor1.dmz.phx1 incoming-logs]# crontab -l | grep 'finished-logs'
0 1 * * * rm -rf /home/webtools/apps/logparser/finished-logs/*
Assignee | ||
Comment 4•10 years ago
|
||
With bug 1121517 fixed, that exception became:
Traceback (most recent call last):
File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse
lp.parseFiles()
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles
self.postResultsToElasticSearch(testdata)
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 198, in postResultsToElasticSearch
self._post_buildlog_to_elasticsearch(data)
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 190, in _post_buildlog_to_elasticsearch
AutologBuildLog(index='logs', server=server, **postdata).submit()
File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 88, in submit
build_id = self._add_doc(json, self.doc_type, self.id)
File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line 58, in _add_doc
raise Exception(json.dumps(result))
Exception: {"_type": "buildlogs", "_id": "1421253183-b89c420c8b7069254a54e82828667c1db2e6913c", "created": true, "_version": 1, "_index": "logs"}
There's also still the "ElasticSearchException: Unknown exception type" (comment 2).
Assignee | ||
Comment 5•10 years ago
|
||
(In reply to Ricky Rosario [:rrosario, :r1cky] from comment #20)
> January 7 sounds great to me!
The above being from "Bug 1071044 - Upgrade ElasticSearch PHX1 Development cluster to 1.2.x", which matches the timeline here - even though that was only supposed to affect dev, not prod.
@cyliang, were any other changes made that might have affected the use of elasticsearch-zlb.webapp.scl3.mozilla.com ?
Flags: needinfo?(cliang)
Assignee | ||
Updated•10 years ago
|
OS: Mac OS X → All
Hardware: x86 → All
Summary: Low disk on orangefactor1.dmz.phx1.mozilla.com → Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions
Assignee | ||
Comment 6•10 years ago
|
||
Since this bug was filed, free disk space had already dropped from 10% to 9%, so I've purged all logs older than 3 days from incoming-logs, to buy us some time. The OrangeFactor scores look no different from Jan 7th onwards compared to the days before, so it seemed unlikely that we'd need to re-parse the logs anyway (I presume the failures must occur after the insert into ES that we need for OF).
Comment 7•10 years ago
|
||
There were no changes made to the production ES cluster. The only production modification I had to make was to update the script that is used to copy indexes from production to the development ES instances in PHX.
Would it be possible to clarify which ES cluster and which index you're trying to write to? Based on the mention of a "logs" index, I'm assuming that you're writing to the cluster in SCL3. There is both a "logs" index and an "of_logs" index.
The only errors that I'm seeing on the SCL3 instance have to do with bzcache. These occur before January 4th, tho, so I don't think they are related.
Flags: needinfo?(cliang)
Assignee | ||
Comment 8•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #4)
> "/home/webtools/apps/logparser/src/mozautolog/mozautolog/buildlog.py", line
> 58, in _add_doc
> raise Exception(json.dumps(result))
> Exception: {"_type": "buildlogs", "_id":
> "1421253183-b89c420c8b7069254a54e82828667c1db2e6913c", "created": true,
> "_version": 1, "_index": "logs"}
That line corresponds to:
if not 'ok' in result or not result['ok'] or not '_id' in result:
raise Exception(json.dumps(result))
Looking at the response, it appears successful, there's just no "ok" key in the response any more. A bit of digging found this upstream ElasticSearch change (which removed the 'ok' since it's redundant given the HTTP response code):
https://github.com/elasticsearch/elasticsearch/commit/d23f640cd1488727c8337a888527d662c6135fac
However that's only supposed to have landed in ES versions above v1.0 (see tags on the github commit), but elasticsearch-zlb.webapp.scl3.mozilla.com is on 0.9.x:
[root@orangefactor1.dmz.phx1 logparser]# curl elasticsearch-zlb.webapp.scl3.mozilla.com:9200
...
"name" : "elasticsearch1_scl3",
"version" : {
"number" : "0.90.10",
Grepping for ES server references (in case we're using the wrong instance):
./logparser/src/mozautolog/mozautolog/buildlog.py: self.server = 'elasticsearch-zlb.webapp.scl3.mozilla.com:9200'
./logparser/src/mozautolog/mozautolog/mozautolog.py: self.server = 'elasticsearch-zlb.webapp.scl3.mozilla.com:9200'
./logparser/src/logparser/scripts/savelogs-daemon.sh: start-stop-daemon --start --user ${USER} --chuid ${USER}:${USER} --pidfile ${PIDFILE} --nicelevel 10 --make-pidfile --background --exec ${SAVELOGS} -- --es --durable --es-server=elasticsearch1.metrics.scl3.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} > savelogs.log 2>&1
./logparser/src/logparser/scripts/savelogs.sh: nohup ${SAVELOGS} --es --durable --es-server=elasticsearch1.metrics.scl3.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pidfile=${PIDFILE} > ${ENV}/savelogs.log 2>&1 &
./logparser/bin/savelogs-gen.sh: nohup ${SAVELOGS} --es --durable --es-server=elasticsearch-zlb.webapp.scl3.mozilla.com:9200 --es-server=elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pidfile=${PIDFILE} --pulse-cfg=${PULSECFG} > ${ENV}/savelogs.log 2>&1 &
./logparser/bin/savelogs-daemon-gen.sh: start-stop-daemon --start --user ${USER} --chuid ${USER}:${USER} --pidfile ${PIDFILE} --nicelevel 10 --make-pidfile --background --exec ${SAVELOGS} -- --es --durable --es-server=elasticsearch1.metrics.scl3.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pulse-cfg=${PULSECFG} > savelogs.log 2>&1
Now elasticsearch1.metrics.scl3 is decommissioned aiui, and either way there is no response to it from the orangefactor1 box, so the scripts that reference it must be unused. Which seems to be confirmed by:
[root@orangefactor1.dmz.phx1 logparser]# egrep 'savelogs.*sh' /etc/init.d/logparser
DAEMON="$BASEDIR/bin/savelogs-gen.sh"
So we don't appear to be using dev/another v1.0+ instance inadvertently, but we're still seeing the behaviour from the upstream ES change, that supposedly only landed in v1.0+?
Bizarre. Let's just start checking for "created" rather than "ok" for now anyway.
Assignee | ||
Comment 9•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #4)
> There's also still the "ElasticSearchException: Unknown exception type"
> (comment 2).
We're currently using pyes v0.16 (was v0.15 until bug 1121517 comment ), but v0.19+ gives much better output for this exception:
https://github.com/aparo/pyes/commit/b61b4be45804abbec6fdd0f43e6127d3643e1e8e
However there are lots of breaking changes, so I've backported that fix by hand to logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/convert_errors.py temporarily.
This revealed:
Traceback (most recent call last):
File "/home/webtools/apps/logparser/src/logparser/logparser/savelogs.py", line 64, in parse
lp.parseFiles()
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 106, in parseFiles
self.postResultsToElasticSearch(testdata)
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 196, in postResultsToElasticSearch
self._post_testgroup_to_elasticsearch(data)
File "/home/webtools/apps/logparser/src/logparser/logparser/logparser.py", line 183, in _post_testgroup_to_elasticsearch
testgroup.submit()
File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 80, in submit
self._generate_testrun()
File "/home/webtools/apps/logparser/src/mozautolog/mozautolog/esautolog.py", line 58, in _generate_testrun
doc_type = [self.doc_type])
File "/home/webtools/apps/logparser/src/mozautoeslib/mozautoeslib/eslib.py", line 186, in query
doc_types=self.doc_type)
File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/es.py", line 893, in count
return self._query_call("_count", query, indexes, doc_types, **query_params)
File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/es.py", line 251, in _query_call
response = self._send_request('GET', path, body, querystring_args)
File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/es.py", line 223, in _send_request
raise_if_error(response.status, decoded)
File "/home/webtools/apps/logparser/lib/python2.6/site-packages/pyes-0.16.0-py2.6.egg/pyes/convert_errors.py", line 58, in raise_if_error
raise pyes.exceptions.ElasticSearchException(u"Unknown exception type: %d, %s" % (status, result), status, result)
ElasticSearchException: Unknown exception type: 400, {u'count': 0, u'_shards': {u'successful': 0, u'failed': 5, u'total': 5, u'failures': [{u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][0] ]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 0}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][2] ]; nested: RemoteTransportException[[elasticsearch2_dev_phx1][inet[/SNIP:9300]][count/s]]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 2}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][1] ]; nested: RemoteTransportException[[elasticsearch2_dev_phx1][inet[/SNIP:9300]][count/s]]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 1}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][4] ]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 4}, {u'index': u'logs', u'reason': u'BroadcastShardOperationFailedException[[logs][3] ]; nested: RemoteTransportException[[elasticsearch1_dev_phx1][inet[/SNIP:9300]][count/s]]; nested: QueryParsingException[[logs] request does not support [bool]]; ', u'shard': 3}]}}
I've removed the IP addresses, but even without them it's clear that we're accessing elasticsearch2_dev_phx1.
Re-reading comment 8 shows that I missed:
./logparser/bin/savelogs-gen.sh: nohup ${SAVELOGS} --es --durable --es-server=elasticsearch-zlb.webapp.scl3.mozilla.com:9200 --es-server=elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200 --savedir=${OUTPUT} --outputdir=${FINISHEDLOGS} --outputlog=${OUTPUTLOG} --errorlog=${ERRORLOG} --pidfile=${PIDFILE} --pulse-cfg=${PULSECFG} > ${ENV}/savelogs.log 2>&1 &
Looking at history shows:
436 2014-10-09 11:45:46 vi bin/savelogs-gen.sh
437 2014-10-09 11:46:34 vi bin/savelogs-
438 2014-10-09 11:46:36 vi bin/savelogs-daemon-gen.sh
Jonathan, was something being tested then? I'm presuming we're not intentionally using the dev ES instance? (elasticsearch-zlb.dev.vlan81.phx1.mozilla.com)
Flags: needinfo?(jgriffin)
Assignee | ||
Comment 10•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #9)
> Jonathan, was something being tested then? I'm presuming we're not
> intentionally using the dev ES instance?
> (elasticsearch-zlb.dev.vlan81.phx1.mozilla.com)
Or more - I don't think logparser should be submitting to both prod and dev ES, seeing as aiui there is a script for mirroring from prod to dev already. Either way, at least we know now that these failures are just for the submission to dev, which we can switch off for now.
I've removed the additional "--es-server" param specifying the dev ES instance from savelogs-gen.sh. Whilst I was there I fixed the out of date ES instance listed in the unused savelogs-daemon-gen.sh. I've cleared the contents of incoming-logs (since it was only the submission to dev that failed) and restarted the logparser service.
All seems good now: no errors in savelogs.err, seemingly normal output in savelogs.out & logs appearing in finished-logs/.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jgriffin)
Resolution: --- → FIXED
Summary: Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions → Low disk on orangefactor1.dmz.phx1.mozilla.com due to logparser exceptions after dev ES upgrade
Assignee | ||
Comment 11•10 years ago
|
||
Meant to say: I'm looking forward to when we can get OrangeFactor v2 created and using Treeherder's API, so we can switch off of this off! The use of customised-not-in-the-repo scripts that pass config params on the command line make this type of issue much harder than it needs be. Even using a not-checked in settings.local.py in the repo is more discoverable, since an |hg status --ignored| surfaces it pretty easily :-)
Comment 12•10 years ago
|
||
That is *awesome* detective work.
Comment 13•10 years ago
|
||
Thanks Ed. At some point in the past, the production instance wasn't being mirrored to dev, thus the double posting.
Updated•5 years ago
|
Product: Tree Management → Tree Management Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•