Closed
Bug 1300341
Opened 9 years ago
Closed 9 years ago
Hive query started failing for Socorro
Categories
(Data & BI Services Team :: DI: Hadoop, task)
Data & BI Services Team
DI: Hadoop
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: peterbe, Assigned: mpressman)
References
Details
You can see on https://crash-stats.mozilla.com/crontabber-state/ that our cron job that does the Hive query for ADI has started failing.
Hasn't worked for 3 days roughly.
Here is the full error:
"'Error while compiling statement: FAILED: RuntimeException org.apache.hadoop.security.AccessControlException: Permission denied: user=socorro, access=WRITE, inode=\"/tmp/hive-hive\":hive:supergroup:drwxr-xr-x\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5607)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5589)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5563)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3685)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3655)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3629)\\n\\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:741)\\n\\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:558)\\n\\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\\n\\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)\\n\\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)\\n\\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)\\n\\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)\\n\\tat java.security.AccessController.doPrivileged(Native Method)\\n\\tat javax.security.auth.Subject.doAs(Subject.java:415)\\n\\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)\\n\\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)\\n'"
}
| Reporter | ||
Updated•9 years ago
|
Group: metrics-private
Comment 1•9 years ago
|
||
It stopped working on Sept 1st, I think.
Comment 2•9 years ago
|
||
I set a public warning message that will need to be cleared.
Comment 3•9 years ago
|
||
9/1 was when we upgraded Cloudera to version 5.1.3. Will work on this to fix it.
Comment 4•9 years ago
|
||
So, I'm seeing this in /tmp:
drwxr-xr-x - hive supergroup 0 2015-09-14 20:47 /tmp/hive-hive
drwxr-xr-x - socorro supergroup 0 2016-09-01 08:00 /tmp/hive-hive-socorro
While we can make /tmp/hive-hive world/group writable, would it be possible to use /tmp/hive-hive-socorro, which seems designed for the purpose?
From the error:
user=socorro, access=WRITE, inode="/tmp/hive-hive":hive:supergroup:drwxr-xr-x
I see that you are using user named 'socorro' to write into /tmp/hive-hive.
If we check HDFS permissions we see that:
drwxr-xr-x - hive supergroup 0 2015-09-14 20:47 /tmp/hive-hive
drwxr-xr-x - socorro supergroup 0 2016-09-01 08:00 /tmp/hive-hive-socorro
It is logical that as user socorro you won't be able to write into /tmp/hive-hive as it is owned as shown by username 'hive'. I am sure you can write to /tmp/hive-hive-socorro without issues
Without knowing exactly which user you use to run this or the exact statement of the job you execute, I think this can be related to how hive2 now manages impersonation. From CDH documentation:
" HiveServer2 Impersonation allows users to execute queries and access HDFS files as the connected user.
If you do not enable impersonation, HiveServer2 by default executes all Hive tasks as the user ID that starts the Hive server"
Source: http://www.cloudera.com/documentation/cdh/5-1-x/CDH5-Installation-Guide/cdh5ig_filesystem_perm.html
If the way it works changed for you without any modifications it may be actually because it was before not taking the job as being submitted as user 'socorro' but was instead falling in the default user 'hive'. Please check if you can maybe change the directory to use /tmp/hive-hive-socorro instead, if not try to use hive as userid, or ask us to make any permission change to the filesystems that you see that is needed.
Comment 6•9 years ago
|
||
The Socorro code that is connecting to Hive (https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/fetch_adi_from_hive.py) using the pyhs2 driver: https://pypi.python.org/pypi/pyhs2
This runs on a standalone node in PHX and connects to Hive over the network, there's no Hadoop or Java code running on the client as far as I can tell.
I've made a reduced test case that can be run standalone (assuming pyhs2 is installed):
"""
import pyhs2
hive = pyhs2.connect(
host="HOST_NAME_HERE",
port=10000,
authMechanism="PLAIN",
user="socorro",
password="ignore",
database="default",
# 30 minutes (ms)
timeout=1800000
)
cur = hive.cursor()
cur.execute("select ds from v2_raw_logs limit 1")
for row in cur:
print row
"""
I've run this on the same node that the Socorro cron job runs on, and pyhs2 throws the same exception as :peterbe pasted in comment 0.
Sheeri, I don't see any way in phys2 to set things like the temp directory that Hive uses (or any direct access to HDFS that I know of from this node) - is there a way for us to do this or does it need to be set on the Hive side?
Flags: needinfo?(scabral)
Hi,
I'm looking at a few articles about this and would like you to help out to run a test for me. Please try that same reduced test again but this time please make sure to run the following query first:
"set hive.exec.scratchdir=/tmp/hive-socorro" in the form of cur.execute(<here>)
and the continue the rest of your test and let me know the outputs of that, time of execution and usernamed used to executed the python script.
Sources:
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration#AdminManualConfiguration-TemporaryFolders
https://issues.apache.org/jira/browse/HIVE-6602
http://stackoverflow.com/questions/21370431/how-to-access-hive-via-python
Nicolas Parducci
Pythian SRE team
Please also make the test withI have made some minor modifications that may help out, please run both the same test in the same way you already had and also a new test forcing the scratchdir to a particular value as told and let me know the results for both cases.
Comment 10•9 years ago
|
||
Thanks for the suggestiong PythianSRC. I just re-ran the orginal test with no changes and it doesn't seem to be working:
pyhs2.error.Pyhs2Exception: 'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask'
We're also getting this same error on production now too:
https://crash-stats.mozilla.com/crontabber-state/#failing
Comment 11•9 years ago
|
||
Can you please provide the following outputs:
- output of 'id' command to list what user/group you are executing your code under
- 'cat <script-name.py>' to know the exact code you are executing
- 'env' to know the predefined variables of your shell environment you may be taking
- 'date' command to know the exact time at which you are executing your script
- './script-name.py' to get output of execution of the script
Once you provide those outputs I can continue to analyze, the reason I am asking this is because from my side I am seeing the following error:
File file:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib-0.10.0-cdh4.7.0.jar does not exist
Apparently either you or Hive2 still has a reference to an old jar, I want to make sure this is the error correlating to your execution and that you are not referrencing indirectly or directly to that old jar just to make sure the error would be on the server.
Comment 12•9 years ago
|
||
You can disregard the previous comment for now.
I have further analyzed and found that the hardcoding of the old library is into Hive's configuration. At some point this must been hardcoded into Hive's configuration and the upgrade won't touch advanced hardcoded values so ergo we have this error.
I already informed Sheeri about correcting this and I am coordinating with her as to changing this will need a restart of hive services. Will keep you updated.
Comment 13•9 years ago
|
||
Hi,
We have restarted Hive and corrected that configuration hardcoded issue. Please test again and let us know how it goes.
Regards,
Nicolas Parducci
Pythian SRE team
| Reporter | ||
Comment 14•9 years ago
|
||
At the time of writing, it's been ongoing for 42 minutes without an error
https://crash-stats.mozilla.com/crontabber-state/
| Reporter | ||
Comment 15•9 years ago
|
||
It worked! Now the backfillings are running for the dependent jobs. We can close this momentarily when everything has been tested properly.
| Reporter | ||
Comment 16•9 years ago
|
||
It works.
https://crash-stats.mozilla.com/crashes-per-day/?p=Firefox
Yay!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Flags: needinfo?(scabral)
You need to log in
before you can comment on or make changes to this bug.
Description
•