Hive query started failing for Socorro

RESOLVED FIXED

Status

task
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: peterbe, Assigned: mpressman)

Tracking

Details

You can see on https://crash-stats.mozilla.com/crontabber-state/ that our cron job that does the Hive query for ADI has started failing. 

Hasn't worked for 3 days roughly.

Here is the full error:


"'Error while compiling statement: FAILED: RuntimeException org.apache.hadoop.security.AccessControlException: Permission denied: user=socorro, access=WRITE, inode=\"/tmp/hive-hive\":hive:supergroup:drwxr-xr-x\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5607)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5589)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5563)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3685)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3655)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3629)\\n\\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:741)\\n\\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:558)\\n\\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\\n\\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)\\n\\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)\\n\\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)\\n\\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)\\n\\tat java.security.AccessController.doPrivileged(Native Method)\\n\\tat javax.security.auth.Subject.doAs(Subject.java:415)\\n\\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)\\n\\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)\\n'"
}
Group: metrics-private
It stopped working on Sept 1st, I think.
Blocks: 1300457
I set a public warning message that will need to be cleared.
9/1 was when we upgraded Cloudera to version 5.1.3. Will work on this to fix it.
So, I'm seeing this in /tmp:

drwxr-xr-x   - hive        supergroup          0 2015-09-14 20:47 /tmp/hive-hive
drwxr-xr-x   - socorro     supergroup          0 2016-09-01 08:00 /tmp/hive-hive-socorro

While we can make /tmp/hive-hive world/group writable, would it be possible to use /tmp/hive-hive-socorro, which seems designed for the purpose?
From the error:
user=socorro, access=WRITE, inode="/tmp/hive-hive":hive:supergroup:drwxr-xr-x

I see that you are using user named 'socorro' to write into /tmp/hive-hive.

If we check HDFS permissions we see that:
drwxr-xr-x   - hive        supergroup          0 2015-09-14 20:47 /tmp/hive-hive
drwxr-xr-x   - socorro     supergroup          0 2016-09-01 08:00 /tmp/hive-hive-socorro

It is logical that as user socorro you won't be able to write into /tmp/hive-hive as it is owned as shown by username 'hive'. I am sure you can write to /tmp/hive-hive-socorro without issues

Without knowing exactly which user you use to run this or the exact statement of the job you execute, I think this can be related to how hive2 now manages impersonation. From CDH documentation:
" HiveServer2 Impersonation allows users to execute queries and access HDFS files as the connected user.
If you do not enable impersonation, HiveServer2 by default executes all Hive tasks as the user ID that starts the Hive server"
Source: http://www.cloudera.com/documentation/cdh/5-1-x/CDH5-Installation-Guide/cdh5ig_filesystem_perm.html

If the way it works changed for you without any modifications it may be actually because it was before not taking the job as being submitted as user 'socorro' but was instead falling in the default user 'hive'. Please check if you can maybe change the directory to use /tmp/hive-hive-socorro instead, if not try to use hive as userid, or ask us to make any permission change to the filesystems that you see that is needed.
The Socorro code that is connecting to Hive (https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/fetch_adi_from_hive.py) using the pyhs2 driver: https://pypi.python.org/pypi/pyhs2

This runs on a standalone node in PHX and connects to Hive over the network, there's no Hadoop or Java code running on the client as far as I can tell.

I've made a reduced test case that can be run standalone (assuming pyhs2 is installed):

"""
import pyhs2

hive = pyhs2.connect(
  host="HOST_NAME_HERE",
  port=10000,
  authMechanism="PLAIN",
  user="socorro",
  password="ignore",
  database="default",
  # 30 minutes (ms)
  timeout=1800000
)

cur = hive.cursor()
cur.execute("select ds from v2_raw_logs limit 1")

for row in cur:
  print row
"""

I've run this on the same node that the Socorro cron job runs on, and pyhs2 throws the same exception as :peterbe pasted in comment 0.

Sheeri, I don't see any way in phys2 to set things like the temp directory that Hive uses (or any direct access to HDFS that I know of from this node) - is there a way for us to do this or does it need to be set on the Hive side?
Flags: needinfo?(scabral)
Hi,

I'm looking at a few articles about this and would like you to help out to run a test for me. Please try that same reduced test again but this time please make sure to run the following query first:

"set hive.exec.scratchdir=/tmp/hive-socorro" in the form of cur.execute(<here>)
and the continue the rest of your test and let me know the outputs of that, time of execution and usernamed used to executed the python script.

Sources:
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration#AdminManualConfiguration-TemporaryFolders
https://issues.apache.org/jira/browse/HIVE-6602
http://stackoverflow.com/questions/21370431/how-to-access-hive-via-python

Nicolas Parducci
Pythian SRE team
Correction, that should probably be:
/tmp/hive-hive-socorro
Please also make the test withI have made some minor modifications that may help out, please run both the same test in the same way you already had and also a new test forcing the scratchdir to a particular value as told and let me know the results for both cases.
Thanks for the suggestiong PythianSRC. I just re-ran the orginal test with no changes and it doesn't seem to be working:

pyhs2.error.Pyhs2Exception: 'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask'

We're also getting this same error on production now too:

https://crash-stats.mozilla.com/crontabber-state/#failing
Can you please provide the following outputs:

- output of 'id' command to list what user/group you are executing your code under
- 'cat <script-name.py>' to know the exact code you are executing
- 'env' to know the predefined variables of your shell environment you may be taking
- 'date' command to know the exact time at which you are executing your script
- './script-name.py' to get output of execution of the script

Once you provide those outputs I can continue to analyze, the reason I am asking this is because from my side I am seeing the following error:
File file:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib-0.10.0-cdh4.7.0.jar does not exist

Apparently either you or Hive2 still has a reference to an old jar, I want to make sure this is the error correlating to your execution and that you are not referrencing indirectly or directly to that old jar just to make sure the error would be on the server.
You can disregard the previous comment for now.

I have further analyzed and found that the hardcoding of the old library is into Hive's configuration. At some point this must been hardcoded into Hive's configuration and the upgrade won't touch advanced hardcoded values so ergo we have this error.

I already informed Sheeri about correcting this and I am coordinating with her as to changing this will need a restart of hive services. Will keep you updated.
Hi,

We have restarted Hive and corrected that configuration hardcoded issue. Please test again and let us know how it goes.

Regards,

Nicolas Parducci
Pythian SRE team
At the time of writing, it's been ongoing for 42 minutes without an error
https://crash-stats.mozilla.com/crontabber-state/
It worked! Now the backfillings are running for the dependent jobs. We can close this momentarily when everything has been tested properly.
It works.
https://crash-stats.mozilla.com/crashes-per-day/?p=Firefox
Yay!
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Flags: needinfo?(scabral)
You need to log in before you can comment on or make changes to this bug.