1300341 - Hive query started failing for Socorro

Reporter

Description

•

9 years ago

You can see on https://crash-stats.mozilla.com/crontabber-state/ that our cron job that does the Hive query for ADI has started failing. Hasn't worked for 3 days roughly. Here is the full error: "'Error while compiling statement: FAILED: RuntimeException org.apache.hadoop.security.AccessControlException: Permission denied: user=socorro, access=WRITE, inode=\"/tmp/hive-hive\":hive:supergroup:drwxr-xr-x\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5607)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5589)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5563)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3685)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3655)\\n\\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3629)\\n\\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:741)\\n\\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:558)\\n\\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\\n\\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)\\n\\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)\\n\\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)\\n\\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)\\n\\tat java.security.AccessController.doPrivileged(Native Method)\\n\\tat javax.security.auth.Subject.doAs(Subject.java:415)\\n\\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)\\n\\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)\\n'" }

Peter Bengtsson [:peterbe]

Reporter

Updated

•

9 years ago

Group: metrics-private

Liz Henry (:lizzard) (relman/hg->git project)

Comment 1

•

9 years ago

It stopped working on Sept 1st, I think.

Lonnen :lonnen

Updated

•

9 years ago

Blocks: 1300457

Lonnen :lonnen

Comment 2

•

9 years ago

I set a public warning message that will need to be cleared.

Sheeri Cabral [:sheeri]

Comment 3

•

9 years ago

9/1 was when we upgraded Cloudera to version 5.1.3. Will work on this to fix it.

Sheeri Cabral [:sheeri]

Comment 4

•

9 years ago

So, I'm seeing this in /tmp: drwxr-xr-x - hive supergroup 0 2015-09-14 20:47 /tmp/hive-hive drwxr-xr-x - socorro supergroup 0 2016-09-01 08:00 /tmp/hive-hive-socorro While we can make /tmp/hive-hive world/group writable, would it be possible to use /tmp/hive-hive-socorro, which seems designed for the purpose?

u550631

Comment 5

•

9 years ago

From the error: user=socorro, access=WRITE, inode="/tmp/hive-hive":hive:supergroup:drwxr-xr-x I see that you are using user named 'socorro' to write into /tmp/hive-hive. If we check HDFS permissions we see that: drwxr-xr-x - hive supergroup 0 2015-09-14 20:47 /tmp/hive-hive drwxr-xr-x - socorro supergroup 0 2016-09-01 08:00 /tmp/hive-hive-socorro It is logical that as user socorro you won't be able to write into /tmp/hive-hive as it is owned as shown by username 'hive'. I am sure you can write to /tmp/hive-hive-socorro without issues Without knowing exactly which user you use to run this or the exact statement of the job you execute, I think this can be related to how hive2 now manages impersonation. From CDH documentation: " HiveServer2 Impersonation allows users to execute queries and access HDFS files as the connected user. If you do not enable impersonation, HiveServer2 by default executes all Hive tasks as the user ID that starts the Hive server" Source: http://www.cloudera.com/documentation/cdh/5-1-x/CDH5-Installation-Guide/cdh5ig_filesystem_perm.html If the way it works changed for you without any modifications it may be actually because it was before not taking the job as being submitted as user 'socorro' but was instead falling in the default user 'hive'. Please check if you can maybe change the directory to use /tmp/hive-hive-socorro instead, if not try to use hive as userid, or ask us to make any permission change to the filesystems that you see that is needed.

Robert Helmer [:rhelmer]

Comment 6

•

9 years ago

The Socorro code that is connecting to Hive (https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/fetch_adi_from_hive.py) using the pyhs2 driver: https://pypi.python.org/pypi/pyhs2 This runs on a standalone node in PHX and connects to Hive over the network, there's no Hadoop or Java code running on the client as far as I can tell. I've made a reduced test case that can be run standalone (assuming pyhs2 is installed): """ import pyhs2 hive = pyhs2.connect( host="HOST_NAME_HERE", port=10000, authMechanism="PLAIN", user="socorro", password="ignore", database="default", # 30 minutes (ms) timeout=1800000 ) cur = hive.cursor() cur.execute("select ds from v2_raw_logs limit 1") for row in cur: print row """ I've run this on the same node that the Socorro cron job runs on, and pyhs2 throws the same exception as :peterbe pasted in comment 0. Sheeri, I don't see any way in phys2 to set things like the temp directory that Hive uses (or any direct access to HDFS that I know of from this node) - is there a way for us to do this or does it need to be set on the Hive side?

Flags: needinfo?(scabral)

u550631

Comment 7

•

9 years ago

Hi, I'm looking at a few articles about this and would like you to help out to run a test for me. Please try that same reduced test again but this time please make sure to run the following query first: "set hive.exec.scratchdir=/tmp/hive-socorro" in the form of cur.execute(<here>) and the continue the rest of your test and let me know the outputs of that, time of execution and usernamed used to executed the python script. Sources: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration#AdminManualConfiguration-TemporaryFolders https://issues.apache.org/jira/browse/HIVE-6602 http://stackoverflow.com/questions/21370431/how-to-access-hive-via-python Nicolas Parducci Pythian SRE team

u550631

Comment 8

•

9 years ago

Correction, that should probably be: /tmp/hive-hive-socorro

u550631

Comment 9

•

9 years ago

Please also make the test withI have made some minor modifications that may help out, please run both the same test in the same way you already had and also a new test forcing the scratchdir to a particular value as told and let me know the results for both cases.

Robert Helmer [:rhelmer]

Comment 10

•

9 years ago

Thanks for the suggestiong PythianSRC. I just re-ran the orginal test with no changes and it doesn't seem to be working: pyhs2.error.Pyhs2Exception: 'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask' We're also getting this same error on production now too: https://crash-stats.mozilla.com/crontabber-state/#failing

u550631

Comment 11

•

9 years ago

Can you please provide the following outputs: - output of 'id' command to list what user/group you are executing your code under - 'cat <script-name.py>' to know the exact code you are executing - 'env' to know the predefined variables of your shell environment you may be taking - 'date' command to know the exact time at which you are executing your script - './script-name.py' to get output of execution of the script Once you provide those outputs I can continue to analyze, the reason I am asking this is because from my side I am seeing the following error: File file:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib-0.10.0-cdh4.7.0.jar does not exist Apparently either you or Hive2 still has a reference to an old jar, I want to make sure this is the error correlating to your execution and that you are not referrencing indirectly or directly to that old jar just to make sure the error would be on the server.

u550631

Comment 12

•

9 years ago

You can disregard the previous comment for now. I have further analyzed and found that the hardcoding of the old library is into Hive's configuration. At some point this must been hardcoded into Hive's configuration and the upgrade won't touch advanced hardcoded values so ergo we have this error. I already informed Sheeri about correcting this and I am coordinating with her as to changing this will need a restart of hive services. Will keep you updated.

u550631

Comment 13

•

9 years ago

Hi, We have restarted Hive and corrected that configuration hardcoded issue. Please test again and let us know how it goes. Regards, Nicolas Parducci Pythian SRE team

Peter Bengtsson [:peterbe]

Reporter

Comment 14

•

9 years ago

At the time of writing, it's been ongoing for 42 minutes without an error https://crash-stats.mozilla.com/crontabber-state/

Peter Bengtsson [:peterbe]

Reporter

Comment 15

•

9 years ago

It worked! Now the backfillings are running for the dependent jobs. We can close this momentarily when everything has been tested properly.

Peter Bengtsson [:peterbe]

Reporter

Comment 16

•

9 years ago

It works. https://crash-stats.mozilla.com/crashes-per-day/?p=Firefox Yay!

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Sheeri Cabral [:sheeri]

Updated

•

9 years ago

Flags: needinfo?(scabral)

Bugzilla

Hive query started failing for Socorro

Categories

(Data & BI Services Team :: DI: Hadoop, task)

Tracking

(Not tracked)

People

(Reporter: peterbe, Assigned: mpressman)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated