An Example How to Solve Problems with Open Source: Connecting Excel to Spark SQL

Today I tried to connect Excel 2013 to Spark SQL. It was an interesting experience: digging in the source code in order to understand how everything is working.

I have Spark 1.3.1 (the latest one). I have a standalone cluster running on a HDP 2.1 sandbox single machine, the configuration is as follows:

conf/spark-env.sh:

export HADOOP_CONF_DIR=/etc/hadoop/conf

export SPARK_YARN_QUEUE=default

export SPARK_EXECUTOR_MEMORY=512M

export HIVE_SERVER2_THRIFT_PORT=10001

conf/spark-defaults:conf:

spark.master spark://sandbox.hortonworks.com:7077

spark.eventLog.enabled true

spark.eventLog.dir hdfs://sandbox.hortonworks.com:8020/user/pdi/spark/logs

conf/slaves:

sandbox.hortonworks.com

 

File hive-site.xml was copied from the /etc/hive/conf to /conf directory of spark. Then the cluster was started by sbin/start-all.sh and I see in the logs that there are two workers – everything seems fine:

Then I launched sbin/start-thriftserver.sh

As we can see – one more log file added

and in this log file we can see

Then I switched to Windows. In Windows I installed Microsoft ODBC Driver https://www.microsoft.com/en-us/download/details.aspx?id=40886 and now I am trying to set up the DSN.

When I am trying to connect to Hive’s address (port 10000), I see this:

but I would like to connect also to Spark SQL and when connecting to port 10001, I am having problems:

So, the first thing that I am trying to eliminate is to avoid external components (I mean Windows), and run Spark’s beeline JDBC client with command

bin/beeline

and then

beeline> !connect jdbc:hive2://sandbox:10001

scan complete in 5ms

Connecting to jdbc:hive2://sandbox:10001

Enter username for jdbc:hive2://sandbox:10001: pdi

Enter password for jdbc:hive2://sandbox:10001: ***

Connected to: Spark SQL (version 1.3.1)

Driver: Hive JDBC (version 0.13.0.2.1.1.0-385)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://sandbox:10001>

So, I am connected.

In my hive I have a table logsjava, so if I select it with

select * from logsjava limit 10;

I get the result:

It means that Spark SQL is working fine: the thrift server is functioning, and it cooperates with hive metastorage to fetch the data and output the result to beeline client (which connected to it using JDBC). Nice.

But now let us look what happens in the case if I run

0: jdbc:hive2://sandbox:10001> show databases;

Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Hive Internal Error: java.lang.ClassNotFoundException(org.apache.hadoop.hive.ql.hooks.ATSHook) (state=,code=0)

0: jdbc:hive2://sandbox:10001>

Not very nice picture, isn’t it?

We get some java exception – the same one which was obtained when we tried to connect from Windows ODBC. The conclusion is next: client side is not guilty; it is worth looking at the logs of Spark SQL thrift server.

Now let us open the file logs/spark-pdi-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-sandbox.hortonworks.com.out (in the Spark’s directory) and see what happens:

hive.exec.pre.hooks Class not found:org.apache.hadoop.hive.ql.hooks.ATSHook

FAILED: Hive Internal Error: java.lang.ClassNotFoundException(org.apache.hadoop.hive.ql.hooks.ATSHook)

java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.hooks.ATSHook

    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

    at java.security.AccessController.doPrivileged(Native Method)

    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

    at java.lang.Class.forName0(Native Method)

    at java.lang.Class.forName(Class.java:270)

    at org.apache.hadoop.hive.ql.hooks.HookUtils.getHooks(HookUtils.java:59)

    at org.apache.hadoop.hive.ql.Driver.getHooks(Driver.java:1172)

    at org.apache.hadoop.hive.ql.Driver.getHooks(Driver.java:1156)

    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1206)

    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)

    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)

    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)

    at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:318)

    at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:290)

    at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)

    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:54)

    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:54)

    at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:64)

    at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)

    at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)

    at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:147)

    at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:130)

    at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)

    at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:101)

    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:164)

    at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)

    at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)

    at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)

    at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)

    at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)

    at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)

    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)

    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)

    at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)

    at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

    at java.lang.Thread.run(Thread.java:744)

 

I’ve got the source code of Hive at https://github.com/apache/hive/tree/branch-0.13 (my version which I determined using “hive –version” command), and opened the code of HookUtils.getHooks:

 

Spark’s thrift server got configuration of Hive from conf/hive-site.xml, and if we open this file – we can see this:

<property>

<name>hive.exec.pre.hooks</name>

<value>org.apache.hadoop.hive.ql.hooks.ATSHook</value>

</property>

 

Now the question is to find out “what is this hook, do we really need it, and can we unplug it? If we cannot – shall we do so that Spark’s thrift server see’s the jar with this hook in the classpath?”.

First of all, I must say that the file ATSHook.java is NOT found in the sources of hive that we’ve just downloaded – as of version 0.13 (bundled with HDP 2.1). However, googling gave me a page of stackoverflow http://stackoverflow.com/questions/24316492/unable-to-configure-hive-exec-hooks-due-to-missing-jar that tells that in 0.14 release this hook is available. And this is really so: downloading branch https://github.com/apache/hive/tree/branch-0.14 shows me that there is a file in the sources

hive-branch-0.14\ql\src\java\org\apache\hadoop\hive\ql\hooks\ATSHook.java

which does the job as described in the javadocs https://hive.apache.org/javadocs/r1.0.0/api/ql/org/apache/hadoop/hive/ql/hooks/ATSHook.html

“ATSHook sends query + plan info to Yarn App Timeline Server.”

NICE! I am so happy that I know what does this hook do, but an obvious question is next: HOW DOES THIS HOOK CAME INTO CONFIGURATION FILE HIVE-SITE.XML, IF AT THE MOMENT OF RELEASE 0.13 IT DID NOT EXIST???

The answer is next: when the guys from Hortonworks created the sandbox virtual machine, they played with hive “by hands”, instead of applying some standard installation procedures.

It is called “WELCOME TO THE OPEN SOURCE, GUYS!”.

The solution is next: open the file conf/hive-site.xml, and in (!) three places disable this block of XML:

<property>

<name>hive.exec.failure.hooks</name>

<value>org.apache.hadoop.hive.ql.hooks.ATSHook</value>

</property>

by putting XML comments or by removing it at all.

Then restart the thrift server by running sbin/stop-thriftserver.sh and then sbin/start-thriftserver.sh.

Now let us connect beeline again and run show databases command:

And the same from Excel in Windows:

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>