Spark and Hive 3

A tale of two metastores

Recently I experienced a cluster upgrade to the next major version of hadoop. Many component versions have been updated. In particular hive moved from 1.2.x to 3.x line. This introduces many changes as hive tries to behave more and more like a traditional database. In particular, hive introduced ACID tables.

By default Hive tries to optimize (i.e. compact and sort) the data automatically so users do not need to worry about this. Additionally, they do support UPSERTS as a slowly changing dimension, which could be suited for GDPR deletions. Unfortunately, spark as of 2.3 or 2.4 does not support reading these tables. In fact, distributors like Cloudera have introduced a second warehouse only used by spark - for a reason.

If you still want to keep everyone using the same warehouse, setting --conf spark.hadoop.metastore.catalog.default=hive gets the job done.

Being able to see or potentially read the data does not mean everything works fine.

The first catch is when one of the tables is ACID spark will only be able to read it corrently after a full major compaction. Otherwise, Spark might receive outdated data as it does not understand the delta files. The second and more subtle is: by not using LLAP’s ranger plugin the columnar ranger policies do not apply.

Both spark and hive remain interoperable using:

  • external tables (the traditional approach)
  • HiveWarehouseConnector
By not using the HiveWareouseConnector the data is not traversing Hive and advanced columnar restrictions in Ranger are not applied. Only regular Ranger HDFS policies remain.

Using the HiveWarehouseConnector is slower than the old file access methodology. This also holds true for Databricks delta, where directly writing to parquet would be faster but not ACID. But in addition managed tables automatically manage the size of the partitions and collect statistics - which might be beneficial overall.

Contrary to what is found in most of the documentation about the warehouse connector it can work with both LLAP and Tez, though there are some restrictions in case of the Tez execution path.

The warehouse connector is not a drop-in replacement. I.e. sparks SQL optimization is broken at the boundaries:

hwx.sql("SELECT * from db.table).count

might run out of memory, whereas sparks native counterpart will work just fine.

summary

Spark and Hive have drifted apart and do not play well together in HDP 3.x.

By separating read and write paths you get the best of both worlds: Spark can write without LLAP for maximum performance to an external table. On the read path by applying the HiveWarehouseConnector data is transferred via LLAP for added security.

Avatar
Georg Heiler
PhD candidate & data scientist

My research interests include large geo-spatial time and network data analytics.

Related