Thursday, July 20, 2017

Securing Apache Hive - part I

This is the first post in a series of articles on securing Apache Hive. In this article we will look at installing Apache Hive and doing some queries on data stored in HDFS. We will not consider any security requirements in this post, but the test deployment will be used by future posts in this series on authenticating and authorizing access to Hive.

1) Install and configure Apache Hadoop

The first step is to install and configure Apache Hadoop. Please follow section 1 of this earlier tutorial for information on how to do this. In addition, we need to configure two extra properties in 'etc/hadoop/core-site.xml':
  • hadoop.proxyuser.$user.groups: *
  • hadoop.proxyuser.$user.hosts: localhost
where "$user" above should be replaced with the user that is going to run the hive server below. As we are not using authentication in this tutorial, this allows the $user to impersonate the "anonymous" user, who will connect to Hive via beeline and run some queries.

Once HDFS has started, we need to create some directories for use by Apache Hive, and change the permissions appropriately:
  • bin/hadoop fs -mkdir -p /user/hive/warehouse /tmp
  • bin/hadoop fs -chmod g+w /user/hive/warehouse /tmp
  • bin/hadoop fs -mkdir /data
The "/data" directory will hold a file which represents the output of a map-reduce job. For the purposes of this tutorial, we will use a sample output of the canonical "Word Count" map-reduce job on some text. The file consists of two columns separated by a tab character, where the left column is the word, and the right column is the total count associated with that word in the original document.

I've uploaded such a sample output here. Download it and upload it to the HDFS data directory:
  • bin/hadoop fs -put output.txt /data
2) Install and configure Apache Hive

Now we will install and configure Apache Hive. Download and extract Apache Hive (2.1.1 was used for the purposes of this tutorial). Set the "HADOOP_HOME" environment variable to point to the Apache Hadoop installation directory above. Now we will configure the metastore and start Hiveserver2:
  • bin/schematool -dbType derby -initSchema
  • bin/hiveserver2
In a separate window, we will start beeline to connect to the hive server, where $user is the user who is running Hadoop (necessary as we are going to create some data in HDFS, and otherwise wouldn't have the correct permissions):
  • bin/beeline -u jdbc:hive2://localhost:10000 -n $user
Once we are connected, then create a Hive table and load the map reduce output data into a new table called "words":
  • create table words (word STRING, count INT) row format delimited fields terminated by '\t' stored as textfile;
  • LOAD DATA INPATH '/data/output.txt' INTO TABLE words;
Now we can run some queries on the data as the anonymous user. Log out of beeline and then back in and run some queries via:
  • bin/beeline -u jdbc:hive2://localhost:10000
  • select * from words where word == 'Dare';

No comments:

Post a Comment