Friday, January 26, 2018

Securing Apache Sqoop - part I

This is the first in a series of posts on how to secure Apache Sqoop. Apache Sqoop is a tool to transfer bulk data mainly between HDFS and relational databases, but also supporting other projects such as Apache Kafka. In this post we will look at how to set up Apache Sqoop to perform a simple use-case of transferring a file from HDFS to Apache Kafka. Subsequent posts will show how to authorize this data transfer using both Apache Ranger and Apache Sentry.

Note that we will only use Sqoop 2 (current version 1.99.7), as this is the only version that both Sentry and Ranger support. However, this version is not (yet) recommended for production deployment.

1) Set up Apache Hadoop and Apache Kafka

First we will set up Apache Hadoop and Apache Kafka. The use-case is that we want to transfer a file from HDFS (/data/LICENSE.txt) to a Kafka topic (test). Follow part (1) of an earlier tutorial I wrote about installing Apache Hadoop. The following change is also required for ''etc/hadoop/core-site.xml' (in addition to the "fs.defaultFS" setting that is configured in the earlier tutorial):

Make sure that LICENSE.txt is uploaded to the /data directory as outlined in the tutorial. Now we will set up Apache Kafka. Download Apache Kafka and extract it (1.0.0 was used for the purposes of this tutorial). Start Zookeeper with:
  • bin/zookeeper-server-start.sh config/zookeeper.properties
and start the broker and then create a "test" topic with:
  • bin/kafka-server-start.sh config/server.properties
  • bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Finally let's set up a consumer for the "test" topic:
  • bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning --consumer.config config/consumer.properties
2) Set up Apache Sqoop

Download Apache Sqoop and extract it (1.99.7 was used for the purposes of this tutorial).

2.a) Configure + start Sqoop

Before starting Sqoop, edit 'conf/sqoop.properties' and change the following property to point instead to the Hadoop configuration directory (e.g. /path.to.hadoop/etc/hadoop):
  • org.apache.sqoop.submission.engine.mapreduce.configuration.directory
Then configure and start Apache Sqoop with the following commands:
  • export HADOOP_HOME=path to Hadoop home
  • bin/sqoop2-tool upgrade
  • bin/sqoop2-tool verify
  • bin/sqoop2-server start (stop)
2.b) Configure links/job in Sqoop

Now that Sqoop has started we need to configure it to transfer data from HDFS to Kafka. Start the Shell via:
  • bin/sqoop2-shell
"show connector" lists the connectors that are available. We first need to configure a link for the HDFS connector:
  • create link -connector hdfs-connector
  • Name: HDFS
  • URI: hdfs://localhost:9000
  • Conf directory: Path to Hadoop conf directory
Similarly, for the Kafka connector:
  • create link -connector kafka-connector
  • Name: KAFKA
  • Kafka brokers: localhost:9092
  • Zookeeper quorum: localhost:2181
"show link" shows the links we've just created. Now we need to create a job from the HDFS link to the Kafka link as follows (accepting the default values if they are not specified below):
  • create job -f HDFS -t KAFKA
  • Name: testjob
  • Input Directory: /data
  • Topic: test
We can see the job we've created with "show job". Now let's start the job:
  • start job -name testjob 
You should see the content of the HDFS "/data" directory (i.e. the LICENSE.txt) appear in the window of the Kafka "test" consumer, thus showing that Sqoop has transfered data from HDFS to Kafka.

No comments:

Post a Comment