Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other database-resident data. If required, you can also load data into the database using SQL. For an Oracle user, HDFS files and the Hive tables are hidden behind external tables.

This article describes how to install and use the Oracle SQL Connector for Hadoop. I am only covering the integration with HDFS. Another article describes how to further configure the SQL connector to integrate with Hive.

Minimum Requirements

We assume that we have a Hadoop cluster as well as another server with Oracle already installed. The supported Hadoop distributions are Cloudera CDH3 and CDH4 or Apache Hadoop 1.0 (formerly 0.20.2). The same version of Hadoop must be installed on the Hadoop cluster and the Oracle server. Also, the minimal Oracle Database release must be 11g release 2 (11.2.0.2 or 11.2.0.3) for Linux. In our case, all the servers are running CentOs 6.4. with CDH4 and Oracle 11.2.0.

Make sure to disable SeLinux both for Hadoop and for Oracle. Using “sed”, the command sed -i.bck 's/SELINUX=enforcing/SELINUX=disabled/' /etc/sysconfig/selinux will make a backup and update the configuration in place. This change will be apply after the next reboot.

Prepare the Oracle database

Database creation

The Oracle system must already be installed on your system.

We will create a database “BIG”. The following settings are applyable to a development node on a virtual machine. Using the graphical Oracle client: select the “Data Warehouse” model; the global database name and SID are both “BIG”; desactivate Oracle Enterprise Manager; use a database password; the storage location is on the filesytem; we don’t activate archiving; set the memory SGA size as 2048 Mo and the memory PGA size as 1024 Mo; leave the rest empty. We save the database model with name BIG and we generate the database creation script inside $ORACLE_BASE/admin/BIG/scripts.

Environment

To facilitate the connection with the sqlplus cli client, set the ORACLE_SID to the Oracle instance ID (SID).

Also, you PATH variable should reflect the Oracle installation export PATH=$PATH:$ORACLE_HOME/bin:$ORACLE_HOME/OPatch

The values may differ based on your Oracle installation. In my installation, all the variables are defined inside “~/.bash_profile”.

User creation

The database should be started. If it isn’t, you could refer to the “Starting the Database” section below.

You can now login within the Oracle client with privelege sysdba (user sys) without a password and create a user “BIG” with the “XXXXXX” password and the roles “CONNECT” and “RESSOURCE”.

Alternatively, we could also connect to the database with a remote Tora client. Using “instant client”, set the username as “system” and leave the schema empty to login.

Here, we have granted all privileges to our user. You should be more restrictive. Oracle require the haddop user “CREATE SESSION”, “CREATE TABLE”, “EXECUTE” on the UTL_FILE PL/SQL package as well as “READ” and “EXECUTE” on the “OSCH_BIN_PATH” directory created during the installation of Oracle SQL Connector for HDFS.

SQL Connector installation

From the offication documentation:

Oracle SQL Connector for HDFS is installed and configured on the system where Oracle Database runs. It can also run on the system where Oracle Database runs. If Hive tables are used as data sources, then Oracle SQL Connector for HDFS must also be installed and running on the system where Hive is installed.

You must install Hadoop on the Oracle Database system and minimally configure it for Hadoop client use only. The installation procedure is described below.

Install Hadoop

This step must only be executed on your Oracle server. Only the Hadoop client is required on this server and the configuration must match the one of your Hadoop cluster.

One easy way is to use the yum repository. In this example we deploy the Cloudera repository but Hortonworks work the same way.

Now that the repository is available, we install the hadoop client package and apply your cluster configuration. Note, the “mapred-site.xml” and “yarn-site.xml” configuration files are relevant and we need to import them.

Ensure that the Oracle Database has access to HDFS. Run the following command using your Oracle Database account and you should see the directory listing of “/user”.

Download and install the SQL connector

The step must be run on your Oracle server and, if using Hive, on your Hadoop cluster nodes as well.

Download must be done manually since it requires acception the OTN license agreement. Place the resulting file “oraosch-2.1.0.zip” into “/tmp/oraosch-2.1.0.zip”.

Configure the SQL connector on your Oracle server

Edit the “./bin/hdfs_stream” file inside your “orahdfs” folder and set the OSCH_HOME variable with the correct path.

Update your environment variables (vi ~/.bash_profile):

Create a database directory for the orahdfs-version/bin directory where hdfs_stream resides. In our example:

Configure the SQL connector on your Hadoop cluster

Contrary to what the official documentation say, you shouldn’t only perform this step if you plan on using Hive. You should also do it if you wish to declare new external tables from the Hadoop cluster (the “ExternalTable” command described below).

Creating a table referencing an HDFS file

First, we need to create an external directory grant access to the our user.

Now we can declare the Hadoop data file as a table.

Where the content of the “hdfs.xml” file is:

You can test the connection with the following query.

Additional notes

Starting the Database

Disable Oracle Database Vault

Database vault need to be disabled or you will encounter this error: “KUP-04094 : preprocessing cannot be performed if Database Vault is installed”. The procedure to disable this setting is described below.

  1. Check if Oracle Database Vault Is Enabled or Disabled:

  1. Stop all Oracle services

If you have installed the enterprise manager database control, then:

  1. Relink :

  1. Restart Oracle services: