Testing the Oracle SQL Connector for Hadoop HDFS

Testing the Oracle SQL Connector for Hadoop HDFS

By David WORMS

Jul 15, 2013

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other database-resident data. If required, you can also load data into the database using SQL. For an Oracle user, HDFS files and the Hive tables are hidden behind external tables.

This article describes how to install and use the Oracle SQL Connector for Hadoop. I am only covering the integration with HDFS. Another article describes how to further configure the SQL connector to integrate with Hive.

Minimum Requirements

We assume that we have a Hadoop cluster as well as another server with Oracle already installed. The supported Hadoop distributions are Cloudera CDH3 and CDH4 or Apache Hadoop 1.0 (formerly 0.20.2). The same version of Hadoop must be installed on the Hadoop cluster and the Oracle server. Also, the minimal Oracle Database release must be 11g release 2 (11.2.0.2 or 11.2.0.3) for Linux. In our case, all the servers are running CentOs 6.4. with CDH4 and Oracle 11.2.0.

Make sure to disable SeLinux both for Hadoop and for Oracle. Using “sed”, the command sed -i.bck 's/SELINUX=enforcing/SELINUX=disabled/' /etc/sysconfig/selinux will make a backup and update the configuration in place. This change will be applied after the next reboot.

Prepare the Oracle database

Database creation

The Oracle system must already be installed on your system.

We will create a database “BIG”. The following settings are applyable to a development node on a virtual machine. Using the graphical Oracle client: select the “Data Warehouse” model; the global database name and SID are both “BIG”; deactivate Oracle Enterprise Manager; use a database password; the storage location is on the filesytem; we don’t activate archiving; set the memory SGA size as 2048 Mo and the memory PGA size as 1024 Mo; leave the rest empty. We save the database model with name BIG and we generate the database creation script inside $ORACLE_BASE/admin/BIG/scripts.

Environment

To facilitate the connection with the sqlpluscli client, set the ORACLE_SIDto the Oracle instance ID (SID).

echo "export ORACLE_SID=BIG" >> ~/.bash_profile
. ~/.bash_profile
env | grep ORA
ORACLE_SID=BIG
ORACLE_BASE=/home/oracle/app/oracle
ORACLE_HOME=/home/oracle/app/oracle/product/11.2.0/dbhome_1

Also, you PATHvariable should reflect the Oracle installation export PATH=$PATH:$ORACLEHOME/bin:$ORACLEHOME/OPatch

The values may differ based on your Oracle installation. In my installation, all the variables are defined inside ”~/.bash_profile”.

User creation

The database should be started. If it isn’t, you could refer to the “Starting the Database” section below.

You can now login within the Oracle client with privelege sysdba(user sys) without a password and create a user “BIG” with the “XXXXXX” password and the roles “CONNECT” and “RESSOURCE”.

sqlplus "/ as sysdba"

SQL*Plus: Release 11.2.0.3.0 Production on Fri May 17 12:07:56 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to :
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options

SQL> show user;
USER is "SYS"

SQL> CREATE USER "BIG" IDENTIFIED BY "toto123" PROFILE "DEFAULT" DEFAULT TABLESPACE "USERS" TEMPORARY TABLESPACE "TEMP";
User created.

SQL> GRANT "CONNECT", "RESOURCE" to "BIG";
Authorization for privilege (GRANT) accepted.

Alternatively, we could also connect to the database with a remote Tora client. Using “instant client”, set the username as “system” and leave the schema empty to login.

Here, we have granted all privileges to our user. You should be more restrictive. Oracle requires the haddop user “CREATE SESSION”, “CREATE TABLE”, “EXECUTE” on the UTLFILE PL/SQL package as well as “READ” and “EXECUTE” on the “OSCHBINPATH” directory created during the installation of [Oracle SQL Connector](http://docs.oracle.com/cd/E3723101/doc.20/e36961/sqlch.htm) for HDFS.

SQL Connector installation

From the offication documentation:

Oracle SQL Connector for HDFS is installed and configured on the system where Oracle Database runs. It can also run on the system where Oracle Database runs. If Hive tables are used as data sources, then Oracle SQL Connector for HDFS must also be installed and running on the system where Hive is installed.

You must install Hadoop on the Oracle Database system and minimally configure it for Hadoop client use only. The installation procedure is described below.

Install Hadoop

This step must only be executed on your Oracle server. Only the Hadoop client is required on this server and the configuration must match the one of your Hadoop cluster.

One easy way is to use the yum repository. In this example we deploy the Cloudera repository but Hortonworks works the same way.

# Yum repository
cat > /etc/yum.repos.d/cloudera-cdh4.repo <<REPO
[fusion_builder_container hundred_percent="yes" overflow="visible"][fusion_builder_row][fusion_builder_column type="1_1" background_position="left top" background_color="" border_size="" border_color="" border_style="solid" spacing="yes" background_image="" background_repeat="no-repeat" padding="" margin_top="0px" margin_bottom="0px" class="" id="" animation_type="" animation_speed="0.3" animation_direction="left" hide_on_mobile="no" center_content="no" min_height="none"][cloudera-cdh4]
name = Cloudera CDH, Version 4
baseurl = http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4/
gpgkey = http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
REPO
yum update

Now that the repository is available, we install the Hadoop client package and apply your cluster configuration. Note, the “mapred-site.xml” and “yarn-site.xml” configuration files are relevant and we need to import them.

yum install hadoop-client
# Import your Hadoop configuration
node=a_hadoop_hostname
scp root@${node}:/etc/hadoop/conf/core-site.xml /etc/hadoop/conf/core-site.xml
scp root@${node}:/etc/hadoop/conf/hdfs-site.xml /etc/hadoop/conf/hdfs-site.xml

Ensure that the Oracle Database has access to HDFS. Run the following command using your Oracle Database account and you should see the directory listing of “/user”.

hadoop fs -ls /user

Download and install the SQL connector

The step must be run on your Oracle server and, if using Hive, on your Hadoop cluster nodes as well.

Download must be done manually since it requires acception the OTN license agreement. Place the resulting file “oraosch-2.1.0.zip” into “/tmp/oraosch-2.1.0.zip”.

cd /tmp
unzip oraosch-2.1.0.zip
unzip orahdfs-2.1.0.zip
mv orahdfs-2.1.0 /usr/lib/

Configure the SQL connector on your Oracle server

Edit the “./bin/hdfsstream” file inside your “orahdfs” folder and set the `OSCHHOME`variable with the correct path.

Update your environment variables (vi ~/.bash_profile):

export OSCH_HOME=/usr/lib/orahdfs-2.1.0
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OSCH_HOME/jlib/*"

Create a database directory for the orahdfs-version/bin directory where hdfs_stream resides. In our example:

SQL> CREATE OR REPLACE DIRECTORY osch_bin_path AS '/home/oracle/orahdfs-2.1.0/bin';
SQL> GRANT READ, EXECUTE ON DIRECTORY OSCH_BIN_PATH TO big;

Configure the SQL connector on your Hadoop cluster

Contrary to what the official documentation say, you shouldn’t only perform this step if you plan on using Hive. You should also do it if you wish to declare new external tables from the Hadoop cluster (the “ExternalTable” command described below).

Creating a table referencing an HDFS file

First, we need to create an external directory grant access to the our user.

SQL> CREATE OR REPLACE DIRECTORY ext_hdfs_test_dir AS '/home/oracle/ext_hdfs_test_dir';
SQL> GRANT READ, WRITE ON DIRECTORY ext_hdfs_test_dir TO big;

Now we can declare the Hadoop data file as a table.

hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -conf ./hdfs.xml -createTable

Where the content of the “hdfs.xml” file is:

<?xml version="1.0"?>
<configuration>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>jdbc:oracle:thin:@//100.100.100.52:1521/BIG</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>BIG</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.sourceType</name>
    <value>text</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>BIG.EXT_HDFS_TEST</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>ext_hdfs_test_dir</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>/user/big/oracle/hdfs.csv</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.columnNames</name>
    <value>ID_MI,ID_MT,HOST,TS,VAL_MOYENNE,VAL_MIN,VAL_MAX,FIABILITY</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.fieldTerminator</name>
    <value>,</value>
  </property>
</configuration>

You can test the connection with the following query.

SQL> SELECT COUNT(*) FROM big.ext_hive_test;

Additional notes

Starting the Database

lsnrctl start
sqlplus "/ as sysdba"
...
SQL> startup

Disable Oracle Database Vault

Database vault needs to be disabled or you will encounter this error: “KUP-04094: preprocessing cannot be performed if Database Vault is installed”. The procedure to disable this setting is described below.

  1. Check if Oracle Database Vault Is Enabled or Disabled:
sqlplus sys as sysdba
Enter password: password

SELECT * FROM V$OPTION WHERE PARAMETER = 'Oracle Database Vault';
...
Oracle Database Vault
TRUE
  1. Stop all Oracle services
SQL> SHUTDOWN IMMEDIATE
SQL> EXIT

If you have installed the enterprise manager database control, then:

$ emctl stop dbconsole
$ lsnrctl stop [listener_name]
  1. Relink:
cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk dv_off ioracle
  1. Restart Oracle services:
lsnrctl start
sqlplus "/ as sysdba"

SQL> startup
SQL> SELECT * FROM V$OPTION WHERE PARAMETER = 'Oracle Database Vault';
...
Oracle Database Vault
FALSE

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.