Adaltas

Testing the Oracle SQL Connector for Hadoop HDFS

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other database-resident data. If required, you can also load data into the database using SQL. For an Oracle user, HDFS files and the Hive tables are hidden behind external tables.

This article describes how to install and use the Oracle SQL Connector for Hadoop. I am only covering the integration with HDFS. Another article describes how to further configure the SQL connector to integrate with Hive.

Minimum Requirements

We assume that we have a Hadoop cluster as well as another server with Oracle already installed. The supported Hadoop distributions are Cloudera CDH3 and CDH4 or Apache Hadoop 1.0 (formerly 0.20.2). The same version of Hadoop must be installed on the Hadoop cluster and the Oracle server. Also, the minimal Oracle Database release must be 11g release 2 (11.2.0.2 or 11.2.0.3) for Linux. In our case, all the servers are running CentOs 6.4. with CDH4 and Oracle 11.2.0.

Make sure to disable SeLinux both for Hadoop and for Oracle. Using “sed”, the command sed -i.bck 's/SELINUX=enforcing/SELINUX=disabled/' /etc/sysconfig/selinux will make a backup and update the configuration in place. This change will be apply after the next reboot.

Prepare the Oracle database

Database creation

The Oracle system must already be installed on your system.

We will create a database “BIG”. The following settings are applyable to a development node on a virtual machine. Using the graphical Oracle client: select the “Data Warehouse” model; the global database name and SID are both “BIG”; desactivate Oracle Enterprise Manager; use a database password; the storage location is on the filesytem; we don’t activate archiving; set the memory SGA size as 2048 Mo and the memory PGA size as 1024 Mo; leave the rest empty. We save the database model with name BIG and we generate the database creation script inside $ORACLE_BASE/admin/BIG/scripts.

Environment

To facilitate the connection with the sqlplus cli client, set the ORACLE_SID to the Oracle instance ID (SID).

1
2
echo "export ORACLE_SID=BIG" >> ~/.bash_profile
. ~/.bash_profile
1
2
3
4
env | grep ORA
ORACLE_SID=BIG
ORACLE_BASE=/home/oracle/app/oracle
ORACLE_HOME=/home/oracle/app/oracle/product/11.2.0/dbhome_1

Also, you PATH variable should reflect the Oracle installation export PATH=$PATH:$ORACLE_HOME/bin:$ORACLE_HOME/OPatch

The values may differ based on your Oracle installation. In my installation, all the variables are defined inside “~/.bash_profile”.

User creation

The database should be started. If it isn’t, you could refer to the “Starting the Database” section below.

You can now login within the Oracle client with privelege sysdba (user sys) without a password and create a user “BIG” with the “XXXXXX” password and the roles “CONNECT” and “RESSOURCE”.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sqlplus "/ as sysdba"

SQL*Plus: Release 11.2.0.3.0 Production on Fri May 17 12:07:56 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to :
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options

SQL> show user;
USER is "SYS"

SQL> CREATE USER "BIG" IDENTIFIED BY "toto123" PROFILE "DEFAULT" DEFAULT TABLESPACE "USERS" TEMPORARY TABLESPACE "TEMP";
User created.

SQL> GRANT "CONNECT", "RESOURCE" to "BIG";
Authorization for privilege (GRANT) accepted.

Alternatively, we could also connect to the database with a remote Tora client. Using “instant client”, set the username as “system” and leave the schema empty to login.

Here, we have granted all privileges to our user. You should be more restrictive. Oracle require the haddop user “CREATE SESSION”, “CREATE TABLE”, “EXECUTE” on the UTL_FILE PL/SQL package as well as “READ” and “EXECUTE” on the “OSCH_BIN_PATH” directory created during the installation of Oracle SQL Connector for HDFS.

SQL Connector installation

From the offication documentation:

Oracle SQL Connector for HDFS is installed and configured on the system where Oracle Database runs. It can also run on the system where Oracle Database runs. If Hive tables are used as data sources, then Oracle SQL Connector for HDFS must also be installed and running on the system where Hive is installed.

You must install Hadoop on the Oracle Database system and minimally configure it for Hadoop client use only. The installation procedure is described below.

Install Hadoop

This step must only be executed on your Oracle server. Only the Hadoop client is required on this server and the configuration must match the one of your Hadoop cluster.

One easy way is to use the yum repository. In this example we deploy the Cloudera repository but Hortonworks work the same way.

1
2
3
4
5
6
7
8
9
# Yum repository
cat > /etc/yum.repos.d/cloudera-cdh4.repo <<REPO
[cloudera-cdh4]
name = Cloudera CDH, Version 4
baseurl = http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4/
gpgkey = http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
REPO
yum update

One the repository available, we install the hadoop client package and apply your cluster configuration. Note, the “mapred-site.xml” and “yarn-site.xml” configuration files are relevant and we don’t import them.

1
2
3
4
5
yum install hadoop-client
# Import your Hadoop configuration
node=a_hadoop_hostname
scp root@${node}:/etc/hadoop/conf/core-site.xml /etc/hadoop/conf/core-site.xml
scp root@${node}:/etc/hadoop/conf/hdfs-site.xml /etc/hadoop/conf/hdfs-site.xml

Ensure that the Oracle Database has access to HDFS. Run the following command using your Oracle Database account and you should see the directory listing of “/user”.

1
hadoop fs -ls /user

Download and install the SQL connector

The step must be run on your Oracle server and, if using Hive, on your Hadoop cluster nodes as well.

Download must be done manually since it requires acception the OTN license agreement. Place the resulting file “oraosch-2.1.0.zip” into “/tmp/oraosch-2.1.0.zip”.

1
2
3
4
cd /tmp
unzip oraosch-2.1.0.zip
unzip orahdfs-2.1.0.zip
mv orahdfs-2.1.0 /usr/lib/

Configure the SQL connector on your Oracle server

Edit the “./bin/hdfs_stream” file inside your “orahdfs” folder and set the OSCH_HOME variable with the correct path.

Update your environment variables (vi ~/.bash_profile):

1
2
export OSCH_HOME=/usr/lib/orahdfs-2.1.0
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OSCH_HOME/jlib/*"

Create a database directory for the orahdfs-version/bin directory where hdfs_stream resides. In our example:

1
2
SQL> CREATE OR REPLACE DIRECTORY osch_bin_path AS '/home/oracle/orahdfs-2.1.0/bin';
SQL> GRANT READ, EXECUTE ON DIRECTORY OSCH_BIN_PATH TO big;

Configure the SQL connector on your Hadoop cluster

Contrary to what the official documentation say, you shouldn’t only perform this step if you plan on using Hive. You could also do it if you wish to the declare new external table from the Hadoop cluster (the “ExternalTable” command described below).

Creating a table referencing a HDFS file

First, we need to create an external directory grant access to the our user.

1
2
SQL> CREATE OR REPLACE DIRECTORY ext_hdfs_test_dir AS '/home/oracle/ext_hdfs_test_dir';
SQL> GRANT READ, WRITE ON DIRECTORY ext_hdfs_test_dir TO big;

Now we can declare the Hadoop data file as a table.

1
hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -conf ./hdfs.xml -createTable

Where the content of the “hdfs.xml” file is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
<?xml version="1.0"?>
<configuration>
  <property>
    <name>oracle.hadoop.connection.url</name>
    <value>jdbc:oracle:thin:@//100.100.100.52:1521/BIG</value>
  </property>
  <property>
    <name>oracle.hadoop.connection.user</name>
    <value>BIG</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.sourceType</name>
    <value>text</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.tableName</name>
    <value>BIG.EXT_HDFS_TEST</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.defaultDirectory</name>
    <value>ext_hdfs_test_dir</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.dataPaths</name>
    <value>/user/big/oracle/hdfs.csv</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.columnNames</name>
    <value>ID_MI,ID_MT,HOST,TS,VAL_MOYENNE,VAL_MIN,VAL_MAX,FIABILITY</value>
  </property>
  <property>
    <name>oracle.hadoop.exttab.fieldTerminator</name>
    <value>,</value>
  </property>
</configuration>

You can test the connection with the following query.

1
SQL> SELECT COUNT(*) FROM big.ext_hive_test;

Additional notes

Starting the Database

1
2
3
4
lsnrctl start
sqlplus "/ as sysdba"
...
SQL> startup

Disable Oracle Database Vault

Database vault need to be disabled or you will encounter this error: “KUP-04094 : preprocessing cannot be performed if Database Vault is installed”. The procedure to disable this setting is described below.

  1. Check if Oracle Database Vault Is Enabled or Disabled:
1
2
3
4
5
6
7
sqlplus sys as sysdba
Enter password: password

SELECT * FROM V$OPTION WHERE PARAMETER = 'Oracle Database Vault';
...
Oracle Database Vault
TRUE
  1. Stop all Oracle services
1
2
SQL> SHUTDOWN IMMEDIATE
SQL> EXIT

If you have installed the enterprise manager database control, then:

1
2
$ emctl stop dbconsole
$ lsnrctl stop [listener_name]
  1. Relink :
1
2
cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk dv_off ioracle
  1. Restart Oracle services:
1
2
3
4
5
6
7
8
lsnrctl start
sqlplus "/ as sysdba"

SQL> startup
SQL> SELECT * FROM V$OPTION WHERE PARAMETER = 'Oracle Database Vault';
...
Oracle Database Vault
FALSE

Comments