In the past few days, I’ve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with HDFS and a third one explaining how to install and use the Oracle SQL Connector with Hive. Those last two articles raised some questions for which I have recompiled the notes below.
The documentation says:
The Oracle external table is not a “live” Hive table. When changes are made to a Hive table, you must use the ExternalTable tool to either republish the data or create a new external table.
I wasn’t sure how to interpret this. Particularly the part “changes are mode to a Hive table” and “republish the data”. Does “Hive table” refer to the data or the schema.
With the former, it could imply that the data is transferred and cached on the Oracle server. After all, this is what “republish the data” seems to say? Really, this looks very strange, much more like an import tool. It presents the advantage of greater query speed and the disadvantage of running a command every time the data source is updated. But wait, this is the job of the Oracle Loader, isn’t it?
With the later, it could means that the Oracle database uses (partially) the Hive metastore on each request. But why not using the metastore information to get the file location ?
Create and publish
Just a quick reminder, using the Oracle SQL Loader, you integrate an HDFS file and a Hive table with the class “ExternalTable”. You use the “createTable” argument when you first declare the input source. Later, you use the “publish” argument to update the information.
hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -conf ./hive.xml -createTable
Is the data cached/copy somewhere else
In my first test, I run two count queries on the same Hive table consisting of one file. The only difference between the two queries is that the dataset contains a different amount of records. The second query reflects the correct count, meaning that the data source is read on each query.
How is Oracle using the Hive metastore
In my second test, I run two count queries on the same Hive table consisting of one file. The data source file is the same, I’ve only renamed it from “000000_0” to “000000_1”. The result is very interesting as it prints an error. Here’s the output:
SQL> select count(*) from oracle.ext_testdataload_test;
select count(*) from oracle.ext_testdataload_test
ERREUR a la ligne 1 :
ORA-12801: erreur signalee dans le serveur de requete parallele P000
ORA-29913: erreur d'execution de l'appel
ORA-29400: erreur de cartouche de donnees
KUP-04095: la commande de preprocesseur
/home/oracle/orahdfs-2.1.0/bin/hdfs_stream a rencontre une erreur "OSCH: Error
File does not exist: /user/oracle/db/testdataload/000000_0
Of course, such a query is perfectly valid in Hive. Hive doesn’t store the filename in its metastore, only the folder in which they are stored. Running the command
hive -e "use oracle; select count(*) from ext_testdataload_test;" will print the correct count.
So it turns out that Oracle uses Hive for the schema definition but not to determine the data location.
Where is Oracle storing the file information
With some digging, you should find a file inside your external directory named like “osch-20130527011437-5756-1” with content similar to the one below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<productName>Oracle SQL Connector for HDFS Release 2.1.0 - Production</productName>
<uri_list_item size="613539" compressionCodec="">hdfs://noe003fn.noe.edf.fr:8020/user/big/omnivision/db/hiveload/000000_0</uri_list_item>
If you have played with my second test, you can modify the
uri_list_item property manually to reflect your HDFS file location and the count query will return the correct count.
This file doesn’t store any information about the schema. Instead, the schema is fetched on every query and dump into the 2 generated “*.log” files (whose contents are the same!).
This is awkward, on each request, information is fetched from the metastore but Oracle doesn’t take advantage of the metastore storing the file location. Instead, it forces you to run the “publish” command each time data is updated.
In a real life workflow, the Hive tables you would like to publish to Oracle are probably the result of long and complex MapReduce programs. Those programs are scheduled to update the content of the final Hive tables by overwriting the original dataset or by adding new files or partitions.
The first test show that no further operation is needed if the HDFS file is named the same. However, you can not rely on this behavior. The second test suggests that you should always republish the table information to Oracle.
Using the “ExternalTable” command, change the argument “-createTable” to “-publish”. The final command looks like:
hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -conf ./hive.xml -publish