In the past few days, I’ve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with HDFS and a third one explaining how to install and use the Oracle SQL Connector with Hive. Those last two articles raised some questions for which I have recompiled the notes below.

The documentation says:

The Oracle external table is not a “live” Hive table. When changes are made to a Hive table, you must use the ExternalTable tool to either republish the data or create a new external table.

I wasn’t sure how to interpret this. Particularly the part “changes are mode to a Hive table” and “republish the data”. Does “Hive table” refer to the data or the schema.

With the former, it could imply that the data is transferred and cached on the Oracle server. After all, this is what “republish the data” seems to say? Really, this look very strange, much more like an import tool. It presents the advantage of greater query speed and the disadvantage of running a command every time the data source is updated. But wait, this is the job of the Oracle Loader, isn’t it?

With the later, it could means that the Oracle database uses (partially) the Hive metastore on each request. But why not using the metastore information to get the file location ?

Create and publish

Just a quick reminder, using the Oracle SQL Loader, you integrate an HDFS file and a Hive table with the class “ExternalTable”. You use the “createTable” argument when you first declare the input source. Later, you use the “publish” argument to update the information.

Is the data cached/copy somewhere else

In my first test, I run two count queries on the same Hive table consisting of one file. The only difference between the two queries is that the dataset contains a different amount of records. The second query reflects the correct count, meaning that the data source is read on each query.

How is Oracle using the Hive metastore

In my second test, I run two count queries on the same Hive table consisting of one file. The data source file is the same, I’ve only renamed it from “000000_0” to “000000_1”. The result is very interesting as it print an error. Here’s the output:

Of course, such a query is perfectly valid in Hive. Hive doesn’t store the filename in its metastore, only the folder in which they are stored. Running the command hive -e "use oracle; select count(*) from ext_testdataload_test;" will print the correct count.

So it turns out that Oracle use Hive for the schema definition but not to determine the data location.

Where is Oracle storing the file information

With some digging, you should find a file inside your external directory named like “osch-20130527011437-5756-1” with content similar to the one below:

If you have played with my second test, you can modify the uri_list_item property manually to reflect your HDFS file location and the count query will return the correct count.

This file doesn’t store any information about the schema. Instead, the schema is fetched on every query and dump into the 2 generated “*.log” files (whose content are the same!).

This is awkward, on each request, information is fetched from the metastore but Oracle doesn’t take advantage of the metastore storing the file location. Instead, it forces you to run the “publish” command each time data is updated.

Conclusion

In a real life workflow, the Hive tables you would like to publish to Oracle are probably the result of long and complex MapReduce programs. Those programs are scheduled to update the content of the final Hive tables by overwriting the original dataset or by adding new files or partitions.

The first test show that no further operation is needed if the HDFS file is named the same. However, you can not rely on this behavior. The second test suggests that you should always republish the table information to Oracle.

Using the “ExternalTable” command, change the argument “-createTable” to “-publish”. The final command looks like: