Oracle and Hive, how data are published?
In the past few days, I’ve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with HDFS and a third one explaining how to install and use the Oracle SQL Connector with Hive. Those last two articles raised some questions for which I have recompiled the notes below.
The documentation says:
The Oracle external table is not a “live” Hive table. When changes are made to a Hive table, you must use the ExternalTable tool to either republish the data or create a new external table.
I wasn’t sure how to interpret this. Particularly the part “changes are made to a Hive table” and “republish the data”. Does “Hive table” refer to the data or the schema.
With the former, it could imply that the data is transferred and cached on the Oracle server. After all, this is what “republish the data” seems to say? Really, this looks very strange, much more like an import tool. It presents the advantage of greater query speed and the disadvantage of running a command every time the data source is updated. But wait, this is the job of the Oracle Loader, isn’t it?
With the later, it could means that the Oracle database uses (partially) the Hive metastore on each request. But why not using the metastore information to get the file location?
Just a quick reminder, using the Oracle SQL Loader, you integrate an HDFS file and a Hive table with the class “ExternalTable”. You use the “createTable” argument when you first declare the input source. Later, you use the “publish” argument to update the information.
hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -conf ./hive.xml -createTable
In my first test, I run two count queries on the same Hive table consisting of one file. The only difference between the two queries is that the dataset contains a different amount of records. The second query reflects the correct count, meaning that the data source is read on each query.
In my second test, I run two count queries on the same Hive table consisting of one file. The data source file is the same, I’ve only renamed it from “000000_0” to “000000_1”. The result is very interesting as it prints an error. Here’s the output:
SQL> select count(*) from oracle.ext_testdataload_test; select count(*) from oracle.ext_testdataload_test * ERREUR a la ligne 1 : ORA-12801: erreur signalee dans le serveur de requete parallele P000 ORA-29913: erreur d'execution de l'appel ORA-29400: erreur de cartouche de donnees KUP-04095: la commande de preprocesseur /home/oracle/orahdfs-2.1.0/bin/hdfs_stream a rencontre une erreur "OSCH: Error processing URI hdfs://myhost:8020/user/oracle/db/testdataload/000000_0 File does not exist: /user/oracle/db/testdataload/000000_0 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes (FSNamesystem.java:1301) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNames ystem.java:1254) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesyst em.java:1227) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesyst em.java:1209) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameN odeRpcServer.java:393) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.g etBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170)
Of course, such a query is perfectly valid in Hive. Hive doesn’t store the filename in its metastore, only the folder in which they are stored. Running the command
hive -e "use oracle; select count(*) from ext_testdataload_test;" will print the correct count.
So it turns out that Oracle uses Hive for the schema definition but not to determine the data location.
With some digging, you should find a file inside your external directory named like “osch-20130527011437-5756-1” with content similar to the one below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <locationFile> <header> <version>1.0</version> <fileName>osch-20130527011437-5756-1</fileName> <createDate>2013-05-24T22:18:22</createDate> <publishDate>2013-05-27T01:14:37</publishDate> <productName>Oracle SQL Connector for HDFS Release 2.1.0 - Production</productName> <productVersion>2.1.0</productVersion> </header> <uri_list> <uri_list_item size="613539" compressionCodec="">hdfs://noe003fn.noe.edf.fr:8020/user/big/omnivision/db/hiveload/000000_0</uri_list_item> </uri_list> </locationFile>
If you have played with my second test, you can modify the
uri_list_item property manually to reflect your HDFS file location and the count query will return the correct count.
This file doesn’t store any information about the schema. Instead, the schema is fetched on every query and dump into the 2 generated “*.log” files (whose contents are the same!).
This is awkward, on each request, information is fetched from the metastore but Oracle doesn’t take advantage of the metastore storing the file location. Instead, it forces you to run the “publish” command each time data is updated.
In a real life workflow, the Hive tables you would like to publish to Oracle are probably the result of long and complex MapReduce programs. Those programs are scheduled to update the content of the final Hive tables by overwriting the original dataset or by adding new files or partitions.
The first test show that no further operation is needed if the HDFS file is named the same. However, you can not rely on this behavior. The second test suggests that you should always republish the table information to Oracle.
Using the “ExternalTable” command, change the argument “-createTable” to “-publish”. The final command looks like:
hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -conf ./hive.xml -publish