Options to connect and integrate Hadoop with Oracle
I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article with more details.
To summarize, we have Sqoop originally from Cloudera and now part of Apache, a Sqoop plugin from MapQuest and the Oracle Big Data connectors as a family of four distinct products which are Oracle Loader for Hadoop (OLH), Oracle SQL Connector for HDFS, Oracle R Connector for Hadoop and Oracle Data Integrator Application Adapter for Hadoop.
Apache Sqoop describes itself as a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. In that sense, it is by no means limited to Oracle.
I have personally intensively used Sqoop to import full databases from Oracle. Here’s my experience. Overall, it works great but I didn’t have to import larger databases than 1 Go which is relatively insignificant in terms of load testing. The mapping between Oracle and Hive is ok but with a few glitches. The Raw data type isn’t handled. To bypass this limitation, you can use the
columnsoption to filter out those columns, you can use the
map-column-hiveto force the casting to a string type or you could use your own SQL import statement to filter or transform the unhandled columns. I also remember having some issue with column padded with space but I can’t recall in which exact situation. Note as well that compared to the import options offered by Sqoop, the export options are more limited and doesn’t come with Hive integration.
Quest, editor of the famous Toad products family, provides a plug-in for Apache Sqoop that lets you transfer data bi-directionally between Oracle and Hadoop. It is meant to compete with the native Oracle implementation present in Sqoop and advertise itself as more than five times faster. It is a free, Open Source plug in to SQOOP licensed under the Apache 2.0 license. More information this blog post.
The Oracle Big Data connectors is a suite of 4 products. With the Oracle SQL Connector, the user writes Oracle SQL queries which run against an external table that references files stored in HDFS. With the Oracle Data Integrator, the user uses graphical tools which generate HiveQL which in turn generate native Map Reduce programs.
Uses MapReduce processing to format and load data efficiently into Oracle Database for analysis. The connector automatically creates an OSCH-enabled Oracle Database external table that references files stored in HDFS. Full SQL access allows users to join data in HDFS with data stored in Oracle Database, apply analytic functions, load data into tables, and more.
- On-Line Load Option:
Reducer nodes connect to the database for load, using JDBC or direct path load options
- Off-Line Load Option:
Reducer nodes write Oracle Data Pump binary files or delimited text files for loading into the database.
- Load Balancing:
“Perfect Balance” distributes work evenly to all reducers.
- Input Formats:
Supports multiple input formats: delimited text files, regular expressions, Oracle NoSQL Database, Avro, Hive tables or custom inputs.
Oracle Loader for Hadoop is a MapReduce application that is invoked as a command line utility. It accepts the generic command-line options that are supported by the Tool interface. The OLH driver uses Sqoop to perform operations that Oracle Loader for Hadoop does not support.
- Online database mode:
The data is loaded into the database using either a JDBC output format or an OCI Direct Path output format. The OCI Direct Path output format performs a high performance direct path load of the target table. The JDBC output format performs a conventional path load.
- Offline database mode:
The reducer nodes create binary or text format output files. The Data Pump output format creates binary format files that are ready to be loaded into an Oracle database using an external table and the ORACLE_DATAPUMP access driver. The Delimited Text output format creates text files in delimited record format. (This is usually called comma separated value (CSV) format when the delimiter is a comma.) These text files are ready to be loaded into an Oracle database using an external table and the ORACLE_LOADER access driver. The files can also be loaded using the SQL*Loader utility.
- Direct SQL Access:
Query Hive tables and files in HDFS directly from Oracle Database.
- Parallel Query:
Fast, efficient parallel query of data in HDFS.
- Automatic Table Creation:
Easily generate Oracle external tables to access data in HDFS.
As previously written, more information are available in another blog post.
- Interactive R access to HDFS:
Manipulate and explore data in HDFS using R functions; Using simple R functions, move data between HDFS and R, Oracle Database and User’s local file system
- R integration with Hadoop:
Leverage map-reduce programming paradigm in the familiar context of R without having to learn Hadoop concepts; Introduces new analytic techniques implemented in R, Java and natively in Hadoop; Native support for Hive tables
It simplifies data integration between Oracle Data Integrator (ODI) and Hadoop through an easy to use interface. ODI generates optimized HiveQL which in turn generates native Map Reduce programs that are executed on the Hadoop cluster.
- Optimized for Developer Productivity:
Familiar ODI graphical user interface; End-to-End coordination of Hadoop jobs; Map-Reduce jobs created and orchestrated by ODI.
- Native Integration with Hadoop:
Native integration with Hadoop using Hive; Ability to represent Hive metadata within ODI; Transformations and filtering occur directly in Hadoop; Transformations written in SQL-like HiveQL
- Optimized for Performance:
Optimized Hadoop ODI knowledge modules; High Performance load to Oracle Database using ODI with Oracle Loader for Hadoop; Ability to configure and execute Oracle Loader for Hadoop and Oracle SQL Connector for HDFS.