Accelerating query processing with materialized views in Apache Hive

Accelerating query processing with materialized views in Apache Hive

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

The new materialized view feature is coming in Apache Hive 3.0. Jesus Camacho Rodriguez from Hortonworks held a talk ”Accelerating query processing with materialized views in Apache Hive” about it. This article covers the main principle of this feature, gives some examples and the improvements that are in the roadmap.

Not a view not a table, meet the Materialized view

According to Wikipedia, a SQL View is the result set of a stored query on the data. Let’s say you have a lot of different tables that you are constantly requesting, using always the same joins, filters and aggregations. With a view, you could simplify access to those datasets while providing more meaning to the end user. It avoids repeating the same complex queries and eases schema evolution.

For example, an application needs access to a products dataset with the product owner and the total number of order for each product. Such queries would need to join the User and Order tables with the Product table. A view would mask the complexity of the schema to the end users by only providing one table with custom and dedicated ACLs.

However such views in Hive used to be virtual and implied huge and slow queries. Instead, you could create an intermediate table to store the results of your query, but such operations require changing your access patterns and has the challenge of making sure the data in the table stays fresh.

We can identify four main types of optimization:

  • Change data’s physical properties (distribute, sort).
  • Filter or partition rows.
  • Denormalization.
  • Preaggregation.

The goal of Materialized views (MV) is to improve the speed of queries while requiring zero maintenance operations.

The main features are:

  • Storing the result of a query just like a table (the storage can be in Hive or Druid).
  • The definition of the MV is used to rewrite query and requires no change in your previous patterns.
  • The freshness of the data is ensured by the system.
  • A simple insert in the table is very efficient since it does not require rebuilding the view.

Examples

Let’s see a few examples in order to illustrate the rewriting capabilities.

Denormalization and filtering

Denormalization is the operation of grouping two or more tables into one bigger table. Basically it removes the need of a heavy JOIN operation.

Here are two tables describing employees and departments:

empIdempNamedeptIDempSalaryempHireDate
0Frodo10200002018-03-03
1Sam10223002016-11-11
2Gimli20423002016-02-13
3Galadriel30500002015-05-28
4Legolas30720002015-05-01

And

deptIddeptName
10Accounting
20Workers
30HR

To get all employees recruited in HR during the first semester of 2016 we would use a request looking like this:

SELECT 
  employees.empId, 
  employees.empName 
FROM employees 
JOIN departments ON 
  employees.deptId = departments.deptId AND 
  departments.deptName = 'HR' AND 
  employees.empHireDate >= '2016-01-01' AND 
  employees.empHireDate <= '2016-07-01';

Instead, we’ll create a Materialized view:

CREATE MATERIALIZED VIEW mv AS 
  SELECT 
    employees.empId as empId, 
    employees.empName as empName,
    departments.deptId as deptId, 
    departments.deptName as deptName,
    employees.empHireDate as empHireDate 
  FROM employees, departments 
  WHERE 
     employees.deptId = departments.deptId AND 
     employees.empHireDate >= '2016-01-01' AND 
     employees.empHireDate <= '2016-12-31';

It is denormalized as all the informations are located in one table and filtered to only store the employees hired in 2016.

Our select requests is now simplified to:

SELECT 
  empId, 
  empName 
FROM mv 
WHERE 
  deptName = 'HR' AND 
  empHireDate <= '2016-07-01'

Grouping

For this example we’ll use a table showing the activity of our users on a GIT repository. Every entry shows the added and removed lines in a commit:

commitIdcommitDatecommitAuthorcommitLinesAddedcommitLinesRemoved
0b42f0b2018-01-02 18:04:33Linus1233
d346b3d2018-01-03 17:33:46Linus2234
7151de52018-01-03 18:24:21Richard1311
...............

Using a Materialized view we can group the commits by day and store only the aggregation on the lines added which is very efficient if we store the view in Druid.

CREATE MATERIALIZED VIEW mv_commits_by_day
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
  SELECT
    floor(time to day),
    commitAuthor as author,
    sum(commitLinesAdded) as linesAdded,
    sum(commitLinesRemoved) as linesRemoved
  FROM commits
  GROUP BY floor(time to day), author

Now our query can be much more efficient. Before rewriting:

SELECT
  floor(time to month),
  sum(commitLinesAdded)
FROM commits
GROUP BY floor(time to month)

And after the rewrite, the query generates a much lighter job, the rows being already grouped:

SELECT
  floor(time to month),
  sum(linesAdded)
FROM mv_commits_by_day
GROUP BY floor(time to month)

Enabling Materialized view rewriting

In order to enable query rewriting using Materialized views this global property is needed: SET hive.materializedview.rewriting=true;.

The user can then select which view is enabled for rewriting: ALTER MATERIALIZED VIEW mv ENABLE|DISABLE REWRITE.

Query rewriting is enable by default for all materialized views.

Rebuilding a Materialized view

Once the MV has been created, it is populated with the data present in the tables. But after data has been modified in the source table, it is not automatically reflected in the MV. Hence, The MV needs to be rebuilt using the command:

ALTER MATERIALIZED VIEW mv REBUILD;

The optimizer will always attempt an incremental rebuild instead of a full one. An incremental rebuild can only be made when new data has been inserted.

In the case of DELETE or UPDATE in the table, a full rebuild will be applied.

Stale data

Data is considered stale if it is the value from the most recent version committed to the original data source. When a MV has not yet been synchronized, the view is considered stale. The system is aware of the status of the MV and will only use the data combined with the fresh data in the original table.

Using the parameter hive.materializedview.rewriting.time.window we can override the default behaviour of ignoring stale data. This parameter is a time window after which outdated materialized views become invalid for automatic query rewriting.

Roadmap

Many improvements are planned:

  • Improving the rewriting algorithm inside Apache Calcite
  • Control distribution of data inside the view (SORT BY, CLUSTER BY, DISTRIBUTE BY)
  • Supports UPDATE/DELETE in incremental rebuild of the view

Conclusion

This new feature of Hive looks very promising. Its ability to integrate in nearly any workflow with little to no impact makes it very interesting in a wide variety of use cases. The need for a full rebuild after an update is still a very limiting factor but improvements are on the roadmap.

Materialized Views should be made available in Hive 3.0.

Share this article

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain