In the lifecycle of a data warehouse in production, there are a variety of tasks that need to be executed on a recurring basis. To name a few concrete examples, scheduled tasks can be related to data ingestion (inserting data from a stream into a transactional table every 10 minutes), query performance (refreshing a materialized view used for BI reporting every hour), or warehouse maintenance (executing replication from one cluster to another on a daily basis).
Until now, executing queries on a recurring basis was possible using OS-level schedulers like cron, or most commonly, relying on a workflow engine such as Apache Oozie or Apache Airflow. While these tools have many built-in functionalities which could easily go beyond those of a basic scheduler, our goal was to implement a simple, yet powerful and secure way to create, manage, and monitor scheduled jobs that was fully-integrated with Apache Hive.
To fulfill that purpose, recently Apache Hive introduced a new feature called scheduled queries. Using SQL statements, users can schedule Hive queries to run on a recurring basis, monitor their progress, and optionally disable a query schedule.
In a nutshell, every scheduled query in Hive consists of (i) a unique name to identify the schedule, (ii) the actual SQL statement to be executed, and (iii) the schedule at which the query should be executed defined by a Quartz cron expression. In addition, a scheduled query belongs to a namespace, i.e., a collection of HiveServer2 instances that are responsible to execute the query.
Quartz cron expressions are expressive and flexible. For instance, they can be used to describe simple schedules such as every 10 minutes, but also an execution happening at 10am on the first Sunday of the month in January, February in 2021, 2022. For simple use cases and user convenience, scheduled queries implementation also provides syntax to describe common schedules in an easily comprehensible format, e.g., every 20 minutes or every day at ‘3:25:00’.
Scheduled queries are stored in Hive’s metastore. HiveServer2 periodically polls the metastore to retrieve scheduled queries that are due to be executed. In case we are running multiple HiveServer2 instances within a single deployment, the metastore guarantees that only one of them will execute a certain scheduled query at any given time.
Users can create, alter, and drop scheduled queries using dedicated SQL statements. The queries scheduled, status for ongoing and previously executed statements, and other additional information, are stored in Hive’s metastore and easily accessible through SQL queries. Notably, fine-grained authorization is provided through the integration with Apache Ranger.
Although only Hive is supported at the moment, the metastore APIs have been designed to allow managing and executing scheduled queries for other systems as well. This could enable other engines to add support for query schedules natively in the near future, e.g., Apache Impala or Apache Spark.
The following sections show some use cases of this new functionality.
Periodical materialized view rebuild
Materialized views play a central role in data warehouses – with automatic rewriting they can be used to transparently enhance query performance. However, materialized view contents need to be refreshed in case new data is added to the underlying tables.
Consider the following sample database schema:
CREATE TABLE emps ( empid INTEGER, deptno INTEGER, name VARCHAR(256), salary FLOAT, hire_date TIMESTAMP); CREATE TABLE depts ( deptno INTEGER, deptname VARCHAR(256), locationid INTEGER);
Assume we have a lot of queries accessing the employees hired during last year as well as the department they belong to.
We could create a materialized view with information about the most recently hired employees:
CREATE MATERIALIZED VIEW mv_recently_hired AS SELECT empid, name, deptname, hire_date FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2020-01-01 00:00:00';
This materialized view will be matched and used for queries like the following:
SELECT empid, name FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2020-03-01 00:00:00' AND deptname = 'finance';
Now assume that new employees are hired and thus new records are added to the `emps` table. The materialized view contents become outdated, and thus, we need to execute a rebuild statement to refresh its contents:
ALTER MATERIALIZED VIEW mv_recently_hired REBUILD;
Instead of doing this manually every time the table contents have changed, we can create a scheduled query that will invoke the rebuild statement periodically:
CREATE SCHEDULED QUERY scheduled_rebuild EVERY 10 MINUTES AS ALTER MATERIALIZED VIEW mv_recently_hired REBUILD;
In particular, the statement above will execute a rebuild every 10 minutes. It is important to note that if a materialized view can be rebuilt incrementally, this operation will be a no-op unless there are changes to the input tables.
If we want to bypass the scheduler system and execute a scheduled query immediately, e.g., for testing purposes, we can do it by executing the following statement:
ALTER SCHEDULED QUERY scheduled_rebuild EXECUTE;
In addition, a scheduled query definition can be modified after its creation by using alter statements. For instance, assume that we want to change the frequency for the query created above. We could execute the following statement
ALTER SCHEDULED QUERY scheduled_rebuild EVERY 20 MINUTES;
Information about a scheduled query
Once a scheduled query is created, its details can be accessed in the `scheduled_queries` table in Hive’s information schema.
SELECT * FROM information_schema.scheduled_queries WHERE schedule_name = 'scheduled_rebuild';
As of today, the query shows the following information about the scheduled query.
|scheduled_query_id||Unique numeric identifier for a scheduled query.|
|schedule_name||Name of the scheduled query.|
|enabled||Whether the scheduled query is currently enabled or not.|
|cluster_namespace||Namespace that the scheduled query belongs to.|
|schedule||Schedule described as a Quartz cron expression.|
|user||Owner of the scheduled query.|
|query||SQL query to be executed.|
|next_execution||When the next execution of this scheduled query is due.|
Monitoring scheduled queries
In addition to accessing the details about a certain scheduled query, the information schema can be used to monitor the most recent scheduled queries executions.
SELECT * FROM information_schema.scheduled_executions;
The query will show the following information for each row. The retention period for this information in Hive’s metastore RDBMS is configurable.
|scheduled_execution_id||Unique numeric identifier for a scheduled query execution.|
|schedule_name||Name of the scheduled query associated with this execution.|
|executor_query_id||Query ID assigned to the execution by HiveServer2.|
|state||Current state of the execution. It can be one of the following:
|start_time||Start time of execution.|
|end_time||End time of execution.|
|elapsed||Difference between start and end time.|
|error_message||If the scheduled query failed, it contains the error message associated with its failure.|
|last_update_time||Time of the last update of the query status by HiveServer2.|
Scheduled queries in Cloudera Data Warehouse (CDW)
It is important to note that scheduled queries work seamlessly with the exciting new features introduced in CDW such as auto-scaling and auto-suspend. If a cluster is suspended when a scheduled query execution is due, HiveServer2 will still retrieve the query and execute it, automatically spinning up nodes as required. Similarly, if a query needs more resources when it is due to be executed, we will automatically add more nodes to the cluster, following the policies defined for the virtual warehouse. In addition, when the query finishes, if the virtual warehouse is idle, it will suspend the cluster nodes automatically. In essence, a user simply needs to create a scheduled query and the rest will be taken care of by CDW.
Future work for scheduled queries
Scheduled queries are an exciting new feature that can be useful in numerous cases. As we mentioned above, other interesting examples include data ingestion and cluster replication, which we should talk about in more detail in upcoming blog posts.
In addition, the entire system is orchestrated over SQL, providing a well-defined robust and secure foundation for any management UI to integrate a visual interface that would make it easy to create, manage, and monitor scheduled queries.
Where to go from here?
Scheduled queries are available in Cloudera Data Warehouse. Thus, you can simply create your own warehouse to try out this new feature!
If you are interested in learning more about scheduled queries, you can find some additional information in the Apache Hive wiki.