Natively Query Your Delta Lake With Scala, Java, and Python

Today, we’re happy to announce that you can natively query your Delta Lake with Scala and Java (via the Delta Standalone Reader) and Python (via the Delta Rust API). Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark™ APIs. The project has been deployed at thousands of organizations and processes more exabytes of data each week, becoming an indispensable pillar in data and AI architectures. More than 75% of the data scanned on the Databricks Platform is on Delta Lake!

In addition to Apache Spark, Delta Lake has integrations with Amazon Redshift, Redshift Spectrum, Athena, Presto, Hive, and more; you can find more information in the Delta Lake Integrations. For this blog post, we will discuss the most recent release of the Delta Standalone Reader and the Delta Rust API that allows you to query your Delta Lake with Scala, Java, and Python without Apache Spark.

Delta Standalone Reader

The Delta Standalone Reader (DSR) is a JVM library that allows you to read Delta Lake tables without the need to use Apache Spark; i.e. it can be used by any application that cannot run Spark. The motivation behind creating DSR is to enable better integrations with other systems such as Presto, Athena, Redshift Spectrum, Snowflake, and Apache Hive. For Apache Hive, we rewrote it using DSR to get rid of the embedded Spark in the new release.

To use DSR using sbt include delta-standalone as well as hadoop-client and parquet-hadoop.

libraryDependencies ++= Seq(
"io.delta" %% "delta-standalone" % "0.2.0",
"org.apache.hadoop" % "hadoop-client" % "2.7.2",
"org.apache.parquet" % "parquet-hadoop" % "1.10.1")

Using DSR to query your Delta Lake table

Below are some examples of how to query your Delta Lake table in Java.

Reading the Metadata

After importing the necessary libraries, you can determine the table version and associated metadata (number of files, size, etc.) as noted below.

import io.delta.standalone.DeltaLog;
import io.delta.standalone.Snapshot;
import io.delta.standalone.data.CloseableIterator;
import io.delta.standalone.data.RowRecord;

import org.apache.hadoop.conf.Configuration;

DeltaLog log = DeltaLog.forTable(new Configuration(), "[DELTA LOG LOCATION]");

// Returns the current snapshot
log.snapshot();

// Returns the version 1 snapshot
log.getSnapshotForVersionAsOf(1);

// Returns the snapshot version
log.snapshot.getVersion();

// Returns the number of data files
log.snapshot.getAllFiles().size();

Reading the Delta Table

To query the table, open a snapshot and then iterate through the table as noted below.

// Create a closeable iterator
CloseableIterator iter = snapshot.open();

RowRecord row = null;
int numRows = 0;

// Schema of Delta table is {long, long, string}
while (iter.hasNext()) {
row = iter.next();
      numRows++;

      Long c1 = row.isNullAt("c1") ? null : row.getLong("c1");
      Long c2 = row.isNullAt("c2") ? null : row.getLong("c2");
      String c3 = row.getString("c3");
      System.out.println(c1 + " " + c2 + " " + c3);
}

// Sample output
175 0 foo-1
176 1 foo-0
177 2 foo-1
178 3 foo-0
179 4 foo-1

Requirements

DSR has the following requirements:

  • JDK 8 or above
  • Scala 2.11 or Scala 2.12
  • Dependencies on parquet-hadoop and hadoop-client

For more information, please refer to the Java API docs or Delta Standalone Reader wiki.

Delta Rust API

delta.rs is an experimental interface to Delta Lake for Rust. This library provides low-level access to Delta tables and is intended to be used with data processing frameworks like datafusion, ballista, rust-dataframe, and vega. It can also act as the basis for native bindings in other languages such as Python, Ruby, or Golang.

QP Hou and R. Tyler Croy at Scribd use Delta Lake to enable the world’s largest digital library are the initial creators of this API. The Delta Rust API has quickly gained traction in the community with a special callout of community-driven Azure support within weeks after the initial release.

How Scribd Uses Delta Lake to Enable the World’s Largest Digital Library.

Reading the Metadata (Cargo)

You can use the API or CLI to inspect the files of your Delta Lake table as well as provide the metadata information; below are sample commands using the CLI via cargo. Once the 0.2.0 release of delta.rs has been published, `cargo install deltalake` will provide the delta-inspect binary.

To inspect the files, check out the source and use delta-inspect files:

❯ cargo run --bin delta-inspect files ./tests/data/delta-0.2.0

part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet
part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet
part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet

To inspect the metadata, use delta-inspect info:

❯ cargo run --bin delta-inspect info ./tests/data/delta-0.2.0
DeltaTable(./tests/data/delta-0.2.0)
version: 3
metadata: GUID=22ef18ba-191c-4c36-a606-3dad5cdf3830, name=None, description=None, partitionColumns=[], configuration={}
min_version: read=1, write=2
files count: 3

Reading the Metadata (Python)

You can also use the delta.rs to query Delta Lake using Python via the delta.rs Python bindings.

To obtain the Delta Lake version and files, use the .version() and .files() methods respectively.

from deltalake import DeltaTable
dt = DeltaTable("../rust/tests/data/delta-0.2.0")

# Get the Delta Lake Table version
dt.version()

# Example Output
3

# List the Delta Lake table files
dt.files()

# Example Output
['part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet', 'part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet', 'part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet']

Reading the Delta Table (Python)

To read a Delta table using the delta.rs Python bindings, you will need to convert the Delta table into a PyArrow Table and Pandas Dataframe.

# Import Delta Table
from deltalake import DeltaTable

# Read the Delta Table using the Rust API
dt = DeltaTable("../rust/tests/data/simple_table")

# Create a Pandas Dataframe by initially converting the Delta Lake
# table into a PyArrow table
df = dt.to_pyarrow_table().to_pandas()
# Query the Pandas table
Df

# Example output
0	5
1	7
2	9

You can also use Time Travel and load a previous version of the Delta table by specifying the version number by using the load_version method.

# Load version 2 of the table
dt.load_version(2)

Notes

Currently, you can also query your Delta Lake table through delta.rs using Python and Ruby, but the underlying Rust APIs should be straightforward to integrate into Golang or other languages too.. Refer to delta.rs for more information. There’s lots of opportunity to contribute to Delta.rs, so be sure to check out the open issues! https://github.com/delta-io/delta.rs/issues

Discussion

We’d like to thank Scott Sandre and the Delta Lake Engineering team for creating the Delta Standalone Reader and QP Hou and R. Tyler Croy for creating the Delta Rust API. Try out the Delta Standalone Reader and Delta Rust API today – no Spark required!

Join us in the Delta Lake community through our Public Slack Channel (Register here | Log in here) or Public Mailing list.

Try Databricks for free. Get started today.

The post Natively Query Your Delta Lake With Scala, Java, and Python appeared first on Databricks.

Source: Databricks

Leave a Reply

Your email address will not be published.


*