What is Google Cloud Dataflow?

Google Cloud Dataflow is a simple, flexible, and powerful system you can use to perform data processing tasks of any size.

Cloud Dataflow consists of two major components:

A set of SDKs that you use to define data processing jobs. The Dataflow SDKs feature a unique programming model that simplifies the mechanics of large-scale cloud data processing. You can define your data processing jobs by writing programs using the Dataflow SDKs.
A Google Cloud Platform managed service. The Dataflow service ties together and fully manages several different Google Cloud Platform technologies, such as Google Compute Engine, Google Cloud Storage, and BigQuery, to execute data processing jobs on Google Cloud Platform resources.

You can use both aspects of Dataflow together by using the Dataflow SDKs to create jobs to be executed by the Dataflow service.

See Getting Started for instructions on how to begin using Cloud Dataflow. You can set up a Dataflow-enabled Cloud Platform project, create a dependency on the Dataflow SDK for Java, and run some sample pipelines on the Dataflow service.

What Can I Use Dataflow For?

You can use Cloud Dataflow for nearly any kind of data processing task, including both batch and streaming data processing. The Dataflow SDKs provide a unified data model that can represent any size data set, including an unbounded or infinite data set from a continuously updating data source such as Google Cloud Pub/Sub. The Dataflow managed service is capable of running both batch and streaming jobs.

In particular, Dataflow excels at high volume computation, where the steps in your job need to process an amount of data that exceeds the memory capacity of a cost-effective cluster. Dataflow is particularly useful for "Embarrassingly Parallel" data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel.

You can also use Dataflow for "Extract, Transform, and Load" tasks. These tasks are useful for moving data between different storage media, transforming data into a more desirable format, or loading data onto a new storage system.

Dataflow SDKs

The Dataflow SDKs provide a simple and elegant programming model to express your data processing jobs. Each job is represented by a data processing pipeline that you create by writing a program with a Dataflow SDK. Each pipeline is an independent entity that reads some input data, performs some transforms on that data to gain useful or actionable intelligence about it, and produces some resulting output data. A pipeline's transforms might include filtering, grouping, comparing, or joining data.

The Dataflow SDKs provide several useful abstractions that allow you to think about your data processing pipeline in a simple, logical way. Dataflow simplifies the mechanics of large-scale parallel data processing, freeing you from the need to manage orchestration details such as partitioning your data and coordinating individual workers.

Note: There is currently only one Dataflow SDK, the Dataflow SDK for Java. Dataflow plans to add additional SDKs in future releases.

Dataflow SDK Concepts

Key concepts in the Dataflow SDKs include:

Simple data representation. Dataflow SDKs use a specialized collection class, called PCollection, to represent your pipeline data. This class can represent data sets of virtually unlimited size, including bounded and unbounded data collections.
Powerful data transforms. Dataflow SDKs provide several core data transforms that you can apply to your data. These transforms, called PTransforms, are generic frameworks that apply functions that you provide across an entire data set, using the features of the Dataflow service to execute each transform in the most efficient way.
I/O APIs for a variety of data formats. Dataflow SDKs provide APIs that let your pipeline read and write data to and from a variety of formats and storage technologies. Your pipeline can read text files, Avro files, BigQuery tables, and more.

See the programming model documentation to learn more about how Dataflow implements these concepts.

Open Source SDKs

Google has released the Dataflow SDK for Java as open source available on GitHub. This will help foster an ecosystem of open source projects based on the Dataflow model. The benefits of an open source Dataflow ecosystem include:

Increased visibility. The Dataflow SDK source code provides insight on how Dataflow programs interact with the managed Dataflow service that Google provides.
Support for third-party pipeline runners. Releasing the Dataflow SDKs as open source makes it easier for others to provide their own services to run pipelines defined using the Dataflow SDKs. These runners might target different runtime environments or enable Dataflow pipelines to be run on premises, using non-Google Cloud Platform services.
Transform Modeling. The pre-written transforms in the Dataflow SDKs can provide models or design patterns that the open source community can use to contribute their own transforms to the ecosystem.

What is the Dataflow Service?

The Cloud Dataflow service fully manages Google Cloud Platform resources to execute your data processing tasks. The Dataflow service ties together a number of Cloud Platform technologies, including:

Google Compute Engine VMs, to provide job workers.
Google Cloud Storage, for reading and writing data.
Google BigQuery, for reading and writing data.

When using the Dataflow service, there's no need to manually shard or partition your data by hand; the service's automatic optimization and resource management systems automatically handle breaking your job down for the most efficient execution on the Cloud Platform resources you've allocated.

Service Features

Dynamic Optimization: The Dataflow service provides dynamic optimization of Cloud Platform resources to execute your data processing jobs. When you build a dataflow, the Dataflow service constructs a directed graph of your job and optimizes the graph for the most efficient execution.

Resource Management: The Dataflow service fully manages Cloud Platform technologies to run your job. This includes spinning up and tearing down Compute Engine resources, collecting logs, and communicating with Cloud Storage technologies.

Job Monitoring: The Dataflow service includes a monitoring interface built into the Google Developers Console. The Dataflow monitoring interface shows the different stages of your data processing pipeline and lets you see how data moves through those stages as the job progresses.

Native I/O Adapters for Cloud Storage Technologies: The Dataflow Service has built-in support for getting data from, and writing data to, Cloud Platform storage systems such as Cloud Storage and BigQuery. This makes it easy to build a data processing pipeline to work with your data in Cloud Platform.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies.