Skip to content
This repository has been archived by the owner on Jun 30, 2022. It is now read-only.

Releases: GoogleCloudPlatform/DataflowPythonSDK

Future Releases

31 May 17:16
Compare
Choose a tag to compare

All future releases will be announced on Release Notes: Dataflow SDK for Python and releases will be available on PyPI.

See README.md for more information.

Version 0.2.7

14 Jun 06:22
Compare
Choose a tag to compare

The 0.2.7 release includes the following changes:

  • Introduce OperationCounters.should_sample for sampling for size estimation.
  • Implement fixed sharding in TextFileSink.
  • Use multiple file rename threads in finalize_write method.
  • Retry idempotent I/O operations on GCS timeout.

Version 0.2.6

10 Jun 16:01
Compare
Choose a tag to compare

The 0.2.6 release includes the following changes:

  • Allow Pipeline objects to be used in Python with statements.
  • Several bug fixes.

Version 0.2.5

31 May 21:55
Compare
Choose a tag to compare

The 0.2.5 release includes the following changes:

  • Support for creating custom sources, and reading them with DirectRunner and DataflowRunner.
  • DiskCachedPipelineRunner as a disk backed alternative to DirectRunner.
  • Ignore undeclared side outputs of DoFns in cloud executor.
  • Fix pickling issue when the Seaborn package is loaded.
  • Enable gzip compression on text files sink.

Version 0.2.4

11 May 22:12
Compare
Choose a tag to compare

The 0.2.4 release includes the following changes:

  • Support for large iterable side inputs.
  • Enable support for all supported counter types.
  • Modify --requirements_file behavior to locally cache packages.
  • Support for non-native TextFileSink.
  • Several fixes.

Version 0.2.3

19 Apr 23:30
Compare
Choose a tag to compare

The 0.2.3 release includes several fixes:

  • Removed version pin for google-apitools package.
  • Removed version pin for oath2client package.
  • Better inter-op with the gcloud package
  • Raising correct exception for failures in start/finish DoFn methods.

Version 0.2.2

01 Apr 00:48
Compare
Choose a tag to compare

The 0.2.2 release includes the following changes:

  • Improved memory footprint for DirectPipelineRunner.
  • Multiple bug fixes (BigQuerySink schema handling for record field types, more clear error messages for missing files, etc.).
  • Several performance improvements (cythonize some files, reduced debug logging, etc.).
  • New example
    using more complex BigQuery schemas

This release supports only batch execution. Streaming processing is not available yet.
The batch execution can be done locally (for development/testing) or in the Google cloud using the Cloud Dataflow service. Running against the Google cloud requires whitelisting using this form.

Version 0.2.1

21 Mar 18:39
Compare
Choose a tag to compare

The 0.2.1 release includes the following changes:

  • Optimized performance for the following features:
    • Logging
    • Shuffle Writing
    • Using Coders
    • Compiling some of the worker modules with Cython
  • Changed the default behavior for Cloud execution: Instead of downloading the SDK from a Cloud Storage bucket, you now download the SDK as a tarball from GitHub. When you run jobs using the Dataflow service, the SDK version used will match the version you've downloaded (to your local environment). You can use the --sdk_location pipeline option to override this behavior and provide an explicit tarball location (Cloud Storage path or URL).
  • Fixed several pickling issues related to how Dataflow serializes user functions and data.
  • Fixed several worker lease expiration issues experienced when processing large datasets.
  • Improved validation to detect various common errors, such as access issues and invalid parameter combinations, much earlier in time.

Version 0.2.0

03 Mar 07:25
Compare
Choose a tag to compare

Initial release of the open-sourced Datafow SDK for Python.