fastparquet

A Python interface to the Parquet file format.

Introduction

The Parquet format is a common binary data store, used particularly in the Hadoop/big-data sphere. It provides several advantages relevant to big-data processing, including:

  1. columnar storage, only read the data of interest

  2. efficient binary packing

  3. choice of compression algorithms and encoding

  4. split data into files, allowing for parallel processing

  5. range of logical types

  6. statistics stored in metadata allow for skipping unneeded chunks

  7. data partitioning using the directory structure

Since it was developed as part of the Hadoop ecosystem, Parquet’s reference implementation is written in Java. This package aims to provide a performant library to read and write Parquet files from Python, without any need for a Python-Java bridge. This will make the Parquet format an ideal storage mechanism for Python-based big data workflows.

The tabular nature of Parquet is a good fit for the Pandas data-frame objects, and we exclusively deal with data-frame<->Parquet.

Highlights

The original outline plan for this project can be found upstream

Briefly, some features of interest:

  1. read and write Parquet files, in single or multiple-file format

  2. choice of compression per-column and various optimized encoding schemes; ability to choose row divisions and partitioning on write.

  3. acceleration of both reading and writing using cython, competitive performance versus other frameworks

  4. ability to read and write to arbitrary file-like objects, allowing interoperability with fsspec filesystems and others

  5. can be called from dask, to enable parallel reading and writing with Parquet files, possibly distributed across a cluster.

Caveats, Known Issues

Please see the Release Notes. With versions 0.6.0 and 0.7.0, a LOT of new features and enhancements were added, so please read that page carefully, this may affect you!

Fastparquet is a free and open-source project. We welcome contributions in the form of bug reports, documentation, code, design proposals, and more. This page provides resources on how best to contribute.

Bug reports

Please file an issue on github.

Relation to Other Projects

  1. parquet-python is the original pure-Python Parquet quick-look utility which was the inspiration for fastparquet. It has continued development, but is not directed as big data vectorised loading as we are.

  2. Apache Arrow and its python API define an in-memory data representation, and can read/write parquet, including conversion to pandas. It is the “other” engine available within Dask and Pandas, and gives good performance and a range of options. If you are using Arrow anyway, you probably want to use its parquet interface.

  3. PySpark, a Python API to the Spark engine, interfaces Python commands with a Java/Scala execution core, and thereby gives Python programmers access to the Parquet format. Spark is used in some tests and some test files were produced by Spark.

  4. fastparquet lives within the dask ecosystem, and although it is useful by itself, it is designed to work well with dask for parallel execution, as well as related libraries such as s3fs for pythonic access to Amazon S3.

Index

1. Index 1. Module Index 1. Search Page