Quickstart

You may already be using fastparquet via the Dask or Pandas APIs. The options available, with engine="fastparquet", are essentially the same as given here and in our API docs.

Reading

To open and read the contents of a Parquet file:

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()

The Pandas data-frame, df will contain all columns in the target file, and all row-groups concatenated together. If the data is a multi-file collection, such as generated by hadoop, the filename to supply is either the directory name, or the “_metadata” file contained therein #.these are handled transparently.

One may wish to investigate the meta-data associated with the data before loading, for example, to choose which row-groups and columns to load. The properties columns, count, dtypes and statistics are available to assist with this, and a summary in info. In addition, if the data is in a hierarchical directory-partitioned structure, then the property cats specifies the possible values of each partitioning field. You can get a deeper view of the parquet schema wih print(pf.schema).

You may specify which columns to load, which of those to keep as categoricals (if the data uses dictionary encoding), and which column to use as the pandas index. By selecting columns, we only access parts of the file, and efficiently skip columns that are not of interest.

df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])
# or
df2 = pf.to_pandas(['col1', 'col2'], categories={'col1': 12})

where the second version specifies the number of expected labels for that column. If the data originated from pandas, the categories will already be known.

Furthermore, row-groups can be skipped by providing a list of filters. There is no need to return the filtering column as a column in the data-frame. Note that only row-groups that have no data at all meeting the specified requirements will be skipped.

df3 = pf.to_pandas(['col1', 'col2'], filters=[('col3', 'in', [1, 2, 3, 4])])

(new in 0.7.0: row-level filtering)

Writing

To create a single Parquet file from a dataframe:

from fastparquet import write
write('outfile.parq', df)

The function write provides a number of options. The default is to produce a single output file with a row-groups up to 50M rows, with plain encoding and no compression. The performance will therefore be similar to simple binary packing such as numpy.save for numerical columns.

Further options that may be of interest are:

  1. the compression algorithms (typically “snappy”, for fast, but not too space-efficient), which can vary by column

  2. the row-group splits to apply, which may lead to efficiencies on loading, if some row-groups can be skipped. Statistics (min/max) are calculated for each column in each row-group on the fly.

  3. multi-file saving can be enabled with the file_scheme="hive"|"drill": directory-tree-partitioned output with a single metadata file and several data-files, one or more per leaf directory. The values used for partitioning are encoded into the paths of the data files.

  4. append to existing data sets with append=True, adding new row-groups. For the specific case of “hive”-partitioned data and one file per partition, append="overwrite" is also allowed, which replaces partitions of the data where new data are given, but leaves other existing partitions untouched.

write('outdir.parq', df, row_group_offsets=[0, 10000, 20000],
      compression='GZIP', file_scheme='hive')