Quickstart

Reading

To open and read the contents of a Parquet file:

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()

The Pandas data-frame, df will contain all columns in the target file, and all row-groups concatenated together. If the data is a multi-file collection, such as generated by hadoop, the filename to supply is either the directory name, or the “_metadata” file contained therein - these are handled transparently.

One may wish to investigate the meta-data associated with the data before loading, for example, to choose which row-groups and columns to load. The properties columns, count, dtypes and statistics are available to assist with this. In addition, if the data is in a hierarchical directory-partitioned structure, then the property cats specifies the possible values of each partitioning field.

You may specify which columns to load, which of those to keep as categoricals (if the data uses dictionary encoding), and which column to use as the pandas index. By selecting columns, we only access parts of the file, and efficiently skip columns that are not of interest.

df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])
# or
df2 = pf.to_pandas(['col1', 'col2'], categories={'col1': 12})

where the second version specifies the number of expected labels for that column.

Furthermore, row-groups can be skipped by providing a list of filters. There is no need to return the filtering column as a column in the data-frame. Note that only row-groups that have no data at all meeting the specified requirements will be skipped.

df3 = pf.to_pandas(['col1', 'col2'], filters=[('col3', 'in', [1, 2, 3, 4])])

Writing

To create a single Parquet file from a dataframe:

from fastparquet import write
write('outfile.parq', df)

The function write provides a number of options. The default is to produce a single output file with a row-groups up to 50M rows, with plain encoding and no compression. The performance will therefore be similar to simple binary packing such as numpy.save for numerical columns.

Further options that may be of interest are:

  • the compression algorithms (typically “snappy”, for fast, but not too space-efficient), which can vary by column
  • the row-group splits to apply, which may lead to efficiencies on loading, if some row-groups can be skipped. Statistics (min/max) are calculated for each column in each row-group on the fly.
  • multi-file saving can be enabled with the file_scheme keyword: hive-style output is a directory with a single metadata file and several data-files.
  • options has_nulls, fixed_text and write_index affect efficiency see the api docs.
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
      compression='GZIP', file_scheme='hive')