API

ParquetFile(fn[, verify, open_with, root, sep]) The metadata of a parquet file or collection
ParquetFile.to_pandas([columns, categories, …]) Read data from parquet into a Pandas dataframe.
ParquetFile.iter_row_groups([columns, …]) Read data from parquet into a Pandas dataframe.
ParquetFile.info Some metadata details
write(filename, data[, row_group_offsets, …]) Write Pandas DataFrame to filename as Parquet Format
class fastparquet.ParquetFile(fn, verify=False, open_with=<function default_open>, root=False, sep=None)[source]

The metadata of a parquet file or collection

Reads the metadata (row-groups and schema definition) and provides methods to extract the data from the files.

Note that when reading parquet files partitioned using directories (i.e. using the hive/drill scheme), an attempt is made to coerce the partition values to a number, datetime or timedelta. Fastparquet cannot read a hive/drill parquet file with partition names which coerce to the same value, such as “0.7” and “.7”.

Parameters:
fn: path/URL string or list of paths

Location of the data. If a directory, will attempt to read a file “_metadata” within that directory. If a list of paths, will assume that they make up a single parquet data set. This parameter can also be any file-like object, in which case this must be a single-file dataset.

verify: bool [False]

test file start/end byte markers

open_with: function

With the signature func(path, mode), returns a context which evaluated to a file open for reading. Defaults to the built-in open.

root: str

If passing a list of files, the top directory of the data-set may be ambiguous for partitioning where the upmost field has only one value. Use this to specify the data’set root directory, if required.

Attributes:
cats: dict

Columns derived from hive/drill directory information, with known values for each column.

categories: list

Columns marked as categorical in the extra metadata (meaning the data must have come from pandas).

columns: list of str

The data columns available

count: int

Total number of rows

dtypes: dict

Expected output types for each column

file_scheme: str

‘simple’: all row groups are within the same file; ‘hive’: all row groups are in other files; ‘mixed’: row groups in this file and others too; ‘empty’: no row groups at all.

info: dict

Combination of some of the other attributes

key_value_metadata: list

Additional information about this data’s origin, e.g., pandas description.

row_groups: list

Thrift objects for each row group

schema: schema.SchemaHelper

print this for a representation of the column structure

selfmade: bool

If this file was created by fastparquet

statistics: dict

Max/min/count of each column chunk

Methods

filter_row_groups(filters) Select row groups using set of filters
grab_cats(columns[, row_group_index]) Read dictionaries of first row_group
iter_row_groups([columns, categories, …]) Read data from parquet into a Pandas dataframe.
read_row_group(rg, columns, categories[, …]) Access row-group in a file and read some columns into a data-frame.
read_row_group_file(rg, columns, categories) Open file for reading, and process it as a row-group
to_pandas([columns, categories, filters, index]) Read data from parquet into a Pandas dataframe.
check_categories  
pre_allocate  
row_group_filename  
ParquetFile.to_pandas(columns=None, categories=None, filters=[], index=None)[source]

Read data from parquet into a Pandas dataframe.

Parameters:
columns: list of names or `None`

Column to load (see ParquetFile.columns). Any columns in the data not in this list will be ignored. If None, read all columns.

categories: list, dict or `None`

If a column is encoded using dictionary encoding in every row-group and its name is also in this list, it will generate a Pandas Category-type column, potentially saving memory and time. If a dict {col: int}, the value indicates the number of categories, so that the optimal data-dtype can be allocated. If None, will automatically set if the data was written from pandas.

filters: list of tuples

To filter out (i.e., not read) some of the row-groups. (This is not row-level filtering) Filter syntax: [(column, op, val), …], where op is [==, >, >=, <, <=, !=, in, not in]

index: string or list of strings or False or None

Column(s) to assign to the (multi-)index. If None, index is inferred from the metadata (if this was originally pandas data); if the metadata does not exist or index is False, index is simple sequential integers.

Returns:
Pandas data-frame
ParquetFile.iter_row_groups(columns=None, categories=None, filters=[], index=None)[source]

Read data from parquet into a Pandas dataframe.

Parameters:
columns: list of names or `None`

Column to load (see ParquetFile.columns). Any columns in the data not in this list will be ignored. If None, read all columns.

categories: list, dict or `None`

If a column is encoded using dictionary encoding in every row-group and its name is also in this list, it will generate a Pandas Category-type column, potentially saving memory and time. If a dict {col: int}, the value indicates the number of categories, so that the optimal data-dtype can be allocated. If None, will automatically set if the data was written by fastparquet.

filters: list of tuples

To filter out (i.e., not read) some of the row-groups. (This is not row-level filtering) Filter syntax: [(column, op, val), …], where op is [==, >, >=, <, <=, !=, in, not in]

index: string or list of strings or False or None

Column(s) to assign to the (multi-)index. If None, index is inferred from the metadata (if this was originally pandas data); if the metadata does not exist or index is False, index is simple sequential integers.

assign: dict {cols: array}

Pre-allocated memory to write to. If None, will allocate memory here.

Returns:
Generator yielding one Pandas data-frame per row-group
fastparquet.write(filename, data, row_group_offsets=50000000, compression=None, file_scheme='simple', open_with=<function default_open>, mkdirs=<function default_mkdirs>, has_nulls=True, write_index=None, partition_on=[], fixed_text=None, append=False, object_encoding='infer', times='int64')[source]

Write Pandas DataFrame to filename as Parquet Format

Parameters:
filename: string

Parquet collection to write to, either a single file (if file_scheme is simple) or a directory containing the metadata and data-files.

data: pandas dataframe

The table to write

row_group_offsets: int or list of ints

If int, row-groups will be approximately this many rows, rounded down to make row groups about the same size; if a list, the explicit index values to start new row groups.

compression: str, dict

compression to apply to each column, e.g. GZIP or SNAPPY or a dict like {"col1": "SNAPPY", "col2": None} to specify per column compression types. In both cases, the compressor settings would be the underlying compressor defaults. To pass arguments to the underlying compressor, each dict entry should itself be a dictionary:

{
    col1: {
        "type": "LZ4",
        "args": {
            "compression_level": 6,
            "content_checksum": True
         }
    },
    col2: {
        "type": "SNAPPY",
        "args": None
    }
    "_default": {
        "type": "GZIP",
        "args": None
    }
}

where "type" specifies the compression type to use, and "args" specifies a dict that will be turned into keyword arguments for the compressor. If the dictionary contains a “_default” entry, this will be used for any columns not explicitly specified in the dictionary.

file_scheme: ‘simple’|’hive’

If simple: all goes in a single file If hive: each row group is in a separate file, and a separate file (called “_metadata”) contains the metadata.

open_with: function

When called with a f(path, mode), returns an open file-like object

mkdirs: function

When called with a path/URL, creates any necessary dictionaries to make that location writable, e.g., os.makedirs. This is not necessary if using the simple file scheme

has_nulls: bool, ‘infer’ or list of strings

Whether columns can have nulls. If a list of strings, those given columns will be marked as “optional” in the metadata, and include null definition blocks on disk. Some data types (floats and times) can instead use the sentinel values NaN and NaT, which are not the same as NULL in parquet, but functionally act the same in many cases, particularly if converting back to pandas later. A value of ‘infer’ will assume nulls for object columns and not otherwise.

write_index: boolean

Whether or not to write the index to a separate column. By default we write the index if it is not 0, 1, …, n.

partition_on: list of column names

Passed to groupby in order to split data within each row-group, producing a structured directory tree. Note: as with pandas, null values will be dropped. Ignored if file_scheme is simple.

fixed_text: {column: int length} or None

For bytes or str columns, values will be converted to fixed-length strings of the given length for the given columns before writing, potentially providing a large speed boost. The length applies to the binary representation after conversion for utf8, json or bson.

append: bool (False)

If False, construct data-set from scratch; if True, add new row-group(s) to existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.

object_encoding: str or {col: type}

For object columns, this gives the data type, so that the values can be encoded to bytes. Possible values are bytes|utf8|json|bson|bool|int|int32, where bytes is assumed if not specified (i.e., no conversion). The special value ‘infer’ will cause the type to be guessed from the first ten non-null values.

times: ‘int64’ (default), or ‘int96’:

In “int64” mode, datetimes are written as 8-byte integers, us resolution; in “int96” mode, they are written as 12-byte blocks, with the first 8 bytes as ns within the day, the next 4 bytes the julian day. ‘int96’ mode is included only for compatibility.

Examples

>>> fastparquet.write('myfile.parquet', df)  # doctest: +SKIP