API¶

`ParquetFile`(fn[, verify, open_with, root, ...])	The metadata of a parquet file or collection
`ParquetFile.to_pandas`([columns, categories, ...])	Read data from parquet into a Pandas dataframe.
`ParquetFile.iter_row_groups`([filters])	Iterate a dataset by row-groups
`ParquetFile.info`	Dataset summary
`ParquetFile.head`(nrows, **kwargs)	Get the first nrows of data
`ParquetFile.count`([filters, row_filter])	Total number of rows
`ParquetFile.__getitem__`(item)	Select among the row-groups using integer/slicing
`write`(filename, data[, row_group_offsets, ...])	Write pandas dataframe to filename with parquet format.

class fastparquet.ParquetFile(fn, verify=False, open_with=<built-in function open>, root=False, sep=None, fs=None, pandas_nulls=True, dtypes=None)[source]¶

The metadata of a parquet file or collection

Reads the metadata (row-groups and schema definition) and provides methods to extract the data from the files.

Note that when reading parquet files partitioned using directories (i.e. using the hive/drill scheme), an attempt is made to coerce the partition values to a number, datetime or timedelta. Fastparquet cannot read a hive/drill parquet file with partition names which coerce to the same value, such as “0.7” and “.7”.

Parameters:

fn: path/URL string or list of paths: Location of the data. If a directory, will attempt to read a file “_metadata” within that directory. If a list of paths, will assume that they make up a single parquet data set. This parameter can also be any file-like object, in which case this must be a single-file dataset.
verify: bool [False]: test file start/end byte markers
open_with: function: With the signature func(path, mode), returns a context which evaluated to a file open for reading. Defaults to the built-in open.
root: str: If passing a list of files, the top directory of the data-set may be ambiguous for partitioning where the upmost field has only one value. Use this to specify the dataset root directory, if required.
fs: fsspec-compatible filesystem: You can use this instead of open_with (otherwise, it will be inferred)
pandas_nulls: bool (True): If True, columns that are int or bool in parquet, but have nulls, will become pandas nullale types (Uint, Int, boolean). If False (the only behaviour prior to v0.7.0), both kinds will be cast to float, and nulls will be NaN unless pandas metadata indicates that the original datatypes were nullable. Pandas nullable types were introduces in v1.0.0, but were still marked as experimental in v1.3.0.

Attributes:

fn: path/URL: Of ‘_metadata’ file.
basepath: path/URL: Of directory containing files of parquet dataset.
cats: dict: Columns derived from hive/drill directory information, with known values for each column.
categories: list: Columns marked as categorical in the extra metadata (meaning the data must have come from pandas).
columns: list of str: The data columns available
count: int: Total number of rows
dtypes: dict: Expected output types for each column
file_scheme: str: ‘simple’: all row groups are within the same file; ‘hive’: all row groups are in other files; ‘mixed’: row groups in this file and others too; ‘empty’: no row groups at all.
info: dict: Combination of some of the other attributes
key_value_metadata: dict: Additional information about this data’s origin, e.g., pandas description, and custom metadata defined by user.
row_groups: list: Thrift objects for each row group
schema: schema.SchemaHelper: print this for a representation of the column structure
selfmade: bool: If this file was created by fastparquet
statistics: dict: Max/min/count of each column chunk
fs: fsspec-compatible filesystem: You can use this instead of open_with (otherwise, it will be inferred)

Methods

`count`([filters, row_filter])	Total number of rows
`head`(nrows, **kwargs)	Get the first nrows of data
`iter_row_groups`([filters])	Iterate a dataset by row-groups
`read_row_group_file`(rg, columns, categories)	Open file for reading, and process it as a row-group
`remove_row_groups`(rgs[, sort_pnames, ...])	Remove list of row groups from disk.
`to_pandas`([columns, categories, filters, ...])	Read data from parquet into a Pandas dataframe.
`write_row_groups`(data[, row_group_offsets, ...])	Write data as new row groups to disk, with optional sorting.

check_categories
pre_allocate
row_group_filename

property columns¶: Column names

count(filters=None, row_filter=False)[source]¶

Total number of rows

filters and row_filters have the same meaning as in to_pandas. Unless both are given, this method will not need to decode any data

head(nrows, **kwargs)[source]¶

Get the first nrows of data

This will load the whole of the first valid row-group for the given columns.

kwargs can include things like columns, filters, etc., with the same semantics as to_pandas(). If filters are applied, it may happen that data is so reduced that ‘nrows’ is not ensured (fewer rows).

returns: dataframe

property info¶: Dataset summary

iter_row_groups(filters=None, **kwargs)[source]¶

Iterate a dataset by row-groups

If filters is given, omits row-groups that fail the filer (saving execution time)

Returns:

Generator yielding one Pandas data-frame per row-group.

read_row_group_file(rg, columns, categories, index=None, assign=None, partition_meta=None, row_filter=False, infile=None)[source]¶

Open file for reading, and process it as a row-group

assign is None if this method is called directly (not from to_pandas), in which case we return the resultant dataframe

row_filter can be:

False (don’t do row filtering)

a list of filters (do filtering here for this one row-group; only makes sense if assign=None

bool array with a size equal to the number of rows in this group and the length of the assign arrays

remove_row_groups(rgs, sort_pnames: bool = False, write_fmd: bool = True, open_with=<built-in function open>, remove_with=None)[source]¶: Remove list of row groups from disk. ParquetFile metadata are updated accordingly. This method can not be applied if file scheme is simple.

to_pandas(columns=None, categories=None, filters=[], index=None, row_filter=False, dtypes=None)[source]¶

Read data from parquet into a Pandas dataframe.

Parameters:

columns: list of names or `None`: Column to load (see ParquetFile.columns). Any columns in the data not in this list will be ignored. If None, read all columns.
categories: list, dict or `None`: If a column is encoded using dictionary encoding in every row-group and its name is also in this list, it will generate a Pandas Category-type column, potentially saving memory and time. If a dict {col: int}, the value indicates the number of categories, so that the optimal data-dtype can be allocated. If None, will automatically set if the data was written from pandas.
filters: list of list of tuples or list of tuples: To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, =, >, >=, <, <=, !=, in, not in] The innermost tuples are transposed into a set of filters applied through an AND operation. The outer list combines these sets of filters through an OR operation. A single list of tuples can also be used, meaning that no OR operation between set of filters is to be conducted.
index: string or list of strings or False or None: Column(s) to assign to the (multi-)index. If None, index is inferred from the metadata (if this was originally pandas data); if the metadata does not exist or index is False, index is simple sequential integers.
row_filter: bool or boolean ndarray: Whether filters are applied to whole row-groups (False, default) or row-wise (True, experimental). The latter requires two passes of any row group that may contain valid rows, but can be much more memory-efficient, especially if the filter columns are not required in the output. If boolean array, it is applied as custom row filter. In this case, ‘filter’ parameter is ignored, and length of the array has to be equal to the total number of rows.

Returns:

Pandas data-frame

write_row_groups(data, row_group_offsets=None, sort_key=None, sort_pnames: bool = False, compression=None, write_fmd: bool = True, open_with=<built-in function open>, mkdirs=None, stats='auto')[source]¶

Write data as new row groups to disk, with optional sorting.

Parameters:

datapandas dataframe or iterable of pandas dataframe: Data to add to existing parquet dataset. Only columns are written to disk. Row index is not kept. If a dataframe, columns are checked against parquet file schema.
row_group_offsets: int or list of int: If int, row-groups will be approximately this many rows, rounded down to make row groups about the same size; If a list, the explicit index values to start new row groups; If None, set to 50_000_000.
sort_keyfunction, default None: Sorting function used as key parameter for row_groups.sort() function. This function is called once new row groups have been added to list of existing ones. If not provided, new row groups are only appended to existing ones and the updated list of row groups is not sorted.
sort_pnamesbool, default False: Align name of part files with position of the 1st row group they contain. Only used if file_scheme of parquet file is set to hive or drill.
compressionstr or dict, default None: Compression to apply to each column, e.g. GZIP or SNAPPY or a dict like {"col1": "SNAPPY", "col2": None} to specify per column compression types. By default, do not compress. Please, review full description of this parameter in write docstring.
write_fmdbool, default True: Write updated common metadata to disk.
open_withfunction: When called with a f(path, mode), returns an open file-like object.
mkdirsfunction: When called with a path/URL, creates any necessary dictionaries to make that location writable, e.g., os.makedirs. This is not necessary if using the simple file scheme.
statsTrue|False|list of str: Whether to calculate and write summary statistics. If True (default), do it for every column; If False, never do; If a list of str, do it only for those specified columns. “auto” means True for any int/float or timemstamp column, False otherwise. This will become the default in a future release.

ParquetFile.to_pandas(columns=None, categories=None, filters=[], index=None, row_filter=False, dtypes=None)[source]¶

Read data from parquet into a Pandas dataframe.

Parameters:

columns: list of names or `None`: Column to load (see ParquetFile.columns). Any columns in the data not in this list will be ignored. If None, read all columns.
categories: list, dict or `None`: If a column is encoded using dictionary encoding in every row-group and its name is also in this list, it will generate a Pandas Category-type column, potentially saving memory and time. If a dict {col: int}, the value indicates the number of categories, so that the optimal data-dtype can be allocated. If None, will automatically set if the data was written from pandas.
filters: list of list of tuples or list of tuples: To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, =, >, >=, <, <=, !=, in, not in] The innermost tuples are transposed into a set of filters applied through an AND operation. The outer list combines these sets of filters through an OR operation. A single list of tuples can also be used, meaning that no OR operation between set of filters is to be conducted.
index: string or list of strings or False or None: Column(s) to assign to the (multi-)index. If None, index is inferred from the metadata (if this was originally pandas data); if the metadata does not exist or index is False, index is simple sequential integers.
row_filter: bool or boolean ndarray: Whether filters are applied to whole row-groups (False, default) or row-wise (True, experimental). The latter requires two passes of any row group that may contain valid rows, but can be much more memory-efficient, especially if the filter columns are not required in the output. If boolean array, it is applied as custom row filter. In this case, ‘filter’ parameter is ignored, and length of the array has to be equal to the total number of rows.

Returns:

Pandas data-frame

ParquetFile.iter_row_groups(filters=None, **kwargs)[source]¶

Iterate a dataset by row-groups

If filters is given, omits row-groups that fail the filer (saving execution time)

Returns:

Generator yielding one Pandas data-frame per row-group.

fastparquet.write(filename, data, row_group_offsets=None, compression=None, file_scheme='simple', open_with=<built-in function open>, mkdirs=None, has_nulls=True, write_index=None, partition_on=[], fixed_text=None, append=False, object_encoding='infer', times='int64', custom_metadata=None, stats='auto')[source]¶

Write pandas dataframe to filename with parquet format.

Parameters:

filename: str or pathlib.Path

Parquet collection to write to, either a single file (if file_scheme is simple) or a directory containing the metadata and data-files.

data: pandas dataframe

The table to write.

row_group_offsets: int or list of int

If int, row-groups will be approximately this many rows, rounded down to make row groups about the same size; If a list, the explicit index values to start new row groups; If None, set to 50_000_000. In case of partitioning the data, final row-groups size can be reduced significantly further by the partitioning, occuring as a subsequent step.

compression: str, dict

compression to apply to each column, e.g. GZIP or SNAPPY or a dict like {"col1": "SNAPPY", "col2": None} to specify per column compression types. In both cases, the compressor settings would be the underlying compressor defaults. To pass arguments to the underlying compressor, each dict entry should itself be a dictionary:

{
    col1: {
        "type": "LZ4",
        "args": {
            "mode": "high_compression",
            "compression": 9
         }
    },
    col2: {
        "type": "SNAPPY",
        "args": None
    }
    "_default": {
        "type": "GZIP",
        "args": None
    }
}

where "type" specifies the compression type to use, and "args" specifies a dict that will be turned into keyword arguments for the compressor. If the dictionary contains a “_default” entry, this will be used for any columns not explicitly specified in the dictionary.

file_scheme: ‘simple’|’hive’|’drill’

If simple: all goes in a single file If hive or drill: each row group is in a separate file, and a separate file (called “_metadata”) contains the metadata.

open_with: function

When called with a f(path, mode), returns an open file-like object

mkdirs: function

When called with a path/URL, creates any necessary dictionaries to make that location writable, e.g., os.makedirs. This is not necessary if using the simple file scheme

has_nulls: bool, ‘infer’ or list of strings

Whether columns can have nulls. If a list of strings, those given columns will be marked as “optional” in the metadata, and include null definition blocks on disk. Some data types (floats and times) can instead use the sentinel values NaN and NaT, which are not the same as NULL in parquet, but functionally act the same in many cases, particularly if converting back to pandas later. A value of ‘infer’ will assume nulls for object columns and not otherwise. Ignored if appending to an existing parquet data-set.

write_index: boolean

Whether or not to write the index to a separate column. By default we write the index if it is not 0, 1, …, n. Ignored if appending to an existing parquet data-set.

partition_on: string or list of string

Column names passed to groupby in order to split data within each row-group, producing a structured directory tree. Note: as with pandas, null values will be dropped. Ignored if file_scheme is simple. Checked when appending to an existing parquet dataset that requested partition column names match those of existing parquet data-set.

fixed_text: {column: int length} or None

For bytes or str columns, values will be converted to fixed-length strings of the given length for the given columns before writing, potentially providing a large speed boost. The length applies to the binary representation after conversion for utf8, json or bson. Ignored if appending to an existing parquet dataset.

append: bool (False) or ‘overwrite’

If False, construct data-set from scratch; if True, add new row-group(s) to existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.

If ‘overwrite’, existing partitions will be replaced in-place, where the given data has any rows within a given partition. To use this, the existing dataset had to be written with these other parameters set to specific values, or will raise ValueError:

file_scheme='hive'

partition_on set to at least one column name.

object_encoding: str or {col: type}

For object columns, this gives the data type, so that the values can be encoded to bytes. Possible values are bytes|utf8|json|bson|bool|int|int32|decimal, where bytes is assumed if not specified (i.e., no conversion). The special value ‘infer’ will cause the type to be guessed from the first ten non-null values. The decimal.Decimal type is a valid choice, but will result in float encoding with possible loss of accuracy. Ignored if appending to an existing parquet data-set.

times: ‘int64’ (default), or ‘int96’:

In “int64” mode, datetimes are written as 8-byte integers, us resolution; in “int96” mode, they are written as 12-byte blocks, with the first 8 bytes as ns within the day, the next 4 bytes the julian day. ‘int96’ mode is included only for compatibility. Ignored if appending to an existing parquet data-set.

custom_metadata: dict

Key-value metadata to write Ignored if appending to an existing parquet data-set.

stats: True|False|list(str)|”auto”

Whether to calculate and write summary statistics. If True, do it for every column; If False, never do; And if a list of str, do it only for those specified columns. “auto” (default) means True for any int/float or timestamp column

Examples

>>> fastparquet.write('myfile.parquet', df)  

fastparquet.update_file_custom_metadata(path: str, custom_metadata: dict, is_metadata_file: bool = None)[source]¶

Update metadata in file without rewriting data portion if a data file.

This function updates only the user key-values metadata, not the whole metadata of a parquet file. Update strategy depends if key found in new custom metadata is also found in already existing custom metadata within thrift object, as well as its value.

If not found in existing, it is added.

If found in existing, it is updated.

If its value is None, it is not added, and if found in existing, it is removed from existing.

Parameters:

pathstr

Local path to file.

custom_metadatadict

Key-value metadata to update in thrift object. The values must be strings or binary. To pass a dictionary, serialize it as json string then encode it in binary.

is_metadata_filebool, default None

Define if target file is a pure metadata file, or is a parquet data file. If None, is set depending file name.

if ending with ‘_metadata’, it assumes file is a metadata file.

otherwise, it assumes it is a parquet data file.

Notes

This method does not work for remote files.