API¶

Dataframe¶

`DataFrame`(dsk, name, meta, divisions)	Parallel Pandas DataFrame
`DataFrame.add`(self, other[, axis, level, …])	Get Addition of dataframe and other, element-wise (binary operator add).
`DataFrame.append`(self, other[, …])	Append rows of other to the end of caller, returning a new object.
`DataFrame.apply`(self, func[, axis, …])	Parallel version of pandas.DataFrame.apply
`DataFrame.assign`(self, \\kwargs)	Assign new columns to a DataFrame.
`DataFrame.astype`(self, dtype)	Cast a pandas object to a specified dtype `dtype`.
`DataFrame.categorize`(df[, columns, index, …])	Convert columns of the DataFrame to category dtype.
`DataFrame.columns`
`DataFrame.compute`(self, \\kwargs)	Compute this dask collection
`DataFrame.corr`(self[, method, min_periods, …])	Compute pairwise correlation of columns, excluding NA/null values.
`DataFrame.count`(self[, axis, split_every])	Count non-NA cells for each column or row.
`DataFrame.cov`(self[, min_periods, split_every])	Compute pairwise covariance of columns, excluding NA/null values.
`DataFrame.cummax`(self[, axis, skipna, out])	Return cumulative maximum over a DataFrame or Series axis.
`DataFrame.cummin`(self[, axis, skipna, out])	Return cumulative minimum over a DataFrame or Series axis.
`DataFrame.cumprod`(self[, axis, skipna, …])	Return cumulative product over a DataFrame or Series axis.
`DataFrame.cumsum`(self[, axis, skipna, …])	Return cumulative sum over a DataFrame or Series axis.
`DataFrame.describe`(self[, split_every, …])	Generate descriptive statistics.
`DataFrame.div`(self, other[, axis, level, …])	Get Floating division of dataframe and other, element-wise (binary operator truediv).
`DataFrame.drop`(self[, labels, axis, …])	Drop specified labels from rows or columns.
`DataFrame.drop_duplicates`(self[, subset, …])	Return DataFrame with duplicate rows removed.
`DataFrame.dropna`(self[, how, subset, thresh])	Remove missing values.
`DataFrame.dtypes`	Return data types
`DataFrame.explode`(self, column)	Transform each element of a list-like to a row, replicating index values.
`DataFrame.fillna`(self[, value, method, …])	Fill NA/NaN values using the specified method.
`DataFrame.floordiv`(self, other[, axis, …])	Get Integer division of dataframe and other, element-wise (binary operator floordiv).
`DataFrame.get_partition`(self, n)	Get a dask DataFrame/Series representing the nth partition.
`DataFrame.groupby`(self[, by])	Group DataFrame using a mapper or by a Series of columns.
`DataFrame.head`(self[, n, npartitions, compute])	First n rows of the dataset
`DataFrame.iloc`	Purely integer-location based indexing for selection by position.
`DataFrame.index`	Return dask Index instance
`DataFrame.isna`(self)	Detect missing values.
`DataFrame.isnull`(self)	Detect missing values.
`DataFrame.iterrows`(self)	Iterate over DataFrame rows as (index, Series) pairs.
`DataFrame.itertuples`(self[, index, name])	Iterate over DataFrame rows as namedtuples.
`DataFrame.join`(self, other[, on, how, …])	Join columns of another DataFrame.
`DataFrame.known_divisions`	Whether divisions are already known
`DataFrame.loc`	Purely label-location based indexer for selection by label.
`DataFrame.map_partitions`(self, func, \*args, …)	Apply Python function on each DataFrame partition.
`DataFrame.mask`(self, cond[, other])	Replace values where the condition is True.
`DataFrame.max`(self[, axis, skipna, …])	Return the maximum of the values for the requested axis.
`DataFrame.mean`(self[, axis, skipna, …])	Return the mean of the values for the requested axis.
`DataFrame.memory_usage`(self[, index, deep])	Return the memory usage of each column in bytes.
`DataFrame.memory_usage_per_partition`(self[, …])	Return the memory usage of each partition
`DataFrame.merge`(self, right[, how, on, …])	Merge the DataFrame with another DataFrame
`DataFrame.min`(self[, axis, skipna, …])	Return the minimum of the values for the requested axis.
`DataFrame.mod`(self, other[, axis, level, …])	Get Modulo of dataframe and other, element-wise (binary operator mod).
`DataFrame.mul`(self, other[, axis, level, …])	Get Multiplication of dataframe and other, element-wise (binary operator mul).
`DataFrame.ndim`	Return dimensionality
`DataFrame.nlargest`(self[, n, columns, …])	Return the first n rows ordered by columns in descending order.
`DataFrame.npartitions`	Return number of partitions
`DataFrame.partitions`	Slice dataframe by partitions
`DataFrame.pop`(self, item)	Return item and drop from frame.
`DataFrame.pow`(self, other[, axis, level, …])	Get Exponential power of dataframe and other, element-wise (binary operator pow).
`DataFrame.prod`(self[, axis, skipna, …])	Return the product of the values for the requested axis.
`DataFrame.quantile`(self[, q, axis, method])	Approximate row-wise and precise column-wise quantiles of DataFrame
`DataFrame.query`(self, expr, \\kwargs)	Filter dataframe with complex expression
`DataFrame.radd`(self, other[, axis, level, …])	Get Addition of dataframe and other, element-wise (binary operator radd).
`DataFrame.random_split`(self, frac[, …])	Pseudorandomly split dataframe into different pieces row-wise
`DataFrame.rdiv`(self, other[, axis, level, …])	Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
`DataFrame.rename`(self[, index, columns])	Alter axes labels.
`DataFrame.repartition`(self[, divisions, …])	Repartition dataframe along new divisions
`DataFrame.replace`(self[, to_replace, value, …])	Replace values given in to_replace with value.
`DataFrame.reset_index`(self[, drop])	Reset the index to the default index.
`DataFrame.rfloordiv`(self, other[, axis, …])	Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).
`DataFrame.rmod`(self, other[, axis, level, …])	Get Modulo of dataframe and other, element-wise (binary operator rmod).
`DataFrame.rmul`(self, other[, axis, level, …])	Get Multiplication of dataframe and other, element-wise (binary operator rmul).
`DataFrame.rpow`(self, other[, axis, level, …])	Get Exponential power of dataframe and other, element-wise (binary operator rpow).
`DataFrame.rsub`(self, other[, axis, level, …])	Get Subtraction of dataframe and other, element-wise (binary operator rsub).
`DataFrame.rtruediv`(self, other[, axis, …])	Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
`DataFrame.sample`(self[, n, frac, replace, …])	Random sample of items
`DataFrame.set_index`(self, other[, drop, …])	Set the DataFrame index (row labels) using an existing column.
`DataFrame.shape`	Return a tuple representing the dimensionality of the DataFrame.
`DataFrame.std`(self[, axis, skipna, ddof, …])	Return sample standard deviation over requested axis.
`DataFrame.sub`(self, other[, axis, level, …])	Get Subtraction of dataframe and other, element-wise (binary operator sub).
`DataFrame.sum`(self[, axis, skipna, …])	Return the sum of the values for the requested axis.
`DataFrame.tail`(self[, n, compute])	Last n rows of the dataset
`DataFrame.to_bag`(self[, index])	Create Dask Bag from a Dask DataFrame
`DataFrame.to_csv`(self, filename, \\kwargs)	Store Dask DataFrame to CSV files
`DataFrame.to_dask_array`(self[, lengths])	Convert a dask DataFrame to a dask array.
`DataFrame.to_delayed`(self[, optimize_graph])	Convert into a list of `dask.delayed` objects, one per partition.
`DataFrame.to_hdf`(self, path_or_buf, key[, …])	Store Dask Dataframe to Hierarchical Data Format (HDF) files
`DataFrame.to_json`(self, filename, \*args, …)	See dd.to_json docstring for more information
`DataFrame.to_parquet`(self, path, \*args, …)	Store Dask.dataframe to Parquet files
`DataFrame.to_records`(self[, index, lengths])	Create Dask Array from a Dask Dataframe
`DataFrame.to_sql`(self, name, uri[, schema, …])	See dd.to_sql docstring for more information
`DataFrame.truediv`(self, other[, axis, …])	Get Floating division of dataframe and other, element-wise (binary operator truediv).
`DataFrame.values`	Return a dask.array of the values of this dataframe
`DataFrame.var`(self[, axis, skipna, ddof, …])	Return unbiased variance over requested axis.
`DataFrame.visualize`(self[, filename, …])	Render the computation of this object’s task graph using graphviz.
`DataFrame.where`(self, cond[, other])	Replace values where the condition is False.

Series¶

`Series`(dsk, name, meta, divisions)	Parallel Pandas Series
`Series.add`(self, other[, level, fill_value, …])	Return Addition of series and other, element-wise (binary operator add).
`Series.align`(self, other[, join, axis, …])	Align two objects on their axes with the specified join method.
`Series.all`(self[, axis, skipna, …])	Return whether all elements are True, potentially over an axis.
`Series.any`(self[, axis, skipna, …])	Return whether any element is True, potentially over an axis.
`Series.append`(self, other[, …])	Concatenate two or more Series.
`Series.apply`(self, func[, convert_dtype, …])	Parallel version of pandas.Series.apply
`Series.astype`(self, dtype)	Cast a pandas object to a specified dtype `dtype`.
`Series.autocorr`(self[, lag, split_every])	Compute the lag-N autocorrelation.
`Series.between`(self, left, right[, inclusive])	Return boolean Series equivalent to left <= series <= right.
`Series.bfill`(self[, axis, limit])	Synonym for `DataFrame.fillna()` with `method='bfill'`.
`Series.cat`
`Series.clear_divisions`(self)	Forget division information
`Series.clip`(self[, lower, upper, out])	Trim values at input threshold(s).
`Series.clip_lower`(self, threshold)
`Series.clip_upper`(self, threshold)
`Series.compute`(self, \\kwargs)	Compute this dask collection
`Series.copy`(self)	Make a copy of the dataframe
`Series.corr`(self, other[, method, …])	Compute correlation with other Series, excluding missing values.
`Series.count`(self[, split_every])	Return number of non-NA/null observations in the Series.
`Series.cov`(self, other[, min_periods, …])	Compute covariance with Series, excluding missing values.
`Series.cummax`(self[, axis, skipna, out])	Return cumulative maximum over a DataFrame or Series axis.
`Series.cummin`(self[, axis, skipna, out])	Return cumulative minimum over a DataFrame or Series axis.
`Series.cumprod`(self[, axis, skipna, dtype, out])	Return cumulative product over a DataFrame or Series axis.
`Series.cumsum`(self[, axis, skipna, dtype, out])	Return cumulative sum over a DataFrame or Series axis.
`Series.describe`(self[, split_every, …])	Generate descriptive statistics.
`Series.diff`(self[, periods, axis])	First discrete difference of element.
`Series.div`(self, other[, level, fill_value, …])	Return Floating division of series and other, element-wise (binary operator truediv).
`Series.drop_duplicates`(self[, subset, …])	Return DataFrame with duplicate rows removed.
`Series.dropna`(self)	Return a new Series with missing values removed.
`Series.dt`	Namespace of datetime methods
`Series.dtype`	Return data type
`Series.eq`(self, other[, level, fill_value, axis])	Return Equal to of series and other, element-wise (binary operator eq).
`Series.explode`(self)	Transform each element of a list-like to a row, replicating the index values.
`Series.ffill`(self[, axis, limit])	Synonym for `DataFrame.fillna()` with `method='ffill'`.
`Series.fillna`(self[, value, method, limit, axis])	Fill NA/NaN values using the specified method.
`Series.first`(self, offset)	Method to subset initial periods of time series data based on a date offset.
`Series.floordiv`(self, other[, level, …])	Return Integer division of series and other, element-wise (binary operator floordiv).
`Series.ge`(self, other[, level, fill_value, axis])	Return Greater than or equal to of series and other, element-wise (binary operator ge).
`Series.get_partition`(self, n)	Get a dask DataFrame/Series representing the nth partition.
`Series.groupby`(self[, by])	Group Series using a mapper or by a Series of columns.
`Series.gt`(self, other[, level, fill_value, axis])	Return Greater than of series and other, element-wise (binary operator gt).
`Series.head`(self[, n, npartitions, compute])	First n rows of the dataset
`Series.idxmax`(self[, axis, skipna, split_every])	Return index of first occurrence of maximum over requested axis.
`Series.idxmin`(self[, axis, skipna, split_every])	Return index of first occurrence of minimum over requested axis.
`Series.isin`(self, values)	Check whether values are contained in Series.
`Series.isna`(self)	Detect missing values.
`Series.isnull`(self)	Detect missing values.
`Series.iteritems`(self)	Lazily iterate over (index, value) tuples.
`Series.known_divisions`	Whether divisions are already known
`Series.last`(self, offset)	Method to subset final periods of time series data based on a date offset.
`Series.le`(self, other[, level, fill_value, axis])	Return Less than or equal to of series and other, element-wise (binary operator le).
`Series.loc`	Purely label-location based indexer for selection by label.
`Series.lt`(self, other[, level, fill_value, axis])	Return Less than of series and other, element-wise (binary operator lt).
`Series.map`(self, arg[, na_action, meta])	Map values of Series according to input correspondence.
`Series.map_overlap`(self, func, before, …)	Apply a function to each partition, sharing rows with adjacent partitions.
`Series.map_partitions`(self, func, \*args, …)	Apply Python function on each DataFrame partition.
`Series.mask`(self, cond[, other])	Replace values where the condition is True.
`Series.max`(self[, axis, skipna, …])	Return the maximum of the values for the requested axis.
`Series.mean`(self[, axis, skipna, …])	Return the mean of the values for the requested axis.
`Series.memory_usage`(self[, index, deep])	Return the memory usage of the Series.
`Series.memory_usage_per_partition`(self[, …])	Return the memory usage of each partition
`Series.min`(self[, axis, skipna, …])	Return the minimum of the values for the requested axis.
`Series.mod`(self, other[, level, fill_value, …])	Return Modulo of series and other, element-wise (binary operator mod).
`Series.mul`(self, other[, level, fill_value, …])	Return Multiplication of series and other, element-wise (binary operator mul).
`Series.nbytes`	Number of bytes
`Series.ndim`	Return dimensionality
`Series.ne`(self, other[, level, fill_value, axis])	Return Not equal to of series and other, element-wise (binary operator ne).
`Series.nlargest`(self[, n, split_every])	Return the largest n elements.
`Series.notnull`(self)	Detect existing (non-missing) values.
`Series.nsmallest`(self[, n, split_every])	Return the smallest n elements.
`Series.nunique`(self[, split_every])	Return number of unique elements in the object.
`Series.nunique_approx`(self[, split_every])	Approximate number of unique rows.
`Series.persist`(self, \\kwargs)	Persist this dask collection into memory
`Series.pipe`(self, func, \args, \\*kwargs)	Apply func(self, args, *kwargs).
`Series.pow`(self, other[, level, fill_value, …])	Return Exponential power of series and other, element-wise (binary operator pow).
`Series.prod`(self[, axis, skipna, …])	Return the product of the values for the requested axis.
`Series.quantile`(self[, q, method])	Approximate quantiles of Series
`Series.radd`(self, other[, level, …])	Return Addition of series and other, element-wise (binary operator radd).
`Series.random_split`(self, frac[, …])	Pseudorandomly split dataframe into different pieces row-wise
`Series.rdiv`(self, other[, level, …])	Return Floating division of series and other, element-wise (binary operator rtruediv).
`Series.reduction`(self, chunk[, aggregate, …])	Generic row-wise reductions.
`Series.repartition`(self[, divisions, …])	Repartition dataframe along new divisions
`Series.replace`(self[, to_replace, value, regex])	Replace values given in to_replace with value.
`Series.rename`(self[, index, inplace, …])	Alter Series index labels or name
`Series.resample`(self, rule[, closed, label])	Resample time-series data.
`Series.reset_index`(self[, drop])	Reset the index to the default index.
`Series.rolling`(self, window[, min_periods, …])	Provides rolling transformations.
`Series.round`(self[, decimals])	Round each value in a Series to the given number of decimals.
`Series.sample`(self[, n, frac, replace, …])	Random sample of items
`Series.sem`(self[, axis, skipna, ddof, …])	Return unbiased standard error of the mean over requested axis.
`Series.shape`	Return a tuple representing the dimensionality of a Series.
`Series.shift`(self[, periods, freq, axis])	Shift index by desired number of periods with an optional time freq.
`Series.size`	Size of the Series or DataFrame as a Delayed object.
`Series.std`(self[, axis, skipna, ddof, …])	Return sample standard deviation over requested axis.
`Series.str`	Namespace for string methods
`Series.sub`(self, other[, level, fill_value, …])	Return Subtraction of series and other, element-wise (binary operator sub).
`Series.sum`(self[, axis, skipna, …])	Return the sum of the values for the requested axis.
`Series.to_bag`(self[, index])	Create a Dask Bag from a Series
`Series.to_csv`(self, filename, \\kwargs)	Store Dask DataFrame to CSV files
`Series.to_dask_array`(self[, lengths])	Convert a dask DataFrame to a dask array.
`Series.to_delayed`(self[, optimize_graph])	Convert into a list of `dask.delayed` objects, one per partition.
`Series.to_frame`(self[, name])	Convert Series to DataFrame.
`Series.to_hdf`(self, path_or_buf, key[, …])	Store Dask Dataframe to Hierarchical Data Format (HDF) files
`Series.to_string`(self[, max_rows])	Render a string representation of the Series.
`Series.to_timestamp`(self[, freq, how, axis])	Cast to DatetimeIndex of timestamps, at beginning of period.
`Series.truediv`(self, other[, level, …])	Return Floating division of series and other, element-wise (binary operator truediv).
`Series.unique`(self[, split_every, split_out])	Return Series of unique values in the object.
`Series.value_counts`(self[, sort, ascending, …])	Return a Series containing counts of unique values.
`Series.values`	Return a dask.array of the values of this dataframe
`Series.var`(self[, axis, skipna, ddof, …])	Return unbiased variance over requested axis.
`Series.visualize`(self[, filename, format, …])	Render the computation of this object’s task graph using graphviz.
`Series.where`(self, cond[, other])	Replace values where the condition is False.

Groupby Operations¶

`DataFrameGroupBy.aggregate`(self, arg[, …])	Aggregate using one or more operations over the specified axis.
`DataFrameGroupBy.apply`(self, func, \*args, …)	Parallel version of pandas GroupBy.apply
`DataFrameGroupBy.count`(self[, split_every, …])	Compute count of group, excluding missing values.
`DataFrameGroupBy.cumcount`(self[, axis])	Number each item in each group from 0 to the length of that group - 1.
`DataFrameGroupBy.cumprod`(self[, axis])	Cumulative product for each group.
`DataFrameGroupBy.cumsum`(self[, axis])	Cumulative sum for each group.
`DataFrameGroupBy.get_group`(self, key)	Construct DataFrame from group with provided name.
`DataFrameGroupBy.max`(self[, split_every, …])	Compute max of group values.
`DataFrameGroupBy.mean`(self[, split_every, …])	Compute mean of groups, excluding missing values.
`DataFrameGroupBy.min`(self[, split_every, …])	Compute min of group values.
`DataFrameGroupBy.size`(self[, split_every, …])	Compute group sizes.
`DataFrameGroupBy.std`(self[, ddof, …])	Compute standard deviation of groups, excluding missing values.
`DataFrameGroupBy.sum`(self[, split_every, …])	Compute sum of group values.
`DataFrameGroupBy.var`(self[, ddof, …])	Compute variance of groups, excluding missing values.
`DataFrameGroupBy.cov`(self[, ddof, …])	Compute pairwise covariance of columns, excluding NA/null values.
`DataFrameGroupBy.corr`(self[, ddof, …])	Compute pairwise correlation of columns, excluding NA/null values.
`DataFrameGroupBy.first`(self[, split_every, …])	Compute first of group values.
`DataFrameGroupBy.last`(self[, split_every, …])	Compute last of group values.
`DataFrameGroupBy.idxmin`(self[, split_every, …])	Return index of first occurrence of minimum over requested axis.
`DataFrameGroupBy.idxmax`(self[, split_every, …])	Return index of first occurrence of maximum over requested axis.

`SeriesGroupBy.aggregate`(self, arg[, …])	Aggregate using one or more operations over the specified axis.
`SeriesGroupBy.apply`(self, func, \*args, …)	Parallel version of pandas GroupBy.apply
`SeriesGroupBy.count`(self[, split_every, …])	Compute count of group, excluding missing values.
`SeriesGroupBy.cumcount`(self[, axis])	Number each item in each group from 0 to the length of that group - 1.
`SeriesGroupBy.cumprod`(self[, axis])	Cumulative product for each group.
`SeriesGroupBy.cumsum`(self[, axis])	Cumulative sum for each group.
`SeriesGroupBy.get_group`(self, key)	Construct DataFrame from group with provided name.
`SeriesGroupBy.max`(self[, split_every, split_out])	Compute max of group values.
`SeriesGroupBy.mean`(self[, split_every, …])	Compute mean of groups, excluding missing values.
`SeriesGroupBy.min`(self[, split_every, split_out])	Compute min of group values.
`SeriesGroupBy.nunique`(self[, split_every, …])	Return number of unique elements in the group.
`SeriesGroupBy.size`(self[, split_every, …])	Compute group sizes.
`SeriesGroupBy.std`(self[, ddof, split_every, …])	Compute standard deviation of groups, excluding missing values.
`SeriesGroupBy.sum`(self[, split_every, …])	Compute sum of group values.
`SeriesGroupBy.var`(self[, ddof, split_every, …])	Compute variance of groups, excluding missing values.
`SeriesGroupBy.first`(self[, split_every, …])	Compute first of group values.
`SeriesGroupBy.last`(self[, split_every, …])	Compute last of group values.
`SeriesGroupBy.idxmin`(self[, split_every, …])	Return index of first occurrence of minimum over requested axis.
`SeriesGroupBy.idxmax`(self[, split_every, …])	Return index of first occurrence of maximum over requested axis.

Aggregation(name, chunk, agg[, finalize])

User defined groupby-aggregation.

Rolling Operations¶

`rolling.map_overlap`(func, df, before, after, …)	Apply a function to each partition, sharing rows with adjacent partitions.
`Series.rolling`(self, window[, min_periods, …])	Provides rolling transformations.
`DataFrame.rolling`(self, window[, …])	Provides rolling transformations.

`Rolling.apply`(self, func[, raw, engine, …])	The rolling function’s apply function.
`Rolling.count`(self)	The rolling count of any non-NaN observations inside the window.
`Rolling.kurt`(self)	Calculate unbiased rolling kurtosis.
`Rolling.max`(self)	Calculate the rolling maximum.
`Rolling.mean`(self)	Calculate the rolling mean of the values.
`Rolling.median`(self)	Calculate the rolling median.
`Rolling.min`(self)	Calculate the rolling minimum.
`Rolling.quantile`(self, quantile)	Calculate the rolling quantile.
`Rolling.skew`(self)	Unbiased rolling skewness.
`Rolling.std`(self[, ddof])	Calculate rolling standard deviation.
`Rolling.sum`(self)	Calculate rolling sum of given DataFrame or Series.
`Rolling.var`(self[, ddof])	Calculate unbiased rolling variance.

Create DataFrames¶

`read_csv`(urlpath[, blocksize, …])	Read CSV files into a Dask.DataFrame
`read_table`(urlpath[, blocksize, …])	Read delimited files into a Dask.DataFrame
`read_fwf`(urlpath[, blocksize, …])	Read fixed-width files into a Dask.DataFrame
`read_parquet`(path[, columns, filters, …])	Read a Parquet file into a Dask DataFrame
`read_hdf`(pattern, key[, start, stop, …])	Read HDF files into a Dask DataFrame
`read_json`(url_path[, orient, lines, …])	Create a dataframe from a set of JSON files
`read_orc`(path[, columns, storage_options])	Read dataframe from ORC file(s)
`read_sql_table`(table, uri, index_col[, …])	Create dataframe from an SQL table.
`from_array`(x[, chunksize, columns, meta])	Read any sliceable array into a Dask Dataframe
`from_bcolz`(x[, chunksize, categorize, …])	Read BColz CTable into a Dask Dataframe
`from_dask_array`(x[, columns, index, meta])	Create a Dask DataFrame from a Dask Array.
`from_delayed`(dfs[, meta, divisions, prefix, …])	Create Dask DataFrame from many Dask Delayed objects
`from_pandas`(data[, npartitions, chunksize, …])	Construct a Dask DataFrame from a Pandas DataFrame
`dask.bag.core.Bag.to_dataframe`(self[, meta, …])	Create Dask Dataframe from a Dask Bag.

Store DataFrames¶

`to_csv`(df, filename[, single_file, …])	Store Dask DataFrame to CSV files
`to_parquet`(df, path[, engine, compression, …])	Store Dask.dataframe to Parquet files
`to_hdf`(df, path, key[, mode, append, …])	Store Dask Dataframe to Hierarchical Data Format (HDF) files
`to_records`(df)	Create Dask Array from a Dask Dataframe
`to_sql`(df, name, uri[, schema, index_label, …])	Store Dask Dataframe to a SQL table
`to_bag`(df[, index])	Create Dask Bag from a Dask DataFrame
`to_json`(df, url_path[, orient, lines, …])	Write dataframe into JSON text files

Convert DataFrames¶

`DataFrame.to_dask_array`(self[, lengths])	Convert a dask DataFrame to a dask array.
`DataFrame.to_delayed`(self[, optimize_graph])	Convert into a list of `dask.delayed` objects, one per partition.

Reshape DataFrames¶

`get_dummies`(data[, prefix, prefix_sep, …])	Convert categorical variable into dummy/indicator variables.
`pivot_table`(df[, index, columns, values, …])	Create a spreadsheet-style pivot table as a DataFrame.
`melt`(frame[, id_vars, value_vars, var_name, …])	Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

DataFrame Methods¶

class dask.dataframe.DataFrame(dsk, name, meta, divisions)¶

Parallel Pandas DataFrame

Do not use this class directly. Instead use functions like dd.read_csv, dd.read_parquet, or dd.from_pandas.

Parameters

dsk: dict: The dask graph to compute this DataFrame
name: str: The key prefix that specifies which keys in the dask comprise this particular DataFrame
meta: pandas.DataFrame: An empty pandas.DataFrame with names, dtypes, and index matching the expected output.
divisions: tuple of index values: Values along which we partition our blocks on the index

abs(self)¶

Return a Series/DataFrame with absolute numeric value of each element.

This docstring was copied from pandas.core.frame.DataFrame.abs.

Some inconsistencies with the Dask version may exist.

This function only applies to elements that are all numeric.

Returns

abs: Series/DataFrame containing the absolute value of each element.

See also

numpy.absolute: Calculate the absolute value element-wise.

Notes

For complex inputs, 1.2 + 1j, the absolute value is \(\sqrt{ a^2 + b^2 }\).

Examples

Absolute numeric values in a Series.

>>> s = pd.Series([-1.10, 2, -3.33, 4])  
>>> s.abs()  
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

Absolute numeric values in a Series with complex numbers.

>>> s = pd.Series([1.2 + 1j])  
>>> s.abs()  
0    1.56205
dtype: float64

Absolute numeric values in a Series with a Timedelta element.

>>> s = pd.Series([pd.Timedelta('1 days')])  
>>> s.abs()  
0   1 days
dtype: timedelta64[ns]

Select rows with data closest to certain value using argsort (from StackOverflow).

>>> df = pd.DataFrame({  
...     'a': [4, 5, 6, 7],
...     'b': [10, 20, 30, 40],
...     'c': [100, 50, -30, -50]
... })
>>> df  
     a    b    c
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
>>> df.loc[(df.c - 43).abs().argsort()]  
     a    b    c
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50

add(self, other, axis='columns', level=None, fill_value=None)¶

Get Addition of dataframe and other, element-wise (binary operator add).

This docstring was copied from pandas.core.frame.DataFrame.add.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

align(self, other, join='outer', axis=None, fill_value=None)¶

Align two objects on their axes with the specified join method.

This docstring was copied from pandas.core.frame.DataFrame.align.

Some inconsistencies with the Dask version may exist.

Join method is specified for each axis Index.

Parameters

otherDataFrame or Series

join{‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’

axisallowed axis of the other object, default None

Align on index (0), columns (1), or both (None).

levelint or level name, default None (Not supported in Dask)

Broadcast across a level, matching Index values on the passed MultiIndex level.

copybool, default True (Not supported in Dask)

Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

fill_valuescalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None (Not supported in Dask)

Method to use for filling holes in reindexed Series:

pad / ffill: propagate last valid observation forward to next valid.
backfill / bfill: use NEXT valid observation to fill gap.

limitint, default None (Not supported in Dask)

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

fill_axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Filling axis, method and limit.

broadcast_axis{0 or ‘index’, 1 or ‘columns’}, default None (Not supported in Dask)

Broadcast values along this axis, if aligning two objects of different dimensions.

Returns

(left, right)(DataFrame, type of other): Aligned objects.

all(self, axis=None, skipna=True, split_every=False, out=None)¶

Return whether all elements are True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.all.

Some inconsistencies with the Dask version may exist.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters

axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.

bool_onlybool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipnabool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

**kwargsany, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame: If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

Series.all: Return True if all elements are True.
DataFrame.any: Return True if one (or more) elements are True.

Examples

Series

>>> pd.Series([True, True]).all()  
True
>>> pd.Series([True, False]).all()  
False
>>> pd.Series([]).all()  
True
>>> pd.Series([np.nan]).all()  
True
>>> pd.Series([np.nan]).all(skipna=False)  
True

DataFrames

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})  
>>> df  
   col1   col2
0  True   True
1  True  False

Default behaviour checks if column-wise values all return True.

>>> df.all()  
col1     True
col2    False
dtype: bool

Specify axis='columns' to check if row-wise values all return True.

>>> df.all(axis='columns')  
0     True
1    False
dtype: bool

Or axis=None for whether every value is True.

>>> df.all(axis=None)  
False

any(self, axis=None, skipna=True, split_every=False, out=None)¶

Return whether any element is True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.any.

Some inconsistencies with the Dask version may exist.

Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters

axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.

bool_onlybool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipnabool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

**kwargsany, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame: If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

numpy.any: Numpy version of this method.
Series.any: Return whether any element is True.
Series.all: Return whether all elements are True.
DataFrame.any: Return whether any element is True over requested axis.
DataFrame.all: Return whether all elements are True over requested axis.

Examples

Series

For Series input, the output is a scalar indicating whether any element is True.

>>> pd.Series([False, False]).any()  
False
>>> pd.Series([True, False]).any()  
True
>>> pd.Series([]).any()  
False
>>> pd.Series([np.nan]).any()  
False
>>> pd.Series([np.nan]).any(skipna=False)  
True

DataFrame

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})  
>>> df  
   A  B  C
0  1  0  0
1  2  2  0

>>> df.any()  
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})  
>>> df  
       A  B
0   True  1
1  False  2

>>> df.any(axis='columns')  
0    True
1    True
dtype: bool

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})  
>>> df  
       A  B
0   True  1
1  False  0

>>> df.any(axis='columns')  
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with axis=None.

>>> df.any(axis=None)  
True

any for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()  
Series([], dtype: bool)

append(self, other, interleave_partitions=False)¶

Append rows of other to the end of caller, returning a new object.

This docstring was copied from pandas.core.frame.DataFrame.append.

Some inconsistencies with the Dask version may exist.

Columns in other that are not in the caller are added as new columns.

Parameters

otherDataFrame or Series/dict-like object, or list of these: The data to append.
ignore_indexbool, default False (Not supported in Dask): If True, do not use the index labels.
verify_integritybool, default False (Not supported in Dask): If True, raise ValueError on creating index with duplicates.
sortbool, default False (Not supported in Dask): Sort columns if the columns of self and other are not aligned.

New in version 0.23.0.

Changed in version 1.0.0: Changed to not sort by default.

Returns

DataFrame

See also

concat: General function to concatenate DataFrame or Series objects.

Notes

If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Examples

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))  
>>> df  
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))  
>>> df.append(df2)  
   A  B
0  1  2
1  3  4
0  5  6
1  7  8

With ignore_index set to True:

>>> df.append(df2, ignore_index=True)  
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The following, while not recommended methods for generating DataFrames, show two ways to generate a DataFrame from multiple data sources.

Less efficient:

>>> df = pd.DataFrame(columns=['A'])  
>>> for i in range(5):  
...     df = df.append({'A': i}, ignore_index=True)
>>> df  
   A
0  0
1  1
2  2
3  3
4  4

More efficient:

>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],  
...           ignore_index=True)
   A
0  0
1  1
2  2
3  3
4  4

apply(self, func, axis=0, broadcast=None, raw=False, reduce=None, args=(), meta='__no_default__', **kwds)¶

Parallel version of pandas.DataFrame.apply

This mimics the pandas version except for the following:

Only axis=1 is supported (and must be specified explicitly).
The user should provide output metadata via the meta keyword.

Parameters

funcfunction

Function to apply to each column/row

axis{0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’: apply function to each column (NOT SUPPORTED)
1 or ‘columns’: apply function to each row

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

argstuple

Positional arguments to pass to function in addition to the array/series

Additional keyword arguments will be passed as keywords to the function

Returns

appliedSeries or DataFrame

See also

dask.DataFrame.map_partitions

Examples

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

Apply a function to row-wise passing in extra arguments in args and kwargs:

>>> def myadd(row, a, b=1):
...     return row.sum() + a + b
>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5)  

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with name 'x', and dtype float64:

>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5, meta=('x', 'f8'))

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.apply(lambda row: row + 1, axis=1, meta=ddf)

applymap(self, func, meta='__no_default__')¶

Apply a function to a Dataframe elementwise.

This docstring was copied from pandas.core.frame.DataFrame.applymap.

Some inconsistencies with the Dask version may exist.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters

funccallable: Python function, returns a single value from a single value.

Returns

DataFrame: Transformed DataFrame.

See also

DataFrame.apply: Apply a function along input axis of DataFrame.

Notes

In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

Examples

>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])  
>>> df  
       0      1
0  1.000  2.120
1  3.356  4.567

>>> df.applymap(lambda x: len(str(x)))  
1
3  4
5  5

Note that a vectorized version of func often exists, which will be much faster. You could square each number elementwise.

>>> df.applymap(lambda x: x**2)  
           0          1
0   1.000000   4.494400
1  11.262736  20.857489

But it’s better to avoid applymap in that case.

>>> df ** 2  
           0          1
0   1.000000   4.494400
1  11.262736  20.857489

assign(self, **kwargs)¶

Assign new columns to a DataFrame.

This docstring was copied from pandas.core.frame.DataFrame.assign.

Some inconsistencies with the Dask version may exist.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters

**kwargsdict of {str: callable or Series}: The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Returns

DataFrame: A new DataFrame with the new columns in addition to all the existing columns.

Notes

Assigning multiple columns within the same assign is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.

Changed in version 0.23.0: Keyword argument order is maintained.

Examples

>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},  
...                   index=['Portland', 'Berkeley'])
>>> df  
          temp_c
Portland    17.0
Berkeley    25.0

Where the value is a callable, evaluated on df:

>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)  
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:

>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)  
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:

>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,  
...           temp_k=lambda x: (x['temp_f'] +  459.67) * 5 / 9)
          temp_c  temp_f  temp_k
Portland    17.0    62.6  290.15
Berkeley    25.0    77.0  298.15

astype(self, dtype)¶

Cast a pandas object to a specified dtype dtype.

This docstring was copied from pandas.core.frame.DataFrame.astype.

Some inconsistencies with the Dask version may exist.

Parameters

dtypedata type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copybool, default True (Not supported in Dask)

Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).

errors{‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Control raising of exceptions on invalid data for provided dtype.

raise : allow exceptions to be raised
ignore : suppress exceptions. On error return original object.

Returns

castedsame type as caller

See also

to_datetime: Convert argument to datetime.
to_timedelta: Convert argument to timedelta.
to_numeric: Convert argument to a numeric type.
numpy.ndarray.astype: Cast a numpy array to a specified type.

Examples

Create a DataFrame:

>>> d = {'col1': [1, 2], 'col2': [3, 4]}  
>>> df = pd.DataFrame(data=d)  
>>> df.dtypes  
col1    int64
col2    int64
dtype: object

Cast all columns to int32:

>>> df.astype('int32').dtypes  
col1    int32
col2    int32
dtype: object

Cast col1 to int32 using a dictionary:

>>> df.astype({'col1': 'int32'}).dtypes  
col1    int32
col2    int64
dtype: object

Create a series:

>>> ser = pd.Series([1, 2], dtype='int32')  
>>> ser  
0    1
1    2
dtype: int32
>>> ser.astype('int64')  
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')  
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> cat_dtype = pd.api.types.CategoricalDtype(  
...     categories=[2, 1], ordered=True)
>>> ser.astype(cat_dtype)  
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False and changing data on a new pandas object may propagate changes:

>>> s1 = pd.Series([1, 2])  
>>> s2 = s1.astype('int64', copy=False)  
>>> s2[0] = 10  
>>> s1  # note that s1[0] has changed too  
0    10
1     2
dtype: int64

bfill(self, axis=None, limit=None)¶

Synonym for DataFrame.fillna() with method='bfill'.

This docstring was copied from pandas.core.frame.DataFrame.bfill.

Some inconsistencies with the Dask version may exist.

Returns

%(klass)s or None: Object with missing values filled or None if inplace=True.

categorize(df, columns=None, index=None, split_every=None, **kwargs)¶

Convert columns of the DataFrame to category dtype.

Parameters

columnslist, optional: A list of column names to convert to categoricals. By default any column with an object dtype is converted to a categorical, and any unknown categoricals are made known.
indexbool, optional: Whether to categorize the index. By default, object indices are converted to categorical, and unknown categorical indices are made known. Set True to always categorize the index, False to never.
split_everyint, optional: Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 16.
kwargs: Keyword arguments are passed on to compute.

clear_divisions(self)¶: Forget division information

clip(self, lower=None, upper=None, out=None)¶

Trim values at input threshold(s).

This docstring was copied from pandas.core.frame.DataFrame.clip.

Some inconsistencies with the Dask version may exist.

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters

lowerfloat or array_like, default None: Minimum threshold value. All values below this threshold will be set to it.
upperfloat or array_like, default None: Maximum threshold value. All values above this threshold will be set to it.
axisint or str axis name, optional (Not supported in Dask): Align object with lower and upper along the given axis.
inplacebool, default False (Not supported in Dask): Whether to perform the operation in place on the data.

New in version 0.21.0.
*args, **kwargs: Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns

Series or DataFrame: Same type as calling object with the values outside the clip boundaries replaced.

Examples

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}  
>>> df = pd.DataFrame(data)  
>>> df  
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)  
   col_0  col_1
    6     -2
   -3     -4
    0      6
   -1      6
    5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])  
>>> t  
0    2
1   -4
2   -1
3    6
4    3
dtype: int64

>>> df.clip(t, t + 4, axis=0)  
   col_0  col_1
    6      2
   -3     -4
    0      3
    6      8
    5      3

combine(self, other, func, fill_value=None, overwrite=True)¶

Perform column-wise combine with another DataFrame.

This docstring was copied from pandas.core.frame.DataFrame.combine.

Some inconsistencies with the Dask version may exist.

Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.

Parameters

otherDataFrame: The DataFrame to merge column-wise.
funcfunction: Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.
fill_valuescalar value, default None: The value to fill NaNs with prior to passing any column to the merge func.
overwritebool, default True: If True, columns in self that do not exist in other will be overwritten with NaNs.

Returns

DataFrame: Combination of the provided DataFrames.

See also

DataFrame.combine_first: Combine two DataFrame objects and default to non-null values in frame calling the method.

Examples

Combine using a simple function that chooses the smaller column.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  
>>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2  
>>> df1.combine(df2, take_smaller)  
   A  B
0  0  3
1  0  3

Example using a true element-wise combine function.

>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  
>>> df1.combine(df2, np.minimum)  
   A  B
0  1  2
1  0  3

Using fill_value fills Nones prior to passing the column to the merge function.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  
>>> df1.combine(df2, take_smaller, fill_value=-5)  
   A    B
0  0 -5.0
1  0  4.0

However, if the same element in both dataframes is None, that None is preserved

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]})  
>>> df1.combine(df2, take_smaller, fill_value=-5)  
    A    B
0  0 -5.0
1  0  3.0

Example that demonstrates the use of overwrite and behavior when the axis differ between the dataframes.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})  
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2])  
>>> df1.combine(df2, take_smaller)  
     A    B     C
0  NaN  NaN   NaN
1  NaN  3.0 -10.0
2  NaN  3.0   1.0

>>> df1.combine(df2, take_smaller, overwrite=False)  
     A    B     C
0  0.0  NaN   NaN
1  0.0  3.0 -10.0
2  NaN  3.0   1.0

Demonstrating the preference of the passed in dataframe.

>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2])  
>>> df2.combine(df1, take_smaller)  
   A    B   C
0  0.0  NaN NaN
1  0.0  3.0 NaN
2  NaN  3.0 NaN

>>> df2.combine(df1, take_smaller, overwrite=False)  
     A    B   C
0  0.0  NaN NaN
1  0.0  3.0 1.0
2  NaN  3.0 1.0

combine_first(self, other)¶

Update null elements with value in the same location in other.

This docstring was copied from pandas.core.frame.DataFrame.combine_first.

Some inconsistencies with the Dask version may exist.

Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.

Parameters

otherDataFrame: Provided DataFrame to use to fill null values.

Returns

DataFrame

See also

DataFrame.combine: Perform series-wise operation on two DataFrames using a given function.

Examples

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  
>>> df1.combine_first(df2)  
     A    B
0  1.0  3.0
1  0.0  4.0

Null values still persist if the location of that null value does not exist in other

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]})  
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2])  
>>> df1.combine_first(df2)  
     A    B    C
0  NaN  4.0  NaN
1  0.0  3.0  1.0
2  NaN  3.0  1.0

compute(self, **kwargs)¶

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask array turns into a NumPy array and a Dask dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters

schedulerstring, optional: Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graphbool, optional: If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs: Extra keywords to forward to the scheduler function.

See also

dask.base.compute

copy(self)¶

Make a copy of the dataframe

This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data

corr(self, method='pearson', min_periods=None, split_every=False)¶

Compute pairwise correlation of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.corr.

Some inconsistencies with the Dask version may exist.

Parameters

method{‘pearson’, ‘kendall’, ‘spearman’} or callable

Method of correlation:

pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

New in version 0.24.0.

min_periodsint, optional

Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

Returns

DataFrame: Correlation matrix.

See also

DataFrame.corrwith
Series.corr

Examples

>>> def histogram_intersection(a, b):  
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],  
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)  
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0

count(self, axis=None, split_every=False)¶

Count non-NA cells for each column or row.

This docstring was copied from pandas.core.frame.DataFrame.count.

Some inconsistencies with the Dask version may exist.

The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
levelint or str, optional (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name.
numeric_onlybool, default False (Not supported in Dask): Include only float, int or boolean data.

Returns

Series or DataFrame: For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.

See also

Series.count: Number of non-NA elements in a Series.
DataFrame.shape: Number of DataFrame rows and columns (including NA elements).
DataFrame.isna: Boolean same-sized DataFrame showing places of NA elements.

Examples

Constructing DataFrame from a dictionary:

>>> df = pd.DataFrame({"Person":  
...                    ["John", "Myla", "Lewis", "John", "Myla"],
...                    "Age": [24., np.nan, 21., 33, 26],
...                    "Single": [False, True, True, True, False]})
>>> df  
   Person   Age  Single
0    John  24.0   False
1    Myla   NaN    True
2   Lewis  21.0    True
3    John  33.0    True
4    Myla  26.0   False

Notice the uncounted NA values:

>>> df.count()  
Person    5
Age       4
Single    5
dtype: int64

Counts for each row:

>>> df.count(axis='columns')  
  3
  2
  3
  3
  3
dtype: int64

Counts for one level of a MultiIndex:

>>> df.set_index(["Person", "Single"]).count(level="Person")  
        Age
Person
John      2
Lewis     1
Myla      1

cov(self, min_periods=None, split_every=False)¶

Compute pairwise covariance of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.cov.

Some inconsistencies with the Dask version may exist.

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters

min_periodsint, optional: Minimum number of observations required per pair of columns to have a valid result.

Returns

DataFrame: The covariance matrix of the series of the DataFrame.

See also

Series.cov: Compute covariance with another Series.
core.window.EWM.cov: Exponential weighted sample covariance.
core.window.Expanding.cov: Expanding sample covariance.
core.window.Rolling.cov: Rolling sample covariance.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],  
...                   columns=['dogs', 'cats'])
>>> df.cov()  
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667

>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(1000, 5),  
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()  
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:

>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(20, 3),  
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan  
>>> df.loc[df.index[5:10], 'b'] = np.nan  
>>> df.cov(min_periods=12)  
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202

cummax(self, axis=None, skipna=True, out=None)¶

Return cumulative maximum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummax.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative maximum.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :: Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame

See also

core.window.Expanding.max: Similar functionality but ignores NaN values.
DataFrame.max: Return the maximum over DataFrame axis.
DataFrame.cummax: Return cumulative maximum over DataFrame axis.
DataFrame.cummin: Return cumulative minimum over DataFrame axis.
DataFrame.cumsum: Return cumulative sum over DataFrame axis.
DataFrame.cumprod: Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummax()  
  2.0
  NaN
  5.0
  5.0
  5.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummax(skipna=False)  
  2.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the maximum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummax()  
     A    B
0  2.0  1.0
1  3.0  NaN
2  3.0  1.0

To iterate over columns and find the maximum in each row, use axis=1

>>> df.cummax(axis=1)  
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  1.0

cummin(self, axis=None, skipna=True, out=None)¶

Return cumulative minimum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummin.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative minimum.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :: Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame

See also

core.window.Expanding.min: Similar functionality but ignores NaN values.
DataFrame.min: Return the minimum over DataFrame axis.
DataFrame.cummax: Return cumulative maximum over DataFrame axis.
DataFrame.cummin: Return cumulative minimum over DataFrame axis.
DataFrame.cumsum: Return cumulative sum over DataFrame axis.
DataFrame.cumprod: Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummin()  
  2.0
  NaN
  2.0
 -1.0
 -1.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummin(skipna=False)  
  2.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the minimum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummin()  
     A    B
0  2.0  1.0
1  2.0  NaN
2  1.0  0.0

To iterate over columns and find the minimum in each row, use axis=1

>>> df.cummin(axis=1)  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

cumprod(self, axis=None, skipna=True, dtype=None, out=None)¶

Return cumulative product over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumprod.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative product.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :: Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame

See also

core.window.Expanding.prod: Similar functionality but ignores NaN values.
DataFrame.prod: Return the product over DataFrame axis.
DataFrame.cummax: Return cumulative maximum over DataFrame axis.
DataFrame.cummin: Return cumulative minimum over DataFrame axis.
DataFrame.cumsum: Return cumulative sum over DataFrame axis.
DataFrame.cumprod: Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumprod()  
   2.0
   NaN
  10.0
 -10.0
  -0.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumprod(skipna=False)  
  2.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the product in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumprod()  
     A    B
0  2.0  1.0
1  6.0  NaN
2  6.0  0.0

To iterate over columns and find the product in each row, use axis=1

>>> df.cumprod(axis=1)  
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  0.0

cumsum(self, axis=None, skipna=True, dtype=None, out=None)¶

Return cumulative sum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumsum.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative sum.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :: Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame

See also

core.window.Expanding.sum: Similar functionality but ignores NaN values.
DataFrame.sum: Return the sum over DataFrame axis.
DataFrame.cummax: Return cumulative maximum over DataFrame axis.
DataFrame.cummin: Return cumulative minimum over DataFrame axis.
DataFrame.cumsum: Return cumulative sum over DataFrame axis.
DataFrame.cumprod: Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumsum()  
  2.0
  NaN
  7.0
  6.0
  6.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumsum(skipna=False)  
  2.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumsum()  
     A    B
0  2.0  1.0
1  5.0  NaN
2  6.0  1.0

To iterate over columns and find the sum in each row, use axis=1

>>> df.cumsum(axis=1)  
     A    B
0  2.0  3.0
1  3.0  NaN
2  1.0  1.0

describe(self, split_every=False, percentiles=None, percentiles_method='default', include=None, exclude=None)¶

Generate descriptive statistics.

This docstring was copied from pandas.core.frame.DataFrame.describe.

Some inconsistencies with the Dask version may exist.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters

percentileslist-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include‘all’, list-like of dtypes or None (default), optional

A white list of data types to include in the result. Ignored for Series. Here are the options:

‘all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
None (default) : The result will include all numeric columns.

excludelist-like of dtypes or None (default), optional,

A black list of data types to omit from the result. Ignored for Series. Here are the options:

A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'
None (default) : The result will exclude nothing.

Returns

Series or DataFrame: Summary statistics of the Series or Dataframe provided.

See also

DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.mean: Mean of the values.
DataFrame.std: Standard deviation of the observations.
DataFrame.select_dtypes: Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])  
>>> s.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])  
>>> s.describe()  
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([  
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe()  
count                       3
unique                      2
top       2010-01-01 00:00:00
freq                        2
first     2000-01-01 00:00:00
last      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),  
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  
        categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      c
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[np.object])  
       object
count       3
unique      3
top         c
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])  
       categorical
count            3
unique           3
top              f
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      c
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[np.object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0

diff(self, periods=1, axis=0)¶

First discrete difference of element.

This docstring was copied from pandas.core.frame.DataFrame.diff.

Some inconsistencies with the Dask version may exist.

Note

Pandas currently uses an object-dtype column to represent boolean data with missing values. This can cause issues for boolean-specific operations, like |. To enable boolean- specific operations, at the cost of metadata that doesn’t match pandas, use .astype(bool) after the shift.

Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is the element in the same column of the previous row).

Parameters

periodsint, default 1: Periods to shift for calculating difference, accepts negative values.
axis{0 or ‘index’, 1 or ‘columns’}, default 0: Take difference over rows (0) or columns (1).

Returns

DataFrame

See also

Series.diff: First discrete difference for a Series.
DataFrame.pct_change: Percent change over given number of periods.
DataFrame.shift: Shift index by desired number of periods with an optional time freq.

Notes

For boolean dtypes, this uses operator.xor() rather than operator.sub().

Examples

Difference with previous row

>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],  
...                    'b': [1, 1, 2, 3, 5, 8],
...                    'c': [1, 4, 9, 16, 25, 36]})
>>> df  
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36

>>> df.diff()  
     a    b     c
NaN  NaN   NaN
1.0  0.0   3.0
1.0  1.0   5.0
1.0  1.0   7.0
1.0  2.0   9.0
1.0  3.0  11.0

Difference with previous column

>>> df.diff(axis=1)  
    a    b     c
NaN  0.0   0.0
NaN -1.0   3.0
NaN -1.0   7.0
NaN -1.0  13.0
NaN  0.0  20.0
NaN  2.0  28.0

Difference with 3rd previous row

>>> df.diff(periods=3)  
     a    b     c
NaN  NaN   NaN
NaN  NaN   NaN
NaN  NaN   NaN
3.0  2.0  15.0
3.0  4.0  21.0
3.0  6.0  27.0

Difference with following row

>>> df.diff(periods=-1)  
     a    b     c
-1.0  0.0  -3.0
-1.0 -1.0  -5.0
-1.0 -1.0  -7.0
-1.0 -2.0  -9.0
-1.0 -3.0 -11.0
NaN  NaN   NaN

div(self, other, axis='columns', level=None, fill_value=None)¶

Get Floating division of dataframe and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.frame.DataFrame.div.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

divide(self, other, axis='columns', level=None, fill_value=None)¶

Get Floating division of dataframe and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.frame.DataFrame.divide.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

drop(self, labels=None, axis=0, columns=None, errors='raise')¶

Drop specified labels from rows or columns.

This docstring was copied from pandas.core.frame.DataFrame.drop.

Some inconsistencies with the Dask version may exist.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

Parameters

labelssingle label or list-like: Index or column labels to drop.
axis{0 or ‘index’, 1 or ‘columns’}, default 0: Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
indexsingle label or list-like (Not supported in Dask): Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).

New in version 0.21.0.
columnssingle label or list-like: Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

New in version 0.21.0.
levelint or level name, optional (Not supported in Dask): For MultiIndex, level from which the labels will be removed.
inplacebool, default False (Not supported in Dask): If True, do operation inplace and return None.
errors{‘ignore’, ‘raise’}, default ‘raise’: If ‘ignore’, suppress error and only existing labels are dropped.

Returns

DataFrame: DataFrame without the removed index or column labels.

Raises

KeyError: If any of the labels is not found in the selected axis.

See also

DataFrame.loc: Label-location based indexer for selection by label.
DataFrame.dropna: Return DataFrame with labels on given axis omitted where (all or any) data are missing.
DataFrame.drop_duplicates: Return DataFrame with duplicate rows removed, optionally only considering certain columns.
Series.drop: Return Series with specified index labels removed.

Examples

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),  
...                   columns=['A', 'B', 'C', 'D'])
>>> df  
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns

>>> df.drop(['B', 'C'], axis=1)  
   A   D
0  0   3
1  4   7
2  8  11

>>> df.drop(columns=['B', 'C'])  
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index

>>> df.drop([0, 1])  
   A  B   C   D
2  8  9  10  11

Drop columns and/or rows of MultiIndex DataFrame

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],  
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],  
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3, 0.2]])
>>> df  
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2

>>> df.drop(index='cow', columns='small')  
                big
lama    speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3

>>> df.drop(index='length', level=1)  
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8

drop_duplicates(self, subset=None, split_every=None, split_out=1, ignore_index=False, **kwargs)¶

Return DataFrame with duplicate rows removed.

This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates.

Some inconsistencies with the Dask version may exist.

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters

subsetcolumn label or sequence of labels, optional: Only consider certain columns for identifying duplicates, by default use all of the columns.
keep{‘first’, ‘last’, False}, default ‘first’ (Not supported in Dask): Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.
inplacebool, default False (Not supported in Dask): Whether to drop duplicates in place or to return a copy.
ignore_indexbool, default False: If True, the resulting axis will be labeled 0, 1, …, n - 1.

New in version 1.0.0.

Returns

DataFrame: DataFrame with duplicates removed or None if inplace=True.

dropna(self, how='any', subset=None, thresh=None)¶

Remove missing values.

This docstring was copied from pandas.core.frame.DataFrame.dropna.

Some inconsistencies with the Dask version may exist.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Determine if rows or columns which contain missing values are removed.

0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.

Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.

how{‘any’, ‘all’}, default ‘any’

Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.

threshint, optional

Require that many non-NA values.

subsetarray-like, optional

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

inplacebool, default False (Not supported in Dask)

If True, do operation inplace and return None.

Returns

DataFrame: DataFrame with NA entries dropped from it.

See also

DataFrame.isna: Indicate missing values.
DataFrame.notna: Indicate existing (non-missing) values.
DataFrame.fillna: Replace missing values.
Series.dropna: Drop missing values.
Index.dropna: Drop missing indices.

Examples

>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],  
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
...                             pd.NaT]})
>>> df  
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Drop the rows where at least one element is missing.

>>> df.dropna()  
     name        toy       born
1  Batman  Batmobile 1940-04-25

Drop the columns where at least one element is missing.

>>> df.dropna(axis='columns')  
       name
0    Alfred
1    Batman
2  Catwoman

Drop the rows where all elements are missing.

>>> df.dropna(how='all')  
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Keep only the rows with at least 2 non-NA values.

>>> df.dropna(thresh=2)  
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Define in which columns to look for missing values.

>>> df.dropna(subset=['name', 'born'])  
       name        toy       born
1    Batman  Batmobile 1940-04-25

Keep the DataFrame with valid entries in the same variable.

>>> df.dropna(inplace=True)  
>>> df  
     name        toy       born
1  Batman  Batmobile 1940-04-25

property dtypes¶: Return data types

eq(self, other, axis='columns', level=None)¶

Get Equal to of dataframe and other, element-wise (binary operator eq).

This docstring was copied from pandas.core.frame.DataFrame.eq.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

DataFrame of bool: Result of the comparison.

See also

DataFrame.eq: Compare DataFrames for equality elementwise.
DataFrame.ne: Compare DataFrames for inequality elementwise.
DataFrame.le: Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt: Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge: Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt: Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False

eval(self, expr, inplace=None, **kwargs)¶

Evaluate a string describing operations on DataFrame columns.

This docstring was copied from pandas.core.frame.DataFrame.eval.

Some inconsistencies with the Dask version may exist.

Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.

Parameters

exprstr: The expression string to evaluate.
inplacebool, default False: If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.
**kwargs: See the documentation for eval() for complete details on the keyword arguments accepted by query().

Returns

ndarray, scalar, or pandas object: The result of the evaluation.

See also

DataFrame.query: Evaluates a boolean expression to query the columns of a frame.
DataFrame.assign: Can evaluate an expression or function to create new values for a column.
eval: Evaluate a Python expression as a string using various backends.

Notes

For more details see the API documentation for eval(). For detailed examples see enhancing performance with eval.

Examples

>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})  
>>> df  
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2
>>> df.eval('A + B')  
0    11
1    10
2     9
3     8
4     7
dtype: int64

Assignment is allowed though by default the original DataFrame is not modified.

>>> df.eval('C = A + B')  
   A   B   C
1  10  11
2   8  10
3   6   9
4   4   8
5   2   7
>>> df  
   A   B
1  10
2   8
3   6
4   4
5   2

Use inplace=True to modify the original DataFrame.

>>> df.eval('C = A + B', inplace=True)  
>>> df  
   A   B   C
0  1  10  11
1  2   8  10
2  3   6   9
3  4   4   8
4  5   2   7

explode(self, column)¶

Transform each element of a list-like to a row, replicating index values.

This docstring was copied from pandas.core.frame.DataFrame.explode.

Some inconsistencies with the Dask version may exist.

New in version 0.25.0.

Parameters

columnstr or tuple: Column to explode.

Returns

DataFrame: Exploded lists to rows of the subset columns; index will be duplicated for these rows.

Raises

ValueError :: if columns of the frame are not unique.

See also

DataFrame.unstack: Pivot a level of the (necessarily hierarchical) index labels.
DataFrame.melt: Unpivot a DataFrame from wide format to long format.
Series.explode: Explode a DataFrame from list-like columns to long format.

Notes

This routine will explode list-likes including lists, tuples, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged. Empty list-likes will result in a np.nan for that row.

Examples

>>> df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1})  
>>> df  
           A  B
0  [1, 2, 3]  1
1        foo  1
2         []  1
3     [3, 4]  1

>>> df.explode('A')  
     A  B
  1  1
  2  1
  3  1
foo  1
NaN  1
  3  1
  4  1

ffill(self, axis=None, limit=None)¶

Synonym for DataFrame.fillna() with method='ffill'.

This docstring was copied from pandas.core.frame.DataFrame.ffill.

Some inconsistencies with the Dask version may exist.

Returns

%(klass)s or None: Object with missing values filled or None if inplace=True.

fillna(self, value=None, method=None, limit=None, axis=None)¶

Fill NA/NaN values using the specified method.

This docstring was copied from pandas.core.frame.DataFrame.fillna.

Some inconsistencies with the Dask version may exist.

Parameters

valuescalar, dict, Series, or DataFrame: Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None: Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
axis{0 or ‘index’, 1 or ‘columns’}: Axis along which to fill missing values.
inplacebool, default False (Not supported in Dask): If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
limitint, default None: If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcastdict, default is None (Not supported in Dask): A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

Returns

DataFrame or None: Object with missing values filled or None if inplace=True.

See also

interpolate: Fill NaN values using interpolation.
reindex: Conform object to new index.
asfreq: Convert TimeSeries to specified frequency.

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],  
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, 5],
...                    [np.nan, 3, np.nan, 4]],
...                   columns=list('ABCD'))
>>> df  
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

Replace all NaN elements with 0s.

>>> df.fillna(0)  
    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

We can also propagate non-null values forward or backward.

>>> df.fillna(method='ffill')  
    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   3.0 4.0 NaN 5
3   3.0 3.0 NaN 4

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}  
>>> df.fillna(value=values)  
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 2.0 1
2   0.0 1.0 2.0 5
3   0.0 3.0 2.0 4

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)  
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 NaN 1
2   NaN 1.0 NaN 5
3   NaN 3.0 NaN 4

first(self, offset)¶

Method to subset initial periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.first.

Some inconsistencies with the Dask version may exist.

Parameters

offsetstr, DateOffset, dateutil.relativedelta

Returns

subsetsame type as caller

Raises

TypeError: If the index is not a DatetimeIndex

See also

last: Select final periods of time series based on a date offset.
at_time: Select values at a particular time of the day.
between_time: Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  
>>> ts = pd.DataFrame({'A': [1,2,3,4]}, index=i)  
>>> ts  
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the first 3 days:

>>> ts.first('3D')  
            A
2018-04-09  1
2018-04-11  2

Notice the data for 3 first calender days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.

floordiv(self, other, axis='columns', level=None, fill_value=None)¶

Get Integer division of dataframe and other, element-wise (binary operator floordiv).

This docstring was copied from pandas.core.frame.DataFrame.floordiv.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe // other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rfloordiv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

ge(self, other, axis='columns', level=None)¶

Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).

This docstring was copied from pandas.core.frame.DataFrame.ge.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

DataFrame of bool: Result of the comparison.

See also

DataFrame.eq: Compare DataFrames for equality elementwise.
DataFrame.ne: Compare DataFrames for inequality elementwise.
DataFrame.le: Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt: Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge: Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt: Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False

get_partition(self, n)¶: Get a dask DataFrame/Series representing the nth partition.

groupby(self, by=None, **kwargs)¶

Group DataFrame using a mapper or by a Series of columns.

This docstring was copied from pandas.core.frame.DataFrame.groupby.

Some inconsistencies with the Dask version may exist.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters

bymapping, function, label, or list of labels: Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.
axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask): Split along rows (0) or columns (1).
levelint, level name, or sequence of such, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_indexbool, default True (Not supported in Dask): For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sortbool, default True (Not supported in Dask): Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
group_keysbool, default True (Not supported in Dask): When calling apply, add group keys to index to identify pieces.
squeezebool, default False (Not supported in Dask): Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
observedbool, default False (Not supported in Dask): This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

New in version 0.23.0.

Returns

DataFrameGroupBy: Returns a groupby object that contains information about the groups.

See also

resample: Convenience method for frequency conversion and resampling of time series.

Notes

See the user guide for more.

Examples

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',  
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df  
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()  
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

Hierarchical Indexes

We can groupby different levels of a hierarchical index using the level parameter:

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],  
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))  
>>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},  
...                   index=index)
>>> df  
                Max Speed
Animal Type
Falcon Captive      390.0
       Wild         350.0
Parrot Captive       30.0
       Wild          20.0
>>> df.groupby(level=0).mean()  
        Max Speed
Animal
Falcon      370.0
Parrot       25.0
>>> df.groupby(level="Type").mean()  
         Max Speed
Type
Captive      210.0
Wild         185.0

gt(self, other, axis='columns', level=None)¶

Get Greater than of dataframe and other, element-wise (binary operator gt).

This docstring was copied from pandas.core.frame.DataFrame.gt.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

DataFrame of bool: Result of the comparison.

See also

DataFrame.eq: Compare DataFrames for equality elementwise.
DataFrame.ne: Compare DataFrames for inequality elementwise.
DataFrame.le: Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt: Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge: Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt: Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False

head(self, n=5, npartitions=1, compute=True)¶

First n rows of the dataset

Parameters

nint, optional: The number of rows to return. Default is 5.
npartitionsint, optional: Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.
computebool, optional: Whether to compute the result, default is True.

idxmax(self, axis=None, skipna=True, split_every=False)¶

Return index of first occurrence of maximum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns

Series: Indexes of maxima along the specified axis.

Raises

ValueError

If the row/column is empty

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(self, axis=None, skipna=True, split_every=False)¶

Return index of first occurrence of minimum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns

Series: Indexes of minima along the specified axis.

Raises

ValueError

If the row/column is empty

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

property iloc¶

Purely integer-location based indexing for selection by position.

Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.

See Indexing into Dask DataFrames for more.

Examples

>>> df.iloc[:, [2, 0, 1]]  

property index¶: Return dask Index instance

info(self, buf=None, verbose=False, memory_usage=False)¶: Concise summary of a Dask DataFrame.

isin(self, values)¶

Whether each element in the DataFrame is contained in values.

This docstring was copied from pandas.core.frame.DataFrame.isin.

Some inconsistencies with the Dask version may exist.

Parameters

valuesiterable, Series, DataFrame or dict: The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.

Returns

DataFrame: DataFrame of booleans showing whether each element in the DataFrame is contained in values.

See also

DataFrame.eq: Equality test for DataFrame.
Series.isin: Equivalent method on Series.
Series.str.contains: Test if pattern or regex is contained within a string of a Series or Index.

Examples

>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},  
...                   index=['falcon', 'dog'])
>>> df  
        num_legs  num_wings
falcon         2          2
dog            4          0

When values is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)

>>> df.isin([0, 2])  
        num_legs  num_wings
falcon      True       True
dog        False       True

When values is a dict, we can pass values to check for each column separately:

>>> df.isin({'num_wings': [0, 3]})  
        num_legs  num_wings
falcon     False      False
dog        False       True

When values is a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in df2.

>>> other = pd.DataFrame({'num_legs': [8, 2], 'num_wings': [0, 2]},  
...                      index=['spider', 'falcon'])
>>> df.isin(other)  
        num_legs  num_wings
falcon      True       True
dog        False      False

isna(self)¶

Detect missing values.

This docstring was copied from pandas.core.frame.DataFrame.isna.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns

DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.isnull: Alias of isna.
DataFrame.notna: Boolean inverse of isna.
DataFrame.dropna: Omit axes labels with missing values.
isna: Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame({'age': [5, 6, np.NaN],  
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()  
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()  
0    False
1    False
2     True
dtype: bool

isnull(self)¶

Detect missing values.

This docstring was copied from pandas.core.frame.DataFrame.isnull.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns

DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.isnull: Alias of isna.
DataFrame.notna: Boolean inverse of isna.
DataFrame.dropna: Omit axes labels with missing values.
isna: Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame({'age': [5, 6, np.NaN],  
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()  
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()  
0    False
1    False
2     True
dtype: bool

items(self)¶

Iterate over (column name, Series) pairs.

This docstring was copied from pandas.core.frame.DataFrame.items.

Some inconsistencies with the Dask version may exist.

Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.

Yields

labelobject: The column names for the DataFrame being iterated over.
contentSeries: The column entries belonging to each label, as a Series.

See also

DataFrame.iterrows: Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples: Iterate over DataFrame rows as namedtuples of the values.

Examples

>>> df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'],  
...                   'population': [1864, 22000, 80000]},
...                   index=['panda', 'polar', 'koala'])
>>> df  
        species   population
panda   bear      1864
polar   bear      22000
koala   marsupial 80000
>>> for label, content in df.items():  
...     print('label:', label)
...     print('content:', content, sep='\n')
...
label: species
content:
panda         bear
polar         bear
koala    marsupial
Name: species, dtype: object
label: population
content:
panda     1864
polar    22000
koala    80000
Name: population, dtype: int64

iterrows(self)¶

Iterate over DataFrame rows as (index, Series) pairs.

This docstring was copied from pandas.core.frame.DataFrame.iterrows.

Some inconsistencies with the Dask version may exist.

Yields

indexlabel or tuple of label: The index of the row. A tuple for a MultiIndex.
dataSeries: The data of the row as a Series.
itgenerator: A generator that iterates over the rows of the frame.

See also

DataFrame.itertuples: Iterate over DataFrame rows as namedtuples of the values.
DataFrame.items: Iterate over (column name, Series) pairs.

Notes

Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,
```
>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])  
>>> row = next(df.iterrows())[1]  
>>> row  
int      1.0
float    1.5
Name: 0, dtype: float64
>>> print(row['int'].dtype)  
float64
>>> print(df['int'].dtype)  
int64
```
To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

itertuples(self, index=True, name='Pandas')¶

Iterate over DataFrame rows as namedtuples.

This docstring was copied from pandas.core.frame.DataFrame.itertuples.

Some inconsistencies with the Dask version may exist.

Parameters

indexbool, default True: If True, return the index as the first element of the tuple.
namestr or None, default “Pandas”: The name of the returned namedtuples or None to return regular tuples.

Returns

iterator: An object to iterate over namedtuples for each row in the DataFrame with the first field possibly being the index and following fields being the column values.

See also

DataFrame.iterrows: Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.items: Iterate over (column name, Series) pairs.

Notes

The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. On python versions < 3.7 regular tuples are returned for DataFrames with a large number of columns (>254).

Examples

>>> df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},  
...                   index=['dog', 'hawk'])
>>> df  
      num_legs  num_wings
dog          4          0
hawk         2          2
>>> for row in df.itertuples():  
...     print(row)
...
Pandas(Index='dog', num_legs=4, num_wings=0)
Pandas(Index='hawk', num_legs=2, num_wings=2)

By setting the index parameter to False we can remove the index as the first element of the tuple:

>>> for row in df.itertuples(index=False):  
...     print(row)
...
Pandas(num_legs=4, num_wings=0)
Pandas(num_legs=2, num_wings=2)

With the name parameter set we set a custom name for the yielded namedtuples:

>>> for row in df.itertuples(name='Animal'):  
...     print(row)
...
Animal(Index='dog', num_legs=4, num_wings=0)
Animal(Index='hawk', num_legs=2, num_wings=2)

join(self, other, on=None, how='left', lsuffix='', rsuffix='', npartitions=None, shuffle=None)¶

Join columns of another DataFrame.

This docstring was copied from pandas.core.frame.DataFrame.join.

Some inconsistencies with the Dask version may exist.

Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

Parameters

otherDataFrame, Series, or list of DataFrame

Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.

onstr, list of str, or array-like, optional

Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

how{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

How to handle the operation of the two objects.

left: use calling frame’s index (or column if on is specified)
right: use other’s index.
outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.
inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.

lsuffixstr, default ‘’

Suffix to use from left frame’s overlapping columns.

rsuffixstr, default ‘’

Suffix to use from right frame’s overlapping columns.

sortbool, default False (Not supported in Dask)

Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).

Returns

DataFrame: A dataframe containing columns from both the caller and other.

See also

DataFrame.merge: For column(s)-on-columns(s) operations.

Notes

Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.

Support for specifying index levels as the on parameter was added in version 0.23.0.

Examples

>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],  
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

>>> df  
  key   A
K0  A0
K1  A1
K2  A2
K3  A3
K4  A4
K5  A5

>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],  
...                       'B': ['B0', 'B1', 'B2']})

>>> other  
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

Join DataFrames using their indexes.

>>> df.join(other, lsuffix='_caller', rsuffix='_other')  
  key_caller   A key_other    B
       K0  A0        K0   B0
       K1  A1        K1   B1
       K2  A2        K2   B2
       K3  A3       NaN  NaN
       K4  A4       NaN  NaN
       K5  A5       NaN  NaN

If we want to join using the key columns, we need to set key to be the index in both df and other. The joined DataFrame will have key as its index.

>>> df.set_index('key').join(other.set_index('key'))  
      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in df. This method preserves the original DataFrame’s index in the result.

>>> df.join(other.set_index('key'), on='key')  
  key   A    B
K0  A0   B0
K1  A1   B1
K2  A2   B2
K3  A3  NaN
K4  A4  NaN
K5  A5  NaN

property known_divisions¶: Whether divisions are already known

last(self, offset)¶

Method to subset final periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.last.

Some inconsistencies with the Dask version may exist.

Parameters

offsetstr, DateOffset, dateutil.relativedelta

Returns

subsetsame type as caller

Raises

TypeError: If the index is not a DatetimeIndex

See also

first: Select initial periods of time series based on a date offset.
at_time: Select values at a particular time of the day.
between_time: Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)  
>>> ts  
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the last 3 days:

>>> ts.last('3D')  
            A
2018-04-13  3
2018-04-15  4

Notice the data for 3 last calender days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.

le(self, other, axis='columns', level=None)¶

Get Less than or equal to of dataframe and other, element-wise (binary operator le).

This docstring was copied from pandas.core.frame.DataFrame.le.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

DataFrame of bool: Result of the comparison.

See also

DataFrame.eq: Compare DataFrames for equality elementwise.
DataFrame.ne: Compare DataFrames for inequality elementwise.
DataFrame.le: Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt: Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge: Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt: Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False

property loc¶

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  

lt(self, other, axis='columns', level=None)¶

Get Less than of dataframe and other, element-wise (binary operator lt).

This docstring was copied from pandas.core.frame.DataFrame.lt.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

DataFrame of bool: Result of the comparison.

See also

DataFrame.eq: Compare DataFrames for equality elementwise.
DataFrame.ne: Compare DataFrames for inequality elementwise.
DataFrame.le: Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt: Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge: Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt: Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False

map_overlap(self, func, before, after, *args, **kwargs)¶

Apply a function to each partition, sharing rows with adjacent partitions.

This can be useful for implementing windowing functions such as df.rolling(...).mean() or df.diff().

Parameters

funcfunction: Function applied to each partition.
beforeint: The number of rows to prepend to partition i from the end of partition i - 1.
afterint: The number of rows to append to partition i from the beginning of partition i + 1.
args, kwargs :: Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Notes

Given positive integers before and after, and a function func, map_overlap does the following:

Prepend before rows to each partition i from the end of partition i - 1. The first partition has no rows prepended.
Append after rows to each partition i from the beginning of partition i + 1. The last partition has no rows appended.
Apply func to each partition, passing in any extra args and kwargs if provided.
Trim before rows from the beginning of all but the first partition.
Trim after rows from the end of all but the last partition.

Note that the index and divisions are assumed to remain unchanged.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to df.rolling(2).sum():

>>> ddf.compute()
    x    y
 1  1.0
 2  2.0
 4  3.0
 7  4.0
11  5.0
>>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute()
      x    y
 NaN  NaN
 3.0  3.0
 6.0  5.0
11.0  7.0
18.0  9.0

The pandas diff method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls to df.diff to each partition after prepending/appending that many rows, depending on sign:

>>> def diff(df, periods=1):
...     before, after = (periods, 0) if periods > 0 else (0, -periods)
...     return df.map_overlap(lambda df, periods=1: df.diff(periods),
...                           periods, 0, periods=periods)
>>> diff(ddf, 1).compute()
     x    y
0  NaN  NaN
1  1.0  1.0
2  2.0  1.0
3  3.0  1.0
4  4.0  1.0

If you have a DatetimeIndex, you can use a pd.Timedelta for time- based windows.

>>> ts = pd.Series(range(10), index=pd.date_range('2017', periods=10))
>>> dts = dd.from_pandas(ts, npartitions=2)
>>> dts.map_overlap(lambda df: df.rolling('2D').sum(),
...                 pd.Timedelta('2D'), 0).compute()
2017-01-01     0.0
2017-01-02     1.0
2017-01-03     3.0
2017-01-04     5.0
2017-01-05     7.0
2017-01-06     9.0
2017-01-07    11.0
2017-01-08    13.0
2017-01-09    15.0
2017-01-10    17.0
Freq: D, dtype: float64

map_partitions(self, func, *args, **kwargs)¶

Apply Python function on each DataFrame partition.

Note that the index and divisions are assumed to remain unchanged.

Parameters

funcfunction: Function applied to each partition.
args, kwargs :: Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after. Arguments and keywords may contain Scalar, Delayed or regular python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:

>>> def myadd(df, a, b=1):
...     return df.x + df.y + a + b
>>> res = ddf.map_partitions(myadd, 1, b=2)
>>> res.dtype
dtype('float64')

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with no name, and dtype float64:

>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))

Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
>>> res.dtypes
x      int64
y    float64
z    float64
dtype: object

As before, the output metadata can also be specified manually. This time we pass in a dict, as the output is a DataFrame:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y),
...                          meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.map_partitions(lambda df: df.head(), meta=ddf)

Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:

>>> ddf.map_partitions(func).clear_divisions()  

mask(self, cond, other=nan)¶

Replace values where the condition is True.

This docstring was copied from pandas.core.frame.DataFrame.mask.

Some inconsistencies with the Dask version may exist.

Parameters

condbool Series/DataFrame, array-like, or callable

Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

otherscalar, Series/DataFrame, or callable

Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

axisint, default None (Not supported in Dask)

Alignment axis if needed.

levelint, default None (Not supported in Dask)

Alignment level if needed.

errorsstr, {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

‘raise’ : allow exceptions to be raised.
‘ignore’ : suppress exceptions. On error return original object.

try_castbool, default False (Not supported in Dask)

Try to cast the result back to the input type (if possible).

Returns

Same type as caller

See also

DataFrame.where(): Return an object of same shape as self.

Notes

The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in indexing.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

>>> s.mask(s > 0)  
  0.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

>>> s.where(s > 1, 10)  
  10
  10
  2
  3
  4
dtype: int64

>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> df  
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

max(self, axis=None, skipna=True, split_every=False, out=None)¶

Return the maximum of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.max.

Some inconsistencies with the Dask version may exist.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

See also

Series.sum: Return the sum.
Series.min: Return the minimum.
Series.max: Return the maximum.
Series.idxmin: Return the index of the minimum.
Series.idxmax: Return the index of the maximum.
DataFrame.sum: Return the sum over the requested axis.
DataFrame.min: Return the minimum over the requested axis.
DataFrame.max: Return the maximum over the requested axis.
DataFrame.idxmin: Return the index of the minimum over the requested axis.
DataFrame.idxmax: Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.max()  
8

Max using level names, as well as indices.

>>> s.max(level='blooded')  
blooded
warm    4
cold    8
Name: legs, dtype: int64

>>> s.max(level=0)  
blooded
warm    4
cold    8
Name: legs, dtype: int64

mean(self, axis=None, skipna=True, split_every=False, dtype=None, out=None)¶

Return the mean of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.mean.

Some inconsistencies with the Dask version may exist.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

melt(self, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)¶

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters

frameDataFrame
id_varstuple, list, or ndarray, optional: Column(s) to use as identifier variables.
value_varstuple, list, or ndarray, optional: Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_namescalar: Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
value_namescalar, default ‘value’: Name to use for the ‘value’ column.
col_levelint or string, optional: If columns are a MultiIndex then use this level to melt.

Returns

DataFrame: Unpivoted DataFrame.

See also

pandas.DataFrame.melt

memory_usage(self, index=True, deep=False)¶

Return the memory usage of each column in bytes.

This docstring was copied from pandas.core.frame.DataFrame.memory_usage.

Some inconsistencies with the Dask version may exist.

The memory usage can optionally include the contribution of the index and elements of object dtype.

This value is displayed in DataFrame.info by default. This can be suppressed by setting pandas.options.display.memory_usage to False.

Parameters

indexbool, default True: Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If index=True, the memory usage of the index is the first item in the output.
deepbool, default False: If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

Returns

Series: A Series whose index is the original column names and whose values is the memory usage of each column in bytes.

See also

numpy.ndarray.nbytes: Total bytes consumed by the elements of an ndarray.
Series.memory_usage: Bytes consumed by a Series.
Categorical: Memory-efficient array for string values with many repeated values.
DataFrame.info: Concise summary of a DataFrame.

Examples

>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']  
>>> data = dict([(t, np.ones(shape=5000).astype(t))  
...              for t in dtypes])
>>> df = pd.DataFrame(data)  
>>> df.head()  
   int64  float64            complex128  object  bool
0      1      1.0    1.000000+0.000000j       1  True
1      1      1.0    1.000000+0.000000j       1  True
2      1      1.0    1.000000+0.000000j       1  True
3      1      1.0    1.000000+0.000000j       1  True
4      1      1.0    1.000000+0.000000j       1  True

>>> df.memory_usage()  
Index           128
int64         40000
float64       40000
complex128    80000
object        40000
bool           5000
dtype: int64

>>> df.memory_usage(index=False)  
int64         40000
float64       40000
complex128    80000
object        40000
bool           5000
dtype: int64

The memory footprint of object dtype columns is ignored by default:

>>> df.memory_usage(deep=True)  
Index            128
int64          40000
float64        40000
complex128     80000
object        160000
bool            5000
dtype: int64

Use a Categorical for efficient storage of an object-dtype column with many repeated values.

>>> df['object'].astype('category').memory_usage(deep=True)  
5216

memory_usage_per_partition(self, index=True, deep=False)¶

Return the memory usage of each partition

Parameters

indexbool, default True: Specifies whether to include the memory usage of the index in returned Series.
deepbool, default False: If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

Returns

Series: A Series whose index is the partition number and whose values are the memory usage of each partition in bytes.

merge(self, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle=None)¶

Merge the DataFrame with another DataFrame

This will merge the two datasets, either on the indices, a certain column in each dataset or the index in one dataset and the column in another.

Parameters

right: dask.dataframe.DataFrame

how{‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘inner’

How to handle the operation of the two objects:

left: use calling frame’s index (or column if on is specified)
right: use other frame’s index
outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically
inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one

onlabel or list

Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_onlabel or list, or array-like

Column to join on in the left DataFrame. Other than in pandas arrays and lists are only support if their length is 1.

right_onlabel or list, or array-like

Column to join on in the right DataFrame. Other than in pandas arrays and lists are only support if their length is 1.

left_indexboolean, default False

Use the index from the left DataFrame as the join key.

right_indexboolean, default False

Use the index from the right DataFrame as the join key.

suffixes2-length sequence (tuple, list, …)

Suffix to apply to overlapping column names in the left and right side, respectively

indicatorboolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in left DataFrame, “right_only” for observations whose merge key only appears in right DataFrame, and “both” if the observation’s merge key is found in both.

npartitions: int or None, optional

The ideal number of output partitions. This is only utilised when performing a hash_join (merging on columns only). If None then npartitions = max(lhs.npartitions, rhs.npartitions). Default is None.

shuffle: {‘disk’, ‘tasks’}, optional

Either 'disk' for single-node operation or 'tasks' for distributed operation. Will be inferred by your current scheduler.

Notes

There are three ways to join dataframes:

Joining on indices. In this case the divisions are aligned using the function dask.dataframe.multi.align_partitions. Afterwards, each partition is merged with the pandas merge function.
Joining one on index and one on column. In this case the divisions of dataframe merged by index (\(d_i\)) are used to divide the column merged dataframe (\(d_c\)) one using dask.dataframe.multi.rearrange_by_divisions. In this case the merged dataframe (\(d_m\)) has the exact same divisions as (\(d_i\)). This can lead to issues if you merge multiple rows from (\(d_c\)) to one row in (\(d_i\)).
Joining both on columns. In this case a hash join is performed using dask.dataframe.multi.hash_join.

min(self, axis=None, skipna=True, split_every=False, out=None)¶

Return the minimum of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.min.

Some inconsistencies with the Dask version may exist.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

See also

Series.sum: Return the sum.
Series.min: Return the minimum.
Series.max: Return the maximum.
Series.idxmin: Return the index of the minimum.
Series.idxmax: Return the index of the maximum.
DataFrame.sum: Return the sum over the requested axis.
DataFrame.min: Return the minimum over the requested axis.
DataFrame.max: Return the maximum over the requested axis.
DataFrame.idxmin: Return the index of the minimum over the requested axis.
DataFrame.idxmax: Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.min()  
0

Min using level names, as well as indices.

>>> s.min(level='blooded')  
blooded
warm    2
cold    0
Name: legs, dtype: int64

>>> s.min(level=0)  
blooded
warm    2
cold    0
Name: legs, dtype: int64

mod(self, other, axis='columns', level=None, fill_value=None)¶

Get Modulo of dataframe and other, element-wise (binary operator mod).

This docstring was copied from pandas.core.frame.DataFrame.mod.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe % other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmod.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

mode(self, dropna=True, split_every=False)¶

Get the mode(s) of each element along the selected axis.

This docstring was copied from pandas.core.frame.DataFrame.mode.

Some inconsistencies with the Dask version may exist.

The mode of a set of values is the value that appears most often. It can be multiple values.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

The axis to iterate over while searching for the mode:

0 or ‘index’ : get mode of each column
1 or ‘columns’ : get mode of each row.

numeric_onlybool, default False (Not supported in Dask)

If True, only apply to numeric columns.

dropnabool, default True

Don’t consider counts of NaN/NaT.

New in version 0.24.0.

Returns

DataFrame: The modes of each column or row.

See also

Series.mode: Return the highest frequency value in a Series.
Series.value_counts: Return the counts of values in a Series.

Examples

>>> df = pd.DataFrame([('bird', 2, 2),  
...                    ('mammal', 4, np.nan),
...                    ('arthropod', 8, 0),
...                    ('bird', 2, np.nan)],
...                   index=('falcon', 'horse', 'spider', 'ostrich'),
...                   columns=('species', 'legs', 'wings'))
>>> df  
           species  legs  wings
falcon        bird     2    2.0
horse       mammal     4    NaN
spider   arthropod     8    0.0
ostrich       bird     2    NaN

By default, missing values are not considered, and the mode of wings are both 0 and 2. The second row of species and legs contains NaN, because they have only one mode, but the DataFrame has two rows.

>>> df.mode()  
  species  legs  wings
0    bird   2.0    0.0
1     NaN   NaN    2.0

Setting dropna=False NaN values are considered and they can be the mode (like for wings).

>>> df.mode(dropna=False)  
  species  legs  wings
0    bird     2    NaN

Setting numeric_only=True, only the mode of numeric columns is computed, and columns of other types are ignored.

>>> df.mode(numeric_only=True)  
   legs  wings
0   2.0    0.0
1   NaN    2.0

To compute the mode over columns and not rows, use the axis parameter:

>>> df.mode(axis='columns', numeric_only=True)  
           0    1
falcon   2.0  NaN
horse    4.0  NaN
spider   0.0  8.0
ostrich  2.0  NaN

mul(self, other, axis='columns', level=None, fill_value=None)¶

Get Multiplication of dataframe and other, element-wise (binary operator mul).

This docstring was copied from pandas.core.frame.DataFrame.mul.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

property ndim¶: Return dimensionality

ne(self, other, axis='columns', level=None)¶

Get Not equal to of dataframe and other, element-wise (binary operator ne).

This docstring was copied from pandas.core.frame.DataFrame.ne.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

DataFrame of bool: Result of the comparison.

See also

DataFrame.eq: Compare DataFrames for equality elementwise.
DataFrame.ne: Compare DataFrames for inequality elementwise.
DataFrame.le: Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt: Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge: Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt: Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False

>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150

>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225

>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False

nlargest(self, n=5, columns=None, split_every=None)¶

Return the first n rows ordered by columns in descending order.

This docstring was copied from pandas.core.frame.DataFrame.nlargest.

Some inconsistencies with the Dask version may exist.

Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=False).head(n), but more performant.

Parameters

nint

Number of rows to return.

columnslabel or list of labels

Column label(s) to order by.

keep{‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

Where there are duplicate values:

first : prioritize the first occurrence(s)
last : prioritize the last occurrence(s)
alldo not drop any duplicates, even it means
selecting more than n items.

New in version 0.24.0.

Returns

DataFrame: The first n rows ordered by the given columns in descending order.

See also

DataFrame.nsmallest: Return the first n rows ordered by columns in ascending order.
DataFrame.sort_values: Sort DataFrame by the values.
DataFrame.head: Return the first n rows without re-ordering.

Notes

This function cannot be used with all column types. For example, when specifying columns with object or category dtypes, TypeError is raised.

Examples

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,  
...                                   434000, 434000, 337000, 11300,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df  
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru          11300      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use nlargest to select the three rows having the largest values in column “population”.

>>> df.nlargest(3, 'population')  
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Malta       434000    12011      MT

When using keep='last', ties are resolved in reverse order:

>>> df.nlargest(3, 'population', keep='last')  
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN

When using keep='all', all duplicate items are maintained:

>>> df.nlargest(3, 'population', keep='all')  
          population      GDP alpha-2
France      65000000  2583560      FR
Italy       59000000  1937894      IT
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN

To order by the largest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.

>>> df.nlargest(3, ['population', 'GDP'])  
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN

notnull(self)¶

Detect existing (non-missing) values.

This docstring was copied from pandas.core.frame.DataFrame.notnull.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns

DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.notnull: Alias of notna.
DataFrame.isna: Boolean inverse of notna.
DataFrame.dropna: Omit axes labels with missing values.
notna: Top-level notna.

Examples

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame({'age': [5, 6, np.NaN],  
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()  
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()  
0     True
1     True
2    False
dtype: bool

property npartitions¶: Return number of partitions

nsmallest(self, n=5, columns=None, split_every=None)¶

Return the first n rows ordered by columns in ascending order.

This docstring was copied from pandas.core.frame.DataFrame.nsmallest.

Some inconsistencies with the Dask version may exist.

Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=True).head(n), but more performant.

Parameters

nint

Number of items to retrieve.

columnslist or str

Column name or names to order by.

keep{‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

Where there are duplicate values:

first : take the first occurrence.
last : take the last occurrence.
all : do not drop any duplicates, even it means selecting more than n items.

New in version 0.24.0.

Returns

DataFrame

See also

DataFrame.nlargest: Return the first n rows ordered by columns in descending order.
DataFrame.sort_values: Sort DataFrame by the values.
DataFrame.head: Return the first n rows without re-ordering.

Examples

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,  
...                                   434000, 434000, 337000, 11300,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df  
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru          11300      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use nsmallest to select the three rows having the smallest values in column “a”.

>>> df.nsmallest(3, 'population')  
          population  GDP alpha-2
Nauru          11300  182      NR
Tuvalu         11300   38      TV
Anguilla       11300  311      AI

When using keep='last', ties are resolved in reverse order:

>>> df.nsmallest(3, 'population', keep='last')  
          population  GDP alpha-2
Anguilla       11300  311      AI
Tuvalu         11300   38      TV
Nauru          11300  182      NR

When using keep='all', all duplicate items are maintained:

>>> df.nsmallest(3, 'population', keep='all')  
          population  GDP alpha-2
Nauru          11300  182      NR
Tuvalu         11300   38      TV
Anguilla       11300  311      AI

To order by the largest values in column “a” and then “c”, we can specify multiple columns like in the next example.

>>> df.nsmallest(3, ['population', 'GDP'])  
          population  GDP alpha-2
Tuvalu         11300   38      TV
Nauru          11300  182      NR
Anguilla       11300  311      AI

nunique_approx(self, split_every=None)¶

Approximate number of unique rows.

This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.

Parameters

split_everyint, optional: Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.

Returns

a float representing the approximate number of elements

property partitions¶

Slice dataframe by partitions

This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along partitions so, for example, df.partitions[:5] produces a new Dask Dataframe of the first five partitions.

Returns

A Dask DataFrame

Examples

>>> df.partitions[0]  
>>> df.partitions[:3]  
>>> df.partitions[::10]  

persist(self, **kwargs)¶

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.

This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.

Parameters

schedulerstring, optional: Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graphbool, optional: If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
**kwargs: Extra keywords to forward to the scheduler function.

Returns

New dask collections backed by in-memory data

See also

dask.base.persist

pipe(self, func, *args, **kwargs)¶

Apply func(self, *args, **kwargs).

This docstring was copied from pandas.core.frame.DataFrame.pipe.

Some inconsistencies with the Dask version may exist.

Parameters

funcfunction: Function to apply to the Series/DataFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the Series/DataFrame.
argsiterable, optional: Positional arguments passed into func.
kwargsmapping, optional: A dictionary of keyword arguments passed into func.

Returns

objectthe return type of func.

See also

DataFrame.apply
DataFrame.applymap
Series.map

Notes

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing

>>> f(g(h(df), arg1=a), arg2=b, arg3=c)  

You can write

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe(f, arg2=b, arg3=c)
... )

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe((f, 'arg2'), arg1=a, arg3=c)
...  )

pivot_table(self, index=None, columns=None, values=None, aggfunc='mean')¶

Create a spreadsheet-style pivot table as a DataFrame. Target columns must have category dtype to infer result’s columns. index, columns, values and aggfunc must be all scalar.

Parameters

valuesscalar: column to aggregate
indexscalar: column to be index
columnsscalar: column to be columns
aggfunc{‘mean’, ‘sum’, ‘count’}, default ‘mean’

Returns

tableDataFrame

pop(self, item)¶

Return item and drop from frame. Raise KeyError if not found.

This docstring was copied from pandas.core.frame.DataFrame.pop.

Some inconsistencies with the Dask version may exist.

Parameters

itemstr: Label of column to be popped.

Returns

Series

Examples

>>> df = pd.DataFrame([('falcon', 'bird', 389.0),  
...                    ('parrot', 'bird', 24.0),
...                    ('lion', 'mammal', 80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))
>>> df  
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN

>>> df.pop('class')  
0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object

>>> df  
     name  max_speed
0  falcon      389.0
1  parrot       24.0
2    lion       80.5
3  monkey        NaN

pow(self, other, axis='columns', level=None, fill_value=None)¶

Get Exponential power of dataframe and other, element-wise (binary operator pow).

This docstring was copied from pandas.core.frame.DataFrame.pow.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

prod(self, axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)¶

Return the product of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.prod.

Some inconsistencies with the Dask version may exist.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min_countint, default 0: The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([]).prod()  
1.0

This can be controlled with the min_count parameter

>>> pd.Series([]).prod(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()  
1.0

>>> pd.Series([np.nan]).prod(min_count=1)  
nan

quantile(self, q=0.5, axis=0, method='default')¶

Approximate row-wise and precise column-wise quantiles of DataFrame

Parameters

qlist/array of floats, default 0.5 (50%): Iterable of numbers ranging from 0 to 1 for the desired quantiles
axis{0, 1, ‘index’, ‘columns’} (default 0): 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
method{‘default’, ‘tdigest’, ‘dask’}, optional: What method to use. By default will use dask’s internal custom algorithm ('dask'). If set to 'tdigest' will use tdigest for floats and ints and fallback to the 'dask' otherwise.

query(self, expr, **kwargs)¶

Filter dataframe with complex expression

Blocked version of pd.DataFrame.query

This is like the sequential version except that this will also happen in many threads. This may conflict with numexpr which will use multiple threads itself. We recommend that you set numexpr to use a single thread

import numexpr numexpr.set_num_threads(1)

See also

pandas.DataFrame.query

radd(self, other, axis='columns', level=None, fill_value=None)¶

Get Addition of dataframe and other, element-wise (binary operator radd).

This docstring was copied from pandas.core.frame.DataFrame.radd.

Some inconsistencies with the Dask version may exist.

Equivalent to other + dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, add.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

random_split(self, frac, random_state=None, shuffle=False)¶

Pseudorandomly split dataframe into different pieces row-wise

Parameters

fraclist: List of floats that should sum to one.
random_stateint or np.random.RandomState: If int create a new RandomState with this as the seed. Otherwise draw from the passed RandomState.
shufflebool, default False: If set to True, the dataframe is shuffled (within partition) before the split.

See also

dask.DataFrame.sample

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  

rdiv(self, other, axis='columns', level=None, fill_value=None)¶

Get Floating division of dataframe and other, element-wise (binary operator rtruediv).

This docstring was copied from pandas.core.frame.DataFrame.rdiv.

Some inconsistencies with the Dask version may exist.

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

reduction(self, chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)¶

Generic row-wise reductions.

Parameters

chunkcallable

Function to operate on each partition. Should return a pandas.DataFrame, pandas.Series, or a scalar.

aggregatecallable, optional

Function to operate on the concatenated result of chunk. If not specified, defaults to chunk. Used to do the final aggregation in a tree reduction.

The input to aggregate depends on the output of chunk. If the output of chunk is a:

scalar: Input is a Series, with one row per partition.
Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.

Should return a pandas.DataFrame, pandas.Series, or a scalar.

combinecallable, optional

Function to operate on intermediate concatenated results of chunk in a tree-reduction. If not provided, defaults to aggregate. The input/output requirements should match that of aggregate described above.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

tokenstr, optional

The name to use for the output keys.

split_everyint, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to aggregate. Default is 8.

chunk_kwargsdict, optional

Keyword arguments to pass on to chunk only.

aggregate_kwargsdict, optional

Keyword arguments to pass on to aggregate only.

combine_kwargsdict, optional

Keyword arguments to pass on to combine only.

kwargs :

All remaining keywords will be passed to chunk, combine, and aggregate.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)})
>>> ddf = dd.from_pandas(df, npartitions=4)

Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:

>>> res = ddf.reduction(lambda x: x.count(),
...                     aggregate=lambda x: x.sum())
>>> res.compute()
x    50
y    50
dtype: int64

Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).

>>> def count_greater(x, value=0):
...     return (x >= value).sum()
>>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(),
...                       chunk_kwargs={'value': 25})
>>> res.compute()
25

Aggregate both the sum and count of a Series at the same time:

>>> def sum_and_count(x):
...     return pd.Series({'count': x.count(), 'sum': x.sum()},
...                      index=['count', 'sum'])
>>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum())
>>> res.compute()
count      50
sum      1225
dtype: int64

Doing the same, but for a DataFrame. Here chunk returns a DataFrame, meaning the input to aggregate is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.

>>> def sum_and_count(x):
...     return pd.DataFrame({'count': x.count(), 'sum': x.sum()},
...                         columns=['count', 'sum'])
>>> res = ddf.reduction(sum_and_count,
...                     aggregate=lambda x: x.groupby(level=0).sum())
>>> res.compute()
   count   sum
x     50  1225
y     50  3725

rename(self, index=None, columns=None)¶

Alter axes labels.

This docstring was copied from pandas.core.frame.DataFrame.rename.

Some inconsistencies with the Dask version may exist.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

See the user guide for more.

Parameters

mapperdict-like or function (Not supported in Dask): Dict-like or functions transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.
indexdict-like or function (Not supported in Dask): Alternative to specifying axis (mapper, axis=0 is equivalent to index=mapper).
columnsdict-like or function: Alternative to specifying axis (mapper, axis=1 is equivalent to columns=mapper).
axisint or str (Not supported in Dask): Axis to target with mapper. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.
copybool, default True (Not supported in Dask): Also copy underlying data.
inplacebool, default False (Not supported in Dask): Whether to return a new DataFrame. If True then value of copy is ignored.
levelint or level name, default None (Not supported in Dask): In case of a MultiIndex, only rename labels in the specified level.
errors{‘ignore’, ‘raise’}, default ‘ignore’ (Not supported in Dask): If ‘raise’, raise a KeyError when a dict-like mapper, index, or columns contains labels that are not present in the Index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.

Returns

DataFrame: DataFrame with the renamed axis labels.

Raises

KeyError: If any of the labels is not found in the selected axis and “errors=’raise’”.

See also

DataFrame.rename_axis: Set the name of the axis.

Examples

DataFrame.rename supports two calling conventions

(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={'index', 'columns'}, ...)

We highly recommend using keyword arguments to clarify your intent.

Rename columns using a mapping:

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})  
>>> df.rename(columns={"A": "a", "B": "c"})  
   a  c
0  1  4
1  2  5
2  3  6

Rename index using a mapping:

>>> df.rename(index={0: "x", 1: "y", 2: "z"})  
   A  B
x  1  4
y  2  5
z  3  6

Cast index labels to a different type:

>>> df.index  
RangeIndex(start=0, stop=3, step=1)
>>> df.rename(index=str).index  
Index(['0', '1', '2'], dtype='object')

>>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise")  
Traceback (most recent call last):
KeyError: ['C'] not found in axis

Using axis-style parameters

>>> df.rename(str.lower, axis='columns')  
   a  b
0  1  4
1  2  5
2  3  6

>>> df.rename({1: 2, 2: 4}, axis='index')  
   A  B
0  1  4
2  2  5
4  3  6

repartition(self, divisions=None, npartitions=None, partition_size=None, freq=None, force=False)¶

Repartition dataframe along new divisions

Parameters

divisionslist, optional: List of partitions to be used. Only used if npartitions and partition_size isn’t specified. For convenience if given an integer this will defer to npartitions and if given a string it will defer to partition_size (see below)
npartitionsint, optional: Number of partitions of output. Only used if partition_size isn’t specified.
partition_size: int or string, optional: Max number of bytes of memory for each partition. Use numbers or strings like 5MB. If specified npartitions and divisions will be ignored.

Warning

This keyword argument triggers computation to determine the memory size of each partition, which may be expensive.
freqstr, pd.Timedelta: A period on which to partition timeseries data like '7D' or '12h' or pd.Timedelta(hours=12). Assumes a datetime index.
forcebool, default False: Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.

Notes

Exactly one of divisions, npartitions, partition_size, or freq should be specified. A ValueError will be raised when that is not the case.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
>>> df = df.repartition(freq='7d')  

replace(self, to_replace=None, value=None, regex=False)¶

Replace values given in to_replace with value.

This docstring was copied from pandas.core.frame.DataFrame.replace.

Some inconsistencies with the Dask version may exist.

Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

Parameters

to_replacestr, regex, list, dict, Series, int, float, or None

How to find the values that will be replaced.

numeric, str or regex:
- numeric: numeric values equal to to_replace will be replaced with value
- str: string exactly matching to_replace will be replaced with value
- regex: regexs matching to_replace will be replaced with value
list of str, regex, or numeric:
- First, if to_replace and value are both lists, they must be the same length.
- Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.
- str, regex and numeric rules apply as above.
dict:
- Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None.
- For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
- For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The value parameter should be None to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
None:
- This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also None then this must be a nested dictionary or Series.

See the examples section for examples of each of these.

valuescalar, dict, list, str, regex, default None

Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.

inplacebool, default False (Not supported in Dask)

If True, in place. Note: this will modify any other views on this object (e.g. a column from a DataFrame). Returns the caller if this is True.

limitint, default None (Not supported in Dask)

Maximum size gap to forward or backward fill.

regexbool or same types as to_replace, default False

Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.

method{‘pad’, ‘ffill’, ‘bfill’, None} (Not supported in Dask)

The method to use when for replacement, when to_replace is a scalar, list or tuple and value is None.

Changed in version 0.23.0: Added to DataFrame.

Returns

DataFrame: Object after replacement.

Raises

AssertionError

If regex is not a bool and to_replace is not None.

TypeError

If to_replace is a dict and value is not a list, dict, ndarray, or Series
If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series.
When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced

ValueError

If a list or an ndarray is passed to to_replace and value but they are not the same length.

See also

DataFrame.fillna: Fill NA values.
DataFrame.where: Replace values based on boolean condition.
Series.str.replace: Simple string replacement.

Notes

Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.
Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.

Examples

Scalar `to_replace` and `value`

>>> s = pd.Series([0, 1, 2, 3, 4])  
>>> s.replace(0, 5)  
0    5
1    1
2    2
3    3
4    4
dtype: int64

>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],  
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)  
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like `to_replace`

>>> df.replace([0, 1, 2, 3], 4)  
   A  B  C
4  5  a
4  6  b
4  7  c
4  8  d
4  9  e

>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])  
   A  B  C
4  5  a
3  6  b
2  7  c
1  8  d
4  9  e

>>> s.replace([1, 2], method='bfill')  
  0
  3
  3
  3
  4
dtype: int64

dict-like `to_replace`

>>> df.replace({0: 10, 1: 100})  
     A  B  C
 10  5  a
100  6  b
  2  7  c
  3  8  d
  4  9  e

>>> df.replace({'A': 0, 'B': 5}, 100)  
     A    B  C
100  100  a
  1    6  b
  2    7  c
  3    8  d
  4    9  e

>>> df.replace({'A': {0: 100, 4: 400}})  
     A  B  C
100  5  a
  1  6  b
  2  7  c
  3  8  d
400  9  e

Regular expression `to_replace`

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],  
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)  
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)  
      A    B
0   new  abc
1   foo  bar
2  bait  xyz

>>> df.replace(regex=r'^ba.$', value='new')  
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})  
      A    B
0   new  abc
1   xyz  new
2  bait  xyz

>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')  
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple bool or datetime64 objects, the data types in the to_replace parameter must match the data type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],  
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises  
Traceback (most recent call last):
    ...
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a TypeError because one of the dict keys is not of the correct type for replacement.

Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand the peculiarities of the to_replace parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])  

When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None}, value=None, method=None):

>>> s.replace({'a': None})  
    10
  None
  None
     b
  None
dtype: object

When value=None and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case. The command s.replace('a', None) is actually equivalent to s.replace(to_replace='a', value=None, method='pad'):

>>> s.replace('a', None)  
  10
  10
  10
   b
   b
dtype: object

resample(self, rule, closed=None, label=None)¶

Resample time-series data.

This docstring was copied from pandas.core.frame.DataFrame.resample.

Some inconsistencies with the Dask version may exist.

Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

Parameters

ruleDateOffset, Timedelta or str: The offset string or object representing target conversion.
axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask): Which axis to use for up- or down-sampling. For Series this will default to 0, i.e. along the rows. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.
closed{‘right’, ‘left’}, default None: Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
label{‘right’, ‘left’}, default None: Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
convention{‘start’, ‘end’, ‘s’, ‘e’}, default ‘start’ (Not supported in Dask): For PeriodIndex only, controls whether to use the start or end of rule.
kind{‘timestamp’, ‘period’}, optional, default None (Not supported in Dask): Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.
loffsettimedelta, default None (Not supported in Dask): Adjust the resampled time labels.
baseint, default 0 (Not supported in Dask): For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0.
onstr, optional (Not supported in Dask): For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
levelstr or int, optional (Not supported in Dask): For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.

Returns

Resampler object

See also

groupby: Group by mapping, function, label, or list of labels.
Series.resample: Resample a Series.
DataFrame.resample: Resample a DataFrame.

Notes

See the user guide for more.

To learn more about the offset strings, please see this link.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')  
>>> series = pd.Series(range(9), index=index)  
>>> series  
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()  
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()  
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()  
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows  
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the pad method.

>>> series.resample('30S').pad()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(array_like):  
...     return np.sum(array_like) + 5
...
>>> series.resample('3T').apply(custom_resampler)  
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',  
...                                             freq='A',
...                                             periods=2))
>>> s  
2012    1
2013    2
Freq: A-DEC, dtype: int64
>>> s.resample('Q', convention='start').asfreq()  
2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.

>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',  
...                                                   freq='Q',
...                                                   periods=4))
>>> q  
2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64
>>> q.resample('M', convention='end').asfreq()  
2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],  
...           'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df = pd.DataFrame(d)  
>>> df['week_starting'] = pd.date_range('01/01/2018',  
...                                     periods=8,
...                                     freq='W')
>>> df  
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
>>> df.resample('M', on='week_starting').mean()  
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.

>>> days = pd.date_range('1/1/2000', periods=4, freq='D')  
>>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],  
...            'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df2 = pd.DataFrame(d2,  
...                    index=pd.MultiIndex.from_product([days,
...                                                     ['morning',
...                                                      'afternoon']]
...                                                     ))
>>> df2  
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
>>> df2.resample('D', level=0).sum()  
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90

reset_index(self, drop=False)¶

Reset the index to the default index.

Note that unlike in pandas, the reset dask.dataframe index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g. index1 = [0, ..., 10], index2 = [0, ...]). This is due to the inability to statically know the full length of the index.

For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.

Parameters

dropboolean, default False: Do not try to insert index into dataframe columns.

rfloordiv(self, other, axis='columns', level=None, fill_value=None)¶

Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).

This docstring was copied from pandas.core.frame.DataFrame.rfloordiv.

Some inconsistencies with the Dask version may exist.

Equivalent to other // dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, floordiv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

rmod(self, other, axis='columns', level=None, fill_value=None)¶

Get Modulo of dataframe and other, element-wise (binary operator rmod).

This docstring was copied from pandas.core.frame.DataFrame.rmod.

Some inconsistencies with the Dask version may exist.

Equivalent to other % dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mod.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

rmul(self, other, axis='columns', level=None, fill_value=None)¶

Get Multiplication of dataframe and other, element-wise (binary operator rmul).

This docstring was copied from pandas.core.frame.DataFrame.rmul.

Some inconsistencies with the Dask version may exist.

Equivalent to other * dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mul.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

rolling(self, window, min_periods=None, center=False, win_type=None, axis=0)¶

Provides rolling transformations.

Parameters

windowint, str, offset: Size of the moving window. This is the number of observations used for calculating the statistic. When not using a DatetimeIndex, the window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a DatetimeIndex

Changed in version 0.15.0: Now accepts offsets and string offset aliases
min_periodsint, default None: Minimum number of observations in window required to have a value (otherwise result is NA).
centerboolean, default False: Set the labels at the center of the window.
win_typestring, default None: Provide a window type. The recognized window types are identical to pandas.
axisint, default 0

Returns

a Rolling object on which to call a method to compute a statistic

round(self, decimals=0)¶

Round a DataFrame to a variable number of decimal places.

This docstring was copied from pandas.core.frame.DataFrame.round.

Some inconsistencies with the Dask version may exist.

Parameters

decimalsint, dict, Series: Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
*args: Additional keywords have no effect but might be accepted for compatibility with numpy.
**kwargs: Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns

DataFrame: A DataFrame with the affected columns rounded to the specified number of decimal places.

See also

numpy.around: Round a numpy array to the given number of decimals.
Series.round: Round a Series to the given number of decimals.

Examples

>>> df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)],  
...                   columns=['dogs', 'cats'])
>>> df  
    dogs  cats
0  0.21  0.32
1  0.01  0.67
2  0.66  0.03
3  0.21  0.18

By providing an integer each column is rounded to the same number of decimal places

>>> df.round(1)  
    dogs  cats
0   0.2   0.3
1   0.0   0.7
2   0.7   0.0
3   0.2   0.2

With a dict, the number of places for specific columns can be specified with the column names as key and the number of decimal places as value

>>> df.round({'dogs': 1, 'cats': 0})  
    dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0

Using a Series, the number of places for specific columns can be specified with the column names as index and the number of decimal places as value

>>> decimals = pd.Series([0, 1], index=['cats', 'dogs'])  
>>> df.round(decimals)  
    dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0

rpow(self, other, axis='columns', level=None, fill_value=None)¶

Get Exponential power of dataframe and other, element-wise (binary operator rpow).

This docstring was copied from pandas.core.frame.DataFrame.rpow.

Some inconsistencies with the Dask version may exist.

Equivalent to other ** dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, pow.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

rsub(self, other, axis='columns', level=None, fill_value=None)¶

Get Subtraction of dataframe and other, element-wise (binary operator rsub).

This docstring was copied from pandas.core.frame.DataFrame.rsub.

Some inconsistencies with the Dask version may exist.

Equivalent to other - dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, sub.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

rtruediv(self, other, axis='columns', level=None, fill_value=None)¶

Get Floating division of dataframe and other, element-wise (binary operator rtruediv).

This docstring was copied from pandas.core.frame.DataFrame.rtruediv.

Some inconsistencies with the Dask version may exist.

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

sample(self, n=None, frac=None, replace=False, random_state=None)¶

Random sample of items

Parameters

nint, optional: Number of items to return is not supported by dask. Use frac instead.
fracfloat, optional: Fraction of axis items to return.
replaceboolean, optional: Sample with or without replacement. Default = False.
random_stateint or np.random.RandomState: If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

See also

DataFrame.random_split
pandas.DataFrame.sample

select_dtypes(self, include=None, exclude=None)¶

Return a subset of the DataFrame’s columns based on the column dtypes.

This docstring was copied from pandas.core.frame.DataFrame.select_dtypes.

Some inconsistencies with the Dask version may exist.

Parameters

include, excludescalar or list-like: A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.

Returns

DataFrame: The subset of the frame including the dtypes in include and excluding the dtypes in exclude.

Raises

ValueError

If both of include and exclude are empty
If include and exclude have overlapping elements
If any kind of string dtype is passed in.

Notes

To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype, but note that this will return all object dtype columns
See the numpy dtype hierarchy
To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'

Examples

>>> df = pd.DataFrame({'a': [1, 2] * 3,  
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3})
>>> df  
        a      b  c
0       1   True  1.0
1       2  False  2.0
2       1   True  1.0
3       2  False  2.0
4       1   True  1.0
5       2  False  2.0

>>> df.select_dtypes(include='bool')  
   b
True
False
True
False
True
False

>>> df.select_dtypes(include=['float64'])  
   c
1.0
2.0
1.0
2.0
1.0
2.0

>>> df.select_dtypes(exclude=['int'])  
       b    c
 True  1.0
False  2.0
 True  1.0
False  2.0
 True  1.0
False  2.0

sem(self, axis=None, skipna=None, ddof=1, split_every=False)¶

Return unbiased standard error of the mean over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.sem.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters

axis{index (0), columns (1)}
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
ddofint, default 1: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns

Series or DataFrame (if level specified)

set_index(self, other, drop=True, sorted=False, npartitions=None, divisions=None, inplace=False, **kwargs)¶

Set the DataFrame index (row labels) using an existing column.

This realigns the dataset to be sorted by a new column. This can have a significant impact on performance, because joins, groupbys, lookups, etc. are all much faster on that column. However, this performance increase comes with a cost, sorting a parallel dataset requires expensive shuffles. Often we set_index once directly after data ingest and filtering and then perform many cheap computations off of the sorted dataset.

This function operates exactly like pandas.set_index except with different performance costs (dask dataframe set_index is much more expensive). Under normal operation this function does an initial pass over the index column to compute approximate qunatiles to serve as future divisions. It then passes over the data a second time, splitting up each input partition into several pieces and sharing those pieces to all of the output partitions now in sorted order.

In some cases we can alleviate those costs, for example if your dataset is sorted already then we can avoid making many small pieces or if you know good values to split the new index column then we can avoid the initial pass over the data. For example if your new index is a datetime index and your data is already sorted by day then this entire operation can be done for free. You can control these options with the following parameters.

Parameters

df: Dask DataFrame
index: string or Dask Series
npartitions: int, None, or ‘auto’: The ideal number of output partitions. If None use the same as the input. If ‘auto’ then decide by memory use.
shuffle: string, optional: Either 'disk' for single-node operation or 'tasks' for distributed operation. Will be inferred by your current scheduler.
sorted: bool, optional: If the index column is already sorted in increasing order. Defaults to False
divisions: list, optional: Known values on which to separate index values of the partitions. See https://docs.dask.org/en/latest/dataframe-design.html#partitions Defaults to computing this with a single pass over the data. Note that if sorted=True, specified divisions are assumed to match the existing partitions in the data. If sorted=False, you should leave divisions empty and call repartition after set_index.
inplacebool, optional: Modifying the DataFrame in place is not supported by Dask. Defaults to False.
compute: bool: Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set compute=False, an immediate computation will still be triggered if divisions is None.

Examples

>>> df2 = df.set_index('x')  
>>> df2 = df.set_index(d.x)  
>>> df2 = df.set_index(d.timestamp, sorted=True)  

A common case is when we have a datetime column that we know to be sorted and is cleanly divided by day. We can set this index for free by specifying both that the column is pre-sorted and the particular divisions along which is is separated

>>> import pandas as pd
>>> divisions = pd.date_range('2000', '2010', freq='1D')
>>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions)  

property shape¶

Return a tuple representing the dimensionality of the DataFrame.

The number of rows is a Delayed result. The number of columns is a concrete integer.

Examples

>>> df.size  
(Delayed('int-07f06075-5ecc-4d77-817e-63c69a9188a8'), 2)

shift(self, periods=1, freq=None, axis=0)¶

Shift index by desired number of periods with an optional time freq.

This docstring was copied from pandas.core.frame.DataFrame.shift.

Some inconsistencies with the Dask version may exist.

When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq.

Parameters

periodsint: Number of periods to shift. Can be positive or negative.
freqDateOffset, tseries.offsets, timedelta, or str, optional: Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
axis{0 or ‘index’, 1 or ‘columns’, None}, default None: Shift direction.
fill_valueobject, optional (Not supported in Dask): The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.

Changed in version 0.24.0.

Returns

DataFrame: Copy of input object, shifted.

See also

Index.shift: Shift values of Index.
DatetimeIndex.shift: Shift values of DatetimeIndex.
PeriodIndex.shift: Shift values of PeriodIndex.
tshift: Shift the time index, using the index’s frequency if available.

Examples

>>> df = pd.DataFrame({'Col1': [10, 20, 15, 30, 45],  
...                    'Col2': [13, 23, 18, 33, 48],
...                    'Col3': [17, 27, 22, 37, 52]})

>>> df.shift(periods=3)  
   Col1  Col2  Col3
 NaN   NaN   NaN
 NaN   NaN   NaN
 NaN   NaN   NaN
10.0  13.0  17.0
20.0  23.0  27.0

>>> df.shift(periods=1, axis='columns')  
   Col1  Col2  Col3
 NaN  10.0  13.0
 NaN  20.0  23.0
 NaN  15.0  18.0
 NaN  30.0  33.0
 NaN  45.0  48.0

>>> df.shift(periods=3, fill_value=0)  
   Col1  Col2  Col3
   0     0     0
   0     0     0
   0     0     0
  10    13    17
  20    23    27

shuffle(self, on, npartitions=None, max_branch=None, shuffle=None, ignore_index=False, compute=None)¶

Rearrange DataFrame into new partitions

Uses hashing of on to map rows to output partitions. After this operation, rows with the same value of on will be in the same partition.

Parameters

onstr, list of str, or Series, Index, or DataFrame: Column(s) or index to be used to map rows to output partitions
npartitionsint, optional: Number of partitions of output. Partition count will not be changed by default.
max_branch: int, optional: The maximum number of splits per input partition. Used within the staged shuffling algorithm.
shuffle: {‘disk’, ‘tasks’}, optional: Either 'disk' for single-node operation or 'tasks' for distributed operation. Will be inferred by your current scheduler.
ignore_index: bool, default False: Ignore index during shuffle. If True, performance may improve, but index values will not be preserved.
compute: bool: Whether or not to trigger an immediate computation. Defaults to False.

Notes

This does not preserve a meaningful index/partitioning scheme. This is not deterministic if done in parallel.

Examples

>>> df = df.shuffle(df.columns[0])  

property size¶

Size of the Series or DataFrame as a Delayed object.

Examples

>>> series.size  
dd.Scalar<size-ag..., dtype=int64>

squeeze(self, axis=None)¶

Squeeze 1 dimensional axis objects into scalars.

This docstring was copied from pandas.core.frame.DataFrame.squeeze.

Some inconsistencies with the Dask version may exist.

Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged.

This method is most useful when you don’t know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.

Parameters

axis{0 or ‘index’, 1 or ‘columns’, None}, default None: A specific axis to squeeze. By default, all length-1 axes are squeezed.

Returns

DataFrame, Series, or scalar: The projection after squeezing axis or all the axes.

See also

Series.iloc: Integer-location based indexing for selecting scalars.
DataFrame.iloc: Integer-location based indexing for selecting Series.
Series.to_frame: Inverse of DataFrame.squeeze for a single-column DataFrame.

Examples

>>> primes = pd.Series([2, 3, 5, 7])  

Slicing might produce a Series with a single value:

>>> even_primes = primes[primes % 2 == 0]  
>>> even_primes  
0    2
dtype: int64

>>> even_primes.squeeze()  
2

Squeezing objects with more than one value in every axis does nothing:

>>> odd_primes = primes[primes % 2 == 1]  
>>> odd_primes  
1    3
2    5
3    7
dtype: int64

>>> odd_primes.squeeze()  
1    3
2    5
3    7
dtype: int64

Squeezing is even more effective when used with DataFrames.

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])  
>>> df  
   a  b
0  1  2
1  3  4

Slicing a single column will produce a DataFrame with the columns having only one value:

>>> df_a = df[['a']]  
>>> df_a  
   a
0  1
1  3

So the columns can be squeezed down, resulting in a Series:

>>> df_a.squeeze('columns')  
0    1
1    3
Name: a, dtype: int64

Slicing a single row from a single column will produce a single scalar DataFrame:

>>> df_0a = df.loc[df.index < 1, ['a']]  
>>> df_0a  
   a
0  1

Squeezing the rows produces a single scalar Series:

>>> df_0a.squeeze('rows')  
a    1
Name: 0, dtype: int64

Squeezing all axes will project directly into a scalar:

>>> df_0a.squeeze()  
1

std(self, axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None)¶

Return sample standard deviation over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.std.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters

axis{index (0), columns (1)}
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
ddofint, default 1: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns

Series or DataFrame (if level specified)

sub(self, other, axis='columns', level=None, fill_value=None)¶

Get Subtraction of dataframe and other, element-wise (binary operator sub).

This docstring was copied from pandas.core.frame.DataFrame.sub.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

sum(self, axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)¶

Return the sum of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.sum.

Some inconsistencies with the Dask version may exist.

This is equivalent to the method numpy.sum.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min_countint, default 0: The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

See also

Series.sum: Return the sum.
Series.min: Return the minimum.
Series.max: Return the maximum.
Series.idxmin: Return the index of the minimum.
Series.idxmax: Return the index of the maximum.
DataFrame.sum: Return the sum over the requested axis.
DataFrame.min: Return the minimum over the requested axis.
DataFrame.max: Return the maximum over the requested axis.
DataFrame.idxmin: Return the index of the minimum over the requested axis.
DataFrame.idxmax: Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.sum()  
14

Sum using level names, as well as indices.

>>> s.sum(level='blooded')  
blooded
warm    6
cold    8
Name: legs, dtype: int64

>>> s.sum(level=0)  
blooded
warm    6
cold    8
Name: legs, dtype: int64

By default, the sum of an empty or all-NA Series is 0.

>>> pd.Series([]).sum()  # min_count=0 is the default  
0.0

This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.

>>> pd.Series([]).sum(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).sum()  
0.0

>>> pd.Series([np.nan]).sum(min_count=1)  
nan

tail(self, n=5, compute=True)¶

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_bag(self, index=False)¶

Create Dask Bag from a Dask DataFrame

Parameters

indexbool, optional: If True, the elements are tuples of (index, value), otherwise they’re just the value. Default is False.

Examples

>>> bag = df.to_bag()  

to_csv(self, filename, **kwargs)¶

Store Dask DataFrame to CSV files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, …

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))

>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'

>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  

/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 

Parameters

dfdask.DataFrame: Data to save
filenamestring: Path glob indicating the naming scheme for the output files
single_filebool, default False: Whether to save everything into a single CSV file. Under the single file mode, each partition is appended at the end of the specified CSV file. Note that not all filesystems support the append mode and thus the single file mode, especially on cloud storage systems such as S3 or GCS. A warning will be issued when writing to a file that is not backed by a local filesystem.
encodingstring, optional: A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
modestr: Python write mode, default ‘w’
name_functioncallable, default None: Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions. Not supported when single_file is True.
compressionstring, optional: a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
computebool: If true, immediately executes. If False, returns a set of delayed objects, which can be computed at a later time.
storage_optionsdict: Parameters passed on to the backend filesystem class.
header_first_partition_onlyboolean, default None: If set to True, only write the header row in the first output file. By default, headers are written to all partitions under the multiple file mode (single_file is False) and written only once under the single file mode (single_file is True). It must not be False under the single file mode.
compute_kwargsdict, optional: Options to be passed in to the compute method
kwargsdict, optional: Additional parameters to pass to pd.DataFrame.to_csv()

Returns

The names of the file written if they were computed right away
If not, the delayed tasks associated to the writing of the files

Raises

ValueError: If header_first_partition_only is set to False or name_function is specified when single_file is True.

to_dask_array(self, lengths=None)¶

Convert a dask DataFrame to a dask array.

Parameters

lengthsbool or Sequence of ints, optional

How to determine the chunks sizes for the output array. By default, the output array will have unknown chunk lengths along the first axis, which can cause some later operations to fail.

True : immediately compute the length of each partition
Sequence : a sequence of integers to use for the chunk sizes on the first axis. These values are not validated for correctness, beyond ensuring that the number of items matches the number of partitions.

to_delayed(self, optimize_graph=True)¶

Convert into a list of dask.delayed objects, one per partition.

Parameters

optimize_graphbool, optional: If True [default], the graph is optimized before converting into dask.delayed objects.

See also

dask.dataframe.from_delayed

Examples

>>> partitions = df.to_delayed()  

to_hdf(self, path_or_buf, key, mode='a', append=False, **kwargs)¶

Store Dask Dataframe to Hierarchical Data Format (HDF) files

This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.

This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix * within the filename or datapath, and an optional name_function. The asterix will be replaced with an increasing sequence of integers starting from 0 or with the result of calling name_function on each of those integers.

This function only supports the Pandas 'table' format, not the more specialized 'fixed' format.

Parameters

pathstring, pathlib.Path: Path to a target filename. Supports strings, pathlib.Path, or any object implementing the __fspath__ protocol. May contain a * to denote many filenames.
keystring: Datapath within the files. May contain a * to denote many locations
name_functionfunction: A function to convert the * in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)
computebool: Whether or not to execute immediately. If False then this returns a dask.Delayed value.
lockLock, optional: Lock to use to prevent concurrency issues. By default a threading.Lock, multiprocessing.Lock or SerializableLock will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.
schedulerstring: The scheduler to use, like “threads” or “processes”
**other:: See pandas.to_hdf for more information

Returns

filenameslist: Returned if compute is True. List of file names that each partition is saved to.
delayeddask.Delayed: Returned if compute is False. Delayed object to execute to_hdf when computed.

See also

read_hdf
to_parquet

Examples

Save Data to a single file

>>> df.to_hdf('output.hdf', '/data')            

Save data to multiple datapaths within the same file:

>>> df.to_hdf('output.hdf', '/data-*')          

Save data to multiple files:

>>> df.to_hdf('output-*.hdf', '/data')          

Save data to multiple files, using the multiprocessing scheduler:

>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes') 

Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..

>>> from datetime import date, timedelta
>>> base = date(year=2000, month=1, day=1)
>>> def name_function(i):
...     ''' Convert integer 0 to n to a string '''
...     return base + timedelta(days=i)

>>> df.to_hdf('*.hdf', '/data', name_function=name_function) 

to_html(self, max_rows=5)¶

Render a DataFrame as an HTML table.

Parameters

bufstr, Path or StringIO-like, optional, default None (Not supported in Dask)

Buffer to write to. If None, the output is returned as a string.

columnssequence, optional, default None (Not supported in Dask)

The subset of columns to write. Writes all columns by default.

col_spacestr or int, optional (Not supported in Dask)

The minimum width of each column in CSS length units. An int is assumed to be px units.

This docstring was copied from pandas.core.frame.DataFrame.to_html.

Some inconsistencies with the Dask version may exist.

New in version 0.25.0: Ability to use str.

headerbool, optional (Not supported in Dask)

Whether to print column labels, default True.

indexbool, optional, default True (Not supported in Dask)

Whether to print index (row) labels.

na_repstr, optional, default ‘NaN’ (Not supported in Dask)

String representation of NAN to use.

formatterslist, tuple or dict of one-param. functions, optional (Not supported in Dask)

Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

float_formatone-parameter function, optional, default None (Not supported in Dask)

Formatter function to apply to columns’ elements if they are floats. The result of this function must be a unicode string.

sparsifybool, optional, default True (Not supported in Dask)

Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.

index_namesbool, optional, default True (Not supported in Dask)

Prints the names of the indexes.

justifystr, default None (Not supported in Dask)

How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are

left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.

max_rowsint, optional

Maximum number of rows to display in the console.

min_rowsint, optional

The number of rows to display in the console in a truncated repr (when number of rows is above max_rows).

max_colsint, optional (Not supported in Dask)

Maximum number of columns to display in the console.

show_dimensionsbool, default False (Not supported in Dask)

Display DataFrame dimensions (number of rows by number of columns).

decimalstr, default ‘.’ (Not supported in Dask)

Character recognized as decimal separator, e.g. ‘,’ in Europe.

bold_rowsbool, default True (Not supported in Dask)

Make the row labels bold in the output.

classesstr or list or tuple, default None (Not supported in Dask)

CSS class(es) to apply to the resulting html table.

escapebool, default True (Not supported in Dask)

Convert the characters <, >, and & to HTML-safe sequences.

notebook{True, False}, default False (Not supported in Dask)

Whether the generated HTML is for IPython Notebook.

borderint (Not supported in Dask)

A border=border attribute is included in the opening <table> tag. Default pd.options.display.html.border.

encodingstr, default “utf-8” (Not supported in Dask)

Set character encoding.

New in version 1.0.

table_idstr, optional (Not supported in Dask)

A css id is included in the opening <table> tag if specified.

New in version 0.23.0.

render_linksbool, default False (Not supported in Dask)

Convert URLs to HTML links.

New in version 0.24.0.

Returns

str or None: If buf is None, returns the result as a string. Otherwise returns None.

See also

to_string: Convert DataFrame to a string.

to_json(self, filename, *args, **kwargs)¶: See dd.to_json docstring for more information

to_parquet(self, path, *args, **kwargs)¶

Store Dask.dataframe to Parquet files

Parameters

dfdask.dataframe.DataFrame
pathstring or pathlib.Path: Destination directory for data. Prepend with protocol like s3:// or hdfs:// for remote data.
engine{‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’: Parquet library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’.
compressionstring or dict, optional: Either a string like "snappy" or a dictionary mapping column names to compressors like {"name": "gzip", "values": "snappy"}. The default is "default", which uses the default compression for whichever engine is selected.
write_indexboolean, optional: Whether or not to write the index. Defaults to True.
appendbool, optional: If False (default), construct data-set from scratch. If True, add new row-group(s) to an existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.
ignore_divisionsbool, optional: If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False.
partition_onlist, optional: Construct directory-based partitioning by splitting on these fields’ values. Each dask partition will result in one or more datafiles, there will be no global groupby.
storage_optionsdict, optional: Key/value pairs to be passed on to the file-system backend, if any.
write_metadata_filebool, optional: Whether to write the special “_metadata” file.
computebool, optional: If True (default) then the result is computed immediately. If False then a dask.delayed object is returned for future computation.
compute_kwargsdict, optional: Options to be passed in to the compute method
**kwargs :: Extra options to be passed on to the specific backend.

See also

read_parquet: Read parquet data to dask.dataframe

Notes

Each partition will be written to a separate file.

Examples

>>> df = dd.read_csv(...)  
>>> dd.to_parquet(df, '/path/to/output/',...)  

to_records(self, index=False, lengths=None)¶

Create Dask Array from a Dask Dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

See also

dask.dataframe._Frame.values
dask.dataframe.from_dask_array

Examples

>>> df.to_records()  
dask.array<to_records, shape=(nan,), dtype=(numpy.record, [('ind', '<f8'), ('x', 'O'), ('y', '<i8')]), chunksize=(nan,), chunktype=numpy.ndarray>  # noqa: E501

to_sql(self, name: str, uri: str, schema=None, if_exists: str = 'fail', index: bool = True, index_label=None, chunksize=None, dtype=None, method=None, compute=True, parallel=False)¶: See dd.to_sql docstring for more information

to_string(self, max_rows=5)¶

Render a DataFrame to a console-friendly tabular output.

Parameters

bufstr, Path or StringIO-like, optional, default None (Not supported in Dask)

Buffer to write to. If None, the output is returned as a string.

columnssequence, optional, default None (Not supported in Dask)

The subset of columns to write. Writes all columns by default.

col_spaceint, optional (Not supported in Dask)

The minimum width of each column.

headerbool or sequence, optional (Not supported in Dask)

Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names.

indexbool, optional, default True (Not supported in Dask)

Whether to print index (row) labels.

na_repstr, optional, default ‘NaN’ (Not supported in Dask)

String representation of NAN to use.

formatterslist, tuple or dict of one-param. functions, optional (Not supported in Dask)

Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

float_formatone-parameter function, optional, default None (Not supported in Dask)

Formatter function to apply to columns’ elements if they are floats. The result of this function must be a unicode string.

sparsifybool, optional, default True (Not supported in Dask)

Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.

index_namesbool, optional, default True (Not supported in Dask)

Prints the names of the indexes.

justifystr, default None (Not supported in Dask)

How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are

This docstring was copied from pandas.core.frame.DataFrame.to_string.

Some inconsistencies with the Dask version may exist.

left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.

max_rowsint, optional

Maximum number of rows to display in the console.

min_rowsint, optional (Not supported in Dask)

The number of rows to display in the console in a truncated repr (when number of rows is above max_rows).

max_colsint, optional (Not supported in Dask)

Maximum number of columns to display in the console.

show_dimensionsbool, default False (Not supported in Dask)

Display DataFrame dimensions (number of rows by number of columns).

decimalstr, default ‘.’ (Not supported in Dask)

Character recognized as decimal separator, e.g. ‘,’ in Europe.

line_widthint, optional (Not supported in Dask)

Width to wrap a line in characters.

max_colwidthint, optional (Not supported in Dask)

Max width to truncate each column in characters. By default, no limit.

New in version 1.0.0.

encodingstr, default “utf-8” (Not supported in Dask)

Set character encoding.

New in version 1.0.

Returns

str or None: If buf is None, returns the result as a string. Otherwise returns None.

See also

to_html: Convert DataFrame to HTML.

Examples

>>> d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}  
>>> df = pd.DataFrame(d)  
>>> print(df.to_string())  
   col1  col2
0     1     4
1     2     5
2     3     6

to_timestamp(self, freq=None, how='start', axis=0)¶

Cast to DatetimeIndex of timestamps, at beginning of period.

This docstring was copied from pandas.core.frame.DataFrame.to_timestamp.

Some inconsistencies with the Dask version may exist.

Parameters

freqstr, default frequency of PeriodIndex: Desired frequency.
how{‘s’, ‘e’, ‘start’, ‘end’}: Convention for converting period to timestamp; start of period vs. end.
axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to convert (the index by default).
copybool, default True (Not supported in Dask): If False then underlying input data is not copied.

Returns

DataFrame with DatetimeIndex

truediv(self, other, axis='columns', level=None, fill_value=None)¶

Get Floating division of dataframe and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.frame.DataFrame.truediv.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}: Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label: Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns

DataFrame: Result of the arithmetic operation.

See also

DataFrame.add: Add DataFrames.
DataFrame.sub: Subtract DataFrames.
DataFrame.mul: Multiply DataFrames.
DataFrame.div: Divide DataFrames (float division).
DataFrame.truediv: Divide DataFrames (float division).
DataFrame.floordiv: Divide DataFrames (integer division).
DataFrame.mod: Calculate modulo (remainder after division).
DataFrame.pow: Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0

>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4

>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN

>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720

>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0

property values¶

Return a dask.array of the values of this dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

var(self, axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None)¶

Return unbiased variance over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.var.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters

axis{index (0), columns (1)}
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
ddofint, default 1: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns

Series or DataFrame (if level specified)

visualize(self, filename='mydask', format=None, optimize_graph=False, **kwargs)¶

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters

filenamestr or None, optional: The name of the file to write to disk. If the provided filename doesn’t include an extension, ‘.png’ will be used by default. If filename is None, no file will be written, and we communicate with dot using only pipes.
format{‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional: Format in which to write output file. Default is ‘png’.
optimize_graphbool, optional: If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
color: {None, ‘order’}, optional: Options to color nodes. Provide cmap= keyword for additional colormap
**kwargs: Additional keyword arguments to forward to to_graphviz.

Returns

resultIPython.diplay.Image, IPython.display.SVG, or None: See dask.dot.dot_graph for more information.

See also

dask.base.visualize
dask.dot.dot_graph

Notes

For more information on optimization see here:

https://docs.dask.org/en/latest/optimize.html

Examples

>>> x.visualize(filename='dask.pdf')  
>>> x.visualize(filename='dask.pdf', color='order')  

where(self, cond, other=nan)¶

Replace values where the condition is False.

This docstring was copied from pandas.core.frame.DataFrame.where.

Some inconsistencies with the Dask version may exist.

Parameters

condbool Series/DataFrame, array-like, or callable

Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

otherscalar, Series/DataFrame, or callable

Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

axisint, default None (Not supported in Dask)

Alignment axis if needed.

levelint, default None (Not supported in Dask)

Alignment level if needed.

errorsstr, {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

‘raise’ : allow exceptions to be raised.
‘ignore’ : suppress exceptions. On error return original object.

try_castbool, default False (Not supported in Dask)

Try to cast the result back to the input type (if possible).

Returns

Same type as caller

See also

DataFrame.mask(): Return an object of same shape as self.

Notes

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the where documentation in indexing.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

>>> s.mask(s > 0)  
  0.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

>>> s.where(s > 1, 10)  
  10
  10
  2
  3
  4
dtype: int64

>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> df  
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

Series Methods¶

class dask.dataframe.Series(dsk, name, meta, divisions)¶

Parallel Pandas Series

Do not use this class directly. Instead use functions like dd.read_csv, dd.read_parquet, or dd.from_pandas.

Parameters

dsk: dict: The dask graph to compute this Series
_name: str: The key prefix that specifies which keys in the dask comprise this particular Series
meta: pandas.Series: An empty pandas.Series with names, dtypes, and index matching the expected output.
divisions: tuple of index values: Values along which we partition our blocks on the index

See also

dask.dataframe.DataFrame

abs(self)¶

Return a Series/DataFrame with absolute numeric value of each element.

This docstring was copied from pandas.core.frame.DataFrame.abs.

Some inconsistencies with the Dask version may exist.

This function only applies to elements that are all numeric.

Returns

abs: Series/DataFrame containing the absolute value of each element.

See also

numpy.absolute: Calculate the absolute value element-wise.

Notes

For complex inputs, 1.2 + 1j, the absolute value is \(\sqrt{ a^2 + b^2 }\).

Examples

Absolute numeric values in a Series.

>>> s = pd.Series([-1.10, 2, -3.33, 4])  
>>> s.abs()  
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

Absolute numeric values in a Series with complex numbers.

>>> s = pd.Series([1.2 + 1j])  
>>> s.abs()  
0    1.56205
dtype: float64

Absolute numeric values in a Series with a Timedelta element.

>>> s = pd.Series([pd.Timedelta('1 days')])  
>>> s.abs()  
0   1 days
dtype: timedelta64[ns]

Select rows with data closest to certain value using argsort (from StackOverflow).

>>> df = pd.DataFrame({  
...     'a': [4, 5, 6, 7],
...     'b': [10, 20, 30, 40],
...     'c': [100, 50, -30, -50]
... })
>>> df  
     a    b    c
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
>>> df.loc[(df.c - 43).abs().argsort()]  
     a    b    c
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50

add(self, other, level=None, fill_value=None, axis=0)¶

Return Addition of series and other, element-wise (binary operator add).

This docstring was copied from pandas.core.series.Series.add.

Some inconsistencies with the Dask version may exist.

Equivalent to series + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.radd

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.add(b, fill_value=0)  
a    2.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64

align(self, other, join='outer', axis=None, fill_value=None)¶

Align two objects on their axes with the specified join method.

This docstring was copied from pandas.core.series.Series.align.

Some inconsistencies with the Dask version may exist.

Join method is specified for each axis Index.

Parameters

otherDataFrame or Series

join{‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’

axisallowed axis of the other object, default None

Align on index (0), columns (1), or both (None).

levelint or level name, default None (Not supported in Dask)

Broadcast across a level, matching Index values on the passed MultiIndex level.

copybool, default True (Not supported in Dask)

Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

fill_valuescalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None (Not supported in Dask)

Method to use for filling holes in reindexed Series:

pad / ffill: propagate last valid observation forward to next valid.
backfill / bfill: use NEXT valid observation to fill gap.

limitint, default None (Not supported in Dask)

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

fill_axis{0 or ‘index’}, default 0 (Not supported in Dask)

Filling axis, method and limit.

broadcast_axis{0 or ‘index’}, default None (Not supported in Dask)

Broadcast values along this axis, if aligning two objects of different dimensions.

Returns

(left, right)(Series, type of other): Aligned objects.

all(self, axis=None, skipna=True, split_every=False, out=None)¶

Return whether all elements are True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.all.

Some inconsistencies with the Dask version may exist.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters

axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.

bool_onlybool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipnabool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

**kwargsany, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame: If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

Series.all: Return True if all elements are True.
DataFrame.any: Return True if one (or more) elements are True.

Examples

Series

>>> pd.Series([True, True]).all()  
True
>>> pd.Series([True, False]).all()  
False
>>> pd.Series([]).all()  
True
>>> pd.Series([np.nan]).all()  
True
>>> pd.Series([np.nan]).all(skipna=False)  
True

DataFrames

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})  
>>> df  
   col1   col2
0  True   True
1  True  False

Default behaviour checks if column-wise values all return True.

>>> df.all()  
col1     True
col2    False
dtype: bool

Specify axis='columns' to check if row-wise values all return True.

>>> df.all(axis='columns')  
0     True
1    False
dtype: bool

Or axis=None for whether every value is True.

>>> df.all(axis=None)  
False

any(self, axis=None, skipna=True, split_every=False, out=None)¶

Return whether any element is True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.any.

Some inconsistencies with the Dask version may exist.

Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters

axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.

bool_onlybool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipnabool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

**kwargsany, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame: If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

numpy.any: Numpy version of this method.
Series.any: Return whether any element is True.
Series.all: Return whether all elements are True.
DataFrame.any: Return whether any element is True over requested axis.
DataFrame.all: Return whether all elements are True over requested axis.

Examples

Series

For Series input, the output is a scalar indicating whether any element is True.

>>> pd.Series([False, False]).any()  
False
>>> pd.Series([True, False]).any()  
True
>>> pd.Series([]).any()  
False
>>> pd.Series([np.nan]).any()  
False
>>> pd.Series([np.nan]).any(skipna=False)  
True

DataFrame

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})  
>>> df  
   A  B  C
0  1  0  0
1  2  2  0

>>> df.any()  
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})  
>>> df  
       A  B
0   True  1
1  False  2

>>> df.any(axis='columns')  
0    True
1    True
dtype: bool

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})  
>>> df  
       A  B
0   True  1
1  False  0

>>> df.any(axis='columns')  
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with axis=None.

>>> df.any(axis=None)  
True

any for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()  
Series([], dtype: bool)

append(self, other, interleave_partitions=False)¶

Concatenate two or more Series.

This docstring was copied from pandas.core.series.Series.append.

Some inconsistencies with the Dask version may exist.

Parameters

to_appendSeries or list/tuple of Series (Not supported in Dask): Series to append with self.
ignore_indexbool, default False (Not supported in Dask): If True, do not use the index labels.
verify_integritybool, default False (Not supported in Dask): If True, raise Exception on creating index with duplicates.

Returns

Series: Concatenated Series.

See also

concat: General function to concatenate DataFrame or Series objects.

Notes

Iteratively appending to a Series can be more computationally intensive than a single concatenate. A better solution is to append values to a list and then concatenate the list with the original Series all at once.

Examples

>>> s1 = pd.Series([1, 2, 3])  
>>> s2 = pd.Series([4, 5, 6])  
>>> s3 = pd.Series([4, 5, 6], index=[3, 4, 5])  
>>> s1.append(s2)  
0    1
1    2
2    3
0    4
1    5
2    6
dtype: int64

>>> s1.append(s3)  
  1
  2
  3
  4
  5
  6
dtype: int64

With ignore_index set to True:

>>> s1.append(s2, ignore_index=True)  
  1
  2
  3
  4
  5
  6
dtype: int64

With verify_integrity set to True:

>>> s1.append(s2, verify_integrity=True)  
Traceback (most recent call last):
...
ValueError: Indexes have overlapping values: [0, 1, 2]

apply(self, func, convert_dtype=True, meta='__no_default__', args=(), **kwds)¶

Parallel version of pandas.Series.apply

Parameters

funcfunction: Function to apply
convert_dtypeboolean, default True: Try to find better dtype for elementwise function results. If False, leave as dtype=object.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.
argstuple: Positional arguments to pass to function in addition to the value.
Additional keyword arguments will be passed as keywords to the function.

Returns

appliedSeries or DataFrame if func returns a Series.

See also

dask.Series.map_partitions

Examples

>>> import dask.dataframe as dd
>>> s = pd.Series(range(5), name='x')
>>> ds = dd.from_pandas(s, npartitions=2)

Apply a function elementwise across the Series, passing in extra arguments in args and kwargs:

>>> def myadd(x, a, b=1):
...     return x + a + b
>>> res = ds.apply(myadd, args=(2,), b=1.5)  

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with name 'x', and dtype float64:

>>> res = ds.apply(myadd, args=(2,), b=1.5, meta=('x', 'f8'))

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ds.apply(lambda x: x + 1, meta=ds)

astype(self, dtype)¶

Cast a pandas object to a specified dtype dtype.

This docstring was copied from pandas.core.frame.DataFrame.astype.

Some inconsistencies with the Dask version may exist.

Parameters

dtypedata type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copybool, default True (Not supported in Dask)

Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).

errors{‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Control raising of exceptions on invalid data for provided dtype.

raise : allow exceptions to be raised
ignore : suppress exceptions. On error return original object.

Returns

castedsame type as caller

See also

to_datetime: Convert argument to datetime.
to_timedelta: Convert argument to timedelta.
to_numeric: Convert argument to a numeric type.
numpy.ndarray.astype: Cast a numpy array to a specified type.

Examples

Create a DataFrame:

>>> d = {'col1': [1, 2], 'col2': [3, 4]}  
>>> df = pd.DataFrame(data=d)  
>>> df.dtypes  
col1    int64
col2    int64
dtype: object

Cast all columns to int32:

>>> df.astype('int32').dtypes  
col1    int32
col2    int32
dtype: object

Cast col1 to int32 using a dictionary:

>>> df.astype({'col1': 'int32'}).dtypes  
col1    int32
col2    int64
dtype: object

Create a series:

>>> ser = pd.Series([1, 2], dtype='int32')  
>>> ser  
0    1
1    2
dtype: int32
>>> ser.astype('int64')  
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')  
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> cat_dtype = pd.api.types.CategoricalDtype(  
...     categories=[2, 1], ordered=True)
>>> ser.astype(cat_dtype)  
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False and changing data on a new pandas object may propagate changes:

>>> s1 = pd.Series([1, 2])  
>>> s2 = s1.astype('int64', copy=False)  
>>> s2[0] = 10  
>>> s1  # note that s1[0] has changed too  
0    10
1     2
dtype: int64

autocorr(self, lag=1, split_every=False)¶

Compute the lag-N autocorrelation.

This docstring was copied from pandas.core.series.Series.autocorr.

Some inconsistencies with the Dask version may exist.

This method computes the Pearson correlation between the Series and its shifted self.

Parameters

lagint, default 1: Number of lags to apply before performing autocorrelation.

Returns

float: The Pearson correlation between self and self.shift(lag).

See also

Series.corr: Compute the correlation between two Series.
Series.shift: Shift index by desired number of periods.
DataFrame.corr: Compute pairwise correlation of columns.
DataFrame.corrwith: Compute pairwise correlation between rows or columns of two DataFrame objects.

Notes

If the Pearson correlation is not well defined return ‘NaN’.

Examples

>>> s = pd.Series([0.25, 0.5, 0.2, -0.05])  
>>> s.autocorr()  
0.10355...
>>> s.autocorr(lag=2)  
-0.99999...

If the Pearson correlation is not well defined, then ‘NaN’ is returned.

>>> s = pd.Series([1, 0, 0, 0])  
>>> s.autocorr()  
nan

between(self, left, right, inclusive=True)¶

Return boolean Series equivalent to left <= series <= right.

This docstring was copied from pandas.core.series.Series.between.

Some inconsistencies with the Dask version may exist.

This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

Parameters

leftscalar or list-like: Left boundary.
rightscalar or list-like: Right boundary.
inclusivebool, default True: Include boundaries.

Returns

Series: Series representing whether each element is between left and right (inclusive).

See also

Series.gt: Greater than of series and other.
Series.lt: Less than of series and other.

Notes

This function is equivalent to (left <= ser) & (ser <= right)

Examples

>>> s = pd.Series([2, 0, 4, 8, np.nan])  

Boundary values are included by default:

>>> s.between(1, 4)  
   True
  False
   True
  False
  False
dtype: bool

With inclusive set to False boundary values are excluded:

>>> s.between(1, 4, inclusive=False)  
   True
  False
  False
  False
  False
dtype: bool

left and right can be any scalar value:

>>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve'])  
>>> s.between('Anna', 'Daniel')  
0    False
1     True
2     True
3    False
dtype: bool

bfill(self, axis=None, limit=None)¶

Synonym for DataFrame.fillna() with method='bfill'.

This docstring was copied from pandas.core.frame.DataFrame.bfill.

Some inconsistencies with the Dask version may exist.

Returns

%(klass)s or None: Object with missing values filled or None if inplace=True.

clear_divisions(self)¶: Forget division information

clip(self, lower=None, upper=None, out=None)¶

Trim values at input threshold(s).

This docstring was copied from pandas.core.series.Series.clip.

Some inconsistencies with the Dask version may exist.

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters

lowerfloat or array_like, default None: Minimum threshold value. All values below this threshold will be set to it.
upperfloat or array_like, default None: Maximum threshold value. All values above this threshold will be set to it.
axisint or str axis name, optional (Not supported in Dask): Align object with lower and upper along the given axis.
inplacebool, default False (Not supported in Dask): Whether to perform the operation in place on the data.

New in version 0.21.0.
*args, **kwargs: Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns

Series or DataFrame: Same type as calling object with the values outside the clip boundaries replaced.

Examples

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}  
>>> df = pd.DataFrame(data)  
>>> df  
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)  
   col_0  col_1
    6     -2
   -3     -4
    0      6
   -1      6
    5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])  
>>> t  
0    2
1   -4
2   -1
3    6
4    3
dtype: int64

>>> df.clip(t, t + 4, axis=0)  
   col_0  col_1
    6      2
   -3     -4
    0      3
    6      8
    5      3

combine(self, other, func, fill_value=None)¶

Combine the Series with a Series or scalar according to func.

This docstring was copied from pandas.core.series.Series.combine.

Some inconsistencies with the Dask version may exist.

Combine the Series and other using func to perform elementwise selection for combined Series. fill_value is assumed when value is missing at some index from one of the two objects being combined.

Parameters

otherSeries or scalar: The value(s) to be combined with the Series.
funcfunction: Function that takes two scalars as inputs and returns an element.
fill_valuescalar, optional: The value to assume when an index is missing from one Series or the other. The default specifies to use the appropriate NaN value for the underlying dtype of the Series.

Returns

Series: The result of combining the Series with the other object.

See also

Series.combine_first: Combine Series values, choosing the calling Series’ values first.

Examples

Consider 2 Datasets s1 and s2 containing highest clocked speeds of different birds.

>>> s1 = pd.Series({'falcon': 330.0, 'eagle': 160.0})  
>>> s1  
falcon    330.0
eagle     160.0
dtype: float64
>>> s2 = pd.Series({'falcon': 345.0, 'eagle': 200.0, 'duck': 30.0})  
>>> s2  
falcon    345.0
eagle     200.0
duck       30.0
dtype: float64

Now, to combine the two datasets and view the highest speeds of the birds across the two datasets

>>> s1.combine(s2, max)  
duck        NaN
eagle     200.0
falcon    345.0
dtype: float64

In the previous example, the resulting value for duck is missing, because the maximum of a NaN and a float is a NaN. So, in the example, we set fill_value=0, so the maximum value returned will be the value from some dataset.

>>> s1.combine(s2, max, fill_value=0)  
duck       30.0
eagle     200.0
falcon    345.0
dtype: float64

combine_first(self, other)¶

Combine Series values, choosing the calling Series’s values first.

This docstring was copied from pandas.core.series.Series.combine_first.

Some inconsistencies with the Dask version may exist.

Parameters

otherSeries: The value(s) to be combined with the Series.

Returns

Series: The result of combining the Series with the other object.

See also

Series.combine: Perform elementwise operation on two Series using a given function.

Notes

Result index will be the union of the two indexes.

Examples

>>> s1 = pd.Series([1, np.nan])  
>>> s2 = pd.Series([3, 4])  
>>> s1.combine_first(s2)  
0    1.0
1    4.0
dtype: float64

compute(self, **kwargs)¶

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask array turns into a NumPy array and a Dask dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters

schedulerstring, optional: Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graphbool, optional: If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs: Extra keywords to forward to the scheduler function.

See also

dask.base.compute

copy(self)¶

Make a copy of the dataframe

This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data

corr(self, other, method='pearson', min_periods=None, split_every=False)¶

Compute correlation with other Series, excluding missing values.

This docstring was copied from pandas.core.series.Series.corr.

Some inconsistencies with the Dask version may exist.

Parameters

otherSeries

Series with which to compute the correlation.

method{‘pearson’, ‘kendall’, ‘spearman’} or callable

Method used to compute correlation:

pearson : Standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: Callable with input two 1d ndarrays and returning a float.

New in version 0.24.0: Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

min_periodsint, optional

Minimum number of observations needed to have a valid result.

Returns

float: Correlation with other.

Examples

>>> def histogram_intersection(a, b):  
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> s1 = pd.Series([.2, .0, .6, .2])  
>>> s2 = pd.Series([.3, .6, .0, .1])  
>>> s1.corr(s2, method=histogram_intersection)  
0.3

count(self, split_every=False)¶

Return number of non-NA/null observations in the Series.

This docstring was copied from pandas.core.series.Series.count.

Some inconsistencies with the Dask version may exist.

Parameters

levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series.

Returns

int or Series (if level specified): Number of non-null values in the Series.

Examples

>>> s = pd.Series([0.0, 1.0, np.nan])  
>>> s.count()  
2

cov(self, other, min_periods=None, split_every=False)¶

Compute covariance with Series, excluding missing values.

This docstring was copied from pandas.core.series.Series.cov.

Some inconsistencies with the Dask version may exist.

Parameters

otherSeries: Series with which to compute the covariance.
min_periodsint, optional: Minimum number of observations needed to have a valid result.

Returns

float: Covariance between Series and other normalized by N-1 (unbiased estimator).

Examples

>>> s1 = pd.Series([0.90010907, 0.13484424, 0.62036035])  
>>> s2 = pd.Series([0.12528585, 0.26962463, 0.51111198])  
>>> s1.cov(s2)  
-0.01685762652715874

cummax(self, axis=None, skipna=True, out=None)¶

Return cumulative maximum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummax.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative maximum.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :: Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame

See also

core.window.Expanding.max: Similar functionality but ignores NaN values.
DataFrame.max: Return the maximum over DataFrame axis.
DataFrame.cummax: Return cumulative maximum over DataFrame axis.
DataFrame.cummin: Return cumulative minimum over DataFrame axis.
DataFrame.cumsum: Return cumulative sum over DataFrame axis.
DataFrame.cumprod: Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummax()  
  2.0
  NaN
  5.0
  5.0
  5.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummax(skipna=False)  
  2.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the maximum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummax()  
     A    B
0  2.0  1.0
1  3.0  NaN
2  3.0  1.0

To iterate over columns and find the maximum in each row, use axis=1

>>> df.cummax(axis=1)  
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  1.0

cummin(self, axis=None, skipna=True, out=None)¶

Return cumulative minimum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummin.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative minimum.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :: Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame

See also

core.window.Expanding.min: Similar functionality but ignores NaN values.
DataFrame.min: Return the minimum over DataFrame axis.
DataFrame.cummax: Return cumulative maximum over DataFrame axis.
DataFrame.cummin: Return cumulative minimum over DataFrame axis.
DataFrame.cumsum: Return cumulative sum over DataFrame axis.
DataFrame.cumprod: Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummin()  
  2.0
  NaN
  2.0
 -1.0
 -1.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummin(skipna=False)  
  2.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the minimum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummin()  
     A    B
0  2.0  1.0
1  2.0  NaN
2  1.0  0.0

To iterate over columns and find the minimum in each row, use axis=1

>>> df.cummin(axis=1)  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

cumprod(self, axis=None, skipna=True, dtype=None, out=None)¶

Return cumulative product over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumprod.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative product.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :: Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame

See also

core.window.Expanding.prod: Similar functionality but ignores NaN values.
DataFrame.prod: Return the product over DataFrame axis.
DataFrame.cummax: Return cumulative maximum over DataFrame axis.
DataFrame.cummin: Return cumulative minimum over DataFrame axis.
DataFrame.cumsum: Return cumulative sum over DataFrame axis.
DataFrame.cumprod: Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumprod()  
   2.0
   NaN
  10.0
 -10.0
  -0.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumprod(skipna=False)  
  2.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the product in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumprod()  
     A    B
0  2.0  1.0
1  6.0  NaN
2  6.0  0.0

To iterate over columns and find the product in each row, use axis=1

>>> df.cumprod(axis=1)  
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  0.0

cumsum(self, axis=None, skipna=True, dtype=None, out=None)¶

Return cumulative sum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumsum.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative sum.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :: Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns

Series or DataFrame

See also

core.window.Expanding.sum: Similar functionality but ignores NaN values.
DataFrame.sum: Return the sum over DataFrame axis.
DataFrame.cummax: Return cumulative maximum over DataFrame axis.
DataFrame.cummin: Return cumulative minimum over DataFrame axis.
DataFrame.cumsum: Return cumulative sum over DataFrame axis.
DataFrame.cumprod: Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumsum()  
  2.0
  NaN
  7.0
  6.0
  6.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumsum(skipna=False)  
  2.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumsum()  
     A    B
0  2.0  1.0
1  5.0  NaN
2  6.0  1.0

To iterate over columns and find the sum in each row, use axis=1

>>> df.cumsum(axis=1)  
     A    B
0  2.0  3.0
1  3.0  NaN
2  1.0  1.0

describe(self, split_every=False, percentiles=None, percentiles_method='default', include=None, exclude=None)¶

Generate descriptive statistics.

This docstring was copied from pandas.core.frame.DataFrame.describe.

Some inconsistencies with the Dask version may exist.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters

percentileslist-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include‘all’, list-like of dtypes or None (default), optional

A white list of data types to include in the result. Ignored for Series. Here are the options:

‘all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
None (default) : The result will include all numeric columns.

excludelist-like of dtypes or None (default), optional,

A black list of data types to omit from the result. Ignored for Series. Here are the options:

A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'
None (default) : The result will exclude nothing.

Returns

Series or DataFrame: Summary statistics of the Series or Dataframe provided.

See also

DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.mean: Mean of the values.
DataFrame.std: Standard deviation of the observations.
DataFrame.select_dtypes: Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])  
>>> s.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])  
>>> s.describe()  
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([  
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe()  
count                       3
unique                      2
top       2010-01-01 00:00:00
freq                        2
first     2000-01-01 00:00:00
last      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),  
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  
        categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      c
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[np.object])  
       object
count       3
unique      3
top         c
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])  
       categorical
count            3
unique           3
top              f
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      c
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[np.object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0

diff(self, periods=1, axis=0)¶

First discrete difference of element.

This docstring was copied from pandas.core.frame.DataFrame.diff.

Some inconsistencies with the Dask version may exist.

Note

Pandas currently uses an object-dtype column to represent boolean data with missing values. This can cause issues for boolean-specific operations, like |. To enable boolean- specific operations, at the cost of metadata that doesn’t match pandas, use .astype(bool) after the shift.

Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is the element in the same column of the previous row).

Parameters

periodsint, default 1: Periods to shift for calculating difference, accepts negative values.
axis{0 or ‘index’, 1 or ‘columns’}, default 0: Take difference over rows (0) or columns (1).

Returns

DataFrame

See also

Series.diff: First discrete difference for a Series.
DataFrame.pct_change: Percent change over given number of periods.
DataFrame.shift: Shift index by desired number of periods with an optional time freq.

Notes

For boolean dtypes, this uses operator.xor() rather than operator.sub().

Examples

Difference with previous row

>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],  
...                    'b': [1, 1, 2, 3, 5, 8],
...                    'c': [1, 4, 9, 16, 25, 36]})
>>> df  
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36

>>> df.diff()  
     a    b     c
NaN  NaN   NaN
1.0  0.0   3.0
1.0  1.0   5.0
1.0  1.0   7.0
1.0  2.0   9.0
1.0  3.0  11.0

Difference with previous column

>>> df.diff(axis=1)  
    a    b     c
NaN  0.0   0.0
NaN -1.0   3.0
NaN -1.0   7.0
NaN -1.0  13.0
NaN  0.0  20.0
NaN  2.0  28.0

Difference with 3rd previous row

>>> df.diff(periods=3)  
     a    b     c
NaN  NaN   NaN
NaN  NaN   NaN
NaN  NaN   NaN
3.0  2.0  15.0
3.0  4.0  21.0
3.0  6.0  27.0

Difference with following row

>>> df.diff(periods=-1)  
     a    b     c
-1.0  0.0  -3.0
-1.0 -1.0  -5.0
-1.0 -1.0  -7.0
-1.0 -2.0  -9.0
-1.0 -3.0 -11.0
NaN  NaN   NaN

div(self, other, level=None, fill_value=None, axis=0)¶

Return Floating division of series and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.series.Series.div.

Some inconsistencies with the Dask version may exist.

Equivalent to series / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.rtruediv

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)  
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64

divide(self, other, level=None, fill_value=None, axis=0)¶

Return Floating division of series and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.series.Series.divide.

Some inconsistencies with the Dask version may exist.

Equivalent to series / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.rtruediv

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)  
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64

drop_duplicates(self, subset=None, split_every=None, split_out=1, ignore_index=False, **kwargs)¶

Return DataFrame with duplicate rows removed.

This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates.

Some inconsistencies with the Dask version may exist.

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters

subsetcolumn label or sequence of labels, optional: Only consider certain columns for identifying duplicates, by default use all of the columns.
keep{‘first’, ‘last’, False}, default ‘first’ (Not supported in Dask): Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.
inplacebool, default False (Not supported in Dask): Whether to drop duplicates in place or to return a copy.
ignore_indexbool, default False: If True, the resulting axis will be labeled 0, 1, …, n - 1.

New in version 1.0.0.

Returns

DataFrame: DataFrame with duplicates removed or None if inplace=True.

dropna(self)¶

Return a new Series with missing values removed.

This docstring was copied from pandas.core.series.Series.dropna.

Some inconsistencies with the Dask version may exist.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters

axis{0 or ‘index’}, default 0 (Not supported in Dask): There is only one axis to drop values from.
inplacebool, default False (Not supported in Dask): If True, do operation inplace and return None.
howstr, optional (Not supported in Dask): Not in use. Kept for compatibility.

Returns

Series: Series with NA entries dropped from it.

See also

Series.isna: Indicate missing values.
Series.notna: Indicate existing (non-missing) values.
Series.fillna: Replace missing values.
DataFrame.dropna: Drop rows or columns which contain NA values.
Index.dropna: Drop missing indices.

Examples

>>> ser = pd.Series([1., 2., np.nan])  
>>> ser  
0    1.0
1    2.0
2    NaN
dtype: float64

Drop NA values from a Series.

>>> ser.dropna()  
0    1.0
1    2.0
dtype: float64

Keep the Series with valid entries in the same variable.

>>> ser.dropna(inplace=True)  
>>> ser  
0    1.0
1    2.0
dtype: float64

Empty strings are not considered NA values. None is considered an NA value.

>>> ser = pd.Series([np.NaN, 2, pd.NaT, '', None, 'I stay'])  
>>> ser  
0       NaN
1         2
2       NaT
3
4      None
5    I stay
dtype: object
>>> ser.dropna()  
1         2
3
5    I stay
dtype: object

dt¶: Namespace of datetime methods

property dtype¶: Return data type

eq(self, other, level=None, fill_value=None, axis=0)¶

Return Equal to of series and other, element-wise (binary operator eq).

This docstring was copied from pandas.core.series.Series.eq.

Some inconsistencies with the Dask version may exist.

Equivalent to series == other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.None

explode(self)¶

Transform each element of a list-like to a row, replicating the index values.

This docstring was copied from pandas.core.series.Series.explode.

Some inconsistencies with the Dask version may exist.

New in version 0.25.0.

Returns

Series: Exploded lists to rows; index will be duplicated for these rows.

See also

Series.str.split: Split string values on specified separator.
Series.unstack: Unstack, a.k.a. pivot, Series with MultiIndex to produce DataFrame.
DataFrame.melt: Unpivot a DataFrame from wide format to long format.
DataFrame.explode: Explode a DataFrame from list-like columns to long format.

Notes

This routine will explode list-likes including lists, tuples, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged. Empty list-likes will result in a np.nan for that row.

Examples

>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])  
>>> s  
0    [1, 2, 3]
1          foo
2           []
3       [3, 4]
dtype: object

>>> s.explode()  
    1
    2
    3
  foo
  NaN
    3
    4
dtype: object

ffill(self, axis=None, limit=None)¶

Synonym for DataFrame.fillna() with method='ffill'.

This docstring was copied from pandas.core.frame.DataFrame.ffill.

Some inconsistencies with the Dask version may exist.

Returns

%(klass)s or None: Object with missing values filled or None if inplace=True.

fillna(self, value=None, method=None, limit=None, axis=None)¶

Fill NA/NaN values using the specified method.

This docstring was copied from pandas.core.frame.DataFrame.fillna.

Some inconsistencies with the Dask version may exist.

Parameters

valuescalar, dict, Series, or DataFrame: Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None: Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
axis{0 or ‘index’, 1 or ‘columns’}: Axis along which to fill missing values.
inplacebool, default False (Not supported in Dask): If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
limitint, default None: If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcastdict, default is None (Not supported in Dask): A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

Returns

DataFrame or None: Object with missing values filled or None if inplace=True.

See also

interpolate: Fill NaN values using interpolation.
reindex: Conform object to new index.
asfreq: Convert TimeSeries to specified frequency.

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],  
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, 5],
...                    [np.nan, 3, np.nan, 4]],
...                   columns=list('ABCD'))
>>> df  
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

Replace all NaN elements with 0s.

>>> df.fillna(0)  
    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

We can also propagate non-null values forward or backward.

>>> df.fillna(method='ffill')  
    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   3.0 4.0 NaN 5
3   3.0 3.0 NaN 4

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}  
>>> df.fillna(value=values)  
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 2.0 1
2   0.0 1.0 2.0 5
3   0.0 3.0 2.0 4

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)  
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 NaN 1
2   NaN 1.0 NaN 5
3   NaN 3.0 NaN 4

first(self, offset)¶

Method to subset initial periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.first.

Some inconsistencies with the Dask version may exist.

Parameters

offsetstr, DateOffset, dateutil.relativedelta

Returns

subsetsame type as caller

Raises

TypeError: If the index is not a DatetimeIndex

See also

last: Select final periods of time series based on a date offset.
at_time: Select values at a particular time of the day.
between_time: Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  
>>> ts = pd.DataFrame({'A': [1,2,3,4]}, index=i)  
>>> ts  
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the first 3 days:

>>> ts.first('3D')  
            A
2018-04-09  1
2018-04-11  2

Notice the data for 3 first calender days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.

floordiv(self, other, level=None, fill_value=None, axis=0)¶

Return Integer division of series and other, element-wise (binary operator floordiv).

This docstring was copied from pandas.core.series.Series.floordiv.

Some inconsistencies with the Dask version may exist.

Equivalent to series // other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.rfloordiv

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.floordiv(b, fill_value=0)  
a    1.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64

ge(self, other, level=None, fill_value=None, axis=0)¶

Return Greater than or equal to of series and other, element-wise (binary operator ge).

This docstring was copied from pandas.core.series.Series.ge.

Some inconsistencies with the Dask version may exist.

Equivalent to series >= other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.None

get_partition(self, n)¶: Get a dask DataFrame/Series representing the nth partition.

groupby(self, by=None, **kwargs)¶

Group Series using a mapper or by a Series of columns.

This docstring was copied from pandas.core.series.Series.groupby.

Some inconsistencies with the Dask version may exist.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters

bymapping, function, label, or list of labels: Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.
axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask): Split along rows (0) or columns (1).
levelint, level name, or sequence of such, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_indexbool, default True (Not supported in Dask): For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sortbool, default True (Not supported in Dask): Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
group_keysbool, default True (Not supported in Dask): When calling apply, add group keys to index to identify pieces.
squeezebool, default False (Not supported in Dask): Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
observedbool, default False (Not supported in Dask): This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

New in version 0.23.0.

Returns

SeriesGroupBy: Returns a groupby object that contains information about the groups.

See also

resample: Convenience method for frequency conversion and resampling of time series.

Notes

See the user guide for more.

Examples

>>> ser = pd.Series([390., 350., 30., 20.],  
...                 index=['Falcon', 'Falcon', 'Parrot', 'Parrot'], name="Max Speed")
>>> ser  
Falcon    390.0
Falcon    350.0
Parrot     30.0
Parrot     20.0
Name: Max Speed, dtype: float64
>>> ser.groupby(["a", "b", "a", "b"]).mean()  
a    210.0
b    185.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level=0).mean()  
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(ser > 100).mean()  
Max Speed
False     25.0
True     370.0
Name: Max Speed, dtype: float64

Grouping by Indexes

We can groupby different levels of a hierarchical index using the level parameter:

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],  
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))  
>>> ser = pd.Series([390., 350., 30., 20.], index=index, name="Max Speed")  
>>> ser  
Animal  Type
Falcon  Captive    390.0
        Wild       350.0
Parrot  Captive     30.0
        Wild        20.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level=0).mean()  
Animal
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level="Type").mean()  
Type
Captive    210.0
Wild       185.0
Name: Max Speed, dtype: float64

gt(self, other, level=None, fill_value=None, axis=0)¶

Return Greater than of series and other, element-wise (binary operator gt).

This docstring was copied from pandas.core.series.Series.gt.

Some inconsistencies with the Dask version may exist.

Equivalent to series > other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.None

head(self, n=5, npartitions=1, compute=True)¶

First n rows of the dataset

Parameters

nint, optional: The number of rows to return. Default is 5.
npartitionsint, optional: Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.
computebool, optional: Whether to compute the result, default is True.

idxmax(self, axis=None, skipna=True, split_every=False)¶

Return index of first occurrence of maximum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns

Series: Indexes of maxima along the specified axis.

Raises

ValueError

If the row/column is empty

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(self, axis=None, skipna=True, split_every=False)¶

Return index of first occurrence of minimum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns

Series: Indexes of minima along the specified axis.

Raises

ValueError

If the row/column is empty

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

property index¶: Return dask Index instance

isin(self, values)¶

Check whether values are contained in Series.

This docstring was copied from pandas.core.series.Series.isin.

Some inconsistencies with the Dask version may exist.

Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.

Parameters

valuesset or list-like: The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

Returns

Series: Series of booleans indicating if each element is in values.

Raises

TypeError

If values is a string

See also

DataFrame.isin: Equivalent method on DataFrame.

Examples

>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama',  
...                'hippo'], name='animal')
>>> s.isin(['cow', 'lama'])  
0     True
1     True
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

Passing a single string as s.isin('lama') will raise an error. Use a list of one element instead:

>>> s.isin(['lama'])  
   True
  False
   True
  False
   True
  False
Name: animal, dtype: bool

isna(self)¶

Detect missing values.

This docstring was copied from pandas.core.frame.DataFrame.isna.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns

DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.isnull: Alias of isna.
DataFrame.notna: Boolean inverse of isna.
DataFrame.dropna: Omit axes labels with missing values.
isna: Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame({'age': [5, 6, np.NaN],  
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()  
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()  
0    False
1    False
2     True
dtype: bool

isnull(self)¶

Detect missing values.

This docstring was copied from pandas.core.frame.DataFrame.isnull.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns

DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.isnull: Alias of isna.
DataFrame.notna: Boolean inverse of isna.
DataFrame.dropna: Omit axes labels with missing values.
isna: Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame({'age': [5, 6, np.NaN],  
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.isna()  
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.isna()  
0    False
1    False
2     True
dtype: bool

iteritems(self)¶

Lazily iterate over (index, value) tuples.

This docstring was copied from pandas.core.series.Series.iteritems.

Some inconsistencies with the Dask version may exist.

This method returns an iterable tuple (index, value). This is convenient if you want to create a lazy iterator.

Returns

iterable: Iterable of tuples containing the (index, value) pairs from a Series.

See also

DataFrame.items: Iterate over (column name, Series) pairs.
DataFrame.iterrows: Iterate over DataFrame rows as (index, Series) pairs.

Examples

>>> s = pd.Series(['A', 'B', 'C'])  
>>> for index, value in s.items():  
...     print(f"Index : {index}, Value : {value}")
Index : 0, Value : A
Index : 1, Value : B
Index : 2, Value : C

property known_divisions¶: Whether divisions are already known

last(self, offset)¶

Method to subset final periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.last.

Some inconsistencies with the Dask version may exist.

Parameters

offsetstr, DateOffset, dateutil.relativedelta

Returns

subsetsame type as caller

Raises

TypeError: If the index is not a DatetimeIndex

See also

first: Select initial periods of time series based on a date offset.
at_time: Select values at a particular time of the day.
between_time: Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)  
>>> ts  
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the last 3 days:

>>> ts.last('3D')  
            A
2018-04-13  3
2018-04-15  4

Notice the data for 3 last calender days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.

le(self, other, level=None, fill_value=None, axis=0)¶

Return Less than or equal to of series and other, element-wise (binary operator le).

This docstring was copied from pandas.core.series.Series.le.

Some inconsistencies with the Dask version may exist.

Equivalent to series <= other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.None

property loc¶

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  

lt(self, other, level=None, fill_value=None, axis=0)¶

Return Less than of series and other, element-wise (binary operator lt).

This docstring was copied from pandas.core.series.Series.lt.

Some inconsistencies with the Dask version may exist.

Equivalent to series < other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.None

map(self, arg, na_action=None, meta='__no_default__')¶

Map values of Series according to input correspondence.

This docstring was copied from pandas.core.series.Series.map.

Some inconsistencies with the Dask version may exist.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

Parameters

argfunction, collections.abc.Mapping subclass or Series: Mapping correspondence.
na_action{None, ‘ignore’}, default None: If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns

Series: Same index as caller.

See also

Series.apply: For applying more complex functions on a Series.
DataFrame.apply: Apply a function row-/column-wise.
DataFrame.applymap: Apply a function elementwise on a whole DataFrame.

Notes

When arg is a dictionary, values in Series that are not in the dictionary (as keys) are converted to NaN. However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a method for default values), then this default is used rather than NaN.

Examples

>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])  
>>> s  
0      cat
1      dog
2      NaN
3   rabbit
dtype: object

map accepts a dict or a Series. Values that are not found in the dict are converted to NaN, unless the dict has a default value (e.g. defaultdict):

>>> s.map({'cat': 'kitten', 'dog': 'puppy'})  
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

It also accepts a function:

>>> s.map('I am a {}'.format)  
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

To avoid applying the function to missing values (and keep them as NaN) na_action='ignore' can be used:

>>> s.map('I am a {}'.format, na_action='ignore')  
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object

map_overlap(self, func, before, after, *args, **kwargs)¶

Apply a function to each partition, sharing rows with adjacent partitions.

This can be useful for implementing windowing functions such as df.rolling(...).mean() or df.diff().

Parameters

funcfunction: Function applied to each partition.
beforeint: The number of rows to prepend to partition i from the end of partition i - 1.
afterint: The number of rows to append to partition i from the beginning of partition i + 1.
args, kwargs :: Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Notes

Given positive integers before and after, and a function func, map_overlap does the following:

Prepend before rows to each partition i from the end of partition i - 1. The first partition has no rows prepended.
Append after rows to each partition i from the beginning of partition i + 1. The last partition has no rows appended.
Apply func to each partition, passing in any extra args and kwargs if provided.
Trim before rows from the beginning of all but the first partition.
Trim after rows from the end of all but the last partition.

Note that the index and divisions are assumed to remain unchanged.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to df.rolling(2).sum():

>>> ddf.compute()
    x    y
 1  1.0
 2  2.0
 4  3.0
 7  4.0
11  5.0
>>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute()
      x    y
 NaN  NaN
 3.0  3.0
 6.0  5.0
11.0  7.0
18.0  9.0

The pandas diff method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls to df.diff to each partition after prepending/appending that many rows, depending on sign:

>>> def diff(df, periods=1):
...     before, after = (periods, 0) if periods > 0 else (0, -periods)
...     return df.map_overlap(lambda df, periods=1: df.diff(periods),
...                           periods, 0, periods=periods)
>>> diff(ddf, 1).compute()
     x    y
0  NaN  NaN
1  1.0  1.0
2  2.0  1.0
3  3.0  1.0
4  4.0  1.0

If you have a DatetimeIndex, you can use a pd.Timedelta for time- based windows.

>>> ts = pd.Series(range(10), index=pd.date_range('2017', periods=10))
>>> dts = dd.from_pandas(ts, npartitions=2)
>>> dts.map_overlap(lambda df: df.rolling('2D').sum(),
...                 pd.Timedelta('2D'), 0).compute()
2017-01-01     0.0
2017-01-02     1.0
2017-01-03     3.0
2017-01-04     5.0
2017-01-05     7.0
2017-01-06     9.0
2017-01-07    11.0
2017-01-08    13.0
2017-01-09    15.0
2017-01-10    17.0
Freq: D, dtype: float64

map_partitions(self, func, *args, **kwargs)¶

Apply Python function on each DataFrame partition.

Note that the index and divisions are assumed to remain unchanged.

Parameters

funcfunction: Function applied to each partition.
args, kwargs :: Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after. Arguments and keywords may contain Scalar, Delayed or regular python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:

>>> def myadd(df, a, b=1):
...     return df.x + df.y + a + b
>>> res = ddf.map_partitions(myadd, 1, b=2)
>>> res.dtype
dtype('float64')

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with no name, and dtype float64:

>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))

Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
>>> res.dtypes
x      int64
y    float64
z    float64
dtype: object

As before, the output metadata can also be specified manually. This time we pass in a dict, as the output is a DataFrame:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y),
...                          meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.map_partitions(lambda df: df.head(), meta=ddf)

Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:

>>> ddf.map_partitions(func).clear_divisions()  

mask(self, cond, other=nan)¶

Replace values where the condition is True.

This docstring was copied from pandas.core.frame.DataFrame.mask.

Some inconsistencies with the Dask version may exist.

Parameters

condbool Series/DataFrame, array-like, or callable

Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

otherscalar, Series/DataFrame, or callable

Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

axisint, default None (Not supported in Dask)

Alignment axis if needed.

levelint, default None (Not supported in Dask)

Alignment level if needed.

errorsstr, {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

‘raise’ : allow exceptions to be raised.
‘ignore’ : suppress exceptions. On error return original object.

try_castbool, default False (Not supported in Dask)

Try to cast the result back to the input type (if possible).

Returns

Same type as caller

See also

DataFrame.where(): Return an object of same shape as self.

Notes

The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in indexing.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

>>> s.mask(s > 0)  
  0.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

>>> s.where(s > 1, 10)  
  10
  10
  2
  3
  4
dtype: int64

>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> df  
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

max(self, axis=None, skipna=True, split_every=False, out=None)¶

Return the maximum of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.max.

Some inconsistencies with the Dask version may exist.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

See also

Series.sum: Return the sum.
Series.min: Return the minimum.
Series.max: Return the maximum.
Series.idxmin: Return the index of the minimum.
Series.idxmax: Return the index of the maximum.
DataFrame.sum: Return the sum over the requested axis.
DataFrame.min: Return the minimum over the requested axis.
DataFrame.max: Return the maximum over the requested axis.
DataFrame.idxmin: Return the index of the minimum over the requested axis.
DataFrame.idxmax: Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.max()  
8

Max using level names, as well as indices.

>>> s.max(level='blooded')  
blooded
warm    4
cold    8
Name: legs, dtype: int64

>>> s.max(level=0)  
blooded
warm    4
cold    8
Name: legs, dtype: int64

mean(self, axis=None, skipna=True, split_every=False, dtype=None, out=None)¶

Return the mean of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.mean.

Some inconsistencies with the Dask version may exist.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

memory_usage(self, index=True, deep=False)¶

Return the memory usage of the Series.

This docstring was copied from pandas.core.series.Series.memory_usage.

Some inconsistencies with the Dask version may exist.

The memory usage can optionally include the contribution of the index and of elements of object dtype.

Parameters

indexbool, default True: Specifies whether to include the memory usage of the Series index.
deepbool, default False: If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned value.

Returns

int: Bytes of memory consumed.

See also

numpy.ndarray.nbytes: Total bytes consumed by the elements of the array.
DataFrame.memory_usage: Bytes consumed by a DataFrame.

Examples

>>> s = pd.Series(range(3))  
>>> s.memory_usage()  
152

Not including the index gives the size of the rest of the data, which is necessarily smaller:

>>> s.memory_usage(index=False)  
24

The memory footprint of object values is ignored by default:

>>> s = pd.Series(["a", "b"])  
>>> s.values  
array(['a', 'b'], dtype=object)
>>> s.memory_usage()  
144
>>> s.memory_usage(deep=True)  
260

memory_usage_per_partition(self, index=True, deep=False)¶

Return the memory usage of each partition

Parameters

indexbool, default True: Specifies whether to include the memory usage of the index in returned Series.
deepbool, default False: If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

Returns

Series: A Series whose index is the partition number and whose values are the memory usage of each partition in bytes.

min(self, axis=None, skipna=True, split_every=False, out=None)¶

Return the minimum of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.min.

Some inconsistencies with the Dask version may exist.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

See also

Series.sum: Return the sum.
Series.min: Return the minimum.
Series.max: Return the maximum.
Series.idxmin: Return the index of the minimum.
Series.idxmax: Return the index of the maximum.
DataFrame.sum: Return the sum over the requested axis.
DataFrame.min: Return the minimum over the requested axis.
DataFrame.max: Return the maximum over the requested axis.
DataFrame.idxmin: Return the index of the minimum over the requested axis.
DataFrame.idxmax: Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.min()  
0

Min using level names, as well as indices.

>>> s.min(level='blooded')  
blooded
warm    2
cold    0
Name: legs, dtype: int64

>>> s.min(level=0)  
blooded
warm    2
cold    0
Name: legs, dtype: int64

mod(self, other, level=None, fill_value=None, axis=0)¶

Return Modulo of series and other, element-wise (binary operator mod).

This docstring was copied from pandas.core.series.Series.mod.

Some inconsistencies with the Dask version may exist.

Equivalent to series % other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.rmod

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.mod(b, fill_value=0)  
a    0.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64

mode(self, dropna=True, split_every=False)¶

Return the mode(s) of the dataset.

This docstring was copied from pandas.core.series.Series.mode.

Some inconsistencies with the Dask version may exist.

Always returns Series even if only one value is returned.

Parameters

dropnabool, default True: Don’t consider counts of NaN/NaT.

New in version 0.24.0.

Returns

Series: Modes of the Series in sorted order.

mul(self, other, level=None, fill_value=None, axis=0)¶

Return Multiplication of series and other, element-wise (binary operator mul).

This docstring was copied from pandas.core.series.Series.mul.

Some inconsistencies with the Dask version may exist.

Equivalent to series * other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.rmul

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)  
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64

property nbytes¶: Number of bytes

property ndim¶: Return dimensionality

ne(self, other, level=None, fill_value=None, axis=0)¶

Return Not equal to of series and other, element-wise (binary operator ne).

This docstring was copied from pandas.core.series.Series.ne.

Some inconsistencies with the Dask version may exist.

Equivalent to series != other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.None

nlargest(self, n=5, split_every=None)¶

Return the largest n elements.

This docstring was copied from pandas.core.series.Series.nlargest.

Some inconsistencies with the Dask version may exist.

Parameters

nint, default 5

Return this many descending sorted values.

keep{‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

When there are duplicate values that cannot all fit in a Series of n elements:

firstreturn the first n occurrences in order
of appearance.
lastreturn the last n occurrences in reverse
order of appearance.
allkeep all occurrences. This can result in a Series of
size larger than n.

Returns

Series: The n largest values in the Series, sorted in decreasing order.

See also

Series.nsmallest: Get the n smallest elements.
Series.sort_values: Sort Series by values.
Series.head: Return the first n rows.

Notes

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Examples

>>> countries_population = {"Italy": 59000000, "France": 65000000,  
...                         "Malta": 434000, "Maldives": 434000,
...                         "Brunei": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Monserat": 5200}
>>> s = pd.Series(countries_population)  
>>> s  
Italy       59000000
France      65000000
Malta         434000
Maldives      434000
Brunei        434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Monserat        5200
dtype: int64

The n largest elements where n=5 by default.

>>> s.nlargest()  
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64

The n largest elements where n=3. Default keep value is ‘first’ so Malta will be kept.

>>> s.nlargest(3)  
France    65000000
Italy     59000000
Malta       434000
dtype: int64

The n largest elements where n=3 and keeping the last duplicates. Brunei will be kept since it is the last with value 434000 based on the index order.

>>> s.nlargest(3, keep='last')  
France      65000000
Italy       59000000
Brunei        434000
dtype: int64

The n largest elements where n=3 with all duplicates kept. Note that the returned Series has five elements due to the three duplicates.

>>> s.nlargest(3, keep='all')  
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64

notnull(self)¶

Detect existing (non-missing) values.

This docstring was copied from pandas.core.frame.DataFrame.notnull.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns

DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.notnull: Alias of notna.
DataFrame.isna: Boolean inverse of notna.
DataFrame.dropna: Omit axes labels with missing values.
notna: Top-level notna.

Examples

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame({'age': [5, 6, np.NaN],  
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker

>>> df.notna()  
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64

>>> ser.notna()  
0     True
1     True
2    False
dtype: bool

property npartitions¶: Return number of partitions

nsmallest(self, n=5, split_every=None)¶

Return the smallest n elements.

This docstring was copied from pandas.core.series.Series.nsmallest.

Some inconsistencies with the Dask version may exist.

Parameters

nint, default 5

Return this many ascending sorted values.

keep{‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

When there are duplicate values that cannot all fit in a Series of n elements:

firstreturn the first n occurrences in order
of appearance.
lastreturn the last n occurrences in reverse
order of appearance.
allkeep all occurrences. This can result in a Series of
size larger than n.

Returns

Series: The n smallest values in the Series, sorted in increasing order.

See also

Series.nlargest: Get the n largest elements.
Series.sort_values: Sort Series by values.
Series.head: Return the first n rows.

Notes

Faster than .sort_values().head(n) for small n relative to the size of the Series object.

Examples

>>> countries_population = {"Italy": 59000000, "France": 65000000,  
...                         "Brunei": 434000, "Malta": 434000,
...                         "Maldives": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Monserat": 5200}
>>> s = pd.Series(countries_population)  
>>> s  
Italy       59000000
France      65000000
Brunei        434000
Malta         434000
Maldives      434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Monserat        5200
dtype: int64

The n smallest elements where n=5 by default.

>>> s.nsmallest()  
Monserat      5200
Nauru        11300
Tuvalu       11300
Anguilla     11300
Iceland     337000
dtype: int64

The n smallest elements where n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be kept.

>>> s.nsmallest(3)  
Monserat     5200
Nauru       11300
Tuvalu      11300
dtype: int64

The n smallest elements where n=3 and keeping the last duplicates. Anguilla and Tuvalu will be kept since they are the last with value 11300 based on the index order.

>>> s.nsmallest(3, keep='last')  
Monserat     5200
Anguilla    11300
Tuvalu      11300
dtype: int64

The n smallest elements where n=3 with all duplicates kept. Note that the returned Series has four elements due to the three duplicates.

>>> s.nsmallest(3, keep='all')  
Monserat     5200
Nauru       11300
Tuvalu      11300
Anguilla    11300
dtype: int64

nunique(self, split_every=None)¶

Return number of unique elements in the object.

This docstring was copied from pandas.core.series.Series.nunique.

Some inconsistencies with the Dask version may exist.

Excludes NA values by default.

Parameters

dropnabool, default True (Not supported in Dask): Don’t include NaN in the count.

Returns

int

See also

DataFrame.nunique: Method nunique for DataFrame.
Series.count: Count non-NA/null observations in the Series.

Examples

>>> s = pd.Series([1, 3, 5, 7, 7])  
>>> s  
0    1
1    3
2    5
3    7
4    7
dtype: int64

>>> s.nunique()  
4

nunique_approx(self, split_every=None)¶

Approximate number of unique rows.

This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.

Parameters

split_everyint, optional: Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.

Returns

a float representing the approximate number of elements

property partitions¶

Slice dataframe by partitions

This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along partitions so, for example, df.partitions[:5] produces a new Dask Dataframe of the first five partitions.

Returns

A Dask DataFrame

Examples

>>> df.partitions[0]  
>>> df.partitions[:3]  
>>> df.partitions[::10]  

persist(self, **kwargs)¶

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.

This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.

Parameters

schedulerstring, optional: Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graphbool, optional: If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
**kwargs: Extra keywords to forward to the scheduler function.

Returns

New dask collections backed by in-memory data

See also

dask.base.persist

pipe(self, func, *args, **kwargs)¶

Apply func(self, *args, **kwargs).

This docstring was copied from pandas.core.frame.DataFrame.pipe.

Some inconsistencies with the Dask version may exist.

Parameters

funcfunction: Function to apply to the Series/DataFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the Series/DataFrame.
argsiterable, optional: Positional arguments passed into func.
kwargsmapping, optional: A dictionary of keyword arguments passed into func.

Returns

objectthe return type of func.

See also

DataFrame.apply
DataFrame.applymap
Series.map

Notes

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing

>>> f(g(h(df), arg1=a), arg2=b, arg3=c)  

You can write

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe(f, arg2=b, arg3=c)
... )

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe((f, 'arg2'), arg1=a, arg3=c)
...  )

pow(self, other, level=None, fill_value=None, axis=0)¶

Return Exponential power of series and other, element-wise (binary operator pow).

This docstring was copied from pandas.core.series.Series.pow.

Some inconsistencies with the Dask version may exist.

Equivalent to series ** other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.rpow

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.pow(b, fill_value=0)  
a    1.0
b    1.0
c    1.0
d    0.0
e    NaN
dtype: float64

prod(self, axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)¶

Return the product of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.prod.

Some inconsistencies with the Dask version may exist.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min_countint, default 0: The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([]).prod()  
1.0

This can be controlled with the min_count parameter

>>> pd.Series([]).prod(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()  
1.0

>>> pd.Series([np.nan]).prod(min_count=1)  
nan

quantile(self, q=0.5, method='default')¶

Approximate quantiles of Series

Parameters

qlist/array of floats, default 0.5 (50%): Iterable of numbers ranging from 0 to 1 for the desired quantiles
method{‘default’, ‘tdigest’, ‘dask’}, optional: What method to use. By default will use dask’s internal custom algorithm ('dask'). If set to 'tdigest' will use tdigest for floats and ints and fallback to the 'dask' otherwise.

radd(self, other, level=None, fill_value=None, axis=0)¶

Return Addition of series and other, element-wise (binary operator radd).

This docstring was copied from pandas.core.series.Series.radd.

Some inconsistencies with the Dask version may exist.

Equivalent to other + series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.add

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.add(b, fill_value=0)  
a    2.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64

random_split(self, frac, random_state=None, shuffle=False)¶

Pseudorandomly split dataframe into different pieces row-wise

Parameters

fraclist: List of floats that should sum to one.
random_stateint or np.random.RandomState: If int create a new RandomState with this as the seed. Otherwise draw from the passed RandomState.
shufflebool, default False: If set to True, the dataframe is shuffled (within partition) before the split.

See also

dask.DataFrame.sample

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  

rdiv(self, other, level=None, fill_value=None, axis=0)¶

Return Floating division of series and other, element-wise (binary operator rtruediv).

This docstring was copied from pandas.core.series.Series.rdiv.

Some inconsistencies with the Dask version may exist.

Equivalent to other / series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.truediv

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)  
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64

reduction(self, chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)¶

Generic row-wise reductions.

Parameters

chunkcallable

Function to operate on each partition. Should return a pandas.DataFrame, pandas.Series, or a scalar.

aggregatecallable, optional

Function to operate on the concatenated result of chunk. If not specified, defaults to chunk. Used to do the final aggregation in a tree reduction.

The input to aggregate depends on the output of chunk. If the output of chunk is a:

scalar: Input is a Series, with one row per partition.
Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.

Should return a pandas.DataFrame, pandas.Series, or a scalar.

combinecallable, optional

Function to operate on intermediate concatenated results of chunk in a tree-reduction. If not provided, defaults to aggregate. The input/output requirements should match that of aggregate described above.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

tokenstr, optional

The name to use for the output keys.

split_everyint, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to aggregate. Default is 8.

chunk_kwargsdict, optional

Keyword arguments to pass on to chunk only.

aggregate_kwargsdict, optional

Keyword arguments to pass on to aggregate only.

combine_kwargsdict, optional

Keyword arguments to pass on to combine only.

kwargs :

All remaining keywords will be passed to chunk, combine, and aggregate.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)})
>>> ddf = dd.from_pandas(df, npartitions=4)

Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:

>>> res = ddf.reduction(lambda x: x.count(),
...                     aggregate=lambda x: x.sum())
>>> res.compute()
x    50
y    50
dtype: int64

Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).

>>> def count_greater(x, value=0):
...     return (x >= value).sum()
>>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(),
...                       chunk_kwargs={'value': 25})
>>> res.compute()
25

Aggregate both the sum and count of a Series at the same time:

>>> def sum_and_count(x):
...     return pd.Series({'count': x.count(), 'sum': x.sum()},
...                      index=['count', 'sum'])
>>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum())
>>> res.compute()
count      50
sum      1225
dtype: int64

Doing the same, but for a DataFrame. Here chunk returns a DataFrame, meaning the input to aggregate is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.

>>> def sum_and_count(x):
...     return pd.DataFrame({'count': x.count(), 'sum': x.sum()},
...                         columns=['count', 'sum'])
>>> res = ddf.reduction(sum_and_count,
...                     aggregate=lambda x: x.groupby(level=0).sum())
>>> res.compute()
   count   sum
x     50  1225
y     50  3725

rename(self, index=None, inplace=False, sorted_index=False)¶

Alter Series index labels or name

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

Alternatively, change Series.name with a scalar value.

Parameters

indexscalar, hashable sequence, dict-like or callable, optional: If dict-like or callable, the transformation is applied to the index. Scalar or hashable sequence-like will alter the Series.name attribute.
inplaceboolean, default False: Whether to return a new Series or modify this one inplace.
sorted_indexbool, default False: If true, the output Series will have known divisions inferred from the input series and the transformation. Ignored for non-callable/dict-like index or when the input series has unknown divisions. Note that this may only be set to True if you know that the transformed index is monotonicly increasing. Dask will check that transformed divisions are monotonic, but cannot check all the values between divisions, so incorrectly setting this can result in bugs.

Returns

renamedSeries

See also

pandas.Series.rename

repartition(self, divisions=None, npartitions=None, partition_size=None, freq=None, force=False)¶

Repartition dataframe along new divisions

Parameters

divisionslist, optional: List of partitions to be used. Only used if npartitions and partition_size isn’t specified. For convenience if given an integer this will defer to npartitions and if given a string it will defer to partition_size (see below)
npartitionsint, optional: Number of partitions of output. Only used if partition_size isn’t specified.
partition_size: int or string, optional: Max number of bytes of memory for each partition. Use numbers or strings like 5MB. If specified npartitions and divisions will be ignored.

Warning

This keyword argument triggers computation to determine the memory size of each partition, which may be expensive.
freqstr, pd.Timedelta: A period on which to partition timeseries data like '7D' or '12h' or pd.Timedelta(hours=12). Assumes a datetime index.
forcebool, default False: Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.

Notes

Exactly one of divisions, npartitions, partition_size, or freq should be specified. A ValueError will be raised when that is not the case.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
>>> df = df.repartition(freq='7d')  

replace(self, to_replace=None, value=None, regex=False)¶

Replace values given in to_replace with value.

This docstring was copied from pandas.core.frame.DataFrame.replace.

Some inconsistencies with the Dask version may exist.

Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

Parameters

to_replacestr, regex, list, dict, Series, int, float, or None

How to find the values that will be replaced.

numeric, str or regex:
- numeric: numeric values equal to to_replace will be replaced with value
- str: string exactly matching to_replace will be replaced with value
- regex: regexs matching to_replace will be replaced with value
list of str, regex, or numeric:
- First, if to_replace and value are both lists, they must be the same length.
- Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.
- str, regex and numeric rules apply as above.
dict:
- Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None.
- For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
- For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The value parameter should be None to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
None:
- This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also None then this must be a nested dictionary or Series.

See the examples section for examples of each of these.

valuescalar, dict, list, str, regex, default None

Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.

inplacebool, default False (Not supported in Dask)

If True, in place. Note: this will modify any other views on this object (e.g. a column from a DataFrame). Returns the caller if this is True.

limitint, default None (Not supported in Dask)

Maximum size gap to forward or backward fill.

regexbool or same types as to_replace, default False

Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.

method{‘pad’, ‘ffill’, ‘bfill’, None} (Not supported in Dask)

The method to use when for replacement, when to_replace is a scalar, list or tuple and value is None.

Changed in version 0.23.0: Added to DataFrame.

Returns

DataFrame: Object after replacement.

Raises

AssertionError

If regex is not a bool and to_replace is not None.

TypeError

If to_replace is a dict and value is not a list, dict, ndarray, or Series
If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series.
When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced

ValueError

If a list or an ndarray is passed to to_replace and value but they are not the same length.

See also

DataFrame.fillna: Fill NA values.
DataFrame.where: Replace values based on boolean condition.
Series.str.replace: Simple string replacement.

Notes

Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.
Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.

Examples

Scalar `to_replace` and `value`

>>> s = pd.Series([0, 1, 2, 3, 4])  
>>> s.replace(0, 5)  
0    5
1    1
2    2
3    3
4    4
dtype: int64

>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],  
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)  
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like `to_replace`

>>> df.replace([0, 1, 2, 3], 4)  
   A  B  C
4  5  a
4  6  b
4  7  c
4  8  d
4  9  e

>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])  
   A  B  C
4  5  a
3  6  b
2  7  c
1  8  d
4  9  e

>>> s.replace([1, 2], method='bfill')  
  0
  3
  3
  3
  4
dtype: int64

dict-like `to_replace`

>>> df.replace({0: 10, 1: 100})  
     A  B  C
 10  5  a
100  6  b
  2  7  c
  3  8  d
  4  9  e

>>> df.replace({'A': 0, 'B': 5}, 100)  
     A    B  C
100  100  a
  1    6  b
  2    7  c
  3    8  d
  4    9  e

>>> df.replace({'A': {0: 100, 4: 400}})  
     A  B  C
100  5  a
  1  6  b
  2  7  c
  3  8  d
400  9  e

Regular expression `to_replace`

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],  
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)  
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)  
      A    B
0   new  abc
1   foo  bar
2  bait  xyz

>>> df.replace(regex=r'^ba.$', value='new')  
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})  
      A    B
0   new  abc
1   xyz  new
2  bait  xyz

>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')  
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple bool or datetime64 objects, the data types in the to_replace parameter must match the data type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],  
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises  
Traceback (most recent call last):
    ...
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a TypeError because one of the dict keys is not of the correct type for replacement.

Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand the peculiarities of the to_replace parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])  

When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None}, value=None, method=None):

>>> s.replace({'a': None})  
    10
  None
  None
     b
  None
dtype: object

When value=None and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case. The command s.replace('a', None) is actually equivalent to s.replace(to_replace='a', value=None, method='pad'):

>>> s.replace('a', None)  
  10
  10
  10
   b
   b
dtype: object

resample(self, rule, closed=None, label=None)¶

Resample time-series data.

This docstring was copied from pandas.core.frame.DataFrame.resample.

Some inconsistencies with the Dask version may exist.

Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

Parameters

ruleDateOffset, Timedelta or str: The offset string or object representing target conversion.
axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask): Which axis to use for up- or down-sampling. For Series this will default to 0, i.e. along the rows. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.
closed{‘right’, ‘left’}, default None: Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
label{‘right’, ‘left’}, default None: Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
convention{‘start’, ‘end’, ‘s’, ‘e’}, default ‘start’ (Not supported in Dask): For PeriodIndex only, controls whether to use the start or end of rule.
kind{‘timestamp’, ‘period’}, optional, default None (Not supported in Dask): Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.
loffsettimedelta, default None (Not supported in Dask): Adjust the resampled time labels.
baseint, default 0 (Not supported in Dask): For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0.
onstr, optional (Not supported in Dask): For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
levelstr or int, optional (Not supported in Dask): For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.

Returns

Resampler object

See also

groupby: Group by mapping, function, label, or list of labels.
Series.resample: Resample a Series.
DataFrame.resample: Resample a DataFrame.

Notes

See the user guide for more.

To learn more about the offset strings, please see this link.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')  
>>> series = pd.Series(range(9), index=index)  
>>> series  
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()  
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()  
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()  
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows  
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the pad method.

>>> series.resample('30S').pad()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(array_like):  
...     return np.sum(array_like) + 5
...
>>> series.resample('3T').apply(custom_resampler)  
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',  
...                                             freq='A',
...                                             periods=2))
>>> s  
2012    1
2013    2
Freq: A-DEC, dtype: int64
>>> s.resample('Q', convention='start').asfreq()  
2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.

>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',  
...                                                   freq='Q',
...                                                   periods=4))
>>> q  
2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64
>>> q.resample('M', convention='end').asfreq()  
2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],  
...           'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df = pd.DataFrame(d)  
>>> df['week_starting'] = pd.date_range('01/01/2018',  
...                                     periods=8,
...                                     freq='W')
>>> df  
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
>>> df.resample('M', on='week_starting').mean()  
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.

>>> days = pd.date_range('1/1/2000', periods=4, freq='D')  
>>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],  
...            'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df2 = pd.DataFrame(d2,  
...                    index=pd.MultiIndex.from_product([days,
...                                                     ['morning',
...                                                      'afternoon']]
...                                                     ))
>>> df2  
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
>>> df2.resample('D', level=0).sum()  
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90

reset_index(self, drop=False)¶

Reset the index to the default index.

Note that unlike in pandas, the reset dask.dataframe index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g. index1 = [0, ..., 10], index2 = [0, ...]). This is due to the inability to statically know the full length of the index.

For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.

Parameters

dropboolean, default False: Do not try to insert index into dataframe columns.

rfloordiv(self, other, level=None, fill_value=None, axis=0)¶

Return Integer division of series and other, element-wise (binary operator rfloordiv).

This docstring was copied from pandas.core.series.Series.rfloordiv.

Some inconsistencies with the Dask version may exist.

Equivalent to other // series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.floordiv

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.floordiv(b, fill_value=0)  
a    1.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64

rmod(self, other, level=None, fill_value=None, axis=0)¶

Return Modulo of series and other, element-wise (binary operator rmod).

This docstring was copied from pandas.core.series.Series.rmod.

Some inconsistencies with the Dask version may exist.

Equivalent to other % series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.mod

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.mod(b, fill_value=0)  
a    0.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64

rmul(self, other, level=None, fill_value=None, axis=0)¶

Return Multiplication of series and other, element-wise (binary operator rmul).

This docstring was copied from pandas.core.series.Series.rmul.

Some inconsistencies with the Dask version may exist.

Equivalent to other * series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.mul

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)  
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64

rolling(self, window, min_periods=None, center=False, win_type=None, axis=0)¶

Provides rolling transformations.

Parameters

windowint, str, offset: Size of the moving window. This is the number of observations used for calculating the statistic. When not using a DatetimeIndex, the window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a DatetimeIndex

Changed in version 0.15.0: Now accepts offsets and string offset aliases
min_periodsint, default None: Minimum number of observations in window required to have a value (otherwise result is NA).
centerboolean, default False: Set the labels at the center of the window.
win_typestring, default None: Provide a window type. The recognized window types are identical to pandas.
axisint, default 0

Returns

a Rolling object on which to call a method to compute a statistic

round(self, decimals=0)¶

Round each value in a Series to the given number of decimals.

This docstring was copied from pandas.core.series.Series.round.

Some inconsistencies with the Dask version may exist.

Parameters

decimalsint, default 0: Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.

Returns

Series: Rounded values of the Series.

See also

numpy.around: Round values of an np.array.
DataFrame.round: Round values of a DataFrame.

Examples

>>> s = pd.Series([0.1, 1.3, 2.7])  
>>> s.round()  
0    0.0
1    1.0
2    3.0
dtype: float64

rpow(self, other, level=None, fill_value=None, axis=0)¶

Return Exponential power of series and other, element-wise (binary operator rpow).

This docstring was copied from pandas.core.series.Series.rpow.

Some inconsistencies with the Dask version may exist.

Equivalent to other ** series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.pow

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.pow(b, fill_value=0)  
a    1.0
b    1.0
c    1.0
d    0.0
e    NaN
dtype: float64

rsub(self, other, level=None, fill_value=None, axis=0)¶

Return Subtraction of series and other, element-wise (binary operator rsub).

This docstring was copied from pandas.core.series.Series.rsub.

Some inconsistencies with the Dask version may exist.

Equivalent to other - series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.sub

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)  
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64

rtruediv(self, other, level=None, fill_value=None, axis=0)¶

Return Floating division of series and other, element-wise (binary operator rtruediv).

This docstring was copied from pandas.core.series.Series.rtruediv.

Some inconsistencies with the Dask version may exist.

Equivalent to other / series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.truediv

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)  
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64

sample(self, n=None, frac=None, replace=False, random_state=None)¶

Random sample of items

Parameters

nint, optional: Number of items to return is not supported by dask. Use frac instead.
fracfloat, optional: Fraction of axis items to return.
replaceboolean, optional: Sample with or without replacement. Default = False.
random_stateint or np.random.RandomState: If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

See also

DataFrame.random_split
pandas.DataFrame.sample

sem(self, axis=None, skipna=None, ddof=1, split_every=False)¶

Return unbiased standard error of the mean over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.sem.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters

axis{index (0), columns (1)}
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
ddofint, default 1: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns

Series or DataFrame (if level specified)

property shape¶

Return a tuple representing the dimensionality of a Series.

The single element of the tuple is a Delayed result.

Examples

>>> series.shape  
# (dd.Scalar<size-ag..., dtype=int64>,)

shift(self, periods=1, freq=None, axis=0)¶

Shift index by desired number of periods with an optional time freq.

This docstring was copied from pandas.core.frame.DataFrame.shift.

Some inconsistencies with the Dask version may exist.

When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq.

Parameters

periodsint: Number of periods to shift. Can be positive or negative.
freqDateOffset, tseries.offsets, timedelta, or str, optional: Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
axis{0 or ‘index’, 1 or ‘columns’, None}, default None: Shift direction.
fill_valueobject, optional (Not supported in Dask): The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.

Changed in version 0.24.0.

Returns

DataFrame: Copy of input object, shifted.

See also

Index.shift: Shift values of Index.
DatetimeIndex.shift: Shift values of DatetimeIndex.
PeriodIndex.shift: Shift values of PeriodIndex.
tshift: Shift the time index, using the index’s frequency if available.

Examples

>>> df = pd.DataFrame({'Col1': [10, 20, 15, 30, 45],  
...                    'Col2': [13, 23, 18, 33, 48],
...                    'Col3': [17, 27, 22, 37, 52]})

>>> df.shift(periods=3)  
   Col1  Col2  Col3
 NaN   NaN   NaN
 NaN   NaN   NaN
 NaN   NaN   NaN
10.0  13.0  17.0
20.0  23.0  27.0

>>> df.shift(periods=1, axis='columns')  
   Col1  Col2  Col3
 NaN  10.0  13.0
 NaN  20.0  23.0
 NaN  15.0  18.0
 NaN  30.0  33.0
 NaN  45.0  48.0

>>> df.shift(periods=3, fill_value=0)  
   Col1  Col2  Col3
   0     0     0
   0     0     0
   0     0     0
  10    13    17
  20    23    27

shuffle(self, on, npartitions=None, max_branch=None, shuffle=None, ignore_index=False, compute=None)¶

Rearrange DataFrame into new partitions

Uses hashing of on to map rows to output partitions. After this operation, rows with the same value of on will be in the same partition.

Parameters

onstr, list of str, or Series, Index, or DataFrame: Column(s) or index to be used to map rows to output partitions
npartitionsint, optional: Number of partitions of output. Partition count will not be changed by default.
max_branch: int, optional: The maximum number of splits per input partition. Used within the staged shuffling algorithm.
shuffle: {‘disk’, ‘tasks’}, optional: Either 'disk' for single-node operation or 'tasks' for distributed operation. Will be inferred by your current scheduler.
ignore_index: bool, default False: Ignore index during shuffle. If True, performance may improve, but index values will not be preserved.
compute: bool: Whether or not to trigger an immediate computation. Defaults to False.

Notes

This does not preserve a meaningful index/partitioning scheme. This is not deterministic if done in parallel.

Examples

>>> df = df.shuffle(df.columns[0])  

property size¶

Size of the Series or DataFrame as a Delayed object.

Examples

>>> series.size  
dd.Scalar<size-ag..., dtype=int64>

squeeze(self)¶

Squeeze 1 dimensional axis objects into scalars.

This docstring was copied from pandas.core.series.Series.squeeze.

Some inconsistencies with the Dask version may exist.

Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged.

This method is most useful when you don’t know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.

Parameters

axis{0 or ‘index’, 1 or ‘columns’, None}, default None (Not supported in Dask): A specific axis to squeeze. By default, all length-1 axes are squeezed.

Returns

DataFrame, Series, or scalar: The projection after squeezing axis or all the axes.

See also

Series.iloc: Integer-location based indexing for selecting scalars.
DataFrame.iloc: Integer-location based indexing for selecting Series.
Series.to_frame: Inverse of DataFrame.squeeze for a single-column DataFrame.

Examples

>>> primes = pd.Series([2, 3, 5, 7])  

Slicing might produce a Series with a single value:

>>> even_primes = primes[primes % 2 == 0]  
>>> even_primes  
0    2
dtype: int64

>>> even_primes.squeeze()  
2

Squeezing objects with more than one value in every axis does nothing:

>>> odd_primes = primes[primes % 2 == 1]  
>>> odd_primes  
1    3
2    5
3    7
dtype: int64

>>> odd_primes.squeeze()  
1    3
2    5
3    7
dtype: int64

Squeezing is even more effective when used with DataFrames.

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])  
>>> df  
   a  b
0  1  2
1  3  4

Slicing a single column will produce a DataFrame with the columns having only one value:

>>> df_a = df[['a']]  
>>> df_a  
   a
0  1
1  3

So the columns can be squeezed down, resulting in a Series:

>>> df_a.squeeze('columns')  
0    1
1    3
Name: a, dtype: int64

Slicing a single row from a single column will produce a single scalar DataFrame:

>>> df_0a = df.loc[df.index < 1, ['a']]  
>>> df_0a  
   a
0  1

Squeezing the rows produces a single scalar Series:

>>> df_0a.squeeze('rows')  
a    1
Name: 0, dtype: int64

Squeezing all axes will project directly into a scalar:

>>> df_0a.squeeze()  
1

std(self, axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None)¶

Return sample standard deviation over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.std.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters

axis{index (0), columns (1)}
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
ddofint, default 1: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns

Series or DataFrame (if level specified)

str¶: Namespace for string methods

sub(self, other, level=None, fill_value=None, axis=0)¶

Return Subtraction of series and other, element-wise (binary operator sub).

This docstring was copied from pandas.core.series.Series.sub.

Some inconsistencies with the Dask version may exist.

Equivalent to series - other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.rsub

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)  
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64

sum(self, axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)¶

Return the sum of the values for the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.sum.

Some inconsistencies with the Dask version may exist.

This is equivalent to the method numpy.sum.

Parameters

axis{index (0), columns (1)}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min_countint, default 0: The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
**kwargs: Additional keyword arguments to be passed to the function.

Returns

Series or DataFrame (if level specified)

See also

Series.sum: Return the sum.
Series.min: Return the minimum.
Series.max: Return the maximum.
Series.idxmin: Return the index of the minimum.
Series.idxmax: Return the index of the maximum.
DataFrame.sum: Return the sum over the requested axis.
DataFrame.min: Return the minimum over the requested axis.
DataFrame.max: Return the maximum over the requested axis.
DataFrame.idxmin: Return the index of the minimum over the requested axis.
DataFrame.idxmax: Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64

>>> s.sum()  
14

Sum using level names, as well as indices.

>>> s.sum(level='blooded')  
blooded
warm    6
cold    8
Name: legs, dtype: int64

>>> s.sum(level=0)  
blooded
warm    6
cold    8
Name: legs, dtype: int64

By default, the sum of an empty or all-NA Series is 0.

>>> pd.Series([]).sum()  # min_count=0 is the default  
0.0

This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.

>>> pd.Series([]).sum(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).sum()  
0.0

>>> pd.Series([np.nan]).sum(min_count=1)  
nan

tail(self, n=5, compute=True)¶

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_bag(self, index=False)¶: Create a Dask Bag from a Series

to_csv(self, filename, **kwargs)¶

Store Dask DataFrame to CSV files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, …

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))

>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'

>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  

/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 

Parameters

dfdask.DataFrame: Data to save
filenamestring: Path glob indicating the naming scheme for the output files
single_filebool, default False: Whether to save everything into a single CSV file. Under the single file mode, each partition is appended at the end of the specified CSV file. Note that not all filesystems support the append mode and thus the single file mode, especially on cloud storage systems such as S3 or GCS. A warning will be issued when writing to a file that is not backed by a local filesystem.
encodingstring, optional: A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
modestr: Python write mode, default ‘w’
name_functioncallable, default None: Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions. Not supported when single_file is True.
compressionstring, optional: a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
computebool: If true, immediately executes. If False, returns a set of delayed objects, which can be computed at a later time.
storage_optionsdict: Parameters passed on to the backend filesystem class.
header_first_partition_onlyboolean, default None: If set to True, only write the header row in the first output file. By default, headers are written to all partitions under the multiple file mode (single_file is False) and written only once under the single file mode (single_file is True). It must not be False under the single file mode.
compute_kwargsdict, optional: Options to be passed in to the compute method
kwargsdict, optional: Additional parameters to pass to pd.DataFrame.to_csv()

Returns

The names of the file written if they were computed right away
If not, the delayed tasks associated to the writing of the files

Raises

ValueError: If header_first_partition_only is set to False or name_function is specified when single_file is True.

to_dask_array(self, lengths=None)¶

Convert a dask DataFrame to a dask array.

Parameters

lengthsbool or Sequence of ints, optional

How to determine the chunks sizes for the output array. By default, the output array will have unknown chunk lengths along the first axis, which can cause some later operations to fail.

True : immediately compute the length of each partition
Sequence : a sequence of integers to use for the chunk sizes on the first axis. These values are not validated for correctness, beyond ensuring that the number of items matches the number of partitions.

to_delayed(self, optimize_graph=True)¶

Convert into a list of dask.delayed objects, one per partition.

Parameters

optimize_graphbool, optional: If True [default], the graph is optimized before converting into dask.delayed objects.

See also

dask.dataframe.from_delayed

Examples

>>> partitions = df.to_delayed()  

to_frame(self, name=None)¶

Convert Series to DataFrame.

This docstring was copied from pandas.core.series.Series.to_frame.

Some inconsistencies with the Dask version may exist.

Parameters

nameobject, default None: The passed name should substitute for the series name (if it has one).

Returns

DataFrame: DataFrame representation of Series.

Examples

>>> s = pd.Series(["a", "b", "c"],  
...               name="vals")
>>> s.to_frame()  
  vals
0    a
1    b
2    c

to_hdf(self, path_or_buf, key, mode='a', append=False, **kwargs)¶

Store Dask Dataframe to Hierarchical Data Format (HDF) files

This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.

This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix * within the filename or datapath, and an optional name_function. The asterix will be replaced with an increasing sequence of integers starting from 0 or with the result of calling name_function on each of those integers.

This function only supports the Pandas 'table' format, not the more specialized 'fixed' format.

Parameters

pathstring, pathlib.Path: Path to a target filename. Supports strings, pathlib.Path, or any object implementing the __fspath__ protocol. May contain a * to denote many filenames.
keystring: Datapath within the files. May contain a * to denote many locations
name_functionfunction: A function to convert the * in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)
computebool: Whether or not to execute immediately. If False then this returns a dask.Delayed value.
lockLock, optional: Lock to use to prevent concurrency issues. By default a threading.Lock, multiprocessing.Lock or SerializableLock will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.
schedulerstring: The scheduler to use, like “threads” or “processes”
**other:: See pandas.to_hdf for more information

Returns

filenameslist: Returned if compute is True. List of file names that each partition is saved to.
delayeddask.Delayed: Returned if compute is False. Delayed object to execute to_hdf when computed.

See also

read_hdf
to_parquet

Examples

Save Data to a single file

>>> df.to_hdf('output.hdf', '/data')            

Save data to multiple datapaths within the same file:

>>> df.to_hdf('output.hdf', '/data-*')          

Save data to multiple files:

>>> df.to_hdf('output-*.hdf', '/data')          

Save data to multiple files, using the multiprocessing scheduler:

>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes') 

Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..

>>> from datetime import date, timedelta
>>> base = date(year=2000, month=1, day=1)
>>> def name_function(i):
...     ''' Convert integer 0 to n to a string '''
...     return base + timedelta(days=i)

>>> df.to_hdf('*.hdf', '/data', name_function=name_function) 

to_json(self, filename, *args, **kwargs)¶: See dd.to_json docstring for more information

to_sql(self, name: str, uri: str, schema=None, if_exists: str = 'fail', index: bool = True, index_label=None, chunksize=None, dtype=None, method=None, compute=True, parallel=False)¶: See dd.to_sql docstring for more information

to_string(self, max_rows=5)¶

Render a string representation of the Series.

This docstring was copied from pandas.core.series.Series.to_string.

Some inconsistencies with the Dask version may exist.

Parameters

bufStringIO-like, optional (Not supported in Dask): Buffer to write to.
na_repstr, optional (Not supported in Dask): String representation of NaN to use, default ‘NaN’.
float_formatone-parameter function, optional (Not supported in Dask): Formatter function to apply to columns’ elements if they are floats, default None.
headerbool, default True (Not supported in Dask): Add the Series header (index name).
indexbool, optional (Not supported in Dask): Add index (row) labels, default True.
lengthbool, default False (Not supported in Dask): Add the Series length.
dtypebool, default False (Not supported in Dask): Add the Series dtype.
namebool, default False (Not supported in Dask): Add the Series name if not None.
max_rowsint, optional: Maximum number of rows to show before truncating. If None, show all.
min_rowsint, optional (Not supported in Dask): The number of rows to display in a truncated repr (when number of rows is above max_rows).

Returns

str or None: String representation of Series if buf=None, otherwise None.

to_timestamp(self, freq=None, how='start', axis=0)¶

Cast to DatetimeIndex of timestamps, at beginning of period.

This docstring was copied from pandas.core.frame.DataFrame.to_timestamp.

Some inconsistencies with the Dask version may exist.

Parameters

freqstr, default frequency of PeriodIndex: Desired frequency.
how{‘s’, ‘e’, ‘start’, ‘end’}: Convention for converting period to timestamp; start of period vs. end.
axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to convert (the index by default).
copybool, default True (Not supported in Dask): If False then underlying input data is not copied.

Returns

DataFrame with DatetimeIndex

truediv(self, other, level=None, fill_value=None, axis=0)¶

Return Floating division of series and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.series.Series.truediv.

Some inconsistencies with the Dask version may exist.

Equivalent to series / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters

otherSeries or scalar value
fill_valueNone or float value, default None (NaN): Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing.
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns

Series: The result of the operation.

See also

Series.rtruediv

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)  
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64

unique(self, split_every=None, split_out=1)¶

Return Series of unique values in the object. Includes NA values.

Returns

uniquesSeries

value_counts(self, sort=None, ascending=False, dropna=None, split_every=None, split_out=1)¶

Return a Series containing counts of unique values.

This docstring was copied from pandas.core.series.Series.value_counts.

Some inconsistencies with the Dask version may exist.

Note: dropna is only supported in pandas >= 1.1.0, in which case it defaults to True.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters

normalizebool, default False (Not supported in Dask): If True then the object returned will contain the relative frequencies of the unique values.
sortbool, default True: Sort by frequencies.
ascendingbool, default False: Sort in ascending order.
binsint, optional (Not supported in Dask): Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.
dropnabool, default True: Don’t include counts of NaN.

Returns

Series

See also

Series.count: Number of non-NA elements in a Series.
DataFrame.count: Number of non-NA elements in a DataFrame.

Examples

>>> index = pd.Index([3, 1, 2, 3, 4, np.nan])  
>>> index.value_counts()  
3.0    2
4.0    1
2.0    1
1.0    1
dtype: int64

With normalize set to True, returns the relative frequency by dividing all values by the sum of values.

>>> s = pd.Series([3, 1, 2, 3, 4, np.nan])  
>>> s.value_counts(normalize=True)  
3.0    0.4
4.0    0.2
2.0    0.2
1.0    0.2
dtype: float64

bins

Bins can be useful for going from a continuous variable to a categorical variable; instead of counting unique apparitions of values, divide the index in the specified number of half-open bins.

>>> s.value_counts(bins=3)  
(2.0, 3.0]      2
(0.996, 2.0]    2
(3.0, 4.0]      1
dtype: int64

dropna

With dropna set to False we can also see NaN index values.

>>> s.value_counts(dropna=False)  
3.0    2
NaN    1
4.0    1
2.0    1
1.0    1
dtype: int64

property values¶

Return a dask.array of the values of this dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

var(self, axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None)¶

Return unbiased variance over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.var.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters

axis{index (0), columns (1)}
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
levelint or level name, default None (Not supported in Dask): If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
ddofint, default 1: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_onlybool, default None (Not supported in Dask): Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns

Series or DataFrame (if level specified)

visualize(self, filename='mydask', format=None, optimize_graph=False, **kwargs)¶

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters

filenamestr or None, optional: The name of the file to write to disk. If the provided filename doesn’t include an extension, ‘.png’ will be used by default. If filename is None, no file will be written, and we communicate with dot using only pipes.
format{‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional: Format in which to write output file. Default is ‘png’.
optimize_graphbool, optional: If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
color: {None, ‘order’}, optional: Options to color nodes. Provide cmap= keyword for additional colormap
**kwargs: Additional keyword arguments to forward to to_graphviz.

Returns

resultIPython.diplay.Image, IPython.display.SVG, or None: See dask.dot.dot_graph for more information.

See also

dask.base.visualize
dask.dot.dot_graph

Notes

For more information on optimization see here:

https://docs.dask.org/en/latest/optimize.html

Examples

>>> x.visualize(filename='dask.pdf')  
>>> x.visualize(filename='dask.pdf', color='order')  

where(self, cond, other=nan)¶

Replace values where the condition is False.

This docstring was copied from pandas.core.frame.DataFrame.where.

Some inconsistencies with the Dask version may exist.

Parameters

condbool Series/DataFrame, array-like, or callable

Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

otherscalar, Series/DataFrame, or callable

Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

axisint, default None (Not supported in Dask)

Alignment axis if needed.

levelint, default None (Not supported in Dask)

Alignment level if needed.

errorsstr, {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

‘raise’ : allow exceptions to be raised.
‘ignore’ : suppress exceptions. On error return original object.

try_castbool, default False (Not supported in Dask)

Try to cast the result back to the input type (if possible).

Returns

Same type as caller

See also

DataFrame.mask(): Return an object of same shape as self.

Notes

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the where documentation in indexing.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

>>> s.mask(s > 0)  
  0.0
  NaN
  NaN
  NaN
  NaN
dtype: float64

>>> s.where(s > 1, 10)  
  10
  10
  2
  3
  4
dtype: int64

>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> df  
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

DataFrameGroupBy¶

class dask.dataframe.groupby.DataFrameGroupBy(df, by=None, slice=None, group_keys=True, dropna=None, sort=None)¶

agg(self, arg, split_every=None, split_out=1)¶

Aggregate using one or more operations over the specified axis.

This docstring was copied from pandas.core.groupby.generic.DataFrameGroupBy.agg.

Some inconsistencies with the Dask version may exist.

Parameters

funcfunction, str, list or dict (Not supported in Dask)

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

function
string function name
list of functions and/or function names, e.g. [np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns

scalar, Series or DataFrame

The return can be:

scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

See also

pandas.DataFrame.groupby.apply
pandas.DataFrame.groupby.transform
pandas.DataFrame.aggregate

Notes

agg is an alias for aggregate. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 2],  
...                    'B': [1, 2, 3, 4],
...                    'C': np.random.randn(4)})

>>> df  
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860

The aggregation is for each column.

>>> df.groupby('A').agg('min')  
   B         C
A
1  1  0.227877
2  3 -0.562860

Multiple aggregations

>>> df.groupby('A').agg(['min', 'max'])  
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

Select a column for aggregation

>>> df.groupby('A').B.agg(['min', 'max'])  
   min  max
A
1    1    2
2    3    4

Different aggregations per column

>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})  
    B             C
  min max       sum
A
1   1   2  0.590716
2   3   4  0.704907

To control the output names with different aggregations per column, pandas supports “named aggregation”

>>> df.groupby("A").agg(  
...     b_min=pd.NamedAgg(column="B", aggfunc="min"),
...     c_sum=pd.NamedAgg(column="C", aggfunc="sum"))
   b_min     c_sum
A
1      1 -1.956929
2      3 -0.322183

The keywords are the output column names
The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

See Named aggregation for more.

aggregate(self, arg, split_every=None, split_out=1)¶

Aggregate using one or more operations over the specified axis.

This docstring was copied from pandas.core.groupby.generic.DataFrameGroupBy.aggregate.

Some inconsistencies with the Dask version may exist.

Parameters

funcfunction, str, list or dict (Not supported in Dask)

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

function
string function name
list of functions and/or function names, e.g. [np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns

scalar, Series or DataFrame

The return can be:

scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

See also

pandas.DataFrame.groupby.apply
pandas.DataFrame.groupby.transform
pandas.DataFrame.aggregate

Notes

agg is an alias for aggregate. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 2],  
...                    'B': [1, 2, 3, 4],
...                    'C': np.random.randn(4)})

>>> df  
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860

The aggregation is for each column.

>>> df.groupby('A').agg('min')  
   B         C
A
1  1  0.227877
2  3 -0.562860

Multiple aggregations

>>> df.groupby('A').agg(['min', 'max'])  
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

Select a column for aggregation

>>> df.groupby('A').B.agg(['min', 'max'])  
   min  max
A
1    1    2
2    3    4

Different aggregations per column

>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})  
    B             C
  min max       sum
A
1   1   2  0.590716
2   3   4  0.704907

To control the output names with different aggregations per column, pandas supports “named aggregation”

>>> df.groupby("A").agg(  
...     b_min=pd.NamedAgg(column="B", aggfunc="min"),
...     c_sum=pd.NamedAgg(column="C", aggfunc="sum"))
   b_min     c_sum
A
1      1 -1.956929
2      3 -0.322183

The keywords are the output column names
The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

See Named aggregation for more.

apply(self, func, *args, **kwargs)¶

Parallel version of pandas GroupBy.apply

This mimics the pandas version except for the following:

If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Dask’s GroupBy.apply is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters

func: function: Function to apply
args, kwargsScalar, Delayed or object: Arguments and keywords to pass to the function.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns

appliedSeries or DataFrame depending on columns keyword

corr(self, ddof=1, split_every=None, split_out=1)¶

Compute pairwise correlation of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.corr.

Some inconsistencies with the Dask version may exist.

Groupby correlation: corr(X, Y) = cov(X, Y) / (std_x * std_y)

Parameters

method{‘pearson’, ‘kendall’, ‘spearman’} or callable (Not supported in Dask)

Method of correlation:

pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

New in version 0.24.0.

min_periodsint, optional (Not supported in Dask)

Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

Returns

DataFrame: Correlation matrix.

See also

DataFrame.corrwith
Series.corr

Examples

>>> def histogram_intersection(a, b):  
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],  
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)  
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0

count(self, split_every=None, split_out=1)¶

Compute count of group, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.count.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Count of values within each group.

See also

Series.groupby
DataFrame.groupby

cov(self, ddof=1, split_every=None, split_out=1, std=False)¶

Compute pairwise covariance of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.cov.

Some inconsistencies with the Dask version may exist.

Groupby covariance is accomplished by

Computing intermediate values for sum, count, and the product of all columns: a b c -> a*a, a*b, b*b, b*c, c*c.
The values are then aggregated and the final covariance value is calculated: cov(X, Y) = X*Y - Xbar * Ybar

When std is True calculate Correlation

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters

min_periodsint, optional (Not supported in Dask): Minimum number of observations required per pair of columns to have a valid result.

Returns

DataFrame: The covariance matrix of the series of the DataFrame.

See also

Series.cov: Compute covariance with another Series.
core.window.EWM.cov: Exponential weighted sample covariance.
core.window.Expanding.cov: Expanding sample covariance.
core.window.Rolling.cov: Rolling sample covariance.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],  
...                   columns=['dogs', 'cats'])
>>> df.cov()  
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667

>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(1000, 5),  
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()  
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:

>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(20, 3),  
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan  
>>> df.loc[df.index[5:10], 'b'] = np.nan  
>>> df.cov(min_periods=12)  
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202

cumcount(self, axis=None)¶

Number each item in each group from 0 to the length of that group - 1.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumcount.

Some inconsistencies with the Dask version may exist.

Essentially this is equivalent to

>>> self.apply(lambda x: pd.Series(np.arange(len(x)), x.index))  

Parameters

ascendingbool, default True (Not supported in Dask): If False, number in reverse, from length of group - 1 to 0.

Returns

Series: Sequence number of each element within each group.

See also

ngroup: Number the groups themselves.

Examples

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],  
...                   columns=['A'])
>>> df  
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount()  
0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)  
0    3
1    2
2    1
3    1
4    0
5    0
dtype: int64

cumprod(self, axis=0)¶

Cumulative product for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumprod.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame

See also

Series.groupby
DataFrame.groupby

cumsum(self, axis=0)¶

Cumulative sum for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumsum.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame

See also

Series.groupby
DataFrame.groupby

first(self, split_every=None, split_out=1)¶

Compute first of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.first.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed first of values within each group.

get_group(self, key)¶

Construct DataFrame from group with provided name.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.get_group.

Some inconsistencies with the Dask version may exist.

Parameters

nameobject (Not supported in Dask): The name of the group to get as a DataFrame.
objDataFrame, default None (Not supported in Dask): The DataFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used.

Returns

groupsame type as obj

idxmax(self, split_every=None, split_out=1, axis=None, skipna=True)¶

Return index of first occurrence of maximum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns

Series: Indexes of maxima along the specified axis.

Raises

ValueError

If the row/column is empty

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(self, split_every=None, split_out=1, axis=None, skipna=True)¶

Return index of first occurrence of minimum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns

Series: Indexes of minima along the specified axis.

Raises

ValueError

If the row/column is empty

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

last(self, split_every=None, split_out=1)¶

Compute last of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.last.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed last of values within each group.

max(self, split_every=None, split_out=1)¶

Compute max of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.max.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed max of values within each group.

mean(self, split_every=None, split_out=1)¶

Compute mean of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.mean.

Some inconsistencies with the Dask version may exist.

Returns

pandas.Series or pandas.DataFrame

See also

Series.groupby
DataFrame.groupby

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],  
...                    'B': [np.nan, 2, 3, 4, 5],
...                    'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])

Groupby one column and return the mean of the remaining columns in each group.

>>> df.groupby('A').mean()  
     B         C
A
1  3.0  1.333333
2  4.0  1.500000

Groupby two columns and return the mean of the remaining column.

>>> df.groupby(['A', 'B']).mean()  
       C
A B
1 2.0  2
  4.0  1
2 3.0  1
  5.0  2

Groupby one column and return the mean of only particular column in the group.

>>> df.groupby('A')['B'].mean()  
A
1    3.0
2    4.0
Name: B, dtype: float64

min(self, split_every=None, split_out=1)¶

Compute min of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.min.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed min of values within each group.

prod(self, split_every=None, split_out=1, min_count=None)¶

Compute prod of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.prod.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed prod of values within each group.

size(self, split_every=None, split_out=1)¶

Compute group sizes.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.size.

Some inconsistencies with the Dask version may exist.

Returns

Series: Number of rows in each group.

See also

Series.groupby
DataFrame.groupby

std(self, ddof=1, split_every=None, split_out=1)¶

Compute standard deviation of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.std.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters

ddofint, default 1: Degrees of freedom.

Returns

Series or DataFrame: Standard deviation of values within each group.

See also

Series.groupby
DataFrame.groupby

sum(self, split_every=None, split_out=1, min_count=None)¶

Compute sum of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.sum.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed sum of values within each group.

transform(self, func, *args, **kwargs)¶

Parallel version of pandas GroupBy.transform

This mimics the pandas version except for the following:

If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Dask’s GroupBy.transform is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-transform can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-transform will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters

func: function: Function to apply
args, kwargsScalar, Delayed or object: Arguments and keywords to pass to the function.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns

appliedSeries or DataFrame depending on columns keyword

var(self, ddof=1, split_every=None, split_out=1)¶

Compute variance of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.var.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters

ddofint, default 1: Degrees of freedom.

Returns

Series or DataFrame: Variance of values within each group.

See also

Series.groupby
DataFrame.groupby

SeriesGroupBy¶

class dask.dataframe.groupby.SeriesGroupBy(df, by=None, slice=None, **kwargs)¶

agg(self, arg, split_every=None, split_out=1)¶

Aggregate using one or more operations over the specified axis.

This docstring was copied from pandas.core.groupby.generic.SeriesGroupBy.agg.

Some inconsistencies with the Dask version may exist.

Parameters

funcfunction, str, list or dict (Not supported in Dask)

Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.

Accepted combinations are:

function
string function name
list of functions and/or function names, e.g. [np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns

scalar, Series or DataFrame

The return can be:

scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

See also

pandas.Series.groupby.apply
pandas.Series.groupby.transform
pandas.Series.aggregate

Notes

agg is an alias for aggregate. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> s = pd.Series([1, 2, 3, 4])  

>>> s  
0    1
1    2
2    3
3    4
dtype: int64

>>> s.groupby([1, 1, 2, 2]).min()  
1    1
2    3
dtype: int64

>>> s.groupby([1, 1, 2, 2]).agg('min')  
1    1
2    3
dtype: int64

>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max'])  
   min  max
1    1    2
2    3    4

The output column names can be controlled by passing the desired column names and aggregations as keyword arguments.

>>> s.groupby([1, 1, 2, 2]).agg(  
...     minimum='min',
...     maximum='max',
... )
   minimum  maximum
1        1        2
2        3        4

aggregate(self, arg, split_every=None, split_out=1)¶

Aggregate using one or more operations over the specified axis.

This docstring was copied from pandas.core.groupby.generic.SeriesGroupBy.aggregate.

Some inconsistencies with the Dask version may exist.

Parameters

funcfunction, str, list or dict (Not supported in Dask)

Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.

Accepted combinations are:

function
string function name
list of functions and/or function names, e.g. [np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns

scalar, Series or DataFrame

The return can be:

scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

See also

pandas.Series.groupby.apply
pandas.Series.groupby.transform
pandas.Series.aggregate

Notes

agg is an alias for aggregate. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> s = pd.Series([1, 2, 3, 4])  

>>> s  
0    1
1    2
2    3
3    4
dtype: int64

>>> s.groupby([1, 1, 2, 2]).min()  
1    1
2    3
dtype: int64

>>> s.groupby([1, 1, 2, 2]).agg('min')  
1    1
2    3
dtype: int64

>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max'])  
   min  max
1    1    2
2    3    4

The output column names can be controlled by passing the desired column names and aggregations as keyword arguments.

>>> s.groupby([1, 1, 2, 2]).agg(  
...     minimum='min',
...     maximum='max',
... )
   minimum  maximum
1        1        2
2        3        4

apply(self, func, *args, **kwargs)¶

Parallel version of pandas GroupBy.apply

This mimics the pandas version except for the following:

If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Dask’s GroupBy.apply is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters

func: function: Function to apply
args, kwargsScalar, Delayed or object: Arguments and keywords to pass to the function.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns

appliedSeries or DataFrame depending on columns keyword

corr(self, ddof=1, split_every=None, split_out=1)¶

Compute pairwise correlation of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.corr.

Some inconsistencies with the Dask version may exist.

Groupby correlation: corr(X, Y) = cov(X, Y) / (std_x * std_y)

Parameters

method{‘pearson’, ‘kendall’, ‘spearman’} or callable (Not supported in Dask)

Method of correlation:

pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

New in version 0.24.0.

min_periodsint, optional (Not supported in Dask)

Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

Returns

DataFrame: Correlation matrix.

See also

DataFrame.corrwith
Series.corr

Examples

>>> def histogram_intersection(a, b):  
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],  
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)  
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0

count(self, split_every=None, split_out=1)¶

Compute count of group, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.count.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Count of values within each group.

See also

Series.groupby
DataFrame.groupby

cov(self, ddof=1, split_every=None, split_out=1, std=False)¶

Compute pairwise covariance of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.cov.

Some inconsistencies with the Dask version may exist.

Groupby covariance is accomplished by

Computing intermediate values for sum, count, and the product of all columns: a b c -> a*a, a*b, b*b, b*c, c*c.
The values are then aggregated and the final covariance value is calculated: cov(X, Y) = X*Y - Xbar * Ybar

When std is True calculate Correlation

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters

min_periodsint, optional (Not supported in Dask): Minimum number of observations required per pair of columns to have a valid result.

Returns

DataFrame: The covariance matrix of the series of the DataFrame.

See also

Series.cov: Compute covariance with another Series.
core.window.EWM.cov: Exponential weighted sample covariance.
core.window.Expanding.cov: Expanding sample covariance.
core.window.Rolling.cov: Rolling sample covariance.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],  
...                   columns=['dogs', 'cats'])
>>> df.cov()  
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667

>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(1000, 5),  
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()  
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:

>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(20, 3),  
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan  
>>> df.loc[df.index[5:10], 'b'] = np.nan  
>>> df.cov(min_periods=12)  
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202

cumcount(self, axis=None)¶

Number each item in each group from 0 to the length of that group - 1.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumcount.

Some inconsistencies with the Dask version may exist.

Essentially this is equivalent to

>>> self.apply(lambda x: pd.Series(np.arange(len(x)), x.index))  

Parameters

ascendingbool, default True (Not supported in Dask): If False, number in reverse, from length of group - 1 to 0.

Returns

Series: Sequence number of each element within each group.

See also

ngroup: Number the groups themselves.

Examples

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],  
...                   columns=['A'])
>>> df  
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount()  
0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)  
0    3
1    2
2    1
3    1
4    0
5    0
dtype: int64

cumprod(self, axis=0)¶

Cumulative product for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumprod.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame

See also

Series.groupby
DataFrame.groupby

cumsum(self, axis=0)¶

Cumulative sum for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumsum.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame

See also

Series.groupby
DataFrame.groupby

first(self, split_every=None, split_out=1)¶

Compute first of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.first.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed first of values within each group.

get_group(self, key)¶

Construct DataFrame from group with provided name.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.get_group.

Some inconsistencies with the Dask version may exist.

Parameters

nameobject (Not supported in Dask): The name of the group to get as a DataFrame.
objDataFrame, default None (Not supported in Dask): The DataFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used.

Returns

groupsame type as obj

idxmax(self, split_every=None, split_out=1, axis=None, skipna=True)¶

Return index of first occurrence of maximum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns

Series: Indexes of maxima along the specified axis.

Raises

ValueError

If the row/column is empty

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(self, split_every=None, split_out=1, axis=None, skipna=True)¶

Return index of first occurrence of minimum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipnabool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns

Series: Indexes of minima along the specified axis.

Raises

ValueError

If the row/column is empty

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

last(self, split_every=None, split_out=1)¶

Compute last of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.last.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed last of values within each group.

max(self, split_every=None, split_out=1)¶

Compute max of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.max.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed max of values within each group.

mean(self, split_every=None, split_out=1)¶

Compute mean of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.mean.

Some inconsistencies with the Dask version may exist.

Returns

pandas.Series or pandas.DataFrame

See also

Series.groupby
DataFrame.groupby

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],  
...                    'B': [np.nan, 2, 3, 4, 5],
...                    'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])

Groupby one column and return the mean of the remaining columns in each group.

>>> df.groupby('A').mean()  
     B         C
A
1  3.0  1.333333
2  4.0  1.500000

Groupby two columns and return the mean of the remaining column.

>>> df.groupby(['A', 'B']).mean()  
       C
A B
1 2.0  2
  4.0  1
2 3.0  1
  5.0  2

Groupby one column and return the mean of only particular column in the group.

>>> df.groupby('A')['B'].mean()  
A
1    3.0
2    4.0
Name: B, dtype: float64

min(self, split_every=None, split_out=1)¶

Compute min of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.min.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed min of values within each group.

nunique(self, split_every=None, split_out=1)¶

Return number of unique elements in the group.

This docstring was copied from pandas.core.groupby.generic.SeriesGroupBy.nunique.

Some inconsistencies with the Dask version may exist.

Returns

Series: Number of unique values within each group.

Examples

>>> import pandas as pd  
>>> import dask.dataframe as dd  
>>> d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}  
>>> df = pd.DataFrame(data=d)  
>>> ddf = dd.from_pandas(df, 2)  
>>> ddf.groupby(['col1']).col2.nunique().compute()  

prod(self, split_every=None, split_out=1, min_count=None)¶

Compute prod of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.prod.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed prod of values within each group.

size(self, split_every=None, split_out=1)¶

Compute group sizes.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.size.

Some inconsistencies with the Dask version may exist.

Returns

Series: Number of rows in each group.

See also

Series.groupby
DataFrame.groupby

std(self, ddof=1, split_every=None, split_out=1)¶

Compute standard deviation of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.std.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters

ddofint, default 1: Degrees of freedom.

Returns

Series or DataFrame: Standard deviation of values within each group.

See also

Series.groupby
DataFrame.groupby

sum(self, split_every=None, split_out=1, min_count=None)¶

Compute sum of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.sum.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed sum of values within each group.

transform(self, func, *args, **kwargs)¶

Parallel version of pandas GroupBy.transform

This mimics the pandas version except for the following:

If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Dask’s GroupBy.transform is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-transform can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-transform will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters

func: function: Function to apply
args, kwargsScalar, Delayed or object: Arguments and keywords to pass to the function.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns

appliedSeries or DataFrame depending on columns keyword

unique(self, split_every=None, split_out=1)¶

Return unique values of Series object.

This docstring was copied from pandas.core.groupby.generic.SeriesGroupBy.unique.

Some inconsistencies with the Dask version may exist.

Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.

Returns

ndarray or ExtensionArray: The unique values returned as a NumPy array. See Notes.

See also

unique: Top-level unique method for any 1-d array-like object.
Index.unique: Return Index with unique values from an Index object.

Notes

Returns the unique values as a NumPy array. In case of an extension-array backed Series, a new ExtensionArray of that type with just the unique values is returned. This includes

Categorical

Period

Datetime with Timezone

Interval

Sparse

IntegerNA

See Examples section.

Examples

>>> pd.Series([2, 1, 3, 3], name='A').unique()  
array([2, 1, 3])

>>> pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique()  
array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

>>> pd.Series([pd.Timestamp('2016-01-01', tz='US/Eastern')  
...            for _ in range(3)]).unique()
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]

An unordered Categorical will return categories in the order of appearance.

>>> pd.Series(pd.Categorical(list('baabc'))).unique()  
[b, a, c]
Categories (3, object): [b, a, c]

An ordered Categorical preserves the category ordering.

>>> pd.Series(pd.Categorical(list('baabc'), categories=list('abc'),  
...                          ordered=True)).unique()
[b, a, c]
Categories (3, object): [a < b < c]

var(self, ddof=1, split_every=None, split_out=1)¶

Compute variance of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.var.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters

ddofint, default 1: Degrees of freedom.

Returns

Series or DataFrame: Variance of values within each group.

See also

Series.groupby
DataFrame.groupby

Custom Aggregation¶

class dask.dataframe.groupby.Aggregation(name, chunk, agg, finalize=None)¶

User defined groupby-aggregation.

This class allows users to define their own custom aggregation in terms of operations on Pandas dataframes in a map-reduce style. You need to specify what operation to do on each chunk of data, how to combine those chunks of data together, and then how to finalize the result.

See Aggregate for more.

Parameters

namestr: the name of the aggregation. It should be unique, since intermediate result will be identified by this name.
chunkcallable: a function that will be called with the grouped column of each partition. It can either return a single series or a tuple of series. The index has to be equal to the groups.
aggcallable: a function that will be called to aggregate the results of each chunk. Again the argument(s) will be grouped series. If chunk returned a tuple, agg will be called with all of them as individual positional arguments.
finalizecallable: an optional finalizer that will be called with the results from the aggregation.

Examples

We could implement sum as follows:

>>> custom_sum = dd.Aggregation(
...     name='custom_sum',
...     chunk=lambda s: s.sum(),
...     agg=lambda s0: s0.sum()
... )  
>>> df.groupby('g').agg(custom_sum)  

We can implement mean as follows:

>>> custom_mean = dd.Aggregation(
...     name='custom_mean',
...     chunk=lambda s: (s.count(), s.sum()),
...     agg=lambda count, sum: (count.sum(), sum.sum()),
...     finalize=lambda count, sum: sum / count,
... )  
>>> df.groupby('g').agg(custom_mean)  

Though of course, both of these are built-in and so you don’t need to implement them yourself.

Storage and Conversion¶

dask.dataframe.read_csv(urlpath, blocksize='default', lineterminator=None, compression=None, sample=256000, enforce=False, assume_missing=False, storage_options=None, include_path_column=False, **kwargs)¶

Read CSV files into a Dask.DataFrame

This parallelizes the pandas.read_csv() function in the following ways:

It supports loading many files at once using globstrings:
```
>>> df = dd.read_csv('myfiles.*.csv')  
```

In some cases it can break up large files:

>>> df = dd.read_csv('largefile.csv', blocksize=25e6)  # 25MB chunks  

It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:

>>> df = dd.read_csv('s3://bucket/myfiles.*.csv')  
>>> df = dd.read_csv('hdfs:///myfiles.*.csv')  
>>> df = dd.read_csv('hdfs://namenode.example.com/myfiles.*.csv')  

Internally dd.read_csv uses pandas.read_csv() and supports many of the same keyword arguments with the same performance guarantees. See the docstring for pandas.read_csv() for more information on available keyword arguments.

Parameters

urlpathstring or list: Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.
blocksizestr, int or None, optional: Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like 64000000` or a string like ``"64MB". If None, a single block is used for each file.
sampleint, optional: Number of bytes to use when determining dtypes
assume_missingbool, optional: If True, all integer columns that aren’t specified in dtype are assumed to contain missing values, and are converted to floats. Default is False.
storage_optionsdict, optional: Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
include_path_columnbool or str, optional: Whether or not to include the path to each particular file. If True a new column is added to the dataframe called path. If str, sets new column name. Default is False.
**kwargs: Extra keyword arguments to forward to pandas.read_csv().

Notes

Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a NaN, then this would error at compute time. To fix this, you have a few options:

Provide explicit dtypes for the offending columns using the dtype keyword. This is the recommended solution.
Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats.
Increase the size of the sample using the sample keyword.

It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.

dask.dataframe.read_table(urlpath, blocksize='default', lineterminator=None, compression=None, sample=256000, enforce=False, assume_missing=False, storage_options=None, include_path_column=False, **kwargs)¶

Read delimited files into a Dask.DataFrame

This parallelizes the pandas.read_table() function in the following ways:

It supports loading many files at once using globstrings:
```
>>> df = dd.read_table('myfiles.*.csv')  
```

In some cases it can break up large files:

>>> df = dd.read_table('largefile.csv', blocksize=25e6)  # 25MB chunks  

It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:

>>> df = dd.read_table('s3://bucket/myfiles.*.csv')  
>>> df = dd.read_table('hdfs:///myfiles.*.csv')  
>>> df = dd.read_table('hdfs://namenode.example.com/myfiles.*.csv')  

Internally dd.read_table uses pandas.read_table() and supports many of the same keyword arguments with the same performance guarantees. See the docstring for pandas.read_table() for more information on available keyword arguments.

Parameters

urlpathstring or list: Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.
blocksizestr, int or None, optional: Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like 64000000` or a string like ``"64MB". If None, a single block is used for each file.
sampleint, optional: Number of bytes to use when determining dtypes
assume_missingbool, optional: If True, all integer columns that aren’t specified in dtype are assumed to contain missing values, and are converted to floats. Default is False.
storage_optionsdict, optional: Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
include_path_columnbool or str, optional: Whether or not to include the path to each particular file. If True a new column is added to the dataframe called path. If str, sets new column name. Default is False.
**kwargs: Extra keyword arguments to forward to pandas.read_table().

Notes

Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a NaN, then this would error at compute time. To fix this, you have a few options:

Provide explicit dtypes for the offending columns using the dtype keyword. This is the recommended solution.
Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats.
Increase the size of the sample using the sample keyword.

It should also be noted that this function may fail if a delimited file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.

dask.dataframe.read_fwf(urlpath, blocksize='default', lineterminator=None, compression=None, sample=256000, enforce=False, assume_missing=False, storage_options=None, include_path_column=False, **kwargs)¶

Read fixed-width files into a Dask.DataFrame

This parallelizes the pandas.read_fwf() function in the following ways:

It supports loading many files at once using globstrings:
```
>>> df = dd.read_fwf('myfiles.*.csv')  
```

In some cases it can break up large files:

>>> df = dd.read_fwf('largefile.csv', blocksize=25e6)  # 25MB chunks  

It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:

>>> df = dd.read_fwf('s3://bucket/myfiles.*.csv')  
>>> df = dd.read_fwf('hdfs:///myfiles.*.csv')  
>>> df = dd.read_fwf('hdfs://namenode.example.com/myfiles.*.csv')  

Internally dd.read_fwf uses pandas.read_fwf() and supports many of the same keyword arguments with the same performance guarantees. See the docstring for pandas.read_fwf() for more information on available keyword arguments.

Parameters

urlpathstring or list: Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.
blocksizestr, int or None, optional: Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like 64000000` or a string like ``"64MB". If None, a single block is used for each file.
sampleint, optional: Number of bytes to use when determining dtypes
assume_missingbool, optional: If True, all integer columns that aren’t specified in dtype are assumed to contain missing values, and are converted to floats. Default is False.
storage_optionsdict, optional: Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
include_path_columnbool or str, optional: Whether or not to include the path to each particular file. If True a new column is added to the dataframe called path. If str, sets new column name. Default is False.
**kwargs: Extra keyword arguments to forward to pandas.read_fwf().

Notes

Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a NaN, then this would error at compute time. To fix this, you have a few options:

Provide explicit dtypes for the offending columns using the dtype keyword. This is the recommended solution.
Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats.
Increase the size of the sample using the sample keyword.

It should also be noted that this function may fail if a fixed-width file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.

dask.dataframe.read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', gather_statistics=None, split_row_groups=None, chunksize=None, **kwargs)¶

Read a Parquet file into a Dask DataFrame

This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if any exist.

Parameters

pathstring or list

Source directory for data, or path(s) to individual parquet files. Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

columnsstring, list or None (default)

Field name(s) to read in as columns in the output. By default all non-index fields will be read (as determined by the pandas parquet metadata, if present). Provide a single field name instead of a list to read in the data as a Series.

filtersUnion[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]]

List of filters to apply, like [[('x', '=', 0), ...], ...]. This implements partition-level (hive) filtering only, i.e., to prevent the loading of some row-groups and/or files.

Predicates can be expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are combined with an AND conjunction into a larger predicate. The outer-most list then combines all of the combined filters with an OR disjunction.

Predicates can also be expressed as a List[Tuple]. These are evaluated as an AND conjunction. To express OR in predictates, one must use the (preferred) List[List[Tuple]] notation.

indexstring, list, False or None (default)

Field name(s) to use as the output frame index. By default will be inferred from the pandas parquet file metadata (if present). Use False to read all fields as columns.

categorieslist, dict or None

For any fields listed here, if the parquet encoding is Dictionary, the column will be created with dtype category. Use only if it is guaranteed that the column is encoded as dictionary in all row-groups. If a list, assumes up to 2**16-1 labels; if a dict, specify the number of labels expected; if None, will load categories automatically for data written by dask/fastparquet, not otherwise.

storage_optionsdict

Key/value pairs to be passed on to the file-system backend, if any.

engine{‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’

Parquet reader library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’

gather_statisticsbool or None (default).

Gather the statistics for each dataset partition. By default, this will only be done if the _metadata file is available. Otherwise, statistics will only be gathered if True, because the footer of every file will be parsed (which is very slow on some systems).

split_row_groupsbool or int

Default is True if a _metadata file is available or if the dataset is composed of a single file (otherwise defult is False). If True, then each output dataframe partition will correspond to a single parquet-file row-group. If False, each partition will correspond to a complete file. If a positive integer value is given, each dataframe partition will correspond to that number of parquet row-groups (or fewer). Only the “pyarrow” engine supports this argument.

chunksizeint, str

The target task partition size. If set, consecutive row-groups from the same file will be aggregated into the same output partition until the aggregate size reaches this value.

**kwargs: dict (of dicts)

Passthrough key-word arguments for read backend. The top-level keys correspond to the appropriate operation type, and the second level corresponds to the kwargs that will be passed on to the underlying pyarrow or fastparquet function. Supported top-level keys: ‘dataset’ (for opening a pyarrow dataset), ‘file’ (for opening a fastparquet ParquetFile), ‘read’ (for the backend read function), ‘arrow_to_pandas’ (for controlling the arguments passed to convert from a pyarrow.Table.to_pandas())

See also

to_parquet

Examples

>>> df = dd.read_parquet('s3://bucket/my-parquet-data')  

dask.dataframe.read_orc(path, columns=None, storage_options=None)¶

Read dataframe from ORC file(s)

Parameters

path: str or list(str): Location of file(s), which can be a full URL with protocol specifier, and may include glob character if a single string.
columns: None or list(str): Columns to load. If None, loads all.
storage_options: None or dict: Further parameters to pass to the bytes backend.

Returns

Dask.DataFrame (even if there is only one column)

Examples

>>> df = dd.read_orc('https://github.com/apache/orc/raw/'
...                  'master/examples/demo-11-zlib.orc')  

dask.dataframe.read_hdf(pattern, key, start=0, stop=None, columns=None, chunksize=1000000, sorted_index=False, lock=True, mode='a')¶

Read HDF files into a Dask DataFrame

Read hdf files into a dask dataframe. This function is like pandas.read_hdf, except it can read from a single large file, or from multiple files, or from multiple keys from the same file.

Parameters

patternstring, pathlib.Path, list

File pattern (string), pathlib.Path, buffer to read from, or list of file paths. Can contain wildcards.

keygroup identifier in the store. Can contain wildcards

startoptional, integer (defaults to 0), row number to start at

stopoptional, integer (defaults to None, the last row), row number to

stop at

columnslist of columns, optional

A list of columns that if not None, will limit the return columns (default is None)

chunksizepositive integer, optional

Maximal number of rows per partition (default is 1000000).

sorted_indexboolean, optional

Option to specify whether or not the input hdf files have a sorted index (default is False).

lockboolean, optional

Option to use a lock to prevent concurrency issues (default is True).

mode{‘a’, ‘r’, ‘r+’}, default ‘a’. Mode to use when opening file(s).

‘r’: Read-only; no data can be modified.
‘a’: Append; an existing file is opened for reading and writing, and if the file does not exist it is created.
‘r+’: It is similar to ‘a’, but the file must already exist.

Returns

dask.DataFrame

Examples

Load single file

>>> dd.read_hdf('myfile.1.hdf5', '/x')  

Load multiple files

>>> dd.read_hdf('myfile.*.hdf5', '/x')  

>>> dd.read_hdf(['myfile.1.hdf5', 'myfile.2.hdf5'], '/x')  

Load multiple datasets

>>> dd.read_hdf('myfile.1.hdf5', '/*')  

dask.dataframe.read_json(url_path, orient='records', lines=None, storage_options=None, blocksize=None, sample=1048576, encoding='utf-8', errors='strict', compression='infer', meta=None, engine=<function read_json at 0x7f20855d6c20>, **kwargs)¶

Create a dataframe from a set of JSON files

This utilises pandas.read_json(), and most parameters are passed through - see its docstring.

Differences: orient is ‘records’ by default, with lines=True; this is appropriate for line-delimited “JSON-lines” data, the kind of JSON output that is most common in big-data scenarios, and which can be chunked when reading (see read_json()). All other options require blocksize=None, i.e., one partition per input file.

Parameters

url_path: str, list of str: Location to read from. If a string, can include a glob character to find a set of file names. Supports protocol specifications such as "s3://".
encoding, errors:: The text encoding to implement, e.g., “utf-8” and how to respond to errors in the conversion (see str.encode()).
orient, lines, kwargs: passed to pandas; if not specified, lines=True when orient=’records’, False otherwise.
storage_options: dict: Passed to backend file-system implementation
blocksize: None or int: If None, files are not blocked, and you get one partition per input file. If int, which can only be used for line-delimited JSON files, each partition will be approximately this size in bytes, to the nearest newline character.
sample: int: Number of bytes to pre-load, to provide an empty dataframe structure to any blocks without data. Only relevant is using blocksize.
encoding, errors:: Text conversion, see bytes.decode()
compressionstring or None: String like ‘gzip’ or ‘xz’.
enginefunction object, default pd.read_json: The underlying function that dask will use to read JSON files. By default, this will be the pandas JSON reader (pd.read_json).
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns

dask.DataFrame

Examples

Load single file

>>> dd.read_json('myfile.1.json')  

Load multiple files

>>> dd.read_json('myfile.*.json')  

>>> dd.read_json(['myfile.1.json', 'myfile.2.json'])  

Load large line-delimited JSON files using partitions of approx 256MB size

>> dd.read_json(‘data/file*.csv’, blocksize=2**28)

dask.dataframe.read_sql_table(table, uri, index_col, divisions=None, npartitions=None, limits=None, columns=None, bytes_per_chunk='256 MiB', head_rows=5, schema=None, meta=None, engine_kwargs=None, **kwargs)¶

Create dataframe from an SQL table.

If neither divisions or npartitions is given, the memory footprint of the first few rows will be determined, and partitions of size ~256MB will be used.

Parameters

tablestring or sqlalchemy expression

Select columns from here.

uristring

Full sqlalchemy URI for the database connection

index_colstring

Column which becomes the index, and defines the partitioning. Should be a indexed column in the SQL server, and any orderable type. If the type is number or time, then partition boundaries can be inferred from npartitions or bytes_per_chunk; otherwide must supply explicit divisions=. index_col could be a function to return a value, e.g., sql.func.abs(sql.column('value')).label('abs(value)'). index_col=sql.func.abs(sql.column("value")).label("abs(value)"), or index_col=cast(sql.column("id"),types.BigInteger).label("id") to convert the textfield id to BigInteger.

Note sql, cast, types methods comes frome sqlalchemy module.

Labeling columns created by functions or arithmetic operations is required.

divisions: sequence

Values of the index column to split the table by. If given, this will override npartitions and bytes_per_chunk. The divisions are the value boundaries of the index column used to define the partitions. For example, divisions=list('acegikmoqsuwz') could be used to partition a string column lexographically into 12 partitions, with the implicit assumption that each partition contains similar numbers of records.

npartitionsint

Number of partitions, if divisions is not given. Will split the values of the index column linearly between limits, if given, or the column max/min. The index column must be numeric or time for this to work

limits: 2-tuple or None

Manually give upper and lower range of values for use with npartitions; if None, first fetches max/min from the DB. Upper limit, if given, is inclusive.

columnslist of strings or None

Which columns to select; if None, gets all; can include sqlalchemy functions, e.g., sql.func.abs(sql.column('value')).label('abs(value)'). Labeling columns created by functions or arithmetic operations is recommended.

bytes_per_chunkstr, int

If both divisions and npartitions is None, this is the target size of each partition, in bytes

head_rowsint

How many rows to load for inferring the data-types, unless passing meta

metaempty DataFrame or None

If provided, do not attempt to infer dtypes, but use these, coercing all chunks on load

schemastr or None

If using a table name, pass this to sqlalchemy to select which DB schema to use within the URI connection

engine_kwargsdict or None

Specific db engine parameters for sqlalchemy

kwargsdict

Additional parameters to pass to pd.read_sql()

Returns

dask.dataframe

Examples

>>> df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
...                  npartitions=10, index_col='id')  

dask.dataframe.from_array(x, chunksize=50000, columns=None, meta=None)¶

Read any sliceable array into a Dask Dataframe

Uses getitem syntax to pull slices out of the array. The array need not be a NumPy array but must support slicing syntax

x[50000:100000]

and have 2 dimensions:

x.ndim == 2

or have a record dtype:

x.dtype == [(‘name’, ‘O’), (‘balance’, ‘i8’)]

Parameters

xarray_like
chunksizeint, optional: The number of rows per partition to use.
columnslist or string, optional: list of column names if DataFrame, single string if Series
metaobject, optional: An optional meta parameter can be passed for dask to specify the concrete dataframe type to use for partitions of the Dask dataframe. By default, pandas DataFrame is used.

Returns

dask.DataFrame or dask.Series: A dask DataFrame/Series

dask.dataframe.from_pandas(data, npartitions=None, chunksize=None, sort=True, name=None)¶

Construct a Dask DataFrame from a Pandas DataFrame

This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel.

Note that, despite parallelism, Dask.dataframe may not always be faster than Pandas. We recommend that you stay with Pandas for as long as possible before switching to Dask.dataframe.

Parameters

datapandas.DataFrame or pandas.Series: The DataFrame/Series with which to construct a Dask DataFrame/Series
npartitionsint, optional: The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.
chunksizeint, optional: The number of rows per index partition to use.
sort: bool: Sort input first to obtain cleanly divided partitions or don’t sort and don’t get cleanly divided partitions
name: string, optional: An optional keyname for the dataframe. Defaults to hashing the input

Returns

dask.DataFrame or dask.Series: A dask DataFrame/Series partitioned along the index

Raises

TypeError: If something other than a pandas.DataFrame or pandas.Series is passed in.

See also

from_array: Construct a dask.DataFrame from an array that has record dtype
read_csv: Construct a dask.DataFrame from a CSV file

Examples

>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))),
...                   index=pd.date_range(start='20100101', periods=6))
>>> ddf = from_pandas(df, npartitions=3)
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))
>>> ddf = from_pandas(df.a, npartitions=3)  # Works with Series too!
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))

dask.dataframe.from_bcolz(x, chunksize=None, categorize=True, index=None, lock=<unlocked _thread.lock object at 0x7f2084c28480>, **kwargs)¶

Read BColz CTable into a Dask Dataframe

BColz is a fast on-disk compressed column store with careful attention given to compression. https://bcolz.readthedocs.io/en/latest/

Parameters

xbcolz.ctable
chunksizeint, optional: The size(rows) of blocks to pull out from ctable.
categorizebool, defaults to True: Automatically categorize all string dtypes
indexstring, optional: Column to make the index
lock: bool or Lock: Lock to use when reading or False for no lock (not-thread-safe)

See also

from_array: more generic function not optimized for bcolz

dask.dataframe.from_dask_array(x, columns=None, index=None, meta=None)¶

Create a Dask DataFrame from a Dask Array.

Converts a 2d array into a DataFrame and a 1d array into a Series.

Parameters

xda.Array

columnslist or string

list of column names if DataFrame, single string if Series

indexdask.dataframe.Index, optional

An optional dask Index to use for the output Series or DataFrame.

The default output index depends on whether x has any unknown chunks. If there are any unknown chunks, the output has None for all the divisions (one per chunk). If all the chunks are known, a default index with known divisions is created.

Specifying index can be useful if you’re conforming a Dask Array to an existing dask Series or DataFrame, and you would like the indices to match.

metaobject, optional

An optional meta parameter can be passed for dask to specify the concrete dataframe type to be returned. By default, pandas DataFrame is used.

See also

dask.bag.to_dataframe: from dask.bag
dask.dataframe._Frame.values: Reverse conversion
dask.dataframe._Frame.to_records: Reverse conversion

Examples

>>> import dask.array as da
>>> import dask.dataframe as dd
>>> x = da.ones((4, 2), chunks=(2, 2))
>>> df = dd.io.from_dask_array(x, columns=['a', 'b'])
>>> df.compute()
     a    b
0  1.0  1.0
1  1.0  1.0
2  1.0  1.0
3  1.0  1.0

dask.dataframe.from_delayed(dfs, meta=None, divisions=None, prefix='from-delayed', verify_meta=True)¶

Create Dask DataFrame from many Dask Delayed objects

Parameters

dfslist of Delayed: An iterable of dask.delayed.Delayed objects, such as come from dask.delayed These comprise the individual partitions of the resulting dataframe.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.
divisionstuple, str, optional: Partition boundaries along the index. For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information
prefixstr, optional: Prefix to prepend to the keys.
verify_metabool, optional: If True check that the partitions have consistent metadata, defaults to True.

dask.dataframe.to_records(df)¶

Create Dask Array from a Dask Dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

See also

dask.dataframe._Frame.values
dask.dataframe.from_dask_array

Examples

>>> df.to_records()  
dask.array<to_records, shape=(nan,), dtype=(numpy.record, [('ind', '<f8'), ('x', 'O'), ('y', '<i8')]), chunksize=(nan,), chunktype=numpy.ndarray>  # noqa: E501

dask.dataframe.to_csv(df, filename, single_file=False, encoding='utf-8', mode='wt', name_function=None, compression=None, compute=True, scheduler=None, storage_options=None, header_first_partition_only=None, compute_kwargs=None, **kwargs)¶

Store Dask DataFrame to CSV files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, …

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))

>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'

>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  

/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 

Parameters

dfdask.DataFrame: Data to save
filenamestring: Path glob indicating the naming scheme for the output files
single_filebool, default False: Whether to save everything into a single CSV file. Under the single file mode, each partition is appended at the end of the specified CSV file. Note that not all filesystems support the append mode and thus the single file mode, especially on cloud storage systems such as S3 or GCS. A warning will be issued when writing to a file that is not backed by a local filesystem.
encodingstring, optional: A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
modestr: Python write mode, default ‘w’
name_functioncallable, default None: Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions. Not supported when single_file is True.
compressionstring, optional: a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
computebool: If true, immediately executes. If False, returns a set of delayed objects, which can be computed at a later time.
storage_optionsdict: Parameters passed on to the backend filesystem class.
header_first_partition_onlyboolean, default None: If set to True, only write the header row in the first output file. By default, headers are written to all partitions under the multiple file mode (single_file is False) and written only once under the single file mode (single_file is True). It must not be False under the single file mode.
compute_kwargsdict, optional: Options to be passed in to the compute method
kwargsdict, optional: Additional parameters to pass to pd.DataFrame.to_csv()

Returns

The names of the file written if they were computed right away
If not, the delayed tasks associated to the writing of the files

Raises

ValueError: If header_first_partition_only is set to False or name_function is specified when single_file is True.

dask.dataframe.to_bag(df, index=False)¶

Create Dask Bag from a Dask DataFrame

Parameters

indexbool, optional: If True, the elements are tuples of (index, value), otherwise they’re just the value. Default is False.

Examples

>>> bag = df.to_bag()  

dask.dataframe.to_hdf(df, path, key, mode='a', append=False, scheduler=None, name_function=None, compute=True, lock=None, dask_kwargs={}, **kwargs)¶

Store Dask Dataframe to Hierarchical Data Format (HDF) files

This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.

This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix * within the filename or datapath, and an optional name_function. The asterix will be replaced with an increasing sequence of integers starting from 0 or with the result of calling name_function on each of those integers.

This function only supports the Pandas 'table' format, not the more specialized 'fixed' format.

Parameters

pathstring, pathlib.Path: Path to a target filename. Supports strings, pathlib.Path, or any object implementing the __fspath__ protocol. May contain a * to denote many filenames.
keystring: Datapath within the files. May contain a * to denote many locations
name_functionfunction: A function to convert the * in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)
computebool: Whether or not to execute immediately. If False then this returns a dask.Delayed value.
lockLock, optional: Lock to use to prevent concurrency issues. By default a threading.Lock, multiprocessing.Lock or SerializableLock will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.
schedulerstring: The scheduler to use, like “threads” or “processes”
**other:: See pandas.to_hdf for more information

Returns

filenameslist: Returned if compute is True. List of file names that each partition is saved to.
delayeddask.Delayed: Returned if compute is False. Delayed object to execute to_hdf when computed.

See also

read_hdf
to_parquet

Examples

Save Data to a single file

>>> df.to_hdf('output.hdf', '/data')            

Save data to multiple datapaths within the same file:

>>> df.to_hdf('output.hdf', '/data-*')          

Save data to multiple files:

>>> df.to_hdf('output-*.hdf', '/data')          

Save data to multiple files, using the multiprocessing scheduler:

>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes') 

Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..

>>> from datetime import date, timedelta
>>> base = date(year=2000, month=1, day=1)
>>> def name_function(i):
...     ''' Convert integer 0 to n to a string '''
...     return base + timedelta(days=i)

>>> df.to_hdf('*.hdf', '/data', name_function=name_function) 

dask.dataframe.to_parquet(df, path, engine='auto', compression='default', write_index=True, append=False, ignore_divisions=False, partition_on=None, storage_options=None, write_metadata_file=True, compute=True, compute_kwargs=None, **kwargs)¶

Store Dask.dataframe to Parquet files

Parameters

dfdask.dataframe.DataFrame
pathstring or pathlib.Path: Destination directory for data. Prepend with protocol like s3:// or hdfs:// for remote data.
engine{‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’: Parquet library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’.
compressionstring or dict, optional: Either a string like "snappy" or a dictionary mapping column names to compressors like {"name": "gzip", "values": "snappy"}. The default is "default", which uses the default compression for whichever engine is selected.
write_indexboolean, optional: Whether or not to write the index. Defaults to True.
appendbool, optional: If False (default), construct data-set from scratch. If True, add new row-group(s) to an existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.
ignore_divisionsbool, optional: If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False.
partition_onlist, optional: Construct directory-based partitioning by splitting on these fields’ values. Each dask partition will result in one or more datafiles, there will be no global groupby.
storage_optionsdict, optional: Key/value pairs to be passed on to the file-system backend, if any.
write_metadata_filebool, optional: Whether to write the special “_metadata” file.
computebool, optional: If True (default) then the result is computed immediately. If False then a dask.delayed object is returned for future computation.
compute_kwargsdict, optional: Options to be passed in to the compute method
**kwargs :: Extra options to be passed on to the specific backend.

See also

read_parquet: Read parquet data to dask.dataframe

Notes

Each partition will be written to a separate file.

Examples

>>> df = dd.read_csv(...)  
>>> dd.to_parquet(df, '/path/to/output/',...)  

dask.dataframe.to_json(df, url_path, orient='records', lines=None, storage_options=None, compute=True, encoding='utf-8', errors='strict', compression=None, compute_kwargs=None, **kwargs)¶

Write dataframe into JSON text files

This utilises pandas.DataFrame.to_json(), and most parameters are passed through - see its docstring.

Differences: orient is ‘records’ by default, with lines=True; this produces the kind of JSON output that is most common in big-data applications, and which can be chunked when reading (see read_json()).

Parameters

df: dask.DataFrame: Data to save
url_path: str, list of str: Location to write to. If a string, and there are more than one partitions in df, should include a glob character to expand into a set of file names, or provide a name_function= parameter. Supports protocol specifications such as "s3://".
encoding, errors:: The text encoding to implement, e.g., “utf-8” and how to respond to errors in the conversion (see str.encode()).
orient, lines, kwargs: passed to pandas; if not specified, lines=True when orient=’records’, False otherwise.
storage_options: dict: Passed to backend file-system implementation
compute: bool: If true, immediately executes. If False, returns a set of delayed objects, which can be computed at a later time.
compute_kwargsdict, optional: Options to be passed in to the compute method
encoding, errors:: Text conversion, see str.encode()
compressionstring or None: String like ‘gzip’ or ‘xz’.

dask.dataframe.to_sql(df, name: str, uri: str, schema=None, if_exists: str = 'fail', index: bool = True, index_label=None, chunksize=None, dtype=None, method=None, compute=True, parallel=False)¶

Store Dask Dataframe to a SQL table

An empty table is created based on the “meta” DataFrame (and conforming to the caller’s “if_exists” preference), and then each block calls pd.DataFrame.to_sql (with if_exists=”append”).

Databases supported by SQLAlchemy [1] are supported. Tables can be newly created, appended to, or overwritten.

Parameters

namestr

Name of SQL table.

uristring

Full sqlalchemy URI for the database connection

schemastr, optional

Specify the schema (if database flavor supports this). If None, use default schema.

if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’

How to behave if the table already exists.

fail: Raise a ValueError.
replace: Drop the table before inserting new values.
append: Insert new values to the existing table.

indexbool, default True

Write DataFrame index as a column. Uses index_label as the column name in the table.

index_labelstr or sequence, default None

Column label for index column(s). If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.

chunksizeint, optional

Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.

dtypedict or scalar, optional

Specifying the datatype for columns. If a dictionary is used, the keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.

method{None, ‘multi’, callable}, optional

Controls the SQL insertion clause used:

None : Uses standard SQL INSERT clause (one per row).
‘multi’: Pass multiple values in a single INSERT clause.
callable with signature (pd_table, conn, keys, data_iter).

Details and a sample callable implementation can be found in the section insert method.

computebool, default True

When true, call dask.compute and perform the load into SQL; otherwise, return a Dask object (or array of per-block objects when parallel=True)

parallelbool, default False

When true, have each block append itself to the DB table concurrently. This can result in DB rows being in a different order than the source DataFrame’s corresponding rows. When false, load each block into the SQL DB in sequence.

Raises

ValueError: When the table already exists and if_exists is ‘fail’ (the default).

See also

read_sql: Read a DataFrame from a table.

Notes

Timezone aware datetime columns will be written as Timestamp with timezone type with SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone unaware timestamps local to the original timezone.

New in version 0.24.0.

References

1: https://docs.sqlalchemy.org
2: https://www.python.org/dev/peps/pep-0249/

Examples

Create a table from scratch with 4 rows.

>>> import pandas as pd
>>> df = pd.DataFrame([ {'i':i, 's':str(i)*2 } for i in range(4) ])
>>> from dask.dataframe import from_pandas
>>> ddf = from_pandas(df, npartitions=2)
>>> ddf  
Dask DataFrame Structure:
                   i       s
npartitions=2
0              int64  object
2                ...     ...
3                ...     ...
Dask Name: from_pandas, 2 tasks

>>> from dask.utils import tmpfile
>>> from sqlalchemy import create_engine
>>> with tmpfile() as f:
...     db = 'sqlite:///%s' % f
...     ddf.to_sql('test', db)
...     engine = create_engine(db, echo=False)
...     result = engine.execute("SELECT * FROM test").fetchall()
>>> result
[(0, 0, '00'), (1, 1, '11'), (2, 2, '22'), (3, 3, '33')]

Rolling¶

dask.dataframe.rolling.map_overlap(func, df, before, after, *args, **kwargs)¶

Apply a function to each partition, sharing rows with adjacent partitions.

Parameters

funcfunction: Function applied to each partition.
dfdd.DataFrame, dd.Series
beforeint or timedelta: The rows to prepend to partition i from the end of partition i - 1.
afterint or timedelta: The rows to append to partition i from the beginning of partition i + 1.
args, kwargs :: Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.

See also

dd.DataFrame.map_overlap

Resampling¶

class dask.dataframe.tseries.resample.Resampler(obj, rule, **kwargs)¶

Class for resampling timeseries data.

This class is commonly encountered when using obj.resample(...) which return Resampler objects.

Parameters

objDask DataFrame or Series: Data to be resampled.
rulestr, tuple, datetime.timedelta, DateOffset or None: The offset string or object representing the target conversion.
kwargsoptional: Keyword arguments passed to underlying pandas resampling function.

Returns

Resampler instance of the appropriate type

agg(self, agg_funcs, *args, **kwargs)¶

Aggregate using one or more operations over the specified axis.

This docstring was copied from pandas.core.resample.Resampler.agg.

Some inconsistencies with the Dask version may exist.

Parameters

funcfunction, str, list or dict (Not supported in Dask)

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

function
string function name
list of functions and/or function names, e.g. [np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns

scalar, Series or DataFrame

The return can be:

scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

See also

DataFrame.groupby.aggregate
DataFrame.resample.transform
DataFrame.aggregate

Notes

agg is an alias for aggregate. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> s = pd.Series([1,2,3,4,5],  
                  index=pd.date_range('20130101', periods=5,freq='s'))
2013-01-01 00:00:00    1
2013-01-01 00:00:01    2
2013-01-01 00:00:02    3
2013-01-01 00:00:03    4
2013-01-01 00:00:04    5
Freq: S, dtype: int64

>>> r = s.resample('2s')  
DatetimeIndexResampler [freq=<2 * Seconds>, axis=0, closed=left,
                        label=left, convention=start, base=0]

>>> r.agg(np.sum)  
2013-01-01 00:00:00    3
2013-01-01 00:00:02    7
2013-01-01 00:00:04    5
Freq: 2S, dtype: int64

>>> r.agg(['sum','mean','max'])  
                     sum  mean  max
2013-01-01 00:00:00    3   1.5    2
2013-01-01 00:00:02    7   3.5    4
2013-01-01 00:00:04    5   5.0    5

>>> r.agg({'result' : lambda x: x.mean() / x.std(),  
           'total' : np.sum})
                     total    result
2013-01-01 00:00:00      3  2.121320
2013-01-01 00:00:02      7  4.949747
2013-01-01 00:00:04      5       NaN

count(self)¶

Compute count of group, excluding missing values.

This docstring was copied from pandas.core.resample.Resampler.count.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Count of values within each group.

See also

Series.groupby
DataFrame.groupby

first(self)¶

Compute first of group values.

This docstring was copied from pandas.core.resample.Resampler.first.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed first of values within each group.

last(self)¶

Compute last of group values.

This docstring was copied from pandas.core.resample.Resampler.last.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed last of values within each group.

max(self)¶

Compute max of group values.

This docstring was copied from pandas.core.resample.Resampler.max.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed max of values within each group.

mean(self)¶

Compute mean of groups, excluding missing values.

This docstring was copied from pandas.core.resample.Resampler.mean.

Some inconsistencies with the Dask version may exist.

Returns

pandas.Series or pandas.DataFrame

See also

Series.groupby
DataFrame.groupby

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],  
...                    'B': [np.nan, 2, 3, 4, 5],
...                    'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])

Groupby one column and return the mean of the remaining columns in each group.

>>> df.groupby('A').mean()  
     B         C
A
1  3.0  1.333333
2  4.0  1.500000

Groupby two columns and return the mean of the remaining column.

>>> df.groupby(['A', 'B']).mean()  
       C
A B
1 2.0  2
  4.0  1
2 3.0  1
  5.0  2

Groupby one column and return the mean of only particular column in the group.

>>> df.groupby('A')['B'].mean()  
A
1    3.0
2    4.0
Name: B, dtype: float64

median(self)¶

Compute median of groups, excluding missing values.

This docstring was copied from pandas.core.resample.Resampler.median.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex

Returns

Series or DataFrame: Median of values within each group.

See also

Series.groupby
DataFrame.groupby

min(self)¶

Compute min of group values.

This docstring was copied from pandas.core.resample.Resampler.min.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed min of values within each group.

nunique(self)¶

Return number of unique elements in the group.

This docstring was copied from pandas.core.resample.Resampler.nunique.

Some inconsistencies with the Dask version may exist.

Returns

Series: Number of unique values within each group.

ohlc(self)¶

Compute sum of values, excluding missing values.

This docstring was copied from pandas.core.resample.Resampler.ohlc.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex

Returns

DataFrame: Open, high, low and close values within each group.

See also

Series.groupby
DataFrame.groupby

prod(self)¶

Compute prod of group values.

This docstring was copied from pandas.core.resample.Resampler.prod.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed prod of values within each group.

quantile(self)¶

Return value at the given quantile.

This docstring was copied from pandas.core.resample.Resampler.quantile.

Some inconsistencies with the Dask version may exist.

New in version 0.24.0.

Parameters

qfloat or array-like, default 0.5 (50% quantile) (Not supported in Dask)

Returns

DataFrame or Series: Quantile of values within each group.

See also

Series.quantile
DataFrame.quantile
DataFrameGroupBy.quantile

sem(self)¶

Compute standard error of the mean of groups, excluding missing values.

This docstring was copied from pandas.core.resample.Resampler.sem.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters

ddofint, default 1: Degrees of freedom.

Returns

Series or DataFrame: Standard error of the mean of values within each group.

See also

Series.groupby
DataFrame.groupby

size(self)¶

Compute group sizes.

This docstring was copied from pandas.core.resample.Resampler.size.

Some inconsistencies with the Dask version may exist.

Returns

Series: Number of rows in each group.

See also

Series.groupby
DataFrame.groupby

std(self)¶

Compute standard deviation of groups, excluding missing values.

This docstring was copied from pandas.core.resample.Resampler.std.

Some inconsistencies with the Dask version may exist.

Parameters

ddofint, default 1 (Not supported in Dask): Degrees of freedom.

Returns

DataFrame or Series: Standard deviation of values within each group.

sum(self)¶

Compute sum of group values.

This docstring was copied from pandas.core.resample.Resampler.sum.

Some inconsistencies with the Dask version may exist.

Returns

Series or DataFrame: Computed sum of values within each group.

var(self)¶

Compute variance of groups, excluding missing values.

This docstring was copied from pandas.core.resample.Resampler.var.

Some inconsistencies with the Dask version may exist.

Parameters

ddofint, default 1 (Not supported in Dask): Degrees of freedom.

Returns

DataFrame or Series: Variance of values within each group.

Dask Metadata¶

dask.dataframe.utils.make_meta(arg, *args, **kwargs)¶

Create an empty pandas object containing the desired metadata.

Parameters

xdict, tuple, list, pd.Series, pd.DataFrame, pd.Index, dtype, scalar: To create a DataFrame, provide a dict mapping of {name: dtype}, or an iterable of (name, dtype) tuples. To create a Series, provide a tuple of (name, dtype). If a pandas object, names, dtypes, and index should match the desired output. If a dtype or scalar, a scalar of the same dtype is returned.
indexpd.Index, optional: Any pandas index to use in the metadata. If none provided, a RangeIndex will be used.

Examples

>>> make_meta([('a', 'i8'), ('b', 'O')])
Empty DataFrame
Columns: [a, b]
Index: []
>>> make_meta(('a', 'f8'))
Series([], Name: a, dtype: float64)
>>> make_meta('i8')
1

Other functions¶

dask.dataframe.compute(*args, **kwargs)¶

Compute several dask collections at once.

Parameters

argsobject: Any number of objects. If it is a dask object, it’s computed and the result is returned. By default, python builtin collections are also traversed to look for dask objects (for more information see the traverse keyword). Non-dask arguments are passed through unchanged.
traversebool, optional: By default dask traverses builtin python collections looking for dask objects passed to compute. For large collections this can be expensive. If none of the arguments contain any dask objects, set traverse=False to avoid doing this traversal.
schedulerstring, optional: Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graphbool, optional: If True [default], the optimizations for each collection are applied before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs: Extra keywords to forward to the scheduler function.

Examples

>>> import dask.array as da
>>> a = da.arange(10, chunks=2).sum()
>>> b = da.arange(10, chunks=2).mean()
>>> compute(a, b)
(45, 4.5)

By default, dask objects inside python collections will also be computed:

>>> compute({'a': a, 'b': b, 'c': 1})  
({'a': 45, 'b': 4.5, 'c': 1},)

dask.dataframe.map_partitions(func, *args, meta='__no_default__', enforce_metadata=True, transform_divisions=True, **kwargs)¶

Apply Python function on each DataFrame partition.

Parameters

funcfunction: Function applied to each partition.
args, kwargs :: Arguments and keywords to pass to the function. At least one of the args should be a Dask.dataframe. Arguments and keywords may contain Scalar, Delayed or regular python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function.
enforce_metadatabool: Whether or not to enforce the structure of the metadata at runtime. This will rename and reorder columns for each partition, and will raise an error if this doesn’t work or types don’t match.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional: An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

dask.dataframe.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)¶

Convert argument to datetime.

Parameters

argint, float, str, datetime, list, tuple, 1-d array, Series DataFrame/dict-like

The object to convert to a datetime.

errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

If ‘raise’, then invalid parsing will raise an exception.
If ‘coerce’, then invalid parsing will be set as NaT.
If ‘ignore’, then invalid parsing will return the input.

dayfirstbool, default False

Specify a date parse order if arg is str or its list-likes. If True, parses dates with the day first, eg 10/11/12 is parsed as 2012-11-10. Warning: dayfirst=True is not strict, but will prefer to parse with day first (this is a known bug, based on dateutil behavior).

yearfirstbool, default False

Specify a date parse order if arg is str or its list-likes.

If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12.
If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).

Warning: yearfirst=True is not strict, but will prefer to parse with year first (this is a known bug, based on dateutil behavior).

utcbool, default None

Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).

formatstr, default None

The strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds. See strftime documentation for more information on choices: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.

exactbool, True by default

Behaves as: - If True, require an exact format match. - If False, allow the format to match anywhere in the target string.

unitstr, default ‘ns’

The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. This will be based off the origin. Example, with unit=’ms’ and origin=’unix’ (the default), this would calculate the number of milliseconds to the unix epoch start.

infer_datetime_formatbool, default False

If True and no format is given, attempt to infer the format of the datetime strings, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.

originscalar, default ‘unix’

Define the reference date. The numeric values would be parsed as number of units (defined by unit) since this reference date.

If ‘unix’ (or POSIX) time; origin is set to 1970-01-01.
If ‘julian’, unit must be ‘D’, and origin is set to beginning of Julian Calendar. Julian day number 0 is assigned to the day starting at noon on January 1, 4713 BC.
If Timestamp convertible, origin is set to Timestamp identified by origin.

cachebool, default True

If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets. The cache is only used when there are at least 50 values. The presence of out-of-bounds values will render the cache unusable and may slow down parsing.

New in version 0.23.0.

Changed in version 0.25.0: - changed default value from False to True.

Returns

datetime

If parsing succeeded. Return type depends on input:

list-like: DatetimeIndex
Series: Series of datetime64 dtype
scalar: Timestamp

In case when it is not possible to return designated types (e.g. when any element of input is before Timestamp.min or after Timestamp.max) return will have datetime.datetime type (or corresponding array/Series).

See also

DataFrame.astype: Cast argument to a specified dtype.
to_timedelta: Convert argument to timedelta.
convert_dtypes: Convert dtypes.

Examples

Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same

>>> df = pd.DataFrame({'year': [2015, 2016],
...                    'month': [2, 3],
...                    'day': [4, 5]})
>>> pd.to_datetime(df)
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

If a date does not meet the timestamp limitations, passing errors=’ignore’ will return the original input instead of raising any exception.

Passing errors=’coerce’ will force an out-of-bounds date to NaT, in addition to forcing non-dates (or non-parseable dates) to NaT.

>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
datetime.datetime(1300, 1, 1, 0, 0)
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
NaT

Passing infer_datetime_format=True can often-times speedup a parsing if its not an ISO8601 format exactly, but in a regular format.

>>> s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000'] * 1000)
>>> s.head()
0    3/11/2000
1    3/12/2000
2    3/13/2000
3    3/11/2000
4    3/12/2000
dtype: object

>>> %timeit pd.to_datetime(s, infer_datetime_format=True)  
100 loops, best of 3: 10.4 ms per loop

>>> %timeit pd.to_datetime(s, infer_datetime_format=False)  
1 loop, best of 3: 471 ms per loop

Using a unix epoch time

>>> pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
>>> pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')

Warning

For float arg, precision rounding might happen. To prevent unexpected behavior use a fixed-width exact type.

Using a non-unix epoch origin

>>> pd.to_datetime([1, 2, 3], unit='D',
...                origin=pd.Timestamp('1960-01-01'))
DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)

dask.dataframe.to_numeric(arg, errors='raise', meta=None)¶

Convert argument to a numeric type.

This docstring was copied from pandas.to_numeric.

Some inconsistencies with the Dask version may exist.

Return type depends on input. Delayed if scalar, otherwise same as input. For errors, only “raise” and “coerce” are allowed.

The default return dtype is float64 or int64 depending on the data supplied. Use the downcast parameter to obtain other dtypes.

Please note that precision loss may occur if really large numbers are passed in. Due to the internal limitations of ndarray, if numbers smaller than -9223372036854775808 (np.iinfo(np.int64).min) or larger than 18446744073709551615 (np.iinfo(np.uint64).max) are passed in, it is very likely they will be converted to float so that they can stored in an ndarray. These warnings apply similarly to Series since it internally leverages ndarray.

Parameters

argscalar, list, tuple, 1-d array, or Series

errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

If ‘raise’, then invalid parsing will raise an exception.
If ‘coerce’, then invalid parsing will be set as NaN.
If ‘ignore’, then invalid parsing will return the input.

downcast{‘integer’, ‘signed’, ‘unsigned’, ‘float’}, default None (Not supported in Dask) (Not supported in Dask)

If not None, and if the data has been successfully cast to a numerical dtype (or if the data was numeric to begin with), downcast that resulting data to the smallest numerical dtype possible according to the following rules:

‘integer’ or ‘signed’: smallest signed int dtype (min.: np.int8)
‘unsigned’: smallest unsigned int dtype (min.: np.uint8)
‘float’: smallest float dtype (min.: np.float32)

As this behaviour is separate from the core conversion to numeric values, any errors raised during the downcasting will be surfaced regardless of the value of the ‘errors’ input.

In addition, downcasting will only occur if the size of the resulting data’s dtype is strictly larger than the dtype it is to be cast to, so if none of the dtypes checked satisfy that specification, no downcasting will be performed on the data.

Returns

retnumeric if parsing succeeded.: Return type depends on input. Series if Series, otherwise ndarray.

See also

DataFrame.astype: Cast argument to a specified dtype.
to_datetime: Convert argument to datetime.
to_timedelta: Convert argument to timedelta.
numpy.ndarray.astype: Cast a numpy array to a specified type.
convert_dtypes: Convert dtypes.

Examples

Take separate series and convert to numeric, coercing when told to

>>> s = pd.Series(['1.0', '2', -3])  
>>> pd.to_numeric(s)  
0    1.0
1    2.0
2   -3.0
dtype: float64
>>> pd.to_numeric(s, downcast='float')  
0    1.0
1    2.0
2   -3.0
dtype: float32
>>> pd.to_numeric(s, downcast='signed')  
0    1
1    2
2   -3
dtype: int8
>>> s = pd.Series(['apple', '1.0', '2', -3])  
>>> pd.to_numeric(s, errors='ignore')  
0    apple
1      1.0
2        2
3       -3
dtype: object
>>> pd.to_numeric(s, errors='coerce')  
0    NaN
1    1.0
2    2.0
3   -3.0
dtype: float64

dask.dataframe.multi.concat(dfs, axis=0, join='outer', interleave_partitions=False, ignore_unknown_divisions=False, **kwargs)¶

Concatenate DataFrames along rows.

When axis=0 (default), concatenate DataFrames row-wise:
- If all divisions are known and ordered, concatenate DataFrames keeping divisions. When divisions are not ordered, specifying interleave_partition=True allows concatenate divisions each by each.
- If any of division is unknown, concatenate DataFrames resetting its division to unknown (None)
When axis=1, concatenate DataFrames column-wise:
- Allowed if all divisions are known.
- If any of division is unknown, it raises ValueError.

Parameters

dfslist: List of dask.DataFrames to be concatenated
axis{0, 1, ‘index’, ‘columns’}, default 0: The axis to concatenate along
join{‘inner’, ‘outer’}, default ‘outer’: How to handle indexes on other axis
interleave_partitionsbool, default False: Whether to concatenate DataFrames ignoring its order. If True, every divisions are concatenated each by each.
ignore_unknown_divisionsbool, default False: By default a warning is raised if any input has unknown divisions. Set to True to disable this warning.

Notes

This differs in from pd.concat in the when concatenating Categoricals with different categories. Pandas currently coerces those to objects before concatenating. Coercing to objects is very expensive for large arrays, so dask preserves the Categoricals by taking the union of the categories.

Examples

If all divisions are known and ordered, divisions are kept.

>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(6, 8, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(1, 3, 6, 8, 10)>

Unable to concatenate if divisions are not ordered.

>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(2, 3, 6)>
>>> dd.concat([a, b])                               
ValueError: All inputs have known divisions which cannot be concatenated
in order. Specify interleave_partitions=True to ignore order

Specify interleave_partitions=True to ignore the division order.

>>> dd.concat([a, b], interleave_partitions=True)   
dd.DataFrame<concat-..., divisions=(1, 2, 3, 5, 6)>

If any of division is unknown, the result division will be unknown

>>> a                                               
dd.DataFrame<x, divisions=(None, None)>
>>> b                                               
dd.DataFrame<y, divisions=(1, 4, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(None, None, None, None)>

By default concatenating with unknown divisions will raise a warning. Set ignore_unknown_divisions=True to disable this:

>>> dd.concat([a, b], ignore_unknown_divisions=True)
dd.DataFrame<concat-..., divisions=(None, None, None, None)>

Different categoricals are unioned

>> dd.concat([ # doctest: +SKIP … dd.from_pandas(pd.Series([‘a’, ‘b’], dtype=’category’), 1), … dd.from_pandas(pd.Series([‘a’, ‘c’], dtype=’category’), 1), … ], interleave_partitions=True).dtype CategoricalDtype(categories=[‘a’, ‘b’, ‘c’], ordered=False)

dask.dataframe.multi.merge(left, right, how: str = 'inner', on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, sort: bool = False, suffixes=('_x', '_y'), copy: bool = True, indicator: bool = False, validate=None) → 'DataFrame'¶

Merge DataFrame or named Series objects with a database-style join.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

Parameters

leftDataFrame

rightDataFrame or named Series

Object to merge with.

how{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

Type of merge to be performed.

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

onlabel or list

Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_onlabel or list, or array-like

Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_onlabel or list, or array-like

Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

left_indexbool, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.

right_indexbool, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index.

sortbool, default False

Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).

suffixestuple of (str, str), default (‘_x’, ‘_y’)

Suffix to apply to overlapping column names in the left and right side, respectively. To raise an exception on overlapping columns use (False, False).

copybool, default True

If False, avoid copy if possible.

indicatorbool or str, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

validatestr, optional

If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.

New in version 0.21.0.

Returns

DataFrame: A DataFrame of the two merged objects.

See also

merge_ordered: Merge with optional filling/interpolation.
merge_asof: Merge on nearest keys.
DataFrame.join: Similar method using indices.

Notes

Support for specifying index levels as the on, left_on, and right_on parameters was added in version 0.23.0 Support for merging named Series objects was added in version 0.24.0

Examples

>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [5, 6, 7, 8]})
>>> df1
    lkey value
0   foo      1
1   bar      2
2   baz      3
3   foo      5
>>> df2
    rkey value
0   foo      5
1   bar      6
2   baz      7
3   foo      8

Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended.

>>> df1.merge(df2, left_on='lkey', right_on='rkey')
  lkey  value_x rkey  value_y
foo        1  foo        5
foo        1  foo        8
foo        5  foo        5
foo        5  foo        8
bar        2  bar        6
baz        3  baz        7

Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey',
...           suffixes=('_left', '_right'))
  lkey  value_left rkey  value_right
0  foo           1  foo            5
1  foo           1  foo            8
2  foo           5  foo            5
3  foo           5  foo            8
4  bar           2  bar            6
5  baz           3  baz            7

Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
Traceback (most recent call last):
...
ValueError: columns overlap but no suffix specified:
    Index(['value'], dtype='object')

dask.dataframe.multi.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, by=None, left_by=None, right_by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches: bool = True, direction: str = 'backward') → 'DataFrame'¶

Perform an asof merge. This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the key.

For each row in the left DataFrame:

A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

The default is “backward” and is compatible in versions below 0.20.0. The direction parameter was added in version 0.20.0 and introduces “forward” and “nearest”.

Optionally match on equivalent keys with ‘by’ before searching with ‘on’.

Parameters

leftDataFrame

rightDataFrame

onlabel

Field name to join on. Must be found in both DataFrames. The data MUST be ordered. Furthermore this must be a numeric column, such as datetimelike, integer, or float. On or left_on/right_on must be given.

left_onlabel

Field name to join on in left DataFrame.

right_onlabel

Field name to join on in right DataFrame.

left_indexbool

Use the index of the left DataFrame as the join key.

right_indexbool

Use the index of the right DataFrame as the join key.

bycolumn name or list of column names

Match on these columns before performing merge operation.

left_bycolumn name

Field names to match on in the left DataFrame.

right_bycolumn name

Field names to match on in the right DataFrame.

suffixes2-length sequence (tuple, list, …)

Suffix to apply to overlapping column names in the left and right side, respectively.

toleranceint or Timedelta, optional, default None

Select asof tolerance within this range; must be compatible with the merge index.

allow_exact_matchesbool, default True

If True, allow matching with the same ‘on’ value (i.e. less-than-or-equal-to / greater-than-or-equal-to)
If False, don’t match the same ‘on’ value (i.e., strictly less-than / strictly greater-than).

direction‘backward’ (default), ‘forward’, or ‘nearest’

Whether to search for prior, subsequent, or closest matches.

Returns

mergedDataFrame

See also

merge
merge_ordered

Examples

>>> left = pd.DataFrame({'a': [1, 5, 10], 'left_val': ['a', 'b', 'c']})
>>> left
    a left_val
0   1        a
1   5        b
2  10        c

>>> right = pd.DataFrame({'a': [1, 2, 3, 6, 7],
...                       'right_val': [1, 2, 3, 6, 7]})
>>> right
   a  right_val
0  1          1
1  2          2
2  3          3
3  6          6
4  7          7

>>> pd.merge_asof(left, right, on='a')
    a left_val  right_val
0   1        a          1
1   5        b          3
2  10        c          7

>>> pd.merge_asof(left, right, on='a', allow_exact_matches=False)
    a left_val  right_val
0   1        a        NaN
1   5        b        3.0
2  10        c        7.0

>>> pd.merge_asof(left, right, on='a', direction='forward')
    a left_val  right_val
0   1        a        1.0
1   5        b        6.0
2  10        c        NaN

>>> pd.merge_asof(left, right, on='a', direction='nearest')
    a left_val  right_val
0   1        a          1
1   5        b          6
2  10        c          7

We can use indexed DataFrames as well.

>>> left = pd.DataFrame({'left_val': ['a', 'b', 'c']}, index=[1, 5, 10])
>>> left
   left_val
1         a
5         b
10        c

>>> right = pd.DataFrame({'right_val': [1, 2, 3, 6, 7]},
...                      index=[1, 2, 3, 6, 7])
>>> right
   right_val
1          1
2          2
3          3
6          6
7          7

>>> pd.merge_asof(left, right, left_index=True, right_index=True)
   left_val  right_val
1         a          1
5         b          3
10        c          7

Here is a real-world times-series example

>>> quotes
                     time ticker     bid     ask
2016-05-25 13:30:00.023   GOOG  720.50  720.93
2016-05-25 13:30:00.023   MSFT   51.95   51.96
2016-05-25 13:30:00.030   MSFT   51.97   51.98
2016-05-25 13:30:00.041   MSFT   51.99   52.00
2016-05-25 13:30:00.048   GOOG  720.50  720.93
2016-05-25 13:30:00.049   AAPL   97.99   98.01
2016-05-25 13:30:00.072   GOOG  720.50  720.88
2016-05-25 13:30:00.075   MSFT   52.01   52.03

>>> trades
                     time ticker   price  quantity
2016-05-25 13:30:00.023   MSFT   51.95        75
2016-05-25 13:30:00.038   MSFT   51.95       155
2016-05-25 13:30:00.048   GOOG  720.77       100
2016-05-25 13:30:00.048   GOOG  720.92       100
2016-05-25 13:30:00.048   AAPL   98.00       100

By default we are taking the asof of the quotes

>>> pd.merge_asof(trades, quotes,
...                       on='time',
...                       by='ticker')
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

We only asof within 2ms between the quote time and the trade time

>>> pd.merge_asof(trades, quotes,
...                       on='time',
...                       by='ticker',
...                       tolerance=pd.Timedelta('2ms'))
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

We only asof within 10ms between the quote time and the trade time and we exclude exact matches on time. However prior data will propagate forward

>>> pd.merge_asof(trades, quotes,
...                       on='time',
...                       by='ticker',
...                       tolerance=pd.Timedelta('10ms'),
...                       allow_exact_matches=False)
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75     NaN     NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100     NaN     NaN
3 2016-05-25 13:30:00.048   GOOG  720.92       100     NaN     NaN
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

dask.dataframe.reshape.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=<class 'numpy.uint8'>, **kwargs)¶

Convert categorical variable into dummy/indicator variables.

Data must have category dtype to infer result’s columns.

Parameters

dataSeries, or DataFrame: For Series, the dtype must be categorical. For DataFrame, at least one column must be categorical.
prefixstring, list of strings, or dict of strings, default None: String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.
prefix_sepstring, default ‘_’: If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.
dummy_nabool, default False: Add a column to indicate NaNs, if False NaNs are ignored.
columnslist-like, default None: Column names in the DataFrame to be encoded. If columns is None then all the columns with category dtype will be converted.
sparsebool, default False: Whether the dummy columns should be sparse or not. Returns SparseDataFrame if data is a Series or if all columns are included. Otherwise returns a DataFrame with some SparseBlocks.

New in version 0.18.2.
drop_firstbool, default False: Whether to get k-1 dummies out of k categorical levels by removing the first level.
dtypedtype, default np.uint8: Data type for new columns. Only a single dtype is allowed. Only valid if pandas is 0.23.0 or newer.

New in version 0.18.2.

Returns

dummiesDataFrame

See also

pandas.get_dummies

Examples

Dask’s version only works with Categorical data, as this is the only way to know the output shape without computing all the data.

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> s = dd.from_pandas(pd.Series(list('abca')), npartitions=2)
>>> dd.get_dummies(s)
Traceback (most recent call last):
    ...
NotImplementedError: `get_dummies` with non-categorical dtypes is not supported...

With categorical data:

>>> s = dd.from_pandas(pd.Series(list('abca'), dtype='category'), npartitions=2)
>>> dd.get_dummies(s)  
Dask DataFrame Structure:
                   a      b      c
npartitions=2
0              uint8  uint8  uint8
2                ...    ...    ...
3                ...    ...    ...
Dask Name: get_dummies, 4 tasks
>>> dd.get_dummies(s).compute()  
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

dask.dataframe.reshape.pivot_table(df, index=None, columns=None, values=None, aggfunc='mean')¶

Create a spreadsheet-style pivot table as a DataFrame. Target columns must have category dtype to infer result’s columns. index, columns, and aggfunc must be all scalar. values can be scalar or list-like.

Parameters

dfDataFrame
indexscalar: column to be index
columnsscalar: column to be columns
valuesscalar or list(scalar): column(s) to aggregate
aggfunc{‘mean’, ‘sum’, ‘count’}, default ‘mean’

Returns

tableDataFrame

See also

pandas.DataFrame.pivot_table

dask.dataframe.reshape.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)¶

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters

frameDataFrame
id_varstuple, list, or ndarray, optional: Column(s) to use as identifier variables.
value_varstuple, list, or ndarray, optional: Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_namescalar: Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
value_namescalar, default ‘value’: Name to use for the ‘value’ column.
col_levelint or string, optional: If columns are a MultiIndex then use this level to melt.

Returns

DataFrame: Unpivoted DataFrame.

See also

pandas.DataFrame.melt