Indexing into Dask DataFrames¶
Dask DataFrame supports some of Pandas’ indexing behavior.
Purely integer-location based indexing for selection by position. |
|
Purely label-location based indexer for selection by label. |
Label-based Indexing¶
Just like Pandas, Dask DataFrame supports label-based indexing with the .loc
accessor for selecting rows or columns, and __getitem__
(square brackets)
for selecting just columns.
Note
To select rows, the DataFrame’s divisions must be known (see Internal Design and Best Practices for more information.)
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]},
... index=['a', 'b', 'c'])
>>> ddf = dd.from_pandas(df, npartitions=2)
>>> ddf
Dask DataFrame Structure:
A B
npartitions=1
a int64 int64
c ... ...
Dask Name: from_pandas, 1 tasks
Selecting columns:
>>> ddf[['B', 'A']]
Dask DataFrame Structure:
B A
npartitions=1
a int64 int64
c ... ...
Dask Name: getitem, 2 tasks
Selecting a single column reduces to a Dask Series:
>>> ddf['A']
Dask Series Structure:
npartitions=1
a int64
c ...
Name: A, dtype: int64
Dask Name: getitem, 2 tasks
Slicing rows and (optionally) columns with .loc
:
>>> ddf.loc[['b', 'c'], ['A']]
Dask DataFrame Structure:
A
npartitions=1
b int64
c ...
Dask Name: loc, 2 tasks
>>> ddf.loc[df["A"] > 1, ["B"]]
Dask DataFrame Structure:
B
npartitions=1
a int64
c ...
Dask Name: try_loc, 2 tasks
>>> ddf.loc[lambda df: df["A"] > 1, ["B"]]
Dask DataFrame Structure:
B
npartitions=1
a int64
c ...
Dask Name: try_loc, 2 tasks
Dask DataFrame supports Pandas’ partial-string indexing:
>>> ts = dd.demo.make_timeseries()
>>> ts
Dask DataFrame Structure:
id name x y
npartitions=11
2000-01-31 int64 object float64 float64
2000-02-29 ... ... ... ...
... ... ... ... ...
2000-11-30 ... ... ... ...
2000-12-31 ... ... ... ...
Dask Name: make-timeseries, 11 tasks
>>> ts.loc['2000-02-12']
Dask DataFrame Structure:
id name x y
npartitions=1
2000-02-12 00:00:00.000000000 int64 object float64 float64
2000-02-12 23:59:59.999999999 ... ... ... ...
Dask Name: loc, 12 tasks
Positional Indexing¶
Dask DataFrame does not track the length of partitions, making positional
indexing with .iloc
inefficient for selecting rows. DataFrame.iloc()
only supports indexers where the row indexer is slice(None)
(which :
is
a shorthand for.)
>>> ddf.iloc[:, [1, 0]]
Dask DataFrame Structure:
B A
npartitions=1
a int64 int64
c ... ...
Dask Name: iloc, 2 tasks
Trying to select specific rows with iloc
will raise an exception:
>>> ddf.iloc[[0, 2], [1]]
Traceback (most recent call last)
File "<stdin>", line 1, in <module>
ValueError: 'DataFrame.iloc' does not support slicing rows. The indexer must be a 2-tuple whose first item is 'slice(None)'.