Serialization¶
When we communicate data between computers we first convert that data into a sequence of bytes that can be communicated across the network. Choices made in serialization can affect performance and security.
The standard Python solution to this, Pickle, is often but not always the right solution. Dask uses a number of different serialization schemes in different situations. These are extensible to allow users to control in sensitive situations and also to enable library developers to plug in more performant serialization solutions.
This document first describes Dask’s default solution for serialization and then discusses ways to control and extend that serialiation.
Defaults¶
There are three kinds of messages passed through the Dask network:
Small administrative messages like “Worker A has finished task X” or “I’m running out of memory”. These are always serialized with msgpack.
Movement of program data, such as Numpy arrays and Pandas dataframes. This uses a combination of pickle and custom serializers and is the topic of the next section
Computational tasks like
f(x)
that are defined and serialized on client processes and deserialized and run on worker processes. These are serialized using a fixed scheme decided on by those libraries. Today this is a combination of pickle and cloudpickle.
Serialization families¶
Use¶
For the movement of program data (item 2 above) we can use a few different families of serializers. By default the following families are built in:
Pickle and cloudpickle
Msgpack
Custom per-type serializers that come with Dask for the special serialization of important classes of data like Numpy arrays
You can choose which families you want to use to serialize data and to deserialize data when you create a Client
from dask.distributed import Client
client = Client('tcp://scheduler-address:8786',
serializers=['dask', 'pickle'],
deserializers=['dask', 'msgpack'])
This can be useful if, for example, you are sensitive about receiving Pickle-serialized data for security reasons.
Dask uses the serializers ['dask', 'pickle']
by default, trying to use dask
custom serializers (described below) if they work and then falling back to
pickle/cloudpickle.
Extend¶
These families can be extended by creating two functions, dumps and loads,
which return and consume a msgpack-encodable header, and a list of byte-like
objects. These must then be included in the distributed.protocol.serialize
dictionary with an appropriate name. Here is the definition of
pickle_dumps
and pickle_loads
to serve as an example.
import pickle
def pickle_dumps(x):
header = {'serializer': 'pickle'}
frames = [pickle.dumps(x)]
return header, frames
def pickle_loads(header, frames):
if len(frames) > 1: # this may be cut up for network reasons
frame = ''.join(frames)
else:
frame = frames[0]
return pickle.loads(frame)
from distributed.protocol.serialize import register_serialization_family
register_serialization_family('pickle', pickle_dumps, pickle_loads)
After this the name 'pickle'
can be used in the serializers=
and
deserializers=
keywords in Client
and other parts of Dask.
Communication Context¶
Note
This is an experimental feature and may change without notice
Dask Comms can provide additional context to
serialization family functions if they provide a context=
keyword.
This allows serialization to behave differently according to how it is being
used.
def my_dumps(x, context=None):
if context and 'recipient' in context:
# check if we're sending to the same host or not
The context depends on the kind of communication. For example when sending over TCP, the address of the sender (us) and the recipient are available in a dictionary.
>>> context
{'sender': 'tcp://127.0.0.1:1234', 'recipient': 'tcp://127.0.0.1:5678'}
Other comms may provide other information.
Dask Serialization Family¶
Use¶
Dask maintains its own custom serialization family that special cases a few important types, like Numpy arrays. These serializers either operate more efficiently than Pickle, or serialize types that Pickle can not handle.
You don’t need to do anything special to use this family of serializers. It is on by default (along with pickle). Note that Dask custom serializers may use pickle internally in some cases. It should not be considered more secure.
Extend¶
|
Single Dispatch for dask_serialize |
|
Single Dispatch for dask_deserialize |
As with serialization families in general, the Dask family in particular is also extensible. This is a good way to support custom serialization of a single type of object. The method is similar, you create serialize and deserialize function that create and consume a header and frames, and then register them with Dask.
class Human:
def __init__(self, name):
self.name = name
from distributed.protocol import dask_serialize, dask_deserialize
@dask_serialize.register(Human)
def serialize(human: Human) -> Tuple[Dict, List[bytes]]:
header = {}
frames = [human.name.encode()]
return header, frames
@dask_deserialize.register(Human)
def deserialize(header: Dict, frames: List[bytes]) -> Human:
return Human(frames[0].decode())
Traverse attributes¶
|
Register dask_(de)serialize to traverse through __dict__ |
A common case is that your object just wraps Numpy arrays or other objects that
Dask already serializes well. For example, Scikit-Learn estimators mostly
surround Numpy arrays with a bit of extra metadata. In these cases you can
register your class for custom Dask serialization with the
register_generic
function.
API¶
|
Convert object to a header and list of bytestrings |
|
Convert serialized header and list of bytestrings back to a Python object |
|
Single Dispatch for dask_serialize |
|
Single Dispatch for dask_deserialize |
|
Register dask_(de)serialize to traverse through __dict__ |
-
distributed.protocol.serialize.
serialize
(x, serializers=None, on_error='message', context=None)[source]¶ Convert object to a header and list of bytestrings
This takes in an arbitrary Python object and returns a msgpack serializable header and a list of bytes or memoryview objects.
The serialization protocols to use are configurable: a list of names define the set of serializers to use, in order. These names are keys in the
serializer_registry
dict (e.g., ‘pickle’, ‘msgpack’), which maps to the de/serialize functions. The name ‘dask’ is special, and will use the per-class serialization methods.None
gives the default list['dask', 'pickle']
.- Returns
- header: dictionary containing any msgpack-serializable metadata
- frames: list of bytes or memoryviews, commonly of length one
See also
deserialize
Convert header and frames back to object
to_serialize
Mark that data in a message should be serialized
register_serialization
Register custom serialization functions
Examples
>>> serialize(1) ({}, [b'\x80\x04\x95\x03\x00\x00\x00\x00\x00\x00\x00K\x01.'])
>>> serialize(b'123') # some special types get custom treatment ({'type': 'builtins.bytes'}, [b'123'])
>>> deserialize(*serialize(1)) 1
-
distributed.protocol.serialize.
deserialize
(header, frames, deserializers=None)[source]¶ Convert serialized header and list of bytestrings back to a Python object
- Parameters
- header: dict
- frames: list of bytes
- deserializersOptional[Dict[str, Tuple[Callable, Callable, bool]]]
An optional dict mapping a name to a (de)serializer. See dask_serialize and dask_deserialize for more.
See also
-
distributed.protocol.serialize.
dask_serialize
(arg, *args, **kwargs)¶ Single Dispatch for dask_serialize
-
distributed.protocol.serialize.
dask_deserialize
(arg, *args, **kwargs)¶ Single Dispatch for dask_deserialize
-
distributed.protocol.serialize.
register_generic
(cls)[source]¶ Register dask_(de)serialize to traverse through __dict__
Normally when registering new classes for Dask’s custom serialization you need to manage headers and frames, which can be tedious. If all you want to do is traverse through your object and apply serialize to all of your object’s attributes then this function may provide an easier path.
This registers a class for the custom Dask serialization family. It serializes it by traversing through its __dict__ of attributes and applying
serialize
anddeserialize
recursively. It collects a set of frames and keeps small attributes in the header. Deserialization reverses this process.This is a good idea if the following hold:
Most of the bytes of your object are composed of data types that Dask’s custom serializtion already handles well, like Numpy arrays.
Your object doesn’t require any special constructor logic, other than object.__new__(cls)
See also
Examples
>>> import sklearn.base >>> from distributed.protocol import register_generic >>> register_generic(sklearn.base.BaseEstimator)