Setup ===== This page describes various ways to set up Dask on different hardware, either locally on your own machine or on a distributed cluster. If you are just getting started, then this page is unnecessary. Dask does not require any setup if you only want to use it on a single computer. Dask has two families of task schedulers: 1. **Single machine scheduler**: This scheduler provides basic features on a local process or thread pool. This scheduler was made first and is the default. It is simple and cheap to use. It can only be used on a single machine and does not scale. 2. **Distributed scheduler**: This scheduler is more sophisticated. It offers more features, but also requires a bit more effort to set up. It can run locally or distributed across a cluster. If you import Dask, set up a computation, and then call ``compute``, then you will use the single-machine scheduler by default. To use the ``dask.distributed`` scheduler you must set up a ``Client`` .. code-block:: python import dask.dataframe as dd df = dd.read_csv(...) df.x.sum().compute() # This uses the single-machine scheduler by default .. code-block:: python from dask.distributed import Client client = Client(...) # Connect to distributed cluster and override default df.x.sum().compute() # This now runs on the distributed system Note that the newer ``dask.distributed`` scheduler is often preferable, even on single workstations. It contains many diagnostics and features not found in the older single-machine scheduler. The following pages explain in more detail how to set up Dask on a variety of local and distributed hardware. .. raw:: html - Single Machine: - :doc:`Default Scheduler `: The no-setup default. Uses local threads or processes for larger-than-memory processing - :doc:`dask.distributed `: The sophistication of the newer system on a single machine. This provides more advanced features while still requiring almost no setup. - Distributed computing: - :doc:`Manual Setup `: The command line interface to set up ``dask-scheduler`` and ``dask-worker`` processes. Useful for IT or anyone building a deployment solution. - :doc:`SSH `: Use SSH to set up Dask across an un-managed cluster. - :doc:`High Performance Computers `: How to run Dask on traditional HPC environments using tools like MPI, or job schedulers like SLURM, SGE, TORQUE, LSF, and so on. - :doc:`Kubernetes `: Deploy Dask with the popular Kubernetes resource manager using either Helm or a native deployment. - `YARN / Hadoop `_: Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations. - :doc:`Python API (advanced) `: Create ``Scheduler`` and ``Worker`` objects from Python as part of a distributed Tornado TCP application. This page is useful for those building custom frameworks. - :doc:`Docker ` containers are available and may be useful in some of the solutions above. - :doc:`Cloud ` for current recommendations on how to deploy Dask and Jupyter on common cloud providers like Amazon, Google, or Microsoft Azure. .. toctree:: :maxdepth: 1 :hidden: :caption: Getting Started setup/single-machine.rst setup/single-distributed.rst setup/cli.rst setup/ssh.rst setup/hpc.rst setup/kubernetes.rst YARN / Hadoop setup/python-advanced.rst setup/cloud.rst setup/adaptive.rst setup/docker.rst setup/custom-startup.rst setup/prometheus.rst