How to View Dask Dashboard When Running on a Virtual Machine

How to view Dask dashboard when running on a virtual machine?

Finally figured it out with some SSH Tunneling.

More background on problem:

Local machine is a Windows laptop
Remote server is a CentOS box

The goal is actually two-fold:

Run Jupyter Notebook on remote server that contains Dask code
View Dask Dashboard from code running in Notebook

Here are the steps I took:

For this example, IP Address of remote server is 11.11.11.111
Following some instructions for Port Tunneling, I use 8001 as the Source Port and Destination is localhost:8889
After connecting to the remote server (which has 16 cores and 44.7GB of RAM), I ran this in Putty terminal: dask-worker tcp://11.11.11.111:8786 --memory-limit=auto --nthreads=1 --nprocs=16 &
Start Jupyter Notebook on server: jupyter notebook --ip=0.0.0.0 --port=8889 --no-browser &
a. After running above command, output shows that Jupyter notebook is running at http://(hostname or 127.0.0.1):8889/?token=blahblahblah
b. Opening a browser and going to the URL above (http://hostname:8889/?token=blahblahblah) brings to Jupyter Notebook home page

Create new Notebook and run following code:

import dask.dataframe as dd
from dask.distributed import Client
client = Client('11.11.11.111:8786')
print(client)

The output shows dashboard

    Client
    Scheduler: tcp://11.11.11.111:8786
    Dashboard: http://11.11.11.111:36124/status
    client = Client('11.11.11.111:8786')

    Cluster
    Workers: 16
    Cores: 16
    Memory: 44.70 GB

Now typing http://11.11.11.111:36124/status into a browser window takes me to the Dask Dashboard.

Access dashboard on AWS ec2 local cluster

The answer is in the comments, but I will type it out here, so that the original question looks "answered".

You need two things to connect to a port on an EC2 machine: the external IP, and access. The former is most easily found from the AWS console. For the latter, you typically need to edit the security group to add an inbound TCP rule for the port (either open to the world, or just your IP). There are other ways to do this part, depending on whether your machine is inside a VPC, has any custom gateways or routers... but if you don't know what that means, find the security group first. Both the public IP and the security group will be linked from the machine's row in the EC2 "running instances" list.

Jupyter Lab Dask Extension on AKS: Dask dashboard windows empty

If you are able to successfully navigate to the dashboard in a separate tab then copy that same address into the text field in the dask labextension and things should be ok.

Dask with docker not showing anything in dask's dashboard

I've run through the same workflow that you describe and have a couple of pointers.

In order to connect to the Dask cluster you need to create a Client object.

So before you run any code in your notebook you first need to run

from dask.distributed import Client
client = Client("tcp://scheduler:8786")  # We could also omit the address because it is set in the `DASK_SCHEDULER_ADDRESS` environment variable

Then running your Dask array code will execute on the cluster. However it is also worth noting that this is such a small amount of work it completes almost immediately and the dashboard shows nothing for me. However if I head to the profile page I can see a profile of the work executed, so it definitely ran there.

Dashboard profile plot

If I increase the array sizes to da.random.random((10_000,10_000,10), chunks=(1000,1000,5)) then I see activity on the dashboard.

Dashboard task graph

My final comment here is that the docker-compose.yml file you are using is actually part of the build pipeline for the Dask Docker image and is not actually intended for folks to use to run Dask. Although it does work. You may find this simpler config easier to work with.

version: "3.1"

services:
  scheduler:
    image: daskdev/dask
    hostname: dask-scheduler
    ports:
      - "8786:8786"
      - "8787:8787"
    command: ["dask-scheduler"]

  worker:
    image: daskdev/dask
    hostname: dask-worker
    command: ["dask-worker", "tcp://scheduler:8786"]

  notebook:
    image: daskdev/dask-notebook
    hostname: notebook
    ports:
      - "8888:8888"
    environment:
      - DASK_SCHEDULER_ADDRESS="tcp://scheduler:8786"

How does Dask execute code on multiple vm's in the cloud

When running on multiple machines Dask workers must have access to all required dependencies in order to be able to run your code.

You have labelled your question with dask-kubernetes so I'll use that as an example. By default dask-kubernetes uses the daskdev/dask Docker image to run your workers. This image contains Python and the minimal dependencies to run Dask distributed.

If your code requires an external dependency you must ensure this is installed in the image. The Dask docker image supports installing extra packages at runtime by setting either the EXTRA_APT_PACKAGES, EXTRA_CONDA_PACKAGES or EXTRA_PIP_PACKAGES environment variables.

# worker-spec.yml

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 6GB, --death-timeout, '60']
    name: dask
    env:
      - name: EXTRA_APT_PACKAGES
        value: packagename  # Some package to install with `apt install`
      - name: EXTRA_PIP_PACKAGES
        value: packagename  # Some package to install with `pip install`
      - name: EXTRA_CONDA_PACKAGES
        value: packagename  # Some package to install with `conda install`
    resources:
      limits:
        cpu: "2"
        memory: 6G
      requests:
        cpu: "2"
        memory: 6G

from dask_kubernetes import KubeCluster

cluster = KubeCluster.from_yaml('worker-spec.yml')

The downside of this is that packages must be installed every time a worker starts, which can make adaptive scaling slow. So alternatively you can create your own Docker image with all your dependencies already installed and publish it to Docker Hub. Then use that instead in your configuration.

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: me/mycustomimage:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 6GB, --death-timeout, '60']
    name: dask
    resources:
      limits:
        cpu: "2"
        memory: 6G
      requests:
        cpu: "2"
        memory: 6G

How to View Dask Dashboard When Running on a Virtual Machine