Use Dask interactively as a backend

Dask can be used as a backend for the distributed execution of ROOT's RDataFrame computations. More specifically, ROOT RDataFrame supports distributed execution via the ROOT.RDF.Experimental.Distributed module.

Set up

In a notebook, once a Dask client has been deployed and a Dask client instatiated as described here, you can start with:

from dask.distributed import Client

client = Client(<Dask scheduler address)

In alternative, you can drag and drop the Cluster block on the left panel into the notebook. That will create the cell above automatically with the correct values already compiled.

Using RDataframe

If you want to use DASK to distribute RDataFrame payloads, you need now to declare a distributed DataFrame, passing all the necessary information

import ROOT

df = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame(
    <name of the tree>, 
    chain, 
    npartitions=<maximum number of partitions (i.e. Dask tasks)>, 
    daskclient=client,
)

In most cases you would like to declare custom functions to the ROOT interpreter in order to perform specific calculations on data. To do this, you need to use the ROOT.RDF.Experimental.Distributed.initialize function to initialize each worker.

text_file = open("postselection.h", "r")
data = text_file.read()

def my_initialization_function():
    ROOT.gInterpreter.Declare('{}'.format(data))

ROOT.RDF.Experimental.Distributed.initialize(my_initialization_function)

now you are ready to book all the computations you need to do on the dataframe using RDataFrame methods. Once all the operations are booked, trigger the distributed execution by doing some actions on booked items, as you would do using RDataFrame locally. Results can be accessed in the same way as local RDataFrame, too.

Using Coffea

TODO