Use Dask interactively as a backend
Dask can be used as a backend for the distributed execution of ROOT's RDataFrame computations. More specifically, ROOT RDataFrame supports distributed execution via the ROOT.RDF.Experimental.Distributed module.
Set up
In a notebook, once a Dask client has been deployed and a Dask client instatiated as described here, you can start with:
from dask.distributed import Client
client = Client(<Dask scheduler address)
In alternative, you can drag and drop the Cluster block on the left panel into the notebook. That will create the cell above automatically with the correct values already compiled.
Using RDataframe
If you want to use DASK to distribute RDataFrame payloads, you need now to declare a distributed DataFrame, passing all the necessary information
import ROOT
df = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame(
<name of the tree>,
chain,
npartitions=<maximum number of partitions (i.e. Dask tasks)>,
daskclient=client,
)
In most cases you would like to declare custom functions to the ROOT interpreter in order to perform specific calculations on data. To do this, you need to use the ROOT.RDF.Experimental.Distributed.initialize function to initialize each worker.
text_file = open("postselection.h", "r")
data = text_file.read()
def my_initialization_function():
ROOT.gInterpreter.Declare('{}'.format(data))
ROOT.RDF.Experimental.Distributed.initialize(my_initialization_function)
now you are ready to book all the computations you need to do on the dataframe using RDataFrame methods. Once all the operations are booked, trigger the distributed execution by doing some actions on booked items, as you would do using RDataFrame locally. Results can be accessed in the same way as local RDataFrame, too.
Using Coffea
TODO