Using EOEPCA with Research Platforms

A study was carried out as part of a project funded by ESA, through Telespazio Vega UK, aimed at supporting the uptake and operations of the Earth Observation Exploitation Platform Common Architecture (EOEPCA). The research was conducted by the Centre for Environmental Data Analysis (CEDA1), which is part of the United Kingdom Research and Innovation / Science and Technology Facilities Council (UKRI-STFC). CEDA operates the JASMIN2 data and compute cluster, primarily supporting scientific research under the Natural Environment Research Council (NERC).

This study focuses on Task E4 of the contract, which involves reporting on the deployment of EOEPCA within a research platform, including its implications and findings, and Task E5, which investigates the integration of ADES with a batch processing environment to enable the scaling up of processing tasks.

For the full report please click here.

JASMIN is mainly provided to support academic research, but extends to support the work of organisations, such as the Met Office, that have a strong collaborative link with the atmospheric research community. During its lifetime, JASMIN has supported a range of high-profile projects and datasets, including the 6th Climate Model Intercomparison Project (CMIP6), the EU Horizon Europe PRIMAVERA project, the ESA-CCI Open Data Portal and a multi-petabyte store of Sentinel satellite products.

As a Research Platform (RP), JASMIN is used in many modes, for example:

A scientist logs in via SSH and runs their own code against terabytes of climate/EO data.
A team of scientists develop and run a data-processing model and generate a new product that is published to the CEDA archive and is minted with a DOI.
An international project uses JASMIN to store data and builds tools to optimise access to that data.
A project provides its own web-tool as an interface to existing data (on JASMIN): using the JASMIN cloud, deploying on their own Kubernetes cluster (managing their own users and access rules).
A scientist logs into the JASMIN Notebook Service to develop a data-driven notebook that will accompany a scientific paper to explain the workflow.
A University is running a training event (such as a Hackathon) and uses JASMIN training accounts to provide 50 participants with temporary access to JASMIN resources.

Overview of the JASMIN EOEPCA Developments

Deployment of EOEPCA on the JASMIN Cloud

A tenancy was created in the JASMIN cloud environment to allow the CEDA team to install, test, and develop the JASMIN EOEPCA instance. This was built on top of a Kubernetes “Cluster-as-as-Service” offering provided by the JASMIN cloud infrastructure. A vanilla install of EOEPCA was installed using the recipes provided by the EOEPCA help guides, and with expert help from the EOEPCA team.

Integrating the Slurm Scheduler and the ADES

A key part of the work was to investigate using the ADES with a scheduling tool rather than using Kubernetes to control and execute workflows. The potential advantages of this would be:

Enabling the deployment of large workflows (i.e., those that might require multiple nodes and more complex compute environments).
The potential to re-use pre-installed software environments rather than deploying the software for each execution.
The possibility of connecting to pre-existing processing clusters, such as LOTUS on JASMIN, and executing workflows on them.

The work involved various components, as follows:

Preparation of a command-line tool (see next section) to provide remote subsetting of example datasets held on JASMIN (e.g., ESACCI data), to be deployed and executed through the ADES.
Integration of the ADES scheduling and deployment components with the Slurm scheduler, using a Workflow Execution Service (WES).
Deployment of a Slurm cluster within the cloud tenancy.
Integration with the TOIL tool for executing CWL workflows as Slurm jobs (see Appendix 1 for more details). This was managed using a plugin for the Zoo Project.
Integration with the Singularity tool to enable Docker files to be converted into containers that can run with user (rather than root) permissions.
Testing of the integration with the prototype “snuggs” application.
Integration and testing with the CEDA subsetting tool.

Details of the integration work are documented at: https://github.com/cedadev/eoepca-zoo-wes-runner/tree/main/docs

This explains how TOIL is configured and deployed alongside Slurm and Singularity. This enables a full integration with the ADES so that workflows are deployed as follows:

The job is described in command using the `toil-cwl-runner`, e.g.:

        toil-cwl-runner \ 
        --maxMemory 10Gib \ 
        --batchSystem slurm --singularity \ 
        --workDir ~/daops-work/work_dir \ 
        --jobStore ~/daops-work/job_store/$(uuidgen) \ 
        ~/daops-work/daops/app-package.cwl#daops \ 
        ~/daops-work/daops/cli-params.yml

The TOIL runner contacts the Slurm scheduler to schedule the job.
When Slurm is ready to execute the job, it invokes the instructions in the `app-package.cwl` workflow description which explains the command-line tool signature, the processing environment, and the Dockerfile required for the application.
Singularity is then invoked to pull and build the container image (or use a locally cached version).
The input parameters are described in the `cli-params.yml` file, e.g.:

        ---
area: "30,-10,65,30"
time: "2000-01-01/2000-02-30"
time_components: ""
levels: "/"
file_namer: "simple"
output_dir: "."
collection: "https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/ L3C/ATSR2-AATSR/v3.0/ ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json"

Overview of the JASMIN EOEPCA Developments

Deployment of EOEPCA on the JASMIN Cloud

Integrating the Slurm Scheduler and the ADES

A key part of the work was to investigate using the ADES with a scheduling tool rather than using Kubernetes to control and execute workflows. The potential advantages of this would be:

Enabling the deployment of large workflows (i.e., those that might require multiple nodes and more complex compute environments).
The potential to re-use pre-installed software environments rather than deploying the software for each execution.
The possibility of connecting to pre-existing processing clusters, such as LOTUS on JASMIN, and executing workflows on them.

The work involved various components, as follows:

Preparation of a command-line tool (see next section) to provide remote subsetting of example datasets held on JASMIN (e.g., ESACCI data), to be deployed and executed through the ADES.
Integration of the ADES scheduling and deployment components with the Slurm scheduler, using a Workflow Execution Service (WES).
Deployment of a Slurm cluster within the cloud tenancy.
Integration with the TOIL tool for executing CWL workflows as Slurm jobs (see Appendix 1 for more details). This was managed using a plugin for the Zoo Project.
Integration with the Singularity tool to enable Docker files to be converted into containers that can run with user (rather than root) permissions.
Testing of the integration with the prototype “snuggs” application.
Integration and testing with the CEDA subsetting tool.
The input parameters are described in the `cli-params.yml` file, e.g.:

        ---
        area: "30,-10,65,30"
        time: "2000-01-01/2000-02-30"
        time_components: ""
        levels: "/"
        file_namer: "simple"
        output_dir: "."
        collection: "https://data.ceda.ac.uk/neodc/esacci/cloud/metadata/kerchunk/version3/L3C/ATSR2-AATSR/v3.0/ESACCI-L3C_CLOUD-CLD_PRODUCTS-ATSR2_AATSR-199506-201204-fv3.0-kr1.1.json"

The TOIL runner then invokes the command, with the required inputs, into the container, and the job is executed.
Finally, the outputs are staged out to the user workspace.

A template repository is also provided as a starting point for the creation of new WES using TOIL: https://github.com/cedadev/eoepca-proc-service-template-wes

Development of the “daops” Subsetter and Deployment to the ADES

The “daops” library is part of the “roocs” framework developed by CEDA and DKRZ to provide subsetting and processing capabilities for climate simulation data for the Copernicus Climate Change Service Climate Data Store.

“daops” stands for data-aware operations on climate simulations. It serves 3 main functions:

Providing an aggregation layer that allows the client/user to refer to datasets (via meaningful identifiers) rather than individual files, thereby simplifying data usage.
Enabling “hot-fixes” to datasets where errors exist in the source data/metadata, that can be fixed on-the-fly (at processing time).
Providing a command-line tool to simplify access to basic operations such as subsetting and averaging.

In order to integrate the “daops” tool with the ADES, the following updates where made:

Changes were made to the command-line tool:
https://github.com/roocs/daops/blob/enable-kerchunk/daops/cli.py
The Docker file was updated for integration with the CWL runner:
https://github.com/roocs/daops/blob/enable-kerchunk/Dockerfile
The “app-package.cwl” workflow description file was created to describe the “doaps” tool to the ADES and the WES:
https://github.com/roocs/daops/blob/enable-kerchunk/app-package.cwl
A new capability was added to support remote data access using the Kerchunk file format which describes, and enables read access, to chunks in remote files. This unit test demonstrates the capability:
https://github.com/roocs/daops/blob/enable-kerchunk/tests/test_operations/test_subset.py#L97-L110

Appendix 1: overview of TOIL integration

We tested the StreamFlow, TOIL and Arvados CWL runners which all run on the head-node of the SLURM cluster. The TOIL runner met our requirements as follows:

Able to automatically schedule SLURM jobs for a CWL workflow.
Can automatically convert docker containers to Singularity containers, which is required to run containers without privilege in Slurm.
Can run in server mode and provide a workflow execution service (WES) API.
Uses a scratch space in the HPC cluster to move outputs between CWL steps.

A particular advantage of TOIL is the built-in WES server, which meant:

Implementing our own API for this was not required.
Implementation of communication between ADES and SLURM cluster was significantly simplified.

The figure below demonstrates how the ADES Kubernetes client can be replaced with a different (e.g. TOIL) WES client. The left hand panel is reproduced from: https://github.com/EOEPCA/proc-ades-dev

Overview of the JASMIN EOEPCA Developments

Deployment of EOEPCA on the JASMIN Cloud

Integrating the Slurm Scheduler and the ADES

Overview of the JASMIN EOEPCA Developments

Deployment of EOEPCA on the JASMIN Cloud

Integrating the Slurm Scheduler and the ADES

Development of the “daops” Subsetter and Deployment to the ADES

Appendix 1: overview of TOIL integration

Thank you!