Results of the pipeline orchestration study for SDC
Coordinates
Coordinates need manual cross correlation to be accurate. Cross correlation needs to be run manually with CASSDA GUI, takes multiple minutes per cross correlation.
Cross checking the current archive and /dat/sdc/gris
yields the following status: Total ~1000 runs present, ~700 are viable for cross correlations, ~330 could have a cross correlation but don't. A detailed overview is here: cross_correlation_status_overview.csv
Ifu has another 1000 runs without any cross correlation, tnerife has ~250 more runs from 2020 onward mixed between ifu and slit where no correlation has been done.
In total, 428 correlations exist.
Checking Status of L2 Pipeline
Capable of using split files as input, still many manual steps included, probably > 10min per run for processing.
Kubernetes
Requirements for jobs:
- defining new jobs or pipeline stages shouldn't require any knowledge of docker or the orchestration tool, it should be enough to be able to install the tools into a dockerimage/local environment and then scale the task across the cluster.
Kubeflow??
Scale ML applications across cloud/clusters, anything that runs a kubernetes environment. Not interesting for orchestration problem.
Building the dockerfile
Automated building of the dockerfile needs ssh access to the gitlab server. This means that we need to forward our ssh credentials to docker using the --ssh option. Setting that up isn't too clear yet but I found some manuals on medium, blogs and docker. Remember to set the docker buildkit flag! This answer looks good!
- Getting env built nicely: https://pythonspeed.com/articles/activate-conda-dockerfile/