Commit 8695455d authored by Drew Leonard's avatar Drew Leonard
Browse files

Design document

parent b83f90e4
Pipeline #3117 failed with stage
in 11 minutes and 41 seconds
%% Cell type:markdown id: tags:
Much of the following is adapted from a demonstration presented to KIS staff during the design phase of the project. That demonstration can be found [here](https://gitlab.leibniz-kis.de/ajl/python-user-api/-/blob/fido-demo/fido-demo.ipynb).
%% Cell type:markdown id: tags:
## Overview of the API design
%% Cell type:markdown id: tags:
Programmatic access to the SDC data will be provided by a custom client for SunPy's `Fido` interface. This approach will allow the user to query and download data using SunPy, which will provide consistency with users' experiences working with other solar data in Python, and compatibility with the large and growing number of available scientific Python packages. It will also minimise duplication of effort by building on the achievements of the SunPy community rather than developing an entirely new interface from scratch.
%% Cell type:markdown id: tags:
## Relationship to SunPy
%% Cell type:markdown id: tags:
`Fido` is SunPy's unified search and download interface. It allows users to query many different data sources at once using a system of attributes to specify the particular data the user is looking for. A query returns a table of results which can be inspected and sliced, such that the user can either download all of those results or a subset of them. This interface is mature and well maintained, and provides a good basis for the SDC tools. The custom client will therefore function as a backend to `Fido`, providing access to the SDC servers while abstracting that functionality away and leaving the user experience mostly unchanged.
%% Cell type:markdown id: tags:
SunPy also provides a variety of attributes in its `attrs` submodule for the purpose of defining data parameters. Those defined in SunPy core are suitable for most purposes, but where they are not sufficient to describe data provided by the SDC, custom definitions will be made available through the recommended SunPy API. This allows the workflow of querying and downloading data to remain the same, even when using custom attrs.
%% Cell type:markdown id: tags:
In the first instance the SDC tools will be developed in a stand-alone package distinct and separate from SunPy. At a later date and in consultation with the SDC and SunPy community, that package may apply to become [SunPy Affiliated](https://sunpy.org/project/affiliated), or some or all of the functionality may be merged into the core SunPy package. This eventual choice will not impact the design.
%% Cell type:markdown id: tags:
## Implications for Data Centre
%% Cell type:markdown id: tags:
So far the systems at the Data Centre appear to be very well designed overall - in particular the quality of the FITS headers provided with the test datasets has been very high. This has made the design of the tools somewhat easier and means that very little should need to change to facilitate the development of the user API. However, a few minor points have emerged which the Data Centre should consider.
%% Cell type:markdown id: tags:
First, the current implementation of the prototype tools is such that Fido returns one result per file stored at the Data Centre, with each observation consisting of many files. Some thought should be given to whether this is the expected user experience or whether the Data Centre wishes to facilitate (or enforce) downloading only entire datasets. This decision will be important in determining how the results table is constructed and populated, which in turn may have implications for how to store the data and metadata. For example, if the client has to obtain the metadata for each of a large number of files in a given observation, this may put a greater load on the KIS servers than simply returning the single observation record.
%% Cell type:markdown id: tags:
Second, in the observation used for testing, the CRVAL keyword (the reference value) changes from one file to the next, whereas CRPIX (the reference pixel) stays the same. This arrangement probably introduces a small inaccuracy when dealing with the WCS headers, because it means that the centre of the celestial projection moves during the observation. It would be more accurate to keep the reference value constant across an observation, while changing the reference pixel - for most files this would mean the reference pixel being outside the grid of the image stored in the file. The immediate practical advantage of this would be that the WCS header information would then be consistent across all files of an observation, such that the header from any one file accurately describes the coordinate system for all of them. The extent to which this distinction is a problem may depend on how the instrument works, but it is something of the Data Centre should be aware.
%% Cell type:markdown id: tags:
## Demonstration of API
%% Cell type:markdown id: tags:
`Fido` and `attrs` can both be imported from `sunpy.net`.
%% Cell type:code id: tags:
``` python
from sunpy.net import Fido, attrs as a
```
%% Cell type:markdown id: tags:
Attributes are used to specify particular aspects of the data, such as instrument, time range or wavelength.
%% Cell type:markdown id: tags:
To query data we pass one or more attributes to `Fido.search()`, which returns data that match those attributes. Searches can be made arbitrarily complex by including many different attributes, and also by using the `|` (OR) operator. So for example we can search for data from either of two different instruments in a given time period:
%% Cell type:code id: tags:
``` python
result = Fido.search(a.Time('2012/3/4', '2012/3/4 02:00'), a.Instrument.lyra | a.Instrument.rhessi)
```
%% Cell type:code id: tags:
``` python
result
```
%% Cell type:markdown id: tags:
Once we've selected which files we want, these can then be downloaded using `Fido.fetch()`, which takes as an argument a results table of the kind returned by `search()`.
%% Cell type:code id: tags:
``` python
files = Fido.fetch(result)
```
%% Cell type:markdown id: tags:
`fetch()` downloads these results to the location specified in the SunPy configuration (by default `/home/<user>/sunpy/data`) and returns a list of the names of the downloaded files. Here we've passed the whole search result to download all of the files, but we could also have sliced it, to download only certain files.
%% Cell type:code id: tags:
``` python
files
```
%% Cell type:markdown id: tags:
The custom client is required to provide access to the SDC servers. The client is automatically registered with Fido when it is imported, allowing the user to interact with the existing SunPy interfaces rather than directly with the custom client.
%% Cell type:code id: tags:
``` python
from sdc.client import KISClient
```
%% Cell type:markdown id: tags:
Querying SDC data is then done in the same way as any other Fido query, as described above, e.g.:
%% Cell type:code id: tags:
``` python
result = Fido.search(a.Instrument("GRIS"), a.Time("2014/04/26", "2014/04/27"), a.Level(1))
```
%% Cell type:markdown id: tags:
**Note:** it is currently necessary to specify the processing level of the data you want - in this case level one. This is expected to change at a later date so that level one is the default, but it will still be possible to specify other levels.
%% Cell type:code id: tags:
``` python
result
```
%% Cell type:markdown id: tags:
As we saw earlier, we can download these files using `Fido.fetch()`. In this case, due to how the `KISClient` and the Data Centre are set up, it possible to download either the actual fits files containing the data, or json files containing the observation records. To download the data it is necessary to specify the keyword `binary=True`. This is not expected to remain in the final API, which should download the data by default.
%% Cell type:code id: tags:
``` python
files = Fido.fetch(result, binary=True)
```
%% Cell type:code id: tags:
``` python
files
```
%% Cell type:markdown id: tags:
These files can now be loaded and interacted with in a variety of ways, with several possible options available in the scientific Python ecosystem. For a dataset this size, the most suitable way would be to use [`ndcube`](https://docs.sunpy.org/projects/ndcube/en/stable/index.html) to load the files as a single n-dimensional coordinate-aware array. For more details on how to do this, see the [full demonstration](https://gitlab.leibniz-kis.de/ajl/python-user-api/-/blob/fido-demo/fido-demo.ipynb).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment