Provenance Considerations
cale Owner 6:09 PM The smalles unit one might ask for is an observation, right? Does that unit have a system-wide unique ID? Do we store a digital fingerprint of such an entity (like an md5-sum or similar). Do we keep older versions (with different fingerprints).
The background of this question is: should we ever offer data via a Python (or similar) interface to remote sites, they'd probably want to built up a local cache. If an observation changes (e.g. due to a changed post-processing etc) the user needs to be informed that there are new versions of the data he/she is processing. The user must have the ability to freeze data to his local cache version.
A md5-sum would be really handy for constructing the cache and naming local files.
schaffer schaffer 9:39 AM Hi Cale, basically that's right. We usually only deal with observations as units of data. There are some sub-cases where it's possible to download individual files, for instance logs, location plots or observation previews. But all of those are again tied to the observation. Actual data is currently always bundled by observation, but in principle, the db architecture would also allow for single file query and access..
schaffer schaffer 9:47 AM Concerning the provenance, we currently don't have any system for storing md5 hashes of the observation or similar. Each observation does get a unique name which is later used as the primary method of referring to the object. The name usually contains the date and instrument name as well as sufficient other characteristics to clearly distinguish the observation. There is no versioning in place as of now, but I agree that it could be useful in the future. Implementing something right now is difficult, as the requirements are far from clear, but it's definitely a point to keep in mind. Adding something like a fingerprint and versioning information on top of the current system wouldn't be very difficult and can be implemented when needed. I would suggest to stick with the approach we have now: There is currently a disclaimer stating that the data is provided as-is and may change without notice, so at least no one should expect us to stay 100% consistent. I'll open an Issue on gitlab and add this conversation to the description.
(edit: reformatting text)