# ESCDF

LGPL v3 D. Caliste, F. Corsetti, J. Minar, M. Oliveira, Y. Pouillon, T. Ruh and D. Strubbe Gitlab

libescdf is a library containing tools for reading and writing massive data structures related to electronic structure calculations, following the standards defined in ESCDF - Electronic Structure Common Data Format. It is under development.

## Technical choices

### HDF5

Platform-independence, parallel I/O, ...

### Storage of the arrays

While the API may present multidimensional arrays to the user, all arrays are internally stored as one-dimensional vectors. This circumvents a strong limitation of Fortran only allowing 7 dimensions at most, an upper bound that can easily be exceeded by variables such as wavefunctions.

## API

### Design principles

To make it as flexible as possible, the API was designed having in mind the following points:

• The host code should not have to change the way it stores the data in memory. This includes the case where the data is distributed between different MPI processes.
• The user should be allowed to write and/or read the data in any order.
• When writing a chunk of data to the file, no validation check should be performed against other previously written chunks of data. This means that the overall correctness and completeness of the data should be checked separately.

It would have been nice to have a way of avoiding users to create incorrect or incomplete files. Unfortunately, because of the dependencies between different chunks of data, the only way to achieve this would be to only write a group to disk once all the information required to perform the validation is available. This is not practical and would have required to keep track of the synchronization between the data stored in memory and the one stored on disk.

### Handle

When initializing the library, either by opening an existing ESCDF file or by creating a new one, the library returns a handle. This handle is an abstract reference to that file and all access to a file through Libescd is done using that handle.

#### C structure

The C structure is named escdf_handle_t and contain the following data:

• hid_t file_id: the HDF5 handler for the file.
• hid_t group_id: the HDF5 handler for the root group.
• Several extra variables used for parallelization.

#### Creators and destructors

The following functions are provided:

• escdf_handle_t * escdf_open(const char *filename, const char *path) This function creates an instance of escdf_handle_t by allocating the memory and opens an existing file. Optionally, it considers the root group to be given by path if it is not NULL. Note that this will return an error if the file does not exist.

• escdf_handle_t * escdf_create(const char *filename, const char *path) This function creates an instance of escdf_handle_t by allocating the memory and creates a file. Optionally, it considers the root group to be given by path if it is not NULL.

• escdf_errno_t escdf_close(escdf_handle_t *handle) This function closes a previously opened file.

### Groups

For each group allowed by the specifications (system, densities, etc), there is a C structure plus several functions. These are described bellow. In the following we use the system group as an example.

#### C structure

The C structure is named escdf_system and contain the following data:

• hid_t group_id: the HDF5 handler for the group.
• Several variables containing the metadata. It should also store the information if they are set or not.
• Several booleans indicating if a given dataset is present in the file.

Furthermore, the structure is private, that is, it is declared in the C file and a typedef struct escdf_system escdf_system_t declaration can be found in the corresponding header file.

The difference between what is called metadata and what is called data is not immediately obvious from the specifications, but the way they are handled by the library is different. The metadata is stored on disk and there is a copy of it in the C structure. As for the data, it is never explicitly stored in the structure. Instead, it is always directly written/read to/from the file on disk.

#### Low-level creators and destructors

The following low-level functions are provided:

• escdf_system_t * escdf_system_new() This function takes care of creating an instance of escdf_system_t by allocating the memory and it also initializes all its contents to the default values.

• void escdf_system_free(escdf_system_t *system) This function frees all the memory associated with the instance of the structure, including the instance itself.

• escdf_errno_t escdf_system_open_group(escdf_system_t *system, escdf_handle_t *handle, const char *path) This function opens an group from the file managed by the handle. If path is NULL, the group path is system, otherwise it is system/path. Note that this will return an error if the group does not exist.

• escdf_errno_t escdf_system_create_group(escdf_system_t *system, escdf_handle_t *handle, const char *path) This function creates a group within the file managed by the handle. If path is NULL, the group path is system, otherwise it is system/path.

• escdf_errno_t escdf_system_close_group(escdf_system_t *system) This function closes the group.

#### High-level creators and destructors

The library provides the following high-level creators and destructors:

• escdf_system_t * escdf_system_open(escdf_handle_t *handle) This function performs the following tasks:

1. Call escdf_system_new to create an instance of the structure.
2. Call escdf_system_open_group. Note that this function will return an error if the group does not exist.
3. Call escdf_system_read_metadata to read all the metadata from the file and store it in memory.
4. Call escdf_system_is_correct and escdf_system_is_complete to verify if the data is valid. Return an error code if not.
• escdf_system_t * escdf_system_create(escdf_handle_t *handle) This function performs the following tasks:

1. Call escdf_system_new to create an instance of the structure.
2. Call escdf_system_create_group. Note that this function will delete all previous contents of the group.
• escdf_errno_t escdf_system_close(escdf_system_t *group) This function performs the following tasks:

1. Call escdf_system_is_correct and escdf_system_is_complete.
2. Call escdf_system_close_group to close the group.
3. Call escdf_system_free to free all memory.
4. Return an error code to signal if: a) The group was complete but incorrect; b) The group was correct but incomplete; c) The group was correct and complete.

• escdf_errno_t escdf_system_read_metadata(escdf_system_t *system) This function reads all the metadata from the file on disk and stores it in memory. Note: it is the responsibility of the user to call this function whenever the contents of the file change.

• escdf_errno_t escdf_system_set_* The setters should start by writing the data to the disk. Once the data is successfully written to the file, it is copied to the structure in memory. It is recommended that different metadata that only make sense when taken together be set by calling a single set function rather than by calling several different set functions.

• escdf_errno_t escdf_system_get_* Getters should simply return the values stored in memory.

• escdf_errno_t escdf_system_copy_metadata(const escdf_system_t *src, escdf_system_t *dst) This function copies the content of the metadata from one escdf_system_t structure to another. Once done, write the metadata to the file of the destination group. Note that it is the responsibility of the user to modify the destination group in any necessary way to make it valid.

#### Data

• escdf_system_write_* These functions should take as argument a buffer containing all or part of the data to be written to a given dataset. Any attributes of the dataset, like units, should be passed as arguments of the function.

• escdf_system_read_* These functions read all or part of the data stored in the dataset and copy it to a buffer passed as argument. Any attributes of the dataset, like units, should be returned as arguments of the function.

Both the read and write functions should take care of any necessary data reordering to read/write in parallel.

Coming soon...

#### Validation

There are basically two types of validation that can be performed on the content of the group: correctness and completeness. For a group to be considered as obeying the ESCDF specifications it must be both complete and correct.

The content of the group is considered to be correct if all the pieces of metadata and data that are set or present are correct. The correctness of some metadata or data may depend on the values of other metadata or data. In that case, those checks should only be performed when all the corresponding metadata and/or data are present. Note that if a piece of metadata or a data is not present, the file will never be considered to be valid, as it will fail the completeness test.

The content of the group is said to be complete if all the attributes and datasets that the specifications say are mandatory are set or present.

Therefore, the library provides these two functions:

• bool escdf_system_is_correct(escdf_system_t *system) This functions checks that all the data and metadata that is set is correct, that is, that it satisfies all the ranges, constrains, dimensions, etc that are mentioned in the ESCDF specifications. If two pieces of data/metadata have some sort of dependence, then that dependence is only checked if both are present/set.

• bool escdf_system_is_complete(escdf_group_t *system) This function checks that all the attributes and datasets that the specifications say are mandatory are set or present.