This protocol automates a repeated calculation performed on a predefined data set of systems and calculates the statistics over the set. The data set is described by a YAML file containing the definition of the systems, general setup of the calculations to be performed (most importantly the protocol to be applied to each item) and reference values. Some data sets are provided with Cuby, user-defined data sets can be specified by providing a valid path to a yaml file instead of the name of the predefined data set.
The entries in the data sets can be divided into groups and individually tagged. Only part of the dataset can be calulated, the selection is defined by keywords dataset_select_... and dataset_skip_....
The individual calculations can be executed in parallel to reduce the overall time.
The R160x6 data set contained wrong reference values, and it was withdrawn from Cuby until the issue is fixed.
By default, Cuby contains following data sets:
New, large data sets from the Non-Covalent Interactions Atlas project.
The GMTKN55 collection of data sets by S. Grimme is available in Cuby. The original data were converted automatically to the format Cuby uses; as a result the data sets miss some fancy features such as nice names of the systems. The conversion was validated bu comparing calculations in Cuby to the the DFT results from the original paper, and in all data sets no or negligible difference was observed.
Calculation setup: All the entries in the GMTKN55 (and GMTKN30 listed below) (and GMTKN30 listed below) are calculated using the reaction protocol. Because of this, the calculation setup must be provided in a separate block in the input named 'calculation' rather than at root level. Here is an example:
job: dataset
dataset: GMTKN_PCONF
calculation:
job: energy
interface: mopac
method: pm6
Although superseeded by GMTKN55, the GMTKN30 data sets are also kept in Cuby for backward compatibility. These were previously named just GMTKN. Please note that data sets with the same name may use different reference data in GMTKN30 and GMTKN55. The dsata sets were validsated agains against the original DFT results by Grimme (with exception of G21EA and WATER27 for which the published data were calculated in a modified basis set). Only in the SIE11 data set, there is one point (the last entry) where our result does not agree with Grimme's DFT data (but is closer to the reference).
The data set definition file may contain additional sets of reference values such as energies calculated with other methods or e.g. results of an energy decomposition. This may include later, more accurate recalculations of the benchmark values – the main reference comes from the original publication where the data set was introduced (unless explicitly noted). These additional data are not covered in the documentation yet but can be found in the data set files.
To use the alternative refence data, use the keyword dataset_reference.
Use an existing data set file (located in cuby4/data/datasets) as a template. The file can be located anywhere, just provide a valid path to it in the dataset keyword. The default data sets use geometries from cuby's library but files can be used as well, the record 'geometry' in the data set file is treated the same as the geometry keyword.
A simple data set calculation can be run just on a bunch of geometry files by setting the dataset keyword to value 'from_files'. Here is an example:
job: dataset
dataset: from_files
# Selection of geometry files to be used, shell wildcards allowed
dataset_from_files: "*.xyz"
# What protocol to use for the items
dataset_from_files_job: energy
# Optionally, reference energies can be read from a table
dataset_from_files_reference: "energies.txt"
interface: mopac
method: pm6