This protocol automates a repeated calculation performed on a predefined data set of systems and calculates the statistics over the set. The data set is described by a YAML file containing the definition of the systems, general setup of the calculations to be performed (most importantly the protocol to be applied to each item) and reference values. Some data sets are provided with Cuby, user-defined data sets can be specified by providing a valid path to a yaml file instead of the name of the predefined data set.
The entries in the data sets can be divided into groups and individually tagged. Only part of the dataset can be calulated, the selection is defined by keywords dataset_select_... and dataset_skip_....
The individual calculations can be executed in parallel to reduce the overall time.
The R160x6 data set contained wrong reference values, and it was withdrawn from Cuby until the issue is fixed.
By default, Cuby contains following data sets:
New, large data sets from the Non-Covalent Interactions Atlas project.
NCIA_D1200 | London dispersion in an extended chemical space[118] |
NCIA_D442x10 | London dispersion in an extended chemical space, 10-point dissociation curves[119] |
NCIA_HB300SPXx10 | CCSD(T)/CBS interaction energies of H-bonds featuring S, P and halogens, 10-point dissociation curves[120] |
NCIA_HB375x10 | CCSD(T)/CBS interaction energies of H-bonds and decoys, 10-point dissociation curves[121] |
NCIA_IHB100x10 | CCSD(T)/CBS interaction energies of ionic H-bonds, 10-point dissociation curves[122] |
NCIA_Rep739x5 | CCSD(T)/CBS interaction energies for repulsive contacts in extended chemical space[123] |
NCIA_SH250x10 | Sigma-hole interactions, 10-point dissociation curves[124] |
3B69 | CCSD(T)/CBS three-body energies in 23x3 trimers[1] |
3B69_dimers | All dimers from the 3B69 set of trimers[2] |
A24 | Accurate CCSD(T)/CBS interaction energies in small noncovalent complexes[3] |
Bauza2013 | Halogen, chalcogen and pnicogen bonds[4] |
Charge_transfer | CCSD(T)/CBS interaction energies in charge-transfer complexes[5][6] |
Dipoles152 | Benchmark CCSD(T)/CBS dipole moments in fixed equilibrium geometries[7] |
HB104 | Diverse set of hydrogen bonds of O and N in organic molecules[113][114] |
Ionic_H-bonds | Ionic hydrogen bonds - dissociation curves[115] |
L7 | CCSD(T) or QCISD(T) interaction energies in large noncovalent complexes[116] |
MPCONF196 | Conformation energies of peptides and macrocyclic compounds[117] |
Pecina2015 | Chalcogen and pnicogen bonds of heteroboranes[125] |
Peptide_FGG | CSCD(T)/CBS conformation energies of FGG tripeptide[126] |
Peptide_GFA | CSCD(T)/CBS conformation energies of GFA tripeptide[127] |
Peptide_GGF | CSCD(T)/CBS conformation energies of GGF tripeptide[128] |
Peptide_WG | CSCD(T)/CBS conformation energies of WG dipeptide[129] |
Peptide_WGG | CSCD(T)/CBS conformation energies of WGG tripeptide[130] |
PLFrag547 | PLFrag547 - Protein-ligand fragments[131] |
R160x6 | Repulsive intermolecular contacts in organic molecules[132] |
S12L | Interaction energies in large noncovalent complexes derived from experiment[133] |
S66 | CCSD(T)/CBS interaction energies in organic noncovalent complexes[134][135] |
S66a8 | CCSD(T)/CBS interaction energies in organic noncovalent complexes - angular displacements[136] |
S66x8 | CCSD(T)/CBS interaction energies in organic noncovalent complexes - dissociation curves[137] |
Sulfur_x8 | CCSD(T)/CBS interaction energies in complexes featuring sulfur[138] |
W4-17 | High-level theoretical atomization energies[139] |
X40 | CSCD(T)/CBS interaction energies of halogenated molecules[140] |
X40x10 | CSCD(T)/CBS interaction energies of halogenated molecules - dissociation curves[141] |
The GMTKN55 collection of data sets by S. Grimme is available in Cuby. The original data were converted automatically to the format Cuby uses; as a result the data sets miss some fancy features such as nice names of the systems. The conversion was validated bu comparing calculations in Cuby to the the DFT results from the original paper, and in all data sets no or negligible difference was observed.
GMTKN55_ACONF | Relative energies of alkane conformers[47] |
GMTKN55_ADIM6 | Interaction energies of n-alkane dimers[48] |
GMTKN55_AHB21 | Interaction energies in anion–neutral dimers[49] |
GMTKN55_AL2X6 | Dimerisation energies of AlX3 compounds[50] |
GMTKN55_ALK8 | Dissociation and other reactions of alkaline compounds[51] |
GMTKN55_ALKBDE10 | Dissociation energies in group-1 and -2 diatomics[52] |
GMTKN55_Amino20x4 | Relative energies in amino acid conformers[53] |
GMTKN55_BH76 | Barrier heights of hydrogen transfer, heavy atom transfer, nucleophilic substitution, unimolecular and association reactions[54] |
GMTKN55_BH76RC | Reaction energies of the BH76[55] |
GMTKN55_BHDIV10 | Diverse reaction barrier heights[56] |
GMTKN55_BHPERI | Barrier heights of pericyclic reactions[57] |
GMTKN55_BHROT27 | Barrier heights for rotation around single bonds[58] |
GMTKN55_BSR36 | Bond-separation reactions of saturated hydrocarbons[59] |
GMTKN55_BUT14DIOL | Relative energies in butane-1,4-diol conformers[60] |
GMTKN55_C60ISO | Relative energies between C60 isomers[61] |
GMTKN55_CARBHB12 | Hydrogen-bonded complexes between carbene analogues and H2O, NH3, or HCl[62] |
GMTKN55_CDIE20 | Double-bond isomerisation energies in cyclic systems[63] |
GMTKN55_CHB6 | Interaction energies in cation–neutral dimers[64] |
GMTKN55_DARC | Reaction energies of Diels-Alder reactions[65] |
GMTKN55_DC13 | 13 difficult cases for DFT methods[66][67] |
GMTKN55_DIPCS10 | Double-ionisation potentials of closed-shell systems[68] |
GMTKN55_FH51 | Reaction energies in various (in-)organic systems[69][70] |
GMTKN55_G21EA | Adiabatic electron affinities[71] |
GMTKN55_G21IP | Adiabatic ionization potentials[72] |
GMTKN55_G2RC | Reaction energies of selected G2/97 systems[73] |
GMTKN55_HAL59 | Binding energies in halogenated dimers (incl. halogen bonds)[74][75] |
GMTKN55_HEAVY28 | Noncovalent interaction energies between heavy element hydrides[76] |
GMTKN55_HEAVYSB11 | Dissociation energies in heavy-element compounds[77] |
GMTKN55_ICONF | Relative energies in conformers of inorganic systems[78] |
GMTKN55_IDISP | Intramolecular dispersion interactions[79] |
GMTKN55_IL16 | Interaction energies in anion–cation dimers[80] |
GMTKN55_INV24 | Inversion/racemisation barrier heights[81] |
GMTKN55_ISO34 | Isomerisation energies of small and medium-sized organic molecules[82] |
GMTKN55_ISOL24 | Isomerisation energies of large organic molecules[83][84] |
GMTKN55_MB16-43 | Decomposition energies of artificial molecules[85] |
GMTKN55_MCONF | Relative energies in melatonin conformers[86] |
GMTKN55_NBPRC | Oligomerisations and H2 fragmentations of NH3/BH3 systems, H2 activation reactions with PH3/BH3 systems[87] |
GMTKN55_PA26 | Adiabatic proton affinities (incl. of amino acids)[88][89][90] |
GMTKN55_PArel | Relative energies in protonated isomers[91] |
GMTKN55_PCONF21 | Relative energies in tri- and tetrapeptide conformers[92][93][94] |
GMTKN55_PNICO23 | Interaction energies in pnicogen-containing dimers[95] |
GMTKN55_PX13 | Proton-exchange barriers in H2O, NH3, and HF clusters[96] |
GMTKN55_RC21 | Fragmentations and rearrangements in radical cations[97] |
GMTKN55_RG18 | Interaction energies in rare-gas complexes[98] |
GMTKN55_RSE43 | Radical-stabilisation energies[99] |
GMTKN55_S22 | Binding energies of noncovalently bound dimers[100] |
GMTKN55_S66 | Binding energies of noncovalently bound dimers[101] |
GMTKN55_SCONF | Relative energies of sugar conformers[102] |
GMTKN55_SIE4x4 | Self-interaction-error related problems[103] |
GMTKN55_TAUT15 | Relative energies in tautomers[104] |
GMTKN55_UPU23 | Relative energies between RNA-backbone conformers[105][106] |
GMTKN55_W4-11 | Total atomisation energies[107] |
GMTKN55_WATER27 | Binding energies in (H2O)n, H+(H2O)n and OH-(H2O)n[108][109] |
GMTKN55_WCPT18 | Proton-transfer barriers in uncatalysed and water-catalysed reactions[110] |
GMTKN55_YBDE18 | Bond-dissociation energies in ylides[111][112] |
Calculation setup: All the entries in the GMTKN55 (and GMTKN30 listed below) (and GMTKN30 listed below) are calculated using the reaction protocol. Because of this, the calculation setup must be provided in a separate block in the input named 'calculation' rather than at root level. Here is an example:
job: dataset
dataset: GMTKN_PCONF
calculation:
job: energy
interface: mopac
method: pm6
Although superseeded by GMTKN55, the GMTKN30 data sets are also kept in Cuby for backward compatibility. These were previously named just GMTKN. Please note that data sets with the same name may use different reference data in GMTKN30 and GMTKN55. The dsata sets were validsated agains against the original DFT results by Grimme (with exception of G21EA and WATER27 for which the published data were calculated in a modified basis set). Only in the SIE11 data set, there is one point (the last entry) where our result does not agree with Grimme's DFT data (but is closer to the reference).
GMTKN30_ACONF | relative energies of alkane conformers[8] |
GMTKN30_ADIM6 | interaction energies of n-alkane dimers[9] |
GMTKN30_AL2X | dimerization energies of AlX3 compounds[10] |
GMTKN30_ALK6 | fragmentation and dissociation reactions of alkaline and alkaline−cation−benzene complexes[11] |
GMTKN30_BH76 | barrier heights of hydrogen transfer, heavy atom transfer, nucleophilic substitution, unimolecular, and association reactions[12][13] |
GMTKN30_BH76RC | reaction energies of the BH76 set[14][15] |
GMTKN30_BHPERI | barrier heights of pericyclic reactions[16] |
GMTKN30_BSR36 | bond separation reactions of saturated hydrocarbons[17][18] |
GMTKN30_CYCONF | relative energies of cysteine conformers[19] |
GMTKN30_DARC | reaction energies of Diels−Alder reactions[20] |
GMTKN30_DC9 | nine difficult cases for DFT[21] |
GMTKN30_G21EA | adiabatic electron affinities[22] |
GMTKN30_G21IP | adiabatic ionization potentials[23] |
GMTKN30_G2RC | reaction energies of selected G2-97 systems[24] |
GMTKN30_HEAVY28 | noncovalent interaction energies between heavy element hydrides[25] |
GMTKN30_IDISP | intramolecular dispersion interactions[26][27] |
GMTKN30_ISO34 | isomerization energies of small and medium-sized organic molecules[28] |
GMTKN30_ISOL22 | isomerization energies of large organic molecules[29] |
GMTKN30_MB08-165 | decomposition energies of artificial molecules[30][31] |
GMTKN30_NBPRC | oligomerizations and H2 fragmentations of NH3-BH3 systems; H2 activation reactions with PH3-BH3 systems[32][33] |
GMTKN30_O3ADD6 | reaction energies, barrier heights, association energies for addition of O3 to C2H4 and C2H2[34] |
GMTKN30_PA | adiabatic proton affinities[35][36] |
GMTKN30_PCONF | relative energies of phenylalanyl−glycyl−glycine tripeptide conformers[37] |
GMTKN30_RG6 | interaction energies of rare gas dimers[38] |
GMTKN30_RSE43 | radical stabilization energies[39] |
GMTKN30_S22 | binding energies of noncovalently bound dimers[40][41] |
GMTKN30_SCONF | relative energies of sugar conformers[42][43] |
GMTKN30_SIE11 | self-interaction error related problems[44] |
GMTKN30_W4-08 | atomization energies of small molecules[45] |
GMTKN30_WATER27 | binding energies of water, H+(H2O)n and OH−(H2O)n clusters[46] |
The data set definition file may contain additional sets of reference values such as energies calculated with other methods or e.g. results of an energy decomposition. This may include later, more accurate recalculations of the benchmark values – the main reference comes from the original publication where the data set was introduced (unless explicitly noted). These additional data are not covered in the documentation yet but can be found in the data set files.
To use the alternative refence data, use the keyword dataset_reference.
Use an existing data set file (located in cuby4/data/datasets) as a template. The file can be located anywhere, just provide a valid path to it in the dataset keyword. The default data sets use geometries from cuby's library but files can be used as well, the record 'geometry' in the data set file is treated the same as the geometry keyword.
A simple data set calculation can be run just on a bunch of geometry files by setting the dataset keyword to value 'from_files'. Here is an example:
job: dataset
dataset: from_files
# Selection of geometry files to be used, shell wildcards allowed
dataset_from_files: "*.xyz"
# What protocol to use for the items
dataset_from_files_job: energy
# Optionally, reference energies can be read from a table
dataset_from_files_reference: "energies.txt"
interface: mopac
method: pm6
The following examples, along with all other files needed to run them, can be found in the directory cuby4/protocols/dataset/examples
#===============================================================================
# Dataset example 1: Calculation on a predefined data set
#===============================================================================
job: dataset
#-------------------------------------------------------------------------------
# Dataset selection
#-------------------------------------------------------------------------------
# Predefined data set is used, only the name of the set has to be provided
dataset: A24
#-------------------------------------------------------------------------------
# Calculation setup
#-------------------------------------------------------------------------------
# Interface and method of the calculation is specified, appropriate protocol
# (in this case interaction energy calculation) is chosed for each dataset
# automatically
interface: mopac
method: pm6
Produces output:
_______ /\______\ / / / / / Cuby / Dataset calculation \/______/ ========================================================================================== name E Eref error error(%) ------------------------------------------------------------------------------------------ 01 water ... ammonia -3.904 -6.493 2.590 39.879 02 water dimer -3.922 -5.006 1.084 21.653 03 HCN dimer -2.537 -4.745 2.208 46.535 04 HF dimer 3.515 -4.581 8.096 176.722 05 ammonia dimer -2.333 -3.137 0.804 25.624 06 HF ... methane -0.336 -1.654 1.318 79.664 07 ammonia ... methane -0.544 -0.765 0.221 28.895 08 water ... methane -0.505 -0.663 0.158 23.836 09 formaldehyde dimer -3.788 -4.554 0.766 16.826 10 water ... ethene -1.272 -2.557 1.285 50.269 11 formaldehyde ... ethene -0.614 -1.621 1.007 62.145 12 ethyne dimer -0.463 -1.524 1.061 69.609 13 ammonia ... ethene -0.756 -1.374 0.618 44.996 14 ethene dimer -0.307 -1.090 0.784 71.884 15 methane ... ethene -0.176 -0.502 0.326 64.944 16 borane ... methane -1.124 -1.485 0.360 24.280 17 methane ... ethane -0.154 -0.827 0.673 81.353 18 methane ... ethane -0.129 -0.607 0.478 78.711 19 methane dimer -0.070 -0.533 0.463 86.895 20 Ar ... methane 0.758 -0.405 1.162 287.292 21 Ar ... ethene 0.511 -0.364 0.876 240.349 22 ethene ... ethyne 0.128 0.821 -0.693 -84.379 23 ethene dimer 0.149 0.934 -0.785 -84.047 24 ethyne dimer 0.202 1.115 -0.913 -81.868 ========================================================================================== RMSE 1.951 kcal/mol MUE 1.197 kcal/mol ------------------------------------------------------------------------------------------ MSE 0.998 kcal/mol min -0.913 kcal/mol max 8.096 kcal/mol range 9.009 kcal/mol min abs 0.158 kcal/mol max abs 8.096 kcal/mol ========================================================================================== RMSE 101.844 % MUE 78.027 % MSE 57.169 % min -84.379 % max 287.292 % range 371.671 % min abs 16.826 % max abs 287.292 % ========================================================================================== H-bond (5) RMSE 3.974 MSE 2.956 kcal/mol dispersion (7) RMSE 0.681 MSE 0.620 kcal/mol other (9) RMSE 0.894 MSE 0.802 kcal/mol stack (3) RMSE 0.802 MSE -0.797 kcal/mol ==========================================================================================
#===============================================================================
# Dataset example 2: selections and plotting
#===============================================================================
job: dataset
#-------------------------------------------------------------------------------
# Dataset selection
#-------------------------------------------------------------------------------
dataset: S66x8 # Dissociation curves for the S66 data set
#-------------------------------------------------------------------------------
# Selection
#-------------------------------------------------------------------------------
# select only pi-pi dispersion-bound complexes
dataset_select_tag: "dispersion p-p"
#-------------------------------------------------------------------------------
# Plotting
#-------------------------------------------------------------------------------
# Plot the dissociation curves using gnuplot and merge the images to one file
# with four colums. This requires two external tools installed, gnuplot and
# imagemagick.
dataset_save_plots: gnuplot_tiled
dataset_plot_columns: 4
#-------------------------------------------------------------------------------
# Calculation setup
#-------------------------------------------------------------------------------
interface: mopac
method: pm6
#===============================================================================
# Dataset example 3: Custom calculation of each item
#===============================================================================
# By default, the data set contains information on what calculation protocol
# is applied to each of its items. In this example, we use the S66 data set
# where the calculated quantity is interaction energy in a fixed geometry.
# This example show how to override that and perform a custom calculation,
# in this case optimizing the geometry the geometry of the complex with the
# tested method before the interaction energy is calculated. Additinally,
# the change of the geometry is measured as RMSD and printed.
job: dataset
dataset: S66
dataset_select_name: "^0[1-4]" # only first four items from the data set are used
# The block calculation_overwrite allows definig a custom calculation that is
# performed instead of the default one
calculation_overwrite:
# The multistep protocol allows running the optimization followed by
# interaction energy calculation.
# The multistep protocol returns the result of the last calculation
# which, in this case yields the quantity we are looking for,
# the interaction energy.
job: multistep
steps: clean, opt, rmsd, int
# Common setup for all calculations
calculation_common:
interface: mopac
method: pm6
# Cleanup: remove the old optimized geometry
calculation_clean:
job: shell_script
shell_commands: "rm -f optimized.xyz"
# Optimize geometry of each item in data set
calculation_opt:
job: optimize
geometry: parent_block # The geometry is defined one level above
opt_quality: 0.1
optimizer: lbfgs
optimize_print: steps_as_dots # Simplified printing of steps
# Calculation of RMSD upon optimization
calculation_rmsd:
job: geometry
geometry_action: rmsd_fit
geometry: parent_block
geometry2: optimized.xyz
# Calculate interaction energy in the optimized geometry
calculation_int:
job: interaction
geometry: optimized.xyz
#===============================================================================
# Dataset example 4: Combining multiple data sets
#===============================================================================
# While it is not possible to combine multiple data sets within the data set
# protocol, the multistep protocol can be used to achieve this. The final
# result, in this case root mean square error in the two data sets, has to be
# calculated from the output of the two steps via an user-defined expression.
# Two data sets are calculated separately using the multistep protocol
job: multistep
steps: set1, set2
# The calculation setup is the same for both steps
calculation_common:
job: dataset
interface: mopac
method: pm6
calculation_set1:
dataset: s66
calculation_set2:
dataset: x40
# Calculating the final error in the two data sets requires the knowledge of the
# structure of the objects containing the results of the individual steps.
# In the case of RMSE, we can not get it from RMSEs of the two data sets but it
# can be calculated from the sums of squares as follows:
multistep_result_expression: "((steps['set1'].errors.sumsq + steps['set2'].errors.sumsq)/(steps['set1'].errors.count + steps['set2'].errors.count))**0.5"
# The final results can have an arbitrary name which will be printed in the output
multistep_result_name: "RMSE"
# If more than one combined result is needed, they can be evaluated in a custom
# code inserted into the input using the keyword multistep_result_eval.
#===============================================================================
# Dataset example 5: Datasets from the GMTKN database
#===============================================================================
# All the datasets from the GMTKN database are calculated using the protocol
# "reaction". This protocol is set up automatically, but its use have one
# implication: the setup for the computational method should not be provided
# at the root level of the input, but in a block "calculation".
job: dataset
dataset: GMTKN_PCONF
# Unlike other data sets, the method is specified in a separate block:
calculation:
job: energy
interface: mopac
method: pm6