The repository contains datasets and scripts related to the analysis of electron density pseudosymmetry of atoms and atom types in the MATTS2021 data bank (Jha et al., 2022; Rybicka et al., 2022), evaluated for many different local coordinate system (LCS) types and orientations.
Electron density pseudosymmetry was assigned based solely on multipole parameters (κ, κ′, Pval, and Plm) using symmetry selection rules (Kurki-Suonio, 1977), first at the level of individual LCS orientations, then per LCS type, and lastly as a final symmetry for each atom and atom type. Each symmetry, together with its associated LCS orientation, leads to a specific set of multipolar function that should vanish, i.e. their populations (Plm values) should be equal to zero (Kurki-Suonio, 1977). The provided files include symmetry assignments, statistical summaries, custom Bash and Python scripts, visualizations of parameter distributions, and comparisons of pseudosymmetry assignments between atoms, atom types, and symmetries in MATTS2021.
The refinement of the multipole model for model molecules was performed using the same set of crystal structures and the same general procedure as that used to construct the MATTS2021 data bank (Jha et al., 2022), with modifications applied at the stage of multipole model refinement when symmetry constraints were imposed on refined parameters. Two distinct datasets of refined multipolar models were created: one resulting from refinements with the original symmetry constraints used in the construction of the MATTS2021 data bank (ref-SC), and another from refinements with no symmetry constraints (ref-NSC). Removing symmetry constraints allows all Plm to be populated.
The primary data used for the determination of electron density pseudosymmetry consist of multipole model parameters calculated for atoms and atom types belonging to 14 topological subgroups. These subgroups are defined based on (a) the number of first neighbors and planarity, which together define a topological group (4n, 3n, 3p, 2p, or 1p), and (b) the chemical element (C, N, O, F, P, S, Cl, Br). Thus, the 14 topological subgroups are: 4n-C, 4n-N, 4n-P, 4n-S, 3n-N, 3p-C, 3p-N, 2p-N, 2p-O, 2p-S, 1p-F, 1p-Cl, and 1p-Br. Files specific to each topological group or subgroup include the group or subgroup name in the filename. The term “1p-halogens” refers to the combined set of 1p-F, 1p-Cl, and 1p-Br.
For each subgroup, a universal atom type definition was created manually. These definitions were generalized with respect to neighbors, ring presence, and ring size, only planarity of the central atom was explicitly defined. This approach ensured that all atoms fitting a given subgroup, including both recognized (i.e., those with corresponding atom types in MATTS2021) and unrecognized (i.e., those without such atom types) ones, were identified in model molecules. Subsequently, topological-group-specific Bash scripts (1_make-definitions-{group}.bash, provided in /scripts/LCS/) generated new definitions of universal atom types for all considered LCS orientations, producing .txt files as output. These .txt files are provided in the /universal-definitions directory.
The bankMaker utility from the DiSCaMB library (Chodkiewicz et al., 2018), a local Bash script ROTATE.bash (provided in the /scripts/LCS directory), and a dataset of refined multipolar models (ref-SC or ref-NSC) were used to calculate multipole parameters for every considered LCS orientation for each atom. For atom types, the same procedure was applied, but using the original MATTS2021 data bank atom type definitions in an unchanged order, provided in files generated by subgroup-specific local Bash scripts 2_make-little-banks-{subgroup}.bash (located in /scripts/LCS), instead of the universal atom type definitions. In both cases (atoms and atom types), all symmetries in the atom type definitions were set to “no”, preventing the enforcement of any symmetry higher than 1 and enabling all multipolar functions to be populated.
The resulting sets of multipole parameters were generated for all LCS orientations and combined into subgroup-specific files for further analysis using local Bash and Python scripts (3_get-data.bash and 4_make-csv.py, provided in /scripts/LCS). The following subgroups were included in the subsequent analysis: 4n-C, 3n-N, and 3p-N for ref-SC; and 4n-C, 4n-N, 4n-P, 4n-S, 3n-N, 3p-C, 3p-N, 2p-N, 2p-O, 2p-S, and 1p-halogens (combined 1p-F, 1p-Cl, and 1p-Br) for ref-NSC. Files for 4n-C, 3n-N, and 3p-N indicate the refinement strategy used (ref-SC or ref-NSC) in their filenames. For the remaining subgroups, this information is not indicated, as ref-NSC is the default. A detailed description of the purpose and usage of each of the aforementioned scripts is provided in the read_me file in the /scripts/LCS directory.
To assign pseudosymmetry based on the generated multipole parameters, it was necessary to determine when Plm values could be approximated as zero. For this purpose, zero-value thresholds were introduced and determined automatically using Gaussian Mixture Model (GMM) analysis (Zhuang et al., 1996) of Plm distributions. A custom Python script, gmm_threshold_analysis.py (provided in the /scripts/GMM directory), performed the analysis separately for each subgroup by automatically selecting the optimal number of Gaussian components (from 1 to 6 for each Plm parameter derived from ref-NSC atoms), calculating their weights, and identifying the dominant component. Plm parameters for which the mean of the dominant Gaussian component was greater than or equal to three times its standard deviation were excluded, as they were considered significantly different from zero. For each subgroup and LCS type, this procedure generated three outputs: histogram plots (*.png) showing Plm distributions with Gaussian fits; all_Plm_*.txt files containing statistics for the dominant Gaussian component for each Plm; and filtered_Plm_*.txt files including only the dominant Gaussian components used to define the zero-value thresholds. These files are available in the /GMM-analysis directory. A detailed description of the usage of the gmm_threshold_analysis.py script is provided in the read_me file in the /scripts/GMM directory.
Finally, the generated multipole parameters and the established zero-value thresholds were used to assign electron density pseudosymmetry at the level of individual LCS orientations, per LCS type, and ultimately as a final symmetry for each atom and atom type. This stage employed numerous custom Bash and Python scripts, some of which are group- or subgroup-specific. These scripts are provided in the /scripts/PSEUDOSYMMETRY directory. A short description of the purpose of each script is given in the read_me file in the main directory, while more detailed usage instructions are provided in the read_me file in the /scripts/PSEUDOSYMMETRY directory.
The /atoms-and-atom-types-data directory contains *.ods files with atom- and atom type-level data for each subgroup (as indicated by the filenames), including multipole model parameters and pseudosymmetry assignments across all considered LCS orientations. These files were created manually by combining information obtained from the outputs of the Bash and Python scripts from /scripts/PSEUDOSYMMETRY at various stages of the analysis and interpretation. Detailed information about the contents of the *.ods files, including descriptions of each sheet and column, is provided in a read_me file in the main folder.
References:
Chodkiewicz, M. L., Migacz, S., Rudnicki, W., Makal, A., Kalinowski, J. A., Moriarty, N. W., Grosse-Kunstleve, R. W., Afonine, P. V., Adams, P. D.; Dominiak, P. M. (2018). J. Appl. Crystallogr. 51, 193–199.
Jha, K. K., Gruza, B., Sypko, A., Kumar, P., Chodkiewicz, M. L.; Dominiak, P. M. (2022). J. Chem. Inf. Model. 62, 3752–3765.
Kurki-Suonio, K. (1977). Isr. J. Chem. 16, 115–123.
Rybicka, P. M., Kulik, M., Chodkiewicz, M. L.; Dominiak, P. M. (2022). J. Chem. Inf. Model. 62, 3766–3783.
Zhuang, X., Huang, Y., Palaniappan, K.; Zhao, Y. (1996). IEEE Trans. Image Process. 5, 1293–1302.
(2025-12-31)