Notebooks

Subgroup Discovery

Subgroup discovery (SGD) is a local knowledge discovery method. It identifies subsets in a set of elements that 'stands out' with respect to some property of the elements.

What's new in it then from clustering? My current understanding (need to check and update): Compared to global methods like decision tree, regression, or compressed sensing, SGD is a local method meaning it does not attempt to classify all the elements of the parent set into subsets, rather it tries to find subclasses which are of high quality with respect to the desired property.

It has been used to find out subgroup properties that contribute to one of the two crystal structures of 82 octet binaries (Ghiringhelli et al. 2015). SGD predicts two subgroups which contain elements that have either a zinc blend structure or a rock salt structure (Goldsmith et al 2017, Boley et al. 2017). For a review of the SGD, see Atzmueller (2015).

Package

New (~2024–25): GitHub Repo with Python implementation Also see Lopez-Martinez-Carrasco et al. (2024). (I tried a Java implementation few years ago; it was useful but a bit clunky.)

Algorithm (tentative, need to check)

Given: Sample $S \subseteq P$, Target variable $y:P\rightarrow {a, b, c, \cdots}$, and Features $x_j: P\rightarrow X_j$

Define: Propositions $Pi_x = {\pi_1,\cdots, \pi_k}$, Selection language $\mathcal{L}_x = {\sigma(i)=\pi_{j_1}(i)\wedge\cdots\wedge \pi_{j_t}(i)}$

Optimize: $f(Q)=\textrm{cov}(Q)^\gamma \textrm{eff}(Q)_+$ where $Q=\{i \in S: \sigma(i)= \textrm{True}\}$ (extension), $\textrm{cov}(Q)=|Q|/|S|$ (coverage), $\textrm{eff}(Q)= \frac{H_y(S)-H_y(Q)}{H_y(S)}$ (effect), and $H_y(Q) = -\sum_v p_Q(y=v) \log p_Q(y=v)$ (entropy).

Lopez-Martinez-Carrasco, A.; Juarez, J. M.; Campos, M.; Mora-Caselles, F. Subgroups: A Python Library for Subgroup Discovery. SoftwareX 2024, 28, 101895. https://doi.org/10.1016/j.softx.2024.101895.
Ghiringhelli, L. M.; Vybiral, J.; Levchenko, S. V.; Draxl, C.; Scheffler, M. Big Data of Materials Science: Critical Role of the Descriptor. Phys. Rev. Lett. 2015, 114 (10), 105503. https://doi.org/10.1103/PhysRevLett.114.105503.
Atzmueller, M. Subgroup Discovery. WIREs Data Mining and Knowledge Discovery 2015, 5 (1), 35–49. https://doi.org/10.1002/widm.1144.
Boley, M.; Goldsmith, B. R.; Ghiringhelli, L. M.; Vreeken, J. Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery. Data Min Knowl Disc 2017, 31 (5), 1391–1418. https://doi.org/10.1007/s10618-017-0520-3.
Goldsmith, B. R.; Boley, M.; Vreeken, J.; Scheffler, M.; Ghiringhelli, L. M. Uncovering Structure-Property Relationships of Materials by Subgroup Discovery. New J. Phys. 2017, 19 (1), 013031. https://doi.org/10.1088/1367-2630/aa57c2.

Subgroup Discovery

Package

Algorithm (tentative, need to check)

Recommended