A Tutorial on Structure Optimized Proximity Scaling (STOPS)

Thomas Rusch

2024-10-09

In this document we introduce the functionality available in stops for fitting multidimensional scaling (MDS; Borg & Groenen 2005) or proximity scaling (PS) models by utilizing the STOPS framework. We start with a short introduction to PS and the models that are available. Next, we introduce the reader to STOPS (Rusch, Mair & Hornik, 2020; Rusch, Mair & Hornik, 2023a), a method to select hyperparameters in flexible mutlidimensional scaling methods based on structures in the configuration. We show how to fit those. We also mention P-COPS as a special case and show the connection to COPS (Rusch, Mair & Hornik, 2021). For illustration throughout this document we use the smacof::kinshipdelta data set (Rosenberg & Kim, 1975) which lists percentages of how often 15 kinship terms were not grouped together by college students.

library("stops")

Proximity Scaling

For proximity scaling (PS) or multidimensional scaling (MDS) the input is typically an \(N\times N\) matrix \(\Delta^*=f(\Delta)\), a matrix of proximities with elements \(\delta^*_{ij}\), that is a function of a matrix of observed non-negative dissimilarities \(\Delta\) with elements \(\delta_{ij}\). \(\Delta^*\) usually is symmetric (but does not need to be). The main diagonal of \(\Delta\) is 0. We call a \(f: \delta_{ij} \mapsto \delta^*_{ij}\) a proximity transformation function. In the MDS literature these \(\delta_{ij}^*\) are often called dhats or disparities. The problem that proximity scaling solves is to locate an \(N \times M\) matrix \(X\) (the configuration) with row vectors \(x_i, i=1,\ldots,N\) in low-dimensional space \((\mathbb{R}^M, M \leq N)\) in such a way that transformations \(g(d_{ij}(X))\) of the fitted distances \(d_{ij}(X)=d(x_i,x_j)\)—i.e., the distance between different \(x_i, x_j\)—approximate the \(\delta^*_{ij}\) as closely as possible. We call a \(g: d_{ij}(X) \mapsto d_{ij}^*(X)\) a distance transformation function. In other words, proximity scaling means finding \(X\) so that \(d^*_{ij}(X)=g(d_{ij}(X))\approx\delta^*_{ij}=f(\delta_{ij})\). There may also be weights transformed weight \(w^*_{ij}=h(w_{ij})\) with \(h: w_{ij} \mapsto w_{ij}^*\) being a weight transformation function.

This approximation \(D^*(X)\) to the matrix \(\Delta^*\) is found by defining a fit criterion (the loss function), \(\sigma_{PS}(X)=L(\Delta^*,D^*(X))\), that is used to measure how closely \(D^*(X)\) approximates \(\Delta^*\).

The loss function used is then minimized to find the vectors \(x_1,\dots,x_N\), i.e., \[\begin{equation} \label{eq:optim} \arg \min_{X}\ \sigma_{PS}(X). \end{equation}\]

Stress Models

The MDS models we list below can be fitted with the smacofx package.

The first popular type of PS supported in stops is based on the loss function type stress (Kruskall 1964), employing a quadratic loss function.

A general formulation of a loss function based on a quadratic loss for \(i,j = 1, \dots, N\) is \[\begin{equation} \label{eq:stress} \sigma_{PS}(X)=\sigma_{stress}(X)=\sum_{i<j} w^*_{ij}\left[d^*_{ij}(X)-\delta^*_{ij}\right]^2=\sum^N_{i=1}\sum^N_{j=1} h(w_{ij})\left[g\left(d_{ij}(X)\right)-f(\delta_{ij})\right]^2 \end{equation}\]

Here, the \(w_{ij}\) are finite weights, with \(w_{ij}=0\) if the entry is missing and \(w_{ij}\neq 0\) otherwise. There are a number of optimization techniques one can use to solve this optimization problem.

The distances used is usually the Euclidian distance as the distance fitted to the points in the configuration, \[\begin{equation} \label{eq:dist} d_{ij}(X) = ||x_{i}-x_{j}||_2=\left( \sum_{m=1}^M |x_{im}-x_{jm}|^2 \right)^{1/2} \ i,j = 1, \dots, N. \end{equation}\] Typically, the norm used is the Euclidean norm, so \(p=2\). In standard MDS \(g(\cdot)=f(\cdot)=I(\cdot)\), the identity function.

This formulation enables one to express a large number of PS methods many of which are implemented in stops. In stops we allow to use specific choices for \(f(\cdot)\), \(g(\cdot)\) and \(h(\cdot)\) from the family of power transformations so one can fit the following stress models:

Strain Models

The second popular type of PS supported in stops is based on the loss function type . Here the \(\Delta^*\) are a transformation of the \(\Delta\), \(\Delta^*= f (\Delta)\) so that \(f(\cdot)=-(h\circ l)(\cdot)\) where \(l\) is any function and \(h(\cdot)\) is a double centering operation, \(h(\Delta)=\Delta-\Delta_{i.}-\Delta_{.j}+\Delta_{..}\) where \(\Delta_{i.}, \Delta_{.j}, \Delta_{..}\) are matrices consisting of the row, column and grand marginal means respectively. These then get approximated by (functions of) the inner product matrices of \(X\) \[\begin{equation} \label{eq:dist2} d_{ij}(X) = \langle x_{i},x_{j} \rangle \end{equation}\]

We can thus express classical scaling as a special case of the general PS loss with \(d_{ij}(X)\) as an inner product, \(g(\cdot) = I(\cdot)\) and \(f(\cdot)=-(h \circ I)(\cdot)\).

If we again allow power transformations for \(g(\cdot)\) and \(f(\cdot)\) one can fit the following strain models with stops

Energy Models

The third type of PS supported in stops is based on the energy type losses of pairwise attraction and repulsion between objects (Chen & Buja, 2014, Rusch, Mair, Hornik 2023a). It is related to graph drawing: if vertices are equated with objects, the edge weights are the \(\delta^*_{ij}\), the node positions in the layout is the \(X\) and the distance between nodes in the layout are \(d_{ij}(X)\).

One can express badness-of-fit as comprising a repulsion part \(\propto -\delta^*_{ij}d^*_{ij}(X)\) and an attraction part \(\propto d^{*\mu}_{ij}(X)\) (Chen & Buja, 2014). This is
\[\begin{equation} \label{eq:energy} \sigma_{PS}(X)=\sigma_{EM}(X) = \sum_{i<j} w^*_{ij} \left(a d^{*\mu}_{ij}(X) - b \delta^*_{ij}d^*_{ij}(X) \right) \end{equation}\] with \(a,b\) being constants and \(d^*_{ij}(X)=g(d_{ij}(X))\) and \(\delta^*_{ij}=f(d_{ij}(X))\). For \(\mu=2, a=1\) and \(b=2\) this is a stress model where terms depending solely on \(\delta^*_{ij}\) are disregarded for finding \(X\).

Specific instances of energy models that can be fitted in stops are:

Selecting Hyperparameters by Structure Considerations: STOPS

The flexible models from above ( power stress, power strain, approximate power stress, local MDS, Box-Cox MDS and their variants) all have hyperparameters \(\theta\); for example, the hyperparameter vector for power stress is \(\theta=(\kappa,\lambda,\rho)\). That begs the question which hyperparameters should be chosenas the chosen values have a huge impact on how the final configuration looks like (clearly a \(kappa=2\) will lead to a different configuration than \(\kappa=0.5\)). So far, we have simply assumed the analyst knows the “right” \(\theta\); indeed often (and as we did so far) they are simply chosen ad hoc.

The main contribution of the stops package and idea behind STOPS is to allow to automatically select good hyperparameters \(\theta^*\) to achieve a “structured” MDS configuration. This can be useful in a variety of contexts: to explore or generate structures, to restrict the target space, to avoid artifacts, to preserve certain types of structures and so forth.

For this, the flexible MDS loss functions as described above are augmented to include penalties for the type of structures one is aiming for. This combination of an MDS loss with a structuredness penalty is what we call “structure optimized loss” (stoploss) and the resulting method is coined “Structured Optimized Proximity Scaling” (or STOPS, Rusch, Mair & Hornik, 2023a). STOPS is related to “Cluster Optimized Proximity Scaling” (COPS; Rusch, Mair & Hornik, 2021) which allows to select optimal parameters so that the clusteredness appearance of the configuation is improved (see below).

Model

Following Rusch, Mair & Hornik (2023a) the general idea is that from given observations \(\Delta\) we look for a configuration \(X\). The \(X\) has properties with regards to its structural appearance, which we call c-structuredness for configuration-structuredness. There are different types of c-structuredness people might be interested in (say, how clustered the result is, or that dimensions are orthogonal or if there is some submanifold that the data live on). We developed indices for these types of c-structuredness that capture that essence in the configuration (see Rusch, Mair & Hornik 2023b, or Rusch, Mair & Hornik 2020 for a large list of structures and indices).

We have as part of a STOPS model a proximity scaling loss function \(\sigma_{PS}(\cdot)\) and some transformation \(f_{ij}(\delta_{ij}|\theta_f)\) and \(g_{ij}(d_{ij}|\theta_g)\) and possibly \(h_{ij}(w_{ij}|\theta_h)\). These functions are parametrized and all parameters are collected in the vector \(\theta=(\theta_f,\theta_g,\theta_h)\). \(\theta\) is finite, e.g., a transformation applied to all observations like a power transformation. For example for power stress \(\theta=(\kappa,\lambda,\rho)\) and for Box Cox MDS it is \(\theta=(\mu,\lambda,\rho)\). These transformations achieve a sort of push towards more structure, so different values for \(\theta\) will in general lead to different c-structuredness.

Let us assume we are interested in \(L\) different structural qualities of \(X\) and that we have \(L\) corresponding univariate c-structuredness indices \(I_l(X\vert \gamma)\) for the \(l=1,\dots, L\) different structures, capturing the essence of the structural appearance of the configuration with respect to the \(l\)-th structure. For example, we might be interested in both the structural appearance of how clustered the configuration is (structure 1) and how strongly linearly related the column vectors of the configuration are (structure 2). We then measure the c-structuredness of \(X\) for the two structures with an index for clusteredness and one for linear dependence respectively. The \(\gamma\) are optional metaparameters for the indices, which we assume are given and fixed; they control how c-structuredness is measured. We further assume broadly that the transformations \(f(\Delta\vert\theta_f)\) and/or \(g(D(X)\vert \theta_g)\) and/or \(h(W\vert \theta_h)\) produce different c-structuredness in \(X\) for different values of \(\theta\).

In a nutshell our proposal is to select optimal hyperparameters \(\theta^\ast\) for the scaling procedure by assessing the c-structuredness of an optimal configuration \(X^\ast\) found from a PS method for given \(\theta\) usually in combination with its badness-of-fit value. We aim at finding a \(\theta^\ast\) that, when used as transformation parameters in the PS method, will give a configuration that has high (or low) values of the c-structuredness indices. We view this as a multi-objective optimization problem, where we want to maximize/minimize different criteria (either badness-of-fit, or c-structuredness, or both) over \(\theta\). C-structuredness may this way be induced at a possible expense of fit but we control the expense amount.

To formalize this we explicitly write the building blocks of the objective function used for hyperparameter tuning via STOPS as a function of \(\theta\): Let us denote by \(X^\ast(\theta)\) the optimal solution from minimizing a badness-of-fit \(\sigma_{PS}(X \vert \theta)\) for a fixed \(\theta\), so \(X^\ast(\theta):= \arg\min_{X} \sigma_{PS}(X\vert \theta)\). Further we also have the \(L\) different univariate indices with possible metaparameters \(\gamma\), \(I_l(X^\ast(\theta)\vert \gamma)\) to be optimized for.

Specific variants of STOPS can be instantiated by defining objective functions \(\text{Stoploss}(\theta\vert v_0, \dots, v_L, \gamma)\), comprising either badness-of-fit or c-structuredness indices or both in a scalarized combination. Two variants of objective functions are of the following form:

with \(v_0 \in \mathbb{R}_{\geq 0}\) and \(v_1,\dots,v_L \in \mathbb{R}\) being the scalarization weights that determine how the individual parts (MDS loss and c-structuredness indices) are aggregated. Note that in this formulation the aggregation is a sum/product, so the weights must be negative if a higher index stands for more structure and we want more of that structure. Alternatively, the stressweight can be negative and the strucweight positive.

Numerically, the badness-of-fit function value \(\sigma_{PS}(X^\ast(\theta)\vert \theta)\) needs to be normalized to be scale-free and commensurable for comparability of different values of \(\theta\). The objective function for aSTOPS is fully compensatory, whereas for mSTOPS it ensures that a normalized badness-of-fit of \(0\) will always lead to the minimal \(\text{Stoploss}(\theta\vert v_0, \dots, v_L, \gamma)\) for a positive value of \(I_l(\cdot)\). For notational convenience, we will refer to the objective functions for STOPS variants by \(\text{Stoploss}(\theta)\) from now on.

The job is then to find \[\begin{equation} \arg\min_{\theta}\ \text{Stoploss}(\theta) \end{equation}\]

Minimizing stoploss can be quite difficult. In stops we use a nested algorithm combining optimization that internally first solves for \(X\) given \(\theta\), \(\arg\min_X \sigma_{PS}\left(X|\theta\right)\) in an inner loop, and then optimize over \(\theta\) with a metaheuristic in an outer loop. The number of iterations of the outer loop can be controlled with itmax, the number of iterations of the inner loop with itmaxps.

Optimization

Implemented metaheuristics are simulated annealing (optimmethod="SANN"), particle swarm optimization (optimmethod="pso"), DIRECT (optimmethod="DIRECT"), stochastic global optimization (optimmethod="stogo"), COBYLA (optimmethod="cobyla"), Controlled Random Search 2 with local mutation (optimmethod="crs2lm"), Improved Stochastic Ranking Evolution Strategy (optimmethod="isres"), Multi-Level Single-Linkage (optimmethod="mlsl"), Nelder-Mead (optimmethod="neldermead"), Subplex (optimmethod="sbplx"), Hooke-Jeeves Pattern Search (optimmethod="hjk"), CMA-ES (optimmethod="cmaes"), Bayesian optimization with Gaussian Process priors and Kriging (optimmethod="Kriging"), Bayesian optimization with treed Gaussian processes with jump to linear models (optimmethod="tgp") and Adaptive Luus-Jaakola Search (optimmethod="ALJ"). Defaults is “ALJ”.

So there are a lot of solvers to choose from. In our experience tgp, ALJ, Kriging and pso usually work reasonably well for relatively low values of itmax, the iterations of the outer loop (up to 20). If the data are small, then ALJ and pso (with s=5 particles) and with relatively high itmax (>50) is typically good. If the data are larger (so that solving the PS problem is becoming very costly) we want to minimize the number of outer iterations and then Bayesian optimization with tgp or Kriging is typically good for itmax of about \(10\). The return of tgp is diminishing for higher itmax, so if a higher number of itmax can be afforded pso is often better and more efficient.

C-Structuredness

Currently the following c-structuredness types are supported:

  • c-clusteredness (cclusteredness): A clustered appearance of the configuration (\(I_k\) is the normed OPTICS cordillera (COPS; Rusch et al. 2015a); 0 means no c-clusteredness, 1 means perfect c-clusteredness)
  • c-linearity (clinearity): Projections lie close to a linear subspace of the configuration (\(I_k\) is maximal multiple correlation; 0 means orthogonal, 1 means perfectly linear)
  • c-manifoldness (cmanifoldness): Projections lie on a sub manifold of the configuration (\(I_k\) is maximal correlation (Sarmanov, 1958); only available for two dimensions; 1 means perfectly smooth function)
  • c-dependence (cdependence): Random vectors of projections onto the axes are stochastically dependent (\(I_k\) is distance correlation (Szekely et al., 2007); only available for two dimensions; 0 means they are stochastically independent)
  • c-association (cassociation): Pairwise nonlinear association between dimensions (\(I_k\) is the pairwise maximal maximum information coefficient (Reshef et al. 2011), 1 means perfect functional association)
  • c-nonmonotonicity (cnonmonotonicity): Deviation from monotonicity (\(I_k\) is the pairwise maximal maximum assymmetry score (Reshef et al. 2011), the higher the less monotone)
  • c-functionality (cfunctionality): Pairwise functional, smooth, noise-free relationship between dimensions (\(I_k\) is the mean pairwise maximum edge value (Reshef et al. 2011), 1 means perfect functional association)
  • c-complexity (ccomplexity): Measures the degree of complexity of the functional relationship between any two dimensions (\(I_k\) is the pairwise maximal minimum cell number (Reshef et al. 2011), the higher the more complex)
  • c-faithfulness (cfaithfulness): How accurate is the neighbourhood of \(\Delta\) preserved in \(D\) (\(I_k\) is the adjusted Md index of Chen & Buja, 2013; note that this index deviates from the others by being a function of both \(X^*\) and \(\Delta^*\) rather than \(X^*\) alone)
  • c-regularity (cregularity): How regular are the objects arranged.
  • c-hierarchy (chierachy): Captures how well a partition/ultrametric (obtained by hclust) explains the configuration distances.
  • c-convexity (cconvexity): Measures how convex the object arrangement is.
  • c-striatedness (cstriatedness): Measures how striated the object arrangement is.
  • c-outlying (coutlying): Measures if there are many outlying points.
  • c-skinniness (cskinniness): Measures how skinny the object arrangement is.
  • c-sparsity (csparsity): Measures how sparse the object arrangement is.
  • c-stringiness (cstringiness): Measures how stringy the object arrangement is.
  • c-clumpiness (cclumpiness): Measures how clumpy the object arrangement is.
  • c-inequality (cinequlaity): Measures how unequal the object arrangement is (as it is the Pearson coefficient of variation)

See Rusch, Mair, Hornik (2020) or Rusch, Mair, Hornik (2023b) for a precise definition of each c-structuredness index.

One can also call each index with the function c_foo where foo stands for the structure, so e.g. c_stringiness(X) will give the value of c-stringiness for object X. Note that the c-structuredness functions may have additional metaparameters. For example:

data(kinshipdelta)
dis<-kinshipdelta
X<-mds(dis)$conf
c_stringiness(X)
## [1] 0.1714678
c_clumpiness(X)
## [1] 0.1857798
c_association(X,alpha=0.7,C=10)
## [1] 0.3982779
c_clusteredness(X,minpts=3)
## [1] 0.1807154

If we have a single \(I(X)=OC(X)\), the OPTICS cordillera (Rusch, Hornik & Mair 2018), and the transformations applied are power transformations and the weights for the \(I(X)\) is negative we essentially have P-COPS (see below).

MDS models for STOPS

For the MDS loss (argument loss in functions stops), the functions currently support all losses derived from powerstress, powerstrain, lmds and bcStress. We offer dedicated functions that either use workhorses that are more optimized for the problem at hand and/or restrict the parameter space for the distance/proximity transformations and thus can be faster. They are:

  • stress, smacofSym: Kruskal’s stress and power transformations \(\lambda\); Workhorse: smacofSym, Optimization over \(\theta=\lambda\)
  • smacofSphere: Kruskal’s stress and power transformations \(\lambda\) for the dissimilarities for projection onto a sphere; Workhorse smacofSphere, Optimization over \(\theta=\lambda\)
  • strain, powerstrain: Classical scaling with power transformations \(\lambda\) for the dissimilarities; Workhorse: cmdscale, Optimization over \(\theta=\lambda\)
  • sammon, sammon2: Sammon scaling with power transformations \(\lambda\) for the dissimilarities; Workhorse: sammon or smacofSym, Optimization over \(\theta=\lambda\)
  • elastic: Elastic scaling with power transformations \(\lambda\) for the dissimilarities; Workhorse: smacofSym, Optimization over \(\theta=\lambda\)
  • sstress: S-stress with power transformations \(\lambda\) for the dissimilarities; Workhorse: powerStressMin, Optimization over \(\theta=\lambda\)
  • rstress: r-stress with power transfromations of the with power transformations \(\kappa\) of the fitted distances; Workhorse: powerStressMin, Optimization over \(\theta=\kappa\)
  • powermds: MDS with powers; Workhorse: powerStressMin, Optimization over \(\theta=(\kappa,\lambda)\)
  • powersammon: Sammon scaling with power transformation for the dissimilarities and fitted distances; Workhorse: powerStressMin, Optimization over \(\theta=(\kappa,\lambda)\)
  • powerelastic: Elastic scaling with power transformation for the dissimilarities and fitted distances; Workhorse: powerStressMin, Optimization over \(\theta=(\kappa,\lambda)\)
  • powerstress: Power stress model with power transformation for the dissimilarities and fitted distances and weights; Workhorse: powerStressMin, Optimization over \(\theta=(\kappa,\lambda,\rho)\)
  • rpowerstress: restricted power stress model; Workhorse: powerStressMin, Optimization over \(\theta=(\kappa,\lambda,\rho)\)
  • apstress: Approximate power stress model; Workhorse: powerStressMin, Optimization over \(\theta=(\tau,\upsilon)\)
  • lmds: LMDS; Workhorse: lmds, Optimization over \(\theta=(\tau, k)\).
  • bcstress: Box-Cox MDS; Workhorse: bcStressMin, Optimization over \(\theta=(\mu,\lambda,\rho)\).

Usage

The objects returned from stops() contain an object of smacofP or smacofB or cmdscalex in the slots $fit. Apart from that all the objects are made so that they are compatible to methods from smacof. Accordingly, the following S3 methods are available:

Method Description
print Prints the object
summary A summary of the object
plot 2D Plots of the object
residuals Residuals
coef Model Coefficients

The syntax for fitting a stops model is rather straightforward. One has to supply the arguments dis which is a dissimilarity matrix and structures a character vector listing the c-structuredness type that should be used to augment the PS loss (see above for the types of structures and losses). The metaparameters for the structuredness indices should be given via the strucpars argument; it should be a list whose elements are again lists corresponding to each structuredness index and listing the parameters (if the default should be used the list element should be set to NULL). See the example below. The PS loss can be chosen with the argument loss. The type of aggregation for the multi-objective optimization is specified in type and can be one of additive or multiplicative. As starting value for the \(\theta\) one can supply (theta). If not this will be a scalar \(1\).

For the outer optimization loop the solver can be specified with optimmethod; per default we use "ALJ". The upper and lower box constraints for the outer loop have to be supplied as well (these need to be of the same dimension as theta). If theta was not supplied, upper and lower need to be scalar (so one value each) and will be recycled to match the length of the hyperparamater vector. If the maximum number of iterations of the outer loop needs to be changed, one can use itmax and if the inner loop maximum number of iterations (for finding the PS configuration) needs to be changed, one can use the itmaxps argument. There are additional arguments for the function and one can pass additional parameters to the fitting workhorses with ... (see ?stops for details).

A typical call looks like

stops(dis, structures = c("cclusteredness","clinearity"), loss="stress", upper=0, lower=1)

The returned object contains the MDS object which is usually a smacof or smacofP class and all S3 methods are at one’s disposal.

Let us fit an mSTOPS model that looks for a transformation of the \(\delta_{ij}\) so that a) the result has maximal c-clusteredness (which is 1 in the best case, so we set a negative weight for this structure) b) the projection onto the principal axes are nearly orthogonal (c-linearity close to 0, so we set a positive weight for this structure) c) but the projections onto the principal axes should be stochastically dependent (negative weight on c-dependence) and d) the fit of the MDS is also factored in (so we set positive weight on the MDS loss).

We first setup the structures:

structures<-c("cclusteredness","clinearity","cdependence")

featuring the structures we mentioned.

Next we set up the metaparameters for the c-structuredness indices (the \(gamma\)). This must be a list of lists. Each list element corresponds to a structure in the same ordering as in the structures argument (so first c-clusterednes, then c-linearity, then c-dependence). Each of the list elements is again a list with the named arguments that should be supplied to the structure. If there are no metaparameters for a structure, the list should be NULL. Let’s illustrate with an example. For the OPTICS cordillera (c-lcusteredness) we may want metaparameters dmax=1, epsilon=10 and minpts=2, so we need to put them in a list with named elements as list(epsilon=10,minpts=2,dmax=1.3). For c-linearity we have no parameters, so we use NULL (or list(NULL)). For the c-dependence we have a single parameter, index which we set to 1.2 in another named list as list(index=1.2). We then create a list with these list as elements:

strucpars<-list(
                list(epsilon=10,minpts=2,dmax=1.3),  # element 1: list of arguments for c-clusteredness 
                NULL,                                # element 2: argument for c-linearity (empty as it has no parameters)
                list(index=1.2)                      # element 3: list of arguments for c-dependence
                ) 

If some arguments are not listed, the default values are taken. If there’s only one structre, we don’t need a list of list but the list of metaparameters suffices.

Since we use mSTOPS and a negative weight for c-linearity and c-dependence, a c-linearity/c-dependence close to 0 will overall dominate the stoploss function with the other two criteria being more of an afterthought - in aSTOPS that would be different. We weight all of them equally (\(0.33\)).

[!!]: This is generally the approach to be chosen: We minimize the stoploss, so a c-structuredness index that should be (numerically) large needs a negative weight and a c-structuredness index that should be (numerically) small needs a positive weight.

We now run stops.

set.seed(666)
ressm<-stops(kinshipdelta,loss="stress",stressweight=1,structures=c("cclusteredness","clinearity","cdependence"),strucweight=c(-0.33,0.33,-0.33),verbose=0,strucpars=strucpars,stoptype="multiplicative",lower=0,upper=8)
ressm
## 
## Call: stops(dis = kinshipdelta, loss = "stress", structures = c("cclusteredness", 
##     "clinearity", "cdependence"), stressweight = 1, strucweight = c(-0.33, 
##     0.33, -0.33), strucpars = strucpars, lower = 0, upper = 8, 
##     verbose = 0, stoptype = "multiplicative")
## 
## Model: multiplicative STOPS with ratio stress loss function and theta parameter vector (lambda)  =  1.432493 
## 
## Number of objects: 15 
## MDS loss value: 0.06970603 
## C-Structuredness Indices: cclusteredness 0.24108469 clinearity 0.02696876 cdependence 0.20524114 
## Structure optimized loss (stoploss): 0.05705468 
## MDS loss weight: 1 c-structuredness weights: -0.33 0.33 -0.33 
## Number of iterations of ALJ optimization: 28

The print function gives us information about the model. The selected hyperparameter was \(\lambda^*=\) NA. With that and the given metaparameters of the indices we have values of 0.2410847 for c-clusteredness, 0.0269688 for c-linearity and 0.2052411 for c-dependence. Stress ($stress.m) was 0.069706 and stress-1 ($stress, the square root of raw stress) is sqrt(0.069)=0.26.

We plot the obtained configuration with the “optimal” \(\lambda^*\).

plot(ressm)

We can see that the configuration mimics the structuredness we wanted (there is clusteredness, i.e., the terms are arranged in clusters, the projection is near orthogonal but there is a stochastic dependence that is nonlinear. This is a star-shaped arrangement. (No, this is Patrick).

Let us compare this with the corresponding aSTOPS

ressa<-stops(kinshipdelta,loss="stress",stressweight=1,structures=c("cclusteredness","clinearity","cdependence"),strucweight=c(-0.33,0.33,-0.33),verbose=0,strucpars=strucpars,stoptype="additive",lower=0,upper=8)
ressa
## 
## Call: stops(dis = kinshipdelta, loss = "stress", structures = c("cclusteredness", 
##     "clinearity", "cdependence"), stressweight = 1, strucweight = c(-0.33, 
##     0.33, -0.33), strucpars = strucpars, lower = 0, upper = 8, 
##     verbose = 0, stoptype = "additive")
## 
## Model: additive STOPS with ratio stress loss function and theta parameter vector (lambda)  =  1.483125 
## 
## Number of objects: 15 
## MDS loss value: 0.07037108 
## C-Structuredness Indices: cclusteredness 0.24482385 clinearity 0.02667558 cdependence 0.20393552 
## Structure optimized loss (stoploss): -0.06891657 
## MDS loss weight: 1 c-structuredness weights: -0.33 0.33 -0.33 
## Number of iterations of ALJ optimization: 25
plot(ressa)

Not really different here.

When choosing a c-structuredness index, one needs to be clear what structure one might be interested in and how it interacts with the PS loss chosen. Consider the following example: We fit a powermds model to the kinship data and want to maximize c-association (i.e., any non-linear relationship) and c-manifoldness but minimize the c-linearity. In other words we try to find a power transformation of \(\Delta\) and \(D\) so that the objects are positioned in the configuration in such a way that the projection onto the principal axes are as close as possible to being related by a smooth but non-linear function.

We use tgp as optimization method for itmax=10 steps and restrict the search space to be between \(0\) and \(5\). The maximum number of iterations to get the PS configuration is set to itmaxps=1000.

set.seed(666)
resa<-stops(kinshipdelta,structures=c("cassociation","cmanifoldness","clinearity"),loss="powermds",theta=c(1,1),strucpars=list(NULL,NULL,NULL),strucweight=c(-0.5,-0.5,0.5),lower=c(0,0),upper=c(5,5),optimmethod="tgp",itmaxps=1000,itmax=10)
resa
## 
## Call: stops(dis = kinshipdelta, loss = "powermds", theta = c(1, 1), 
##     structures = c("cassociation", "cmanifoldness", "clinearity"), 
##     strucweight = c(-0.5, -0.5, 0.5), strucpars = list(NULL, 
##         NULL, NULL), optimmethod = "tgp", lower = c(0, 0), upper = c(5, 
##         5), itmax = 10, itmaxps = 1000)
## 
## Model: additive STOPS with ratio powermds loss function and theta parameter vector (kappa lambda)  =  0.4893293 4.298529 
## 
## Number of objects: 15 
## MDS loss value: 0.2369349 
## C-Structuredness Indices: cassociation 0.515506235 cmanifoldness 0.953516033 clinearity 0.002089034 
## Structure optimized loss (stoploss): -0.4965317 
## MDS loss weight: 1 c-structuredness weights: -0.5 -0.5 0.5 
## Number of iterations of tgp optimization: 10

We see in this model (resa) that indeed the c-manifoldness is almost at 1, which suggests we have the objects arranged close to a manifold structure. c-association is also there and clinearity is near 0, so we see that this is a linear structure. How does this relationship look like?

plot(resa)

The manifold resembles an oval and the objects are arranged in clusters close to that oval. Due to the oval shape the c-association isn’t high as that isn’t really what c-association measures (it measures non-linear curves that do not turn whereas manifoldness ignores bending).

What we can also see is that there are three clear clusters of 2 or more objects, but we didn’t select for that.

Note that it may just as well be possible to have a high c-association and no c-clusteredness at all (e.g., points lying equidistant on a smooth non-linear curve), as we did not optimize for the latter. The transformation in powermds that is optimal with respect to c-clusteredness could be different.

Indeed one can even optimize for c-clusteredness alone (setting stressweight=0) and using it as a “goodness-of-clusteredness” index (i.e., we let the \(d_{max}\) vary between the configurations and take the highest attainable normed cordillera). This time we use optimmethod="tgp" and itmax=20.

set.seed(666)
resa2<-stops(kinshipdelta,structures=c("cclusteredness"),loss="powermds",strucpars=list(list(epsilon=10,dmax=NULL,minpts=2)),strucweight=-1,stressweight=0,lower=c(0.1,0.1),upper=c(5,5),theta=c(1,1), optimmethod="tgp",itmax=20,verbose=3)
## Starting Optimization 
##  EGO (TGP) Optimization 
## Evaluating function at initial points 
## Starting Bayesian Optimization 
## Found minimum after 20  iterations at 0.2771 1.4104 with stoploss= -0.6696 and default scaling loss= 0.0612 and c-structuredness indices: cclusteredness 0.669581 . Thanks for your patience.
resa2
## 
## Call: stops(dis = kinshipdelta, loss = "powermds", theta = c(1, 1), 
##     structures = c("cclusteredness"), stressweight = 0, strucweight = -1, 
##     strucpars = list(list(epsilon = 10, dmax = NULL, minpts = 2)), 
##     optimmethod = "tgp", lower = c(0.1, 0.1), upper = c(5, 5), 
##     verbose = 3, itmax = 20)
## 
## Model: additive STOPS with ratio powermds loss function and theta parameter vector (kappa lambda)  =  0.277098 1.410424 
## 
## Number of objects: 15 
## MDS loss value: 0.06121718 
## C-Structuredness Indices: cclusteredness 0.669581 
## Structure optimized loss (stoploss): -0.669581 
## MDS loss weight: 0 c-structuredness weights: -1 
## Number of iterations of tgp optimization: 20

Then we get a projection with c-clusteredness of 0.669581. This can of course be done for an c-structuredness.

It is also possible to use the stops function for finding the hyperparameters that are optimal for minimizing the badness-of-fit for a given transformation class in the the non-augmented models specified in loss, by setting the strucweight (the weight of any c-structuredness) to 0. Then the function optimizes the MDS loss function only. We use optimmethod="pso" with 5 particles and 100 steps.

ressa<-stops(kinshipdelta,theta=1,structures=c("clinearity"),strucweight=0,loss="stress",verbose=0,upper=8,lower=0,optimmethod="pso",itmax=100)

Here the minimum badness-of-fit by subjecting the \(\delta\) to a power transformation \(\delta^*=\delta^\lambda\) is obtained with \(\lambda^*=\) and a stress-1 of 0.2652755. This strategy is similar to how the optimal scaling of dhats works within other MDS methods (e.g., in nonmetric and spline MDS).

COPS

A special STOPS model is P-COPS (Rusch, Mair & Hornik 2021). Along the lines of STOPS the overall objective function, which we call , is a weighted combination of the \(\theta-\)parametrized loss function, \(\sigma_{PS}\left(X(\theta)|\theta\right)\), and a c-clusteredness measure, the OPTICS cordillera or \(OC(X(\theta);\epsilon,k,q)\) to be optimized as a function of \(\theta\) or \[\begin{equation} \label{eq:spstress} \text{coploss}(\theta) = v_1 \cdot \sigma_{PS}\left(X(\theta) | \theta \right) - v_2 \cdot \text{OC}\left(X(\theta);\epsilon,k,q\right) \end{equation}\]

with \(v_1,v_2 \in \mathbb{R}\) controlling how much weight should be given to the scaling fit measure and the c-clusteredness.

The c-clusteredness index we use is the OPTICS cordillera which measures how clustered a configuration appears. It is based on the OPTICS algorithm that outputs an ordering together with a distance. The OPTICS cordillera is now simply an agregation of that information. Since we know what constitutes a maximally clustered result, we can derive an upper bound and normalize the index to lie between 0 and 1. If it is maximally clustered, the index gets a value of 1,and it gets 0 if all points are equidistant to their nearest neighbours (a matchstick embedding). See the vignette in the COPS package for more.

A Full Blown Analysis

We recreate parts of an example of Rusch, Mair and Hornik (2023) here. The data are 500 handwritten pen digits and we subject them to different flexible MDS methods and select hyperparameters with STOPS based on c-clusteredness and c-manifoldness. We use Sammon mapping with power transformations for the proximities, Box-Cox MDS, lMDS (all three like in the article) and also approximate power stress (different from the article). The data are available in the package via data(Pendigits500).

This is a tough data set for our implementations because it is quite large, so an individual MDS takes a while. When running STOPS we have to fit a large number of MDS, so this can take quite a long time and therefore we also time so that users can see what to expect.

# data setup
data(Pendigits500)
pendss <- Pendigits500
names(pendss)[17] <- "digit"

## setup data and the c-structuredness hyperparameters
dis <- dist(pendss[,1:16]) #only features are used to get the dissimilarity matrix, not the digit label

# parameters for OC (c-clusteredness)
q <- 2
rang <- c(0,0.6) 
minpts <- 5
eps <- 10
scale <- 3

We start with an initial solution (a Sammon mapping) and see how the c-structuredness of that result is

## initial sammon mapping solution
initsam <- sammon(dis)
## Initial stress        : 0.15233
## stress after   1 iters: 0.15079
## c-structuredness of initial solution
c1 <- cordillera::cordillera(initsam$points,minpts=minpts,q=q,epsilon=eps,rang=rang,scale=scale)
ccom1 <- c_manifoldness(initsam$points)
c1
##        raw     normed 
## 0.70519137 0.08331615
ccom1
## [1] 0.593913

So the c-clusteredness was 0.0833162 and c-manifoldness was 0.593913. This is our reference solution to see if we can improve c-structuredness.

Next we set up the structures and the parameters for the structuredness indices

# #c-structuredness metaparameters 
strucpars <- list(list(minpts=minpts,epsilon=eps,rang=rang,scale=scale), #cordillera 
                  list(NULL)  #manifoldness
                  ) 
structures <- c("cclusteredness","cmanifoldness") #structures

We use weights of \(-400\) for c-clusteredness and \(-2.5\) for c-manifoldness.

## c-structuredness weights 
strucweight <- c(-400,-2.5)

Now we have everything to run STOPS. We use Sammon mapping with powers for the dissimilarities, local MDS and Box-Cox MDS. The search space is from \(1\) to \(6\) and we use optimmethod='tgp' (one can set verbose=4 to see details of the iteration progress). We also time the calculations.

#Sammon with power transformations of proximities
set.seed(666)
timesam <- system.time(restgp <- stops(dis,theta=5.6,loss="sammon",structures=structures,stressweight=1,strucweight=strucweight,strucpars=strucpars,lower=1,upper=6,optimmethod="tgp",model="btgpllm",itmax=10))

#Box Cox MDS
set.seed(667)
timebc <- system.time(restgpbc <- stops(dis,theta=c(4.63,2.77,5.12),loss="bcstress",structures=structures,stressweight=1,strucweight=strucweight,strucpars=strucpars,lower=c(0.1,0.1,0.1),upper=c(6,6,6),optimmethod="tgp",model="btgpllm",itmax=10,itmaxps=1e4))

#lMDS
set.seed(669)
timelmds <- system.time(restgplmds <- stops(dis,theta=c(14.5,3),loss="lmds",weightmat=dis,structures=structures,stressweight=1,strucweight=strucweight,strucpars=strucpars,lower=c(5,0),upper=c(50,3),optimmethod="tgp",model="btgpllm",itmax=10,itmaxps=1e4))

#approx p stress
set.seed(670)
timeaps <- system.time(restgpaps <- stops(dis,theta=c(1,1,1),loss="apstress",structures=structures,stressweight=1,strucweight=strucweight,strucpars=strucpars,lower=c(0.1,0.1,0.1),upper=c(3,3,3),optimmethod="ALJ",itmax=20,itmaxps=1e4,verbose=4))

We now plot the results. We color the digits differently and adjust all the configurations to the same plotting region via Procrustes transformation (as MDS results are invariant to translation, scaling and rotation).

#plot the digit results
colsopt <- cols1 <- colstraj <- factor(pendss[,17])
levels(colsopt) <- hcl.colors(10,"Dark 3")

#Procrustes adjustment to lmds configuration for STOPS optimized
samconf <- restgp$fit$conf
lmdsconf <- restgplmds$fit$conf
bcconf <- restgpbc$fit$conf
apsconf<-restgpaps$fit$conf
samconf <- Procrustes(lmdsconf,samconf)$Yhat
bcconf <- Procrustes(lmdsconf,bcconf)$Yhat
apsconf <- Procrustes(lmdsconf,apsconf)$Yhat

And here are the plots.

par(mfrow=c(2,2))
#Sammon
plot(samconf[,1],samconf[,2],col=as.character(colsopt),type="n",xlab="",ylab="",asp=1,axes=TRUE,main="Sammon")
points(samconf[,1],samconf[,2],col=as.character(colsopt),pch=as.character(pendss[,17]))
box()

#lmds
plot(lmdsconf[,1],lmdsconf[,2],col=as.character(colsopt),type="n",xlab="",ylab="",asp=1,main="lMDS",axes=TRUE)
points(lmdsconf[,1],lmdsconf[,2],col=as.character(colsopt),pch=as.character(pendss[,17]))
box()

#bc stress
plot(bcconf[,1],bcconf[,2],col=as.character(colsopt),type="n",xlab="",ylab="",asp=1,axes=TRUE,main="BC-MDS")
points(bcconf[,1],bcconf[,2],col=as.character(colsopt),pch=as.character(pendss[,17]))
box()

#ap stress
plot(apsconf[,1],apsconf[,2],col=as.character(colsopt),type="n",xlab="",ylab="",asp=1,axes=TRUE,main="Approx. POST-MDS")
points(apsconf[,1],apsconf[,2],col=as.character(colsopt),pch=as.character(pendss[,17]))
box()

We can see that Sammon mapping with power transformation shows a clusteredness that is both high and sensible. The values is 0.1500428 for Sammon. Box-Cox MDS also give s arelatively clustered result, but it seems it finds 3 clusters and that is less in line wht the ground truth. The value is 0.1302618.

restgp
## 
## Call: stops(dis = dis, loss = "sammon", theta = 5.6, structures = structures, 
##     stressweight = 1, strucweight = strucweight, strucpars = strucpars, 
##     optimmethod = "tgp", lower = 1, upper = 6, itmax = 10, model = "btgpllm")
## 
## Model: additive STOPS with ratio sammon loss function and theta parameter vector (lambda)  =  5.41942 
## 
## Number of objects: 500 
## MDS loss value: 0.4897205 
## C-Structuredness Indices: cclusteredness 0.1500428 cmanifoldness 0.9282179 
## Structure optimized loss (stoploss): -61.84794 
## MDS loss weight: 1 c-structuredness weights: -400 -2.5 
## Number of iterations of tgp optimization: 10
restgpbc
## 
## Call: stops(dis = dis, loss = "bcstress", theta = c(4.63, 2.77, 5.12), 
##     structures = structures, stressweight = 1, strucweight = strucweight, 
##     strucpars = strucpars, optimmethod = "tgp", lower = c(0.1, 
##         0.1, 0.1), upper = c(6, 6, 6), itmax = 10, itmaxps = 10000, 
##     model = "btgpllm")
## 
## Model: additive STOPS with ratio bcstress loss function and theta parameter vector (mu lambda rho)  =  4.603888 2.711544 5.411521 
## 
## Number of objects: 500 
## MDS loss value: 0.08998973 
## C-Structuredness Indices: cclusteredness 0.1302618 cmanifoldness 0.9773570 
## Structure optimized loss (stoploss): -54.45813 
## MDS loss weight: 1 c-structuredness weights: -400 -2.5 
## Number of iterations of tgp optimization: 10

Local MDS and approximate power stress give configurations that are less clustered, although the arrangement of points is still sensible even if they don’t align in clusters.

restgpbc
## 
## Call: stops(dis = dis, loss = "bcstress", theta = c(4.63, 2.77, 5.12), 
##     structures = structures, stressweight = 1, strucweight = strucweight, 
##     strucpars = strucpars, optimmethod = "tgp", lower = c(0.1, 
##         0.1, 0.1), upper = c(6, 6, 6), itmax = 10, itmaxps = 10000, 
##     model = "btgpllm")
## 
## Model: additive STOPS with ratio bcstress loss function and theta parameter vector (mu lambda rho)  =  4.603888 2.711544 5.411521 
## 
## Number of objects: 500 
## MDS loss value: 0.08998973 
## C-Structuredness Indices: cclusteredness 0.1302618 cmanifoldness 0.9773570 
## Structure optimized loss (stoploss): -54.45813 
## MDS loss weight: 1 c-structuredness weights: -400 -2.5 
## Number of iterations of tgp optimization: 10
restgpaps
## 
## Call: stops(dis = dis, loss = "apstress", theta = c(1, 1, 1), structures = structures, 
##     stressweight = 1, strucweight = strucweight, strucpars = strucpars, 
##     optimmethod = "ALJ", lower = c(0.1, 0.1, 0.1), upper = c(3, 
##         3, 3), verbose = 4, itmax = 20, itmaxps = 10000)
## 
## Model: additive STOPS with ratio apstress loss function and theta parameter vector (kappa lambda nu)  =  2.137869 1.423048 1.649153 
## 
## Number of objects: 500 
## MDS loss value: 0.04557652 
## C-Structuredness Indices: cclusteredness 0.08438802 cmanifoldness 0.79844724 
## Structure optimized loss (stoploss): -35.70575 
## MDS loss weight: 1 c-structuredness weights: -400 -2.5 
## Number of iterations of ALJ optimization: 19

This brings us to c-manifoldness which is very high for Sammon, Box-Cox MDS and lMDS. Approximate power stress MDS (POST-MDS) shows c-manifoldness that is a bit less bit still relatively high (it is less mainly due ot being more noisy around the imagined manifold). The manifolds are alos easy to see: for Sammon it seems to be reminiscent of a spiral, for approximate power stress it is a circle, for Box-Cox MDS it is a circle as well (or three spherical clusters) and for lMDS a manifold is also apparent although I don’t think it has a name.

The interesting thing here is that the objects are arrnaged along the manifolds in a way that suggests a structure of how handwritten digits are related, e.g., that 0s and 8s are related and that they are closer to 5s than to 1s. There is also a way to get from one digit to others along the manifold, e.g, in approx. POST-MDS one would get from the 6s to over the 9s to the 3s. This may give a hypothesis about a relationship amongst them.

Extras

We also briefly describe some other functions in the package stops.

The Adaptive Luus-Jaakola Algorithm

Since the inner optimization problem in STOPS models is hard and takes long, Rusch, Mair and Hornik (2021) developed a metaheuristic for the outer optimization problem that typically needs less calls to the inner minimization than SANN, albeit without the guarantees of convergence to a global minimum for non-smooth functions. It is an adaptation of the Luus-Jaakola random search (Luus & Jaakola 1973). It can used with the function ljoptim which modeled its output after optim. It needs as arguments x a starting value, fun a function to optimize, a lower and upper box constraint for the search region. By using the argument adaptive=TRUE or FALSE one can switch between our adaptive version and the original LJ algorithm. Accuracy of the optimization can be controlled with the maxit (maximum number of iterations), accd (terminates after the length of the search space is below this number ) and acc arguments (terminates if difference of two subsequent function values are below this value).

We optimize a “Wild Function” with the non-adaptive LJ version (and numerical accuracies of at least 1e-16 for accd and acc).

set.seed(210485)
fwild <- function (x) 10*sin(0.3*x)*sin(1.3*x^2) + 0.00001*x^4 + 0.2*x+80
res2<-ljoptim(50, fwild,lower=-50,upper=50,adaptive=FALSE,accd=1e-16,acc=1e-16)
res2
## $par
## [1] -15.81515
## 
## $value
## [1] 67.46773
## 
## $counts
## function gradient 
##      463       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
plot(fwild, -50, 50, n = 1000, main = "ljoptim() minimising 'wild function'")
points(res2$par,res2$value,col="red",pch=19)

Procrustes Adjustment

We also provide a procrustes adjustment to make two configurations visually comparable. The function is conf_adjust and takes two configurations conf1 the reference configuration and conf2 another configuration. It returns the adjusted versions

conf_adjust(conf1,conf2) 

References