Principal component analysis (PCA) provides insufficient results for your data? Ever tried multidimensional scaling to reduce the dimensions of your data set and detect underlying dimensions?
What is MDS?
Multidimensional scaling (MDS) can be considered to be an alternative to PCA. In general, the goal of MDS is to detect underlying dimensions via similarities or dissmilarities between the data vectors. A distance matrix is build according to a given dissimilarity function
. For each of the high dimensional data vectors
a lower dimensional vector
with arbitrary coordinates is build to become a representative of
. MDS now tries to achieve to find new coordinates for
in such a way, that the distances between all
, which are stored in a matrix we denote as
, reflect the dissimilarities
at best.
can be understood as a projection function with
, where
and
.
Therefore the following equation has to be satisfied for all pairs of vectors:
MDS equation
There are multiple methods that try to achieve equality for the MDS equation. One very robust method is the scaling by majorizing a converx/ complex function (SMACOF) algorithm. To evaluate the quality of an MDS solution one can define a so called stress function . The SMACOF method now iteratively approximates the derivation of
in order to find a configuration
that provides the minimal amount of stress. The configuration
satisfies the MDS equation and represents the result we are looking for.
A high performance implementation in C(++) and CUDA C
Since the lack of an implementation powerful enough to perfom reduction of large biological gene expression datasets, we have decided to implement our own efficient version of SMACOF. The implementation is both, memory and runtime efficient. We could save half the memory compared to a naive approach, using lower triangular matrices to store symmetric matrices. Cache optimizations and the utilization of OpenMP and CUDA results in such runtime improvements, that we are capable to reduce datasets with 10.000 vectors in a few minutes, rather than a few hours. Details about implementation can be found in my master thesis or check out the project and code yourself if you like (you can find the download link below).
Integration into VANESA
The integration into VANESA allows ad hoc analysis and powerful data exploration via information visualization. SMACOF is used to display high dimensional data. One can then explore the underlying data by using tools offerd by VANESA resulting in images like this:


is a data set containing the expression levels of 9268 proteins of seven cholesteatom patients and corresponding information about the cell compartments you can find every protein in. Color-coding: green – low expression, red – high expression; magent – cytoplasma, cyan – nucleus, blue if mixing these colors. One can see that SMACOF arranges the proteins in such a way, that proteins, that can be found in same or similar cell compartments, build substructures. Within these substructures a gradient is build up that extends from low (green) to high (red) expressed proteins. The used data set
can be downloaded here: cholcellcomp_large. More on
, color-coding and the meaning of these results can be found in my master thesis.
Getting started
- make sure you have installed Armadillo on your system, or install with:
make install-armadillo
- check if the configuration in
Makefile
is suitable for your hardware/ system - run
make c
for C/ OpenMP version ormake cuda
for CUDA version - run
make run-cubeC-example
,make run-cubeC-weights-example
,make run-cubeCUDA-example
to ensure everything works fine (if you get any errors, something is wrong with yourMakefile
configuration - try to reduce some example data sets from
example-data/
Commandline parameters can be used as follows:
####################################################################
# SMACOF - Philipp D. Schubert - schubert@cebitec.uni-bielefeld.de #
####################################################################
usage: <prog> <input> <output> <disfunc> <maxiter> <epsilon> <resultdim> <metric_p> <print>
parameter explanation:
input - a valid path to csv file to analyse: path
ouput - a filename for the outputfile: path
disfunc - dissimilarity measure: ...
0 - euclidean
1 - cityblock
2 - minkowski
3 - correlation
4 - angularseperatin
5 - wavehedges
6 - bahattacharyya
7 - soergel
8 - braycurtis
9 - divergence
10 - canberra
11 - none
maxiter - maximum number of iterations to use: a positive integer
epsilon - the maximum error value which is allowed: a positive real_t
resultdim - number of dimensions for result vectors: a positive integer
minkowski_p - a value to adjust the minkowski distance: a positive integer
print - be verbose: 0 or 1
weights - a valid path to csv file containing the weights
Since the program expects column vectors, you may have to transpose your data set. You can use the transpose.py
script to do the job. If you project your data into two dimensions, you can create a scatterplot with help of scatterplot.py
, you may have to transpose again. If you want to create your own random data set, use gen-rand.py
to do it. Note: random data is nasty data, it takes much longer to find a configuration that satisfies the MDS equation, than real world data. But if you want to melt down your pc – give it a try.
If you find any bug or error, please send mail and let me know.

Further material
My master thesis (only in german): ls_muti_scaling.pdf
NVIDIA Nsight project: smacof_openmp_cuda-1.0.tar.gz
Slides of a recent talk i gave at Bielefeld University to this very topic: smacof_in_openmp_cuda_c.pdf