Principal component analysis (PCA) provides insufficient results for your data? Ever tried multidimensional scaling to reduce the dimensions of your data set and detect underlying dimensions?

#### What is MDS?

Multidimensional scaling (*MDS*) can be considered to be an alternative to PCA. In general, the goal of MDS is to detect underlying dimensions via similarities or dissmilarities between the data vectors. A distance matrix is build according to a given dissimilarity function . For each of the high dimensional data vectors a lower dimensional vector with arbitrary coordinates is build to become a representative of . MDS now tries to achieve to find new coordinates for in such a way, that the distances between all , which are stored in a matrix we denote as , reflect the dissimilarities at best. can be understood as a projection function with , where and .

Therefore the following equation has to be satisfied for all pairs of vectors:

MDS equation

There are multiple methods that try to achieve equality for the MDS equation. One very robust method is the scaling by majorizing a converx/ complex function (SMACOF) algorithm. To evaluate the quality of an MDS solution one can define a so called stress function . The SMACOF method now iteratively approximates the derivation of in order to find a configuration that provides the minimal amount of stress. The configuration satisfies the MDS equation and represents the result we are looking for.

#### A high performance implementation in C(++) and CUDA C

Since the lack of an implementation powerful enough to perfom reduction of large biological gene expression datasets, we have decided to implement our own efficient version of SMACOF. The implementation is both, memory and runtime efficient. We could save half the memory compared to a naive approach, using lower triangular matrices to store symmetric matrices. Cache optimizations and the utilization of OpenMP and CUDA results in such runtime improvements, that we are capable to reduce datasets with 10.000 vectors in a few minutes, rather than a few hours. Details about implementation can be found in my master thesis or check out the project and code yourself if you like (you can find the download link below).

#### Integration into VANESA

The integration into VANESA allows ad hoc analysis and powerful data exploration via information visualization. SMACOF is used to display high dimensional data. One can then explore the underlying data by using tools offerd by VANESA resulting in images like this:

is a data set containing the expression levels of 9268 proteins of seven cholesteatom patients and corresponding information about the cell compartments you can find every protein in. Color-coding: green – low expression, red – high expression; magent – cytoplasma, cyan – nucleus, blue if mixing these colors. One can see that SMACOF arranges the proteins in such a way, that proteins, that can be found in same or similar cell compartments, build substructures. Within these substructures a gradient is build up that extends from low (green) to high (red) expressed proteins. The used data set can be downloaded here: cholcellcomp_large. More on , color-coding and the meaning of these results can be found in my master thesis.

#### Getting started

- make sure you have installed Armadillo on your system, or install with:
`make install-armadillo`

- check if the configuration in
`Makefile`

is suitable for your hardware/ system - run
`make c`

for C/ OpenMP version or`make cuda`

for CUDA version - run
`make run-cubeC-example`

,`make run-cubeC-weights-example`

,`make run-cubeCUDA-example`

to ensure everything works fine (if you get any errors, something is wrong with your`Makefile`

configuration - try to reduce some example data sets from
`example-data/`

Commandline parameters can be used as follows:

####################################################################

# SMACOF - Philipp D. Schubert - schubert@cebitec.uni-bielefeld.de #

####################################################################

usage: <prog> <input> <output> <disfunc> <maxiter> <epsilon> <resultdim> <metric_p> <print>

parameter explanation:

input - a valid path to csv file to analyse: path

ouput - a filename for the outputfile: path

disfunc - dissimilarity measure: ...

0 - euclidean

1 - cityblock

2 - minkowski

3 - correlation

4 - angularseperatin

5 - wavehedges

6 - bahattacharyya

7 - soergel

8 - braycurtis

9 - divergence

10 - canberra

11 - none

maxiter - maximum number of iterations to use: a positive integer

epsilon - the maximum error value which is allowed: a positive real_t

resultdim - number of dimensions for result vectors: a positive integer

minkowski_p - a value to adjust the minkowski distance: a positive integer

print - be verbose: 0 or 1

weights - a valid path to csv file containing the weights

Since the program expects column vectors, you may have to transpose your data set. You can use the `transpose.py`

script to do the job. If you project your data into two dimensions, you can create a scatterplot with help of `scatterplot.py`

, you may have to transpose again. If you want to create your own random data set, use `gen-rand.py`

to do it. Note: random data is nasty data, it takes much longer to find a configuration that satisfies the MDS equation, than real world data. But if you want to melt down your pc – give it a try.

If you find any bug or error, please send mail and let me know.

#### Further material

My master thesis (only in german): ls_muti_scaling.pdf

NVIDIA Nsight project: smacof_openmp_cuda-1.0.tar.gz

Slides of a recent talk i gave at Bielefeld University to this very topic: smacof_in_openmp_cuda_c.pdf