# Large Scale Multidimensional Scaling

Principal component analysis (PCA) provides insufficient results for your data? Ever tried multidimensional scaling to reduce the dimensions of your data set and detect underlying dimensions?

#### What is MDS?

Multidimensional scaling (MDS) can be considered to be an alternative to PCA. In general, the goal of MDS is to detect underlying dimensions via similarities or dissmilarities between the data vectors. A distance matrix $\Delta$ is build according to a given dissimilarity function $\delta$. For each of the high dimensional data vectors $x_i$ a lower dimensional vector $\phi(x_i)$ with arbitrary coordinates is build to become a representative of $x_i$. MDS now tries to achieve to find new coordinates for $\phi(x_i)$ in such a way, that the distances between all $\phi(x_i)$, which are stored in a matrix we denote as $D$, reflect the dissimilarities $\Delta$ at best. $\phi$ can be understood as a projection function with $\phi: \mathbb{R}^n \mapsto \mathbb{R}^m$, where $n,m\in\mathbb{N}$ and $n << m$.

Therefore the following equation has to be satisfied for all pairs of vectors:

MDS equation $\delta(x_i, x_j) = ||\phi(x_i) - \phi(x_j)||\ \forall\ i,j$

There are multiple methods that try to achieve equality for the MDS equation. One very robust method is the scaling by majorizing a converx/ complex function (SMACOF) algorithm. To evaluate the quality of an MDS solution one can define a so called stress function $\sigma(X) = \sum_{i. The SMACOF method now iteratively approximates the derivation of $\sigma$ in order to find a configuration $X^{[k]}$ that provides the minimal amount of stress. The configuration $X^{[k]}$ satisfies the MDS equation and represents the result we are looking for.

#### A high performance implementation in C(++) and CUDA C

Since the lack of an implementation powerful enough to perfom reduction of large biological gene expression datasets, we have decided to implement our own efficient version of SMACOF. The implementation is both, memory and runtime efficient. We could save half the memory compared to a naive approach, using lower triangular matrices to store symmetric matrices. Cache optimizations and the utilization of OpenMP and CUDA results in such runtime improvements, that we are capable to reduce datasets with 10.000 vectors in a few minutes, rather than a few hours. Details about implementation can be found in my master thesis or check out the project and code yourself if you like (you can find the download link below).

#### Integration into VANESA

The integration into VANESA allows ad hoc analysis and powerful data exploration via information visualization. SMACOF is used to display high dimensional data. One can then explore the underlying data by using tools offerd by VANESA resulting in images like this:

$D^a$ is a data set containing the expression levels of 9268 proteins of seven cholesteatom patients and corresponding information about the cell compartments you can find every protein in. Color-coding: green – low expression, red – high expression; magent – cytoplasma, cyan – nucleus, blue if mixing these colors. One can see that SMACOF arranges the proteins in such a way, that proteins, that can be found in same or similar cell compartments, build substructures. Within these substructures a gradient is build up that extends from low (green) to high (red) expressed proteins. The used data set $D^a$ can be downloaded here: cholcellcomp_large. More on $D^a$, color-coding and the meaning of these results can be found in my master thesis.

#### Getting started

1. make sure you have installed Armadillo on your system, or install with: make install-armadillo
2. check if the configuration in Makefile is suitable for your hardware/ system
3. run make c for C/ OpenMP version or make cuda for CUDA version
4. run make run-cubeC-example, make run-cubeC-weights-example, make run-cubeCUDA-example to ensure everything works fine (if you get any errors, something is wrong with your Makefile configuration
5. try to reduce some example data sets from example-data/

Commandline parameters can be used as follows:
 #################################################################### # SMACOF - Philipp D. Schubert - schubert@cebitec.uni-bielefeld.de # #################################################################### usage: <prog> <input> <output> <disfunc> <maxiter> <epsilon> <resultdim> <metric_p> <print> parameter explanation: input - a valid path to csv file to analyse: path ouput - a filename for the outputfile: path disfunc - dissimilarity measure: ... 0 - euclidean 1 - cityblock 2 - minkowski 3 - correlation 4 - angularseperatin 5 - wavehedges 6 - bahattacharyya 7 - soergel 8 - braycurtis 9 - divergence 10 - canberra 11 - none maxiter - maximum number of iterations to use: a positive integer epsilon - the maximum error value which is allowed: a positive real_t resultdim - number of dimensions for result vectors: a positive integer minkowski_p - a value to adjust the minkowski distance: a positive integer print - be verbose: 0 or 1 weights - a valid path to csv file containing the weights 
Since the program expects column vectors, you may have to transpose your data set. You can use the transpose.py script to do the job. If you project your data into two dimensions, you can create a scatterplot with help of scatterplot.py, you may have to transpose again. If you want to create your own random data set, use gen-rand.py to do it. Note: random data is nasty data, it takes much longer to find a configuration that satisfies the MDS equation, than real world data. But if you want to melt down your pc – give it a try.

If you find any bug or error, please send mail and let me know.

#### Further material

My master thesis (only in german): ls_muti_scaling.pdf
NVIDIA Nsight project: smacof_openmp_cuda-1.0.tar.gz
Slides of a recent talk i gave at Bielefeld University to this very topic: smacof_in_openmp_cuda_c.pdf