The visualization of complex data sets is of essential importance in communicating your data products. Beyond pie charts, histograms, line graphs and other common forms of visual communication begins the reign of data sets that encompass too much information to be easily captured by these simple data displays. A typical context that abounds with complexity is found in the areas of text mining, natural language processing, and cognitive computing in general; such a complex context for data presentation is pervasive in an attempt to build a visual interface for products like semantic search engines or recommendation engines. For example, statistical models like LDA (Latent Dirichlet Allocation) enable for a thorough insight into the similarity structure across textual documents or vocabulary terms used to describe them. But as the number of pairwise similarities between terms of documents to be presented to the end users increases, the problem of effective data visualization becomes harder. In this and the following blog posts, I wish to present and discuss some of the well-known approaches to visualize difficult data sets from the viewpoint of pure ergonomics, as well as to suggest some improvements and new ideas here and there.

### A simple similarity map for iris

In this first blog on this topic, the idea is to focus on the basics of ergonomics in the visualization of similarity information. We will compare two ideas: the usage of similarity maps following a dimensionality reduction of the dataset - which can be taken as a sort of baseline idea in the visualization of similarity data - and the usage of directed graphs from higher-dimensional spaces directly, which is less common. I would like to put forward an argument in support of the later approach which I started using some time ago.

But first, let us have some data. I will use a well known, rather simple data set, merely for the purposes of the illustration here: namely, the famous iris:

# - the iris dataset

data(iris)

head(iris)

The iris dataset with its four numerical and one categorical variable is, as you will see, complicated quite enough to illustrate the main problem that I wish to discuss. Now, let’s illustrate the problem. The classical approach to assess the similarity between the different observations (rows) in this dataset would be to derive some metric from the numerical variables, perform dimensionality reduction, and then plot the data points in a lower-dimensional space. I will work with Euclidean distance and a non-linear dimensionality reduction approach by performing ordinal multidimensional scaling (MDS) with {smacof} in R:

# - distances in iris:

dIris <- dist(iris[, 1:4], method = "euclidean")

Following the computation of Euclidean distances, each observation from the dataset can be treated as if it was represented by a set of features, each feature standing for a distance from the observation under discussion to all other observations in the dataset (including itself, where the distance is zero). The source data were four-dimensional (four continuous variables were used to represent each observation); now they are N-dimensional, with N representing the number of observations. But, being distances, the data now convey information about the similarities between the observations. Now, the dimensionality reduction step:

# - multidimensional scaling w. {smacof}

library(smacof)

mdsIris <- smacofSym(dIris, type = "ordinal")

mdsIris$stress

confIris <- as.data.frame(mdsIris$conf)

# - add category to confIris:

confIris$Species <- iris$Species

head(confIris)

In ordinal multidimensional scaling, only the ordering of the initial distances is preserved; the obtained mapping is thus invariant under rotation and translation - like any MDS solution - and any monotonic scale transformation that preserves the rank order of similarity. The stress of the final MDS configuration is .025, not bad. Now we have a 2D representation of the data. Let’s plot it with {ggplot2}:

# - visualize w. {ggplot2}

library(ggplot2)

library(ggrepel)

library(RColorBrewer)

confIris$Color <- recode(confIris$Species, 'setosa' = 'aquamarine3', 'versicolor' = 'coral3', 'virginica' = 'deepskyblue3')

confIris <- confIris[order(confIris$Species), ]

confIris$Label[which(confIris$Species == 'setosa')] <- paste('s', 1:length(which(confIris$Species == 'setosa')), sep = "")

confIris$Label[which(confIris$Species == 'versicolor')] <-paste('ve', 1:length(which(confIris$Species == 'versicolor')), sep = "")

confIris$Label[which(confIris$Species == 'virginica')] <- paste('vi', 1:length(which(confIris$Species == 'virginica')), sep = "")

ggplot(confIris, aes(x = D1, y = D2, label = Label)) + geom_text_repel(aes(x = D1, y = D2, label = Label),

size = 2, segment.size = .1) +

geom_point(size = 1.5, color = confIris$Color) +

geom_point(size = 1, color = "white") +

ggtitle('Iris: semantic map.') +

theme_bw() + theme(plot.title = element_text(size = 10))

With 150 objects to represent, it is already difficult to read the similarity map. It can be readily seen that different types of observations (“s” for “setosa”, “vi” for “virginica”, and “ve” for “versicolor”, the famous iris$Species, each additionally indexed by the observation number in the respective class) tend to cluster naturally. However, we know one very important thing about the iris dataset: it is not linearly separable. In the simplest possible explanation of what non-linear separability means: look for the border between the blue (“virginica”) and red (“versicolor”) points, and try to imagine a straight line through the plane that perfectly separates the two classes (and then generalize if you wish: no plane in 3D, no hyper-plane in a higher dimensional space… can separate the classes perfectly). Ask yourself: what observations are the most responsible for these feature in the iris dataset? For example, is “vi7” at the bottom of the map a problematic case in categorization? What about the following: “vi50”, “vi47”, “ve34”, “ve23”? The observation “vi7” is found far away from the other members of its category, however, it’s still somewhat lonesome in this 2D similarity map. Observations “vi50”, “vi47”, “ve34”, and “ve23”, found in the lower right segment of the map but closer to its center, on the other hand, are found in a very crowded part of the similarity space: could they be a part of explanation for the presence of non-linear separability in iris?

The problem of ergonomics is obviously related to the overcrowding of the points in the map. The conceptual problem with dimensionality reduction to simple, visually comprehensible similarity spaces, is that there is no dimensionality reduction without a loss of information from the initial, higher-dimensional space (except in some really rare, theoretically possible cases). As we will see, as a consequence of this there are situations when seemingly unimportant observations manage to hide a part of their nature in a lower-dimensional representation. But we need to see that first.

### Similarity in iris as a directed graph

Before performing dimensionality reduction with ordinal MDS, we were found in a 150-dimensional similarity - distance, actually – space. In spite of being a powerful technique for non-linear dimensionality reduction, MDS has “eaten” some of the information from the initial hyperspace. Now, let’s see if we can go back into the hyperspace and visualize it somehow to enable us discuss the importance of particular observations in this dataset. Each data point is described as a vector of 150 elements, each element representing the distance between the data point under observation and some other data point. Let’s search through these vectors and find the nearest neighbor – the most similar observation – for each data point:

# - iris as a directed graph:

mdsDistIris <- as.matrix(mdsIris$confdist)

nnIris <- apply(mdsDistIris, 1, function(x) {

minDist <- sort(x, decreasing = T)

minDist <- minDist[length(minDist)-1]

which(x == minDist)[1]

})

# - graphFrame:

graphFrame <- data.frame(from = 1:150,

to = nnIris,

Species = iris$Species)

We have first discovered the nearest neighbors, and the introduced a new variable, Degree, that describes to how many other observations (excluding itself) does a particular data point presents the nearest neighbor. In the following similarity map, the size of the marker represents the degree of each observation:

Observations “vi50” and “ve34” stand close to each other, belonging to different classes; moreover, they are marked as having a higher degree in the nearest neighbor network (we haven’t visualized the network yet) than many other observations; they are candidates to represent the nearest neighbors to each other.

Still, “vi7” at the bottom of the map is certainly not a hub. In the next step, we change the visualization strategy. Again, get ready for a loss of information. While MDS attempts at a representation that will preserve as many as possible information about all pairwise similarities from the initial dataset, thus necessarily compromising with some error present in the lower-dimensional representation, we chose to sacrifice the context and represent information on similarity neighborhoods only. We will represent the similarity data from iris by instantiating a directed graph, with each node representing an observation, and each node having exactly one link, pointing towards the nearest neighbor of the respective observation. We use {igraph} to construct and represent the graph:

# - represent high-dimensional Iris neighbourhoods as directed graph:

graphFrame$from <- confIris$Label[graphFrame$from]

graphFrame$to <- confIris$Label[graphFrame$to]

library(igraph)

irisNet <- graph.data.frame(graphFrame, directed = T)

V(irisNet)$size <- degree(irisNet)*1.25

V(irisNet)$color <- as.character(confIris$Color)

V(irisNet)$label <- confIris$Label

# - plot w. {igraph}

par(mai=c(rep(0,4)))

plot(irisNet,

vertex.shape = "circle",

vertex.frame.color = NA,

edge.width = .75,

edge.color = "grey",

edge.arrow.size = 0.15,

vertex.label.font = 1,

vertex.label.color = "black",

vertex.label.family = "sans",

vertex.label.cex = .65,

vertex.label.dist = .35,

edge.curved = 0.5,

margin = c(rep(0,4)))

As expected, “vi50” and “vi34” have mutual links, standing for each other’s nearest neighbor, and still being members of different categories; but “vi39” and “ve21” too, a fact that was less obvious in the similarity map. Also, we now discover that the nearest neighbor to “vi7” is “ve13” from another class. The clustering in the nearest neighbor graphs occurs naturally, and thus the graph conveys some information about the similarity context for groups of points in spite of the fact that it was not intentionally built to represent such information.

### Maps vs. Graphs

The nearest neighbor directed graph approach is not very common among the attempts to visualize similarity data. However, following years of work in the field of distributive semantics, I have started abandoning similarity maps as a mean of data visualization almost completely in favor of graphs. While I haven’t performed any methodologically rigorous comparisons among the two approaches, I have used both in building interfaces for semantic search engines, and the users turned out to be always more satisfied with directed nearest neighbor graphs in comparisons to similarity maps. Intuitively, I would say the main reason is the absence of too much crowding of data points. Also, the aesthetics of graphs seem more pleasing. Graphs are maybe a bit more difficult to visually search through: the links and the occurrence of clusters tend to somehow guide the eye to search for a particular data point in the imminent neighborhood only. Again, I don’t find the ergonomics of similarity maps much more efficient in that respect. The drawback of directed graphs: while some important clusters of nearest neighbors emerge naturally in them, the graph layout will not necessarily enable the user to recognize the way the observations tend to group in general. I have experimented with directed graphs in which two nearest neighbors are selected for each observations, and that works too, but the graph becomes unreadable quickly for any number significantly larger than ten to fifteen observations.

Comparing the loss of information in similarity maps in contrast to the nearest neighbor graphs, it seems to me that the latter have a clear advantage in the presentation of complex datasets, and I would clearly prefer the graphs in any situation - or at least suggest always using them as an additional method of interpretation alongside the respective maps. In order to illustrate, I have inspected ordinal MDS solutions for iris distances in Euclidean space, ranging the solution dimensionality from two to five, and computed the accuracy of determining correctly the N-th neighbor for each observation for each solution. The following line chart is quite telling.

On the x-axis we find the order of the neighbor (1st, 2nd, etc), while the accuracy of correctly determining the neighbor of the respective order across the observations is represented on the y-axis (% scale). One can now have a quite direct insight into the behavior of information loss in ordinal MDS. The 2D solution seems to be especially problematic, with far less than 50% recognition of the 1st neighbor across the observations. The local information from higher-dimensional spaces represented by directed graphs perfectly complements the compromises made in the lower-dimensional similarity maps.