So in this post, I will try to give a high level overview of SOM and then will try to map the actual algorithm execution on Iris dataset.

SOM is an artificial neural network which learns through unsupervised learning. This helps in summarizing (and visualizing) high dimensional input data in lower dimensions (usually in 2 dimensions). In self-organizing maps, the so called neurons/nodes are organized in a rectangular or hexagonal grid. Each of these nodes has weights and these weights are nothing but multi-dimensional vectors (having same number of dimensions as input vectors). High dimensional input data is presented to this grid and it preserves principal features of the input data presented (topology of the data in the input space).

So here are the high level steps involved in the training:

1. Size of the grid is defined and then weights of each of the nodes of the grid are randomly initialized (they can also be initialized using PCA data)

2. Input vectors from the training data are randomly presented to the network. Input vectors are normalized before learning

3. Each node’s weight is compared with the input vector and the node with the weight nearest to input vector is identified and called ‘best matching unit’ (BMU)

4. Now BMU node weights, along with its neighbouring nodes, are adjusted to come nearer to the input vector. Below are the equations which define how the weights are updated and also how neighbourhood of BMU is identified:

Θ is the neighbourhood function which helps nodes nearer to BMU learns more than those farther. This is usually taken as below:

Θ(t) = exp(-S/2σ^{2}(t)) where S is the Euclidian distance between the node being updated and the winning neuron

And below is how σ is defined:

σ(t) = σ_{0} exp(-t/τ) where t is the current time and τ is the time constant. So this decays over time.

There is another parameter which is called learning rate ( L(t) ). This is introduced to decay the learning over time and is defined as:

L(t) = L_{0} exp(-t/τ)

So below is the final weight updating formula:

W(t+1) = W(t) + Θ(t). L(t) (x – W(t)) where x is the input vector

The above procedure is repeated multiple times for various inputs.

5. Visualization of the above output can be done is various ways and one of the most common ways is U-Matrix. In a U-matrix, Euclidean distance between neighbouring neurons is calculated and then this distance is representing in a gray scale or color scale.

**SOM execution on Iris dataset**

Now let us try to map the above discussed concepts with the actual execution of self-organizing maps on Iris data (which has four dimensions). I have used https://github.com/JustGlowing/minisom (which is a python implementation for SOM developed by Giuseppe Vettigli) and modified it for my purpose.

1. Here the size of the grid is defined as 7 x 7 and weights of the nodes have been randomly initialized. After this these weights were normalized using frobenius norm. Below is how this randomly initialized map looks like (it may look different next time as it is randomly initialized). Implementation code was modified to have color scale to be used as Red Blue.

2. Now is the training time with Iris dataset. Each training example from the dataset is normalized before presenting it to SOM.

3. For each row of normalized Iris data, a winner neuron is identified. This is done by calculating the difference between the input data row and all neuron weights and then finding the minimum of the Euclidian distances. This is called ‘Best Matching Unit’ (BMU).

4. As mentioned in the above overview, now weights of the BMU and the neighbourhood neurons are updated to bring them near to input data. In this implementation, the entire grid is neighbourhood but as the distance from BMU increases, Θ also decreases. Also, to be inline with the above overview, exponentially decaying function has been provided for ‘sigma’ and ‘learning rate’. Each time the weights are updated during the learning phase they are normalized using frobenius norm.

I have given σ_{0 }= 2.0 and L_{0} = 0.5 here. As a result of the training we will get a 7 x 7 x 4 matrix (as the weights are of equal dimensions as input data).

5. Now how do you visualize this 7 x 7 x 4 matrix in a grid of 7 x 7 nodes? So we will have to prepare a U-matrix (which is an average distance matrix between a node’s weight vectors and neighbouring nodes’ weight vectors). Here the neighbouring nodes are taken as highlighted below (black dot is the node for which average distance is calculated and the other blue nodes are the one which are used to calculate the average distance):

Below is the final distance map which got created from the trained SOM:

BMUs can now be highlighted where they reside on the final distance map. Below is the graph where BMUs are overlaid on the distance map. Different markers are placed for each species (circle for setosa, square for versicolor and diamond for virginica).

As it is clear from the above graph that setosa has a cluster which stands away from the other two species which has some overlap with each other). This can further be extended to do classification for new set of Iris data.

References:

1. https://en.wikipedia.org/wiki/Self-organizing_map

2. https://github.com/JustGlowing/minisom

3. http://www.shy.am/wp-content/uploads/2009/01/kohonen-self-organizing-maps-shyam-guthikonda.pdf

]]>Principal Component Analysis:

Principal Component Analysis (PCA) is a statistical procedure that is used for dimensionality reduction through orthogonal projections on new dimensions OR we can also say that it converts set of observations of correlated variables into set of values of linearly uncorrelated variables. These new dimensions are called principal components and they capture the maximum variance of data points in the original space. Below is the function which describes the projects on new dimensions:

Y = W^{T}X , and it turns out that W is the eigenvector of covariance matrix of the dataset in the original space.

Number of resulting eigenvectors is equal to the number of original dimensions. So if this is the case then how are we reducing the number of dimensions? Well, this is where the beauty of PCA lies. It turns out that high amount of variance of original space can be explained by few number of principal components. So we sort the components based on their variance and we select only few top variance explaining components. And it turns out that the variance explained by each component is nothing but the associated Eigenvalue. So below holds true:

Cv=λv (where C is covariance matrix, v is the eigenvector and λ is the eigenvalue)

So in summary below are high level steps to do PCA (let’s say ‘m’ is the total number of dimensions in the original space)

- Calculate the m dimensional mean vector (this makes calculations bit easy)
- Compute the covariance matrix for ‘m’ dimensions
- Compute Eigenvector matrix for the covariance matrix and corresponding eigenvalues
- Based on eigenvalues (and acceptable variance explanation) select the number of eigenvectors which explain the variance most (let’s say ‘d’ eigenvectors). So this eigenvector matrix is m x d dimensional where every columns is an eigenvector
- Now the dataset in the original space is transformed on the new dimensional space using the above mentioned equation (Y = W
^{T}X)

Eigenfaces:

Above mentioned process can be applied to images too and this process is called Eigenfaces. I have used Yale university face database for my trials (http://vision.ucsd.edu/content/yale-face-database). This face database has 320 x 243 images.

So this means we are talking about a point in 77760 dimensional space for each image here (that’s how image representation is done in a matrix form).

Below are the images that were used as training set:

Lot of web resources advise to crop these images further down as PCA is sensitive to scale and lighting. But for this case, we are talking about it without cropping. Now this training set was read into a 77760 x 16 dimension matrix (let’s say X) where each column represented an image. So this is the dataset in the original space.

Average of this whole matrix was calculated which produced the below average image:

Each image differs from this average by a vector (let’s say R) which is calculated by subtracting average image from the data matrix. So this is the Step 1 mentioned in summary of PCA section above.

Now covariance matrix for the above training set matrix will be C = R.R^{T} where size of this covariance matrix will be 77760 x 77760. It is a herculean task to do any computation on this matrix so there is a shortcut which is available. From linear algebra, if there is a M x N matrix where M > N then this matrix can only have N-1 non-zero eigenvalues. So we can do eigenvalue decomposition of C = R^{T}R of size N x N instead:

Cv = λv **==>** R^{T}Rv = λv. Now if we multiply this equation with matrix R, we will get:

R(R^{T}Rv) = R(λv) **==>** RR^{T}(Rv) = λ(Rv)…….so for RR^{T} the eigenvector becomes **Rv**

Let’s apply the above process to our case where C becomes a 16 X 16 matrix. This is Step 2 from the summary section above in PCA section.

Now we compute the eigenvector and eigenvalues for this 16 X 16 covariance matrix (C). So the resulting eigenvector is v (as used in above equation). Remember we actually need the eigenvector for covariance matrix which was 77760 x 77760 in size. So we need to compute the dot product for R and v which becomes 77760 x 16 matrix. Each column of this matrix after normalization can be used to create different images which are known as Eigenfaces. This is Step 3 of PCA summary section. Below are the 16 eigenfaces created in this case:

Each of this eigenfaces is a component and below is a plot describing the % of variance explained by these eigenfaces (calculated through ratio of respective eigenvalue with total variance):

A subset of these eigenfaces can be selected depending on the minimum acceptable % of variance that must be explained. This is Step 4.

Now these eigenfaces can be used to reconstruct existing faces or match new faces. We can now project the new faces on these eigenfaces (after subtraction from the average face) to know the set of weights. These weights are then compared with the set of weights of all faces in the database and the closest neighbour is identified and analysed.

Below is one of the reconstructed faces with different number of eigenfaces (increasing order):

Image recognition through PCA has its own limitations. One of them is that it is sensitive to lighting and scale. It is difficult for PCA to know if the variance is caused due to external factors as it doesn’t do classification.

References:

]]>One can acknowledge, categorize and reduce uncertainty but cannot completely eliminate it. This gives rise to scope of error and these wise people have always learnt from previous inconsistent predictions made by themselves or their community. They were also aware of the fact that the Impact variables could surface from any environment and could also belong to completely different system altogether. Moreover, these were very dynamic by nature and moved in and out of the prediction logic or equation. This basic framework could be applied to any kind of prediction, let it be Predicting the future of an individual based on movement of the stars or predicting whether an incident would take place based on factual evidence.

Let us look at the possible thought process of a seasoned astrologer behind predicting the future and connect it back to the above framework. Science of astrology interprets future outcomes based on behavioral impacts that the movement of stars or planets would have on a particular individual. An Astrologer’s frame of reference and the bounds of environment are gathered from two legs of factual evidence. The first leg advocates the fact that human brain is a simple transceiver, that generates and decodes electromagnetic and electro chemical impulses and the way we act has a strong bearing on the anomalies observed in these signals. The second leg of evidence is derived from the factors that can plausibly manipulate these signals- *“The Cosmic forces” *being one of them. This could also be traced back to etymology of the word “Lunatic”- The root of this word is luna, which means *‘of the moon’*. This brings us to define the environment this community looks at: “Individuals, their psyche, their actions (leading to future outcomes), cosmic movements and their impacts on the individual’s actions”. Determining behavior of an individual based on his current status, past actions bring in the elements of time series-auto regression methodologies and effect of external factors such as cosmic movements bring in cross sectional and panel data. Now that the prediction environment has been established, astrologers now try to draw a regression line mentally based on the conventional wisdom inherited from their community. This practice has evolved a lot since its inception and the methods of prediction have also changed gradually.

Similar analogies could be drawn with respect to other nonscientific prediction techniques like Palmistry, Tarot card reading etc. where each of these techniques has its own logical reasoning backing the methodology. Traditional business decisions or outcome estimations till recent times were no different and were based on Heuristics, patterns or guesstimate of the managers. The major advantage we have over our previous generation for Identification and measurement of these impact variables is advancements in the field of technology. Now we can do the same in a more systematic and scientific way by leveraging on statistical modeling techniques, vast data sets, machine learning and powerful processing capabilities of computers and take the prediction to unprecedented levels of accuracy.

Let us look at an example: To determine the probability of a ‘Tail’ outcome on flipping a fair coin. Empirical records conveniently calls out the probability as 0.5, i.e., assigning an equal chance to each binary outcome. What if we determine all the factors that affect the outcome? Given that the weight of the coin is constant, let us assume that a person flipping the coin has a sensor band on his index finger, which can map and determine “position of the thumb’, “coin placement”, “force of the flip”, “angular momentum of the flip” and “gravity at the place of flip” and notes the outcomes over several trials. We can constantly evolve the model by identifying the missed out variables like “direction and strength of the wind” and eliminating the ones with spurious correlation to increase the accuracy.

Every system has impact variables that are external as well as internal to its environment and some of these could be controlled and some can’t. Success of prediction lies in understanding the dynamic environment, learning from error and constantly improving the estimation model with a systemic shift in the frame of reference whenever necessary. Hence, in order to forecast outcomes for a business or an individual accurately, the modern day ‘*Data prophets’* (popularly referred to as ‘Data Scientists’) can’t rely on static prediction models anymore and need to equip themselves with right kind of technology to know “** when their stars move**”.