sirinnes: K-means clustering algorithm in R

Cluster analysis in other terms unsupervised learning, is the process of placement a set of object into groups. These groups called clusters, so that objects in the same group are more similar to each other than other groups. The objective of the clustering is to determine the essential grouping a set of unlabeled data. Clustering algorithms can be classified according to their cluster model. These base models are connectivity, centroid, distribution, density, subspace and group models. Some famous clustering algorithms of these cluster models are hierarchical clustering (connectivity based), k-means clustering (centroid based), Expectation-Maximization algorithm (distribution based) and DBSCAN (density based).

K-means algorithm is one of the most popular basic clustering algorithm. The main idea is to define k centroids, one for each group and then to take each point belonging to a data set and associate it to the nearest centroid. After that re-calculate k new centroids provided by a loop, centroids of the clusters resulting from the previous stage. K-means algorithm implemented in R language is here. Note that this algorithm implementation is made example by Programming Collective Intelligence book.

The cluster function returns a list of list that each list consists of instances of each cluster.

kcluster <- function(data_matrix,k){
  numOfRow <- nrow(data_matrix)
  numOfCol <- ncol(data_matrix)
  #minVal is a vector that contains min values in columns
  minVal <- apply(data_matrix,2,min)
  minVal <- as.vector(minVal)
  #maxVal is a vector that contains max values in columns
  maxVal <- apply(data_matrix,2,max)
  maxVal <- as.vector(maxVal)
  #difference is a vector that contains difference between 
  # min and max values in columns
  difference <- as.vector(maxVal - minVal)
  ClusterList <- list()
 
  #Create k randomly placed centroids
  for (j in 1:k){
    #multiply each differences to element of uniform dist on (0,1)
    randed_diff <- as.vector(difference * runif(numOfCol,0,1))
    randed_diff <- as.vector(randed_diff + minVal)
    ClusterList[[j]] <- randed_diff
  }
  for (i in 1:100){
    #print(paste('Iteration', i))
    bestMatches <-rep(list(list()),k)
  
    #Find which centroid is the closest for each row
    for(j in 1:numOfRow){
      CurrentRow <- c(data_matrix[j,])
      bestMatch <- 1
      for(i in 1:k){
        dist_1 <- sqrt(sum((ClusterList[[i]] - CurrentRow) ** 2))
        dist_2 <- sqrt(sum((ClusterList[[bestMatch]] -CurrentRow) ** 2))
        if(dist_1 < dist_2)
        bestMatch <- i
      } 
      bestMatches[[c(bestMatch,j)]] <- j
    }
    #Move the centroids to the average of their members
    for(i in 1:k){
      avgs <- matrix(0,1,numOfCol)
      temp <-c()
      if(length(bestMatches[[i]]) > 0){
        for(i in bestMatches[[i]]){
          temp <- append(temp,i)
        }
        for(index in temp){
          avgs <-avgs+data_matrix[index,]
        }
        avgs<- avgs/length(temp)
        ClusterList[[i]] <- avgs 
      }
    }
  }
  return(bestMatches) 
}

Reference: Segaran,T., Programming Collective Intelligence, 2007, O'Reilly Media.

sirinnes

Friday, January 6, 2012

K-means clustering algorithm in R

No comments:

Post a Comment

sirin

posts

labels