<< Sentiment Analysis | Home | Snappy(오픈 소스 압축 툴) 개요 >>

K-Means Clustering

1. K-Means Clustering

임의로 k 개의 평균 점을 찍은 후, 그 점들로 부터 가까운 것들로 k의 그룹으로 나눈다.(Expectation) 또, 다시 각각의 그룹들 안에서 새로운 평균점을 찍고(Maximization) 그룹을 나누는 것을 반복한다.

주로 다른 알고리즘을 쓰기 전에 preprocessing 으로 많이 쓰이며 Predictive나 고객/시장 세분화용으로 활용된다.

2. Hadoop, Mahout를 활용한 k-means 소스
- CsvUtils.java
public class CsvUtils {
	public static List> read(String filename) {
		ArrayList> data = new ArrayList>();

		// ReadFile
		String inputString = null;
		try {
			File file = new File(filename);
			byte[] b = new byte[(int) file.length()];
			FileInputStream fis = new FileInputStream(file);
			fis.read(b);
			inputString = new String(b);
		} catch (Exception e) {
			return data;
		}

		inputString = inputString.replaceAll("\r", "");
		String[] rowdata = inputString.split("\n");
		for (String str : rowdata) {
			String[] values = str.split(",");
			ArrayList row = new ArrayList();
			for (String value : values) {
				row.add(Double.valueOf(value));
			}
			data.add(row);
		}
		return data;
	}

	public static List readVectors(String filename) {
		List> data = read(filename);
		return convert(data);
	}

	public static List convert(List> data) {
		List vectors = new ArrayList();
		for (List rowdata : data) {
			double[] value = new double[rowdata.size()];
			for (int i = 0; i < rowdata.size(); i++) {
				value[i] = rowdata.get(i);
			}
			Vector vector = new DenseVector(value);
			vectors.add(vector);
		}
		return vectors;
	}
}

- KmeansExamples.java
public class KmeansExamples 
{
	public static void writePointToFile(List points, String fileName,
			FileSystem fs, Configuration conf) throws IOException {
		// Writer
		Path path = new Path(fileName);
		SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
				LongWritable.class, VectorWritable.class);

		// Write to File
		VectorWritable vec = new VectorWritable();
		long recordNum = 0;
		for (Vector point : points) {
			vec.set(point);
			writer.append(new LongWritable(recordNum), vec);
		}
		writer.close();
	}

	public static void main(String args[]) throws Exception 
	{
		File kmeansData = new File("kmeansdata");
		if (!kmeansData.exists()) {
			kmeansData.mkdir();
		}
		kmeansData = new File("kmeansdata/points");
		if (!kmeansData.exists()) {
			kmeansData.mkdir();
		}

	  Path input = new Path("kmeansdata/points");
	  Path output = new Path("output");
	  Path clustersIn = new Path("kmeansdata/clusters");
		// Iris 데이터 임포트(Vector)
		List vectors = 
		 CsvUtils.readVectors("/mahout-work-k2/kmeans/iris.csv");

		// Config
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		writePointToFile(vectors, "kmeansdata/points/irisfile", fs, conf);
		// Path
		Path path = new Path("kmeansdata/clusters/part-00000");
		SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
				Text.class, Kluster.class);

		// 클러스터링 초기화
		int clusterNum = 3;
		for (int i = 0; i < clusterNum; i++) {
			Vector vec = vectors.get(i);
			Kluster cluster = new Kluster(vec, i, new EuclideanDistanceMeasure());
			writer.append(new Text(cluster.getIdentifier()), cluster);
		}
		writer.close();

		// K-Means 알고리즘 구동
		KMeansDriver.run(conf, input, clustersIn, output, 
		 new EuclideanDistanceMeasure(), 0.001, 10, true, 0.0, true);

		// 실행 결과 표시하기 위해 Read함
		SequenceFile.Reader reader = new SequenceFile.Reader(fs, 
		 new Path("output/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-0"), conf);

	  IntWritable key = new IntWritable();
	  WeightedVectorWritable value = new WeightedVectorWritable();
		while (reader.next(key, value)) {
		      System.out.println(value.toString() + " belongs to cluster "
              + key.toString());
		}
	}
}

3. Hadoop, Mahout를 활용한 k-means 실행 및 결과
- 데이터 다운로드는 여기.
- 실행 커맨드
> bin/hadoop jar /mahout-work-k2/kmeans/mahout-kmeans-1.0-jar-with-dependencies.jar
 com.mimul.mahout.kmeans.KmeansExamples

- 실행 결과
0.6888927213080215: [5.100, 3.500, 1.400, 0.200] belongs to cluster 2
0.6386337642109103: [4.900, 3.000, 1.400, 0.200] belongs to cluster 2
0.6504599369011972: [4.700, 3.200, 1.300, 0.200] belongs to cluster 2
0.6272893372945867: [4.600, 3.100, 1.500, 0.200] belongs to cluster 2
0.6815356834273322: [5.000, 3.600, 1.400, 0.200] belongs to cluster 2
0.5859256992683831: [5.400, 3.900, 1.700, 0.400] belongs to cluster 2
0.6474167708384064: [4.600, 3.400, 1.400, 0.300] belongs to cluster 2
0.7023920808945009: [5.000, 3.400, 1.500, 0.200] belongs to cluster 2
0.5942035845831581: [4.400, 2.900, 1.400, 0.200] belongs to cluster 2
0.6483699554471285: [4.900, 3.100, 1.500, 0.100] belongs to cluster 2
0.6250284189467693: [5.400, 3.700, 1.500, 0.200] belongs to cluster 2
0.6656887476180753: [4.800, 3.400, 1.600, 0.200] belongs to cluster 2
0.6334113616805834: [4.800, 3.000, 1.400, 0.100] belongs to cluster 2
0.5951393341088583: [4.300, 3.000, 1.100, 0.100] belongs to cluster 2
0.5651054952440013: [5.800, 4.000, 1.200, 0.200] belongs to cluster 2
0.5339151864664161: [5.700, 4.400, 1.500, 0.400] belongs to cluster 2
0.6063713504933184: [5.400, 3.900, 1.300, 0.400] belongs to cluster 2
0.686611531481298: [5.100, 3.500, 1.400, 0.300] belongs to cluster 2
0.5634654849800831: [5.700, 3.800, 1.700, 0.300] belongs to cluster 2
0.6421720259791107: [5.100, 3.800, 1.500, 0.300] belongs to cluster 2
0.6175064937649044: [5.400, 3.400, 1.700, 0.200] belongs to cluster 2
0.6497223140018625: [5.100, 3.700, 1.500, 0.400] belongs to cluster 2
0.6304990821287748: [4.600, 3.600, 1.000, 0.200] belongs to cluster 2
0.6286978448626435: [5.100, 3.300, 1.700, 0.500] belongs to cluster 2
0.6142949307166078: [4.800, 3.400, 1.900, 0.200] belongs to cluster 2
0.6282070140111853: [5.000, 3.000, 1.600, 0.200] belongs to cluster 2
0.6671673850494027: [5.000, 3.400, 1.600, 0.400] belongs to cluster 2
0.6707631055801095: [5.200, 3.500, 1.500, 0.200] belongs to cluster 2
0.6754812098221507: [5.200, 3.400, 1.400, 0.200] belongs to cluster 2
0.6404810616711671: [4.700, 3.200, 1.600, 0.200] belongs to cluster 2
0.6377979985002025: [4.800, 3.100, 1.600, 0.200] belongs to cluster 2
0.6286390732904881: [5.400, 3.400, 1.500, 0.400] belongs to cluster 2
0.59928394362062: [5.200, 4.100, 1.500, 0.100] belongs to cluster 2
0.5731073018682336: [5.500, 4.200, 1.400, 0.200] belongs to cluster 2
0.6483699554471285: [4.900, 3.100, 1.500, 0.100] belongs to cluster 2
0.6610228398148751: [5.000, 3.200, 1.200, 0.200] belongs to cluster 2
0.6242749343330032: [5.500, 3.500, 1.300, 0.200] belongs to cluster 2
0.6483699554471285: [4.900, 3.100, 1.500, 0.100] belongs to cluster 2
0.6043480643815491: [4.400, 3.000, 1.300, 0.200] belongs to cluster 2
0.6911315554131179: [5.100, 3.400, 1.500, 0.200] belongs to cluster 2
0.6841113167726371: [5.000, 3.500, 1.300, 0.300] belongs to cluster 2
0.542836063991656: [4.500, 2.300, 1.300, 0.300] belongs to cluster 2
0.6173846400721656: [4.400, 3.200, 1.300, 0.200] belongs to cluster 2
0.6331613241212302: [5.000, 3.500, 1.600, 0.600] belongs to cluster 2
0.5906810747533399: [5.100, 3.800, 1.900, 0.400] belongs to cluster 2
0.632912575313151: [4.800, 3.000, 1.400, 0.300] belongs to cluster 2
0.6366524666170508: [5.100, 3.800, 1.600, 0.200] belongs to cluster 2
0.6395689841354508: [4.600, 3.200, 1.400, 0.200] belongs to cluster 2
0.6383231857591021: [5.300, 3.700, 1.500, 0.200] belongs to cluster 2
0.689860827941975: [5.000, 3.300, 1.400, 0.200] belongs to cluster 2
0.41113544169820876: [7.000, 3.200, 4.700, 1.400] belongs to cluster 0
0.47979783563133754: [6.400, 3.200, 4.500, 1.500] belongs to cluster 1
0.437353457764317: [6.900, 3.100, 4.900, 1.500] belongs to cluster 0
0.5194168891446207: [5.500, 2.300, 4.000, 1.300] belongs to cluster 1
0.4828739100032952: [6.500, 2.800, 4.600, 1.500] belongs to cluster 1
0.5767663897478309: [5.700, 2.800, 4.500, 1.300] belongs to cluster 1
0.4597168930335657: [6.300, 3.300, 4.700, 1.600] belongs to cluster 1
0.4199494359793022: [4.900, 2.400, 3.300, 1.000] belongs to cluster 1
0.47026694319283974: [6.600, 2.900, 4.600, 1.300] belongs to cluster 1
0.49959267338464214: [5.200, 2.700, 3.900, 1.400] belongs to cluster 1
0.4366070780271616: [5.000, 2.000, 3.500, 1.000] belongs to cluster 1
0.5610182766983255: [5.900, 3.000, 4.200, 1.500] belongs to cluster 1
0.5064436705413087: [6.000, 2.200, 4.000, 1.000] belongs to cluster 1
0.5278475319029214: [6.100, 2.900, 4.700, 1.400] belongs to cluster 1
0.4882721093893237: [5.600, 2.900, 3.600, 1.300] belongs to cluster 1
0.45893866873428396: [6.700, 3.100, 4.400, 1.400] belongs to cluster 1
0.5473267679990427: [5.600, 3.000, 4.500, 1.500] belongs to cluster 1
0.5395143457040881: [5.800, 2.700, 4.100, 1.000] belongs to cluster 1
0.5111198329689384: [6.200, 2.200, 4.500, 1.500] belongs to cluster 1
0.517935864480177: [5.600, 2.500, 3.900, 1.100] belongs to cluster 1
0.47471224438136367: [5.900, 3.200, 4.800, 1.800] belongs to cluster 1
0.5425686035208592: [6.100, 2.800, 4.000, 1.300] belongs to cluster 1
0.4748135370896512: [6.300, 2.500, 4.900, 1.500] belongs to cluster 1
0.5297725555150643: [6.100, 2.800, 4.700, 1.200] belongs to cluster 1
0.5149612001876201: [6.400, 2.900, 4.300, 1.300] belongs to cluster 1
0.47687362598543576: [6.600, 3.000, 4.400, 1.400] belongs to cluster 1
0.4280264824983088: [6.800, 2.800, 4.800, 1.400] belongs to cluster 1
0.4532952604228549: [6.700, 3.000, 5.000, 1.700] belongs to cluster 0
0.5670844853075032: [6.000, 2.900, 4.500, 1.500] belongs to cluster 1
0.4717777513862311: [5.700, 2.600, 3.500, 1.000] belongs to cluster 1
0.4995596097525722: [5.500, 2.400, 3.800, 1.100] belongs to cluster 1
0.4847116353154916: [5.500, 2.400, 3.700, 1.000] belongs to cluster 1
0.5352187429607588: [5.800, 2.700, 3.900, 1.200] belongs to cluster 1
0.4648647793415991: [6.000, 2.700, 5.100, 1.600] belongs to cluster 1
0.5278711919357354: [5.400, 3.000, 4.500, 1.500] belongs to cluster 1
0.48869885574219213: [6.000, 3.400, 4.500, 1.600] belongs to cluster 1
0.4336882390795484: [6.700, 3.100, 4.700, 1.500] belongs to cluster 1
0.5160518630417533: [6.300, 2.300, 4.400, 1.300] belongs to cluster 1
0.5403770862919508: [5.600, 3.000, 4.100, 1.300] belongs to cluster 1
0.5309040574319317: [5.500, 2.500, 4.000, 1.300] belongs to cluster 1
0.5499438591880227: [5.500, 2.600, 4.400, 1.200] belongs to cluster 1
0.5329133143361003: [6.100, 3.000, 4.600, 1.400] belongs to cluster 1
0.5472149415966028: [5.800, 2.600, 4.000, 1.200] belongs to cluster 1
0.42441579250074224: [5.000, 2.300, 3.300, 1.000] belongs to cluster 1
0.5653337578586187: [5.600, 2.700, 4.200, 1.300] belongs to cluster 1
0.5493935020203055: [5.700, 3.000, 4.200, 1.200] belongs to cluster 1
0.5672413379786381: [5.700, 2.900, 4.200, 1.300] belongs to cluster 1
0.5468519666470136: [6.200, 2.900, 4.300, 1.300] belongs to cluster 1
0.4029723567598556: [5.100, 2.500, 3.000, 1.100] belongs to cluster 1
0.5629841887957153: [5.700, 2.800, 4.100, 1.300] belongs to cluster 1
0.5357527764819039: [6.300, 3.300, 6.000, 2.500] belongs to cluster 0
0.4555050739488057: [5.800, 2.700, 5.100, 1.900] belongs to cluster 1
0.6051502714265026: [7.100, 3.000, 5.900, 2.100] belongs to cluster 0
0.5061160102350098: [6.300, 2.900, 5.600, 1.800] belongs to cluster 0
0.5786027956774659: [6.500, 3.000, 5.800, 2.200] belongs to cluster 0
0.5344590836993542: [7.600, 3.000, 6.600, 2.100] belongs to cluster 0
0.48313509541216626: [4.900, 2.500, 4.500, 1.700] belongs to cluster 1
0.5525432434003902: [7.300, 2.900, 6.300, 1.800] belongs to cluster 0
0.5304534625385895: [6.700, 2.500, 5.800, 1.800] belongs to cluster 0
0.5524049282812538: [7.200, 3.600, 6.100, 2.500] belongs to cluster 0
0.4784050626473252: [6.500, 3.200, 5.100, 2.000] belongs to cluster 0
0.47513203801732123: [6.400, 2.700, 5.300, 1.900] belongs to cluster 0
0.5975183057260267: [6.800, 3.000, 5.500, 2.100] belongs to cluster 0
0.4648051892084705: [5.700, 2.500, 5.000, 2.000] belongs to cluster 1
0.41711021836669054: [5.800, 2.800, 5.100, 2.400] belongs to cluster 1
0.508620254023756: [6.400, 3.200, 5.300, 2.300] belongs to cluster 0
0.5304088116482933: [6.500, 3.000, 5.500, 1.800] belongs to cluster 0
0.5140313460441296: [7.700, 3.800, 6.700, 2.200] belongs to cluster 0
0.511977416667424: [7.700, 2.600, 6.900, 2.300] belongs to cluster 0
0.4768176093636466: [6.000, 2.200, 5.000, 1.500] belongs to cluster 1
0.607610451253004: [6.900, 3.200, 5.700, 2.300] belongs to cluster 0
0.47460018390718345: [5.600, 2.800, 4.900, 2.000] belongs to cluster 1
0.5218111604420561: [7.700, 2.800, 6.700, 2.000] belongs to cluster 0
0.45254785601542225: [6.300, 2.700, 4.900, 1.800] belongs to cluster 1
0.599644222866059: [6.700, 3.300, 5.700, 2.100] belongs to cluster 0
0.5745312729251248: [7.200, 3.200, 6.000, 1.800] belongs to cluster 0
0.4757381156725546: [6.200, 2.800, 4.800, 1.800] belongs to cluster 1
0.46130963509011796: [6.100, 3.000, 4.900, 1.800] belongs to cluster 1
0.5345818162975757: [6.400, 2.800, 5.600, 2.100] belongs to cluster 0
0.5534169492800548: [7.200, 3.000, 5.800, 1.600] belongs to cluster 0
0.555769554769629: [7.400, 2.800, 6.100, 1.900] belongs to cluster 0
0.5108800903342556: [7.900, 3.800, 6.400, 2.000] belongs to cluster 0
0.5357017421752822: [6.400, 2.800, 5.600, 2.200] belongs to cluster 0
0.4397166078658171: [6.300, 2.800, 5.100, 1.500] belongs to cluster 1
0.4326110236600522: [6.100, 2.600, 5.600, 1.400] belongs to cluster 0
0.544133801058598: [7.700, 3.000, 6.100, 2.300] belongs to cluster 0
0.5216863076049117: [6.300, 3.400, 5.600, 2.400] belongs to cluster 0
0.5167580861417744: [6.400, 3.100, 5.500, 1.800] belongs to cluster 0
0.4841999039103601: [6.000, 3.000, 4.800, 1.800] belongs to cluster 1
0.5793384618243713: [6.900, 3.100, 5.400, 2.100] belongs to cluster 0
0.5810855011359789: [6.700, 3.100, 5.600, 2.400] belongs to cluster 0
0.5162112247403986: [6.900, 3.100, 5.100, 2.300] belongs to cluster 0
0.4555050739488057: [5.800, 2.700, 5.100, 1.900] belongs to cluster 1
0.6039018153944413: [6.800, 3.200, 5.900, 2.300] belongs to cluster 0
0.569550348203404: [6.700, 3.300, 5.700, 2.500] belongs to cluster 0
0.5222772247143679: [6.700, 3.000, 5.200, 2.300] belongs to cluster 0
0.43429906305743937: [6.300, 2.500, 5.000, 1.900] belongs to cluster 1
0.494262186893462: [6.500, 3.000, 5.200, 2.000] belongs to cluster 0
0.4915147353874284: [6.200, 3.400, 5.400, 2.300] belongs to cluster 0
0.44733886678690127: [5.900, 3.000, 5.100, 1.800] belongs to cluster 1
클러스터 0, 1, 2로 분류되어짐을 알수 있다.

4. R로 표현된 k-means
iris.csv 파일의 Petal.Length, Petal.Width 만 추출해서 2차원으로 k-means 알고리즘을 통해 나온 분포도입니다.

- R 샘플소스
df <-  read.csv("C:\\Project\\workspace\\2012\\mahout-kmeans
 \\src\\test\\resources\\iris.csv", header = F, sep = ",", dec = ".", 
 quote = "")
m=as.matrix(cbind(df$V3, df$V4),ncol=2)
cl=(kmeans(m,3))

df$cluster=factor(cl$cluster)
centers=as.data.frame(cl$centers)
library(ggplot2)

ggplot(data=df, aes(x=V3, y=V4, color=cluster )) + 
 geom_point() + 
 geom_point(data=centers, aes(x=V1,y=V2, color='Center')) +
 geom_point(data=centers, aes(x=V1,y=V2, color='Center'), size=52, 
  alpha=.3, legend=FALSE)

- 결과 화면


Avatar: 한겨레

Re: K-Means Clustering

안녕하세요.

 

여기서 input format은 Sequence file로 변환 되어 있는 건가요 이미?

아니면 그냥 csv 파일을 그대로 읽어다가 sequence 파일로 변환까지 알아서 해주는 건가요?

Avatar: 미물

Re: K-Means Clustering

iris.csv파일을 입력받아서 내부적으로 처리하고 자동적으로 시퀀스파일로 변환되게 소스가 되어 있습니다.


Add a comment Send a TrackBack