Preview only show first 10 pages with watermark. For full document please download

Mtech Thesis Rgpv Cse

Description: Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose ...

   EMBED


Share

Transcript

Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering. Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection) [1]. Methods are studied in which some clustering Density-Based (DBSCAN), Particle swarm optimization (PSO), Hierarchical clustering, Hierarchical agglomerative clustering (HAC), C-mean, and K-mean algorithms integrate the ideas of several clustering methods, so that it is sometimes difficult to classify a given algorithm as uniquely belonging to only one clustering method category. Furthermore, some applications may have clustering criteria that require the integration of several clustering techniques. In the following sections, we examine K mean clustering algorithm with genetic algorithm over the Breast cancer dataset, Thyroid dataset and E-coil dataset in detail. This work evaluates the performance of K means with VSM and Genetic algorithm for the breast, thyroid dataset. A The basic idea about selecting initial cluster centers using genetic algorithm In the proposed algorithm, we first use random function to select K data objects as initial cluster centers to form a chromosome, a total of M chromosomes selected, then have K-means operation on each group of cluster center in the initial population to compute fitness, select individuals according to the fitness of each chromosome, select high-fitness chromosomes for the crossover and mutation operation eliminating low fitness chromosomes, format next generation group finally. In this way, within each new generation of groups, the average fitness are rising, each cluster center is closer to the optimal cluster center, and finally select chromosome that have the highest fitness as the initial cluster center. Algorithm ********************************************************************* 1. Choose a number of clusters k 2. Initialize cluster centers ........... based on mode 3. For each data point, compute the cluster center it is closest to (using some distance measure) and assign the data point to this cluster. 4. Re-compute cluster centers (mean of data points in cluster) 5. Stop when there are no new re-assignments. 6. GA based refinement a) a Construct the initial population (p1) b) b Calculate the global minimum (Gmin) c) c For i = 1 to N do i. Perform reproduction ii. Apply the crossover operator between each parent. iii. Perform mutation and get the new population. (p2) iv. Calculate the local minimum (Lmin). v. If Gmin < Lmin then