Preview only show first 10 pages with watermark. For full document please download

Data Mining & Analysis

Data mining & analysis

   EMBED


Share

Transcript

DATA MINING AND ANALYSIS The fundamental algorithms in data mining and analysis form the basis for the emerging field of data science, which includes automated methods to analyze patterns and models for all kinds of data, with applications ranging from scientific discovery to business intelligence and analytics. This textbook for senior undergraduate and graduate data miningcourses provides abroad yet in-depth overview ofdata mining, integrating related concepts from machine learning and statistics. The main parts of the book include exploratory data analysis, pattern mining, clustering, and classification. The book lays the basic foundations of these tasks and also covers cutting-edge topics such as kernel methods, high-dimensional data analysis, and complex graphs and networks. With its comprehensive coverage, algorithmic perspective, and wealth of examples, this book offers solid guidance in data mining for students, researchers, and practitioners alike. Key Features: •  Covers both core methods and cutting-edge research •  Algorithmic approach with open-source implementations •  Minimal prerequisites, as all key mathematical concepts are presented, as is the intuition behind the formulas •  Short, self-contained chapters with class-tested examples and exercises that allow for flexibility in designing a course and for easy reference •  Supplementary online resource containing lecture slides, videos, project ideas, and more Mohammed J. Zaki is a Professor of Computer Science at Rensselaer Polytechnic Institute, Troy, New York. Wagner Meira Jr. is a Professor of Computer Science at Universidade Federal de Minas Gerais, Brazil. DATA MINING AND ANALYSIS Fundamental Concepts and Algorithms MOHAMMED J. ZAKI Rensselaer Polytechnic Institute, Troy, New York WAGNER MEIRA JR. Universidade Federal de Minas Gerais, Brazil 32 Avenue of the Americas, New York, NY 10013-2473, USA Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of  education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9780521766333 c    Mohammed J. Zaki and Wagner Meira Jr. 2014 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2014 Printed in the United States of America  A catalog record for this publication is available from the British Library . Library of Congress Cataloging in Publication Data Zaki, Mohammed J., 1971– Data mining and analysis: fundamental concepts and algorithms / Mohammed J. Zaki, Rensselaer Polytechnic Institute, Troy, New York, Wagner Meira Jr., Universidade Federal de Minas Gerais, Brazil. pages cm Includes bibliographical references and index. ISBN 978-0-521-76633-3 (hardback) 1. Data mining. I. Meira, Wagner, 1967– II. Title. QA76.9.D343Z36 2014 006.3 ′ 12–dc23 2013037544 ISBN 978-0-521-76633-3 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of  URLs for external or third-party Internet Web sites referred to in this publication and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate. Contents Preface  page  ix 1 Data Mining and Analysis  . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Data Matrix 1 1.2 Attributes 3 1.3 Data: Algebraic and Geometric View 4 1.4 Data: Probabilistic View 14 1.5 Data Mining 25 1.6 Further Reading 30 1.7 Exercises 30 PART ONE: DATA ANALYSIS FOUNDATIONS 2 Numeric Attributes  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.1 Univariate Analysis 33 2.2 Bivariate Analysis 42 2.3 Multivariate Analysis 48 2.4 Data Normalization 52 2.5 Normal Distribution 54 2.6 Further Reading 60 2.7 Exercises 60 3 Categorical Attributes  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1 Univariate Analysis 63 3.2 Bivariate Analysis 72 3.3 Multivariate Analysis 82 3.4 Distance and Angle 87 3.5 Discretization 89 3.6 Further Reading 91 3.7 Exercises 91 4 Graph Data  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.1 Graph Concepts 93 4.2 Topological Attributes 97 v vi  Contents 4.3 Centrality Analysis 102 4.4 Graph Models 112 4.5 Further Reading 132 4.6 Exercises 132 5 Kernel Methods  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.1 Kernel Matrix 138 5.2 Vector Kernels 144 5.3 Basic Kernel Operations in Feature Space 148 5.4 Kernels for Complex Objects 154 5.5 Further Reading 161 5.6 Exercises 161 6 High-dimensional Data  . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.1 High-dimensional Objects 163 6.2 High-dimensional Volumes 165 6.3 Hypersphere Inscribed within Hypercube 168 6.4 Volume of Thin Hypersphere Shell 169 6.5 Diagonals in Hyperspace 171 6.6 Density of the Multivariate Normal 172 6.7 Appendix: Derivation of Hypersphere Volume 175 6.8 Further Reading 180 6.9 Exercises 180 7 Dimensionality Reduction  . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.1 Background 183 7.2 Principal Component Analysis 187 7.3 Kernel Principal Component Analysis 202 7.4 Singular Value Decomposition 208 7.5 Further Reading 213 7.6 Exercises 214 PART TWO: FREQUENT PATTERN MINING 8 Itemset Mining  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 8.1 Frequent Itemsets and Association Rules 217 8.2 Itemset Mining Algorithms 221 8.3 Generating Association Rules 234 8.4 Further Reading 236 8.5 Exercises 237 9 Summarizing Itemsets  . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 9.1 Maximal and Closed Frequent Itemsets 242 9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm 245 9.3 Mining Closed Frequent Itemsets: Charm Algorithm 248 9.4 Nonderivable Itemsets 250 9.5 Further Reading 256 9.6 Exercises 256 Contents  vii 10 Sequence Mining  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 10.1 Frequent Sequences 259 10.2 Mining Frequent Sequences 260 10.3 Substring Mining via Suffix Trees 267 10.4 Further Reading 277 10.5 Exercises 277 11 Graph Pattern Mining  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 11.1 Isomorphism and Support 280 11.2 Candidate Generation 284 11.3 The gSpan Algorithm 288 11.4 Further Reading 296 11.5 Exercises 297 12 Pattern and Rule Assessment  . . . . . . . . . . . . . . . . . . . . . . . . 301 12.1 Rule and Pattern Assessment Measures 301 12.2 Significance Testing and Confidence Intervals 316 12.3 Further Reading 328 12.4 Exercises 328 PART THREE: CLUSTERING 13 Representative-basedClustering  . . . . . . . . . . . . . . . . . . . . . . 333 13.1 K-means Algorithm 333 13.2 Kernel K-means 338 13.3 Expectation-Maximization Clustering 342 13.4 Further Reading 360 13.5 Exercises 361 14 Hierarchical Clustering  . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 14.1 Preliminaries 364 14.2 Agglomerative Hierarchical Clustering 366 14.3 Further Reading 372 14.4 Exercises and Projects 373 15 Density-based Clustering  . . . . . . . . . . . . . . . . . . . . . . . . . . 375 15.1 The DBSCAN Algorithm 375 15.2 Kernel Density Estimation 379 15.3 Density-based Clustering: DENCLUE 385 15.4 Further Reading 390 15.5 Exercises 391 16 Spectral and Graph Clustering  . . . . . . . . . . . . . . . . . . . . . . . 394 16.1 Graphs and Matrices 394 16.2 Clustering as Graph Cuts 401 16.3 Markov Clustering 416 16.4 Further Reading 422 16.5 Exercises 423 viii  Contents 17 Clustering Validation  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 17.1 External Measures 425 17.2 Internal Measures 440 17.3 Relative Measures 448 17.4 Further Reading 461 17.5 Exercises 462 PART FOUR: CLASSIFICATION 18 Probabilistic Classification  . . . . . . . . . . . . . . . . . . . . . . . . . 467 18.1 Bayes Classifier 467 18.2 Naive Bayes Classifier 473 18.3  K   Nearest Neighbors Classifier 477 18.4 Further Reading 479 18.5 Exercises 479 19 Decision Tree Classifier  . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 19.1 Decision Trees 483 19.2 Decision Tree Algorithm 485 19.3 Further Reading 496 19.4 Exercises 496 20 Linear Discriminant Analysis  . . . . . . . . . . . . . . . . . . . . . . . . 498 20.1 Optimal Linear Discriminant 498 20.2 Kernel Discriminant Analysis 505 20.3 Further Reading 511 20.4 Exercises 512 21 Support Vector Machines  . . . . . . . . . . . . . . . . . . . . . . . . . . 514 21.1 Support Vectors and Margins 514 21.2 SVM: Linear and Separable Case 520 21.3 Soft Margin SVM: Linear and Nonseparable Case 524 21.4 Kernel SVM: Nonlinear Case 530 21.5 SVM Training Algorithms 534 21.6 Further Reading 545 21.7 Exercises 546 22 Classification Assessment  . . . . . . . . . . . . . . . . . . . . . . . . . . 548 22.1 Classification Performance Measures 548 22.2 Classifier Evaluation 562 22.3 Bias-Variance Decomposition 572 22.4 Further Reading 581 22.5 Exercises 582 Index  585 Preface This book is an outgrowth of data mining courses at Rensselaer Polytechnic Institute (RPI) and Universidade Federal de Minas Gerais (UFMG); the RPI course has been offered every Fall since 1998, whereas the UFMG course has been offered since 2002. Although there are several good books on data mining and related topics, we felt that many of them are either too high-level or too advanced. Our goal was to write an introductory text that focuses on the fundamental algorithms in data mining and analysis. It lays the mathematical foundations for the core data mining methods, with key concepts explained when first encountered; the book also tries to build the intuition behind the formulas to aid understanding. The main parts of the book include exploratory data analysis, frequent pattern mining, clustering, and classification. The book lays the basic foundations of these tasks, and it also covers cutting-edge topics such as kernel methods, high-dimensional data analysis, and complex graphs and networks. It integrates concepts from related disciplines such as machine learning and statistics and is also ideal for a course on data analysis. Most of the prerequisite material is covered in the text, especially on linear algebra, and probability and statistics. The book includes many examples to illustrate the main technical concepts. It also has end-of-chapterexercises,which havebeenused in class.All of thealgorithms in the book havebeenimplementedby theauthors.We suggestthatreadersuse theirfavorite data analysis and mining software to work through our examples and to implement the algorithms we describe in text; we recommend the R software or the Python language with its NumPy package. The datasets used and other supplementary material such as project ideas and slides are available online at the book’s companion site and its mirrors at RPI and UFMG: •  http://dataminingbook.info •  http://www.cs.rpi.edu/ ~ zaki/dataminingbook •  http://www.dcc.ufmg.br/dataminingbook Having understood the basic principles and algorithms in data mining and data analysis, readers will be well equipped to develop their own methods or use more advanced techniques. ix x  Preface 1 2 14 6  7  15 5 13 17 16 20 22 21 4  19 3 18 8 11 12 9 10 Figure 0.1.  Chapter dependencies Suggested Roadmaps The chapter dependency graph is shown in Figure 0.1. We suggest some typical roadmaps for courses and readings based on this book. For an undergraduate-level course, we suggest the following chapters: 1–3, 8, 10, 12–15, 17–19, and 21–22. For an undergraduate course without exploratory data analysis, we recommend Chapters 1, 8–15, 17–19, and 21–22. For a graduate course, one possibility is to quickly go over the material in Part I or to assume it as background reading and to directly cover Chapters 9–22; the other parts of the book, namely frequent pattern mining (Part II), clustering (Part III), and classification (Part IV), can be covered in any order. For a course on data analysis the chapters covered must include 1–7, 13–14, 15 (Section 2), and 20. Finally, for a course with an emphasis on graphs and kernels we suggest Chapters 4, 5, 7 (Sections 1–3), 11–12, 13 (Sections 1–2), 16–17, and 20–22. Acknowledgments Initial drafts of this book have been used in several data mining courses. We received many valuable comments and corrections from both the faculty and students. Our thanks go to •  Muhammad Abulaish, Jamia Millia Islamia, India •  Mohammad Al Hasan, Indiana University Purdue University at Indianapolis •  Marcio Luiz Bunte de Carvalho, Universidade Federal de Minas Gerais, Brazil •  Lo  ¨  ıc Cerf, Universidade Federal de Minas Gerais, Brazil •  Ayhan Demiriz, Sakarya University, Turkey •  Murat Dundar, Indiana University Purdue University at Indianapolis •  Jun Luke Huan, University of Kansas •  Ruoming Jin, Kent State University •  Latifur Khan, University of Texas, Dallas Preface  xi •  Pauli Miettinen, Max-Planck-Institut f   ¨  ur Informatik, Germany •  Suat Ozdemir, Gazi University, Turkey •  Naren Ramakrishnan, Virginia Polytechnic and State University •  Leonardo Chaves Dutra da Rocha, Universidade Federal de S  ˜  ao Jo  ˜  ao del-Rei, Brazil •  Saeed Salem, North Dakota State University •  Ankur Teredesai, University of Washington, Tacoma •  Hannu Toivonen, University of Helsinki, Finland •  Adriano Alonso Veloso, Universidade Federal de Minas Gerais, Brazil •  Jason T.L. Wang, New Jersey Institute of Technology •  Jianyong Wang, Tsinghua University, China •  Jiong Yang, Case Western Reserve University •  Jieping Ye, Arizona State University We would like to thank all the students enrolled in our data mining courses at RPI and UFMG, as well as the anonymous reviewers who provided technical comments on various chapters. We appreciate the collegial and supportive environment within the computer science departments at RPI and UFMG and at the Qatar Computing Research Institute. In addition, we thank NSF, CNPq, CAPES, FAPEMIG, Inweb – the National Institute of Science and Technology for the Web, and Brazil’s Science without Borders program for their support. We thank Lauren Cowles, our editor at Cambridge University Press, for her guidance and patience in realizing this book. Finally, on a more personal front, MJZ dedicates the book to his wife, Amina, for her love, patience and support over all these years, and to his children, Abrar and Afsah, and his parents. WMJ gratefully dedicates the book to his wife Patricia; to his children, Gabriel and Marina; and to his parents, Wagner and Marlene, for their love, encouragement, and inspiration. CHAPTER 1  Data Mining and Analysis Data mining is the process of discovering insightful, interesting, and novel patterns, as well as descriptive, understandable, and predictive models from large-scale data. We begin this chapter by looking at basic properties of data modeled as a data matrix. We emphasize the geometricand algebraicviews,as well as the probabilistic interpretation of data. We then discuss the main data mining tasks, which span exploratory data analysis, frequent pattern mining, clustering, and classification, laying out the roadmap for the book. 1.1  DATA MATRIX Data can often be represented or abstracted as an  n × d   data matrix , with  n  rows and d   columns, where rows correspond to entities in the dataset, and columns represent attributes or properties of interest. Each row in the data matrix records the observed attribute values for a given entity. The  n × d   data matrix is given as D =          X  1  X  2  ···  X  d  x 1  x 11  x 12  ···  x 1 d  x 2  x 21  x 22  ···  x 2 d  . . . . . . . . . . . . . . . x n  x n 1  x n 2  ···  x nd          where  x i  denotes the  i th row, which is a  d  -tuple given as x i = (x i 1 ,x i 2 ,...,x id  ) and  X  j   denotes the  j  th column, which is an  n -tuple given as  X  j   = (x 1 j  ,x 2 j  ,...,x nj  ) Depending on the application domain, rows may also be referred to as  entities , instances ,  examples ,  records ,  transactions ,  objects ,  points ,  feature-vectors ,  tuples , and so on. Likewise, columns may also be called  attributes ,  properties ,  features ,  dimensions , variables ,  fields , and so on. The number of instances  n  is referred to as the  size  of  1 2  Data Mining and Analysis Table 1.1.  Extract from the Iris dataset                             Sepal Sepal Petal Petal Class length width length width  X  1  X  2  X  3  X  4  X  5 x 1  5 . 9 3 . 0 4 . 2 1 . 5 Iris-versicolor x 2  6 . 9 3 . 1 4 . 9 1 . 5 Iris-versicolor x 3  6 . 6 2 . 9 4 . 6 1 . 3 Iris-versicolor x 4  4 . 6 3 . 2 1 . 4 0 . 2 Iris-setosa x 5  6 . 0 2 . 2 4 . 0 1 . 0 Iris-versicolor x 6  4 . 7 3 . 2 1 . 3 0 . 2 Iris-setosa x 7  6 . 5 3 . 0 5 . 8 2 . 2 Iris-virginica x 8  5 . 8 2 . 7 5 . 1 1 . 9 Iris-virginica . . . . . . . . . . . . . . . . . . x 149  7 . 7 3 . 8 6 . 7 2 . 2 Iris-virginica x 150  5 . 1 3 . 4 1 . 5 0 . 2 Iris-setosa                             the data, whereas the number of attributes  d   is called the  dimensionality  of the data. The analysis of a single attribute is referred to as  univariate analysis , whereas the simultaneous analysis of two attributes is called bivariateanalysis and the simultaneous analysis of more than two attributes is called  multivariate analysis . Example 1.1.  Table 1.1 shows an extract of the Iris dataset; the complete data forms a 150 × 5 data matrix. Each entity is an Iris flower, and the attributes include  sepal length , sepal width , petal length ,and  petal width in centimeters, and the type or  class  of the Iris flower. The first row is given as the 5-tuple x 1 = ( 5 . 9 , 3 . 0 , 4 . 2 , 1 . 5 , Iris-versicolor ) Not all datasets are in the form of a data matrix. For instance, more complex datasets can be in the form of sequences (e.g., DNA and protein sequences), text, time-series, images, audio, video, and so on, which may need special techniques for analysis. However, in many cases even if the raw data is not a data matrix it can usually be transformed into that form via feature extraction. For example, given a database of images, we can create a data matrix in which rows represent images and columns correspond to image features such as color, texture, and so on. Sometimes, certain attributes may have special semantics associated with them requiring special treatment. For instance, temporal or spatial attributes are often treated differently. It is also worth noting that traditional data analysis assumes that each entity or instance is independent. However, given the interconnected nature of the world we live in, this assumption may not always hold. Instances may be connected to other instances via various kinds of relationships, giving rise to a  data graph , where a node represents an entity and an edge represents the relationship between two entities. 1.2 Attributes  3 1.2  ATTRIBUTES Attributes may be classified into two main types depending on their domain, that is, depending on the types of values they take on. Numeric Attributes A  numeric  attribute is one that has a real-valued or integer-valued domain. For example,  Age  with  domain( Age )  =  N , where  N  denotes the set of natural numbers (non-negative integers), is numeric, and so is  petal length  in Table 1.1, with domain( petallength ) = R +  (the set of all positive real numbers). Numeric attributes that takeon a finite or countably infinite set of values are called discrete , whereas those that can take on any real value are called  continuous . As a special case of discrete, if  an attribute has as its domain the set  { 0 , 1 } , it is called a  binary  attribute. Numeric attributes can be classified further into two types: •  Interval-scaled : For these kinds of attributes only differences (addition or subtraction) makesense.Forexample, attribute temperature  measured in ◦ Cor ◦ Fisinterval-scaled. If it is 20  ◦ C on one day and 10  ◦ C on the following day, it is meaningful to talk about a temperature drop of 10  ◦ C, but it is not meaningful to say that it is twice as cold as the previous day. •  Ratio-scaled : Here one can compute both differences as well as ratios between values. For example, for attribute  Age , we can say that someone who is 20 years old is twice as old as someone who is 10 years old. Categorical Attributes A  categorical   attribute is one that has a set-valued domain composed of a set of  symbols. For example,  Sex  and  Education  could be categorical attributes with their domains given as domain( Sex ) ={ M , F } domain( Education ) ={ HighSchool , BS , MS , PhD } Categorical attributes may be of two types: •  Nominal  : The attribute values in the domain are unordered, and thus only equality comparisons are meaningful. That is, we can check only whether the value of the attribute for two given instances is the same or not. For example,  Sex  is a nominal attribute. Also  class  in Table 1.1 is a nominal attribute with  domain( class )  = { iris-setosa , iris-versicolor , iris-virginica } . •  Ordinal  : The attribute values are ordered, and thus both equality comparisons (is one value equal to another?) and inequality comparisons (is one value less than or greater than another?) are allowed, though it may not be possible to quantify the difference between values. For example,  Education  is an ordinal attribute because its domain values are ordered by increasing educational qualification. 4  Data Mining and Analysis 1.3  DATA: ALGEBRAIC AND GEOMETRIC VIEW If the  d   attributes or dimensions in the data matrix  D  are all numeric, then each row can be considered as a  d  -dimensional point: x i = (x i 1 ,x i 2 ,...,x id  ) ∈ R d  or equivalently, each row may be considered as a  d  -dimensional column vector (all vectors are assumed to be column vectors by default): x i =      x i 1 x i 2 . . . x id       =  x i 1  x i 2  ···  x id   T  ∈ R d  where  T   is the  matrix transpose  operator. The  d  -dimensional Cartesian coordinate space is specified via the  d   unit vectors, called the standard basis vectors, along each of the axes. The  j  th  standard basis vector  e j   is the  d  -dimensional unit vector whose  j  th component is 1 and the rest of the components are 0 e j   = ( 0 ,..., 1 j  ,..., 0 ) T  Any other vector in  R d  can be written as  linear combination  of the standard basis vectors. For example, each of the points  x i  can be written as the linear combination x i = x i 1 e 1 + x i 2 e 2 +···+ x id  e d  = d   j  = 1 x ij  e j  where the scalar value  x ij   is the coordinate value along the  j  th axis or attribute. Example 1.2.  Consider the Iris data in Table 1.1. If we  project   the entire data onto the first two attributes, then each row can be considered as a point or a vector in 2-dimensional space. For example, the projection of the 5-tuple x 1  =  ( 5 . 9 , 3 . 0 , 4 . 2 , 1 . 5 , Iris-versicolor )  on the first two attributes is shown in Figure 1.1a. Figure 1.2 shows the scatterplot of all the  n  =  150 points in the 2-dimensional space spanned by the first two attributes. Likewise, Figure 1.1b shows x 1  as a point and vector in 3-dimensional space, by projecting the data onto the first three attributes. The point  ( 5 . 9 , 3 . 0 , 4 . 2 )  can be seen as specifying the coefficients in the linear combination of the standard basis vectors in R 3 : x 1 = 5 . 9 e 1 + 3 . 0 e 2 + 4 . 2 e 3 = 5 . 9   1 0 0   + 3 . 0   0 1 0   + 4 . 2   0 0 1   =   5 . 9 3 . 0 4 . 2   1.3 Data: Algebraic and Geometric View  5 0 1 2 3 4 0 1 2 3 4 5 6  X  1  X  2         x 1 = ( 5 . 9 , 3 . 0 ) (a)  X  1  X  2  X  3 1 2 3 4 5 6 1 2 3 1 2 3 4         x 1 = ( 5 . 9 , 3 . 0 , 4 . 2 ) (b) Figure 1.1.  Row  x 1  as a point and vector in (a) R 2 and (b) R 3 . 2 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 4 4 . 5 5 . 0 5 . 5 6 . 0 6 . 5 7 . 0 7 . 5 8 . 0  X  1 : sepal length     X      2    :    s    e    p    a      l    w      i      d     t      h                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Figure 1.2.  Scatterplot:  sepal length  versus  sepal width . The solid circle shows the mean point. Each numeric column or attribute can also be treated as a vector in an n -dimensional space R n :  X  j   =      x 1 j  x 2 j  . . . x nj       6  Data Mining and Analysis If all attributes are numeric, then the data matrix  D  is in fact an  n × d   matrix, also written as  D ∈ R n × d  , given as D =      x 11  x 12  ···  x 1 d  x 21  x 22  ···  x 2 d  . . . . . . . . . . . . x n 1  x n 2  ···  x nd       =       — x T  1  — — x T  2  — . . . — x T  n  —       =   | | |  X  1  X  2  ···  X  d  | | |   As we can see, we can consider the entire dataset as an  n × d   matrix, or equivalently as a set of   n  row vectors  x T  i  ∈ R d  or as a set of   d   column vectors  X  j   ∈ R n . 1.3.1  Distance and Angle Treating data instances and attributes as vectors, and the entire dataset as a matrix, enables one to apply both geometric and algebraic methods to aid in the data mining and analysis tasks. Let  a , b ∈ R m be two  m -dimensional vectors given as a =      a 1 a 2 . . . a m       b =      b 1 b 2 . . . b m      Dot Product The  dot product   between  a  and  b  is defined as the scalar value a T  b =  a 1  a 2  ···  a m  ×      b 1 b 2 . . . b m      = a 1 b 1 + a 2 b 2 +···+ a m b m = m  i = 1 a i b i Length The  Euclidean norm  or  length  of a vector  a ∈ R m is defined as  a  = √  a T  a =   a 2 1 + a 2 2 +···+ a 2 m =     m  i = 1 a 2 i The  unit vector   in the direction of   a  is given as u =  a  a   =   1  a   a 1.3 Data: Algebraic and Geometric View  7 By definition  u  has length  u  = 1, and it is also called a  normalized  vector, which can be used in lieu of   a  in some analysis tasks. The Euclidean norm is a special case of a general class of norms, known as L p -norm , defined as  a  p =  | a 1 | p +| a 2 | p +···+| a m | p  1 p =   m  i = 1 | a i | p  1 p for any  p  = 0. Thus, the Euclidean norm corresponds to the case when  p = 2. Distance From the Euclidean norm we can define the  Euclidean distance  between  a  and  b , as follows δ( a , b ) =  a − b  =   ( a − b ) T  ( a − b ) =     m  i = 1 (a i − b i ) 2 (1.1) Thus, the length of a vector is simply its distance from the zero vector  0 , all of whose elements are 0, that is,  a  =  a − 0  = δ( a , 0 ) . From the general  L p -norm we can define the corresponding  L p -distance function, given as follows δ p ( a , b ) =  a − b  p  (1.2) If   p  is unspecified, as in Eq.(1.1), it is assumed to be  p = 2 by default. Angle The cosine of the smallest angle between vectors  a  and  b , also called the  cosine  similarity , is given as cos θ   =  a T  b  a  b   =   a  a   T    b  b    (1.3) Thus, the cosine of the angle between  a  and  b  is given as the dot product of the unit vectors  a  a   and  b  b  . The  Cauchy–Schwartz  inequality states that for any vectors  a  and  b  in R m | a T  b |≤  a  ·  b  It follows immediately from the Cauchy–Schwartz inequality that − 1 ≤ cos θ   ≤ 1 8  Data Mining and Analysis 0 1 2 3 4 0 1 2 3 4 5  X  1  X  2         ( 5 , 3 )         ( 1 , 4 )  a       b a  − b  θ  Figure 1.3.  Distance and angle. Unit vectors are shown in gray. Because the smallest angle  θ   ∈  [0 ◦ , 180 ◦ ] and because cos θ   ∈  [ − 1 , 1], the cosine similarity value ranges from + 1, corresponding to an angle of 0 ◦ , to − 1, corresponding to an angle of 180 ◦  (or  π  radians). Orthogonality Two vectors  a  and  b  are said to be  orthogonal   if and only if   a T  b = 0, which in turn implies that cos θ   = 0, that is, the angle between them is 90 ◦  or  π 2  radians. In this case, we say that they have no similarity. Example 1.3 (Distance and Angle).  Figure 1.3 shows the two vectors a =  5 3   and  b =  1 4  Using Eq.(1.1), the Euclidean distance between them is given as δ( a , b ) =   ( 5 − 1 ) 2 + ( 3 − 4 ) 2 = √  16 + 1 = √  17 = 4 . 12 The distance can also be computed as the magnitude of the vector: a − b =  5 3  −  1 4  =   4 − 1  because  a − b  =   4 2 + ( − 1 ) 2 = √  17 = 4 . 12. The unit vector in the direction of   a  is given as u a =  a  a   =  1 √  5 2 + 3 2  5 3  =  1 √  34  5 3  =  0 . 86 0 . 51  1.3 Data: Algebraic and Geometric View  9 The unit vector in the direction of   b  can be computed similarly: u b =  0 . 24 0 . 97  These unit vectors are also shown in gray in Figure 1.3. By Eq.(1.3) the cosine of the angle between  a  and  b  is given as cos θ  =  5 3  T   1 4  √  5 2 + 3 2 √  1 2 + 4 2 =  17 √  34 × 17 =  1 √  2 We can get the angle by computing the inverse of the cosine: θ   = cos − 1  1 / √  2  = 45 ◦ Let us consider the  L p -norm for  a  with  p = 3; we get  a  3 =  5 3 + 3 3  1 / 3 = ( 153 ) 1 / 3 = 5 . 34 The distance between  a  and  b  using Eq.(1.2) for the  L p -norm with  p = 3 is given as  a − b  3 =   ( 4 , − 1 ) T    3 =  4 3 + ( − 1 ) 3  1 / 3 = ( 63 ) 1 / 3 = 3 . 98 1.3.2  Mean and Total Variance Mean The  mean  of the data matrix  D  is the vector obtained as the average of all the points: mean( D ) = µ =  1 n n  i = 1 x i Total Variance The  total variance  of the data matrix  D  is the average squared distance of each point from the mean: var( D ) =  1 n n  i = 1 δ( x i , µ ) 2 =  1 n n  i = 1  x i − µ  2 (1.4) Simplifying Eq.(1.4) we obtain var( D ) =  1 n n  i = 1   x i  2 − 2 x T  i  µ +  µ  2  =  1 n   n  i = 1  x i  2 − 2 n µ T   1 n n  i = 1 x i  + n  µ  2  10  Data Mining and Analysis =  1 n   n  i = 1  x i  2 − 2 n µ T  µ + n  µ  2  =  1 n   n  i = 1  x i  2  −  µ  2 The total varianceis thus thedifferencebetweenthe averageof thesquared magnitude of the data points and the squared magnitude of the mean (average of the points). Centered Data Matrix Often we need to center the data matrix by making the mean coincide with the origin of the data space. The  centered data matrix  is obtained by subtracting the mean from all the points: Z = D − 1 · µ T  =       x T  1 x T  2 . . . x T  n       −       µ T  µ T  . . . µ T        =       x T  1  − µ T  x T  2  − µ T  . . . x T  n − µ T        =       z T  1 z T  2 . . . z T  n       (1.5) where  z i  = x i − µ  represents the centered point corresponding to  x i , and  1 ∈ R n is the n -dimensional vector all of whose elements have value 1. The mean of the centered data matrix  Z  is  0 ∈ R d  , because we have subtracted the mean  µ  from all the points  x i . 1.3.3  Orthogonal Projection Often in data mining we need to project a point or vector onto another vector, for example, to obtain a new point after a change of the basis vectors. Let  a , b ∈ R m be two m -dimensional vectors. An  orthogonal decomposition  of the vector  b  in the direction 0 1 2 3 4 0 1 2 3 4 5  X  1  X  2 a b   r  =   b  ⊥   p  =   b    Figure 1.4.  Orthogonal projection. 1.3 Data: Algebraic and Geometric View  11 of another vector  a , illustrated in Figure 1.4, is given as b = b  + b ⊥ = p + r  (1.6) where  p = b   is parallel to  a , and  r = b ⊥  is perpendicular or orthogonal to  a . The vector p  is called the  orthogonal projection  or simply projection of   b  on the vector  a . Note that the point  p ∈ R m is the point closest to  b  on the line passing through  a . Thus, the magnitude of the vector  r = b − p  gives the  perpendicular distance  between  b  and  a , which is often interpreted as the residual or error vector between the points  b  and  p . We can derive an expression for  p  by noting that  p = c a  for some scalar  c , as  p  is parallel to  a . Thus,  r = b − p = b − c a . Because  p  and  r  are orthogonal, we have p T  r = (c a ) T  ( b − c a ) = c a T  b − c 2 a T  a = 0 which implies that c =  a T  b a T  a Therefore, the projection of   b  on  a  is given as p = b  = c a =  a T  b a T  a  a  (1.7) Example 1.4.  Restricting the Iris dataset to the first two dimensions,  sepal length and  sepal width , the mean point is given as mean( D ) =  5 . 843 3 . 054                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   X  1  X  2 ℓ − 2 . 0  − 1 . 5  − 1 . 0  − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 Figure 1.5.  Projecting the centered data onto the line  ℓ . 12  Data Mining and Analysis which is shown as the black circle in Figure 1.2. The corresponding centered data is shown in Figure 1.5, and the total variance is  var( D ) = 0 . 868 (centering does not change this value). Figure1.5shows the projection of eachpoint onto theline  ℓ , which is the linethat maximizes the separation between the class  iris-setosa  (squares) from the other two classes, namely  iris-versicolor (circles) and iris-virginica (triangles). The line  ℓ  is given as the set of all the points  (x 1 ,x 2 ) T  satisfying the constraint  x 1 x 2  = c  − 2 . 15 2 . 75  for all scalars  c ∈ R . 1.3.4  Linear Independence and Dimensionality Given the data matrix D =  x 1  x 2  ···  x n  T  =   X  1  X  2  ···  X  d   we are often interested in the linear combinations of the rows (points) or the columns (attributes). For instance, different linear combinations of the original  d  attributes yield new derived attributes, which play a key role in feature extraction and dimensionality reduction. Given any set of vectors  v 1 , v 2 ,..., v k  in an  m -dimensional vector space  R m , their linear combination  is given as c 1 v 1 + c 2 v 2 +···+ c k v k where  c i  ∈  R  are scalar values. The set of all possible linear combinations of the  k vectors is called the  span , denoted as  span( v 1 ,..., v k ) , which is itself a vector space beinga  subspace of  R m . If   span( v 1 ,..., v k ) = R m ,then wesay that v 1 ,..., v k  is a  spanning  set   for R m . Row and Column Space There are several interesting vector spaces associated with the data matrix  D , two of  which are the column space and row space of   D . The  column space  of   D , denoted col( D ) , is the set of all linear combinations of the  d   attributes  X  j   ∈ R n , that is, col( D ) = span(  X  1 ,  X  2 ,...,  X  d  ) By definition  col( D )  is a subspace of   R n . The  row space  of   D , denoted  row( D ) , is the set of all linear combinations of the  n  points  x i ∈ R d  , that is, row( D ) = span( x 1 , x 2 ,..., x n ) By definition  row( D )  is a subspace of   R d  . Note also that the row space of   D  is the column space of   D T  : row( D ) = col( D T  ) 1.3 Data: Algebraic and Geometric View  13 Linear Independence We say that the vectors  v 1 ,..., v k  are  linearly dependent   if at least one vector can be written as a linear combination of the others. Alternatively, the  k  vectors are linearly dependent if there are scalars  c 1 ,c 2 ,...,c k , at least one of which is not zero, such that c 1 v 1 + c 2 v 2 +···+ c k v k = 0 On the other hand,  v 1 , ··· , v k  are  linearly independent   if and only if  c 1 v 1 + c 2 v 2 +···+ c k v k = 0  implies  c 1 = c 2 =···= c k = 0 Simply put, a set of vectors is linearly independent if none of them can be written as a linear combination of the other vectors in the set. Dimension and Rank Let  S  be a subspace of  R m . A  basis  for  S  is a set of vectors in  S , say  v 1 ,..., v k , that are linearly independent and they span  S , that is,  span( v 1 ,..., v k ) = S . In fact, a basis is a minimal spanning set. If the vectors in the basis are pairwise orthogonal, they are said to form an  orthogonal basis  for  S . If, in addition, they are also normalized to be unit vectors, then they make up an  orthonormal basis  for  S . For instance, the  standard basis for R m is an orthonormal basis consisting of the vectors e 1 =      1 0 . . . 0       e 2 =      0 1 . . . 0       ···  e m =      0 0 . . . 1      Any two bases for  S  must have the same number of vectors, and the number of vectors in a basis for  S  is called the  dimension  of   S , denoted as  dim( S ) . Because  S  is a subspace of  R m , we must have  dim( S ) ≤ m . It is a remarkable fact that, for any matrix, the dimension of its row and column space is the same, and this dimension is also called the  rank  of the matrix. For the data matrix  D ∈ R n × d  , we have  rank( D ) ≤ min (n,d) , which follows from the fact that the column space can have dimension at most  d  , and the row space can have dimension at most  n . Thus, even though the data points are ostensibly in a  d   dimensional attribute space (the  extrinsic dimensionality ), if   rank( D ) < d  , then the data points reside in a lower dimensional subspace of   R d  , and in this case  rank( D )  gives an indication about the  intrinsic  dimensionality of the data. In fact, with dimensionality reduction methods it is often possible to approximate  D  ∈  R n × d  with a derived data matrix  D ′  ∈  R n × k , which has much lower dimensionality, that is,  k  ≪  d  . In this case  k  may reflect the “true” intrinsic dimensionality of the data. Example 1.5.  The line  ℓ  in Figure 1.5 is given as  ℓ  = span   − 2 . 15 2 . 75  T   , with dim(ℓ) = 1. After normalization, we obtain the orthonormal basis for  ℓ  as the unit vector 1 √  12 . 19  − 2 . 15 2 . 75  =  − 0 . 615 0 . 788  14  Data Mining and Analysis Table 1.2.  Iris dataset:  sepal length  (in centimeters). 5.9 6.9 6.6 4.6 6.0 4.7 6.5 5.8 6.7 6.7 5.1 5.1 5.7 6.1 4.9 5.0 5.0 5.7 5.0 7.2 5.9 6.5 5.7 5.5 4.9 5.0 5.5 4.6 7.2 6.8 5.4 5.0 5.7 5.8 5.1 5.6 5.8 5.1 6.3 6.3 5.6 6.1 6.8 7.3 5.6 4.8 7.1 5.7 5.3 5.7 5.7 5.6 4.4 6.3 5.4 6.3 6.9 7.7 6.1 5.6 6.1 6.4 5.0 5.1 5.6 5.4 5.8 4.9 4.6 5.2 7.9 7.7 6.1 5.5 4.6 4.7 4.4 6.2 4.8 6.0 6.2 5.0 6.4 6.3 6.7 5.0 5.9 6.7 5.4 6.3 4.8 4.4 6.4 6.2 6.0 7.4 4.9 7.0 5.5 6.3 6.8 6.1 6.5 6.7 6.7 4.8 4.9 6.9 4.5 4.3 5.2 5.0 6.4 5.2 5.8 5.5 7.6 6.3 6.4 6.3 5.8 5.0 6.7 6.0 5.1 4.8 5.7 5.1 6.6 6.4 5.2 6.4 7.7 5.8 4.9 5.4 5.1 6.0 6.5 5.5 7.2 6.9 6.2 6.5 6.0 5.4 5.5 6.7 7.7 5.1 1.4  DATA: PROBABILISTIC VIEW The probabilistic view of the data assumes that each numeric attribute  X   is a  random variable , defined as a function that assigns a real number to each outcome of an experiment (i.e., some process of observation or measurement). Formally,  X   is a function  X  :  O  →  R , where  O , the domain of   X  , is the set of all possible outcomes of the experiment, also called the  sample space , and  R , the  range  of   X  , is the set of real numbers. If the outcomes are numeric, and represent the observed values of  the random variable, then  X  :  O → O  is simply the identity function:  X  (v) = v  for all v ∈ O . The distinction between the outcomes and the value of the random variable is important, as we may want to treat the observed values differently depending on the context, as seen in Example 1.6. A random variable  X   is called a  discrete random variable  if it takes on only a finite or countably infinite number of values in its range, whereas  X   is called a  continuous random variable  if it can take on any value in its range. Example 1.6.  Consider the  sepal length  attribute (  X  1 ) for the Iris dataset in Table 1.1. All  n = 150 values of this attribute are shown in Table 1.2, which lie in the range [4 . 3 , 7 . 9], with centimeters as the unit of measurement. Let us assume that these constitute the set of all possible outcomes O . By default, we can consider the attribute  X  1  to be a continuous random variable, given as the identity function  X  1 (v) = v , because the outcomes (sepal length values) are all numeric. On the other hand, if we want to distinguish between Iris flowers with short and long sepal lengths, with long being, say, a length of 7 cm or more, we can define a discrete random variable  A  as follows:  A (v) =  0 if   v <  7 1 if   v ≥ 7 In this case the domain of   A  is [4 . 3 , 7 . 9], and its range is  { 0 , 1 } . Thus,  A  assumes nonzero probability only at the discrete values 0 and 1. 1.4 Data: Probabilistic View  15 Probability Mass Function If   X   is discrete, the  probability mass function  of   X   is defined as f(x) = P(  X  = x)  for all  x ∈ R In other words, the function  f   gives the probability  P(  X  = x)  that the random variable  X   has the exact value  x . The name “probability mass function” intuitively conveys the fact that the probability is concentrated or massed at only discrete values in the range of   X  , and is zero for all other values.  f   must also obey the basic rules of probability. That is,  f   must be non-negative: f(x) ≥ 0 and the sum of all probabilities should add to 1:  x f(x) = 1 Example 1.7 (Bernoulli and Binomial Distribution).  In Example 1.6,  A  was defined as a discrete random variable representing long sepal length. From the sepal length data in Table 1.2 we find that only 13 Irises have sepal length of at least 7 cm. We can thus estimate the probability mass function of   A  as follows: f( 1 ) = P(  A = 1 ) =  13 150  = 0 . 087 = p and f( 0 ) = P(  A = 0 ) =  137 150  = 0 . 913 = 1 − p In this case we say that  A has a  Bernoulli distribution  with parameter  p ∈ [0 , 1], which denotes the probability of a  success , that is, the probability of picking an Iris with a long sepal length at random from the set of all points. On the other hand, 1 − p  is the probability of a  failure , that is, of not picking an Iris with long sepal length. Let us consider another discrete random variable  B , denoting the number of  Irises with long sepal length in  m  independent Bernoulli trials with probability of  success  p . In this case,  B  takes on the discrete values [0 ,m ], and its probability mass function is given by the  Binomial distribution f(k) = P( B = k) =  m k  p k ( 1 − p) m − k The formula can be understood as follows. There are  m k  ways of picking  k  long sepal length Irises out of the  m  trials. For each selection of   k  long sepal length Irises, the total probability of the  k  successes is  p k , and the total probability of   m − k  failures is ( 1 − p) m − k . For example, because  p = 0 . 087 from above, the probability of observing exactly  k = 2 Irises with long sepal length in  m = 10 trials is given as f( 2 ) = P( B = 2 ) =  10 2  ( 0 . 087 ) 2 ( 0 . 913 ) 8 = 0 . 164 Figure1.6shows thefullprobability mass function for differentvaluesof  k  for  m = 10. Because  p  is quite small, the probability of   k  successes in so few a trials falls off  rapidly as  k  increases, becoming practically zero for values of   k ≥ 6. 16  Data Mining and Analysis 0 . 1 0 . 2 0 . 3 0 . 4 0 1 2 3 4 5 6 7 8 9 10 k P( B = k) Figure 1.6.  Binomial distribution: probability mass function ( m = 10,  p = 0 . 087). Probability Density Function If   X   is continuous, its range is the entire set of real numbers R . The probability of any specific value  x  is only one out of the infinitely many possible values in the range of   X  , which means that  P(  X  = x) = 0 for all  x  ∈ R . However, this does not mean that the value  x  is impossible, because in that case we would conclude that all values are impossible! Whatit meansis thattheprobabilitymass is spread so thinly overtherange of values that it can be measured only over intervals [ a,b ] ⊂ R , rather than at specific points. Thus, instead of the probability mass function, we define the  probability density  function , which specifies the probability that the variable  X   takes on values in any interval [ a,b ] ⊂ R : P    X  ∈ [ a,b ]  = b   a f(x) dx As before, the density function  f   must satisfy the basic laws of probability: f(x) ≥ 0 ,  for all  x ∈ R and ∞   −∞ f(x) dx = 1 We can get an intuitive understanding of the density function  f   by considering the probability density over a small interval of width 2 ǫ >  0, centered at  x , namely 1.4 Data: Probabilistic View  17 [ x − ǫ,x + ǫ ]: P    X  ∈ [ x − ǫ,x + ǫ ]  = x + ǫ   x − ǫ f(x) dx  ≃  2 ǫ · f(x) f(x) ≃ P    X  ∈ [ x − ǫ,x + ǫ ]  2 ǫ (1.8) f(x)  thus gives the probability density at  x , given as the ratio of the probability mass to the width of the interval, that is, the probability mass per unit distance. Thus, it is important to note that  P(  X  = x)  = f(x) . Even though the probability density function  f(x)  does not specify the probability P(  X  = x) , it can be used to obtain the relative probability of one value  x 1  over another x 2  because for a given  ǫ > 0, by Eq.(1.8), we have P(  X  ∈ [ x 1 − ǫ,x 1 + ǫ ] ) P(  X  ∈ [ x 2 − ǫ,x 2 + ǫ ] ) ≃  2 ǫ · f(x 1 ) 2 ǫ · f(x 2 ) =  f(x 1 ) f(x 2 ) (1.9) Thus, if   f(x 1 )  is larger than  f(x 2 ) , then values of   X   close to  x 1  are more probable than values close to  x 2 , and vice versa. Example 1.8 (Normal Distribution).  Consider again the  sepal length  values from the Iris dataset, as shown in Table 1.2. Let us assume that these values follow a Gaussian  or  normal   density function, given as f(x) =  1 √  2 πσ  2 exp  − (x − µ) 2 2 σ  2  There are two parameters of the normal density distribution, namely,  µ , which represents the mean value, and  σ  2 , which represents the variance of the values (these parameters are discussed in Chapter 2). Figure 1.7 shows the characteristic “bell” shape plot of the normal distribution. The parameters,  µ = 5 . 84 and  σ  2 = 0 . 681, were estimated directly from the data for  sepal length  in Table 1.2. Whereas  f(x  = µ) = f( 5 . 84 ) = 1 √  2 π · 0 . 681 exp { 0 }= 0 . 483, we emphasize that the probability of observing  X  = µ  is zero, that is,  P(  X  = µ) = 0. Thus,  P(  X  = x) is not given by  f(x) , rather,  P(  X   =  x)  is given as the area under the curve for an infinitesimally small interval [ x − ǫ,x + ǫ ] centered at  x , with  ǫ >  0. Figure 1.7 illustrates this with the shaded region centered at  µ = 5 . 84. From Eq.(1.8), we have P(  X  = µ) ≃ 2 ǫ · f(µ) = 2 ǫ · 0 . 483 = 0 . 967 ǫ As  ǫ → 0, we get  P(  X  = µ) → 0. However, based on Eq.(1.9) we can claim that the probability of observing values close to the mean value  µ = 5 . 84 is 2.69 times the probability of observing values close to  x = 7, as f( 5 . 84 ) f( 7 ) =  0 . 483 0 . 18  = 2 . 69 18  Data Mining and Analysis 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 2 3 4 5 6 7 8 9 x f(x) µ ± ǫ Figure 1.7.  Normal distribution: probability density function ( µ = 5 . 84,  σ  2 = 0 . 681). Cumulative Distribution Function For any random variable  X  , whether discrete or continuous, we can define the cumulative distribution function (CDF)  F   : R → [0 , 1], which gives the probability of  observing a value at most some given value  x : F(x) = P(  X  ≤ x)  for all  −∞