Transcript
DATA MINING AND ANALYSIS
The fundamental algorithms in data mining and analysis form the basis
for the emerging field of data science, which includes automated methods
to analyze patterns and models for all kinds of data, with applications
ranging from scientific discovery to business intelligence and analytics.
This textbook for senior undergraduate and graduate data miningcourses
provides abroad yet in-depth overview ofdata mining, integrating related
concepts from machine learning and statistics. The main parts of the
book include exploratory data analysis, pattern mining, clustering, and
classification. The book lays the basic foundations of these tasks and
also covers cutting-edge topics such as kernel methods, high-dimensional
data analysis, and complex graphs and networks. With its comprehensive
coverage, algorithmic perspective, and wealth of examples, this book
offers solid guidance in data mining for students, researchers, and
practitioners alike.
Key Features:
•
Covers both core methods and cutting-edge research
•
Algorithmic approach with open-source implementations
•
Minimal prerequisites, as all key mathematical concepts are
presented, as is the intuition behind the formulas
•
Short, self-contained chapters with class-tested examples and
exercises that allow for flexibility in designing a course and for easy
reference
•
Supplementary online resource containing lecture slides, videos,
project ideas, and more
Mohammed J. Zaki is a Professor of Computer Science at Rensselaer
Polytechnic Institute, Troy, New York.
Wagner Meira Jr. is a Professor of Computer Science at Universidade
Federal de Minas Gerais, Brazil.
DATA MINING
AND ANALYSIS
Fundamental Concepts and Algorithms
MOHAMMED J. ZAKI
Rensselaer Polytechnic Institute, Troy, New York
WAGNER MEIRA JR.
Universidade Federal de Minas Gerais, Brazil
32 Avenue of the Americas, New York, NY 10013-2473, USA
Cambridge University Press is part of the University of Cambridge.
It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9780521766333
c
Mohammed J. Zaki and Wagner Meira Jr. 2014
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2014
Printed in the United States of America
A catalog record for this publication is available from the British Library
.
Library of Congress Cataloging in Publication Data
Zaki, Mohammed J., 1971–
Data mining and analysis: fundamental concepts and algorithms / Mohammed J. Zaki,
Rensselaer Polytechnic Institute, Troy, New York, Wagner Meira Jr.,
Universidade Federal de Minas Gerais, Brazil.
pages cm
Includes bibliographical references and index.
ISBN 978-0-521-76633-3 (hardback)
1. Data mining. I. Meira, Wagner, 1967– II. Title.
QA76.9.D343Z36 2014
006.3
′
12–dc23 2013037544
ISBN 978-0-521-76633-3 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party Internet Web sites referred to in this publication
and does not guarantee that any content on such Web sites is, or will remain,
accurate or appropriate.
Contents
Preface
page
ix
1 Data Mining and Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data Matrix 1
1.2 Attributes 3
1.3 Data: Algebraic and Geometric View 4
1.4 Data: Probabilistic View 14
1.5 Data Mining 25
1.6 Further Reading 30
1.7 Exercises 30
PART ONE: DATA ANALYSIS FOUNDATIONS
2 Numeric Attributes
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1 Univariate Analysis 33
2.2 Bivariate Analysis 42
2.3 Multivariate Analysis 48
2.4 Data Normalization 52
2.5 Normal Distribution 54
2.6 Further Reading 60
2.7 Exercises 60
3 Categorical Attributes
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1 Univariate Analysis 63
3.2 Bivariate Analysis 72
3.3 Multivariate Analysis 82
3.4 Distance and Angle 87
3.5 Discretization 89
3.6 Further Reading 91
3.7 Exercises 91
4 Graph Data
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1 Graph Concepts 93
4.2 Topological Attributes 97
v
vi
Contents
4.3 Centrality Analysis 102
4.4 Graph Models 112
4.5 Further Reading 132
4.6 Exercises 132
5 Kernel Methods
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.1 Kernel Matrix 138
5.2 Vector Kernels 144
5.3 Basic Kernel Operations in Feature Space 148
5.4 Kernels for Complex Objects 154
5.5 Further Reading 161
5.6 Exercises 161
6 High-dimensional Data
. . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.1 High-dimensional Objects 163
6.2 High-dimensional Volumes 165
6.3 Hypersphere Inscribed within Hypercube 168
6.4 Volume of Thin Hypersphere Shell 169
6.5 Diagonals in Hyperspace 171
6.6 Density of the Multivariate Normal 172
6.7 Appendix: Derivation of Hypersphere Volume 175
6.8 Further Reading 180
6.9 Exercises 180
7 Dimensionality Reduction
. . . . . . . . . . . . . . . . . . . . . . . . . 183
7.1 Background 183
7.2 Principal Component Analysis 187
7.3 Kernel Principal Component Analysis 202
7.4 Singular Value Decomposition 208
7.5 Further Reading 213
7.6 Exercises 214
PART TWO: FREQUENT PATTERN MINING
8 Itemset Mining
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
8.1 Frequent Itemsets and Association Rules 217
8.2 Itemset Mining Algorithms 221
8.3 Generating Association Rules 234
8.4 Further Reading 236
8.5 Exercises 237
9 Summarizing Itemsets
. . . . . . . . . . . . . . . . . . . . . . . . . . . 242
9.1 Maximal and Closed Frequent Itemsets 242
9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm 245
9.3 Mining Closed Frequent Itemsets: Charm Algorithm 248
9.4 Nonderivable Itemsets 250
9.5 Further Reading 256
9.6 Exercises 256
Contents
vii
10 Sequence Mining
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
10.1 Frequent Sequences 259
10.2 Mining Frequent Sequences 260
10.3 Substring Mining via Suffix Trees 267
10.4 Further Reading 277
10.5 Exercises 277
11 Graph Pattern Mining
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
11.1 Isomorphism and Support 280
11.2 Candidate Generation 284
11.3 The gSpan Algorithm 288
11.4 Further Reading 296
11.5 Exercises 297
12 Pattern and Rule Assessment
. . . . . . . . . . . . . . . . . . . . . . . . 301
12.1 Rule and Pattern Assessment Measures 301
12.2 Significance Testing and Confidence Intervals 316
12.3 Further Reading 328
12.4 Exercises 328
PART THREE: CLUSTERING
13 Representative-basedClustering
. . . . . . . . . . . . . . . . . . . . . . 333
13.1 K-means Algorithm 333
13.2 Kernel K-means 338
13.3 Expectation-Maximization Clustering 342
13.4 Further Reading 360
13.5 Exercises 361
14 Hierarchical Clustering
. . . . . . . . . . . . . . . . . . . . . . . . . . . 364
14.1 Preliminaries 364
14.2 Agglomerative Hierarchical Clustering 366
14.3 Further Reading 372
14.4 Exercises and Projects 373
15 Density-based Clustering
. . . . . . . . . . . . . . . . . . . . . . . . . . 375
15.1 The DBSCAN Algorithm 375
15.2 Kernel Density Estimation 379
15.3 Density-based Clustering: DENCLUE 385
15.4 Further Reading 390
15.5 Exercises 391
16 Spectral and Graph Clustering
. . . . . . . . . . . . . . . . . . . . . . . 394
16.1 Graphs and Matrices 394
16.2 Clustering as Graph Cuts 401
16.3 Markov Clustering 416
16.4 Further Reading 422
16.5 Exercises 423
viii
Contents
17 Clustering Validation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
17.1 External Measures 425
17.2 Internal Measures 440
17.3 Relative Measures 448
17.4 Further Reading 461
17.5 Exercises 462
PART FOUR: CLASSIFICATION
18 Probabilistic Classification
. . . . . . . . . . . . . . . . . . . . . . . . . 467
18.1 Bayes Classifier 467
18.2 Naive Bayes Classifier 473
18.3
K
Nearest Neighbors Classifier 477
18.4 Further Reading 479
18.5 Exercises 479
19 Decision Tree Classifier
. . . . . . . . . . . . . . . . . . . . . . . . . . . 481
19.1 Decision Trees 483
19.2 Decision Tree Algorithm 485
19.3 Further Reading 496
19.4 Exercises 496
20 Linear Discriminant Analysis
. . . . . . . . . . . . . . . . . . . . . . . . 498
20.1 Optimal Linear Discriminant 498
20.2 Kernel Discriminant Analysis 505
20.3 Further Reading 511
20.4 Exercises 512
21 Support Vector Machines
. . . . . . . . . . . . . . . . . . . . . . . . . . 514
21.1 Support Vectors and Margins 514
21.2 SVM: Linear and Separable Case 520
21.3 Soft Margin SVM: Linear and Nonseparable Case 524
21.4 Kernel SVM: Nonlinear Case 530
21.5 SVM Training Algorithms 534
21.6 Further Reading 545
21.7 Exercises 546
22 Classification Assessment
. . . . . . . . . . . . . . . . . . . . . . . . . . 548
22.1 Classification Performance Measures 548
22.2 Classifier Evaluation 562
22.3 Bias-Variance Decomposition 572
22.4 Further Reading 581
22.5 Exercises 582
Index
585
Preface
This book is an outgrowth of data mining courses at Rensselaer Polytechnic Institute
(RPI) and Universidade Federal de Minas Gerais (UFMG); the RPI course has been
offered every Fall since 1998, whereas the UFMG course has been offered since
2002. Although there are several good books on data mining and related topics, we
felt that many of them are either too high-level or too advanced. Our goal was to
write an introductory text that focuses on the fundamental algorithms in data mining
and analysis. It lays the mathematical foundations for the core data mining methods,
with key concepts explained when first encountered; the book also tries to build the
intuition behind the formulas to aid understanding.
The main parts of the book include exploratory data analysis, frequent pattern
mining, clustering, and classification. The book lays the basic foundations of these
tasks, and it also covers cutting-edge topics such as kernel methods, high-dimensional
data analysis, and complex graphs and networks. It integrates concepts from related
disciplines such as machine learning and statistics and is also ideal for a course on data
analysis. Most of the prerequisite material is covered in the text, especially on linear
algebra, and probability and statistics.
The book includes many examples to illustrate the main technical concepts. It also
has end-of-chapterexercises,which havebeenused in class.All of thealgorithms in the
book havebeenimplementedby theauthors.We suggestthatreadersuse theirfavorite
data analysis and mining software to work through our examples and to implement the
algorithms we describe in text; we recommend the R software or the Python language
with its NumPy package. The datasets used and other supplementary material such
as project ideas and slides are available online at the book’s companion site and its
mirrors at RPI and UFMG:
•
http://dataminingbook.info
•
http://www.cs.rpi.edu/
~
zaki/dataminingbook
•
http://www.dcc.ufmg.br/dataminingbook
Having understood the basic principles and algorithms in data mining and data
analysis, readers will be well equipped to develop their own methods or use more
advanced techniques.
ix
x
Preface
1
2
14 6
7
15 5
13
17
16 20
22
21
4
19
3
18 8
11
12
9 10
Figure 0.1.
Chapter dependencies
Suggested Roadmaps
The chapter dependency graph is shown in Figure 0.1. We suggest some typical
roadmaps for courses and readings based on this book. For an undergraduate-level
course, we suggest the following chapters: 1–3, 8, 10, 12–15, 17–19, and 21–22. For an
undergraduate course without exploratory data analysis, we recommend Chapters 1,
8–15, 17–19, and 21–22. For a graduate course, one possibility is to quickly go over the
material in Part I or to assume it as background reading and to directly cover Chapters
9–22; the other parts of the book, namely frequent pattern mining (Part II), clustering
(Part III), and classification (Part IV), can be covered in any order. For a course on
data analysis the chapters covered must include 1–7, 13–14, 15 (Section 2), and 20.
Finally, for a course with an emphasis on graphs and kernels we suggest Chapters 4, 5,
7 (Sections 1–3), 11–12, 13 (Sections 1–2), 16–17, and 20–22.
Acknowledgments
Initial drafts of this book have been used in several data mining courses. We received
many valuable comments and corrections from both the faculty and students. Our
thanks go to
•
Muhammad Abulaish, Jamia Millia Islamia, India
•
Mohammad Al Hasan, Indiana University Purdue University at Indianapolis
•
Marcio Luiz Bunte de Carvalho, Universidade Federal de Minas Gerais, Brazil
•
Lo
¨
ıc Cerf, Universidade Federal de Minas Gerais, Brazil
•
Ayhan Demiriz, Sakarya University, Turkey
•
Murat Dundar, Indiana University Purdue University at Indianapolis
•
Jun Luke Huan, University of Kansas
•
Ruoming Jin, Kent State University
•
Latifur Khan, University of Texas, Dallas
Preface
xi
•
Pauli Miettinen, Max-Planck-Institut f
¨
ur Informatik, Germany
•
Suat Ozdemir, Gazi University, Turkey
•
Naren Ramakrishnan, Virginia Polytechnic and State University
•
Leonardo Chaves Dutra da Rocha, Universidade Federal de S
˜
ao Jo
˜
ao del-Rei, Brazil
•
Saeed Salem, North Dakota State University
•
Ankur Teredesai, University of Washington, Tacoma
•
Hannu Toivonen, University of Helsinki, Finland
•
Adriano Alonso Veloso, Universidade Federal de Minas Gerais, Brazil
•
Jason T.L. Wang, New Jersey Institute of Technology
•
Jianyong Wang, Tsinghua University, China
•
Jiong Yang, Case Western Reserve University
•
Jieping Ye, Arizona State University
We would like to thank all the students enrolled in our data mining courses at RPI
and UFMG, as well as the anonymous reviewers who provided technical comments
on various chapters. We appreciate the collegial and supportive environment within
the computer science departments at RPI and UFMG and at the Qatar Computing
Research Institute. In addition, we thank NSF, CNPq, CAPES, FAPEMIG, Inweb –
the National Institute of Science and Technology for the Web, and Brazil’s Science
without Borders program for their support. We thank Lauren Cowles, our editor at
Cambridge University Press, for her guidance and patience in realizing this book.
Finally, on a more personal front, MJZ dedicates the book to his wife, Amina,
for her love, patience and support over all these years, and to his children, Abrar and
Afsah, and his parents. WMJ gratefully dedicates the book to his wife Patricia; to his
children, Gabriel and Marina; and to his parents, Wagner and Marlene, for their love,
encouragement, and inspiration.
CHAPTER 1
Data Mining and Analysis
Data mining is the process of discovering insightful, interesting, and novel patterns, as
well as descriptive, understandable, and predictive models from large-scale data. We
begin this chapter by looking at basic properties of data modeled as a data matrix. We
emphasize the geometricand algebraicviews,as well as the probabilistic interpretation
of data. We then discuss the main data mining tasks, which span exploratory data
analysis, frequent pattern mining, clustering, and classification, laying out the roadmap
for the book.
1.1
DATA MATRIX
Data can often be represented or abstracted as an
n
×
d
data matrix
, with
n
rows and
d
columns, where rows correspond to entities in the dataset, and columns represent
attributes or properties of interest. Each row in the data matrix records the observed
attribute values for a given entity. The
n
×
d
data matrix is given as
D
=
X
1
X
2
···
X
d
x
1
x
11
x
12
···
x
1
d
x
2
x
21
x
22
···
x
2
d
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
n
x
n
1
x
n
2
···
x
nd
where
x
i
denotes the
i
th row, which is a
d
-tuple given as
x
i
=
(x
i
1
,x
i
2
,...,x
id
)
and
X
j
denotes the
j
th column, which is an
n
-tuple given as
X
j
=
(x
1
j
,x
2
j
,...,x
nj
)
Depending on the application domain, rows may also be referred to as
entities
,
instances
,
examples
,
records
,
transactions
,
objects
,
points
,
feature-vectors
,
tuples
, and so
on. Likewise, columns may also be called
attributes
,
properties
,
features
,
dimensions
,
variables
,
fields
, and so on. The number of instances
n
is referred to as the
size
of
1
2
Data Mining and Analysis
Table 1.1.
Extract from the Iris dataset
Sepal Sepal Petal Petal
Class
length width length width
X
1
X
2
X
3
X
4
X
5
x
1
5
.
9 3
.
0 4
.
2 1
.
5 Iris-versicolor
x
2
6
.
9 3
.
1 4
.
9 1
.
5 Iris-versicolor
x
3
6
.
6 2
.
9 4
.
6 1
.
3 Iris-versicolor
x
4
4
.
6 3
.
2 1
.
4 0
.
2 Iris-setosa
x
5
6
.
0 2
.
2 4
.
0 1
.
0 Iris-versicolor
x
6
4
.
7 3
.
2 1
.
3 0
.
2 Iris-setosa
x
7
6
.
5 3
.
0 5
.
8 2
.
2 Iris-virginica
x
8
5
.
8 2
.
7 5
.
1 1
.
9 Iris-virginica
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
149
7
.
7 3
.
8 6
.
7 2
.
2 Iris-virginica
x
150
5
.
1 3
.
4 1
.
5 0
.
2 Iris-setosa
the data, whereas the number of attributes
d
is called the
dimensionality
of the data.
The analysis of a single attribute is referred to as
univariate analysis
, whereas the
simultaneous analysis of two attributes is called
bivariateanalysis
and the simultaneous
analysis of more than two attributes is called
multivariate analysis
.
Example 1.1.
Table 1.1 shows an extract of the Iris dataset; the complete data forms
a 150
×
5 data matrix. Each entity is an Iris flower, and the attributes include
sepal
length
,
sepal width
,
petal length
,and
petal width
in centimeters, and the type
or
class
of the Iris flower. The first row is given as the 5-tuple
x
1
=
(
5
.
9
,
3
.
0
,
4
.
2
,
1
.
5
,
Iris-versicolor
)
Not all datasets are in the form of a data matrix. For instance, more complex
datasets can be in the form of sequences (e.g., DNA and protein sequences), text,
time-series, images, audio, video, and so on, which may need special techniques for
analysis. However, in many cases even if the raw data is not a data matrix it can
usually be transformed into that form via feature extraction. For example, given a
database of images, we can create a data matrix in which rows represent images and
columns correspond to image features such as color, texture, and so on. Sometimes,
certain attributes may have special semantics associated with them requiring special
treatment. For instance, temporal or spatial attributes are often treated differently.
It is also worth noting that traditional data analysis assumes that each entity or
instance is independent. However, given the interconnected nature of the world
we live in, this assumption may not always hold. Instances may be connected to
other instances via various kinds of relationships, giving rise to a
data graph
, where
a node represents an entity and an edge represents the relationship between two
entities.
1.2 Attributes
3
1.2
ATTRIBUTES
Attributes may be classified into two main types depending on their domain, that is,
depending on the types of values they take on.
Numeric Attributes
A
numeric
attribute is one that has a real-valued or integer-valued domain. For
example,
Age
with
domain(
Age
)
=
N
, where
N
denotes the set of natural numbers
(non-negative integers), is numeric, and so is
petal length
in Table 1.1, with
domain(
petallength
)
=
R
+
(the set of all positive real numbers). Numeric attributes
that takeon a finite or countably infinite set of values are called
discrete
, whereas those
that can take on any real value are called
continuous
. As a special case of discrete, if
an attribute has as its domain the set
{
0
,
1
}
, it is called a
binary
attribute. Numeric
attributes can be classified further into two types:
•
Interval-scaled
: For these kinds of attributes only differences (addition or subtraction)
makesense.Forexample, attribute
temperature
measured in
◦
Cor
◦
Fisinterval-scaled.
If it is 20
◦
C on one day and 10
◦
C on the following day, it is meaningful to talk about a
temperature drop of 10
◦
C, but it is not meaningful to say that it is twice as cold as the
previous day.
•
Ratio-scaled
: Here one can compute both differences as well as ratios between values.
For example, for attribute
Age
, we can say that someone who is 20 years old is twice as
old as someone who is 10 years old.
Categorical Attributes
A
categorical
attribute is one that has a set-valued domain composed of a set of
symbols. For example,
Sex
and
Education
could be categorical attributes with their
domains given as
domain(
Sex
)
={
M
,
F
}
domain(
Education
)
={
HighSchool
,
BS
,
MS
,
PhD
}
Categorical attributes may be of two types:
•
Nominal
: The attribute values in the domain are unordered, and thus only equality
comparisons are meaningful. That is, we can check only whether the value of the
attribute for two given instances is the same or not. For example,
Sex
is a nominal
attribute. Also
class
in Table 1.1 is a nominal attribute with
domain(
class
)
=
{
iris-setosa
,
iris-versicolor
,
iris-virginica
}
.
•
Ordinal
: The attribute values are ordered, and thus both equality comparisons (is one
value equal to another?) and inequality comparisons (is one value less than or greater
than another?) are allowed, though it may not be possible to quantify the difference
between values. For example,
Education
is an ordinal attribute because its domain
values are ordered by increasing educational qualification.
4
Data Mining and Analysis
1.3
DATA: ALGEBRAIC AND GEOMETRIC VIEW
If the
d
attributes or dimensions in the data matrix
D
are all numeric, then each row
can be considered as a
d
-dimensional point:
x
i
=
(x
i
1
,x
i
2
,...,x
id
)
∈
R
d
or equivalently, each row may be considered as a
d
-dimensional column vector (all
vectors are assumed to be column vectors by default):
x
i
=
x
i
1
x
i
2
.
.
.
x
id
=
x
i
1
x
i
2
···
x
id
T
∈
R
d
where
T
is the
matrix transpose
operator.
The
d
-dimensional Cartesian coordinate space is specified via the
d
unit vectors,
called the standard basis vectors, along each of the axes. The
j
th
standard basis vector
e
j
is the
d
-dimensional unit vector whose
j
th component is 1 and the rest of the
components are 0
e
j
=
(
0
,...,
1
j
,...,
0
)
T
Any other vector in
R
d
can be written as
linear combination
of the standard basis
vectors. For example, each of the points
x
i
can be written as the linear combination
x
i
=
x
i
1
e
1
+
x
i
2
e
2
+···+
x
id
e
d
=
d
j
=
1
x
ij
e
j
where the scalar value
x
ij
is the coordinate value along the
j
th axis or attribute.
Example 1.2.
Consider the Iris data in Table 1.1. If we
project
the entire data
onto the first two attributes, then each row can be considered as a point or
a vector in 2-dimensional space. For example, the projection of the 5-tuple
x
1
=
(
5
.
9
,
3
.
0
,
4
.
2
,
1
.
5
,
Iris-versicolor
)
on the first two attributes is shown in
Figure 1.1a. Figure 1.2 shows the scatterplot of all the
n
=
150 points in the
2-dimensional space spanned by the first two attributes. Likewise, Figure 1.1b shows
x
1
as a point and vector in 3-dimensional space, by projecting the data onto the first
three attributes. The point
(
5
.
9
,
3
.
0
,
4
.
2
)
can be seen as specifying the coefficients in
the linear combination of the standard basis vectors in
R
3
:
x
1
=
5
.
9
e
1
+
3
.
0
e
2
+
4
.
2
e
3
=
5
.
9
1
0
0
+
3
.
0
0
1
0
+
4
.
2
0
0
1
=
5
.
9
3
.
0
4
.
2
1.3 Data: Algebraic and Geometric View
5
0
1
2
3
4
0 1 2 3 4 5 6
X
1
X
2
x
1
=
(
5
.
9
,
3
.
0
)
(a)
X
1
X
2
X
3
1
2
3
4
5
6
1
2
3
1
2
3
4
x
1
=
(
5
.
9
,
3
.
0
,
4
.
2
)
(b)
Figure 1.1.
Row
x
1
as a point and vector in (a)
R
2
and (b)
R
3
.
2
2
.
5
3
.
0
3
.
5
4
.
0
4
.
5
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
X
1
: sepal length
X
2
:
s
e
p
a
l
w
i
d
t
h
Figure 1.2.
Scatterplot:
sepal length
versus
sepal width
. The solid circle shows the mean point.
Each numeric column or attribute can also be treated as a vector in an
n
-dimensional space
R
n
:
X
j
=
x
1
j
x
2
j
.
.
.
x
nj
6
Data Mining and Analysis
If all attributes are numeric, then the data matrix
D
is in fact an
n
×
d
matrix, also
written as
D
∈
R
n
×
d
, given as
D
=
x
11
x
12
···
x
1
d
x
21
x
22
···
x
2
d
.
.
.
.
.
.
.
.
.
.
.
.
x
n
1
x
n
2
···
x
nd
=
—
x
T
1
—
—
x
T
2
—
.
.
.
—
x
T
n
—
=
| | |
X
1
X
2
···
X
d
| | |
As we can see, we can consider the entire dataset as an
n
×
d
matrix, or equivalently as
a set of
n
row vectors
x
T
i
∈
R
d
or as a set of
d
column vectors
X
j
∈
R
n
.
1.3.1
Distance and Angle
Treating data instances and attributes as vectors, and the entire dataset as a matrix,
enables one to apply both geometric and algebraic methods to aid in the data mining
and analysis tasks.
Let
a
,
b
∈
R
m
be two
m
-dimensional vectors given as
a
=
a
1
a
2
.
.
.
a
m
b
=
b
1
b
2
.
.
.
b
m
Dot Product
The
dot product
between
a
and
b
is defined as the scalar value
a
T
b
=
a
1
a
2
···
a
m
×
b
1
b
2
.
.
.
b
m
=
a
1
b
1
+
a
2
b
2
+···+
a
m
b
m
=
m
i
=
1
a
i
b
i
Length
The
Euclidean norm
or
length
of a vector
a
∈
R
m
is defined as
a
=
√
a
T
a
=
a
2
1
+
a
2
2
+···+
a
2
m
=
m
i
=
1
a
2
i
The
unit vector
in the direction of
a
is given as
u
=
a
a
=
1
a
a
1.3 Data: Algebraic and Geometric View
7
By definition
u
has length
u
=
1, and it is also called a
normalized
vector, which can
be used in lieu of
a
in some analysis tasks.
The Euclidean norm is a special case of a general class of norms, known as
L
p
-norm
, defined as
a
p
=
|
a
1
|
p
+|
a
2
|
p
+···+|
a
m
|
p
1
p
=
m
i
=
1
|
a
i
|
p
1
p
for any
p
=
0. Thus, the Euclidean norm corresponds to the case when
p
=
2.
Distance
From the Euclidean norm we can define the
Euclidean distance
between
a
and
b
, as
follows
δ(
a
,
b
)
=
a
−
b
=
(
a
−
b
)
T
(
a
−
b
)
=
m
i
=
1
(a
i
−
b
i
)
2
(1.1)
Thus, the length of a vector is simply its distance from the zero vector
0
, all of whose
elements are 0, that is,
a
=
a
−
0
=
δ(
a
,
0
)
.
From the general
L
p
-norm we can define the corresponding
L
p
-distance function,
given as follows
δ
p
(
a
,
b
)
=
a
−
b
p
(1.2)
If
p
is unspecified, as in Eq.(1.1), it is assumed to be
p
=
2 by default.
Angle
The cosine of the smallest angle between vectors
a
and
b
, also called the
cosine
similarity
, is given as
cos
θ
=
a
T
b
a
b
=
a
a
T
b
b
(1.3)
Thus, the cosine of the angle between
a
and
b
is given as the dot product of the unit
vectors
a
a
and
b
b
.
The
Cauchy–Schwartz
inequality states that for any vectors
a
and
b
in
R
m
|
a
T
b
|≤
a
·
b
It follows immediately from the Cauchy–Schwartz inequality that
−
1
≤
cos
θ
≤
1
8
Data Mining and Analysis
0
1
2
3
4
0 1 2 3 4 5
X
1
X
2
(
5
,
3
)
(
1
,
4
)
a
b
a
−
b
θ
Figure 1.3.
Distance and angle. Unit vectors are shown in gray.
Because the smallest angle
θ
∈
[0
◦
,
180
◦
] and because cos
θ
∈
[
−
1
,
1], the cosine
similarity value ranges from
+
1, corresponding to an angle of 0
◦
, to
−
1, corresponding
to an angle of 180
◦
(or
π
radians).
Orthogonality
Two vectors
a
and
b
are said to be
orthogonal
if and only if
a
T
b
=
0, which in turn
implies that cos
θ
=
0, that is, the angle between them is 90
◦
or
π
2
radians. In this case,
we say that they have no similarity.
Example 1.3 (Distance and Angle).
Figure 1.3 shows the two vectors
a
=
5
3
and
b
=
1
4
Using Eq.(1.1), the Euclidean distance between them is given as
δ(
a
,
b
)
=
(
5
−
1
)
2
+
(
3
−
4
)
2
=
√
16
+
1
=
√
17
=
4
.
12
The distance can also be computed as the magnitude of the vector:
a
−
b
=
5
3
−
1
4
=
4
−
1
because
a
−
b
=
4
2
+
(
−
1
)
2
=
√
17
=
4
.
12.
The unit vector in the direction of
a
is given as
u
a
=
a
a
=
1
√
5
2
+
3
2
5
3
=
1
√
34
5
3
=
0
.
86
0
.
51
1.3 Data: Algebraic and Geometric View
9
The unit vector in the direction of
b
can be computed similarly:
u
b
=
0
.
24
0
.
97
These unit vectors are also shown in gray in Figure 1.3.
By Eq.(1.3) the cosine of the angle between
a
and
b
is given as
cos
θ
=
5
3
T
1
4
√
5
2
+
3
2
√
1
2
+
4
2
=
17
√
34
×
17
=
1
√
2
We can get the angle by computing the inverse of the cosine:
θ
=
cos
−
1
1
/
√
2
=
45
◦
Let us consider the
L
p
-norm for
a
with
p
=
3; we get
a
3
=
5
3
+
3
3
1
/
3
=
(
153
)
1
/
3
=
5
.
34
The distance between
a
and
b
using Eq.(1.2) for the
L
p
-norm with
p
=
3 is given as
a
−
b
3
=
(
4
,
−
1
)
T
3
=
4
3
+
(
−
1
)
3
1
/
3
=
(
63
)
1
/
3
=
3
.
98
1.3.2
Mean and Total Variance
Mean
The
mean
of the data matrix
D
is the vector obtained as the average of all the
points:
mean(
D
)
=
µ
=
1
n
n
i
=
1
x
i
Total Variance
The
total variance
of the data matrix
D
is the average squared distance of each point
from the mean:
var(
D
)
=
1
n
n
i
=
1
δ(
x
i
,
µ
)
2
=
1
n
n
i
=
1
x
i
−
µ
2
(1.4)
Simplifying Eq.(1.4) we obtain
var(
D
)
=
1
n
n
i
=
1
x
i
2
−
2
x
T
i
µ
+
µ
2
=
1
n
n
i
=
1
x
i
2
−
2
n
µ
T
1
n
n
i
=
1
x
i
+
n
µ
2
10
Data Mining and Analysis
=
1
n
n
i
=
1
x
i
2
−
2
n
µ
T
µ
+
n
µ
2
=
1
n
n
i
=
1
x
i
2
−
µ
2
The total varianceis thus thedifferencebetweenthe averageof thesquared magnitude
of the data points and the squared magnitude of the mean (average of the points).
Centered Data Matrix
Often we need to center the data matrix by making the mean coincide with the origin
of the data space. The
centered data matrix
is obtained by subtracting the mean from
all the points:
Z
=
D
−
1
·
µ
T
=
x
T
1
x
T
2
.
.
.
x
T
n
−
µ
T
µ
T
.
.
.
µ
T
=
x
T
1
−
µ
T
x
T
2
−
µ
T
.
.
.
x
T
n
−
µ
T
=
z
T
1
z
T
2
.
.
.
z
T
n
(1.5)
where
z
i
=
x
i
−
µ
represents the centered point corresponding to
x
i
, and
1
∈
R
n
is the
n
-dimensional vector all of whose elements have value 1. The mean of the centered
data matrix
Z
is
0
∈
R
d
, because we have subtracted the mean
µ
from all the points
x
i
.
1.3.3
Orthogonal Projection
Often in data mining we need to project a point or vector onto another vector, for
example, to obtain a new point after a change of the basis vectors. Let
a
,
b
∈
R
m
be two
m
-dimensional vectors. An
orthogonal decomposition
of the vector
b
in the direction
0
1
2
3
4
0 1 2 3 4 5
X
1
X
2
a
b
r
=
b
⊥
p
=
b
Figure 1.4.
Orthogonal projection.
1.3 Data: Algebraic and Geometric View
11
of another vector
a
, illustrated in Figure 1.4, is given as
b
=
b
+
b
⊥
=
p
+
r
(1.6)
where
p
=
b
is parallel to
a
, and
r
=
b
⊥
is perpendicular or orthogonal to
a
. The vector
p
is called the
orthogonal projection
or simply projection of
b
on the vector
a
. Note
that the point
p
∈
R
m
is the point closest to
b
on the line passing through
a
. Thus, the
magnitude of the vector
r
=
b
−
p
gives the
perpendicular distance
between
b
and
a
,
which is often interpreted as the residual or error vector between the points
b
and
p
.
We can derive an expression for
p
by noting that
p
=
c
a
for some scalar
c
, as
p
is
parallel to
a
. Thus,
r
=
b
−
p
=
b
−
c
a
. Because
p
and
r
are orthogonal, we have
p
T
r
=
(c
a
)
T
(
b
−
c
a
)
=
c
a
T
b
−
c
2
a
T
a
=
0
which implies that
c
=
a
T
b
a
T
a
Therefore, the projection of
b
on
a
is given as
p
=
b
=
c
a
=
a
T
b
a
T
a
a
(1.7)
Example 1.4.
Restricting the Iris dataset to the first two dimensions,
sepal length
and
sepal width
, the mean point is given as
mean(
D
)
=
5
.
843
3
.
054
X
1
X
2
ℓ
−
2
.
0
−
1
.
5
−
1
.
0
−
0
.
5 0
.
0 0
.
5 1
.
0 1
.
5 2
.
0
−
1
.
0
−
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Figure 1.5.
Projecting the centered data onto the line
ℓ
.
12
Data Mining and Analysis
which is shown as the black circle in Figure 1.2. The corresponding centered data
is shown in Figure 1.5, and the total variance is
var(
D
)
=
0
.
868 (centering does not
change this value).
Figure1.5shows the projection of eachpoint onto theline
ℓ
, which is the linethat
maximizes the separation between the class
iris-setosa
(squares) from the other
two classes, namely
iris-versicolor
(circles) and
iris-virginica
(triangles). The
line
ℓ
is given as the set of all the points
(x
1
,x
2
)
T
satisfying the constraint
x
1
x
2
=
c
−
2
.
15
2
.
75
for all scalars
c
∈
R
.
1.3.4
Linear Independence and Dimensionality
Given the data matrix
D
=
x
1
x
2
···
x
n
T
=
X
1
X
2
···
X
d
we are often interested in the linear combinations of the rows (points) or the
columns (attributes). For instance, different linear combinations of the original
d
attributes yield new derived attributes, which play a key role in feature extraction and
dimensionality reduction.
Given any set of vectors
v
1
,
v
2
,...,
v
k
in an
m
-dimensional vector space
R
m
, their
linear combination
is given as
c
1
v
1
+
c
2
v
2
+···+
c
k
v
k
where
c
i
∈
R
are scalar values. The set of all possible linear combinations of the
k
vectors is called the
span
, denoted as
span(
v
1
,...,
v
k
)
, which is itself a vector space
beinga
subspace
of
R
m
. If
span(
v
1
,...,
v
k
)
=
R
m
,then wesay that
v
1
,...,
v
k
is a
spanning
set
for
R
m
.
Row and Column Space
There are several interesting vector spaces associated with the data matrix
D
, two of
which are the column space and row space of
D
. The
column space
of
D
, denoted
col(
D
)
, is the set of all linear combinations of the
d
attributes
X
j
∈
R
n
, that is,
col(
D
)
=
span(
X
1
,
X
2
,...,
X
d
)
By definition
col(
D
)
is a subspace of
R
n
. The
row space
of
D
, denoted
row(
D
)
, is the
set of all linear combinations of the
n
points
x
i
∈
R
d
, that is,
row(
D
)
=
span(
x
1
,
x
2
,...,
x
n
)
By definition
row(
D
)
is a subspace of
R
d
. Note also that the row space of
D
is the
column space of
D
T
:
row(
D
)
=
col(
D
T
)
1.3 Data: Algebraic and Geometric View
13
Linear Independence
We say that the vectors
v
1
,...,
v
k
are
linearly dependent
if at least one vector can be
written as a linear combination of the others. Alternatively, the
k
vectors are linearly
dependent if there are scalars
c
1
,c
2
,...,c
k
, at least one of which is not zero, such that
c
1
v
1
+
c
2
v
2
+···+
c
k
v
k
=
0
On the other hand,
v
1
,
···
,
v
k
are
linearly independent
if and only if
c
1
v
1
+
c
2
v
2
+···+
c
k
v
k
=
0
implies
c
1
=
c
2
=···=
c
k
=
0
Simply put, a set of vectors is linearly independent if none of them can be written as a
linear combination of the other vectors in the set.
Dimension and Rank
Let
S
be a subspace of
R
m
. A
basis
for
S
is a set of vectors in
S
, say
v
1
,...,
v
k
, that are
linearly independent and they span
S
, that is,
span(
v
1
,...,
v
k
)
=
S
. In fact, a basis is a
minimal spanning set. If the vectors in the basis are pairwise orthogonal, they are said
to form an
orthogonal basis
for
S
. If, in addition, they are also normalized to be unit
vectors, then they make up an
orthonormal basis
for
S
. For instance, the
standard basis
for
R
m
is an orthonormal basis consisting of the vectors
e
1
=
1
0
.
.
.
0
e
2
=
0
1
.
.
.
0
···
e
m
=
0
0
.
.
.
1
Any two bases for
S
must have the same number of vectors, and the number of vectors
in a basis for
S
is called the
dimension
of
S
, denoted as
dim(
S
)
. Because
S
is a subspace
of
R
m
, we must have
dim(
S
)
≤
m
.
It is a remarkable fact that, for any matrix, the dimension of its row and column
space is the same, and this dimension is also called the
rank
of the matrix. For the data
matrix
D
∈
R
n
×
d
, we have
rank(
D
)
≤
min
(n,d)
, which follows from the fact that the
column space can have dimension at most
d
, and the row space can have dimension at
most
n
. Thus, even though the data points are ostensibly in a
d
dimensional attribute
space (the
extrinsic dimensionality
), if
rank(
D
) < d
, then the data points reside in a
lower dimensional subspace of
R
d
, and in this case
rank(
D
)
gives an indication about
the
intrinsic
dimensionality of the data. In fact, with dimensionality reduction methods
it is often possible to approximate
D
∈
R
n
×
d
with a derived data matrix
D
′
∈
R
n
×
k
,
which has much lower dimensionality, that is,
k
≪
d
. In this case
k
may reflect the
“true” intrinsic dimensionality of the data.
Example 1.5.
The line
ℓ
in Figure 1.5 is given as
ℓ
=
span
−
2
.
15 2
.
75
T
, with
dim(ℓ)
=
1. After normalization, we obtain the orthonormal basis for
ℓ
as the unit
vector
1
√
12
.
19
−
2
.
15
2
.
75
=
−
0
.
615
0
.
788
14
Data Mining and Analysis
Table 1.2.
Iris dataset:
sepal length
(in centimeters).
5.9 6.9 6.6 4.6 6.0 4.7 6.5 5.8 6.7 6.7 5.1 5.1 5.7 6.1 4.9
5.0 5.0 5.7 5.0 7.2 5.9 6.5 5.7 5.5 4.9 5.0 5.5 4.6 7.2 6.8
5.4 5.0 5.7 5.8 5.1 5.6 5.8 5.1 6.3 6.3 5.6 6.1 6.8 7.3 5.6
4.8 7.1 5.7 5.3 5.7 5.7 5.6 4.4 6.3 5.4 6.3 6.9 7.7 6.1 5.6
6.1 6.4 5.0 5.1 5.6 5.4 5.8 4.9 4.6 5.2 7.9 7.7 6.1 5.5 4.6
4.7 4.4 6.2 4.8 6.0 6.2 5.0 6.4 6.3 6.7 5.0 5.9 6.7 5.4 6.3
4.8 4.4 6.4 6.2 6.0 7.4 4.9 7.0 5.5 6.3 6.8 6.1 6.5 6.7 6.7
4.8 4.9 6.9 4.5 4.3 5.2 5.0 6.4 5.2 5.8 5.5 7.6 6.3 6.4 6.3
5.8 5.0 6.7 6.0 5.1 4.8 5.7 5.1 6.6 6.4 5.2 6.4 7.7 5.8 4.9
5.4 5.1 6.0 6.5 5.5 7.2 6.9 6.2 6.5 6.0 5.4 5.5 6.7 7.7 5.1
1.4
DATA: PROBABILISTIC VIEW
The probabilistic view of the data assumes that each numeric attribute
X
is a
random
variable
, defined as a function that assigns a real number to each outcome of an
experiment (i.e., some process of observation or measurement). Formally,
X
is a
function
X
:
O
→
R
, where
O
, the domain of
X
, is the set of all possible outcomes
of the experiment, also called the
sample space
, and
R
, the
range
of
X
, is the set
of real numbers. If the outcomes are numeric, and represent the observed values of
the random variable, then
X
:
O
→
O
is simply the identity function:
X
(v)
=
v
for all
v
∈
O
. The distinction between the outcomes and the value of the random variable is
important, as we may want to treat the observed values differently depending on the
context, as seen in Example 1.6.
A random variable
X
is called a
discrete random variable
if it takes on only a finite
or countably infinite number of values in its range, whereas
X
is called a
continuous
random variable
if it can take on any value in its range.
Example 1.6.
Consider the
sepal length
attribute (
X
1
) for the Iris dataset in
Table 1.1. All
n
=
150 values of this attribute are shown in Table 1.2, which lie in
the range [4
.
3
,
7
.
9], with centimeters as the unit of measurement. Let us assume that
these constitute the set of all possible outcomes
O
.
By default, we can consider the attribute
X
1
to be a continuous random variable,
given as the identity function
X
1
(v)
=
v
, because the outcomes (sepal length values)
are all numeric.
On the other hand, if we want to distinguish between Iris flowers with short and
long sepal lengths, with long being, say, a length of 7 cm or more, we can define a
discrete random variable
A
as follows:
A
(v)
=
0 if
v <
7
1 if
v
≥
7
In this case the domain of
A
is [4
.
3
,
7
.
9], and its range is
{
0
,
1
}
. Thus,
A
assumes
nonzero probability only at the discrete values 0 and 1.
1.4 Data: Probabilistic View
15
Probability Mass Function
If
X
is discrete, the
probability mass function
of
X
is defined as
f(x)
=
P(
X
=
x)
for all
x
∈
R
In other words, the function
f
gives the probability
P(
X
=
x)
that the random variable
X
has the exact value
x
. The name “probability mass function” intuitively conveys the
fact that the probability is concentrated or massed at only discrete values in the range
of
X
, and is zero for all other values.
f
must also obey the basic rules of probability.
That is,
f
must be non-negative:
f(x)
≥
0
and the sum of all probabilities should add to 1:
x
f(x)
=
1
Example 1.7 (Bernoulli and Binomial Distribution).
In Example 1.6,
A
was defined
as a discrete random variable representing long sepal length. From the sepal length
data in Table 1.2 we find that only 13 Irises have sepal length of at least 7 cm. We can
thus estimate the probability mass function of
A
as follows:
f(
1
)
=
P(
A
=
1
)
=
13
150
=
0
.
087
=
p
and
f(
0
)
=
P(
A
=
0
)
=
137
150
=
0
.
913
=
1
−
p
In this case we say that
A
has a
Bernoulli distribution
with parameter
p
∈
[0
,
1], which
denotes the probability of a
success
, that is, the probability of picking an Iris with a
long sepal length at random from the set of all points. On the other hand, 1
−
p
is the
probability of a
failure
, that is, of not picking an Iris with long sepal length.
Let us consider another discrete random variable
B
, denoting the number of
Irises with long sepal length in
m
independent Bernoulli trials with probability of
success
p
. In this case,
B
takes on the discrete values [0
,m
], and its probability mass
function is given by the
Binomial distribution
f(k)
=
P(
B
=
k)
=
m
k
p
k
(
1
−
p)
m
−
k
The formula can be understood as follows. There are
m
k
ways of picking
k
long sepal
length Irises out of the
m
trials. For each selection of
k
long sepal length Irises, the
total probability of the
k
successes is
p
k
, and the total probability of
m
−
k
failures is
(
1
−
p)
m
−
k
. For example, because
p
=
0
.
087 from above, the probability of observing
exactly
k
=
2 Irises with long sepal length in
m
=
10 trials is given as
f(
2
)
=
P(
B
=
2
)
=
10
2
(
0
.
087
)
2
(
0
.
913
)
8
=
0
.
164
Figure1.6shows thefullprobability mass function for differentvaluesof
k
for
m
=
10.
Because
p
is quite small, the probability of
k
successes in so few a trials falls off
rapidly as
k
increases, becoming practically zero for values of
k
≥
6.
16
Data Mining and Analysis
0
.
1
0
.
2
0
.
3
0
.
4
0 1 2 3 4 5 6 7 8 9 10
k
P(
B
=
k)
Figure 1.6.
Binomial distribution: probability mass function (
m
=
10,
p
=
0
.
087).
Probability Density Function
If
X
is continuous, its range is the entire set of real numbers
R
. The probability of any
specific value
x
is only one out of the infinitely many possible values in the range of
X
, which means that
P(
X
=
x)
=
0 for all
x
∈
R
. However, this does not mean that
the value
x
is impossible, because in that case we would conclude that all values are
impossible! Whatit meansis thattheprobabilitymass is spread so thinly overtherange
of values that it can be measured only over intervals [
a,b
]
⊂
R
, rather than at specific
points. Thus, instead of the probability mass function, we define the
probability density
function
, which specifies the probability that the variable
X
takes on values in any
interval [
a,b
]
⊂
R
:
P
X
∈
[
a,b
]
=
b
a
f(x) dx
As before, the density function
f
must satisfy the basic laws of probability:
f(x)
≥
0
,
for all
x
∈
R
and
∞
−∞
f(x) dx
=
1
We can get an intuitive understanding of the density function
f
by considering
the probability density over a small interval of width 2
ǫ >
0, centered at
x
, namely
1.4 Data: Probabilistic View
17
[
x
−
ǫ,x
+
ǫ
]:
P
X
∈
[
x
−
ǫ,x
+
ǫ
]
=
x
+
ǫ
x
−
ǫ
f(x) dx
≃
2
ǫ
·
f(x)
f(x)
≃
P
X
∈
[
x
−
ǫ,x
+
ǫ
]
2
ǫ
(1.8)
f(x)
thus gives the probability density at
x
, given as the ratio of the probability mass
to the width of the interval, that is, the probability mass per unit distance. Thus, it is
important to note that
P(
X
=
x)
=
f(x)
.
Even though the probability density function
f(x)
does not specify the probability
P(
X
=
x)
, it can be used to obtain the relative probability of one value
x
1
over another
x
2
because for a given
ǫ >
0, by Eq.(1.8), we have
P(
X
∈
[
x
1
−
ǫ,x
1
+
ǫ
]
)
P(
X
∈
[
x
2
−
ǫ,x
2
+
ǫ
]
)
≃
2
ǫ
·
f(x
1
)
2
ǫ
·
f(x
2
)
=
f(x
1
)
f(x
2
)
(1.9)
Thus, if
f(x
1
)
is larger than
f(x
2
)
, then values of
X
close to
x
1
are more probable than
values close to
x
2
, and vice versa.
Example 1.8 (Normal Distribution).
Consider again the
sepal length
values from
the Iris dataset, as shown in Table 1.2. Let us assume that these values follow a
Gaussian
or
normal
density function, given as
f(x)
=
1
√
2
πσ
2
exp
−
(x
−
µ)
2
2
σ
2
There are two parameters of the normal density distribution, namely,
µ
, which
represents the mean value, and
σ
2
, which represents the variance of the values (these
parameters are discussed in Chapter 2). Figure 1.7 shows the characteristic “bell”
shape plot of the normal distribution. The parameters,
µ
=
5
.
84 and
σ
2
=
0
.
681, were
estimated directly from the data for
sepal length
in Table 1.2.
Whereas
f(x
=
µ)
=
f(
5
.
84
)
=
1
√
2
π
·
0
.
681
exp
{
0
}=
0
.
483, we emphasize that
the probability of observing
X
=
µ
is zero, that is,
P(
X
=
µ)
=
0. Thus,
P(
X
=
x)
is not given by
f(x)
, rather,
P(
X
=
x)
is given as the area under the curve for
an infinitesimally small interval [
x
−
ǫ,x
+
ǫ
] centered at
x
, with
ǫ >
0. Figure 1.7
illustrates this with the shaded region centered at
µ
=
5
.
84. From Eq.(1.8), we have
P(
X
=
µ)
≃
2
ǫ
·
f(µ)
=
2
ǫ
·
0
.
483
=
0
.
967
ǫ
As
ǫ
→
0, we get
P(
X
=
µ)
→
0. However, based on Eq.(1.9) we can claim that the
probability of observing values close to the mean value
µ
=
5
.
84 is 2.69 times the
probability of observing values close to
x
=
7, as
f(
5
.
84
)
f(
7
)
=
0
.
483
0
.
18
=
2
.
69
18
Data Mining and Analysis
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
2 3 4 5 6 7 8 9
x
f(x)
µ
±
ǫ
Figure 1.7.
Normal distribution: probability density function (
µ
=
5
.
84,
σ
2
=
0
.
681).
Cumulative Distribution Function
For any random variable
X
, whether discrete or continuous, we can define the
cumulative distribution function (CDF)
F
:
R
→
[0
,
1], which gives the probability of
observing a value at most some given value
x
:
F(x)
=
P(
X
≤
x)
for all
−∞
x
is a binary
indicator variable
that indicates whether the given condition is satisfied
or not. Intuitively, to obtain the empirical CDF we compute, for each value
x
∈
R
,
how many points in the sample are less than or equal to
x
. The empirical CDF puts a
probability mass of
1
n
at each point
x
i
. Note that we use the notation
ˆ
F
to denote the
fact that the empirical CDF is an estimate for the unknown population CDF
F
.
Inverse Cumulative Distribution Function
Define the
inverse cumulative distribution function
or
quantile function
for a random
variable
X
as follows:
F
−
1
(q)
=
min
{
x
|
ˆ
F(x)
≥
q
}
for
q
∈
[0
,
1] (2.2)
That is, the inverse CDF gives the least value of
X
, for which
q
fraction of the values
are higher, and 1
−
q
fraction of the values are lower. The
empirical inverse cumulative
distribution function
ˆ
F
−
1
can be obtained from Eq.(2.1).
Empirical Probability Mass Function
The
empirical probability mass function (PMF)
of
X
is given as
ˆ
f(x)
=
P(
X
=
x)
=
1
n
n
i
=
1
I
(x
i
=
x)
(2.3)
where
I
(x
i
=
x)
=
1 if
x
i
=
x
0 if
x
i
=
x
The empirical PMF also puts a probability mass of
1
n
at each point
x
i
.
2.1.1
Measures of Central Tendency
These measures given an indication about the concentration of the probability mass,
the “middle” values, and so on.
Mean
The
mean
, also called the
expected value
, of a random variable
X
is the arithmetic
averageof thevaluesof
X
. It provides aone-number summary of the
location
or
central
tendency
for the distribution of
X
.
The mean or expected value of a discrete random variable
X
is defined as
µ
=
E
[
X
]
=
x
xf(x)
(2.4)
where
f(x)
is the probability mass function of
X
.
2.1 Univariate Analysis
35
The expected value of a continuous random variable
X
is defined as
µ
=
E
[
X
]
=
∞
−∞
xf(x)dx
where
f(x)
is the probability density function of
X
.
Sample Mean
The
samplemean
is a statistic, that is, a function
ˆ
µ
:
{
x
1
,x
2
,...,x
n
}→
R
,
defined as the average value of
x
i
’s:
ˆ
µ
=
1
n
n
i
=
1
x
i
(2.5)
It serves as an estimator for the unknown mean value
µ
of
X
. It can be derived by
plugging in the empirical PMF
ˆ
f(x)
in Eq.(2.4):
ˆ
µ
=
x
x
ˆ
f(x)
=
x
x
1
n
n
i
=
1
I
(x
i
=
x)
=
1
n
n
i
=
1
x
i
Sample Mean Is Unbiased
An estimator
ˆ
θ
is called an
unbiased estimator
for
parameter
θ
if
E
[
ˆ
θ
]
=
θ
for everypossible valueof
θ
. The sample mean
ˆ
µ
is an unbiased
estimator for the population mean
µ
, as
E
[
ˆ
µ
]
=
E
1
n
n
i
=
1
x
i
=
1
n
n
i
=
1
E
[
x
i
]
=
1
n
n
i
=
1
µ
=
µ
(2.6)
where we use the fact that the random variables
x
i
are IID according to
X
, which
implies that they have the same mean
µ
as
X
, that is,
E
[
x
i
]
=
µ
for all
x
i
. We also used
the fact that the expectation function
E
is a
linear operator
, that is, for any two random
variables
X
and
Y
, and real numbers
a
and
b
, we have
E
[
a
X
+
b
Y
]
=
a
E
[
X
]
+
b
E
[
Y
].
Robustness
We saythata statisticis
robust
if itis not affectedbyextremevalues(such
as outliers) in the data. The sample mean is unfortunately not robust because a single
large value (an outlier) can skew the average. A more robust measure is the
trimmed
mean
obtained after discarding a small fraction of extreme values on one or both ends.
Furthermore, the mean can be somewhat misleading in that it is typically not a value
that occurs in the sample, and it may not even be a value that the random variable
can actually assume (for a discrete random variable). For example, the number of cars
per capita is an integer-valued random variable, but according to the US Bureau of
Transportation Studies, the averagenumber of passenger cars in the United States was
0.45 in 2008(137.1 million cars, with a population size of 304.4million). Obviously, one
cannot own 0.45 cars; it can be interpreted as saying that on average there are 45 cars
per 100 people.
Median
The
median
of a random variable is defined as the value
m
such that
P(
X
≤
m)
≥
1
2
and
P(
X
≥
m)
≥
1
2
36
Numeric Attributes
In other words, the median
m
is the “middle-most” value; half of the values of
X
are
less and half of the values of
X
are more than
m
. In terms of the (inverse) cumulative
distribution function, the median is therefore the value
m
for which
F(m)
=
0
.
5 or
m
=
F
−
1
(
0
.
5
)
The
sample median
can be obtained from the empirical CDF [Eq.(2.1)] or the
empirical inverse CDF [Eq.(2.2)] by computing
ˆ
F(m)
=
0
.
5 or
m
=
ˆ
F
−
1
(
0
.
5
)
A simpler approach to compute the sample median is to first sort all the values
x
i
(
i
∈
[1
,n
]) in increasing order. If
n
is odd, the median is the value at position
n
+
1
2
. If
n
is even, the values at positions
n
2
and
n
2
+
1 are both medians.
Unlike the mean, median is robust, as it is not affected very much by extreme
values. Also, it is a value that occurs in the sample and a value the random variable can
actually assume.
Mode
The
mode
of a random variable
X
is the value at which the probability mass function
or the probability density function attains its maximum value, depending on whether
X
is discrete or continuous, respectively.
The
sample mode
is a value for which the empirical probability mass function
[Eq.(2.3)] attains its maximum, given as
mode
(
X
)
=
argmax
x
ˆ
f(x)
The mode may not be a very useful measure of central tendency for a sample
because by chance an unrepresentative element may be the most frequent element.
Furthermore, if all values in the sample are distinct, each of them will be the mode.
Example 2.1 (Sample Mean, Median, and Mode).
Consider the attribute
sepal
length
(
X
1
) in the Iris dataset, whose values are shown in Table 1.2. The sample
mean is given as follows:
ˆ
µ
=
1
150
(
5
.
9
+
6
.
9
+···+
7
.
7
+
5
.
1
)
=
876
.
5
150
=
5
.
843
Figure 2.1 shows all 150 values of
sepal length
, and the sample mean. Figure 2.2a
shows the empirical CDF and Figure2.2b shows the empirical inverse CDF for
sepal
length
.
Because
n
=
150 is even, the sample median is the value at positions
n
2
=
75 and
n
2
+
1
=
76 in sorted order. For
sepal length
both these values are 5.8; thus the
sample median is 5.8. From the inverse CDF in Figure 2.2b, we can see that
ˆ
F(
5
.
8
)
=
0
.
5 or 5
.
8
=
ˆ
F
−
1
(
0
.
5
)
The sample mode for
sepal length
is 5, which can be observed from the
frequency of 5 in Figure 2.1. The empirical probability mass at
x
=
5 is
ˆ
f(
5
)
=
10
150
=
0
.
067
2.1 Univariate Analysis
37
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
X
1
Frequency
ˆ
µ
=
5
.
843
Figure 2.1.
Sample mean for
sepal length
. Multiple occurrences of the same value are shown stacked.
0
0
.
25
0
.
50
0
.
75
1
.
00
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
x
ˆ
F
(
x
)
(a) Empirical CDF
4
4
.
5
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0
7
.
5
8
.
0
0 0
.
25 0
.
50 0
.
75 1
.
00
q
ˆ
F
−
1
(
q
)
(b) Empirical inverse CDF
Figure 2.2.
Empirical CDF and inverse CDF:
sepal length
.
38
Numeric Attributes
2.1.2
Measures of Dispersion
The measures of dispersion give an indication about the spread or variation in the
values of a random variable.
Range
The
value range
or simply
range
of a random variable
X
is the difference between the
maximum and minimum values of
X
, given as
r
=
max
{
X
}−
min
{
X
}
The (value) range of
X
is a population parameter, not to be confused with the range
of the function
X
, which is the set of all the values
X
can assume. Which range is being
used should be clear from the context.
The
sample range
is a statistic, given as
ˆ
r
=
n
max
i
=
1
{
x
i
}−
n
min
i
=
1
{
x
i
}
By definition, range is sensitive to extreme values, and thus is not robust.
Interquartile Range
Quartiles
are special values of the quantile function [Eq.(2.2)] that divide the data into
four equal parts. That is, quartiles correspond to the quantile values of 0
.
25, 0
.
5, 0
.
75,
and 1
.
0. The
first quartile
is the value
q
1
=
F
−
1
(
0
.
25
)
, to the left of which 25% of the
points lie; the
second quartile
is the same as the median value
q
2
=
F
−
1
(
0
.
5
)
, to the left
of which 50% of the points lie; the third quartile
q
3
=
F
−
1
(
0
.
75
)
is the value to the left
of which 75% of the points lie; and the fourth quartile is the maximum value of
X
, to
the left of which 100% of the points lie.
A more robust measure of the dispersion of
X
is the
interquartile range (IQR)
,
defined as
IQR
=
q
3
−
q
1
=
F
−
1
(
0
.
75
)
−
F
−
1
(
0
.
25
)
(2.7)
IQR can also be thought of as a
trimmed range
, where we discard 25% of the low and
high values of
X
. Or put differently, it is the range for the middle 50% of the values of
X
. IQR is robust by definition.
The
sample IQR
can be obtained by plugging in the empirical inverse
CDF in Eq.(2.7):
IQR
= ˆ
q
3
−ˆ
q
1
=
ˆ
F
−
1
(
0
.
75
)
−
ˆ
F
−
1
(
0
.
25
)
Variance and Standard Deviation
The
variance
of a random variable
X
provides a measure of how much the values of
X
deviatefrom the mean or expected value of
X
. More formally, variance is the expected
2.1 Univariate Analysis
39
value of the squared deviation from the mean, defined as
σ
2
=
var(
X
)
=
E
[
(
X
−
µ)
2
]
=
x
(x
−
µ)
2
f(x)
if
X
is discrete
∞
−∞
(x
−
µ)
2
f(x) dx
if
X
is continuous
(2.8)
The
standard deviation
,
σ
, is defined as the positive square root of the variance,
σ
2
.
We can also write the varianceas the differencebetweenthe expectationof
X
2
and
the square of the expectation of
X
:
σ
2
=
var(
X
)
=
E
[
(
X
−
µ)
2
]
=
E
[
X
2
−
2
µ
X
+
µ
2
]
=
E
[
X
2
]
−
2
µ
E
[
X
]
+
µ
2
=
E
[
X
2
]
−
2
µ
2
+
µ
2
=
E
[
X
2
]
−
(
E
[
X
]
)
2
(2.9)
It is worth noting that variance is in fact the
second moment about the mean
,
corresponding to
r
=
2, which is a special case of the
r
th moment about the mean
for a
random variable
X
, defined as
E
[
(
x
−
µ)
r
]
.
Sample Variance
The
sample variance
is defined as
ˆ
σ
2
=
1
n
n
i
=
1
(x
i
− ˆ
µ)
2
(2.10)
It is the average squared deviation of the data values
x
i
from the sample mean
ˆ
µ
, and
can be derived by plugging in the empirical probability function
ˆ
f
from Eq.(2.3) into
Eq.(2.8), as
ˆ
σ
2
=
x
(x
− ˆ
µ)
2
ˆ
f(x)
=
x
(x
− ˆ
µ)
2
1
n
n
i
=
1
I
(x
i
=
x)
=
1
n
n
i
=
1
(x
i
− ˆ
µ)
2
The
sample standard deviation
is given as the positive square root of the sample
variance:
ˆ
σ
=
1
n
n
i
=
1
(x
i
− ˆ
µ)
2
The
standard score
, also called the
z-score
, of a sample value
x
i
is the number of
standard deviations the value is away from the mean:
z
i
=
x
i
− ˆ
µ
ˆ
σ
Put differently, the
z
-score of
x
i
measures the deviation of
x
i
from the mean value
ˆ
µ
,
in units of
ˆ
σ
.
40
Numeric Attributes
Geometric Interpretation of Sample Variance
We can treat the data sample for
attribute
X
as a vector in
n
-dimensional space, where
n
is the sample size. That is,
we write
X
=
(x
1
,x
2
,...,x
n
)
T
∈
R
n
. Further, let
Z
=
X
−
1
· ˆ
µ
=
x
1
− ˆ
µ
x
2
− ˆ
µ
.
.
.
x
n
− ˆ
µ
denote the mean subtracted attribute vector, where
1
∈
R
n
is the
n
-dimensional vector
allofwhose elementshavevalue1.WecanrewriteEq.(2.10)interms ofthemagnitude
of
Z
, that is, the dot product of
Z
with itself:
ˆ
σ
2
=
1
n
Z
2
=
1
n
Z
T
Z
=
1
n
n
i
=
1
(x
i
− ˆ
µ)
2
(2.11)
The sample variance can thus be interpreted as the squared magnitude of the centered
attribute vector, or the dot product of the centered attribute vector with itself,
normalized by the sample size.
Example 2.2.
Consider the data sample for
sepal length
shown in Figure 2.1. We
can see that the sample range is given as
max
i
{
x
i
}−
min
i
{
x
i
}=
7
.
9
−
4
.
3
=
3
.
6
From the inverse CDF for
sepal length
in Figure 2.2b, we can find the sample
IQR as follows:
ˆ
q
1
=
ˆ
F
−
1
(
0
.
25
)
=
5
.
1
ˆ
q
3
=
ˆ
F
−
1
(
0
.
75
)
=
6
.
4
IQR
= ˆ
q
3
−ˆ
q
1
=
6
.
4
−
5
.
1
=
1
.
3
The sample variance can be computed from the centered data vector via
Eq.(2.11):
ˆ
σ
2
=
1
n
(
X
−
1
· ˆ
µ)
T
(
X
−
1
· ˆ
µ)
=
102
.
168
/
150
=
0
.
681
The sample standard deviation is then
ˆ
σ
=
√
0
.
681
=
0
.
825
Variance of the Sample Mean
Because the sample mean
ˆ
µ
is itself a statistic, we can
compute its mean value and variance.The expectedvalueof the sample mean is simply
µ
, as we saw in Eq.(2.6). To derive an expression for the variance of the sample mean,
2.1 Univariate Analysis
41
we utilize the fact that the random variables
x
i
are all independent, and thus
var
n
i
=
1
x
i
=
n
i
=
1
var(x
i
)
Further, because all the
x
i
’s are identically distributed as
X
, they have the same
variance as
X
, that is,
var(x
i
)
=
σ
2
for all
i
Combining the above two facts, we get
var
n
i
=
1
x
i
=
n
i
=
1
var(x
i
)
=
n
i
=
1
σ
2
=
nσ
2
(2.12)
Further, note that
E
n
i
=
1
x
i
=
nµ
(2.13)
Using Eqs.(2.9), (2.12), and (2.13), the variance of the sample mean
ˆ
µ
can be
computed as
var(
ˆ
µ)
=
E
[
(
ˆ
µ
−
µ)
2
]
=
E
[
ˆ
µ
2
]
−
µ
2
=
E
1
n
n
i
=
1
x
i
2
−
1
n
2
E
n
i
=
1
x
i
2
=
1
n
2
E
n
i
=
1
x
i
2
−
E
n
i
=
1
x
i
2
=
1
n
2
var
n
i
=
1
x
i
=
σ
2
n
(2.14)
In other words, the sample mean
ˆ
µ
varies or deviates from the mean
µ
in proportion
to the population variance
σ
2
. However, the deviation can be made smaller by
considering larger sample size
n
.
Sample Variance Is Biased, but Is Asymptotically Unbiased
The sample variance in
Eq.(2.10) is a
biased estimator
for the true population variance,
σ
2
, that is,
E
[
ˆ
σ
2
]
=
σ
2
.
To show this we make use of the identity
n
i
=
1
(x
i
−
µ)
2
=
n(
ˆ
µ
−
µ)
2
+
n
i
=
1
(x
i
− ˆ
µ)
2
(2.15)
Computing the expectation of
ˆ
σ
2
by using Eq.(2.15) in the first step, we get
E
[
ˆ
σ
2
]
=
E
1
n
n
i
=
1
(x
i
− ˆ
µ)
2
=
E
1
n
n
i
=
1
(x
i
−
µ)
2
−
E
[
(
ˆ
µ
−
µ)
2
] (2.16)
42
Numeric Attributes
Recallthatthe random variables
x
i
areIID according to
X
,which meansthattheyhave
the same mean
µ
and variance
σ
2
as
X
. This means that
E
[
(x
i
−
µ)
2
]
=
σ
2
Further, from Eq.(2.14) the sample mean
ˆ
µ
has variance
E
[
(
ˆ
µ
−
µ)
2
]
=
σ
2
n
. Plugging
these into the Eq.(2.16) we get
E
[
ˆ
σ
2
]
=
1
n
nσ
2
−
σ
2
n
=
n
−
1
n
σ
2
The sample variance
ˆ
σ
2
is a biased estimator of
σ
2
, as its expected value differs from
the population variance by a factor of
n
−
1
n
. However, it is
asymptotically unbiased
, that
is, the bias vanishes as
n
→∞
because
lim
n
→∞
n
−
1
n
=
lim
n
→∞
1
−
1
n
=
1
Put differently, as the sample size increases, we have
E
[
ˆ
σ
2
]
→
σ
2
as
n
→∞
2.2
BIVARIATE ANALYSIS
In bivariate analysis, we consider two attributes at the same time. We are specifically
interested in understanding the association or dependence between them, if any. We
thus restrict our attention to the two numeric attributes of interest, say
X
1
and
X
2
, with
the data
D
represented as an
n
×
2 matrix:
D
=
X
1
X
2
x
11
x
12
x
21
x
22
.
.
.
.
.
.
x
n
1
x
n
2
Geometrically, we can think of
D
in two ways. It can be viewed as
n
points or vectors
in 2-dimensional space over the attributes
X
1
and
X
2
, that is,
x
i
=
(x
i
1
,x
i
2
)
T
∈
R
2
.
Alternatively, it can be viewed as two points or vectors in an
n
-dimensional space
comprising the points, that is, each column is a vector in
R
n
, as follows:
X
1
=
(x
11
,x
21
,...,x
n
1
)
T
X
2
=
(x
12
,x
22
,...,x
n
2
)
T
In the probabilistic view, the column vector
X
=
(
X
1
,
X
2
)
T
is considered a bivariate
vector random variable, and the points
x
i
(1
≤
i
≤
n
) are treated as a random sample
drawn from
X
, that is,
x
i
’s are considered independent and identically distributed as
X
.
2.2 Bivariate Analysis
43
Empirical Joint Probability Mass Function
The
empirical joint probability mass function
for
X
is given as
ˆ
f(
x
)
=
P(
X
=
x
)
=
1
n
n
i
=
1
I
(
x
i
=
x
)
(2.17)
ˆ
f(x
1
,x
2
)
=
P(
X
1
=
x
1
,
X
2
=
x
2
)
=
1
n
n
i
=
1
I
(x
i
1
=
x
1
,x
i
2
=
x
2
)
where
x
=
(x
1
,x
2
)
T
and
I
is a indicator variable that takes on the value 1 only when its
argument is true:
I
(
x
i
=
x
)
=
1 if
x
i
1
=
x
1
and
x
i
2
=
x
2
0 otherwise
As in the univariate case, the probability function puts a probability mass of
1
n
at each
point in the data sample.
2.2.1
Measures of Location and Dispersion
Mean
The bivariate mean is defined as the expected value of the vector random variable
X
,
defined as follows:
µ
=
E
[
X
]
=
E
X
1
X
2
=
E
[
X
1
]
E
[
X
2
]
=
µ
1
µ
2
(2.18)
In other words, the bivariate mean vector is simply the vector of expected values along
each attribute.
The sample mean vector can be obtained from
ˆ
f
X
1
and
ˆ
f
X
2
, the empirical
probability mass functions of
X
1
and
X
2
, respectively, using Eq.(2.5). It can also be
computed from the joint empirical PMF in Eq.(2.17)
ˆ
µ
=
x
x
ˆ
f(
x
)
=
x
x
1
n
n
i
=
1
I
(
x
i
=
x
)
=
1
n
n
i
=
1
x
i
(2.19)
Variance
We can compute the variance along each attribute, namely
σ
2
1
for
X
1
and
σ
2
2
for
X
2
using Eq.(2.8). The
total variance
[Eq.(1.4)] is given as
var(
D
)
=
σ
2
1
+
σ
2
2
The sample variances
ˆ
σ
2
1
and
ˆ
σ
2
2
can be estimated using Eq.(2.10), and the
sample
total variance
is simply
ˆ
σ
2
1
+ ˆ
σ
2
2
.
2.2.2
Measures of Association
Covariance
The
covariance
betweentwoattributes
X
1
and
X
2
provides ameasureoftheassociation
or linear dependence between them, and is defined as
σ
12
=
E
[
(
X
1
−
µ
1
)(
X
2
−
µ
2
)
] (2.20)
44
Numeric Attributes
By linearity of expectation, we have
σ
12
=
E
[
(
X
1
−
µ
1
)(
X
2
−
µ
2
)
]
=
E
[
X
1
X
2
−
X
1
µ
2
−
X
2
µ
1
+
µ
1
µ
2
]
=
E
[
X
1
X
2
]
−
µ
2
E
[
X
1
]
−
µ
1
E
[
X
2
]
+
µ
1
µ
2
=
E
[
X
1
X
2
]
−
µ
1
µ
2
=
E
[
X
1
X
2
]
−
E
[
X
1
]
E
[
X
2
] (2.21)
Eq.(2.21) can be seen as a generalization of the univariate variance [Eq.(2.9)] to the
bivariate case.
If
X
1
and
X
2
are independent random variables, then we conclude that their
covariance is zero. This is because if
X
1
and
X
2
are independent, then we have
E
[
X
1
X
2
]
=
E
[
X
1
]
·
E
[
X
2
]
which in turn implies that
σ
12
=
0
However, the converse is not true. That is, if
σ
12
=
0, one cannot claim that
X
1
and
X
2
are independent. All we can say is that there is no linear dependence between them,
but we cannot rule out that there might be a higher order relationship or dependence
between the two attributes.
The
sample covariance
between
X
1
and
X
2
is given as
ˆ
σ
12
=
1
n
n
i
=
1
(x
i
1
− ˆ
µ
1
)(x
i
2
− ˆ
µ
2
)
(2.22)
It can be derived by substituting the empirical joint probability mass function
ˆ
f(x
1
,x
2
)
from Eq.(2.17) into Eq.(2.20), as follows:
ˆ
σ
12
=
E
[
(
X
1
− ˆ
µ
1
)(
X
2
− ˆ
µ
2
)
]
=
x
=
(x
1
,x
2
)
T
(x
1
− ˆ
µ
1
)(x
2
− ˆ
µ
2
)
ˆ
f(x
1
,x
2
)
=
1
n
x
=
(x
1
,x
2
)
T
n
i
=
1
(x
1
− ˆ
µ
1
)
·
(x
2
− ˆ
µ
2
)
·
I
(x
i
1
=
x
1
,x
i
2
=
x
2
)
=
1
n
n
i
=
1
(x
i
1
− ˆ
µ
1
)(x
i
2
− ˆ
µ
2
)
Notice that sample covariance is a generalization of the sample variance
[Eq.(2.10)] because
ˆ
σ
11
=
1
n
n
i
=
1
(x
i
−
µ
1
)(x
i
−
µ
1
)
=
1
n
n
i
=
1
(x
i
−
µ
1
)
2
= ˆ
σ
2
1
and similarly,
ˆ
σ
22
= ˆ
σ
2
2
.
2.2 Bivariate Analysis
45
Correlation
The
correlation
between variables
X
1
and
X
2
is the
standardized covariance
, obtained
by normalizing the covariance with the standard deviation of each variable, given as
ρ
12
=
σ
12
σ
1
σ
2
=
σ
12
σ
2
1
σ
2
2
(2.23)
The
sample correlation
for attributes
X
1
and
X
2
is given as
ˆ
ρ
12
=
ˆ
σ
12
ˆ
σ
1
ˆ
σ
2
=
n
i
=
1
(x
i
1
− ˆ
µ
1
)(x
i
2
− ˆ
µ
2
)
n
i
=
1
(x
i
1
− ˆ
µ
1
)
2
n
i
=
1
(x
i
2
− ˆ
µ
2
)
2
(2.24)
Geometric Interpretation of Sample Covariance and Correlation
Let
Z
1
and
Z
2
denote the centered attribute vectors in
R
n
, given as follows:
Z
1
=
X
1
−
1
· ˆ
µ
1
=
x
11
− ˆ
µ
1
x
21
− ˆ
µ
1
.
.
.
x
n
1
− ˆ
µ
1
Z
2
=
X
2
−
1
· ˆ
µ
2
=
x
12
− ˆ
µ
2
x
22
− ˆ
µ
2
.
.
.
x
n
2
− ˆ
µ
2
The sample covariance [Eq.(2.22)] can then be written as
ˆ
σ
12
=
Z
T
1
Z
2
n
In other words, the covariance between the two attributes is simply the dot product
between the two centered attribute vectors, normalized by the sample size. The above
can be seen as a generalization of the univariate sample variance given in Eq.(2.11).
x
n
x
2
x
1
θ
Z
2
Z
1
Figure 2.3.
Geometric interpretation of covariance and correlation. The two centered attribute vectors are
shown in the (conceptual)
n
-dimensional space
R
n
spanned by the
n
points.
46
Numeric Attributes
The sample correlation [Eq.(2.24)] can be written as
ˆ
ρ
12
=
Z
T
1
Z
2
Z
T
1
Z
1
Z
T
2
Z
2
=
Z
T
1
Z
2
Z
1
Z
2
=
Z
1
Z
1
T
Z
2
Z
2
=
cos
θ
(2.25)
Thus, the correlation coefficient is simply the cosine of the angle [Eq.(1.3)] between
the two centered attribute vectors, as illustrated in Figure 2.3.
Covariance Matrix
The variance–covariance information for the two attributes
X
1
and
X
2
can be
summarized in the square 2
×
2
covariance matrix
, given as
=
E
[
(
X
−
µ
)(
X
−
µ
)
T
]
=
E
X
1
−
µ
1
X
2
−
µ
2
X
1
−
µ
1
X
2
−
µ
2
=
E
[
(
X
1
−
µ
1
)(
X
1
−
µ
1
)
]
E
[
(
X
1
−
µ
1
)(
X
2
−
µ
2
)
]
E
[
(
X
2
−
µ
2
)(
X
1
−
µ
1
)
]
E
[
(
X
2
−
µ
2
)(
X
2
−
µ
2
)
]
=
σ
2
1
σ
12
σ
21
σ
2
2
(2.26)
Because
σ
12
=
σ
21
,
is a
symmetric
matrix. The covariance matrix records the attribute
specific variances on the main diagonal, and the covariance information on the
off-diagonal elements.
The
totalvariance
of the two attributesis given as the sum of thediagonal elements
of
, which is also called the
trace
of
, given as
var(
D
)
=
tr(
)
=
σ
2
1
+
σ
2
2
We immediately have
tr(
)
≥
0.
The
generalized variance
of the two attributes also considers the covariance, in
addition to the attribute variances, and is given as the
determinant
of the covariance
matrix
, denoted as
|
|
or det
(
)
. The generalized covariance is non-negative,
because
|
|=
det
(
)
=
σ
2
1
σ
2
2
−
σ
2
12
=
σ
2
1
σ
2
2
−
ρ
2
12
σ
2
1
σ
2
2
=
(
1
−
ρ
2
12
)σ
2
1
σ
2
2
where we used Eq.(2.23), that is,
σ
12
=
ρ
12
σ
1
σ
2
. Note that
|
ρ
12
|≤
1 implies that
ρ
2
12
≤
1,
which in turn implies that det
(
)
≥
0, that is, the determinant is non-negative.
The
sample covariance matrix
is given as
=
ˆ
σ
2
1
ˆ
σ
12
ˆ
σ
12
ˆ
σ
2
2
Thesample covariancematrix
shares thesame propertiesas
,thatis, itis symmetric
and
|
| ≥
0, and it can be used to easily obtain the sample total and generalized
variance.
2.2 Bivariate Analysis
47
2
2
.
5
3
.
0
3
.
5
4
.
0
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
X
1
: sepal length
X
2
:
s
e
p
a
l
w
i
d
t
h
Figure 2.4.
Correlation between
sepal length
and
sepal width
.
Example 2.3 (Sample Mean and Covariance).
Consider the
sepal length
and
sepal width
attributes for the Iris dataset, plotted in Figure 2.4. There are
n
=
150
points in the
d
=
2 dimensional attribute space. The sample mean vector is given as
ˆ
µ
=
5
.
843
3
.
054
The sample covariance matrix is given as
=
0
.
681
−
0
.
039
−
0
.
039 0
.
187
The variance for
sepal length
is
ˆ
σ
2
1
=
0
.
681,and that for
sepal width
is
ˆ
σ
2
2
=
0
.
187.
The covariance between the two attributes is
ˆ
σ
12
= −
0
.
039, and the correlation
between them is
ˆ
ρ
12
=
−
0
.
039
√
0
.
681
·
0
.
187
=−
0
.
109
Thus, there is a very weak negative correlation between these two attributes, as
evidenced by the best linear fit line in Figure 2.4. Alternatively, we can consider
the attributes
sepal length
and
sepal width
as two points in
R
n
. The correlation
is then the cosine of the angle between them; we have
ˆ
ρ
12
=
cos
θ
=−
0
.
109, which implies that
θ
=
cos
−
1
(
−
0
.
109
)
=
96
.
26
◦
The angle is close to 90
◦
, that is, the two attribute vectors are almost orthogonal,
indicating weak correlation. Further, the angle being greater than 90
◦
indicates
negative correlation.
48
Numeric Attributes
The sample total variance is given as
tr(
)
=
0
.
681
+
0
.
187
=
0
.
868
and the sample generalized variance is given as
|
|=
det
(
)
=
0
.
681
·
0
.
187
−
(
−
0
.
039
)
2
=
0
.
126
2.3
MULTIVARIATE ANALYSIS
In multivariate analysis, we consider all the
d
numeric attributes
X
1
,
X
2
,...,
X
d
. The
full data is an
n
×
d
matrix, given as
D
=
X
1
X
2
···
X
d
x
11
x
12
···
x
1
d
x
21
x
22
···
x
2
d
.
.
.
.
.
.
.
.
.
.
.
.
x
n
1
x
n
2
···
x
nd
In the row view, the data can be considered as a set of
n
points or vectors in the
d
-dimensional attribute space
x
i
=
(x
i
1
,x
i
2
,...,x
id
)
T
∈
R
d
In the column view, the data can be considered as a set of
d
points or vectors in the
n
-dimensional space spanned by the data points
X
j
=
(x
1
j
,x
2
j
,...,x
nj
)
T
∈
R
n
In the probabilistic view, the
d
attributes are modeled as a vector random variable,
X
=
(
X
1
,
X
2
,...,
X
d
)
T
, and the points
x
i
are considered to be a random sample drawn
from
X
, that is, they are independent and identically distributed as
X
.
Mean
Generalizing Eq.(2.18), the
multivariate meanvector
is obtained by taking the mean of
each attribute, given as
µ
=
E
[
X
]
=
E
[
X
1
]
E
[
X
2
]
.
.
.
E
[
X
d
]
=
µ
1
µ
2
.
.
.
µ
d
Generalizing Eq.(2.19), the
sample mean
is given as
ˆ
µ
=
1
n
n
i
=
1
x
i
2.3 Multivariate Analysis
49
Covariance Matrix
Generalizing Eq.(2.26) to
d
dimensions, the multivariate covariance information is
captured by the
d
×
d
(square) symmetric
covariance matrix
that gives the covariance
for each pair of attributes:
=
E
[
(
X
−
µ
)(
X
−
µ
)
T
]
=
σ
2
1
σ
12
···
σ
1
d
σ
21
σ
2
2
···
σ
2
d
··· ··· ··· ···
σ
d
1
σ
d
2
···
σ
2
d
The diagonal element
σ
2
i
specifies the attribute variance for
X
i
, whereas the
off-diagonal elements
σ
ij
=
σ
ji
represent the covariance between attribute pairs
X
i
and
X
j
.
Covariance Matrix Is Positive Semidefinite
It is worth noting that
is a
positive semidefinite
matrix, that is,
a
T
a
≥
0 for any
d
-dimensional vector
a
To see this, observe that
a
T
a
=
a
T
E
[
(
X
−
µ
)(
X
−
µ
)
T
]
a
=
E
[
a
T
(
X
−
µ
)(
X
−
µ
)
T
a
]
=
E
[
Y
2
]
≥
0
where
Y
is the random variable
Y
=
a
T
(
X
−
µ
)
=
d
i
=
1
a
i
(
X
i
−
µ
i
)
, and we use the fact
that the expectation of a squared random variable is non-negative.
Because
is also symmetric, this implies that all the eigenvalues of
are real and
non-negative. In other words the
d
eigenvalues of
can be arranged from the largest
to the smallest as follows:
λ
1
≥
λ
2
≥···≥
λ
d
≥
0. A consequence is that the determinant
of
is non-negative:
det
(
)
=
d
i
=
1
λ
i
≥
0 (2.27)
Total and Generalized Variance
The total variance is given as the trace of the covariance matrix:
var(
D
)
=
tr(
)
=
σ
2
1
+
σ
2
2
+···+
σ
2
d
(2.28)
Being a sum of squares, the total variance must be non-negative.
The generalized variance is defined as the determinant of the covariance matrix,
det
(
)
, also denoted as
|
|
. It gives a single value for the overall multivariate scatter.
From Eq.(2.27) we have det
(
)
≥
0.
50
Numeric Attributes
Sample Covariance Matrix
The
sample covariance matrix
is given as
=
E
[
(
X
− ˆ
µ
)(
X
− ˆ
µ
)
T
]
=
ˆ
σ
2
1
ˆ
σ
12
··· ˆ
σ
1
d
ˆ
σ
21
ˆ
σ
2
2
··· ˆ
σ
2
d
··· ··· ··· ···
ˆ
σ
d
1
ˆ
σ
d
2
··· ˆ
σ
2
d
(2.29)
Instead of computing the sample covariance matrix element-by-element, we can
obtain it via matrix operations. Let
Z
represent the centered data matrix, given as the
matrix of centered attribute vectors
Z
i
=
X
i
−
1
· ˆ
µ
i
, where
1
∈
R
n
:
Z
=
D
−
1
· ˆ
µ
T
=
| | |
Z
1
Z
2
···
Z
d
| | |
Alternatively,the centered datamatrix can also be written in terms of the centered
points
z
i
=
x
i
− ˆ
µ
:
Z
=
D
−
1
· ˆ
µ
T
=
x
T
1
− ˆ
µ
T
x
T
2
− ˆ
µ
T
.
.
.
x
T
n
− ˆ
µ
T
=
—
z
T
1
—
—
z
T
2
—
.
.
.
—
z
T
n
—
In matrix notation, the sample covariance matrix can be written as
=
1
n
Z
T
Z
=
1
n
Z
T
1
Z
1
Z
T
1
Z
2
···
Z
T
1
Z
d
Z
T
2
Z
1
Z
T
2
Z
2
···
Z
T
2
Z
d
.
.
.
.
.
.
.
.
.
.
.
.
Z
T
d
Z
1
Z
T
d
Z
2
···
Z
T
d
Z
d
(2.30)
The sample covariance matrix is thus given as the pairwise
inner or dot products
of the
centered attribute vectors, normalized by the sample size.
In terms of the centeredpoints
z
i
, the sample covariancematrix can also be written
as a sum of rank-one matrices obtained as the
outer product
of each centered point:
=
1
n
n
i
=
1
z
i
·
z
T
i
(2.31)
Example 2.4 (Sample Mean and Covariance Matrix).
Let us consider all four
numeric attributes for the Iris dataset, namely
sepal length
,
sepal width
,
petal
length
, and
petal width
. The multivariate sample mean vector is given as
ˆ
µ
=
5
.
843 3
.
054 3
.
759 1
.
199
T
2.3 Multivariate Analysis
51
and the sample covariance matrix is given as
=
0
.
681
−
0
.
039 1
.
265 0
.
513
−
0
.
039 0
.
187
−
0
.
320
−
0
.
117
1
.
265
−
0
.
320 3
.
092 1
.
288
0
.
513
−
0
.
117 1
.
288 0
.
579
The sample total variance is
var(
D
)
=
tr(
)
=
0
.
681
+
0
.
187
+
3
.
092
+
0
.
579
=
4
.
539
and the generalized variance is
det
(
)
=
1
.
853
×
10
−
3
Example 2.5 (Inner and Outer Product).
To illustrate the inner and outer
product–based computation of the sample covariance matrix, consider the
2-dimensional dataset
D
=
A
1
A
2
1 0
.
8
5 2
.
4
9 5
.
5
The mean vector is as follows:
ˆ
µ
=
ˆ
µ
1
ˆ
µ
2
=
15
/
3
8
.
7
/
3
=
5
2
.
9
and the centered data matrix is then given as
Z
=
D
−
1
·
µ
T
=
1 0
.
8
5 2
.
4
9 5
.
5
−
1
1
1
5 2
.
9
=
−
4
−
2
.
1
0
−
0
.
5
4 2
.
6
The inner-product approach [Eq.(2.30)] to compute the sample covariance matrix
gives
=
1
n
Z
T
Z
=
1
3
−
4 0 4
−
2
.
1
−
0
.
5 2
.
6
·
−
4
−
2
.
1
0
−
0
.
5
4 2
.
6
=
1
3
32 18
.
8
18
.
8 11
.
42
=
10
.
67 6
.
27
6
.
27 3
.
81
Alternatively, the outer-product approach [Eq.(2.31)] gives
=
1
n
n
i
=
1
z
i
·
z
T
i
=
1
3
−
4
−
2
.
1
·
−
4
−
2
.
1
+
0
−
0
.
5
·
0
−
0
.
5
+
4
2
.
6
·
4 2
.
6
52
Numeric Attributes
=
1
3
16
.
0 8
.
4
8
.
4 4
.
41
+
0
.
0 0
.
0
0
.
0 0
.
25
+
16
.
0 10
.
4
10
.
4 6
.
76
=
1
3
32
.
0 18
.
8
18
.
8 11
.
42
=
10
.
67 6
.
27
6
.
27 3
.
81
where the centered points
z
i
are the rows of
Z
. We can see that both the inner and
outer product approaches yield the same sample covariance matrix.
2.4
DATA NORMALIZATION
When analyzing two or more attributes it is often necessary to normalize the values of
the attributes, especially in those cases where the values are vastly different in scale.
Range Normalization
Let
X
be an attribute and let
x
1
,x
2
,...,x
n
be a random sample drawn from
X
. In
range
normalization
each value is scaled by the sample range
ˆ
r
of
X
:
x
′
i
=
x
i
−
min
i
{
x
i
}
ˆ
r
=
x
i
−
min
i
{
x
i
}
max
i
{
x
i
}−
min
i
{
x
i
}
After transformation the new attribute takes on values in the range [0
,
1].
Standard Score Normalization
In
standard score normalization
, also called
z
-normalization, each value is replaced by
its
z
-score:
x
′
i
=
x
i
− ˆ
µ
ˆ
σ
where
ˆ
µ
is the sample mean and
ˆ
σ
2
is the sample variance of
X
. After transformation,
the new attribute has mean
ˆ
µ
′
=
0, and standard deviation
ˆ
σ
′
=
1.
Example 2.6.
Consider the example dataset shown in Table 2.1. The attributes
Age
and
Income
have very different scales, with the latter having much larger values.
Consider the distance between
x
1
and
x
2
:
x
1
−
x
2
=
(
2
,
200
)
T
=
2
2
+
200
2
=
√
40004
=
200
.
01
As we can observe, the contribution of
Age
is overshadowed by the value of
Income
.
The sample range for
Age
is
ˆ
r
=
40
−
12
=
28, with the minimum value 12. After
range normalization, the new attribute is given as
Age
′
=
(
0
,
0
.
071
,
0
.
214
,
0
.
393
,
0
.
536
,
0
.
571
,
0
.
786
,
0
.
893
,
0
.
964
,
1
)
T
For example,for the point
x
2
=
(x
21
,x
22
)
=
(
14
,
500
)
,the value
x
21
=
14 is transformed
into
x
′
21
=
14
−
12
28
=
2
28
=
0
.
071
2.4 Data Normalization
53
Table 2.1.
Dataset for normalization
x
i
Age (
X
1
) Income (
X
2
)
x
1
12 300
x
2
14 500
x
3
18 1000
x
4
23 2000
x
5
27 3500
x
6
28 4000
x
7
34 4300
x
8
37 6000
x
9
39 2500
x
10
40 2700
Likewise, the sample range for
Income
is 2700
−
300
=
2400, with a minimum value
of 300;
Income
is therefore transformed into
Income
′
=
(
0
,
0
.
035
,
0
.
123
,
0
.
298
,
0
.
561
,
0
.
649
,
0
.
702
,
1
,
0
.
386
,
0
.
421
)
T
so that
x
22
=
0
.
035. The distance between
x
1
and
x
2
after range normalization is given
as
x
′
1
−
x
′
2
=
(
0
,
0
)
T
−
(
0
.
071
,
0
.
035
)
T
=
(
−
0
.
071
,
−
0
.
035
)
T
=
0
.
079
We can observe that
Income
no longer skews the distance.
For
z
-normalization, we first compute the mean and standard deviation of both
attributes:
Age Income
ˆ
µ
27.2 2680
ˆ
σ
9.77 1726.15
Age
is transformed into
Age
′
=
(
−
1
.
56
,
−
1
.
35
,
−
0
.
94
,
−
0
.
43
,
−
0
.
02
,
0
.
08
,
0
.
70
,
1
.
0
,
1
.
21
,
1
.
31
)
T
For instance, the value
x
21
=
14, for the point
x
2
=
(x
21
,x
22
)
=
(
14
,
500
)
, is
transformed as
x
′
21
=
14
−
27
.
2
9
.
77
=−
1
.
35
Likewise,
Income
is transformed into
Income
′
=
(
−
1
.
38
,
−
1
.
26
,
−
0
.
97
,
−
0
.
39
,
0
.
48
,
0
.
77
,
0
.
94
,
1
.
92
,
−
0
.
10
,
0
.
01
)
T
so that
x
22
=−
1
.
26. The distance between
x
1
and
x
2
after
z
-normalization is given as
x
′
1
−
x
′
2
=
(
−
1
.
56
,
−
1
.
38
)
T
−
(
1
.
35
,
−
1
.
26
)
T
=
(
−
0
.
18
,
−
0
.
12
)
T
=
0
.
216
54
Numeric Attributes
2.5
NORMAL DISTRIBUTION
The normal distribution is one of the most important probability density functions,
especiallybecausemany physicallyobserved variablesfollow an approximatelynormal
distribution. Furthermore, the sampling distribution of the mean of any arbitrary
probability distribution follows a normal distribution. The normal distribution also
plays an important role as the parametric distribution of choice in clustering, density
estimation, and classification.
2.5.1
Univariate Normal Distribution
A random variable
X
has a normal distribution, with the parameters mean
µ
and
variance
σ
2
, if the probability density function of
X
is given as follows:
f(x
|
µ,σ
2
)
=
1
√
2
πσ
2
exp
−
(x
−
µ)
2
2
σ
2
The term
(x
−
µ)
2
measures the distance of a value
x
from the mean
µ
of the
distribution, and thus the probability density decreases exponentially as a function of
the distance from the mean. The maximum value of the density occurs at the mean
value
x
=
µ
, given as
f(µ)
=
1
√
2
πσ
2
, which is inversely proportional to the standard
deviation
σ
of the distribution.
Example 2.7.
Figure 2.5 plots the standard normal distribution, which has the
parameters
µ
=
0 and
σ
2
=
1. The normal distribution has a characteristic
bell
shape,
and it is symmetric about the mean. The figure also shows the effect of different
values of standard deviation on the shape of the distribution. A smaller value (e.g.,
σ
=
0
.
5) results in a more “peaked” distribution that decays faster, whereas a larger
value (e.g.,
σ
=
2) results in a flatter distribution that decays slower. Because the
normal distribution is symmetric, the mean
µ
is also the median, as well as the mode,
of the distribution.
Probability Mass
Given an interval [
a,b
] the probability mass of the normal distribution within that
interval is given as
P(a
≤
x
≤
b)
=
b
a
f(x
|
µ,σ
2
) dx
In particular, we are often interested in the probability mass concentrated within
k
standard deviations from the mean, that is, for the interval [
µ
−
kσ,µ
+
kσ
], which can
be computed as
P
µ
−
kσ
≤
x
≤
µ
+
kσ
=
1
√
2
πσ
µ
+
kσ
µ
−
kσ
exp
−
(x
−
µ)
2
2
σ
2
dx
2.5 Normal Distribution
55
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8
−
6
−
5
−
4
−
3
−
2
−
1 0 1 2 3 4 5
x
f(x)
σ
=
1
σ
=
2
σ
=
0
.
5
Figure 2.5.
Normal distribution:
µ
=
0, and different variances.
Via a change of variable
z
=
x
−
µ
σ
, we get an equivalent formulation in terms of the
standard normal distribution:
P(
−
k
≤
z
≤
k)
=
1
√
2
π
k
−
k
e
−
1
2
z
2
dz
=
2
√
2
π
k
0
e
−
1
2
z
2
dz
The last step follows from the fact that e
−
1
2
z
2
is symmetric, and thus the integral over
the range [
−
k,k
] is equivalent to 2 times the integral over the range [0
,k
]. Finally, via
another change of variable
t
=
z
√
2
, we get
P(
−
k
≤
z
≤
k)
=
P
0
≤
t
≤
k/
√
2
=
2
√
π
k/
√
2
0
e
−
t
2
dt
=
erf
k/
√
2
(2.32)
where erf is the
Gauss error function
, defined as
erf
(x)
=
2
√
π
x
0
e
−
t
2
dt
Using Eq.(2.32) we can compute the probability mass within
k
standard deviations of
the mean. In particular, for
k
=
1, we have
P(µ
−
σ
≤
x
≤
µ
+
σ)
=
erf
(
1
/
√
2
)
=
0
.
6827
56
Numeric Attributes
which means that 68.27% of all points lie within 1 standard deviation from the mean.
For
k
=
2, we have erf
(
2
/
√
2
)
=
0
.
9545,and for
k
=
3 we have erf
(
3
/
√
2
)
=
0
.
9973.Thus,
almost the entire probability mass (i.e., 99.73%) of a normal distribution is within
±
3
σ
from the mean
µ
.
2.5.2
Multivariate Normal Distribution
Given the
d
-dimensional vector random variable
X
=
(
X
1
,
X
2
,...,
X
d
)
T
, we say that
X
has a multivariate normal distribution, with the parameters mean
µ
and covariance
matrix
, if its joint multivariate probability density function is given as follows:
f(
x
|
µ
,
)
=
1
(
√
2
π)
d
√
|
|
exp
−
(
x
−
µ
)
T
−
1
(
x
−
µ
)
2
(2.33)
where
|
|
is the determinant of the covariance matrix. As in the univariate case, the
term
(
x
i
−
µ
)
T
−
1
(
x
i
−
µ
)
(2.34)
measures the distance, called the
Mahalanobis distance
, of the point
x
from the mean
µ
of the distribution, taking into account all of the variance–covariance information
between the attributes. The Mahalanobis distance is a generalization of Euclidean
distance because if we set
=
I
, where
I
is the
d
×
d
identity matrix (with diagonal
elements as 1’s and off-diagonal elements as 0’s), we get
(
x
i
−
µ
)
T
I
−
1
(
x
i
−
µ
)
=
x
i
−
µ
2
TheEuclideandistancethusignoresthecovarianceinformationbetweentheattributes,
whereas the Mahalanobis distance explicitly takes it into consideration.
The
standard multivariate normal distribution
has parameters
µ
=
0
and
=
I
.
Figure 2.6a plots the probability density of the standard bivariate (
d
=
2) normal
distribution, with parameters
µ
=
0
=
0
0
and
=
I
=
1 0
0 1
This corresponds to the case where the two attributes are independent, and both
follow the standard normal distribution. The symmetric nature of the standard normal
distribution can be clearly seen in the contour plot shown in Figure 2.6b. Each level
curve represents the set of points
x
with a fixed density value
f(
x
)
.
Geometry of the Multivariate Normal
Let us consider the geometry of the multivariate normal distribution for an arbitrary
mean
µ
and covariance matrix
. Compared to the standard normal distribution,
we can expect the density contours to be shifted, scaled, and rotated. The shift or
translation comes from the fact that the mean
µ
is not necessarily the origin
0
. The
2.5 Normal Distribution
57
X
1
X
2
f(
x
)
−
4
−
3
−
2
−
1
0
1
2
3
4
−
4
−
3
−
2
−
1
0
1
2
3
4
0
0.07
0.14
0.21
X
1
X
2
−
4
−
3
−
2
−
1
0
1
2
3
4
−
4
−
3
−
2
−
1
0
1 2
3
4
0
.
1
3
0
.
0
5
0
.
0
0
7
0
.
0
0
0
7
(b)
(a)
Figure 2.6.
(a) Standard bivariate normal density and (b) its contour plot. Parameters:
µ
=
(
0
,
0
)
T
,
=
I
.
scaling or skewing is a result of the attribute variances, and the rotation is a result of
the covariances.
The shape or geometry of the normal distribution becomes clear by considering
the eigen-decomposition of the covariance matrix. Recall that
is a
d
×
d
symmetric
positive semidefinite matrix. The eigenvector equation for
is given as
u
i
=
λ
i
u
i
Here
λ
i
is an eigenvalue of
and the vector
u
i
∈
R
d
is the eigenvector corresponding
to
λ
i
. Because
is symmetric and positive semidefinite it has
d
real and non-negative
eigenvalues, which can be arranged in order from the largest to the smallest as follows:
λ
1
≥
λ
2
≥
...λ
d
≥
0. The diagonal matrix
is used to record these eigenvalues:
=
λ
1
0
···
0
0
λ
2
···
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
···
λ
d
58
Numeric Attributes
Further, the eigenvectors are unit vectors (normal) and are mutually orthogonal,
that is, they are orthonormal:
u
T
i
u
i
=
1 for all
i
u
T
i
u
j
=
0 for all
i
=
j
The eigenvectors can be put together into an orthogonal matrix
U
, defined as a matrix
with normal and mutually orthogonal columns:
U
=
| | |
u
1
u
2
···
u
d
| | |
The eigen-decomposition of
can then be expressed compactly as follows:
=
U
U
T
This equation can be interpreted geometrically as a change in basis vectors. From the
original
d
dimensions corresponding to the
d
attributes
X
j
, wederive
d
newdimensions
u
i
.
is the covariance matrix in the original space, whereas
is the covariance matrix
in the new coordinate space. Because
is a diagonal matrix, we can immediately
conclude that after the transformation, each new dimension
u
i
has variance
λ
i
, and
further that all covariances are zero. In other words, in the new space, the normal
distribution is axis aligned (has no rotation component), but is skewed in each axis
proportional to the eigenvalue
λ
i
, which represents the variance along that dimension
(further details are given in Section 7.2.4).
Total and Generalized Variance
The determinant of the covariance matrix is is given as det
(
)
=
d
i
=
1
λ
i
. Thus, the
generalized variance of
is the product of its eigenvectors.
Given the fact that the trace of a square matrix is invariant to similarity
transformation, such as a change of basis, we conclude that the total variance
var(
D
)
for a dataset
D
is invariant, that is,
var(
D
)
=
tr(
)
=
d
i
=
1
σ
2
i
=
d
i
=
1
λ
i
=
tr()
In other words
σ
2
1
+···+
σ
2
d
=
λ
1
+···+
λ
d
.
Example 2.8 (Bivariate Normal Density).
Treating attributes
sepal length
(
X
1
)
and
sepal width
(
X
2
) in the Iris dataset (see Table 1.1) as continuous random
variables, we can define a continuous bivariate random variable
X
=
X
1
X
2
.
Assuming that
X
follows a bivariate normal distribution, we can estimate its
parameters from the sample. The sample mean is given as
ˆ
µ
=
(
5
.
843
,
3
.
054
)
T
2.5 Normal Distribution
59
X
1
X
2
f(
x
)
2
3
4
5
6
7
8
9
1
2
3
4
5
u
1
u
2
Figure 2.7.
Iris:
sepal length
and
sepal width
, bivariate normal density and contours.
and the sample covariance matrix is given as
=
0
.
681
−
0
.
039
−
0
.
039 0
.
187
The plot of the bivariate normal density for the two attributes is shown in Figure 2.7.
The figure also shows the contour lines and the data points.
Consider the point
x
2
=
(
6
.
9
,
3
.
1
)
T
. We have
x
2
− ˆ
µ
=
6
.
9
3
.
1
−
5
.
843
3
.
054
=
1
.
057
0
.
046
The Mahalanobis distance between
x
2
and
ˆ
µ
is
(
x
i
− ˆ
µ
)
T
−
1
(
x
i
− ˆ
µ
)
=
1
.
057 0
.
046
0
.
681
−
0
.
039
−
0
.
039 0
.
187
−
1
1
.
057
0
.
046
=
1
.
057 0
.
046
1
.
486 0
.
31
0
.
31 5
.
42
1
.
057
0
.
046
=
1
.
701
whereas the squared Euclidean distance between them is
(
x
2
− ˆ
µ
)
2
=
1
.
057 0
.
046
1
.
057
0
.
046
=
1
.
119
The eigenvalues and the corresponding eigenvectors of
are as follows:
λ
1
=
0
.
684
u
1
=
(
−
0
.
997
,
0
.
078
)
T
λ
2
=
0
.
184
u
2
=
(
−
0
.
078
,
−
0
.
997
)
T
60
Numeric Attributes
These twoeigenvectorsdefinethenewaxesin which thecovariancematrixis givenas
=
0
.
684 0
0 0
.
184
The angle between the original axes
e
1
=
(
1
,
0
)
T
and
u
1
specifies the rotation angle
for the multivariate normal:
cos
θ
=
e
T
1
u
1
=−
0
.
997
θ
=
cos
−
1
(
−
0
.
997
)
=
175
.
5
◦
Figure 2.7 illustrates the new coordinate axes and the new variances. We can see that
in the original axes, the contours are only slightly rotated by angle 175
.
5
◦
(or
−
4
.
5
◦
).
2.6
FURTHER READING
There are several good textbooks that cover the topics discussed in this chapter in
more depth; see Evans and Rosenthal (2011); Wasserman (2004) and Rencher and
Christensen (2012).
Evans, M. and Rosenthal, J. (2011).
Probability and Statistics: The Science of
Uncertainty
, 2nd ed. New York: W. H. Freeman.
Rencher,A.C.andChristensen,W.F.(2012).
MethodsofMultivariateAnalysis
,3rded.
Hoboken, NJ: John Wiley & Sons.
Wasserman, L. (2004).
All of Statistics: A Concise Course in Statistical Inference.
New York: Springer Science+Business Media.
2.7
EXERCISES
Q1.
True or False:
(a)
Mean is robust against outliers.
(b)
Median is robust against outliers.
(c)
Standard deviation is robust against outliers.
Q2.
Let
X
and
Y
be two random variables, denoting age and weight, respectively.
Consider a random sample of size
n
=
20 from these two variables
X
=
(
69
,
74
,
68
,
70
,
72
,
67
,
66
,
70
,
76
,
68
,
72
,
79
,
74
,
67
,
66
,
71
,
74
,
75
,
75
,
76
)
Y
=
(
153
,
175
,
155
,
135
,
172
,
150
,
115
,
137
,
200
,
130
,
140
,
265
,
185
,
112
,
140
,
150
,
165
,
185
,
210
,
220
)
(a)
Find the mean, median, and mode for
X
.
(b)
What is the variance for
Y
?
2.7 Exercises
61
(c)
Plot the normal distribution for
X
.
(d)
What is the probability of observing an age of 80 or higher?
(e)
Find the 2-dimensional mean
ˆ
µ
and the covariance matrix
for these two
variables.
(f)
What is the correlation between age and weight?
(g)
Draw a scatterplot to show the relationship between age and weight.
Q3.
Show that the identity in Eq.(2.15) holds, that is,
n
i
=
1
(x
i
−
µ)
2
=
n(
ˆ
µ
−
mu)
2
+
n
i
=
1
(x
i
− ˆ
µ)
2
Q4.
Prove that if
x
i
are independent random variables, then
var
n
i
=
1
x
i
=
n
i
=
1
var(x
i
)
This fact was used in Eq.(2.12).
Q5.
Define a measure of deviation called
mean absolute deviation
for a random variable
X
as follows:
1
n
n
i
=
1
|
x
i
−
µ
|
Is this measure robust? Why or why not?
Q6.
Prove that the expected value of a vector random variable
X
=
(
X
1
,
X
2
)
T
is simply the
vector of the expected value of the individual random variables
X
1
and
X
2
as given in
Eq.(2.18).
Q7.
Show that the correlation [Eq.(2.23)] between any two random variables
X
1
and
X
2
lies in the range [
−
1
,
1].
Q8.
Given the dataset in Table 2.2, compute the covariance matrix and the generalized
variance.
Table 2.2.
Dataset for Q8
X
1
X
2
X
3
x
1
17 17 12
x
2
11 9 13
x
3
11 8 19
Q9.
Show that the outer-product in Eq.(2.31) for the sample covariance matrix is
equivalent to Eq.(2.29).
Q10.
Assume that we are given two univariate normal distributions,
N
A
and
N
B
, and let
their mean and standard deviation be as follows:
µ
A
=
4,
σ
A
=
1 and
µ
B
=
8
,σ
B
=
2.
(a)
For each of the following values
x
i
∈{
5
,
6
,
7
}
find out which is the more likely
normal distribution to have produced it.
(b)
Derive an expression for the point for which the probability of having been
produced by both the normals is the same.
62
Numeric Attributes
Q11.
Consider Table 2.3. Assume that both the attributes
X
and
Y
are numeric, and the
table represents the entire population. If we know that the correlation between
X
and
Y
is zero, what can you infer about the values of
Y
?
Table 2.3.
Dataset for Q11
X Y
1
a
0
b
1
c
0
a
0
c
Q12.
Under what conditions will the covariance matrix
be identical to the correlation
matrix, whose
(i,j)
entry gives the correlation between attributes
X
i
and
X
j
? What
can you conclude about the two variables?
CHAPTER 3
Categorical Attributes
In this chapter we present methods to analyze categorical attributes. Because
categorical attributes have only symbolic values, many of the arithmetic operations
cannot be performed directly on the symbolic values. However, we can compute the
frequencies of these values and use them to analyze the attributes.
3.1
UNIVARIATE ANALYSIS
We assume that the data consists of values for a single categorical attribute,
X
. Let the
domain of
X
consist of
m
symbolic values
dom(
X
)
={
a
1
,a
2
,...,a
m
}
. The data
D
is thus
an
n
×
1 symbolic data matrix given as
D
=
X
x
1
x
2
.
.
.
x
n
where each point
x
i
∈
dom(
X
)
.
3.1.1
Bernoulli Variable
Let us first consider the case when the categorical attribute
X
has domain
{
a
1
,a
2
}
, with
m
=
2. We can model
X
as a Bernoulli random variable, which takes on two distinct
values, 1 and 0, according to the mapping
X
(v)
=
1 if
v
=
a
1
0 if
v
=
a
2
The probability mass function (PMF) of
X
is given as
P(
X
=
x)
=
f(x)
=
p
1
if
x
=
1
p
0
if
x
=
0
63
64
Categorical Attributes
where
p
1
and
p
0
aretheparametersofthedistribution, which mustsatisfythecondition
p
1
+
p
0
=
1
Because there is only one free parameter, it is customary to denote
p
1
=
p
, from which
it follows that
p
0
=
1
−
p
. The PMF of Bernoulli random variable
X
can then be written
compactly as
P(
X
=
x)
=
f(x)
=
p
x
(
1
−
p)
1
−
x
We can see that
P(
X
=
1
)
=
p
1
(
1
−
p)
0
=
p
and
P(
X
=
0
)
=
p
0
(
1
−
p)
1
=
1
−
p
, as
desired.
Mean and Variance
The expected value of
X
is given as
µ
=
E
[
X
]
=
1
·
p
+
0
·
(
1
−
p)
=
p
and the variance of
X
is given as
σ
2
=
var(
X
)
=
E
[
X
2
]
−
(
E
[
X
]
)
2
=
(
1
2
·
p
+
0
2
·
(
1
−
p))
−
p
2
=
p
−
p
2
=
p(
1
−
p)
(3.1)
Sample Mean and Variance
To estimate the parameters of the Bernoulli variable
X
, we assume that each symbolic
point has been mapped to its binary value. Thus, the set
{
x
1
,x
2
,...,x
n
}
is assumed to
be a random sample drawn from
X
(i.e., each
x
i
is IID with
X
).
The sample mean is given as
ˆ
µ
=
1
n
n
i
=
1
x
i
=
n
1
n
= ˆ
p
(3.2)
where
n
1
is thenumber of points with
x
i
=
1in therandom sample(equalto thenumber
of occurrences of symbol
a
1
).
Let
n
0
=
n
−
n
1
denote the number of points with
x
i
=
0 in the random sample. The
sample variance is given as
ˆ
σ
2
=
1
n
n
i
=
1
(x
i
− ˆ
µ)
2
=
n
1
n
(
1
−ˆ
p)
2
+
n
−
n
1
n
(
−ˆ
p)
2
= ˆ
p(
1
−ˆ
p)
2
+
(
1
−ˆ
p)
ˆ
p
2
= ˆ
p(
1
−ˆ
p)(
1
−ˆ
p
+ ˆ
p)
= ˆ
p(
1
−ˆ
p)
The sample variance could also have been obtained directly from Eq.(3.1), by
substituting
ˆ
p
for
p
.
3.1 Univariate Analysis
65
Example 3.1.
Consider the
sepal length
attribute (
X
1
) for the Iris dataset in
Table 1.1. Let us define an Iris flower as
Long
if its sepal length is in the range [7
,
∞
],
and
Short
if its sepal length is in the range [
−∞
,
7
)
. Then
X
1
can be treated as a
categorical attribute with domain
{
Long
,
Short
}
. From the observed sample of size
n
=
150, we find 13 long Irises. The sample mean of
X
1
is
ˆ
µ
= ˆ
p
=
13
/
150
=
0
.
087
and its variance is
ˆ
σ
2
= ˆ
p(
1
−ˆ
p)
=
0
.
087
(
1
−
0
.
087
)
=
0
.
087
·
0
.
913
=
0
.
079
Binomial Distribution: Number of Occurrences
Given the Bernoulli variable
X
, let
{
x
1
,x
2
,...,x
n
}
denote a random sample of size
n
drawn from
X
. Let
N
be the random variable denoting the number of occurrences
of the symbol
a
1
(value
X
=
1) in the sample.
N
has a binomial distribution,
given as
f(
N
=
n
1
|
n,p)
=
n
n
1
p
n
1
(
1
−
p)
n
−
n
1
(3.3)
In fact,
N
is the sum of the
n
independent Bernoulli random variables
x
i
IID with
X
, that is,
N
=
n
i
=
1
x
i
. By linearity of expectation, the mean or expected number of
occurrences of symbol
a
1
is given as
µ
N
=
E
[
N
]
=
E
n
i
=
1
x
i
=
n
i
=
1
E
[
x
i
]
=
n
i
=
1
p
=
np
Because
x
i
are all independent, the variance of
N
is given as
σ
2
N
=
var(
N
)
=
n
i
=
1
var(x
i
)
=
n
i
=
1
p(
1
−
p)
=
np(
1
−
p)
Example 3.2.
Continuing with Example 3.1, we can use the estimated parameter
ˆ
p
=
0
.
087 to compute the expected number of occurrences
N
of
Long
sepal length
Irises via the binomial distribution:
E
[
N
]
=
n
ˆ
p
=
150
·
0
.
087
=
13
In this case, because
p
is estimated from the sample via
ˆ
p
, it is not surprising that the
expected number of occurrences of long Irises coincides with the actual occurrences.
However,what is more interesting is thatwe can compute the variancein thenumber
of occurrences:
var(
N
)
=
n
ˆ
p(
1
−ˆ
p)
=
150
·
0
.
079
=
11
.
9
66
Categorical Attributes
As the sample size increases, the binomial distribution given in Eq.3.3 tends to a
normal distribution with
µ
=
13 and
σ
=
√
11
.
9
=
3
.
45 for our example. Thus, with
confidence greater than 95% we can claim that the number of occurrences of
a
1
will
lie in the range
µ
±
2
σ
=
[9
.
55
,
16
.
45], which follows from the fact that for a normal
distribution 95.45% of the probability mass lies within two standard deviations from
the mean (see Section 2.5.1).
3.1.2
Multivariate Bernoulli Variable
We now consider the general case when
X
is a categorical attribute with domain
{
a
1
,a
2
,...,a
m
}
. We can model
X
as an
m
-dimensional Bernoulli random variable
X
=
(
A
1
,
A
2
,...,
A
m
)
T
, where each
A
i
is a Bernoulli variable with parameter
p
i
denoting the probability of observing symbol
a
i
. However, because
X
can assume only
one of the symbolic values at any one time, if
X
=
a
i
, then
A
i
=
1, and
A
j
=
0 for
all
j
=
i
. The range of the random variable
X
is thus the set
{
0
,
1
}
m
, with the further
restriction that if
X
=
a
i
, then
X
=
e
i
, where
e
i
is the
i
th standard basis vector
e
i
∈
R
m
given as
e
i
=
(
i
−
1
0
,...,
0
,
1
,
m
−
i
0
,...,
0
)
T
In
e
i
, only the
i
th element is 1 (
e
ii
=
1), whereas all other elements are zero
(
e
ij
=
0
,
∀
j
=
i
).
This is precisely the definition of a
multivariate Bernoulli variable
, which is a
generalization of a Bernoulli variable from two outcomes to
m
outcomes. We thus
model the categorical attribute
X
as a multivariate Bernoulli variable
X
defined as
X
(v)
=
e
i
if
v
=
a
i
The range of
X
consists of
m
distinct vector values
{
e
1
,
e
2
,...,
e
m
}
, with the PMF of
X
given as
P(
X
=
e
i
)
=
f(
e
i
)
=
p
i
where
p
i
is the probability of observing value
a
i
. These parameters must satisfy the
condition
m
i
=
1
p
i
=
1
The PMF can be written compactly as follows:
P(
X
=
e
i
)
=
f(
e
i
)
=
m
j
=
1
p
e
ij
j
(3.4)
Because
e
ii
=
1, and
e
ij
=
0 for
j
=
i
, we can see that, as expected, we have
f(
e
i
)
=
m
j
=
1
p
e
ij
j
=
p
e
i
0
1
×···
p
e
ii
i
···×
p
e
im
m
=
p
0
1
×···
p
1
i
···×
p
0
m
=
p
i
3.1 Univariate Analysis
67
Table 3.1.
Discretized
sepal length
attribute
Bins Domain Counts
[4
.
3
,
5
.
2] Very Short (
a
1
)
n
1
=
45
(
5
.
2
,
6
.
1] Short (
a
2
)
n
2
=
50
(
6
.
1
,
7
.
0] Long (
a
3
)
n
3
=
43
(
7
.
0
,
7
.
9] Very Long (
a
4
)
n
4
=
12
Example 3.3.
Let us consider the
sepal length
attribute (
X
1
) for the Iris dataset
shown in Table 1.2. We divide the sepal length into four equal-width intervals, and
give each interval a name as shown in Table 3.1. We consider
X
1
as a categorical
attribute with domain
{
a
1
=
VeryShort
,a
2
=
Short
,a
3
=
Long
,a
4
=
VeryLong
}
We model the categorical attribute
X
1
as a multivariate Bernoulli variable
X
,
defined as
X
(v)
=
e
1
=
(
1
,
0
,
0
,
0
)
if
v
=
a
1
e
2
=
(
0
,
1
,
0
,
0
)
if
v
=
a
2
e
3
=
(
0
,
0
,
1
,
0
)
if
v
=
a
3
e
4
=
(
0
,
0
,
0
,
1
)
if
v
=
a
4
For example, the symbolic point
x
1
=
Short
=
a
2
is represented as the vector
(
0
,
1
,
0
,
0
)
T
=
e
2
.
Mean
The mean or expected value of
X
can be obtained as
µ
=
E
[
X
]
=
m
i
=
1
e
i
f(
e
i
)
=
m
i
=
1
e
i
p
i
=
1
0
.
.
.
0
p
1
+···+
0
0
.
.
.
1
p
m
=
p
1
p
2
.
.
.
p
m
=
p
(3.5)
Sample Mean
Assume that each symbolic point
x
i
∈
D
is mapped to the variable
x
i
=
X
(x
i
)
. The
mapped dataset
x
1
,
x
2
,...,
x
n
is then assumed to be a random sample IID with
X
. We
can compute the sample mean by placing a probability mass of
1
n
at each point
ˆ
µ
=
1
n
n
i
=
1
x
i
=
m
i
=
1
n
i
n
e
i
=
n
1
/n
n
2
/n
.
.
.
n
m
/n
=
ˆ
p
1
ˆ
p
2
.
.
.
ˆ
p
m
=
ˆ
p
(3.6)
where
n
i
is the number of occurrences of the vector value
e
i
in the sample, which
is equivalent to the number of occurrences of the symbol
a
i
. Furthermore, we have
68
Categorical Attributes
0
0
.
1
0
.
2
0
.
3
x
f(
x
)
0
.
3
0
.
333
0
.
287
0
.
08
e
1
e
2
e
3
e
4
Very Short Short Long Very Long
Figure 3.1.
Probability mass function:
sepal length
.
m
i
=
1
n
i
=
n
, which follows from the fact that
X
can take on only
m
distinct values
e
i
,
and the counts for each value must add up to the sample size
n
.
Example 3.4 (Sample Mean).
Consider the observed counts
n
i
for each of the values
a
i
(
e
i
) of the discretized
sepal length
attribute, shown in Table 3.1. Because the
total sample size is
n
=
150, from these we can obtain the estimates
ˆ
p
i
as follows:
ˆ
p
1
=
45
/
150
=
0
.
3
ˆ
p
2
=
50
/
150
=
0
.
333
ˆ
p
3
=
43
/
150
=
0
.
287
ˆ
p
4
=
12
/
150
=
0
.
08
The PMF for
X
is plotted in Figure 3.1, and the sample mean for
X
is given as
ˆ
µ
=
ˆ
p
=
0
.
3
0
.
333
0
.
287
0
.
08
Covariance Matrix
Recall that an
m
-dimensional multivariate Bernoulli variable is simply a vector of
m
Bernoulli variables. For instance,
X
=
(
A
1
,
A
2
,...,
A
m
)
T
, where
A
i
is the Bernoulli
variable corresponding to symbol
a
i
. The variance–covariance information between
the constituent Bernoulli variables yields a covariance matrix for
X
.
3.1 Univariate Analysis
69
Let us first consider the variance along each Bernoulli variable
A
i
. By Eq.(3.1),
we immediately have
σ
2
i
=
var(
A
i
)
=
p
i
(
1
−
p
i
)
Next consider the covariance between
A
i
and
A
j
. Utilizing the identity in
Eq.(2.21), we have
σ
ij
=
E
[
A
i
A
j
]
−
E
[
A
i
]
·
E
[
A
j
]
=
0
−
p
i
p
j
=−
p
i
p
j
which follows from the factthat
E
[
A
i
A
j
]
=
0, as
A
i
and
A
j
cannot both be 1 at the same
time, and thus their product
A
i
A
j
=
0. This same fact leads to the negativerelationship
between
A
i
and
A
j
. What is interesting is that the degree of negative association is
proportional to the product of the mean values for
A
i
and
A
j
.
From the preceding expressions for variance and covariance,the
m
×
m
covariance
matrix for
X
is given as
=
σ
2
1
σ
12
... σ
1
m
σ
12
σ
2
2
... σ
2
m
.
.
.
.
.
.
.
.
.
.
.
.
σ
1
m
σ
2
m
... σ
2
m
=
p
1
(
1
−
p
1
)
−
p
1
p
2
··· −
p
1
p
m
−
p
1
p
2
p
2
(
1
−
p
2
)
··· −
p
2
p
m
.
.
.
.
.
.
.
.
.
.
.
.
−
p
1
p
m
−
p
2
p
m
···
p
m
(
1
−
p
m
)
Notice how each row in
sums to zero. For example, for row
i
, we have
−
p
i
p
1
−
p
i
p
2
−···+
p
i
(
1
−
p
i
)
−···−
p
i
p
m
=
p
i
−
p
i
m
j
=
1
p
j
=
p
i
−
p
i
=
0 (3.7)
Because
is symmetric, it follows that each column also sums to zero.
Define
P
as the
m
×
m
diagonal matrix:
P
=
diag
(
p
)
=
diag
(p
1
,p
2
,...,p
m
)
=
p
1
0
···
0
0
p
2
···
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
···
p
m
We can compactly write the covariance matrix of
X
as
=
P
−
p
·
p
T
(3.8)
Sample Covariance Matrix
The sample covariance matrix can be obtained from Eq.(3.8) in a straightforward
manner:
=
P
−
ˆ
p
·
ˆ
p
T
(3.9)
where
P
=
diag
(
ˆ
p
)
, and
ˆ
p
= ˆ
µ
=
(
ˆ
p
1
,
ˆ
p
2
,...,
ˆ
p
m
)
T
denotes theempirical probability mass
function for
X
.
70
Categorical Attributes
Example 3.5.
Returning to the discretized
sepal length
attribute in Example 3.4,
we have
ˆ
µ
=
ˆ
p
=
(
0
.
3
,
0
.
333
,
0
.
287
,
0
.
08
)
T
. The sample covariance matrix is given as
=
P
−
ˆ
p
·
ˆ
p
T
=
0
.
3 0 0 0
0 0
.
333 0 0
0 0 0
.
287 0
0 0 0 0
.
08
−
0
.
3
0
.
333
0
.
287
0
.
08
0
.
3 0
.
333 0
.
287 0
.
08
=
0
.
3 0 0 0
0 0
.
333 0 0
0 0 0
.
287 0
0 0 0 0
.
08
−
0
.
09 0
.
1 0
.
086 0
.
024
0
.
1 0
.
111 0
.
096 0
.
027
0
.
086 0
.
096 0
.
082 0
.
023
0
.
024 0
.
027 0
.
023 0
.
006
=
0
.
21
−
0
.
1
−
0
.
086
−
0
.
024
−
0
.
1 0
.
222
−
0
.
096
−
0
.
027
−
0
.
086
−
0
.
096 0
.
204
−
0
.
023
−
0
.
024
−
0
.
027
−
0
.
023 0
.
074
One can verify that each row (and column) in
sums to zero.
It is worth emphasizing that whereas the modeling of categorical attribute
X
as a
multivariateBernoulli variable,
X
=
(
A
1
,
A
2
,...,
A
m
)
T
, makesthestructureof themean
and covariance matrix explicit, the same results would be obtained if we simply treat
the mapped values
X
(x
i
)
as a new
n
×
m
binary data matrix, and apply the standard
definitions of the mean and covariance matrix from multivariate numeric attribute
analysis (see Section 2.3). In essence, the mapping from symbols
a
i
to binary vectors
e
i
is the key idea in categorical attribute analysis.
Example 3.6.
Consider the sample
D
of size
n
=
5 for the
sepal length
attribute
X
1
in the Iris dataset, shown in Table 3.2a. As in Example 3.1, we assume that
X
1
has
only two categoricalvalues
{
Long
,
Short
}
. We model
X
1
as the multivariateBernoulli
variable
X
1
defined as
X
1
(v)
=
e
1
=
(
1
,
0
)
T
if
v
=
Long
(a
1
)
e
2
=
(
0
,
1
)
T
if
v
=
Short
(a
2
)
The sample mean [Eq.(3.6)] is
ˆ
µ
=
ˆ
p
=
(
2
/
5
,
3
/
5
)
T
=
(
0
.
4
,
0
.
6
)
T
and the sample covariance matrix [Eq.(3.9)] is
=
P
−
ˆ
p
ˆ
p
T
=
0
.
4 0
0 0
.
6
−
0
.
4
0
.
6
0
.
4 0
.
6
=
0
.
4 0
0 0
.
6
−
0
.
16 0
.
24
0
.
24 0
.
36
=
0
.
24
−
0
.
24
−
0
.
24 0
.
24
72
Categorical Attributes
We can see that this is a direct generalization of the binomial distribution in Eq.(3.3).
The term
n
n
1
n
2
...n
m
=
n
!
n
1
!
n
2
!
...n
m
!
denotes the number of ways of choosing
n
i
occurrences of each symbol
a
i
from a
sample of size
n
, with
m
i
=
1
n
i
=
n
.
Themeanandcovariancematrixof
N
aregivenas
n
timesthemeanand covariance
matrix of
X
. That is, the mean of
N
is given as
µ
N
=
E
[
N
]
=
n
E
[
X
]
=
n
·
µ
=
n
·
p
=
np
1
.
.
.
np
m
and its covariance matrix is given as
N
=
n
·
(
P
−
pp
T
)
=
np
1
(
1
−
p
1
)
−
np
1
p
2
··· −
np
1
p
m
−
np
1
p
2
np
2
(
1
−
p
2
)
··· −
np
2
p
m
.
.
.
.
.
.
.
.
.
.
.
.
−
np
1
p
m
−
np
2
p
m
···
np
m
(
1
−
p
m
)
Likewise the sample mean and covariance matrix for
N
are given as
ˆ
µ
N
=
n
ˆ
p
N
=
n
P
−
ˆ
p
ˆ
p
T
3.2
BIVARIATE ANALYSIS
Assume that the data comprises two categorical attributes,
X
1
and
X
2
, with
dom(
X
1
)
={
a
11
,a
12
,...,a
1
m
1
}
dom(
X
2
)
={
a
21
,a
22
,...,a
2
m
2
}
We are given
n
categorical points of the form
x
i
=
(x
i
1
,x
i
2
)
T
with
x
i
1
∈
dom(
X
1
)
and
x
i
2
∈
dom(
X
2
)
. The dataset is thus an
n
×
2 symbolic data matrix:
D
=
X
1
X
2
x
11
x
12
x
21
x
22
.
.
.
.
.
.
x
n
1
x
n
2
We can model
X
1
and
X
2
as multivariate Bernoulli variables
X
1
and
X
2
with
dimensions
m
1
and
m
2
, respectively. The probability mass functions for
X
1
and
X
2
are
3.2 Bivariate Analysis
73
given according to Eq.(3.4):
P(
X
1
=
e
1
i
)
=
f
1
(
e
1
i
)
=
p
1
i
=
m
1
k
=
1
(p
1
i
)
e
1
ik
P(
X
2
=
e
2
j
)
=
f
2
(
e
2
j
)
=
p
2
j
=
m
2
k
=
1
(p
2
j
)
e
2
jk
where
e
1
i
is the
i
thstandard basisvectorin
R
m
1
(for attribute
X
1
) whose
k
thcomponent
is
e
1
ik
, and
e
2
j
is the
j
th standard basis vector in
R
m
2
(for attribute
X
2
) whose
k
th
component is
e
2
jk
. Further, the parameter
p
1
i
denotes the probability of observing
symbol
a
1
i
, and
p
2
j
denotes the probability of observing symbol
a
2
j
. Together theymust
satisfy the conditions:
m
1
i
=
1
p
1
i
=
1 and
m
2
j
=
1
p
2
j
=
1.
The joint distribution of
X
1
and
X
2
is modeled as the
d
′
=
m
1
+
m
2
dimensional
vector variable
X
=
X
1
X
2
, specified by the mapping
X
(v
1
,v
2
)
T
=
X
1
(v
1
)
X
2
(v
2
)
=
e
1
i
e
2
j
provided that
v
1
=
a
1
i
and
v
2
=
a
2
j
. The range of
X
thus consists of
m
1
×
m
2
distinct
pairs of vector values
(
e
1
i
,
e
2
j
)
T
, with 1
≤
i
≤
m
1
and 1
≤
j
≤
m
2
. The joint PMF of
X
is given as
P
X
=
(
e
1
i
,
e
2
j
)
T
=
f(
e
1
i
,
e
2
j
)
=
p
ij
=
m
1
r
=
1
m
2
s
=
1
p
e
1
ir
·
e
2
js
ij
where
p
ij
the probability of observing the symbol pair
(a
1
i
,a
2
j
)
. These probability
parameters must satisfy the condition
m
1
i
=
1
m
2
j
=
1
p
ij
=
1. The joint PMF for
X
can be
expressed as the
m
1
×
m
2
matrix
P
12
=
p
11
p
12
... p
1
m
2
p
21
p
22
... p
2
m
2
.
.
.
.
.
.
.
.
.
.
.
.
p
m
1
1
p
m
1
2
... p
m
1
m
2
(3.10)
Example 3.7.
Consider the discretized
sepal length
attribute (
X
1
) in Table 3.1.We
alsodiscretizethe
sepal width
attribute(
X
2
)intothreevaluesasshown inTable3.3.
We thus have
dom(
X
1
)
={
a
11
=
VeryShort
,a
12
=
Short
,a
13
=
Long
,a
14
=
VeryLong
}
dom(
X
2
)
={
a
21
=
Short
,a
22
=
Medium
,a
23
=
Long
}
The symbolic point
x
=
(
Short
,
Long
)
=
(a
12
,a
23
)
, is mapped to the vector
X
(
x
)
=
e
12
e
23
=
(
0
,
1
,
0
,
0
|
0
,
0
,
1
)
T
∈
R
7
74
Categorical Attributes
Table 3.3.
Discretized
sepal width
attribute
Bins Domain Counts
[2
.
0
,
2
.
8] Short (
a
1
) 47
(
2
.
8
,
3
.
6] Medium (
a
2
) 88
(
3
.
6
,
4
.
4] Long (
a
3
) 15
where we use
|
to demarcate the two subvectors
e
12
=
(
0
,
1
,
0
,
0
)
T
∈
R
4
and
e
23
=
(
0
,
0
,
1
)
T
∈
R
3
, corresponding to symbolic attributes
sepal length
and
sepal width
,
respectively. Note that
e
12
is the second standard basis vector in
R
4
for
X
1
, and
e
23
is
the third standard basis vector in
R
3
for
X
2
.
Mean
The bivariate mean can easily be generalized from Eq.(3.5), as follows:
µ
=
E
[
X
]
=
E
X
1
X
2
=
E
[
X
1
]
E
[
X
2
]
=
µ
1
µ
2
=
p
1
p
2
where
µ
1
=
p
1
=
(p
1
1
,...,p
1
m
1
)
T
and
µ
2
=
p
2
=
(p
2
1
,...,p
2
m
2
)
T
are the mean vectors for
X
1
and
X
2
. The vectors
p
1
and
p
2
also represent the probability mass functions for
X
1
and
X
2
, respectively.
Sample Mean
The sample mean can also be generalized from Eq.(3.6), by placing a probability mass
of
1
n
at each point:
ˆ
µ
=
1
n
n
i
=
1
x
i
=
1
n
m
1
i
=
1
n
1
i
e
1
i
m
2
j
=
1
n
2
j
e
2
j
=
1
n
n
1
1
.
.
.
n
1
m
1
n
2
1
.
.
.
n
2
m
2
=
ˆ
p
1
1
.
.
.
ˆ
p
1
m
1
ˆ
p
2
1
.
.
.
ˆ
p
2
m
2
=
ˆ
p
1
ˆ
p
2
=
ˆ
µ
1
ˆ
µ
2
where
n
i
j
is the observed frequency of symbol
a
ij
in the sample of size
n
, and
ˆ
µ
i
=
ˆ
p
i
=
(p
i
1
,p
i
2
,...,p
i
m
i
)
T
is the sample mean vector for
X
i
, which is also the empirical PMF for
attribute
X
i
.
Covariance Matrix
The covariance matrix for
X
is the
d
′
×
d
′
=
(m
1
+
m
2
)
×
(m
1
+
m
2
)
matrix given as
=
11
12
T
12
22
(3.11)
where
11
is the
m
1
×
m
1
covariance matrix for
X
1
, and
22
is the
m
2
×
m
2
covariance
matrix for
X
2
, which can be computed using Eq.(3.8). That is,
11
=
P
1
−
p
1
p
T
1
22
=
P
2
−
p
2
p
T
2
3.2 Bivariate Analysis
75
where
P
1
=
diag
(
p
1
)
and
P
2
=
diag
(
p
2
)
. Further,
12
is the
m
1
×
m
2
covariance matrix
between variables
X
1
and
X
2
, given as
12
=
E
[
(
X
1
−
µ
1
)(
X
2
−
µ
2
)
T
]
=
E
[
X
1
X
T
2
]
−
E
[
X
1
]
E
[
X
2
]
T
=
P
12
−
µ
1
µ
T
2
=
P
12
−
p
1
p
T
2
=
p
11
−
p
1
1
p
2
1
p
12
−
p
1
1
p
2
2
···
p
1
m
2
−
p
1
1
p
2
m
2
p
21
−
p
1
2
p
2
1
p
22
−
p
1
2
p
2
2
···
p
2
m
2
−
p
1
2
p
2
m
2
.
.
.
.
.
.
.
.
.
.
.
.
p
m
1
1
−
p
1
m
1
p
2
1
p
m
1
2
−
p
1
m
1
p
2
2
···
p
m
1
m
2
−
p
1
m
1
p
2
m
2
where
P
12
represents the joint PMF for
X
given in Eq.(3.10).
Incidentally, each row and each column of
12
sums to zero. For example,consider
row
i
and column
j
:
m
2
k
=
1
(p
ik
−
p
1
i
p
2
k
)
=
m
2
k
=
1
p
ik
−
p
1
i
=
p
1
i
−
p
1
i
=
0
m
1
k
=
1
(p
kj
−
p
1
k
p
2
j
)
=
m
1
k
=
1
p
kj
−
p
2
j
=
p
2
j
−
p
2
j
=
0
which follows from the fact that summing the joint mass function over all values of
X
2
,
yields the marginal distribution of
X
1
, and summing it over all values of
X
1
yields the
marginal distribution for
X
2
. Note that
p
2
j
is the probability of observing symbol
a
2
j
; it
should not be confused with the square of
p
j
. Combined with the fact that
11
and
22
also have row and column sums equal to zero via Eq.(3.7), the full covariance matrix
has rows and columns that sum up to zero.
Sample Covariance Matrix
The sample covariance matrix is given as
=
11
12
T
12
22
(3.12)
where
11
=
P
1
−
ˆ
p
1
ˆ
p
T
1
22
=
P
2
−
ˆ
p
2
ˆ
p
T
2
12
=
P
12
−
ˆ
p
1
ˆ
p
T
2
Here
P
1
=
diag
(
ˆ
p
1
)
and
P
2
=
diag
(
ˆ
p
2
)
, and
ˆ
p
1
and
ˆ
p
2
specify the empirical probability
mass functions for
X
1
, and
X
2
, respectively. Further,
P
12
specifies the empirical joint
PMF for
X
1
and
X
2
, given as
P
12
(i,j)
=
ˆ
f(
e
1
i
,
e
2
j
)
=
1
n
n
k
=
1
I
ij
(
x
k
)
=
n
ij
n
= ˆ
p
ij
(3.13)
76
Categorical Attributes
where
I
ij
is the indicator variable
I
ij
(
x
k
)
=
1 if
x
k
1
=
e
1
i
and
x
k
2
=
e
2
j
0 otherwise
Taking the sum of
I
ij
(
x
k
)
over all the
n
points in the sample yields the number
of occurrences,
n
ij
, of the symbol pair
(a
1
i
,a
2
j
)
in the sample. One issue with the
cross-attribute covariance matrix
12
is the need to estimate a quadratic number of
parameters. That is, we need to obtain reliable counts
n
ij
to estimate the parameters
p
ij
, for a total of
O
(m
1
×
m
2
)
parameters that have to be estimated, which can be a
problem ifthecategoricalattributeshavemanysymbols. On theother hand,estimating
11
and
22
requires that we estimate
m
1
and
m
2
parameters, corresponding to
p
1
i
and
p
2
j
, respectively. In total, computing
requires the estimation of
m
1
m
2
+
m
1
+
m
2
parameters.
Example 3.8.
We continue with the bivariate categorical attributes
X
1
and
X
2
in
Example 3.7. From Example 3.4, and from the occurrence counts for each of the
values of
sepal width
in Table 3.3, we have
ˆ
µ
1
=
ˆ
p
1
=
0
.
3
0
.
333
0
.
287
0
.
08
ˆ
µ
2
=
ˆ
p
2
=
1
150
47
88
15
=
0
.
313
0
.
587
0
.
1
Thus, the mean for
X
=
X
1
X
2
is given as
ˆ
µ
=
ˆ
µ
1
ˆ
µ
2
=
ˆ
p
1
ˆ
p
2
=
(
0
.
3
,
0
.
333
,
0
.
287
,
0
.
08
|
0
.
313
,
0
.
587
,
0
.
1
)
T
From Example 3.5 we have
11
=
0
.
21
−
0
.
1
−
0
.
086
−
0
.
024
−
0
.
1 0
.
222
−
0
.
096
−
0
.
027
−
0
.
086
−
0
.
096 0
.
204
−
0
.
023
−
0
.
024
−
0
.
027
−
0
.
023 0
.
074
In a similar manner we can obtain
22
=
0
.
215
−
0
.
184
−
0
.
031
−
0
.
184 0
.
242
−
0
.
059
−
0
.
031
−
0
.
059 0
.
09
Next, we use the observed counts in Table 3.4 to obtain the empirical joint PMF
for
X
1
and
X
2
using Eq.(3.13), as plotted in Figure 3.2. From these probabilities we
get
E
[
X
1
X
T
2
]
=
P
12
=
1
150
7 33 5
24 18 8
13 30 0
3 7 2
=
0
.
047 0
.
22 0
.
033
0
.
16 0
.
12 0
.
053
0
.
087 0
.
2 0
0
.
02 0
.
047 0
.
013
3.2 Bivariate Analysis
77
Table 3.4.
Observed Counts (
n
ij
):
sepal length
and
sepal width
X
2
Short
(
e
21
)
Medium
(
e
22
)
Long
(
e
23
)
X
1
Very Short (
e
11
)
7 33 5
Short (
e
22
)
24 18 8
Long (
e
13
)
13 30 0
Very Long (
e
14
)
3 7 2
X
1
X
2
f(
x
)
0.047
0.22
0.033
0.16
0.12
0.053
0.087
0.2
0
0.02
0.047
0.013
e
11
e
12
e
13
e
14
e
21
e
22
e
23
0.1
0.2
Figure 3.2.
Empirical joint probability mass function:
sepal length
and
sepal width
.
Further, we have
E
[
X
1
]
E
[
X
2
]
T
= ˆ
µ
1
ˆ
µ
T
2
=
ˆ
p
1
ˆ
p
T
2
=
0
.
3
0
.
333
0
.
287
0
.
08
0
.
313 0
.
587 0
.
1
=
0
.
094 0
.
176 0
.
03
0
.
104 0
.
196 0
.
033
0
.
09 0
.
168 0
.
029
0
.
025 0
.
047 0
.
008
78
Categorical Attributes
We can now compute the across-attribute sample covariance matrix
12
for
X
1
and
X
2
using Eq.(3.11), as follows:
12
=
P
12
−
ˆ
p
1
ˆ
p
T
2
=
−
0
.
047 0
.
044 0
.
003
0
.
056
−
0
.
076 0
.
02
−
0
.
003 0
.
032
−
0
.
029
−
0
.
005 0 0
.
005
One canobservethateachrow andcolumn in
12
sums tozero.Puttingitalltogether,
from
11
,
22
and
12
we obtain the sample covariance matrix as follows
=
11
12
T
12
22
=
0
.
21
−
0
.
1
−
0
.
086
−
0
.
024
−
0
.
047 0
.
044 0
.
003
−
0
.
1 0
.
222
−
0
.
096
−
0
.
027 0
.
056
−
0
.
076 0
.
02
−
0
.
086
−
0
.
096 0
.
204
−
0
.
023
−
0
.
003 0
.
032
−
0
.
029
−
0
.
024
−
0
.
027
−
0
.
023 0
.
074
−
0
.
005 0 0
.
005
−
0
.
047 0
.
056
−
0
.
003
−
0
.
005 0
.
215
−
0
.
184
−
0
.
031
0
.
044
−
0
.
076 0
.
032 0
−
0
.
184 0
.
242
−
0
.
059
0
.
003 0
.
02
−
0
.
029 0
.
005
−
0
.
031
−
0
.
059 0
.
09
In
, each row and column also sums to zero.
3.2.1
Attribute Dependence: Contingency Analysis
Testing for the independence of the two categorical random variables
X
1
and
X
2
can
be done via
contingency table analysis
. The main idea is to set up a hypothesis testing
framework, where the null hypothesis
H
0
is that
X
1
and
X
2
are independent, and the
alternativehypothesis
H
1
is that theyare dependent.We then compute the valueof the
chi-square statistic
χ
2
under the null hypothesis. Depending on the
p
-value, we either
accept or reject the null hypothesis; in the latter case the attributes are considered to
be dependent.
Contingency Table
A contingency table for
X
1
and
X
2
is the
m
1
×
m
2
matrix of observed counts
n
ij
for all
pairs of values
(
e
1
i
,
e
2
j
)
in the given sample of size
n
, defined as
N
12
=
n
·
P
12
=
n
11
n
12
···
n
1
m
2
n
21
n
22
···
n
2
m
2
.
.
.
.
.
.
.
.
.
.
.
.
n
m
1
1
n
m
1
2
···
n
m
1
m
2
3.2 Bivariate Analysis
79
Table 3.5.
Contingency table:
sepal length
vs.
sepal width
S
e
p
a
l
l
e
n
g
t
h
(
X
1
)
Sepal width
(
X
2
)
Short Medium Long
a
21
a
22
a
23
Row Counts
Very Short
(
a
11
) 7 33 5
n
1
1
=
45
Short
(
a
12
) 24 18 8
n
1
2
=
50
Long
(
a
13
) 13 30 0
n
1
3
=
43
Very Long
(
a
14
) 3 7 2
n
1
4
=
12
Column Counts
n
2
1
=
47
n
2
2
=
88
n
2
3
=
15
n
=
150
where
P
12
is the empirical joint PMF for
X
1
and
X
2
, computed via Eq.(3.13). The
contingency table is then augmented with row and column marginal counts, as follows:
N
1
=
n
·
ˆ
p
1
=
n
1
1
.
.
.
n
1
m
1
N
2
=
n
·
ˆ
p
2
=
n
2
1
.
.
.
n
2
m
2
Notethatthemarginalrow and column entriesandthesample sizesatisfythefollowing
constraints:
n
1
i
=
m
2
j
=
1
n
ij
n
2
j
=
m
1
i
=
1
n
ij
n
=
m
1
i
=
1
n
1
i
=
m
2
j
=
1
n
2
j
=
m
1
i
=
1
m
2
j
=
1
n
ij
It is worth noting that both
N
1
and
N
2
have a multinomial distribution with
parameters
p
1
=
(p
1
1
,...,p
1
m
1
)
and
p
2
=
(p
2
1
,...,p
2
m
2
)
, respectively. Further,
N
12
also has
a multinomial distribution with parameters
P
12
={
p
ij
}
, for 1
≤
i
≤
m
1
and 1
≤
j
≤
m
2
.
Example 3.9 (Contingency Table).
Table 3.4 shows the observed counts for the
discretized
sepal length
(
X
1
) and
sepal width
(
X
2
) attributes. Augmenting the
table with the row and column marginal counts and the sample size yields the final
contingency table shown in Table 3.5.
χ
2
Statistic and Hypothesis Testing
Under the null hypothesis
X
1
and
X
2
are assumed to be independent, which means that
their joint probability mass function is given as
ˆ
p
ij
= ˆ
p
1
i
· ˆ
p
2
j
Under this independence assumption, the expected frequency for each pair of values
is given as
e
ij
=
n
· ˆ
p
ij
=
n
· ˆ
p
1
i
· ˆ
p
2
j
=
n
·
n
1
i
n
·
n
2
j
n
=
n
1
i
n
2
j
n
(3.14)
However, from the sample we already have the observed frequency of each pair
of values,
n
ij
. We would like to determine whether there is a significant difference
in the observed and expected frequencies for each pair of values. If there is no
80
Categorical Attributes
significant difference, then the independence assumption is valid and we accept the
null hypothesis that the attributes are independent. On the other hand, if there is a
significant difference, then the null hypothesis should be rejected and we conclude
that the attributes are dependent.
The
χ
2
statistic quantifies the difference between observed and expected counts
for each pair of values; it is defined as follows:
χ
2
=
m
1
i
=
1
m
2
j
=
1
(n
ij
−
e
ij
)
2
e
ij
(3.15)
At this point, we need to determine the probability of obtaining the computed
χ
2
value. In general, this can be rather difficult if we do not know the sampling
distribution of a given statistic. Fortunately, for the
χ
2
statistic it is known that
its sampling distribution follows the
chi-squared
density function with
q
degrees of
freedom:
f(x
|
q)
=
1
2
q/
2
Ŵ(q/
2
)
x
q
2
−
1
e
−
x
2
(3.16)
where the gamma function
Ŵ
is defined as
Ŵ(k >
0
)
=
∞
0
x
k
−
1
e
−
x
dx
(3.17)
The degrees of freedom,
q
, represent the number of independent parameters. In
the contingency table there are
m
1
×
m
2
observed counts
n
ij
. However, note that each
row
i
and each column
j
must sum to
n
1
i
and
n
2
j
, respectively. Further, the sum of
the row and column marginals must also add to
n
; thus we have to remove
(m
1
+
m
2
)
parametersfrom the number ofindependentparameters.However,doing this removes
one of the parameters, say
n
m
1
m
2
, twice, so we have to add back one to the count. The
total degrees of freedom is therefore
q
=|
dom(
X
1
)
|×|
dom(
X
2
)
|−
(
|
dom(
X
1
)
|+|
dom(
X
2
)
|
)
+
1
=
m
1
m
2
−
m
1
−
m
2
+
1
=
(m
1
−
1
)(m
2
−
1
)
p
-value
The
p-value
of a statistic
θ
is defined as the probability of obtaining a value at least as
extreme as the observed value, say
z
, under the null hypothesis, defined as
p
-value
(z)
=
P(θ
≥
z)
=
1
−
F(θ)
where
F(θ)
is the cumulative probability distribution for the statistic.
The
p
-valuegivesa measure ofhow surprising is theobserved valueof thestatistic.
If the observed value lies in a low-probability region, then the value is more surprising.
In general, the lower the
p
-value, the more surprising the observed value, and the
3.2 Bivariate Analysis
81
Table 3.6.
Expected counts
X
2
Short
(
a
21
)
Medium
(
a
22
)
Short
(
a
23
)
X
1
Very Short
(
a
11
) 14.1 26.4 4.5
Short
(
a
12
) 15.67 29.33 5.0
Long
(
a
13
) 13.47 25.23 4.3
Very Long
(
a
14
) 3.76 7.04 1.2
more the grounds for rejecting the null hypothesis. The null hypothesis is rejected
if the
p
-value is below some
significance level
,
α
. For example, if
α
=
0
.
01, then we
reject the null hypothesis if
p
-value
(z)
≤
α
. The significance level
α
corresponds to
the probability of rejecting the null hypothesis when it is true. For a given significance
level
α
, the value of the test statistic, say
z
, with a
p
-value of
p
-value
(z)
=
α
, is called
a
critical value
. An alternative test for rejection of the null hypothesis is to check
if
χ
2
> z
, as in that case the
p
-value of the observed
χ
2
value is bounded by
α
,
that is,
p
-value
(χ
2
)
≤
p
-value
(z)
=
α
. The value 1
−
α
is also called the
confidence
level
.
Example 3.10.
Consider the contingency table for
sepal length
and
sepal width
in Table 3.5. We compute the expected counts using Eq.(3.14); these counts are
shown in Table 3.6. For example, we have
e
11
=
n
1
1
n
2
1
n
=
45
·
47
150
=
2115
150
=
14
.
1
Next we use Eq.(3.15) to compute the value of the
χ
2
statistic, which is given as
χ
2
=
21
.
8.
Further, the number of degrees of freedom is given as
q
=
(m
1
−
1
)
·
(m
2
−
1
)
=
3
·
2
=
6
The plot of the chi-squared density function with 6 degrees of freedom is shown in
Figure 3.3. From the cumulative chi-squared distribution, we obtain
p
-value
(
21
.
8
)
=
1
−
F(
21
.
8
|
6
)
=
1
−
0
.
9987
=
0
.
0013
At a significance level of
α
=
0
.
01,we would certainly be justified in rejecting the null
hypothesis because the large value of the
χ
2
statistic is indeed surprising. Further, at
the 0
.
01 significance level, the critical value of the statistic is
z
=
F
−
1
(
1
−
0
.
01
|
6
)
=
F
−
1
(
0
.
99
|
6
)
=
16
.
81
This critical valueis also shown in Figure3.3,and we canclearlyseethattheobserved
value of 21
.
8 is in the rejection region, as 21
.
8
>z
=
16
.
81. In effect,we reject the null
hypothesis that
sepal length
and
sepal width
are independent, and accept the
alternative hypothesis that they are dependent.
82
Categorical Attributes
0
0
.
03
0
.
06
0
.
09
0
.
12
0
.
15
0 5 10 15 20 25
x
f(x
|
6
)
α
=
0
.
01
H
0
Rejection Region
16
.
8 21
.
8
Figure 3.3.
Chi-squared distribution (
q
=
6).
3.3
MULTIVARIATE ANALYSIS
Assume that the dataset comprises
d
categorical attributes
X
j
(1
≤
j
≤
d
) with
dom(
X
j
)
= {
a
j
1
,a
j
2
,...,a
jm
j
}
. We are given
n
categorical points of the form
x
i
=
(x
i
1
,x
i
2
,...,x
id
)
T
with
x
ij
∈
dom(
X
j
)
. The dataset is thus an
n
×
d
symbolic matrix
D
=
X
1
X
2
···
X
d
x
11
x
12
···
x
1
d
x
21
x
22
···
x
2
d
.
.
.
.
.
.
.
.
.
.
.
.
x
n
1
x
n
2
···
x
nd
Each attribute
X
i
is modeled as an
m
i
-dimensional multivariate Bernoulli variable
X
i
,
and their joint distribution is modeled as a
d
′
=
d
j
=
1
m
j
dimensional vector random
variable
X
=
X
1
.
.
.
X
d
Each categorical data point
v
=
(v
1
,v
2
,...,v
d
)
T
is therefore represented as a
d
′
-dimensional binary vector
X
(
v
)
=
X
1
(v
1
)
.
.
.
X
d
(v
d
)
=
e
1
k
1
.
.
.
e
dk
d
3.3 Multivariate Analysis
83
provided
v
i
=
a
ik
i
, the
k
i
th symbol of
X
i
. Here
e
ik
i
is the
k
i
th standard basis vector
in
R
m
i
.
Mean
Generalizing from the bivariate case, the mean and sample mean for
X
are given as
µ
=
E
[
X
]
=
µ
1
.
.
.
µ
d
=
p
1
.
.
.
p
d
ˆ
µ
=
ˆ
µ
1
.
.
.
ˆ
µ
d
=
ˆ
p
1
.
.
.
ˆ
p
d
where
p
i
=
(p
i
1
,...,p
i
m
i
)
T
is the PMF for
X
i
, and
ˆ
p
i
=
(
ˆ
p
i
1
,...,
ˆ
p
i
m
i
)
T
is the empirical
PMF for
X
i
.
Covariance Matrix
The covariance matrix for
X
, and its estimate from the sample, are given as the
d
′
×
d
′
matrices:
=
11
12
···
1
d
T
12
22
···
2
d
··· ···
.
.
.
···
T
1
d
T
2
d
···
dd
=
11
12
···
1
d
T
12
22
···
2
d
··· ···
.
.
.
···
T
1
d
T
2
d
···
dd
where
d
′
=
d
i
=
1
m
i
,and
ij
(and
ij
) is the
m
i
×
m
j
covariancematrix(andits estimate)
for attributes
X
i
and
X
j
:
ij
=
P
ij
−
p
i
p
T
j
ij
=
P
ij
−
ˆ
p
i
ˆ
p
T
j
(3.18)
Here
P
ij
is the joint PMF and
P
ij
is the empirical joint PMF for
X
i
and
X
j
, which can
be computed using Eq.(3.13).
Example 3.11 (Multivariate Analysis).
Let us consider the 3-dimensional subset of
the Iris dataset, with the discretized attributes
sepal length
(
X
1
) and
sepal
width
(
X
2
), and the categorical attribute
class
(
X
3
). The domains for
X
1
and
X
2
are given in Table 3.1 and Table 3.3, respectively, and
dom(
X
3
)
=
{
iris
-
versicolor
,
iris
-
setosa
,
iris
-
virginica
}
. Each value of
X
3
occurs 50
times.
The categorical point
x
=
(
Short
,
Medium
,
iris
-
versicolor
)
is modeled as the
vector
X
(
x
)
=
e
12
e
22
e
31
=
(
0
,
1
,
0
,
0
|
0
,
1
,
0
|
1
,
0
,
0
)
T
∈
R
10
From Example 3.8 and the fact that each value in
dom(
X
3
)
occurs 50 times in a
sample of
n
=
150, the sample mean is given as
ˆ
µ
=
ˆ
µ
1
ˆ
µ
2
ˆ
µ
3
=
ˆ
p
1
ˆ
p
2
ˆ
p
3
=
(
0
.
3
,
0
.
333
,
0
.
287
,
0
.
08
|
0
.
313
,
0
.
587
,
0
.
1
|
0
.
33
,
0
.
33
,
0
.
33
)
T
84
Categorical Attributes
Using
ˆ
p
3
=
(
0
.
33
,
0
.
33
,
0
.
33
)
T
we can compute the sample covariance matrix for
X
3
using Eq.(3.9):
33
=
0
.
222
−
0
.
111
−
0
.
111
−
0
.
111 0
.
222
−
0
.
111
−
0
.
111
−
0
.
111 0
.
222
Using Eq.(3.18) we obtain
13
=
−
0
.
067 0
.
16
−
0
.
093
0
.
082
−
0
.
038
−
0
.
044
0
.
011
−
0
.
096 0
.
084
−
0
.
027
−
0
.
027 0
.
053
23
=
0
.
076
−
0
.
098 0
.
022
−
0
.
042 0
.
044
−
0
.
002
−
0
.
033 0
.
053
−
0
.
02
Combined with
11
,
22
and
12
from Example 3.8, the final sample covariance
matrix is the 10
×
10 symmetric matrix given as
=
11
12
13
T
12
22
23
T
13
T
23
33
3.3.1
Multiway Contingency Analysis
For multiway dependence analysis, we have to first determine the empirical joint
probability mass function for
X
:
ˆ
f(
e
1
i
1
,
e
2
i
2
,...,
e
di
d
)
=
1
n
n
k
=
1
I
i
1
i
2
...i
d
(
x
k
)
=
n
i
1
i
2
...i
d
n
= ˆ
p
i
1
i
2
...i
d
where
I
i
1
i
2
...i
d
is the indicator variable
I
i
1
i
2
...i
d
(
x
k
)
=
1 if
x
k
1
=
e
1
i
1
,x
k
2
=
e
2
i
2
,...,x
kd
=
e
di
d
0 otherwise
The sum of
I
i
1
i
2
...i
d
over all the
n
points in the sample yields the number of occurrences,
n
i
1
i
2
...i
d
, of the symbolic vector
(a
1
i
1
,a
2
i
2
,...,a
di
d
)
. Dividing the occurrences by the
sample size results in the probability of observing those symbols. Using the notation
i
=
(i
1
,i
2
,...,i
d
)
to denote the index tuple, we can write the joint empirical PMF as the
d
-dimensional matrix
P
of size
m
1
×
m
2
×···×
m
d
=
d
i
=
1
m
i
, given as
P
(
i
)
=
ˆ
p
i
for all index tuples
i
,
with 1
≤
i
1
≤
m
1
,...,
1
≤
i
d
≤
m
d
where
ˆ
p
i
= ˆ
p
i
1
i
2
...i
d
. The
d
-dimensional contingency table is then given as
N
=
n
×
P
=
n
i
for all index tuples
i
,
with 1
≤
i
1
≤
m
1
,...,
1
≤
i
d
≤
m
d
3.3 Multivariate Analysis
85
where
n
i
=
n
i
1
i
2
...i
d
. The contingencytableis augmentedwith themarginalcount vectors
N
i
for all
d
attributes
X
i
:
N
i
=
n
ˆ
p
i
=
n
i
1
.
.
.
n
i
m
i
where
ˆ
p
i
is the empirical PMF for
X
i
.
χ
2
-Test
We can test for a
d
-way dependence between the
d
categoricalattributes using the null
hypothesis
H
0
that they are
d
-way independent. The alternative hypothesis
H
1
is that
they are not
d
-way independent, that is, they are dependent in some way. Note that
d
-dimensional contingency analysis indicates whether all
d
attributes taken together
are independent or not. In general we may have to conduct
k
-way contingency analysis
to test if any subset of
k
≤
d
attributes are independent or not.
Under thenullhypothesis,theexpectednumber ofoccurrencesofthesymboltuple
(a
1
i
1
,a
2
i
2
,...,a
di
d
)
is given as
e
i
=
n
· ˆ
p
i
=
n
·
d
j
=
1
ˆ
p
j
i
j
=
n
1
i
1
n
2
i
2
...n
d
i
d
n
d
−
1
(3.19)
The chi-squared statistic measures the difference between the observed counts
n
i
and the expected counts
e
i
:
χ
2
=
i
(n
i
−
e
i
)
2
e
i
=
m
1
i
1
=
1
m
2
i
2
=
1
···
m
d
i
d
=
1
(n
i
1
,i
2
,...,i
d
−
e
i
1
,i
2
,...,i
d
)
2
e
i
1
,i
2
,...,i
d
(3.20)
The
χ
2
statistic follows a chi-squared density function with
q
degrees of freedom.
For the
d
-way contingency table we can compute
q
by noting that there are ostensibly
d
i
=
1
|
dom(
X
i
)
|
independent parameters (the counts). However, we have to remove
d
i
=
1
|
dom(
X
i
)
|
degrees of freedom because the marginal count vector along each
dimension
X
i
must equal
N
i
. However, doing so removes one of the parameters
d
times, so we need to add back
d
−
1 to the free parameters count. The total number of
degrees of freedom is given as
q
=
d
i
=
1
|
dom(
X
i
)
|−
d
i
=
1
|
dom(
X
i
)
|+
(d
−
1
)
=
d
i
=
1
m
i
−
d
i
=
1
m
i
+
d
−
1 (3.21)
To rejectthenullhypothesis,wehavetocheckwhetherthe
p
-valueoftheobserved
χ
2
value is smaller than the desired significance level
α
(say
α
=
0
.
01) using the
chi-squared density with
q
degrees of freedom [Eq.(3.16)].
3.4 Distance and Angle
87
Using Eq.(3.21), the number of degrees of freedom is given as
q
=
4
·
3
·
3
−
(
4
+
3
+
3
)
+
2
=
36
−
10
+
2
=
28
In Figure 3.4 the counts in bold are the dependent parameters. All other counts are
independent.In fact,any eightdistinct cells could havebeen chosen as the dependent
parameters.
Forasignificancelevelof
α
=
0
.
01,thecriticalvalueofthechi-squaredistribution
is
z
=
48
.
28. The observed value of
χ
2
=
231
.
06 is much greater than
z
, and it is
thus extremely unlikely to happen under the null hypothesis. We conclude that the
three attributes are not 3-way independent, but rather there is some dependence
between them. However, this example also highlights one of the pitfalls of multiway
contingency analysis. We can observe in Figure 3.4 that many of the observed counts
are zero. This is due to the fact that the sample size is small, and we cannot reliably
estimate all the multiway counts. Consequently, the dependence test may not be
reliable as well.
3.4
DISTANCE AND ANGLE
With the modeling of categorical attributes as multivariate Bernoulli variables, it is
possible to compute the distance or the angle between any two points
x
i
and
x
j
:
x
i
=
e
1
i
1
.
.
.
e
d i
d
x
j
=
e
1
j
1
.
.
.
e
d j
d
The different measures of distance and similarity rely on the number of matching
and mismatching values (or symbols) across the
d
attributes
X
k
. For instance, we can
compute the number of matching values
s
via the dot product:
s
=
x
T
i
x
j
=
d
k
=
1
(
e
ki
k
)
T
e
kj
k
On the other hand, the number of mismatches is simply
d
−
s
. Also useful is the norm
of each point:
x
i
2
=
x
T
i
x
i
=
d
Euclidean Distance
The Euclidean distance between
x
i
and
x
j
is given as
δ(
x
i
,
x
j
)
=
x
i
−
x
j
=
x
T
i
x
i
−
2
x
i
x
j
+
x
T
j
x
j
=
2
(d
−
s)
Thus, the maximum Euclidean distance betweenany two points is
√
2
d
, which happens
when there are no common symbols between them, that is, when
s
=
0.
88
Categorical Attributes
Hamming Distance
The
Hamming distance
between
x
i
and
x
j
is defined as the number of mismatched
values:
δ
H
(
x
i
,
x
j
)
=
d
−
s
=
1
2
δ(
x
i
,
x
j
)
2
Hamming distance is thus equivalent to half the squared Euclidean distance.
Cosine Similarity
The cosine of the angle between
x
i
and
x
j
is given as
cos
θ
=
x
T
i
x
j
x
i
·
x
j
=
s
d
Jaccard Coefficient
The
Jaccard Coefficient
is a commonly used similarity measure between two categori-
cal points. It is defined as the ratio of the number of matching values to the number of
distinct values that appear in both
x
i
and
x
j
, across the
d
attributes:
J
(
x
i
,
x
j
)
=
s
2
(d
−
s)
+
s
=
s
2
d
−
s
where we utilize the observation that when the two points do not match for dimension
k
, they contribute 2 to the distinct symbol count; otherwise, if they match, the number
of distinct symbols increases by 1. Over the
d
−
s
mismatches and
s
matches, the
number of distinct symbols is 2
(d
−
s)
+
s
.
Example 3.13.
Consider the 3-dimensional categorical data from Example 3.11. The
symbolic point
(
Short
,
Medium
,
iris
-
versicolor
)
is modeled as the vector
x
1
=
e
12
e
22
e
31
=
(
0
,
1
,
0
,
0
|
0
,
1
,
0
|
1
,
0
,
0
)
T
∈
R
10
and the symbolic point
(
VeryShort
,
Medium
,
iris
-
setosa
)
is modeled as
x
2
=
e
11
e
22
e
32
=
(
1
,
0
,
0
,
0
|
0
,
1
,
0
|
0
,
1
,
0
)
T
∈
R
10
The number of matching symbols is given as
s
=
x
T
1
x
2
=
(
e
12
)
T
e
11
+
(
e
22
)
T
e
22
+
(
e
31
)
T
e
32
=
0 1 0 0
1
0
0
0
+
0 1 0
0
1
0
+
1 0 0
0
1
0
=
0
+
1
+
0
=
1
3.5 Discretization
89
The Euclidean and Hamming distances are given as
δ(
x
1
,
x
2
)
=
2
(d
−
s)
=
√
2
·
2
=
√
4
=
2
δ
H
(
x
1
,
x
2
)
=
d
−
s
=
3
−
1
=
2
The cosine and Jaccard similarity are given as
cos
θ
=
s
d
=
1
3
=
0
.
333
J
(
x
1
,
x
2
)
=
s
2
d
−
s
=
1
5
=
0
.
2
3.5
DISCRETIZATION
Discretization
, also called
binning
, converts numeric attributes into categorical ones.
It is usually applied for data mining methods that cannot handle numeric attributes.
It can also help in reducing the number of values for an attribute, especially if there
is noise in the numeric measurements; discretization allows one to ignore small and
irrelevant differences in the values.
Formally, given a numeric attribute
X
, and a random sample
{
x
i
}
n
i
=
1
of size
n
drawn
from
X
, the discretization task is to divide the value range of
X
into
k
consecutive
intervals, also called
bins
, by finding
k
−
1 boundary values
v
1
,v
2
,...,v
k
−
1
that yield the
k
intervals:
[
x
min
,v
1
]
, (v
1
,v
2
]
, ..., (v
k
−
1
,x
max
]
where the extremes of the range of
X
are given as
x
min
=
min
i
{
x
i
}
x
max
=
max
i
{
x
i
}
The resulting
k
intervals or bins, which span the entire range of
X
, are usually mapped
to symbolic values that comprise the domain for the new categorical attribute
X
.
Equal-Width Intervals
The simplest binning approach is to partition the range of
X
into
k
equal-width
intervals. The interval width is simply the range of
X
divided by
k
:
w
=
x
max
−
x
min
k
Thus, the
i
th interval boundary is given as
v
i
=
x
min
+
iw,
for
i
=
1
,...,k
−
1
Equal-Frequency Intervals
In
equal-frequency
binning we divide the range of
X
into intervals that contain
(approximately) equal number of points; equal frequency may not be possible due
to repeated values. The intervals can be computed from the empirical quantile or
90
Categorical Attributes
inverse cumulative distribution function
ˆ
F
−
1
(q)
for
X
[Eq.(2.2)]. Recall that
ˆ
F
−
1
(q)
=
min
{
x
|
P(
X
≤
x)
≥
q
}
, for
q
∈
[0
,
1]. In particular, we require that each interval contain
1
/k
of the probability mass; therefore, the interval boundaries are given as follows:
v
i
=
ˆ
F
−
1
(i/k)
for
i
=
1
,...,k
−
1
Example 3.14.
Consider the
sepal length
attribute in the Iris dataset. Its minimum
and maximum values are
x
min
=
4
.
3
x
max
=
7
.
9
We discretize it into
k
=
4 bins using equal-width binning. The width of an interval is
given as
w
=
7
.
9
−
4
.
3
4
=
3
.
6
4
=
0
.
9
and therefore the interval boundaries are
v
1
=
4
.
3
+
0
.
9
=
5
.
2
v
2
=
4
.
3
+
2
·
0
.
9
=
6
.
1
v
3
=
4
.
3
+
3
·
0
.
9
=
7
.
0
The four resulting bins for
sepal length
are shown in Table 3.1, which also shows
the number of points
n
i
in each bin, which are not balanced among the bins.
For equal-frequency discretization, consider the empirical inverse cumulative
distribution function (CDF) for
sepal length
shown in Figure 3.5. With
k
=
4 bins,
the bin boundaries are the quartile values (which are shown as dashed lines):
v
1
=
ˆ
F
−
1
(
0
.
25
)
=
5
.
1
v
2
=
ˆ
F
−
1
(
0
.
50
)
=
5
.
8
v
3
=
ˆ
F
−
1
(
0
.
75
)
=
6
.
4
The resulting intervals are shown in Table 3.8. We can see that although the interval
widths vary, they contain a more balanced number of points. We do not get identical
4
4
.
5
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0
7
.
5
8
.
0
0 0
.
25 0
.
50 0
.
75 1
.
00
q
ˆ
F
−
1
(
q
)
Figure 3.5.
Empirical inverse CDF:
sepal length
.
3.7 Exercises
91
Table 3.8.
Equal-frequency discretization:
sepal length
Bin Width Count
[4
.
3
,
5
.
1] 0.8
n
1
=
41
(
5
.
1
,
5
.
8] 0.7
n
2
=
39
(
5
.
8
,
6
.
4] 0.6
n
3
=
35
(
6
.
4
,
7
.
9] 1.5
n
4
=
35
counts for all the bins because many values are repeated; for instance, there are nine
points with value 5.1 and there are seven points with value 5
.
8.
3.6
FURTHER READING
For a comprehensive introduction to categorical data analysis see Agresti (2012).
Some aspects also appear in Wasserman (2004). For an entropy-based supervised
discretization method that takes the class attribute into account see Fayyad and Irani
(1993).
Agresti, A. (2012).
Categorical Data Analysis
, 3rd ed. Hoboken, NJ: John Wiley &
Sons.
Fayyad, U. M. and Irani, K. B. (1993). Multi-interval Discretization of
Continuous-valued Attributes for Classification Learning.
In Proceedings of the
13th International Joint Conference on Artificial Intelligence
. Morgan-Kaufmann,
pp. 1022–1027.
Wasserman, L. (2004).
All of Statistics: A Concise Course in Statistical Inference
.
NewYork: Springer Science
+
Business Media.
3.7
EXERCISES
Q1.
Show that for categorical points, the cosine similarity between any two vectors in lies
in the range cos
θ
∈
[0
,
1], and consequently
θ
∈
[0
◦
,
90
◦
].
Q2.
Prove that
E
[
(
X
1
−
µ
1
)(
X
2
−
µ
2
)
T
]
=
E
[
X
1
X
T
2
]
−
E
[
X
1
]
E
[
X
2
]
T
.
Table 3.9.
Contingency table for Q3
Z
=
f
Z
=
g
Y
=
d
Y
=
e
Y
=
d
Y
=
e
X
=
a
5 10 10 5
X
=
b
15 5 5 20
X
=
c
20 10 25 10
92
Categorical Attributes
Table 3.10.
χ
2
Critical values for different
p-values
for different degrees of freedom (
q
): For example, for
q
=
5 degrees of freedom, the critical value of
χ
2
=
11
.
070 has
p-value
=
0
.
05.
q
0.995 0.99 0.975 0.95 0.90 0.10 0.05 0.025 0.01 0.005
1 — — 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750
6 0.676 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 18.548
Q3.
Consider the 3-way contingency table for attributes
X
,
Y
,
Z
shown in Table 3.9.
Compute the
χ
2
metric for the correlation between
Y
and
Z
. Are they dependent
or independent at the 95% confidence level? See Table 3.10 for
χ
2
values.
Q4.
Consider the “mixed” data given in Table 3.11. Here
X
1
is a numeric attribute and
X
2
is a categorical one. Assume that the domain of
X
2
is given as
dom(
X
2
)
={
a,b
}
.
Answer the following questions.
(a)
What is the mean vector for this dataset?
(b)
What is the covariance matrix?
Q5.
In Table 3.11, assuming that
X
1
is discretized into three bins, as follows:
c
1
=
(
−
2
,
−
0
.
5]
c
2
=
(
−
0
.
5
,
0
.
5]
c
3
=
(
0
.
5
,
2]
Answer the following questions:
(a)
Construct the contingency table between the discretized
X
1
and
X
2
attributes.
Include the marginal counts.
(b)
Compute the
χ
2
statistic between them.
(c)
Determine whether they are dependent or not at the 5% significance level. Use
the
χ
2
critical values from Table 3.10.
Table 3.11.
Dataset for Q4 and Q5
X
1
X
2
0
.
3
a
−
0
.
3
b
0
.
44
a
−
0
.
60
a
0
.
40
a
1
.
20
b
−
0
.
12
a
−
1
.
60
b
1
.
60
b
−
1
.
32
a
CHAPTER 4
Graph Data
The traditional paradigm in data analysis typically assumes that each data instance is
independent of another. However, often data instances may be connected or linked
to other instances via various types of relationships. The instances themselves may
be described by various attributes. What emerges is a network or graph of instances
(or nodes), connected by links (or edges). Both the nodes and edges in the graph
may have several attributes that may be numerical or categorical, or even more
complex (e.g., time series data). Increasingly, today’s massive data is in the form
of such graphs or networks. Examples include the World Wide Web (with its Web
pages and hyperlinks), social networks (wikis, blogs, tweets, and other social media
data), semantic networks (ontologies), biological networks (protein interactions, gene
regulation networks, metabolic pathways), citation networks for scientific literature,
and so on. In this chapter we look at the analysis of the link structure in graphs that
arise from these kinds of networks. We will study basic topological properties as well
as models that give rise to such graphs.
4.1
GRAPH CONCEPTS
Graphs
Formally, a
graph G
=
(
V
,
E
)
is a mathematical structure consisting of a finite
nonempty set
V
of
vertices
or
nodes
, and a set
E
⊆
V
×
V
of
edges
consisting of
unordered
pairs of vertices. An edge from a node to itself,
(v
i
,v
i
)
, is called a
loop
. An
undirected graph without loops is called a
simple graph
. Unless mentioned explicitly,
we will consider a graph to be simple. An edge
e
=
(v
i
,v
j
)
between
v
i
and
v
j
is said to
be
incident with
nodes
v
i
and
v
j
; in this case we also say that
v
i
and
v
j
are
adjacent
to
one another, and that they are
neighbors
. The number of nodes in the graph
G
, given
as
|
V
|=
n
, is called the
order
of the graph, and the number of edges in the graph, given
as
|
E
|=
m
, is called the
size
of
G
.
A
directed graph
or
digraph
has an edge set
E
consisting of
ordered
pairs of
vertices. A directed edge
(v
i
,v
j
)
is also called an
arc
, and is said to be
from
v
i
to
v
j
.
We also say that
v
i
is the
tail
and
v
j
the
head
of the arc.
93
94
Graph Data
A
weighted graph
consists of a graph together with a weight
w
ij
for each edge
(v
i
,v
j
)
∈
E
. Every graph can be considered to be a weighted graph in which the edges
have weight one.
Subgraphs
A graph
H
=
(
V
H
,
E
H
)
is called a
subgraph
of
G
=
(
V
,
E
)
if
V
H
⊆
V
and
E
H
⊆
E
. We
also say that
G
is a
supergraph
of
H
. Given a subset of the vertices
V
′
⊆
V
, the
induced
subgraph G
′
=
(
V
′
,
E
′
)
consists exactly of all the edges present in
G
between vertices in
V
′
. More formally, for all
v
i
,v
j
∈
V
′
,
(v
i
,v
j
)
∈
E
′
⇐⇒
(v
i
,v
j
)
∈
E
. In other words, two
nodes are adjacent in
G
′
if and only if they are adjacent in
G
. A (sub)graph is called
complete
(or a
clique
) if there exists an edge between all pairs of nodes.
Degree
The
degree
of a node
v
i
∈
V
is the number of edges incident with it, and is denoted as
d(v
i
)
or just
d
i
. The
degree sequence
of a graph is the list of the degrees of the nodes
sorted in non-increasing order.
Let
N
k
denote the number of vertices with degree
k
. The
degree frequency
distribution
of a graph is given as
(
N
0
,
N
1
,...,
N
t
)
where
t
is the maximum degree for a node in
G
. Let
X
be a random variable denoting
the degree of a node. The
degree distribution
of a graph gives the probability mass
function
f
for
X
, given as
f(
0
),f(
1
),...,f(t)
where
f(k)
=
P(
X
=
k)
=
N
k
n
is the probability of a node with degree
k
, given as
the number of nodes
N
k
with degree
k
, divided by the total number of nodes
n
. In
graph analysis, we typically make the assumption that the input graph represents a
population, and therefore we write
f
instead of
ˆ
f
for the probability distributions.
For directed graphs, the
indegree
of node
v
i
, denoted as
id(v
i
)
, is the number of
edges with
v
i
as head, that is, the number of incoming edges at
v
i
. The
outdegree
of
v
i
, denoted
od(v
i
)
, is the number of edges with
v
i
as the tail, that is, the number
of outgoing edges from
v
i
.
Path and Distance
A
walk
in a graph
G
betweennodes
x
and
y
is an ordered sequence of vertices, starting
at
x
and ending at
y
,
x
=
v
0
, v
1
, ..., v
t
−
1
, v
t
=
y
such that there is an edge between every pair of consecutive vertices, that is,
(v
i
−
1
,v
i
)
∈
E
for all
i
=
1
,
2
,...,t
. The length of the walk,
t
, is measured in terms of
hops
– the number of edges along the walk. In a walk, there is no restriction on the
number of times a given vertex may appear in the sequence; thus both the vertices and
edges may be repeated. A walk starting and ending at the same vertex (i.e., with
y
=
x
)
is called
closed
. A
trail
is a walk with distinct edges, and a
path
is a walk with
distinct
vertices (with the exception of the start and end vertices). A closed path with length
4.1 Graph Concepts
95
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
(a)
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
(b)
Figure 4.1.
(a) A graph (undirected). (b) A directed graph.
t
≥
3 is called a
cycle
, that is, a cycle begins and ends at the same vertexand has distinct
nodes.
A path of minimum length between nodes
x
and
y
is called a
shortest path
, and the
length of the shortest path is called the
distance
between
x
and
y
, denoted as
d(x,y)
. If
no path exists between the two nodes, the distance is assumed to be
d(x,y)
=∞
.
Connectedness
Two nodes
v
i
and
v
j
are said to be
connected
if there exists a path between them.
A graph is
connected
if there is a path between all pairs of vertices. A
connected
component
, or just
component
, of a graph is a maximal connected subgraph. If a graph
has only one component it is connected; otherwise it is
disconnected
, as by definition
there cannot be a path between two different components.
For a directed graph, we saythat it is
stronglyconnected
if there is a (directed) path
between all ordered pairs of vertices. We say that it is
weakly connected
if there exists
a path between node pairs only by considering edges as undirected.
Example 4.1.
Figure 4.1a shows a graph with
|
V
|=
8 vertices and
|
E
|=
11 edges.
Because
(v
1
,v
5
)
∈
E
, we say that
v
1
and
v
5
are adjacent. The degree of
v
1
is
d(v
1
)
=
d
1
=
4. The degree sequence of the graph is
(
4
,
4
,
4
,
3
,
2
,
2
,
2
,
1
)
and therefore its degree frequency distribution is given as
(
N
0
,
N
1
,
N
2
,
N
3
,
N
4
)
=
(
0
,
1
,
3
,
1
,
3
)
We have
N
0
=
0 because there are no isolated vertices, and
N
4
=
3 because there are
three nodes,
v
1
,
v
4
and
v
5
, that have degree
k
=
4; the other numbers are obtained in
a similar fashion. The degree distribution is given as
f(
0
),f(
1
),f(
2
),f(
3
),f(
4
)
=
(
0
,
0
.
125
,
0
.
375
,
0
.
125
,
0
.
375
)
The vertex sequence
(v
3
,v
1
,v
2
,v
5
,v
1
,v
2
,v
6
)
is a walk of length 6 between
v
3
and
v
6
. We can see that vertices
v
1
and
v
2
have been visited more than once. In
96
Graph Data
contrast, the vertex sequence
(v
3
,v
4
,v
7
,v
8
,v
5
,v
2
,v
6
)
is a path of length 6 between
v
3
and
v
6
. However, this is not the shortest path between them, which happens to be
(v
3
,v
1
,v
2
,v
6
)
with length 3. Thus, the distance between them is given as
d(v
3
,v
6
)
=
3.
Figure 4.1b shows a directed graph with 8 vertices and 12 edges. We can see that
edge
(v
5
,v
8
)
is distinct from edge
(v
8
,v
5
)
. The indegree of
v
7
is
id(v
7
)
=
2, whereas its
outdegree is
od(v
7
)
=
0. Thus, there is no (directed) path from
v
7
to any other vertex.
Adjacency Matrix
A graph
G
=
(
V
,
E
)
, with
|
V
|=
n
vertices, can be conveniently represented in the form
of an
n
×
n
, symmetric binary
adjacency matrix
,
A
, defined as
A
(i,j)
=
1 if
v
i
is adjacent to
v
j
0 otherwise
If the graph is directed, then the adjacency matrix
A
is not symmetric, as
(v
i
,v
j
)
∈
E
obviously does not imply that
(v
j
,v
i
)
∈
E
.
If the graph is weighted, then we obtain an
n
×
n
weighted adjacency matrix
,
A
,
defined as
A
(i,j)
=
w
ij
if
v
i
is adjacent to
v
j
0 otherwise
where
w
ij
is the weight on edge
(v
i
,v
j
)
∈
E
. A weightedadjacencymatrix can alwaysbe
converted into a binary one, if desired, by using some threshold
τ
on the edge weights
A
(i,j)
=
1 if
w
ij
≥
τ
0 otherwise
(4.1)
Graphs from Data Matrix
Many datasets that are not in the form of a graph can nevertheless be converted into
one.Let
D
={
x
i
}
n
i
=
1
(with
x
i
∈
R
d
), beadatasetconsisting of
n
points in a
d
-dimensional
space. We can define a weighted graph
G
=
(
V
,
E
)
, where there exists a node for each
point in
D
, and there exists an edge between each pair of points, with weight
w
ij
=
sim(
x
i
,
x
j
)
where
sim(
x
i
,
x
j
)
denotes the similarity between points
x
i
and
x
j
. For instance,
similarity can be defined as being inversely related to the Euclidean distance between
the points via the transformation
w
ij
=
sim(
x
i
,
x
j
)
=
exp
−
x
i
−
x
j
2
2
σ
2
(4.2)
where
σ
is the spread parameter (equivalent to the standard deviation in the normal
densityfunction). This transformation restricts thesimilarityfunction
sim()
to lie in the
range [0
,
1]. One can then choose an appropriate threshold
τ
and convert the weighted
adjacency matrix into a binary one via Eq.(4.1).
4.2 Topological Attributes
97
Figure 4.2.
Iris similarity graph.
Example 4.2.
Figure 4.2 shows the similarity graph for the Iris dataset (see
Table 1.1). The pairwise similarity between distinct pairs of points was computed
using Eq.(4.2), with
σ
=
1
/
√
2 (we do not allow loops, to keep the graph simple).
The mean similarity between points was 0
.
197, with a standard deviation of 0
.
290.
A binary adjacency matrix was obtained via Eq.(4.1) using a threshold of
τ
=
0
.
777, which results in an edge between points having similarity higher than two
standard deviations from the mean. The resulting Iris graph has 150 nodes and 753
edges.
The nodes in the Iris graph in Figure 4.2 have also been categorized according
to their class. The circles correspond to class
iris-versicolor
, the triangles
to
iris-virginica
, and the squares to
iris-setosa
. The graph has two big
components, one of which is exclusively composed of nodes labeled as
iris-setosa
.
4.2
TOPOLOGICAL ATTRIBUTES
In this sectionwestudysomeof thepurelytopological,thatis,edge-basedor structural,
attributes of graphs. These attributes are
local
if they apply to only a single node (or
an edge), and
global
if they refer to the entire graph.
Degree
We have already defined the degree of a node
v
i
as the number of its neighbors. A
more general definition that holds even when the graph is weighted is as follows:
d
i
=
j
A
(i,j)
98
Graph Data
The degreeis clearlya local attributeof each node. One of the simplest global attribute
is the
average degree
:
µ
d
=
i
d
i
n
The preceding definitions can easilybe generalizedfor (weighted) directedgraphs.
For example, we can obtain the indegree and outdegree by taking the summation over
the incoming and outgoing edges, as follows:
id(v
i
)
=
j
A
(j,i)
od(v
i
)
=
j
A
(i,j)
The average indegree and average outdegree can be obtained likewise.
Average Path Length
The
averagepath length
, also called the
characteristic path length
, of a connected graph
is given as
µ
L
=
i
j>i
d(v
i
,v
j
)
n
2
=
2
n(n
−
1
)
i
j>i
d(v
i
,v
j
)
where
n
is the number of nodes in the graph, and
d(v
i
,v
j
)
is the distance between
v
i
and
v
j
. For a directed graph, the average is over all ordered pairs of vertices:
µ
L
=
1
n(n
−
1
)
i
j
d(v
i
,v
j
)
For adisconnectedgraph theaverageis takenoveronly theconnectedpairs of vertices.
Eccentricity
The
eccentricity
of a node
v
i
is the maximum distance from
v
i
to any other node in the
graph:
e(v
i
)
=
max
j
d(v
i
,v
j
)
If the graph is disconnected the eccentricity is computed only over pairs of vertices
with finite distance, that is, only for vertices connected by a path.
Radius and Diameter
The
radius
of a connected graph, denoted
r(
G
)
, is the minimum eccentricity of any
node in the graph:
r(
G
)
=
min
i
e(v
i
)
=
min
i
max
j
d(v
i
,v
j
)
4.2 Topological Attributes
99
The
diameter
, denoted
d(
G
)
, is the maximum eccentricity of any vertex in the
graph:
d(
G
)
=
max
i
e(v
i
)
=
max
i,j
d(v
i
,v
j
)
For a disconnected graph, the diameter is the maximum eccentricity over all the
connected components of the graph.
The diameter of a graph
G
is sensitive to outliers. A more robust notion is
effective diameter
, defined as the minimum number of hops for which a large fraction,
typically 90%, of all connected pairs of nodes can reach each other. More formally,
let
H
(k)
denote the number of pairs of nodes that can reach each other in
k
hops or less. The effective diameter is defined as the smallest value of
k
such that
H
(k)
≥
0
.
9
×
H
(d(
G
))
.
Example 4.3.
For the graph in Figure 4.1a, the eccentricity of node
v
4
is
e(v
4
)
=
3
because the node farthest from it is
v
6
and
d(v
4
,v
6
)
=
3. The radius of the graph is
r(
G
)
=
2; both
v
1
and
v
5
have the least eccentricity value of 2. The diameter of the
graph is
d(
G
)
=
4, as the largest distance over all the pairs is
d(v
6
,v
7
)
=
4.
The diameter of the Iris graph is
d(
G
)
=
11, which corresponds to the bold path
connecting the gray nodes in Figure 4.2. The degree distribution for the Iris graph
is shown in Figure 4.3. The numbers at the top of each bar indicate the frequency.
For example, there are exactly 13 nodes with degree 7, which corresponds to the
probability
f(
7
)
=
13
150
=
0
.
0867.
The path length histogram for the Iris graph is shown in Figure 4.4. For instance,
1044 node pairs have a distance of 2 hops between them. With
n
=
150 nodes, there
0
.
01
0
.
02
0
.
03
0
.
04
0
.
05
0
.
06
0
.
07
0
.
08
0
.
09
0
.
10
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Degree:
k
f
(
k
)
6
13
8
6
5
8 8
13
10
6
9
6
7
6
5
1 1
2
4 4
3
4
5
1
3
0
1
0
1
2
1
0 0 0
1
Figure 4.3.
Iris graph: degree distribution.
100
Graph Data
0
100
200
300
400
500
600
700
800
900
1000
0 1 2 3 4 5 6 7 8 9 10 11
Path Length:
k
F
r
e
q
u
e
n
c
y
753
1044
831
668
529
330
240
146
90
30
12
Figure 4.4.
Iris graph: path length histogram.
are
n
2
=
11
,
175 pairs. Out of these 6502 pairs are unconnected, and there are a total
of 4673 reachable pairs. Out of these
4175
4673
=
0
.
89 fraction are reachable in 6 hops, and
4415
4673
=
0
.
94 fraction are reachable in 7 hops. Thus, we can determine that the effective
diameter is 7. The average path length is 3
.
58.
Clustering Coefficient
The
clustering coefficient
of a node
v
i
is a measure of the density of edges in the
neighborhood of
v
i
. Let
G
i
=
(
V
i
,
E
i
)
be the subgraph induced by the neighbors of
vertex
v
i
. Note that
v
i
∈
V
i
, as we assume that
G
is simple. Let
|
V
i
|=
n
i
be the number
of neighbors of
v
i
, and
|
E
i
|=
m
i
be the number of edges among the neighbors of
v
i
.
The clustering coefficient of
v
i
is defined as
C
(v
i
)
=
no. of edges in
G
i
maximum number of edges in
G
i
=
m
i
n
i
2
=
2
·
m
i
n
i
(n
i
−
1
)
The clustering coefficient gives an indication about the “cliquishness” of a node’s
neighborhood,becausethedenominator corresponds tothecasewhen
G
i
is acomplete
subgraph.
The
clustering coefficient
of a graph
G
is simply the average clustering coefficient
over all the nodes, given as
C
(
G
)
=
1
n
i
C
(v
i
)
Because
C
(v
i
)
is well defined only for nodes with degree
d(v
i
)
≥
2, we can define
C
(v
i
)
=
0 for nodes with degree less than 2. Alternatively, we can take the summation
only over nodes with
d(v
i
)
≥
2.
4.2 Topological Attributes
101
The clustering coefficient
C
(v
i
)
of a node is closely related to the notion of
transitive relationships in a graph or network. That is, if there exists an edge between
v
i
and
v
j
, and another between
v
i
and
v
k
, then how likely are
v
j
and
v
k
to be linked or
connectedtoeachother.Definethesubgraph composedoftheedges
(v
i
,v
j
)
and
(v
i
,v
k
)
to be a
connected triple
centered at
v
i
. A connected triple centered at
v
i
that includes
(v
j
,v
k
)
is called a
triangle
(a complete subgraph of size 3). The clustering coefficient of
node
v
i
can be expressed as
C
(v
i
)
=
no. of triangles including
v
i
no. of connected triples centered at
v
i
Note that the number of connected triples centered at
v
i
is simply
d
i
2
=
n
i
(n
i
−
1
)
2
, where
d
i
=
n
i
is the number of neighbors of
v
i
.
Generalizing the aforementioned notion to the entire graph yields the
transitivity
of the graph, defined as
T
(
G
)
=
3
×
no. of triangles in
G
no. of connected triples in
G
The factor 3 in the numerator is due to the fact that each triangle contributes to
three connected triples centered at each of its three vertices. Informally, transitivity
measures the degree to which a friend of your friend is also your friend, say, in a social
network.
Efficiency
The
efficiency
for a pair of nodes
v
i
and
v
j
is defined as
1
d(v
i
,v
j
)
. If
v
i
and
v
j
are not
connected, then
d(v
i
,v
j
)
=∞
and the efficiency is 1
/
∞=
0. As such, the smaller the
distance between the nodes, the more “efficient” the communication between them.
The
efficiency
of a graph
G
is the average efficiency over all pairs of nodes, whether
connected or not, given as
2
n(n
−
1
)
i
j>i
1
d(v
i
,v
j
)
The maximum efficiency value is 1, which holds for a complete graph.
The
local efficiency
for a node
v
i
is defined as the efficiency of the subgraph
G
i
induced by the neighbors of
v
i
. Because
v
i
∈
G
i
, the local efficiency is an indication of
the local fault tolerance, that is, how efficient is the communication between neighbors
of
v
i
when
v
i
is removed or deleted from the graph.
Example 4.4.
For the graph in Figure 4.1a, consider node
v
4
. Its neighborhood graph
is shown in Figure 4.5. The clustering coefficient of node
v
4
is given as
C
(v
4
)
=
2
4
2
=
2
6
=
0
.
33
The clustering coefficient for the entire graph (over all nodes) is given as
C
(
G
)
=
1
8
1
2
+
1
3
+
1
+
1
3
+
1
3
+
0
+
0
+
0
=
2
.
5
8
=
0
.
3125
102
Graph Data
v
1
v
3
v
5
v
7
Figure 4.5.
Subgraph
G
4
induced by node
v
4
.
The local efficiency of
v
4
is given as
2
4
·
3
1
d(v
1
,v
3
)
+
1
d(v
1
,v
5
)
+
1
d(v
1
,v
7
)
+
1
d(v
3
,v
5
)
+
1
d(v
3
,v
7
)
+
1
d(v
5
,v
7
)
=
1
6
(
1
+
1
+
0
+
0
.
5
+
0
+
0
)
=
2
.
5
6
=
0
.
417
4.3
CENTRALITY ANALYSIS
The notion of
centrality
is used to rank the verticesof a graph in terms of how “central”
or important they are. A centrality can be formally defined as a function
c
:
V
→
R
, that
induces a total order on
V
. We say that
v
i
is at least as central as
v
j
if
c(v
i
)
≥
c(v
j
)
.
4.3.1
Basic Centralities
Degree Centrality
The simplest notion of centrality is the degree
d
i
of a vertex
v
i
– the higher the degree,
themoreimportantorcentralthevertex.Fordirectedgraphs,onemayfurtherconsider
the indegree centrality and outdegree centrality of a vertex.
Eccentricity Centrality
According to this notion, the less eccentric a node is, the more central it is. Eccentricity
centrality is thus defined as follows:
c(v
i
)
=
1
e(v
i
)
=
1
max
j
d(v
i
,v
j
)
A node
v
i
that has the least eccentricity, that is, for which the eccentricity equals the
graph radius,
e(v
i
)
=
r(
G
)
, is called a
center node
, whereas a node that has the highest
eccentricity, that is, for which eccentricity equals the graph diameter,
e(v
i
)
=
d(
G
)
, is
called a
periphery node
.
4.3 Centrality Analysis
103
Eccentricitycentralityis relatedto theproblem of
facilitylocation
, thatis, choosing
the optimum location for a resource or facility. The central node minimizes the
maximum distance to any node in the network, and thus the most central node
would be an ideal location for, say, a hospital, because it is desirable to minimize the
maximum distance someone has to travel to get to the hospital quickly.
Closeness Centrality
Whereas eccentricity centrality uses the maximum of the distances from a given node,
closeness centrality uses the sum of all the distances to rank how central a node is
c(v
i
)
=
1
j
d(v
i
,v
j
)
A node
v
i
with the smallest total distance,
j
d(v
i
,v
j
)
, is called the
median node
.
Closeness centrality optimizes a different objective function for the facility
location problem. It tries to minimize the total distance over all the other nodes, and
thus a median node, which has the highest closeness centrality, is the optimal one to,
say, locate a facility such as a new coffee shop or a mall, as in this case it is not as
important to minimize the distance for the farthest node.
Betweenness Centrality
For a given vertex
v
i
the betweenness centrality measures how many shortest paths
between all pairs of vertices include
v
i
. This gives an indication as to the central
“monitoring” role played by
v
i
for various pairs of nodes. Let
η
jk
denote the number
of shortest paths between vertices
v
j
and
v
k
, and let
η
jk
(v
i
)
denote the number of such
paths that include or contain
v
i
. Then the fraction of paths through
v
i
is denoted as
γ
jk
(v
i
)
=
η
jk
(v
i
)
η
jk
If the two vertices
v
j
and
v
k
are not connected, we assume
γ
jk
=
0.
The betweenness centrality for a node
v
i
is defined as
c(v
i
)
=
j
=
i
k
=
i
k>j
γ
jk
=
j
=
i
k
=
i
k>j
η
jk
(v
i
)
η
jk
(4.3)
Example 4.5.
Consider Figure 4.1a. The values for the different node centrality
measures are given in Table 4.1. According to degree centrality, nodes
v
1
,
v
4
, and
v
5
are the most central. The eccentricity centrality is the highest for the center nodes
in the graph, which are
v
1
and
v
5
. It is the least for the periphery nodes, of which
there are two,
v
6
and,
v
7
.
Nodes
v
1
and
v
5
have the highest closeness centrality value. In terms of
betweenness, vertex
v
5
is the most central, with a value of 6
.
5. We can compute this
valueby considering only those pairs ofnodes
v
j
and
v
k
thathaveatleastone shortest
4.3 Centrality Analysis
105
v
1
v
4
v
5
v
3
v
2
(a)
A
=
0 0 0 1 0
0 0 1 0 1
1 0 0 0 0
0 1 1 0 1
0 1 0 0 0
(b)
A
T
=
0 0 1 0 0
0 0 0 1 1
0 1 0 1 0
1 0 0 0 0
0 1 0 1 0
(c)
Figure 4.6.
Example graph (a), adjacency matrix (b), and its transpose (c).
For example, in Figure 4.6, the prestige of
v
5
depends on the prestige of
v
2
and
v
4
.
Across all the nodes, we can recursively express the prestige scores as
p
′
=
A
T
p
(4.4)
where
p
is an
n
-dimensional column vector corresponding to the prestige scores for
each vertex.
Starting from an initial prestige vector we can use Eq.(4.4) to obtain an updated
prestige vector in an iterative manner. In other words, if
p
k
−
1
is the prestige vector
across all the nodes at iteration
k
−
1, then the updated prestige vector at iteration
k
is
given as
p
k
=
A
T
p
k
−
1
=
A
T
(
A
T
p
k
−
2
)
=
A
T
2
p
k
−
2
=
A
T
2
(
A
T
p
k
−
3
)
=
A
T
3
p
k
−
3
=
.
.
.
=
A
T
k
p
0
where
p
0
is the initial prestige vector. It is well known that the vector
p
k
converges to
the dominant eigenvector of
A
T
with increasing
k
.
The dominant eigenvector of
A
T
and the corresponding eigenvalue can be
computed using the
power iteration
approach whose pseudo-code is shown in
Algorithm 4.1. The method starts with the vector
p
0
, which can be initialized to the
vector
(
1
,
1
,...,
1
)
T
∈
R
n
. In each iteration, we multiply on the left by
A
T
, and scale
the intermediate
p
k
vector by dividing it by the maximum entry
p
k
[
i
] in
p
k
to prevent
numeric overflow. The ratio of the maximum entry in iteration
k
to that in
k
−
1, given
as
λ
=
p
k
[
i
]
p
k
−
1
[
i
]
, yields an estimate for the eigenvalue. The iterations continue until the
difference between successive eigenvector estimates falls below some threshold
ǫ >
0.
106
Graph Data
ALGORITHM 4.1. Power Iteration Method: Dominant Eigenvector
P
OWER
I
TERATION
(A
,ǫ
)
:
k
←
0
// iteration
1
p
0
←
1
∈
R
n
// initial vector
2
repeat
3
k
←
k
+
1
4
p
k
←
A
T
p
k
−
1
// eigenvector estimate
5
i
←
argmax
j
p
k
[
j
]
// maximum value index
6
λ
←
p
k
[
i
]
/
p
k
−
1
[
i
]
// eigenvalue estimate
7
p
k
←
1
p
k
[
i
]
p
k
// scale vector
8
until
p
k
−
p
k
−
1
≤
ǫ
9
p
←
1
p
k
p
k
// normalize eigenvector
10
return p
,λ
11
Table 4.2.
Power method via scaling
p
0
p
1
p
2
p
3
1
1
1
1
1
1
2
2
1
2
→
0
.
5
1
1
0
.
5
1
1
1
.
5
1
.
5
0
.
5
1
.
5
→
0
.
67
1
1
0
.
33
1
1
1
.
33
1
.
33
0
.
67
1
.
33
→
0
.
75
1
1
0
.
5
1
λ
2 1.5 1.33
p
4
p
5
p
6
p
7
1
1
.
5
1
.
5
0
.
75
1
.
5
→
0
.
67
1
1
0
.
5
1
1
1
.
5
1
.
5
0
.
67
1
.
5
→
0
.
67
1
1
0
.
44
1
1
1
.
44
1
.
44
0
.
67
1
.
44
→
0
.
69
1
1
0
.
46
1
1
1
.
46
1
.
46
0
.
69
1
.
46
→
0
.
68
1
1
0
.
47
1
1.5 1.5 1.444 1.462
Example 4.6.
Consider the example shown in Figure 4.6. Starting with an initial
prestigevector
p
0
=
(
1
,
1
,
1
,
1
,
1
)
T
,inTable4.2weshow severaliterationsofthepower
method for computing the dominant eigenvector of
A
T
. In each iteration we obtain
p
k
=
A
T
p
k
−
1
. For example,
p
1
=
A
T
p
0
=
0 0 1 0 0
0 0 0 1 1
0 1 0 1 0
1 0 0 0 0
0 1 0 1 0
1
1
1
1
1
=
1
2
2
1
2
4.3 Centrality Analysis
107
1
.
25
1
.
50
1
.
75
2
.
00
2
.
25
0 2 4 6 8 10 12 14 16
λ
=
1
.
466
Figure 4.7.
Convergence of the ratio to dominant eigenvalue.
Before the next iteration, we scale
p
1
by dividing each entry by the maximum value
in the vector, which is 2 in this case, to obtain
p
1
=
1
2
1
1
2
1
2
=
0
.
5
1
1
0
.
5
1
As
k
becomes large, we get
p
k
=
A
T
p
k
−
1
≃
λ
p
k
−
1
which implies that the ratio of the maximum element of
p
k
to that of
p
k
−
1
should
approach
λ
. The table shows this ratio for successive iterations. We can see in
Figure 4.7 that within 10 iterations the ratio converges to
λ
=
1
.
466. The scaled
dominant eigenvector converges to
p
k
=
1
1
.
466
1
.
466
0
.
682
1
.
466
After normalizing it to be a unit vector, the dominant eigenvector is given as
p
=
0
.
356
0
.
521
0
.
521
0
.
243
0
.
521
Thus, in terms of prestige,
v
2
,
v
3
, and
v
5
have the highest values, as all of them have
indegree 2 and are pointed to by nodes with the same incoming values of prestige.
On the other hand, although
v
1
and
v
4
have the same indegree,
v
1
is ranked higher,
because
v
3
contributes its prestige to
v
1
, but
v
4
gets its prestige only from
v
1
.
108
Graph Data
PageRank
PageRank is a method for computing the prestige or centrality of nodes in the context
of Web search. The Web graph consists of pages (the nodes) connected by hyperlinks
(the edges). The method uses the so-called
random surfing
assumption that a person
surfing the Web randomly chooses one of the outgoing links from the current page,
or with some very small probability randomly jumps to any of the other pages in the
Web graph. The PageRank of a Web page is defined to be the probability of a random
web surfer landing at that page. Like prestige, the PageRank of a node
v
recursively
depends on the PageRank of other nodes that point to it.
Normalized Prestige
We assume for the moment that each node
u
has outdegree at
least 1. We discuss later how to handle the case when a node has no outgoing edges.
Let
od(u)
=
v
A
(u,v)
denote the outdegree of node
u
. Because a random surfer can
chooseamonganyofitsoutgoinglinks, ifthereis alink from
u
to
v
, thentheprobability
of visiting
v
from
u
is
1
od(u)
.
Starting from an initial probability or PageRank
p
0
(u)
for each node, such that
u
p
0
(u)
=
1
we can compute an updated PageRank vector for
v
as follows:
p(v)
=
u
A
(u,v)
od(u)
·
p(u)
=
u
N
(u,v)
·
p(u)
=
u
N
T
(v,u)
·
p(u)
(4.5)
where
N
is the normalized adjacency matrix of the graph, given as
N
(u,v)
=
1
od(u)
if
(u,v)
∈
E
0 if
(u,v)
∈
E
Across all nodes, we can express the PageRank vector as follows:
p
′
=
N
T
p
(4.6)
So far, the PageRank vector is essentially a normalized prestige vector.
Random Jumps
In the random surfing approach, there is a small probability of
jumping from one node to any of the other nodes in the graph, even if they do not
have a link between them. In essence, one can think of the Web graph as a (virtual)
fully connected directed graph, with an adjacency matrix given as
A
r
=
1
n
×
n
=
1 1
···
1
1 1
···
1
.
.
.
.
.
.
.
.
.
.
.
.
1 1
···
1
4.3 Centrality Analysis
109
Here
1
n
×
n
is the
n
×
n
matrix of all ones. For the random surfer matrix, the outdegree
of each node is
od(u)
=
n
, and the probability of jumping from
u
to any node
v
is
simply
1
od(u)
=
1
n
. Thus, if one allows only random jumps from one node to another, the
PageRank can be computed analogously to Eq.(4.5):
p(v)
=
u
A
r
(u,v)
od(u)
·
p(u)
=
u
N
r
(u,v)
·
p(u)
=
u
N
T
r
(v,u)
·
p(u)
where
N
r
is the normalized adjacency matrix of the fully connected Web graph,
given as
N
r
=
1
n
1
n
···
1
n
1
n
1
n
···
1
n
.
.
.
.
.
.
.
.
.
.
.
.
1
n
1
n
···
1
n
=
1
n
A
r
=
1
n
1
n
×
n
Across all the nodes the random jump PageRank vector can be represented as
p
′
=
N
T
r
p
PageRank
The full PageRank is computed by assuming that with some small
probability,
α
, a random Web surfer jumps from the current node
u
to any other
random node
v
, and with probability 1
−
α
the user follows an existing link from
u
to
v
. In other words, we combine the normalized prestige vector, and the random jump
vector, to obtain the final PageRank vector, as follows:
p
′
=
(
1
−
α)
N
T
p
+
α
N
T
r
p
=
(
1
−
α)
N
T
+
α
N
T
r
p
=
M
T
p
(4.7)
where
M
=
(
1
−
α)
N
+
α
N
r
is the combined normalized adjacency matrix. The
PageRank vector can be computed in an iterative manner, starting with an initial
PageRank assignment
p
0
, and updating it in each iteration using Eq.(4.7). One minor
problem arises if a node
u
does not have any outgoing edges, that is, when
od(u)
=
0.
Such a node acts like a sink for the normalized prestige score. Because there is no
outgoing edge from
u
, the only choice
u
has is to simply jump to another random node.
Thus, we need to make sure that if
od(u)
=
0 then for the row corresponding to
u
in
M
,
denoted as
M
u
, we set
α
=
1, that is,
M
u
=
M
u
if
od(u)>
0
1
n
1
T
n
if
od(u)
=
0
where
1
n
isthe
n
-dimensionalvectorofallones.Wecanusethepoweriterationmethod
in Algorithm 4.1 to compute the dominant eigenvector of
M
T
.
110
Graph Data
Example 4.7.
Consider the graph in Figure 4.6. The normalized adjacency matrix is
given as
N
=
0 0 0 1 0
0 0 0
.
5 0 0
.
5
1 0 0 0 0
0 0
.
33 0
.
33 0 0
.
33
0 1 0 0 0
Because there are
n
=
5 nodes in the graph, the normalized random jump
adjacency matrix is given as
N
r
=
0
.
2 0
.
2 0
.
2 0
.
2 0
.
2
0
.
2 0
.
2 0
.
2 0
.
2 0
.
2
0
.
2 0
.
2 0
.
2 0
.
2 0
.
2
0
.
2 0
.
2 0
.
2 0
.
2 0
.
2
0
.
2 0
.
2 0
.
2 0
.
2 0
.
2
Assuming that
α
=
0
.
1, the combined normalized adjacency matrix is given as
M
=
0
.
9
N
+
0
.
1
N
r
=
0
.
02 0
.
02 0
.
02 0
.
92 0
.
02
0
.
02 0
.
02 0
.
47 0
.
02 0
.
47
0
.
92 0
.
02 0
.
02 0
.
02 0
.
02
0
.
02 0
.
32 0
.
32 0
.
02 0
.
32
0
.
02 0
.
92 0
.
02 0
.
02 0
.
02
Computing the dominant eigenvector and eigenvalue of
M
T
we obtain
λ
=
1 and
p
=
0
.
419
0
.
546
0
.
417
0
.
422
0
.
417
Node
v
2
has the highest PageRank value.
Hub and Authority Scores
Note that the PageRank of a node is independent of any query that a user may pose,
as it is a global value for a Web page. However, for a specific user query, a page
with a high global PageRank may not be that relevant. One would like to have a
query-specific notion of the PageRank or prestige of a page. The Hyperlink Induced
Topic Search (HITS) method is designed to do this. In fact, it computes two values to
judgetheimportance ofa page.The
authorityscore
of apageis analogous toPageRank
or prestige, and it depends on how many “good” pages point to it. On the other hand,
the
hub score
of a page is based on how many “good” pages it points to. In other
words, a page with high authority has many hub pages pointing to it, and a page with
high hub score points to many pages that have high authority.
4.3 Centrality Analysis
111
Given a user query the HITS method first uses standard search engines to retrieve
the set of relevant pages. It then expands this set to include any pages that point to
some page in the set, or any pages that are pointed to by some page in the set. Any
pages originating from the same host are eliminated. HITS is applied only on this
expanded query specific graph
G
.
We denote by
a(u)
the authority score and by
h(u)
the hub score of node
u
. The
authority score depends on the hub score and vice versa in the following manner:
a(v)
=
u
A
T
(v,u)
·
h(u)
h(v)
=
u
A
(v,u)
·
a(u)
In matrix notation, we obtain
a
′
=
A
T
h
h
′
=
Aa
In fact, we can rewrite the above recursively as follows:
a
k
=
A
T
h
k
−
1
=
A
T
(
Aa
k
−
1
)
=
(
A
T
A
)
a
k
−
1
h
k
=
Aa
k
−
1
=
A
(
A
T
h
k
−
1
)
=
(
AA
T
)
h
k
−
1
In other words, as
k
→∞
, the authority score converges to the dominant eigenvector
of
A
T
A
, whereas the hub score converges to the dominant eigenvector of
AA
T
. The
power iteration method can be used to compute the eigenvector in both cases. Starting
with an initial authority vector
a
=
1
n
, the vector of all ones, we can compute the
vector
h
=
Aa
. To prevent numeric overflows, we scale the vector by dividing by the
maximum element. Next, we can compute
a
=
A
T
h
, and scale it too, which completes
one iteration. This process is repeated until both
a
and
h
converge.
Example 4.8.
For the graph in Figure 4.6, we can iteratively compute the authority
and hub score vectors, by starting with
a
=
(
1
,
1
,
1
,
1
,
1
)
T
. In the first iteration,
we have
h
=
Aa
=
0 0 0 1 0
0 0 1 0 1
1 0 0 0 0
0 1 1 0 1
0 1 0 0 0
1
1
1
1
1
=
1
2
1
3
1
After scaling by dividing by the maximum value 3, we get
h
′
=
0
.
33
0
.
67
0
.
33
1
0
.
33
112
Graph Data
Next we update
a
as follows:
a
=
A
T
h
′
=
0 0 1 0 0
0 0 0 1 1
0 1 0 1 0
1 0 0 0 0
0 1 0 1 0
0
.
33
0
.
67
0
.
33
1
0
.
33
=
0
.
33
1
.
33
1
.
67
0
.
33
1
.
67
After scaling by dividing by the maximum value 1
.
67, we get
a
′
=
0
.
2
0
.
8
1
0
.
2
1
This sets thestagefor thenextiteration.The process continues until
a
and
h
converge
to the dominant eigenvectors of
A
T
A
and
AA
T
, respectively, given as
a
=
0
0
.
46
0
.
63
0
0
.
63
h
=
0
0
.
58
0
0
.
79
0
.
21
From these scores, we conclude that
v
4
has the highest hub score because it points
to three nodes –
v
2
,
v
3
, and
v
5
– with good authority. On the other hand, both
v
3
and
v
5
have high authority scores, as the two nodes
v
4
and
v
2
with the highest hub scores
point to them.
4.4
GRAPH MODELS
Surprisingly, many real-world networks exhibit certain common characteristics, even
though the underlying data can come from vastly different domains, such as social
networks, biological networks, telecommunication networks, and so on. A natural
question is to understand the underlying processes that might give rise to such
real-world networks. We consider several network measures that will allow us to
compare and contrast different graph models. Real-world networks are usually
large
and
sparse
. By large we mean that the order or the number of nodes
n
is very large,
and by sparse we mean that the graph size or number of edges
m
=
O
(n)
. The models
we study below make a similar assumption that the graphs are large and sparse.
Small-world Property
It has been observed that many real-world graphs exhibit the so-called
small-world
property that there is a short path between any pair of nodes. We say that a graph
G
exhibits small-world behavior if the average path length
µ
L
scales logarithmically with
4.4 Graph Models
113
the number of nodes in the graph, that is, if
µ
L
∝
log
n
where
n
is the number of nodes in the graph. A graph is said to have
ultra-small-world
property if the average path length is much smaller than log
n
, that is, if
µ
L
≪
log
n
.
Scale-free Property
In many real-world graphs it has been observed that the empirical degree distribution
f(k)
exhibits a
scale-free
behavior captured by a power-law relationship with
k
, that is,
the probability that a node has degree
k
satisfies the condition
f(k)
∝
k
−
γ
(4.8)
Intuitively, a power law indicates that the vast majority of nodes have very small
degrees, whereas there are a few “hub” nodes that have high degrees, that is, they
connect to or interact with lots of nodes. A power-law relationship leads to a scale-free
or scale invariant behavior because scaling the argument by some constant
c
does
not change the proportionality. To see this, let us rewrite Eq.(4.8) as an equality by
introducing a proportionality constant
α
that does not depend on
k
, that is,
f(k)
=
αk
−
γ
(4.9)
Then we have
f(ck)
=
α(ck)
−
γ
=
(αc
−
γ
)k
−
γ
∝
k
−
γ
Also, taking the logarithm on both sides of Eq.(4.9) gives
log
f(k)
=
log
(αk
−
γ
)
or log
f(k)
=−
γ
log
k
+
log
α
which is the equation of a straight line in the log-log plot of
k
versus
f(k)
, with
−
γ
giving the slope of the line. Thus, the usual approach to check whether a graph has
scale-free behavior is to perform a least-square fit of the points
log
k,
log
f(k)
to a
line, as illustrated in Figure 4.8a.
In practice, one of theproblems with estimatingthe degreedistribution for a graph
is the high level of noise for the higher degrees, where frequency counts are the lowest.
Oneapproach toaddress theproblem is tousethecumulativedegreedistribution
F(k)
,
which tends to smooth out the noise. In particular, we use
F
c
(k)
=
1
−
F(k)
,which gives
the probability that a randomly chosen node has degree greater than
k
. If
f(k)
∝
k
−
γ
,
and assuming that
γ >
1, we have
F
c
(k)
=
1
−
F(k)
=
1
−
k
0
f(x)
=
∞
k
f(x)
=
∞
k
x
−
γ
≃
∞
k
x
−
γ
dx
=
x
−
γ
+
1
−
γ
+
1
∞
k
=
1
(γ
−
1
)
·
k
−
(γ
−
1
)
∝
k
−
(γ
−
1
)
114
Graph Data
−
14
−
12
−
10
−
8
−
6
−
4
−
2
0 1 2 3 4 5 6 7 8
Degree: log
2
k
P
r
o
b
a
b
i
l
i
t
y
:
l
o
g
2
f
(
k
)
−
γ
=−
2
.
15
(a) Degree distribution
−
14
−
12
−
10
−
8
−
6
−
4
−
2
0
0 1 2 3 4 5 6 7 8
Degree: log
2
k
P
r
o
b
a
b
i
l
i
t
y
:
l
o
g
2
F
c
(
k
)
−
(γ
−
1
)
=−
1
.
85
(b) Cumulative degree distribution
Figure 4.8.
Degree distribution and its cumulative distribution.
In other words, the log-log plot of
F
c
(k)
versus
k
will also be a power law with slope
−
(γ
−
1
)
as opposed to
−
γ
. Owing to the smoothing effect, plotting log
k
versus
log
F
c
(k)
and observingthe slope givesa betterestimateof the power law, as illustrated
in Figure 4.8b.
Clustering Effect
Real-world graphs often also exhibit a
clustering effect
, that is, two nodes are more
likely to be connected if they share a common neighbor. The clustering effect is
captured by a high clustering coefficient for the graph
G
. Let
C
(k)
denote the average
clustering coefficient for all nodes with degree
k
; then the clustering effect also
4.4 Graph Models
115
manifests itself as a power-law relationship between
C
(k)
and
k
:
C
(k)
∝
k
−
γ
In other words, a log-log plot of
k
versus
C
(k)
exhibits a straight line behavior with
negativeslope
−
γ
. Intuitively,the power-lawbehavior indicateshierarchical clustering
of the nodes. That is, nodes that are sparsely connected (i.e., have smaller degrees) are
partofhighlyclusteredareas(i.e.,havehigheraverageclusteringcoefficients).Further,
only a few hub nodes (with high degrees) connect these clustered areas (the hub nodes
have smaller clustering coefficients).
Example 4.9.
Figure 4.8a plots the degree distribution for a graph of human protein
interactions, where each node is a protein and each edge indicates if the two incident
proteins interact experimentally. The graph has
n
=
9521 nodes and
m
=
37
,
060
edges. A linear relationship between log
k
and log
f(k)
is clearly visible, although
very small and very large degree values do not fit the linear trend. The best fit line
after ignoring the extremal degrees yields a value of
γ
=
2
.
15. The plot of log
k
versus log
F
c
(k)
makes the linear fit quite prominent. The slope obtained here is
−
(γ
−
1
)
=
1
.
85, that is,
γ
=
2
.
85. We can conclude that the graph exhibits scale-free
behavior (except at the degree extremes), with
γ
somewhere between 2 and 3, as is
typical of many real-world graphs.
The diameter of the graph is
d(
G
)
=
14, which is very close to log
2
n
=
log
2
(
9521
)
=
13
.
22. The network is thus small-world.
Figure 4.9 plots the average clustering coefficient as a function of degree. The
log-log plot has a very weak linear trend, as observed from the line of best fit
that gives a slope of
−
γ
= −
0
.
55. We can conclude that the graph exhibits weak
hierarchical clustering behavior.
−
8
−
6
−
4
−
2
1 2 3 4 5 6 7 8
Degree: log
2
k
A
v
e
r
a
g
e
C
l
u
s
t
e
r
i
n
g
C
o
e
f
fi
c
i
e
n
t
:
l
o
g
2
C
(
k
)
−
γ
=−
0
.
55
Figure 4.9.
Average clustering coefficient distribution.
116
Graph Data
4.4.1
Erd
¨
os–R
´
enyi Random Graph Model
The Erd
¨
os–R
´
enyi (ER) model generates a random graph such that any of the possible
graphs with a fixed number of nodes and edges has equal probability of being chosen.
The ER model has two parameters: the number of nodes
n
and the number of
edges
m
. Let
M
denote the maximum number of edges possible among the
n
nodes,
that is,
M
=
n
2
=
n(n
−
1
)
2
The ER model specifies a collection of graphs
G
(n,m)
with
n
nodes and
m
edges, such
that each graph
G
∈
G
has equal probability of being selected:
P(
G
)
=
1
M
m
=
M
m
−
1
where
M
m
is the number of possible graphs with
m
edges (with
n
nodes) corresponding
to the ways of choosing the
m
edges out of a total of
M
possible edges.
Let
V
={
v
1
,v
2
,...,v
n
}
denotethesetof
n
nodes.The ERmethodchooses arandom
graph
G
=
(
V
,
E
)
∈
G
via a generative process. At each step, it randomly selects two
distinct vertices
v
i
,v
j
∈
V
, and adds an edge
(v
i
,v
j
)
to
E
, provided the edge is not
already in the graph
G
. The process is repeated until exactly
m
edges have been added
to the graph.
Let
X
be a random variable denoting the degree of a node for
G
∈
G
. Let
p
denote
the probability of an edge in
G
, which can be computed as
p
=
m
M
=
m
n
2
=
2
m
n(n
−
1
)
Average Degree
For any given node in
G
its degree can be at most
n
−
1 (because we do not allow
loops). Because
p
is the probability of an edge for any node, the random variable
X
,
corresponding to the degree of a node, follows a binomial distribution with probability
of success
p
, given as
f(k)
=
P(
X
=
k)
=
n
−
1
k
p
k
(
1
−
p)
n
−
1
−
k
The average degree
µ
d
is then given as the expected value of
X
:
µ
d
=
E
[
X
]
=
(n
−
1
)p
We can also compute the variance of the degrees among the nodes by computing the
variance of
X
:
σ
2
d
=
var(
X
)
=
(n
−
1
)p(
1
−
p)
Degree Distribution
To obtain the degree distribution for large and sparse random graphs, we need to
derive an expression for
f(k)
=
P(
X
=
k)
as
n
→∞
. Assuming that
m
=
O
(n)
, we
4.4 Graph Models
117
can write
p
=
m
n(n
−
1
)/
2
=
O
(n)
n(n
−
1
)/
2
=
1
O
(n)
→
0. In other words, we are interested in the
asymptotic behavior of the graphs as
n
→∞
and
p
→
0.
Under these two trends, notice that the expected value and variance of
X
can be
rewritten as
E
[
X
]
=
(n
−
1
)p
≃
np
as
n
→∞
var(
X
)
=
(n
−
1
)p(
1
−
p)
≃
np
as
n
→∞
and
p
→
0
In other words, for large and sparse random graphs the expectation and variance of
X
are the same:
E
[
X
]
=
var(
X
)
=
np
and the binomial distribution can be approximated by a Poisson distribution with
parameter
λ
, given as
f(k)
=
λ
k
e
−
λ
k
!
where
λ
=
np
represents both the expected value and variance of the distribution.
Using Stirling’s approximation of the factorial
k
!
≃
k
k
e
−
k
√
2
πk
we obtain
f(k)
=
λ
k
e
−
λ
k
!
≃
λ
k
e
−
λ
k
k
e
−
k
√
2
πk
=
e
−
λ
√
2
π
(
λ
e
)
k
√
kk
k
In other words, we have
f(k)
∝
α
k
k
−
1
2
k
−
k
for
α
=
λ
e
=
np
e. We conclude that large and sparse random graphs follow a Poisson
degree distribution, which does not exhibit a power-law relationship. Thus, in one
crucial respect, the ER random graph model is not adequate to describe real-world
scale-free graphs.
Clustering Coefficient
Let us consider a node
v
i
in
G
with degree
k
. The clustering coefficient of
v
i
is given as
C
(v
i
)
=
2
m
i
k(k
−
1
)
where
k
=
n
i
also denotes the number of nodes and
m
i
denotes the number of edges in
the subgraph induced by neighbors of
v
i
. However, because
p
is the probability of an
edge, the expected number of edges
m
i
among the neighbors of
v
i
is simply
m
i
=
pk(k
−
1
)
2
Thus, we obtain
C
(v
i
)
=
2
m
i
k(k
−
1
)
=
p
In other words, the expected clustering coefficient across all nodes of all degrees is
uniform, and thus the overall clustering coefficient is also uniform:
C
(
G
)
=
1
n
i
C
(v
i
)
=
p
118
Graph Data
Furthermore, for sparse graphs we have
p
→
0, which in turn implies that
C
(
G
)
=
C
(v
i
)
→
0. Thus, large random graphs have no clustering effect whatsoever, which is
contrary to many real-world networks.
Diameter
We saw earlier that the expected degree of a node is
µ
d
=
λ
, which means that within
one hop from a given node, we can reach
λ
other nodes. Because each of the neighbors
of the initial node also has average degree
λ
, we can approximate the number of nodes
that are two hops away as
λ
2
. In general, at a coarse level of approximation (i.e.,
ignoring shared neighbors), we can estimate the number of nodes at a distance of
k
hops awayfrom a startingnode
v
i
as
λ
k
. However,becausethere area totalof
n
distinct
vertices in the graph, we have
t
k
=
1
λ
k
=
n
where
t
denotes the maximum number of hops from
v
i
. We have
t
k
=
1
λ
k
=
λ
t
+
1
−
1
λ
−
1
≃
λ
t
Plugging into the expression above, we have
λ
t
≃
n
or
t
log
λ
≃
log
n
which implies
t
≃
log
n
log
λ
∝
log
n
Because the path length from a node to the farthest node is bounded by
t
, it follows
that the diameter of the graph is also bounded by that value, that is,
d(
G
)
∝
log
n
assuming thatthe expecteddegree
λ
is fixed.We can thus conclude thatrandom graphs
satisfy at least one property of real-world graphs, namely that they exhibit small-world
behavior.
4.4.2
Watts–Strogatz Small-world Graph Model
The random graph model fails to exhibit a high clustering coefficient, but it is
small-world. The Watts–Strogatz (WS) model tries to explicitly model high local
clustering by starting with a regular network in which each node is connected to its
k
neighbors on the right and left, assuming that the initial
n
vertices are arranged
in a large circular backbone. Such a network will have a high clustering coefficient,
but will not be small-world. Surprisingly, adding a small amount of randomness in the
regular network by randomly
rewiring
some of the edges or by adding a small fraction
of random edges leads to the emergence of the small-world phenomena.
The WS model starts with
n
nodes arranged in a circular layout, with each node
connected to its immediate left and right neighbors. The edges in the initial layout are
4.4 Graph Models
119
v
0
v
1
v
2
v
3
v
4
v
5
v
6
v
7
Figure 4.10.
Watts–Strogatz regular graph:
n
=
8,
k
=
2.
called
backbone
edges. Each node has edges to an additional
k
−
1 neighbors to the
left and right. Thus, the WS model starts with a
regular
graph of degree 2
k
, where
each node is connected to its
k
neighbors on the right and
k
neighbors on the left, as
illustrated in Figure 4.10.
Clustering Coefficient and Diameter of Regular Graph
Consider the subgraph
G
v
induced by the 2
k
neighbors of a node
v
. The clustering
coefficient of
v
is given as
C
(v)
=
m
v
M
v
(4.10)
where
m
v
is the actual number of edges, and
M
v
is the maximum possible number of
edges, among the neighbors of
v
.
To compute
m
v
, consider some node
r
i
thatis at a distanceof
i
hops (with 1
≤
i
≤
k
)
from
v
to the right, considering only the backbone edges. The node
r
i
has edges to
k
−
i
of its immediate right neighbors (restricted to the right neighbors of
v
), and to
k
−
1 of
its left neighbors (all
k
left neighbors, excluding
v
). Owing to the symmetry about
v
, a
node
l
i
that is at a distance of
i
backbone hops from
v
to the left has the same number
of edges. Thus, the degree of any node in
G
v
that is
i
backbone hops away from
v
is
given as
d
i
=
(k
−
i)
+
(k
−
1
)
=
2
k
−
i
−
1
Because each edge contributes to the degree of its two incident nodes, summing the
degrees of all neighbors of
v
, we obtain
2
m
v
=
2
k
i
=
1
2
k
−
i
−
1
120
Graph Data
m
v
=
2
k
2
−
k(k
+
1
)
2
−
k
m
v
=
3
2
k(k
−
1
)
(4.11)
On the other hand, the number of possible edges among the 2
k
neighbors of
v
is
given as
M
v
=
2
k
2
=
2
k(
2
k
−
1
)
2
=
k(
2
k
−
1
)
Plugging the expressions for
m
v
and
M
v
into Eq.(4.10), the clustering coefficient of a
node
v
is given as
C
(v)
=
m
v
M
v
=
3
k
−
3
4
k
−
2
As
k
increases, the clustering coefficient approaches
3
4
because
C
(
G
)
=
C
(v)
→
3
4
as
k
→∞
.
The WS regular graph thus has a high clustering coefficient. However, it does not
satisfy the small-world property. To see this, note that along the backbone, the farthest
node from
v
has a distance of at most
n
2
hops. Further, because each node is connected
to
k
neighbors on either side, one can reach the farthest node in at most
n/
2
k
hops. More
precisely, the diameter of a regular WS graph is given as
d(
G
)
=
n
2
k
if
n
is even
n
−
1
2
k
if
n
is odd
The regular graph has a diameter that scales linearly in the number of nodes, and thus
it is not small-world.
Random Perturbation of Regular Graph
Edge Rewiring
Starting with the regular graph of degree 2
k
, the WS model perturbs
the regular structure by adding some randomness to the network. One approach is to
randomly rewire edges with probability
r
. That is, for each edge
(u,v)
in the graph,
with probability
r
, replace
v
with another randomly chosen node avoiding loops and
duplicate edges. Because the WS regular graph has
m
=
kn
total edges, after rewiring,
rm
of the edges are random, and
(
1
−
r)m
are regular.
Edge Shortcuts
An alternative approach is that instead of rewiring edges, we add a
few
shortcut
edges between random pairs of nodes, as shown in Figure 4.11. The total
number of random shortcut edges added to the network is given as
mr
=
knr
, so that
r
can be considered as the probability, per edge, of adding a shortcut edge. The total
number of edges in the graph is then simply
m
+
mr
=
(
1
+
r)m
=
(
1
+
r)kn
. Because
r
∈
[0
,
1], the number of edges then lies in the range [
kn,
2
kn
].
In either approach, if the probability
r
of rewiring or adding shortcut edgesis
r
=
0,
then we are left with the original regular graph, with high clustering coefficient, but
withno small-world property.On theother hand,if therewiring or shortcut probability
r
=
1,theregularstructureis disrupted,andthegraphapproachesarandom graph,with
little to no clustering effect, but with small-world property. Surprisingly, introducing
4.4 Graph Models
121
Figure 4.11.
Watts–Strogatz graph (
n
=
20,
k
=
3): shortcut edges are shown dotted.
onlyasmall amountofrandomness leadstoasignificantchangein theregularnetwork.
As one can see in Figure 4.11, the presence of a few long-range shortcuts reduces the
diameter of the network significantly. That is, even for a low value of
r
, the WS model
retains most of the regular local clustering structure, but at the same time becomes
small-world.
Properties of Watts–Strogatz Graphs
DegreeDistribution
Letus consider theshortcut approach,which is easierto analyze.
In this approach, each vertex has degree at least 2
k
. In addition there are the shortcut
edges, which follow a binomial distribution. Each node can have
n
′
=
n
−
2
k
−
1
additional shortcut edges, so we take
n
′
as the number of independent trials to add
edges. Because a node has degree 2
k
, with shortcut edge probability of
r
, we expect
roughly 2
kr
shortcuts from that node, but the node can connect to at most
n
−
2
k
−
1
other nodes. Thus, we can take the probability of success as
p
=
2
kr
n
−
2
k
−
1
=
2
kr
n
′
(4.12)
Let
X
denote the random variable denoting the number of shortcuts for each node.
Then the probability of a node with
j
shortcut edges is given as
f(j)
=
P(
X
=
j)
=
n
′
j
p
j
(
1
−
p)
n
′
−
j
with
E
[
X
]
=
n
′
p
=
2
kr
. The expected degree of each node in the network is therefore
2
k
+
E
[
X
]
=
2
k
+
2
kr
=
2
k(
1
+
r)
It is clear that the degree distribution of the WS graph does not adhere to a power law.
Thus, such networks are not scale-free.
122
Graph Data
Clustering Coefficient
After the shortcut edges have been added, each node
v
has
expected degree 2
k(
1
+
r)
, that is, it is on average connected to 2
kr
new neighbors, in
addition to the 2
k
original ones. The number of possible edges among
v
’s neighbors is
given as
M
v
=
2
k(
1
+
r)(
2
k(
1
+
r)
−
1
)
2
=
(
1
+
r)k(
4
kr
+
2
k
−
1
)
Because the regular WS graph remains intact even after adding shortcuts, the
neighbors of
v
retain all
3
k(k
−
1
)
2
initial edges, as given in Eq.(4.11). In addition, some
of the shortcut edges may link pairs of nodes among
v
’s neighbors. Let
Y
be the
randomvariablethatdenotesthenumberofshortcutedgespresentamongthe2
k(
1
+
r)
neighbors of
v
; then
Y
follows a binomial distribution with probability of success
p
, as
given in Eq.(4.12). Thus, the expected number of shortcut edges is given as
E
[
Y
]
=
p
M
v
Let
m
v
be the random variable corresponding to the actual number of edges present
among
v
’s neighbors, whetherregularorshortcutedges.The expectednumberofedges
among the neighbors of
v
is then given as
E
[
m
v
]
=
E
3
k(k
−
1
)
2
+
Y
=
3
k(k
−
1
)
2
+
p
M
v
Because the binomial distribution is essentially concentrated around the mean, we can
now approximate the clustering coefficient by using the expected number of edges, as
follows:
C
(v)
≃
E
[
m
v
]
M
v
=
3
k(k
−
1
)
2
+
p
M
v
M
v
=
3
k(k
−
1
)
2
M
v
+
p
=
3
(k
−
1
)
(
1
+
r)(
4
kr
+
2
(
2
k
−
1
))
+
2
kr
n
−
2
k
−
1
using the value of
p
given in Eq.(4.12). For large graphs we have
n
→∞
, so we can
drop the second term above, to obtain
C
(v)
≃
3
(k
−
1
)
(
1
+
r)(
4
kr
+
2
(
2
k
−
1
))
=
3
k
−
3
4
k
−
2
+
2
r(
2
kr
+
4
k
−
1
)
(4.13)
As
r
→
0,theaboveexpressionbecomesequivalenttoEq.(4.10).Thus, forsmallvalues
of
r
the clustering coefficient remains high.
Diameter
Deriving an analytical expression for the diameter of the WS model with
random edge shortcuts is not easy. Instead we resort to an empirical study of the
behavior of WS graphs when a small number of random shortcuts are added. In
Example 4.10 we find that small values of shortcut edge probability
r
are enough to
reduce the diameter from
O
(n)
to
O
(
log
n)
. The WS model thus leads to graphs that
are small-world and that also exhibit the clustering effect. However, the WS graphs do
not display a scale-free degree distribution.
4.4 Graph Models
123
0
10
20
30
40
50
60
70
80
90
100
0 0
.
02 0
.
04 0
.
06 0
.
08 0
.
10 0
.
12 0
.
14 0
.
16 0
.
18 0
.
20
Edge probability:
r
D
i
a
m
e
t
e
r
:
d
(
G
)
167
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8
0
.
9
1
.
0
C
l
u
s
t
e
r
i
n
g
c
o
e
f
fi
c
i
e
n
t
:
C
(
G
)
Figure 4.12.
Watts-Strogatz model: diameter (circles) and clustering coefficient (triangles).
Example 4.10.
Figure 4.12 shows a simulation of the WS model, for a graph with
n
=
1000 vertices and
k
=
3. The
x
-axis shows different values of the probability
r
of
adding random shortcut edges. The diameter values are shown as circles using the
left
y
-axis, whereas the clustering values are shown as triangles using the right
y
-axis.
These values are the averages over 10 runs of the WS model. The solid line gives
the clustering coefficient from the analyticalformula in Eq.(4.13), which is in perfect
agreement with the simulation values.
The initial regular graph has diameter
d(
G
)
=
n
2
k
=
1000
6
=
167
and its clustering coefficient is given as
C
(
G
)
=
3
(k
−
1
)
2
(
2
k
−
1
)
=
6
10
=
0
.
6
Wecanobservethatthediameterquicklyreduces,evenwithverysmalledgeaddition
probability. For
r
=
0
.
005, the diameter is 61. For
r
=
0
.
1, the diameter shrinks to 11,
which is on the same scale as
O
(
log
2
n)
because log
2
1000
≃
10. On the other hand,
we can observe that clustering coefficient remains high. For
r
=
0
.
1, the clustering
coefficient is 0
.
48. Thus, the simulation study confirms that the addition of even
a small number of random shortcut edges reduces the diameter of the WS regular
graph from
O
(n)
(large-world) to
O
(
log
n)
(small-world). At the same time the graph
retains its local clustering property.
124
Graph Data
4.4.3
Barab
´
asi–Albert Scale-free Model
The Barab
´
asi–Albert(BA) model tries to capture the scale-freedegreedistributions of
real-world graphs via a generative process that adds new nodes and edges at each time
step.Further,theedgegrowthisbasedontheconceptof
preferentialattachment
;thatis,
edgesfrom the new vertexare more likely to link to nodes with higher degrees.For this
reason the model is also known as the
rich get richer
approach. The BA model mimics
a dynamically growing graph by adding new vertices and edges at each time-step
t
=
1
,
2
,...
. Let
G
t
denote the graph at time
t
, and let
n
t
denote the number of nodes, and
m
t
the number of edges in
G
t
.
Initialization
The BA model starts at time-step
t
=
0, with an initial graph
G
0
with
n
0
nodes and
m
0
edges. Each node in
G
0
should have degree at least 1; otherwise it will never be chosen
for preferential attachment. We will assume that each node has initial degree 2, being
connected to its left and right neighbors in a circular layout. Thus
m
0
=
n
0
.
Growth and Preferential Attachment
The BA model derives a new graph
G
t
+
1
from
G
t
by adding exactly one new node
u
and adding
q
≤
n
0
new edgesfrom
u
to
q
distinct nodes
v
j
∈
G
t
, where node
v
j
is chosen
with probability
π
t
(v
j
)
proportional to its degree in
G
t
, given as
π
t
(v
j
)
=
d
j
v
i
∈
G
t
d
i
(4.14)
Because only one new vertex is added at each step, the number of nodes in
G
t
is
given as
n
t
=
n
0
+
t
Further, because exactly
q
new edges are added at each time-step,the number of edges
in
G
t
is given as
m
t
=
m
0
+
qt
Because the sum of the degrees is two times the number of edges in the graph, we have
v
i
∈
G
t
d(v
i
)
=
2
m
t
=
2
(m
0
+
qt)
We can thus rewrite Eq.(4.14) as
π
t
(v
j
)
=
d
j
2
(m
0
+
qt)
(4.15)
As the network grows, owing to preferential attachment, one intuitively expects high
degree hubs to emerge.
Example 4.11.
Figure 4.13shows a graph generatedaccording to the BA model, with
parameters
n
0
=
3
,q
=
2, and
t
=
12. Initially, at time
t
=
0, the graph has
n
0
=
3
vertices, namely
{
v
0
,v
1
,v
2
}
(shown in gray), connected by
m
0
=
3 edges (shown in
bold). At each time step
t
=
1
,...,
12, vertex
v
t
+
2
is added to the growing network
4.4 Graph Models
125
v
0
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
v
9
v
10
v
11
v
12
v
13
v
14
Figure 4.13.
Barab
´
asi–Albert graph (
n
0
=
3,
q
=
2,
t
=
12).
and is connected to
q
=
2 vertices chosen with a probability proportional to their
degree.
Forexample,at
t
=
1,vertex
v
3
is added,withedgesto
v
1
and
v
2
,chosenaccording
to the distribution
π
0
(v
i
)
=
1
/
3 for
i
=
0
,
1
,
2
At
t
=
2,
v
4
is added. Using Eq.(4.15), nodes
v
2
and
v
3
are preferentially chosen
according to the probability distribution
π
1
(v
0
)
=
π
1
(v
3
)
=
2
10
=
0
.
2
π
1
(v
1
)
=
π
1
(v
2
)
=
3
10
=
0
.
3
The final graph after
t
=
12 time-steps shows the emergence of some hub nodes, such
as
v
1
(with degree 9) and
v
3
(with degree 6).
Degree Distribution
We now study two different approaches to estimate the degree distribution for the BA
model, namely the discrete approach, and the continuous approach.
Discrete Approach
The discrete approach is also called the
master-equation
method.
Let
X
t
be a random variable denoting the degree of a node in
G
t
, and let
f
t
(k)
denote
the probability mass function for
X
t
. That is,
f
t
(k)
is the degree distribution for the
126
Graph Data
graph
G
t
at time-step
t
. Simply put,
f
t
(k)
is the fraction of nodes with degree
k
at time
t
. Let
n
t
denote the number of nodes and
m
t
the number of edges in
G
t
. Further, let
n
t
(k)
denote the number of nodes with degree
k
in
G
t
. Then we have
f
t
(k)
=
n
t
(k)
n
t
Because we are interested in large real-world graphs, as
t
→∞
, the number of
nodes and edges in
G
t
can be approximated as
n
t
=
n
0
+
t
≃
t
m
t
=
m
0
+
qt
≃
qt
(4.16)
Based on Eq.(4.14), at time-step
t
+
1, the probability
π
t
(k)
that some node with
degree
k
in
G
t
is chosen for preferential attachment can be written as
π
t
(k)
=
k
·
n
t
(k)
i
i
·
n
t
(i)
Dividing the numerator and denominator by
n
t
, we have
π
t
(k)
=
k
·
n
t
(k)
n
t
i
i
·
n
t
(i)
n
t
=
k
·
f
t
(k)
i
i
·
f
t
(i)
(4.17)
Note that the denominator is simply the expected value of
X
t
, that is, the mean degree
in
G
t
, because
E
[
X
t
]
=
µ
d
(
G
t
)
=
i
i
·
f
t
(i)
(4.18)
Note also that in any graph the average degree is given as
µ
d
(
G
t
)
=
i
d
i
n
t
=
2
m
t
n
t
≃
2
qt
t
=
2
q
(4.19)
where we used Eq.(4.16), that is,
m
t
=
qt
. Equating Eqs.(4.18) and (4.19), we can
rewrite the preferential attachment probability [Eq.(4.17)] for a node of degree
k
as
π
t
(k)
=
k
·
f
t
(k)
2
q
(4.20)
We now consider the change in the number of nodes with degree
k
, when a new
vertex
u
joins the growing network at time-step
t
+
1. The net change in the number of
nodes with degree
k
is given as the number of nodes with degree
k
at time
t
+
1 minus
the number of nodes with degree
k
at time
t
, given as
(n
t
+
1
)
·
f
t
+
1
(k)
−
n
t
·
f
t
(k)
Using theapproximation that
n
t
≃
t
from Eq.(4.16), thenet changein degree
k
nodes is
(n
t
+
1
)
·
f
t
+
1
(k)
−
n
t
·
f
t
(k)
=
(t
+
1
)
·
f
t
+
1
(k)
−
t
·
f
t
(k)
(4.21)
The number of nodes with degree
k
increases whenever
u
connects to a vertex
v
i
of
degree
k
−
1 in
G
t
, as in this case
v
i
will have degree
k
in
G
t
+
1
. Over the
q
edges added
4.4 Graph Models
127
at time
t
+
1, the number of nodes with degree
k
−
1 in
G
t
that are chosen to connect
to
u
is given as
qπ
t
(k
−
1
)
=
q
·
(k
−
1
)
·
f
t
(k
−
1
)
2
q
=
1
2
·
(k
−
1
)
·
f
t
(k
−
1
)
(4.22)
where we use Eq.(4.20) for
π
t
(k
−
1
)
. Note that Eq.(4.22) holds only when
k > q
. This
is because
v
i
must have degree at least
q
, as each node that is added at time
t
≥
1 has
initial degree
q
. Therefore, if
d
i
=
k
−
1, then
k
−
1
≥
q
implies that
k > q
(we can also
ensure that the initial
n
0
edges have degree
q
by starting with clique of size
n
0
=
q
+
1).
At the same time, the number of nodes with degree
k
decreases whenever
u
connects to a vertex
v
i
with degree
k
in
G
t
, as in this case
v
i
will have a degree
k
+
1 in
G
t
+
1
. Using Eq.(4.20), over the
q
edges added at time
t
+
1, the number of nodes with
degree
k
in
G
t
that are chosen to connect to
u
is given as
q
·
π
t
(k)
=
q
·
k
·
f
t
(k)
2
q
=
1
2
·
k
·
f
t
(k)
(4.23)
Based on the preceding discussion, when
k > q
, the net change in the number of
nodes with degree
k
is given as the difference between Eqs.(4.22) and (4.23) in
G
t
:
q
·
π
t
(k
−
1
)
−
q
·
π
t
(k)
=
1
2
·
(k
−
1
)
·
f
t
(k
−
1
)
−
1
2
k
·
f
t
(k)
(4.24)
Equating Eqs.(4.21) and (4.24) we obtain the master equation for
k > q
:
(t
+
1
)
·
f
t
+
1
(k)
−
t
·
f
t
(k)
=
1
2
·
(k
−
1
)
·
f
t
(k
−
1
)
−
1
2
·
k
·
f
t
(k)
(4.25)
On the other hand, when
k
=
q
, assuming that there are no nodes in the graph with
degree less than
q
, then only the newly added node contributes to an increase in the
number of nodes with degree
k
=
q
by one. However, if
u
connects to an existing node
v
i
with degree
k
, then there will be a decrease in the number of degree
k
nodes because
in this case
v
i
will have degree
k
+
1 in
G
t
+
1
. The net change in the number of nodes
with degree
k
is therefore given as
1
−
q
·
π
t
(k)
=
1
−
1
2
·
k
·
f
t
(k)
(4.26)
Equating Eqs.(4.21) and (4.26) we obtain the master equation for the boundary
condition
k
=
q
:
(t
+
1
)
·
f
t
+
1
(k)
−
t
·
f
t
(k)
=
1
−
1
2
·
k
·
f
t
(k)
(4.27)
Our goal is now to obtain the stationary or time-invariant solutions for the master
equations. In other words, we study the solution when
f
t
+
1
(k)
=
f
t
(k)
=
f(k)
(4.28)
The stationary solution gives the degree distribution that is independent of time.
128
Graph Data
Let us first derive the stationary solution for
k
=
q
. Substituting Eq.(4.28) into
Eq.(4.27) and setting
k
=
q
, we obtain
(t
+
1
)
·
f(q)
−
t
·
f(q)
=
1
−
1
2
·
q
·
f(q)
2
f(q)
=
2
−
q
·
f(q),
which implies that
f(q)
=
2
q
+
2
(4.29)
The stationary solution for
k > q
gives us a recursion for
f(k)
in terms of
f(k
−
1
)
:
(t
+
1
)
·
f(k)
−
t
·
f(k)
=
1
2
·
(k
−
1
)
·
f(k
−
1
)
−
1
2
·
k
·
f(k)
2
f(k)
=
(k
−
1
)
·
f(k
−
1
)
−
k
·
f(k),
which implies that
f(k)
=
k
−
1
k
+
2
·
f(k
−
1
)
(4.30)
Expanding (4.30) until the boundary condition
k
=
q
yields
f(k)
=
(k
−
1
)
(k
+
2
)
·
f(k
−
1
)
=
(k
−
1
)(k
−
2
)
(k
+
2
)(k
+
1
)
·
f(k
−
2
)
.
.
.
=
(k
−
1
)(k
−
2
)(k
−
3
)(k
−
4
)
···
(q
+
3
)(q
+
2
)(q
+
1
)(q)
(k
+
2
)(k
+
1
)(k)(k
−
1
)
···
(q
+
6
)(q
+
5
)(q
+
4
)(q
+
3
)
·
f(q)
=
(q
+
2
)(q
+
1
)q
(k
+
2
)(k
+
1
)k
·
f(q)
Plugging in the stationary solution for
f(q)
from Eq.(4.29) gives the general
solution
f(k)
=
(q
+
2
)(q
+
1
)q
(k
+
2
)(k
+
1
)k
·
2
(q
+
2
)
=
2
q(q
+
1
)
k(k
+
1
)(k
+
2
)
For constant
q
and large
k
, it is easy to see that the degree distribution scales as
f(k)
∝
k
−
3
(4.31)
In other words, the BA model yields a power-law degree distribution with
γ
=
3,
especially for large degrees.
Continuous Approach
The continuous approach is also called the
mean-field
method.
In the BA model, the vertices that are added early on tend to have a higher degree,
because they have more chances to acquire connections from the vertices that are
added to the network at a later time. The time dependence of the degree of a vertex
can be approximated as a continuous random variable. Let
k
i
=
d
t
(i)
denote the degree
of vertex
v
i
at time
t
. At time
t
, the probability that the newly added node
u
links to
4.4 Graph Models
129
v
i
is given as
π
t
(i)
. Further, the change in
v
i
’s degree per time-step is given as
q
·
π
t
(i)
.
Using the approximation that
n
t
≃
t
and
m
t
≃
qt
from Eq.(4.16), the rate of change of
k
i
with time can be written as
dk
i
dt
=
q
·
π
t
(i)
=
q
·
k
i
2
qt
=
k
i
2
t
Rearrangingthe terms in the preceding equation
dk
i
dt
=
k
i
2
t
and integratingon both sides,
we have
1
k
i
dk
i
=
1
2
t
dt
ln
k
i
=
1
2
ln
t
+
C
e
ln
k
i
=
e
ln
t
1
/
2
·
e
C
,
which implies
k
i
=
α
·
t
1
/
2
(4.32)
where
C
is the constant of integration, and thus
α
=
e
C
is also a constant.
Let
t
i
denote the time when node
i
was added to the network. Because the initial
degree for any node is
q
, we obtain the boundary condition that
k
i
=
q
at time
t
=
t
i
.
Plugging these into Eq.(4.32), we get
k
i
=
α
·
t
1
/
2
i
=
q,
which implies that
α
=
q
√
t
i
(4.33)
Substituting Eq.(4.33) into Eq.(4.32) leads to the particular solution
k
i
=
α
·
√
t
=
q
·
t/t
i
(4.34)
Intuitively, this solution confirms the rich-gets-richer phenomenon. It suggests that if
a node
v
i
is added early to the network (i.e.,
t
i
is small), then as time progresses (i.e.,
t
gets larger), the degree of
v
i
keeps on increasing (as a square root of the time
t
).
Let us now consider the probability that the degree of
v
i
at time
t
is less than some
value
k
, i.e.,
P(k
i
q
2
t
k
2
130
Graph Data
Thus, we can write
P(k
i
< k)
=
P
t
i
>
q
2
t
k
2
=
1
−
P
t
i
≤
q
2
t
k
2
In other words, the probability that node
v
i
has degree less than
k
is the same as the
probability that the time
t
i
at which
v
i
enters the graph is greater than
q
2
k
2
t
, which in
turn can be expressed as 1 minus the probability that
t
i
is less than or equal to
q
2
k
2
t
.
Note that vertices are added to the graph at a uniform rate of one vertex per
time-step, that is,
1
n
t
≃
1
t
. Thus, the probability that
t
i
is less than or equal to
q
2
k
2
t
is
given as
P(k
i
< k)
=
1
−
P
t
i
≤
q
2
t
k
2
=
1
−
q
2
t
k
2
·
1
t
=
1
−
q
2
k
2
Because
v
i
is any generic node in the graph,
P(k
i
< k)
can be considered to be the
cumulative degree distribution
F
t
(k)
at time
t
. We can obtain the degree distribution
f
t
(k)
by taking the derivative of
F
t
(k)
with respect to
k
to obtain
f
t
(k)
=
d
dk
F
t
(k)
=
d
dk
P(k
i
1. Further, the
expected clustering coefficient of the BA graphs scales as
E
[
C
(
G
t
)
]
=
O
(
log
n
t
)
2
n
t
which is only slightly better than the clustering coefficient for random graphs, which
scale as
O
(n
−
1
t
)
. In Example 4.12, we empirically study the clustering coefficient and
diameter for random instances of the BA model with a given set of parameters.
Example 4.12.
Figure 4.14 plots the empirical degree distribution obtained as the
average of 10 different BA graphs generated with the parameters
n
0
=
3,
q
=
3, and
for
t
=
997 time-steps, so that the final graph has
n
=
1000 vertices. The slope of the
line in the log-log scale confirms the existence of a power law, with the slope given as
−
γ
=−
2
.
64.
The average clustering coefficient over the 10 graphs was
C
(
G
)
=
0
.
019, which
is not very high, indicating that the BA model does not capture the clustering effect.
On the other hand, the average diameter was
d(
G
)
=
6, indicating ultra-small-world
behavior.
−
14
−
12
−
10
−
8
−
6
−
4
−
2
1 2 3 4 5 6 7
Degree: log
2
k
P
r
o
b
a
b
i
l
i
t
y
:
l
o
g
2
f
(
k
)
−
γ
=−
2
.
64
Figure 4.14.
Barab
´
asi–Albert model (
n
0
=
3
,
t
=
997
,
q
=
3): degree distribution.
132
Graph Data
4.5
FURTHER READING
The theory of random graphs was founded in Erd
˝
os and R
´
enyi (1959); for a detailed
treatment of the topic see Bollob
´
as (2001). Alternative graph models for real-world
networks were proposed in Watts and Strogatz (1998) and Barab
´
asi and Albert (1999).
One of the first comprehensive books on graph data analysis was Wasserman and
Faust (1994). More recent books on network science Lewis (2009) and Newman
(2010). For PageRank see Brin and Page (1998), and for the hubs and authorities
approach see Kleinberg (1999). For an up-to-date treatment of the patterns, laws, and
models (including the RMat generator) for real-world networks, see Chakrabarti and
Faloutsos (2012).
Barab
´
asi, A.-L. and Albert, R. (1999). “Emergence of scaling in random networks.”
Science
, 286(5439): 509–512.
Bollob
´
as, B. (2001).
Random Graphs,
2nd ed. Vol. 73. New York: Cambridge
University Press.
Brin, S. and Page, L. (1998). “The anatomy of a large-scale hypertextual Web search
engine.”
Computer Networks and ISDN Systems
, 30(1): 107–117.
Chakrabarti, D. and Faloutsos, C. (2012). “Graph Mining: Laws, Tools, and Case
Studies.”,
Synthesis Lectures on Data Mining and Knowledge Discovery
, 7(1):
1–207. San Rafael, CA: Morgan & Claypool Publishers.
Erd
˝
os, P. and R
´
enyi, A. (1959). “On random graphs.”
Publicationes Mathematicae
Debrecen
, 6, 290–297.
Kleinberg, J. M. (1999). “Authoritative sources in a hyperlinked environment.”
Journal of the ACM
, 46(5): 604–632.
Lewis, T. G. (2009).
Network Science: Theory and Applications
. Hoboken. NJ: John
Wiley & Sons.
Newman, M. (2010).
Networks: An Introduction
. Oxford: Oxford University Press.
Wasserman, S. and Faust, K. (1994).
Social Network Analysis: Methods and Applica-
tions
. Structural Analysis in the Social Sciences. New York: Cambridge University
Press.
Watts, D. J. and Strogatz, S. H. (1998). “Collective dynamics of ‘small-world’
networks.”
Nature
, 393(6684): 440–442.
4.6
EXERCISES
Q1.
Given the graph in Figure 4.15, find the fixed-point of the prestige vector.
a
b
c
Figure 4.15.
Graph for Q1
4.6 Exercises
133
Q2.
Given the graph in Figure 4.16, find the fixed-point of the authority and hub vectors.
a
b
c
Figure 4.16.
Graph for Q2.
Q3.
Consider the double star graph given in Figure 4.17 with
n
nodes, where only nodes
1 and 2 are connected to all other vertices, and there are no other links. Answer the
following questions (treating
n
as a variable).
(a)
What is the degree distribution for this graph?
(b)
What is the mean degree?
(c)
What is the clustering coefficient for vertex 1 and vertex 3?
(d)
What is the clustering coefficient
C
(
G
)
for the entire graph? What happens to
the clustering coefficient as
n
→∞
?
(e)
What is the transitivity
T
(
G
)
for the graph? What happens to
T
(
G
)
and
n
→∞
?
(f)
What is the average path length for the graph?
(g)
What is the betweenness value for node 1?
(h)
What is the degree variance for the graph?
3
4 5
···············
n
1
2
Figure 4.17.
Graph for Q3.
Q4.
Consider the graph in Figure 4.18. Compute the hub and authority score vectors.
Which nodes are the hubs and which are the authorities?
1
3
2 4 5
Figure 4.18.
Graph for Q4.
Q5.
Prove that in the BA model at time-step
t
+
1, the probability
π
t
(k)
that some node
with degree
k
in
G
t
is chosen for preferential attachment is given as
π
t
(k)
=
k
·
n
t
(k)
i
i
·
n
t
(i)
CHAPTER 5
Kernel Methods
Before we can mine data, it is important to first find a suitable data representation
that facilitates data analysis. For example, for complex data such as text, sequences,
images, and so on, we must typically extract or construct a set of attributes or features,
so that we can represent the data instances as multivariatevectors. That is, given a data
instance
x
(e.g., a sequence), we need to find a mapping
φ
, so that
φ(
x
)
is the vector
representation of
x
. Even when the input data is a numeric data matrix, if we wish to
discovernonlinear relationships among theattributes,thena nonlinear mapping
φ
may
be used, so that
φ(
x
)
represents a vector in the corresponding high-dimensional space
comprising nonlinear attributes. We use the term
input space
to refer to the data space
for the input data
x
and
feature space
to refer to the space of mapped vectors
φ(
x
)
.
Thus, given a set of data objects or instances
x
i
, and given a mapping function
φ
, we
can transform them into featurevectors
φ(
x
i
)
, which then allows us to analyzecomplex
data instances via numeric analysis methods.
Example 5.1 (Sequence-based Features).
Consider a dataset of DNA sequences
over the alphabet
={
A
,
C
,
G
,
T
}
. One simple feature space is to represent each
sequence in terms of the probability distribution over symbols in
. That is, given a
sequence
x
with length
|
x
|=
m
, the mapping into feature space is given as
φ(
x
)
={
P(
A
),P(
C
),P(
G
),P(
T
)
}
where
P(s)
=
n
s
m
is the probability of observing symbol
s
∈
, and
n
s
is the number of
times
s
appears in sequence
x
. Here the input space is the set of sequences
∗
, and
the feature space is
R
4
. For example, if
x
=
ACAGCAGTA
, with
m
=|
x
|=
9, since
A
occurs four times,
C
and
G
occur twice, and
T
occurs once, we have
φ(
x
)
=
(
4
/
9
,
2
/
9
,
2
/
9
,
1
/
9
)
=
(
0
.
44
,
0
.
22
,
0
.
22
,
0
.
11
)
Likewise, for another sequence
y
=
AGCAAGCGAG
, we have
φ(
y
)
=
(
4
/
10
,
2
/
10
,
4
/
10
,
0
)
=
(
0
.
4
,
0
.
2
,
0
.
4
,
0
)
The mapping
φ
now allows one to compute statistics over the data sample to
make inferences about the population. For example, we may compute the mean
134
Kernel Methods
135
symbol composition. We can also define the distance between any two sequences,
for example,
δ(
x
,
y
)
=
φ(
x
)
−
φ(
y
)
=
(
0
.
44
−
0
.
4
)
2
+
(
0
.
22
−
0
.
2
)
2
+
(
0
.
22
−
0
.
4
)
2
+
(
0
.
11
−
0
)
2
=
0
.
22
We can compute larger feature spaces by considering, for example, the probability
distribution overallsubstrings or words of sizeup to
k
over thealphabet
,and so on.
Example 5.2 (Nonlinear Features).
As an example of a nonlinear mapping consider
the mapping
φ
that takes as input a vector
x
=
(x
1
,x
2
)
T
∈
R
2
and maps it to a
“quadratic” feature space via the nonlinear mapping
φ(
x
)
=
(x
2
1
,x
2
2
,
√
2
x
1
x
2
)
T
∈
R
3
For example, the point
x
=
(
5
.
9
,
3
)
T
is mapped to the vector
φ(
x
)
=
(
5
.
9
2
,
3
2
,
√
2
·
5
.
9
·
3
)
T
=
(
34
.
81
,
9
,
25
.
03
)
T
The main benefit of this transformation is that we may apply well-known linear
analysis methods in the feature space. However, because the features are nonlinear
combinations of the original attributes, this allows us to mine nonlinear patterns and
relationships.
Whereas mapping into feature space allows one to analyze the data via algebraic
andprobabilisticmodeling,theresultingfeaturespaceisusuallyveryhigh-dimensional;
it may evenbe infinite dimensional. Thus, transforming all the input points into feature
space can be very expensive, or even impossible. Because the dimensionality is high,
we also run into the curse of dimensionality highlighted later in Chapter 6.
Kernel methods avoid explicitly transforming each point
x
in the input space into
the mapped point
φ(
x
)
in the feature space. Instead, the input objects are represented
via their
n
×
n
pairwise similarity values. The similarity function, called a
kernel
, is
chosen so that it represents a dot product in some high-dimensional feature space, yet
it can be computed without directly constructing
φ(
x
)
. Let
I
denote the input space,
which can comprise any arbitrary set of objects, and let
D
={
x
i
}
n
i
=
1
⊂
I
be a dataset
comprising
n
objects in the input space. We can represent the pairwise similarity values
between points in
D
via the
n
×
n
kernel matrix
, defined as
K
=
K
(
x
1
,
x
1
)
K
(
x
1
,
x
2
)
···
K
(
x
1
,
x
n
)
K
(
x
2
,
x
1
)
K
(
x
2
,
x
2
)
···
K
(
x
2
,
x
n
)
.
.
.
.
.
.
.
.
.
.
.
.
K
(
x
n
,
x
1
)
K
(
x
n
,
x
2
)
···
K
(
x
n
,
x
n
)
where
K
:
I
×
I
→
R
is a
kernel function
on any two points in input space. However,
we require that
K
corresponds to a dot product in some feature space. That is, for any
136
Kernel Methods
x
i
,
x
j
∈
I
, the kernel function should satisfy the condition
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
(5.1)
where
φ
:
I
→
F
is a mapping from the input space
I
to the featurespace
F
. Intuitively,
this means that we should be able to compute the value of the dot product using
the original input representation
x
, without having recourse to the mapping
φ(
x
)
.
Obviously, not just any arbitrary function can be used as a kernel; a valid kernel
function must satisfy certain conditions so that Eq.(5.1) remains valid, as discussed
in Section 5.1.
It is important to remark that the transpose operator for the dot product applies
only when
F
is a vector space. When
F
is an abstract vector space with an inner
product, the kernel is written as
K
(
x
i
,
x
j
)
=
φ(
x
i
),φ(
x
j
)
. However, for convenience
we use the transpose operator throughout this chapter; when
F
is an inner product
space it should be understood that
φ(
x
i
)
T
φ(
x
j
)
≡
φ(
x
i
),φ(
x
j
)
Example 5.3 (Linear and Quadratic Kernels).
Consider the identity mapping,
φ(
x
)
→
x
. This naturally leads to the
linear kernel
, which is simply the dot product
between two input vectors, and thus satisfies Eq.(5.1):
φ(
x
)
T
φ(
y
)
=
x
T
y
=
K
(
x
,
y
)
For example, consider the first five points from the two-dimensional Iris dataset
shown in Figure 5.1a:
x
1
=
5
.
9
3
x
2
=
6
.
9
3
.
1
x
3
=
6
.
6
2
.
9
x
4
=
4
.
6
3
.
2
x
5
=
6
2
.
2
The kernel matrix for the linear kernel is shown in Figure 5.1b. For example,
K
(
x
1
,
x
2
)
=
x
T
1
x
2
=
5
.
9
×
6
.
9
+
3
×
3
.
1
=
40
.
71
+
9
.
3
=
50
.
01
2
2
.
5
3
.
0
4
.
5 5
.
0 5
.
5 6
.
0 6
.
5
X
1
X
2
x
1
x
2
x
3
x
4
x
5
(a)
K x
1
x
2
x
3
x
4
x
5
x
1
43.81 50.01 47.64 36.74 42.00
x
2
50.01 57.22 54.53 41.66 48.22
x
3
47.64 54.53 51.97 39.64 45.98
x
4
36.74 41.66 39.64 31.40 34.64
x
5
42.00 48.22 45.98 34.64 40.84
(b)
Figure 5.1.
(a) Example points. (b) Linear kernel matrix.
Kernel Methods
137
Consider the quadratic mapping
φ
:
R
2
→
R
3
from Example 5.2, that maps
x
=
(x
1
,x
2
)
T
as follows:
φ(
x
)
=
(x
2
1
,x
2
2
,
√
2
x
1
x
2
)
T
The dot product between the mapping for two input points
x
,
y
∈
R
2
is given as
φ(
x
)
T
φ(
y
)
=
x
2
1
y
2
1
+
x
2
2
y
2
2
+
2
x
1
y
1
x
2
y
2
We can rearrange the preceding to obtain the (homogeneous)
quadratic kernel
function as follows:
φ(
x
)
T
φ(
y
)
=
x
2
1
y
2
1
+
x
2
2
y
2
2
+
2
x
1
y
1
x
2
y
2
=
(x
1
y
1
+
x
2
y
2
)
2
=
(
x
T
y
)
2
=
K
(
x
,
y
)
We can thus see that the dot product in feature space can be computed by evaluating
the kernel in input space, without explicitly mapping the points into feature space.
For example, we have
φ(
x
1
)
=
(
5
.
9
2
,
3
2
,
√
2
·
5
.
9
·
3
)
T
=
(
34
.
81
,
9
,
25
.
03
)
T
φ(
x
2
)
=
(
6
.
9
2
,
3
.
1
2
,
√
2
·
6
.
9
·
3
.
1
)
T
=
(
47
.
61
,
9
.
61
,
30
.
25
)
T
φ(
x
1
)
T
φ(
x
2
)
=
34
.
81
×
47
.
61
+
9
×
9
.
61
+
25
.
03
×
30
.
25
=
2501
We can verify that the homogeneous quadratic kernel gives the same value
K
(
x
1
,
x
2
)
=
(
x
T
1
x
2
)
2
=
(
50
.
01
)
2
=
2501
We shall see that many data mining methods can be
kernelized
, that is, instead of
mapping the input points into feature space, the data can be represented via the
n
×
n
kernel matrix
K
, and all relevant analysis can be performed over
K
. This is usually
done via the so-called
kernel trick
, that is, show that the analysis task requires only
dot products
φ(
x
i
)
T
φ(
x
j
)
in feature space, which can be replaced by the corresponding
kernel
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
that can be computed efficiently in input space. Once
the kernel matrix has been computed, we no longer even need the input points
x
i
, as
all operations involving only dot products in the feature space can be performed over
the
n
×
n
kernel matrix
K
. An immediate consequence is that when the input data
is the typical
n
×
d
numeric matrix
D
and we employ the linear kernel, the results
obtained by analyzing
K
are equivalent to those obtained by analyzing
D
(as long
as only dot products are involved in the analysis). Of course, kernel methods allow
much more flexibility,as wecan just as easilyperform non-linear analysisby employing
nonlinear kernels,or wemayanalyze(non-numeric) complexobjects withoutexplicitly
constructing the mapping
φ(
x
)
.
138
Kernel Methods
Example 5.4.
Consider the five points from Example 5.3 along with the linear kernel
matrix shown in Figure 5.1. The mean of the five points in feature space is simply the
mean in input space, as
φ
is the identity function for the linear kernel:
µ
φ
=
5
i
=
1
φ(
x
i
)
=
5
i
=
1
x
i
=
(
6
.
00
,
2
.
88
)
T
Now consider the squared magnitude of the mean in feature space:
µ
φ
2
=
µ
T
φ
µ
φ
=
(
6
.
0
2
+
2
.
88
2
)
=
44
.
29
Because this involves only a dot product in feature space, the squared magnitude can
be computed directly from
K
. As we shall see later [see Eq.(5.12)] the squared norm
of the mean vector in feature space is equivalent to the average value of the kernel
matrix
K
. For the kernel matrix in Figure 5.1b we have
1
5
2
5
i
=
1
5
j
=
1
K
(
x
i
,
x
j
)
=
1107
.
36
25
=
44
.
29
which matches the
µ
φ
2
value computed earlier. This example illustrates that
operations involving dot products in feature space can be cast as operations over
the kernel matrix
K
.
Kernel methods offer a radically different view of the data. Instead of thinking
of the data as vectors in input or feature space, we consider only the kernel values
between pairs of points. The kernel matrix can also be considered as a weighted
adjacency matrix for the complete graph over the
n
input points, and consequently
there is a strong connection betweenkernels and graph analysis, in particular algebraic
graph theory.
5.1
KERNEL MATRIX
Let
I
denote the input space, which can be any arbitrary set of data objects, and let
D
={
x
1
,
x
2
,...,
x
n
}⊂
I
denote a subset of
n
objects in the input space. Let
φ
:
I
→
F
be a mapping from the input space into the feature space
F
, which is endowed with a
dot product and norm. Let
K
:
I
×
I
→
R
be a function that maps pairs of input objects
to their dot product value in feature space, that is,
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
, and let
K
be
the
n
×
n
kernel matrix corresponding to the subset
D
.
The function
K
is called a
positivesemidefinite kernel
if and only if it is symmetric:
K
(
x
i
,
x
j
)
=
K
(
x
j
,
x
i
)
and the corresponding kernel matrix
K
for any subset
D
⊂
I
is positive semidefinite,
that is,
a
T
Ka
≥
0
,
for all vectors
a
∈
R
n
5.1 Kernel Matrix
139
which implies that
n
i
=
1
n
j
=
1
a
i
a
j
K
(
x
i
,
x
j
)
≥
0
,
for all
a
i
∈
R
,i
∈
[1
,n
] (5.2)
We first verify that if
K
(
x
i
,
x
j
)
represents the dot product
φ(
x
i
)
T
φ(
x
j
)
in some
feature space, then
K
is a positive semidefinite kernel. Consider any dataset
D
, and
let
K
={
K
(
x
i
,
x
j
)
}
be the corresponding kernel matrix. First,
K
is symmetric since the
dot product is symmetric, which also implies that
K
is symmetric. Second,
K
is positive
semidefinite because
a
T
Ka
=
n
i
=
1
n
j
=
1
a
i
a
j
K
(
x
i
,
x
j
)
=
n
i
=
1
n
j
=
1
a
i
a
j
φ(
x
i
)
T
φ(
x
j
)
=
n
i
=
1
a
i
φ(
x
i
)
T
n
j
=
1
a
j
φ(
x
j
)
=
n
i
=
1
a
i
φ(
x
i
)
2
≥
0
Thus,
K
is a positive semidefinite kernel.
We now show that if we are given a positive semidefinite kernel
K
:
I
×
I
→
R
,
then it corresponds to a dot product in some feature space
F
.
5.1.1
Reproducing Kernel Map
For the reproducing kernel map
φ
, we map each point
x
∈
I
into a function in
a
functional space
{
f
:
I
→
R
}
comprising functions that map points in
I
into
R
.
Algebraically this space of functions is an abstract vector space where each point
happens to be a function. In particular, any
x
∈
I
in the input space is mapped to the
following function:
φ(
x
)
=
K
(
x
,
·
)
where the
·
stands for any argument in
I
. That is, each object
x
in the input space gets
mapped to a
feature point
φ(
x
)
, which is in fact a function
K
(
x
,
·
)
that represents its
similarity to all other points in the input space
I
.
Let
F
be the set of all functions or points that can be obtained as a linear
combination of any subset of feature points, defined as
F
=
span
K
(
x
,
·
)
|
x
∈
I
=
f
=
f(
·
)
=
m
i
=
1
α
i
K
(
x
i
,
·
)
m
∈
N
,α
i
∈
R
,
{
x
1
,...,
x
m
}⊆
I
We use the dual notation
f
and
f(
·
)
interchangeably to emphasize the fact that each
point
f
in the feature space is in fact a function
f(
·
)
. Note that by definition the feature
point
φ(
x
)
=
K
(
x
,
·
)
belongs to
F
.
140
Kernel Methods
Let
f
,
g
∈
F
be any two points in feature space:
f
=
f(
·
)
=
m
a
i
=
1
α
i
K
(
x
i
,
·
)
g
=
g(
·
)
=
m
b
j
=
1
β
j
K
(
x
j
,
·
)
Define the dot product between two points as
f
T
g
=
f(
·
)
T
g(
·
)
=
m
a
i
=
1
m
b
j
=
1
α
i
β
j
K
(
x
i
,
x
j
)
(5.3)
We emphasize that the notation
f
T
g
is only a convenience; it denotes the inner product
f
,
g
because
F
is an abstract vector space, with an inner product as defined above.
We can verify that the dot product is
bilinear
, that is, linear in both arguments,
because
f
T
g
=
m
a
i
=
1
m
b
j
=
1
α
i
β
j
K
(
x
i
,
x
j
)
=
m
a
i
=
1
α
i
g(
x
i
)
=
m
b
j
=
1
β
j
f(
x
j
)
The fact that
K
is positive semidefinite implies that
f
2
=
f
T
f
=
m
a
i
=
1
m
a
j
=
1
α
i
α
j
K
(
x
i
,
x
)
≥
0
Thus, the space
F
is a
pre-Hilbert space
, defined as a normed inner product space,
because it is endowed with a symmetric bilinear dot product and a norm. By adding
the limit points of all Cauchy sequences that are convergent,
F
can be turned into a
Hilbert space
, defined as a normed inner product space that is complete. However,
showing this is beyond the scope of this chapter.
The space
F
has the so-called
reproducing property
, that is, we can evaluate a
function
f(
·
)
=
f
at a point
x
∈
I
by taking the dot product of
f
with
φ(
x
)
, that is,
f
T
φ(
x
)
=
f(
·
)
T
K
(
x
,
·
)
=
m
a
i
=
1
α
i
K
(
x
i
,
x
)
=
f(
x
)
For this reason, the space
F
is also called a
reproducing kernel Hilbert space
.
All we have to do now is to show that
K
(
x
i
,
x
j
)
corresponds to a dot product in the
feature space
F
. This is indeed the case, because using Eq.(5.3) for any two feature
points
φ(
x
i
),φ(
x
j
)
∈
F
their dot product is given as
φ(
x
i
)
T
φ(
x
j
)
=
K
(
x
i
,
·
)
T
K
(
x
j
,
·
)
=
K
(
x
i
,
x
j
)
The reproducing kernel map shows that any positive semidefinite kernel corre-
sponds to a dot product in some feature space. This means we can apply well known
algebraic and geometric methods to understand and analyze the data in these spaces.
Empirical Kernel Map
The reproducing kernel map
φ
maps the input space into a potentially infinite
dimensional feature space. However, given a dataset
D
={
x
i
}
n
i
=
1
, we can obtain a finite
5.1 Kernel Matrix
141
dimensional mapping by evaluating the kernel only on points in
D
. That is, define the
map
φ
as follows:
φ(
x
)
=
(
K
(
x
1
,
x
),
K
(
x
2
,
x
),...,
K
(
x
n
,
x
)
T
∈
R
n
which maps each point
x
∈
I
to the
n
-dimensional vector comprising the kernel values
of
x
with each of the objects
x
i
∈
D
. We can define the dot product in feature space as
φ(
x
i
)
T
φ(
x
j
)
=
n
k
=
1
K
(
x
k
,
x
i
)
K
(
x
k
,
x
j
)
=
K
T
i
K
j
(5.4)
where
K
i
denotes the
i
th column of
K
, which is also the same as the
i
th row of
K
(considered as a column vector), as
K
is symmetric. However, for
φ
to be a valid map,
we require that
φ(
x
i
)
T
φ(
x
j
)
=
K
(
x
i
,
x
j
)
, which is clearly not satisfied by Eq.(5.4). One
solution is to replace
K
T
i
K
j
in Eq.(5.4) with
K
T
i
AK
j
for some positive semidefinite
matrix
A
such that
K
T
i
AK
j
=
K
(
x
i
,
x
j
)
If we can find such an
A
, it would imply that over all pairs of mapped points we have
K
T
i
AK
j
n
i,j
=
1
=
K
(
x
i
,
x
j
)
n
i,j
=
1
which can be written compactly as
KAK
=
K
This immediately suggests that we take
A
=
K
−
1
, the (pseudo) inverse of the kernel
matrix
K
. The modified map
φ
, called the
empirical kernel map
, is then defined as
φ(
x
)
=
K
−
1
/
2
·
(
K
(
x
1
,
x
),
K
(
x
2
,
x
),...,
K
(
x
n
,
x
)
T
∈
R
n
so that the dot product yields
φ(
x
i
)
T
φ(
x
j
)
=
K
−
1
/
2
K
i
T
K
−
1
/
2
K
j
=
K
T
i
K
−
1
/
2
K
−
1
/
2
K
j
=
K
T
i
K
−
1
K
j
Over all pairs of mapped points, we have
K
T
i
K
−
1
K
j
n
i,j
=
1
=
KK
−
1
K
=
K
as desired. However, it is important to note that this empirical feature representation
is valid only for the
n
points in
D
. If points are added to or removed from
D
, the kernel
map will have to be updated for all points.
5.1.2
Mercer Kernel Map
In general different feature spaces can be constructed for the same kernel
K
. We now
describe how to construct the Mercer map.
142
Kernel Methods
Data-specific Kernel Map
The Mercer kernel map is best understood starting from the kernel matrix for the
dataset
D
in input space. Because
K
is a symmetric positive semidefinite matrix, it has
real and non-negative eigenvalues, and it can be decomposed as follows:
K
=
U
U
T
where
U
is the orthonormal matrix of eigenvectors
u
i
=
(u
i
1
,u
i
2
,...,u
in
)
T
∈
R
n
(for
i
=
1
,...,n
), and
is the diagonal matrix of eigenvalues, with both arranged in
non-increasing order of the eigenvalues
λ
1
≥
λ
2
≥
...
≥
λ
n
≥
0:
U
=
| | |
u
1
u
2
···
u
n
| | |
=
λ
1
0
···
0
0
λ
2
···
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
···
λ
n
The kernel matrix
K
can therefore be rewritten as the spectral sum
K
=
λ
1
u
1
u
T
1
+
λ
2
u
2
u
T
2
+···+
λ
n
u
n
u
T
n
In particular the kernel function between
x
i
and
x
j
is given as
K
(
x
i
,
x
j
)
=
λ
1
u
1
i
u
1
j
+
λ
2
u
2
i
u
2
j
···+
λ
n
u
ni
u
nj
=
n
k
=
1
λ
k
u
ki
u
kj
(5.5)
where
u
ki
denotes the
i
th component of eigenvector
u
k
. It follows that if we define the
Mercer map
φ
as follows:
φ(
x
i
)
=
λ
1
u
1
i
,
λ
2
u
2
i
,...,
λ
n
u
ni
T
(5.6)
then
K
(
x
i
,
x
j
)
is a dot product in feature space between the mapped points
φ(
x
i
)
and
φ(
x
j
)
because
φ(
x
i
)
T
φ(
x
j
)
=
λ
1
u
1
i
,...,
λ
n
u
ni
λ
1
u
1
j
,...,
λ
n
u
nj
T
=
λ
1
u
1
i
u
1
j
+···+
λ
n
u
ni
u
nj
=
K
(
x
i
,
x
j
)
Noting that
U
i
=
(u
1
i
,u
2
i
,...,u
ni
)
T
is the
i
th row of
U
, we can rewrite the Mercer map
φ
as
φ(
x
i
)
=
√
U
i
(5.7)
Thus, the kernel value is simply the dot product between scaled rows of
U
:
φ(
x
i
)
T
φ(
x
j
)
=
√
U
i
T
√
U
j
=
U
T
i
U
j
The Mercer map, defined equivalently in Eqs.(5.6) and (5.7), is obviously restricted
to the input dataset
D
, just like the empirical kernel map, and is therefore called
the
data-specific Mercer kernel map
. It defines a data-specific feature space of
dimensionality at most
n
, comprising the eigenvectors of
K
.
5.1 Kernel Matrix
143
Example 5.5.
Let the input dataset comprise the five points shown in Figure 5.1a,
and let the corresponding kernel matrix be as shown in Figure 5.1b. Computing the
eigen-decomposition of
K
, we obtain
λ
1
=
223
.
95,
λ
2
=
1
.
29, and
λ
3
=
λ
4
=
λ
5
=
0. The
effective dimensionality of the feature space is 2, comprising the eigenvectors
u
1
and
u
2
. Thus, the matrix
U
is given as follows:
U
=
u
1
u
2
U
1
−
0
.
442 0
.
163
U
2
−
0
.
505
−
0
.
134
U
3
−
0
.
482
−
0
.
181
U
4
−
0
.
369 0
.
813
U
5
−
0
.
425
−
0
.
512
and we have
=
223
.
95 0
0 1
.
29
√
=
√
223
.
95 0
0
√
1
.
29
=
14
.
965 0
0 1
.
135
The kernel map is specified via Eq.(5.7). For example, for
x
1
=
(
5
.
9
,
3
)
T
and
x
2
=
(
6
.
9
,
3
.
1
)
T
we have
φ(
x
1
)
=
√
U
1
=
14
.
965 0
0 1
.
135
−
0
.
442
0
.
163
=
−
6
.
616
0
.
185
φ(
x
2
)
=
√
U
2
=
14
.
965 0
0 1
.
135
−
0
.
505
−
0
.
134
=
−
7
.
563
−
0
.
153
Their dot product is given as
φ(
x
1
)
T
φ(
x
2
)
=
6
.
616
×
7
.
563
−
0
.
185
×
0
.
153
=
50
.
038
−
0
.
028
=
50
.
01
which matches the kernel value
K
(
x
1
,
x
2
)
in Figure 5.1b.
Mercer Kernel Map
For compact continuous spaces, analogous to the discrete case in Eq.(5.5), the kernel
value between any two points can be written as the infinite spectral decomposition
K
(
x
i
,
x
j
)
=
∞
k
=
1
λ
k
u
k
(
x
i
)
u
k
(
x
j
)
where
{
λ
1
,λ
2
,...
}
is the infinite set of eigenvalues, and
u
1
(
·
),
u
2
(
·
),...
is the
corresponding set of orthogonal and normalized
eigenfunctions
, that is, each function
u
i
(
·
)
is a solution to the integral equation
K
(
x
,
y
)
u
i
(
y
) d
y
=
λ
i
u
i
(
x
)
144
Kernel Methods
and
K
is a continuous positive semidefinite kernel, that is, for all functions
a(
·
)
with a
finite square integral (i.e.,
a(
x
)
2
d
x
<
0)
K
satisfies the condition
K
(
x
1
,
x
2
) a(
x
1
) a(
x
2
) d
x
1
d
x
2
≥
0
We can see that this positive semidefinite kernel for compact continuous spaces is
analogous to the the discrete kernel in Eq.(5.2). Further, similarly to the data-specific
Mercer map [Eq.(5.6)], the general Mercer kernel map is given as
φ(
x
i
)
=
λ
1
u
1
(
x
i
),
λ
2
u
2
(
x
i
),...
T
with the kernel value being equivalent to the dot product between two mapped points:
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
5.2
VECTOR KERNELS
We now consider two of the most commonly used vector kernels in practice.
Kernels that map an (input) vector space into another (feature) vector space are
called
vector kernels
. For multivariate input data, the input vector space will be the
d
-dimensional real space
R
d
. Let
D
comprise
n
input points
x
i
∈
R
d
, for
i
=
1
,
2
,...,n
.
Commonly used (nonlinear) kernel functions over vector data include the polynomial
and Gaussian kernels, as described next.
Polynomial Kernel
Polynomial kernels are of two types: homogeneous or inhomogeneous. Let
x
,
y
∈
R
d
.
The
homogeneous polynomial kernel
is defined as
K
q
(
x
,
y
)
=
φ(
x
)
T
φ(
y
)
=
(
x
T
y
)
q
(5.8)
where
q
is the degree of the polynomial. This kernel corresponds to a feature space
spanned by all products of exactly
q
attributes.
Themosttypicalcasesarethe
linear
(with
q
=
1)and
quadratic
(with
q
=
2)kernels,
given as
K
1
(
x
,
y
)
=
x
T
y
K
2
(
x
,
y
)
=
(
x
T
y
)
2
The
inhomogeneous polynomial kernel
is defined as
K
q
(
x
,
y
)
=
φ(
x
)
T
φ(
y
)
=
(c
+
x
T
y
)
q
(5.9)
where
q
is the degree of the polynomial, and
c
≥
0 is some constant. When
c
=
0 we
obtain the homogeneous kernel. When
c >
0, this kernel corresponds to the feature
space spanned by all products of at most
q
attributes. This can be seen from the
binomial expansion
K
q
(
x
,
y
)
=
(c
+
x
T
y
)
q
=
q
k
=
1
q
k
c
q
−
k
x
T
y
k
5.2 Vector Kernels
145
For example, for the typical value of
c
=
1, the inhomogeneous kernel is a weighted
sum of the homogeneous polynomial kernels for all powers up to
q
, that is,
(
1
+
x
T
y
)
q
=
1
+
q
x
T
y
+
q
2
x
T
y
2
+···+
q
x
T
y
q
−
1
+
x
T
y
q
Example 5.6.
Consider the points
x
1
and
x
2
in Figure 5.1.
x
1
=
5
.
9
3
x
2
=
6
.
9
3
.
1
The homogeneous quadratic kernel is given as
K
(
x
1
,
x
2
)
=
(
x
T
1
x
2
)
2
=
50
.
01
2
=
2501
The inhomogeneous quadratic kernel is given as
K
(
x
1
,
x
2
)
=
(
1
+
x
T
1
x
2
)
2
=
(
1
+
50
.
01
)
2
=
51
.
01
2
=
2602
.
02
For the polynomial kernel it is possible to construct a mapping
φ
from the input to
thefeaturespace.Let
n
0
,n
1
,...,n
d
denotenon-negativeintegers,such that
d
i
=
0
n
i
=
q
.
Further, let
n
=
(n
0
,n
1
,...,n
d
)
, and let
|
n
| =
d
i
=
0
n
i
=
q
. Also, let
q
n
denote the
multinomial coefficient
q
n
=
q
n
0
,n
1
,...,n
d
=
q
!
n
0
!
n
1
!
...n
d
!
The multinomial expansion of the inhomogeneous kernel is then given as
K
q
(
x
,
y
)
=
(c
+
x
T
y
)
q
=
c
+
d
k
=
1
x
k
y
k
q
=
(
c
+
x
1
y
1
+···+
x
d
y
d
)
q
=
|
n
|=
q
q
n
c
n
0
(
x
1
y
1
)
n
1
(
x
2
y
2
)
n
2
...
(
x
d
y
d
)
n
d
=
|
n
|=
q
q
n
c
n
0
x
n
1
1
x
n
2
2
...x
n
d
d
y
n
1
1
y
n
2
2
...y
n
d
d
=
|
n
|=
q
√
a
n
d
k
=
1
x
n
k
k
√
a
n
d
k
=
1
y
n
k
k
=
φ(
x
)
T
φ(
y
)
where
a
n
=
q
n
c
n
0
, and the summation is over all
n
=
(n
0
,n
1
,...,n
d
)
such that
|
n
|=
n
0
+
n
1
+···+
n
d
=
q
. Using the notation
x
n
=
d
k
=
1
x
n
k
k
, the mapping
φ
:
R
d
→
R
m
is
given as the vector
φ(
x
)
=
(...,a
n
x
n
,...)
T
=
...,
q
n
c
n
0
d
k
=
1
x
n
k
k
, ...
T
146
Kernel Methods
where the variable
n
=
(n
0
,...,n
d
)
ranges over all the possible assignments, such that
|
n
|=
q
. It can be shown that the dimensionality of the feature space is given as
m
=
d
+
q
q
Example 5.7 (Quadratic Polynomial Kernel).
Let
x
,
y
∈
R
2
and let
c
=
1. The
inhomogeneous quadratic polynomial kernel is given as
K
(
x
,
y
)
=
(
1
+
x
T
y
)
2
=
(
1
+
x
1
y
1
+
x
2
y
2
)
2
The set of all assignments
n
=
(n
0
,n
1
,n
2
)
, such that
|
n
|=
q
=
2, and the corresponding
terms in the multinomial expansion are shown below.
Assignments Coefficient Variables
n
=
(n
0
,n
1
,n
2
) a
n
=
q
n
c
n
0
x
n
y
n
=
d
k
=
1
(x
i
y
i
)
n
i
(
1
,
1
,
0
)
2
x
1
y
1
(
1
,
0
,
1
)
2
x
2
y
2
(
0
,
1
,
1
)
2
x
1
y
1
x
2
y
2
(
2
,
0
,
0
)
1 1
(
0
,
2
,
0
)
1
(x
1
y
1
)
2
(
0
,
0
,
2
)
1
(x
2
y
2
)
2
Thus, the kernel can be written as
K
(
x
,
y
)
=
1
+
2
x
1
y
1
+
2
x
2
y
2
+
2
x
1
y
1
x
2
y
2
+
x
2
1
y
2
1
+
x
2
2
y
2
2
=
1
,
√
2
x
1
,
√
2
x
2
,
√
2
x
1
x
2
,x
2
1
,x
2
2
1
,
√
2
y
1
,
√
2
y
2
,
√
2
y
1
y
2
,y
2
1
,y
2
2
T
=
φ(
x
)
T
φ(
y
)
When the input space is
R
2
, the dimensionality of the feature space is given as
m
=
d
+
q
q
=
2
+
2
2
=
4
2
=
6
In this case the inhomogeneous quadratic kernel with
c
=
1 corresponds to the
mapping
φ
:
R
2
→
R
6
, given as
φ(
x
)
=
1
,
√
2
x
1
,
√
2
x
2
,
√
2
x
1
x
2
, x
2
1
, x
2
2
T
For example, for
x
1
=
(
5
.
9
,
3
)
T
and
x
2
=
(
6
.
9
,
3
.
1
)
T
, we have
φ(
x
1
)
=
1
,
√
2
·
5
.
9
,
√
2
·
3
,
√
2
·
5
.
9
·
3
,
5
.
9
2
,
3
2
T
=
1
,
8
.
34
,
4
.
24
,
25
.
03
,
34
.
81
,
9
T
φ(
x
2
)
=
1
,
√
2
·
6
.
9
,
√
2
·
3
.
1
,
√
2
·
6
.
9
·
3
.
1
,
6
.
9
2
,
3
.
1
2
T
=
1
,
9
.
76
,
4
.
38
,
30
.
25
,
47
.
61
,
9
.
61
T
5.2 Vector Kernels
147
Thus, the inhomogeneous kernel value is
φ(
x
1
)
T
φ(
x
2
)
=
1
+
81
.
40
+
18
.
57
+
757
.
16
+
1657
.
30
+
86
.
49
=
2601
.
92
On the other hand, when the input space is
R
2
, the homogeneous quadratic kernel
corresponds to the mapping
φ
:
R
2
→
R
3
, defined as
φ(
x
)
=
√
2
x
1
x
2
, x
2
1
, x
2
2
T
because only the degree 2 terms are considered. For example, for
x
1
and
x
2
, we have
φ(
x
1
)
=
√
2
·
5
.
9
·
3
,
5
.
9
2
,
3
2
T
=
25
.
03
,
34
.
81
,
9
T
φ(
x
2
)
=
√
2
·
6
.
9
·
3
.
1
,
6
.
9
2
,
3
.
1
2
T
=
30
.
25
,
47
.
61
,
9
.
61
T
and thus
K
(
x
1
,
x
2
)
=
φ(
x
1
)
T
φ(
x
2
)
=
757
.
16
+
1657
.
3
+
86
.
49
=
2500
.
95
These values essentially match those shown in Example 5.6 up to four significant
digits.
Gaussian Kernel
The Gaussian kernel, also called the Gaussian radial basis function (RBF) kernel, is
defined as
K
(
x
,
y
)
=
exp
−
x
−
y
2
2
σ
2
(5.10)
where
σ >
0 is the spread parameter that plays the same role as the standard deviation
in a normal density function. Note that
K
(
x
,
x
)
=
1, and further that the kernel value is
inversely related to the distance between the two points
x
and
y
.
Example 5.8.
Consider again the points
x
1
and
x
2
in Figure 5.1:
x
1
=
5
.
9
3
x
2
=
6
.
9
3
.
1
The squared distance between them is given as
x
1
−
x
2
2
=
(
−
1
,
−
0
.
1
)
T
2
=
1
2
+
0
.
1
2
=
1
.
01
With
σ
=
1, the Gaussian kernel is
K
(
x
1
,
x
2
)
=
exp
−
1
.
01
2
2
=
exp
{−
0
.
51
}=
0
.
6
It is interesting to note that a feature space for the Gaussian kernel has infinite
dimensionality. To see this, note that the exponential function can be written as the
148
Kernel Methods
infinite expansion
exp
{
a
}=
∞
n
=
0
a
n
n
!
=
1
+
a
+
1
2!
a
2
+
1
3!
a
3
+···
Further, using
γ
=
1
2
σ
2
, and noting that
x
−
y
2
=
x
2
+
y
2
−
2
x
T
y
, we can rewrite
the Gaussian kernel as follows:
K
(
x
,
y
)
=
exp
−
γ
x
−
y
2
=
exp
−
γ
x
2
·
exp
−
γ
y
2
·
exp
2
γ
x
T
y
In particular, the last term is given as the infinite expansion
exp
2
γ
x
T
y
=
∞
q
=
0
(
2
γ)
q
q
!
x
T
y
q
=
1
+
(
2
γ)
x
T
y
+
(
2
γ)
2
2!
x
T
y
2
+···
Using the multinomial expansion of
(
x
T
y
)
q
, we can write the Gaussian kernel as
K
(
x
,
y
)
=
exp
−
γ
x
2
exp
−
γ
y
2
∞
q
=
0
(
2
γ)
q
q
!
|
n
|=
q
q
n
d
k
=
1
(x
k
y
k
)
n
k
=
∞
q
=
0
|
n
|=
q
√
a
q,
n
exp
−
γ
x
2
d
k
=
1
x
n
k
k
√
a
q,
n
exp
−
γ
y
2
d
k
=
1
y
n
k
k
=
φ(
x
)
T
φ(
y
)
where
a
q,
n
=
(
2
γ)
q
q
!
q
n
, and
n
=
(n
1
,n
2
,...,n
d
)
, with
|
n
| =
n
1
+
n
2
+···+
n
d
=
q
. The
mapping into feature space corresponds to the function
φ
:
R
d
→
R
∞
φ(
x
)
=
...,
(
2
γ)
q
q
!
q
n
exp
−
γ
x
2
d
k
=
1
x
n
k
k
, ...
T
with the dimensions ranging over all degrees
q
=
0
,...,
∞
, and with the variable
n
=
(n
1
,...,n
d
)
ranging over all possible assignments such that
|
n
|=
q
for each value
of
q
. Because
φ
maps the input space into an infinite dimensional feature space, we
obviously cannot explicitly transform
x
into
φ(
x
)
, yet computing the Gaussian kernel
K
(
x
,
y
)
is straightforward.
5.3
BASIC KERNEL OPERATIONS IN FEATURE SPACE
Let us look at some of the basic data analysis tasks that can be performed solely via
kernels, without instantiating
φ(
x
)
.
Norm of a Point
We can compute the norm of a point
φ(
x
)
in feature space as follows:
φ(
x
)
2
=
φ(
x
)
T
φ(
x
)
=
K
(
x
,
x
)
which implies that
φ(
x
)
=
√
K
(
x
,
x
)
.
5.3 Basic Kernel Operations in Feature Space
149
Distance between Points
The distance between two points
φ(
x
i
)
and
φ(
x
j
)
can be computed as
φ(
x
i
)
−
φ(
x
j
)
2
=
φ(
x
i
)
2
+
φ(
x
j
)
2
−
2
φ(
x
i
)
T
φ(
x
j
)
(5.11)
=
K
(
x
i
,
x
i
)
+
K
(
x
j
,
x
j
)
−
2
K
(
x
i
,
x
j
)
which implies that
δ
φ(
x
i
),φ(
x
j
)
=
φ(
x
i
)
−
φ(
x
j
)
=
K
(
x
i
,
x
i
)
+
K
(
x
j
,
x
j
)
−
2
K
(
x
i
,
x
j
)
Rearranging Eq.(5.11), we can see that the kernel value can be considered as a
measure of the similarity between two points, as
1
2
φ(
x
i
)
2
+
φ(
x
j
)
2
−
φ(
x
i
)
−
φ(
x
j
)
2
=
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
Thus, the more the distance
φ(
x
i
)
−
φ(
x
j
)
between the two points in feature space,
the less the kernel value, that is, the less the similarity.
Example 5.9.
Consider the two points
x
1
and
x
2
in Figure 5.1:
x
1
=
5
.
9
3
x
2
=
6
.
9
3
.
1
Assuming the homogeneous quadratic kernel, the norm of
φ(
x
1
)
can be computed as
φ(
x
1
)
2
=
K
(
x
1
,
x
1
)
=
(
x
T
1
x
1
)
2
=
43
.
81
2
=
1919
.
32
which implies that the norm of the transformed point is
φ(
x
1
)
=
√
43
.
81
2
=
43
.
81.
The distance between
φ(
x
1
)
and
φ(
x
2
)
in feature space is given as
δ
φ(
x
1
),φ(
x
2
)
=
K
(
x
1
,
x
1
)
+
K
(
x
2
,
x
2
)
−
2
K
(
x
1
,
x
2
)
=
√
1919
.
32
+
3274
.
13
−
2
·
2501
=
√
191
.
45
=
13
.
84
Mean in Feature Space
The mean of the points in feature space is given as
µ
φ
=
1
n
n
i
=
1
φ(
x
i
)
Because we do not, in general, have access to
φ(
x
i
)
, we cannot explicitly compute the
mean point in feature space.
150
Kernel Methods
Nevertheless, we can compute the squared norm of the mean as follows:
µ
φ
2
=
µ
T
φ
µ
φ
=
1
n
n
i
=
1
φ(
x
i
)
T
1
n
n
j
=
1
φ(
x
j
)
=
1
n
2
n
i
=
1
n
j
=
1
φ(
x
i
)
T
φ(
x
j
)
=
1
n
2
n
i
=
1
n
j
=
1
K
(
x
i
,
x
j
)
(5.12)
The above derivation implies that the squared norm of the mean in feature space is
simply the average of the values in the kernel matrix
K
.
Example 5.10.
Consider the five points from Example 5.3, also shown in Figure 5.1.
Example 5.4 showed the norm of the mean for the linear kernel. Let us consider the
Gaussian kernel with
σ
=
1. The Gaussian kernel matrix is given as
K
=
1
.
00 0
.
60 0
.
78 0
.
42 0
.
72
0
.
60 1
.
00 0
.
94 0
.
07 0
.
44
0
.
78 0
.
94 1
.
00 0
.
13 0
.
65
0
.
42 0
.
07 0
.
13 1
.
00 0
.
23
0
.
72 0
.
44 0
.
65 0
.
23 1
.
00
The squared norm of the mean in feature space is therefore
µ
φ
2
=
1
25
5
i
=
1
5
j
=
1
K
(
x
i
,
x
j
)
=
14
.
98
25
=
0
.
599
which implies that
µ
φ
=
√
0
.
599
=
0
.
774.
Total Variance in Feature Space
Let us first derive a formula for the squared distance of a point
φ(
x
i
)
to the mean
µ
φ
in feature space:
φ(
x
i
)
−
µ
φ
2
=
φ(
x
i
)
2
−
2
φ(
x
i
)
T
µ
φ
+
µ
φ
2
=
K
(
x
i
,
x
i
)
−
2
n
n
j
=
1
K
(
x
i
,
x
j
)
+
1
n
2
n
a
=
1
n
b
=
1
K
(
x
a
,
x
b
)
The total variance [Eq.(1.4)] in feature space is obtained by taking the average
squared deviation of points from the mean in feature space:
σ
2
φ
=
1
n
n
i
=
1
φ(
x
i
)
−
µ
φ
2
5.3 Basic Kernel Operations in Feature Space
151
=
1
n
n
i
=
1
K
(
x
i
,
x
i
)
−
2
n
n
j
=
1
K
(
x
i
,
x
j
)
+
1
n
2
n
a
=
1
n
b
=
1
K
(
x
a
,
x
b
)
=
1
n
n
i
=
1
K
(
x
i
,
x
i
)
−
2
n
2
n
i
=
1
n
j
=
1
K
(
x
i
,
x
j
)
+
n
n
3
n
a
=
1
n
b
=
1
K
(
x
a
,
x
b
)
=
1
n
n
i
=
1
K
(
x
i
,
x
i
)
−
1
n
2
n
i
=
1
n
j
=
1
K
(
x
i
,
x
j
)
(5.13)
In other words, the total variance in feature space is given as the difference between
the average of the diagonal entries and the average of the entire kernel matrix
K
. Also
notice that by Eq.(5.12) the second term is simply
µ
φ
2
.
Example 5.11.
Continuing Example 5.10, the total variance in feature space for the
five points, for the Gaussian kernel, is given as
σ
2
φ
=
1
n
n
i
=
1
K
(
x
i
,
x
i
)
−
µ
φ
2
=
1
5
×
5
−
0
.
599
=
0
.
401
The distance between
φ(
x
1
)
and the mean
µ
φ
in feature space is given as
φ(
x
1
)
−
µ
φ
2
=
K
(
x
1
,
x
1
)
−
2
5
5
j
=
1
K
(
x
1
,
x
j
)
+
µ
φ
2
=
1
−
2
5
1
+
0
.
6
+
0
.
78
+
0
.
42
+
0
.
72
+
0
.
599
=
1
−
1
.
410
+
0
.
599
=
0
.
189
Centering in Feature Space
We can center each point in feature space by subtracting the mean from it, as follows:
ˆ
φ(
x
i
)
=
φ(
x
i
)
−
µ
φ
Because we do not have explicit representation of
φ(
x
i
)
or
µ
φ
, we cannot explicitly
center the points. However, we can still compute the
centered kernel matrix
, that is, the
kernel matrix over centered points.
The centered kernel matrix is given as
ˆ
K
=
ˆ
K
(
x
i
,
x
j
)
n
i,j
=
1
where each cell corresponds to the kernel between centered points, that is
ˆ
K
(
x
i
,
x
j
)
=
ˆ
φ(
x
i
)
T
ˆ
φ(
x
j
)
=
(φ(
x
i
)
−
µ
φ
)
T
(φ(
x
j
)
−
µ
φ
)
=
φ(
x
i
)
T
φ(
x
j
)
−
φ(
x
i
)
T
µ
φ
−
φ(
x
j
)
T
µ
φ
+
µ
T
φ
µ
φ
152
Kernel Methods
=
K
(
x
i
,
x
j
)
−
1
n
n
k
=
1
φ(
x
i
)
T
φ(
x
k
)
−
1
n
n
k
=
1
φ(
x
j
)
T
φ(
x
k
)
+
µ
φ
2
=
K
(
x
i
,
x
j
)
−
1
n
n
k
=
1
K
(
x
i
,
x
k
)
−
1
n
n
k
=
1
K
(
x
j
,
x
k
)
+
1
n
2
n
a
=
1
n
b
=
1
K
(
x
a
,
x
b
)
In other words, we can compute the centered kernel matrix using only the kernel
function. Over all the pairs of points, the centered kernel matrix can be written
compactly as follows:
ˆ
K
=
K
−
1
n
1
n
×
n
K
−
1
n
K1
n
×
n
+
1
n
2
1
n
×
n
K1
n
×
n
=
I
−
1
n
1
n
×
n
K
I
−
1
n
1
n
×
n
(5.14)
where
1
n
×
n
is the
n
×
n
singular matrix, all of whose entries equal 1.
Example 5.12.
Consider the first five points from the 2-dimensional Iris dataset
shown in Figure 5.1a:
x
1
=
5
.
9
3
x
2
=
6
.
9
3
.
1
x
3
=
6
.
6
2
.
9
x
4
=
4
.
6
3
.
2
x
5
=
6
2
.
2
Consider the linear kernel matrix shown in Figure 5.1b. We can center it by first
computing
I
−
1
5
1
5
×
5
=
0
.
8
−
0
.
2
−
0
.
2
−
0
.
2
−
0
.
2
−
0
.
2 0
.
8
−
0
.
2
−
0
.
2
−
0
.
2
−
0
.
2
−
0
.
2 0
.
8
−
0
.
2
−
0
.
2
−
0
.
2
−
0
.
2
−
0
.
2 0
.
8
−
0
.
2
−
0
.
2
−
0
.
2
−
0
.
2
−
0
.
2 0
.
8
The centered kernel matrix [Eq.(5.14)] is given as
ˆ
K
=
I
−
1
5
1
5
×
5
·
43
.
81 50
.
01 47
.
64 36
.
74 42
.
00
50
.
01 57
.
22 54
.
53 41
.
66 48
.
22
47
.
64 54
.
53 51
.
97 39
.
64 45
.
98
36
.
74 41
.
66 39
.
64 31
.
40 34
.
64
42
.
00 48
.
22 45
.
98 34
.
64 40
.
84
·
I
−
1
5
1
5
×
5
=
0
.
02
−
0
.
06
−
0
.
06 0
.
18
−
0
.
08
−
0
.
06 0
.
86 0
.
54
−
1
.
19
−
0
.
15
−
0
.
06 0
.
54 0
.
36
−
0
.
83
−
0
.
01
0
.
18
−
1
.
19
−
0
.
83 2
.
06
−
0
.
22
−
0
.
08
−
0
.
15
−
0
.
01
−
0
.
22 0
.
46
To verify that
ˆ
K
is the same as the kernel matrix for the centered points, let us
first center the points by subtracting the mean
µ
=
(
6
.
0
,
2
.
88
)
T
. The centered points
5.3 Basic Kernel Operations in Feature Space
153
in feature space are given as
z
1
=
−
0
.
1
0
.
12
z
2
=
0
.
9
0
.
22
z
3
=
0
.
6
0
.
02
z
4
=
−
1
.
4
0
.
32
z
5
=
0
.
0
−
0
.
68
For example, the kernel between
φ(
z
1
)
and
φ(
z
2
)
is
φ(
z
1
)
T
φ(
z
2
)
=
z
T
1
z
2
=−
0
.
09
+
0
.
03
=−
0
.
06
which matches
ˆ
K
(
x
1
,
x
2
)
, as expected. The other entries can be verified in a similar
manner. Thus, the kernel matrix obtained by centering the data and then computing
the kernel is the same as that obtained via Eq.(5.14).
Normalizing in Feature Space
A common form of normalization is to ensure that points in feature space have unit
length by replacing
φ(
x
i
)
with the corresponding unit vector
φ
n
(
x
i
)
=
φ(
x
i
)
φ(
x
i
)
. The dot
product in feature space then corresponds to the cosine of the angle between the two
mapped points, because
φ
n
(
x
i
)
T
φ
n
(
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
φ(
x
i
)
·
φ(
x
j
)
=
cos
θ
If the mapped points are both centered and normalized, then a dot product
corresponds to the correlation between the two points in feature space.
The normalized kernel matrix,
K
n
, can be computed using only the kernel function
K
, as
K
n
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
φ(
x
i
)
·
φ(
x
j
)
=
K
(
x
i
,
x
j
)
K
(
x
i
,
x
i
)
·
K
(
x
j
,
x
j
)
K
n
has all diagonal elements as 1.
Let
W
denote the diagonal matrix comprising the diagonal elements of
K
:
W
=
diag
(
K
)
=
K
(
x
1
,
x
1
)
0
···
0
0
K
(
x
2
,
x
2
)
···
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
···
K
(
x
n
,
x
n
)
The normalized kernel matrix can then be expressed compactly as
K
n
=
W
−
1
/
2
·
K
·
W
−
1
/
2
where
W
−
1
/
2
is the diagonal matrix, defined as
W
−
1
/
2
(
x
i
,
x
i
)
=
1
√
K
(
x
i
,
x
i
)
, with all other
elements being zero.
154
Kernel Methods
Example 5.13.
Consider the five points and the linear kernel matrix shown in
Figure 5.1. We have
W
=
43
.
81 0 0 0 0
0 57
.
22 0 0 0
0 0 51
.
97 0 0
0 0 0 31
.
40 0
0 0 0 0 40
.
84
The normalized kernel is given as
K
n
=
W
−
1
/
2
·
K
·
W
−
1
/
2
=
1
.
0000 0
.
9988 0
.
9984 0
.
9906 0
.
9929
0
.
9988 1
.
0000 0
.
9999 0
.
9828 0
.
9975
0
.
9984 0
.
9999 1
.
0000 0
.
9812 0
.
9980
0
.
9906 0
.
9828 0
.
9812 1
.
0000 0
.
9673
0
.
9929 0
.
9975 0
.
9980 0
.
9673 1
.
0000
The same kernel is obtained if we first normalize the feature vectors to have unit
length and then take the dot products. For example, with the linear kernel, the
normalized point
φ
n
(
x
1
)
is given as
φ
n
(
x
1
)
=
φ(
x
1
)
φ(
x
1
)
=
x
1
x
1
=
1
√
43
.
81
5
.
9
3
=
0
.
8914
0
.
4532
Likewise, we have
φ
n
(
x
2
)
=
1
√
57
.
22
6
.
9
3
.
1
=
0
.
9122
0
.
4098
. Their dot product is
φ
n
(
x
1
)
T
φ
n
(
x
2
)
=
0
.
8914
·
0
.
9122
+
0
.
4532
·
0
.
4098
=
0
.
9988
which matches
K
n
(
x
1
,
x
2
)
.
If we start with the centered kernel matrix
ˆ
K
from Example 5.12, and then
normalize it, we obtain the normalized and centered kernel matrix
ˆ
K
n
:
ˆ
K
n
=
1
.
00
−
0
.
44
−
0
.
61 0
.
80
−
0
.
77
−
0
.
44 1
.
00 0
.
98
−
0
.
89
−
0
.
24
−
0
.
61 0
.
98 1
.
00
−
0
.
97
−
0
.
03
0
.
80
−
0
.
89
−
0
.
97 1
.
00
−
0
.
22
−
0
.
77
−
0
.
24
−
0
.
03
−
0
.
22 1
.
00
As noted earlier, the kernel value
ˆ
K
n
(
x
i
,
x
j
)
denotes the correlation between
x
i
and
x
j
in feature space, that is, it is cosine of the angle between the centered points
φ(
x
i
)
and
φ(
x
j
)
.
5.4
KERNELS FOR COMPLEX OBJECTS
Weconclude this chapterwithsome examplesof kernels definedfor complexdatasuch
as strings and graphs. The use of kernels for dimensionality reduction is described in
5.4 Kernels for Complex Objects
155
Section 7.3, for clustering in Section 13.2 and Chapter 16, for discriminant analysis in
Section 20.2, and for classification in Sections 21.4 and 21.5.
5.4.1
Spectrum Kernel for Strings
Consider text or sequence data defined over an alphabet
. The
l
-spectrum feature
map is the mapping
φ
:
∗
→
R
|
|
l
from the set of substrings over
to the
|
|
l
-dimensional space representing the number of occurrences of all possible
substrings of length
l
, defined as
φ(
x
)
=
···
,
#
(α),
···
T
α
∈
l
where #
(α)
is the number of occurrences of the
l
-length string
α
in
x
.
The (full) spectrum map is an extension of the
l
-spectrum map, obtained by
considering all lengths from
l
=
0 to
l
=∞
, leading to an infinite dimensional feature
map
φ
:
∗
→
R
∞
:
φ(
x
)
=
···
,
#
(α),
···
T
α
∈
∗
where #
(α)
is the number of occurrences of the string
α
in
x
.
The (
l
-)spectrum kernel between two strings
x
i
,
x
j
is simply the dot product
between their (
l
-)spectrum maps:
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
A naive computation of the
l
-spectrum kernel takes
O
(
|
|
l
)
time. However, for a
given string
x
of length
n
, the vast majority of the
l
-length strings have an occurrence
count of zero, which can be ignored. The
l
-spectrum map can be effectively computed
in
O
(n)
time for a string of length
n
(assuming
n
≫
l
) because there can be at most
n
−
l
+
1 substrings of length
l
, and the
l
-spectrum kernel can thus be computed in
O
(n
+
m)
time for any two strings of length
n
and
m
, respectively.
The feature map for the (full) spectrum kernel is infinite dimensional, but once
again, for a given string
x
of length
n
, the vast majority of the strings will have an
occurrence count of zero. A straightforward implementation of the spectrum map
for a string
x
of length
n
can be computed in
O
(n
2
)
time because
x
can have at
most
n
l
=
1
n
−
l
+
1
=
n(n
+
1
)/
2 distinct nonempty substrings. The spectrum kernel
can then be computed in
O
(n
2
+
m
2
)
time for any two strings of length
n
and
m
,
respectively. However, a much more efficient computation is enabled via suffix trees
(see Chapter 10), with a total time of
O
(n
+
m)
.
Example 5.14.
Consider sequences over the DNA alphabet
= {
A
,
C
,
G
,
T
}
. Let
x
1
=
ACAGCAGTA
, and let
x
2
=
AGCAAGCGAG
. For
l
=
3, the feature space
has dimensionality
|
|
l
=
4
3
=
64. Nevertheless, we do not have to map the input
points into the full feature space; we can compute the reduced 3-spectrum mapping
by counting the number of occurrences for only the length 3 substrings that occur in
each input sequence, as follows:
φ(
x
1
)
=
(
ACA
: 1
,
AGC
:1
,
AGT
: 1
,
CAG
: 2
,
GCA
: 1
,
GTA
: 1
)
φ(
x
2
)
=
(
AAG
: 1
,
AGC
: 2
,
CAA
: 1
,
CGA
: 1
,
GAG
: 1
,
GCA
: 1
,
GCG
: 1
)
156
Kernel Methods
where the notation
α
: #
(α)
denotes that substring
α
has #
(α)
occurrences in
x
i
. We
can then compute the dot product by considering only the common substrings, as
follows:
K
(
x
1
,
x
2
)
=
1
×
2
+
1
×
1
=
2
+
1
=
3
The first term in the dot product is due to the substring
AGC
, and the second is due
to
GCA
, which are the only common length 3 substrings between
x
1
and
x
2
.
The full spectrum can be computed by considering the occurrences of all
common substrings over all possible lengths. For
x
1
and
x
2
, the common substrings
and their occurrence counts are given as
α
A C G AG CA AGC GCA AGCA
#
(α)
in
x
1
4 2 2 2 2 1 1 1
#
(α)
in
x
2
4 2 4 3 1 2 1 1
Thus, the full spectrum kernel value is given as
K
(
x
1
,
x
2
)
=
16
+
4
+
8
+
6
+
2
+
2
+
1
+
1
=
40
5.4.2
Diffusion Kernels on Graph Nodes
Let
S
be some symmetric similarity matrix between nodes of a graph
G
=
(
V
,
E
)
. For
instance,
S
can be the (weighted) adjacency matrix
A
[Eq.(4.1)] or the Laplacian
matrix
L
=
A
−
(or its negation), where
is the degree matrix for an undirected
graph
G
, defined as
(i,i)
=
d
i
and
(i,j)
=
0 for all
i
=
j
, and
d
i
is the degree of
node
i
.
Consider the similarity between any two nodes obtained by summing the product
of the similarities over paths of length 2:
S
(
2
)
(
x
i
,
x
j
)
=
n
a
=
1
S
(
x
i
,
x
a
)
S
(
x
a
,
x
j
)
=
S
T
i
S
j
where
S
i
=
S
(
x
i
,
x
1
),
S
(
x
i
,
x
2
),...,
S
(
x
i
,
x
n
)
T
denotesthe(column) vectorrepresentingthe
i
-throwof
S
(andbecause
S
is symmetric,
it also denotes the
i
th column of
S
). Over all pairs of nodes the similarity matrix over
pathsoflength2,denoted
S
(
2
)
,is thusgivenasthesquareofthebasesimilaritymatrix
S
:
S
(
2
)
=
S
×
S
=
S
2
In general, if we sum up the product of the base similarities over all
l
-length paths
between two nodes, we obtain the
l
-length similarity matrix
S
(l)
, which is simply the
l
th
power of
S
, that is,
S
(l)
=
S
l
5.4 Kernels for Complex Objects
157
Power Kernels
Even path lengths lead to positive semidefinite kernels, but odd path lengths are not
guaranteed to do so, unless the base matrix
S
is itself a positive semidefinite matrix. In
particular,
K
=
S
2
is a valid kernel. To see this, assume that the
i
th row of
S
denotes
the feature map for
x
i
, that is,
φ(
x
i
)
=
S
i
. The kernel value between any two points is
then a dot product in feature space:
K
(
x
i
,
x
j
)
=
S
(
2
)
(
x
i
,
x
j
)
=
S
T
i
S
j
=
φ(
x
i
)
T
φ(
x
j
)
For a general path length
l
, let
K
=
S
l
. Consider the eigen-decomposition of
S
:
S
=
U
U
T
=
n
i
=
1
u
i
λ
i
u
T
i
where
U
is the orthogonal matrix of eigenvectors and
is the diagonal matrix of
eigenvalues of
S
:
U
=
| | |
u
1
u
2
···
u
n
| | |
=
λ
1
0
···
0
0
λ
2
···
0
.
.
.
.
.
.
.
.
.
0
0 0
···
λ
n
The eigen-decomposition of
K
can be obtained as follows:
K
=
S
l
=
U
U
T
l
=
U
l
U
T
where we used the fact that eigenvectors of
S
and
S
l
are identical, and further that
eigenvalues of
S
l
are given as
(λ
i
)
l
(for all
i
=
1
,...,n)
, where
λ
i
is an eigenvalue of
S
.
For
K
=
S
l
tobeapositivesemidefinitematrix,allitseigenvaluesmustbenon-negative,
which is guaranteed for all even path lengths. Because
(λ
i
)
l
will be negative if
l
is odd
and
λ
i
is negative, odd path lengths lead to a positive semidefinite kernel only if
S
is
positive semidefinite.
Exponential Diffusion Kernel
Instead of fixing the path length
a priori
, we can obtain a new kernel between nodes of
a graph by considering paths of all possible lengths, but by damping the contribution
of longer paths, which leads to the
exponential diffusion kernel
, defined as
K
=
∞
l
=
0
1
l
!
β
l
S
l
=
I
+
β
S
+
1
2!
β
2
S
2
+
1
3!
β
3
S
3
+···
=
exp
β
S
(5.15)
where
β
is a damping factor, and exp
{
β
S
}
is the matrix exponential. The series on the
right hand side above converges for all
β
≥
0.
158
Kernel Methods
Substituting
S
=
U
U
T
=
n
i
=
1
λ
i
u
i
u
T
i
in Eq.(5.15), and utilizing the fact that
UU
T
=
n
i
=
1
u
i
u
T
i
=
I
, we have
K
=
I
+
β
S
+
1
2!
β
2
S
2
+···
=
n
i
=
1
u
i
u
T
i
+
n
i
=
1
u
i
βλ
i
u
T
i
+
n
i
=
1
u
i
1
2!
β
2
λ
2
i
u
T
i
+···
=
n
i
=
1
u
i
1
+
βλ
i
+
1
2!
β
2
λ
2
i
+···
u
T
i
=
n
i
=
1
u
i
exp
{
βλ
i
}
u
T
i
=
U
exp
{
βλ
1
}
0
···
0
0 exp
{
βλ
2
} ···
0
.
.
.
.
.
.
.
.
.
0
0 0
···
exp
{
βλ
n
}
U
T
(5.16)
Thus, the eigenvectors of
K
are the same as those for
S
, whereas its eigenvalues are
given as exp
{
βλ
i
}
, where
λ
i
is an eigenvalue of
S
. Further,
K
is symmetric because
S
is symmetric, and its eigenvalues are real and non-negative because the exponential
of a real number is non-negative.
K
is thus a positive semidefinite kernel matrix. The
complexity of computing the diffusion kernel is
O
(n
3
)
corresponding to the complexity
of computing the eigen-decomposition.
Von Neumann Diffusion Kernel
A related kernel based on powers of
S
is the
von Neumann diffusion kernel
, defined as
K
=
∞
l
=
0
β
l
S
l
(5.17)
where
β
≥
0. Expanding Eq.(5.17), we have
K
=
I
+
β
S
+
β
2
S
2
+
β
3
S
3
+···
=
I
+
β
S
(
I
+
β
S
+
β
2
S
2
+···
)
=
I
+
β
SK
Rearranging the terms in the preceding equation, we obtain a closed form expression
for the von Neumann kernel:
K
−
β
SK
=
I
(
I
−
β
S
)
K
=
I
K
=
(
I
−
β
S
)
−
1
(5.18)
5.4 Kernels for Complex Objects
159
Plugging in the eigen-decomposition
S
=
U
U
T
, and rewriting
I
=
UU
T
, we have
K
=
UU
T
−
U
(β
)
U
T
−
1
=
U
(
I
−
β
)
U
T
−
1
=
U
(
I
−
β
)
−
1
U
T
where
(
I
−
β
)
−
1
is the diagonal matrix whose
i
th diagonal entry is
(
1
−
βλ
i
)
−
1
. The
eigenvectors of
K
and
S
are identical, but the eigenvalues of
K
are given as 1
/(
1
−
βλ
i
)
.
For
K
to be a positive semidefinite kernel, all its eigenvalues should be non-negative,
which in turn implies that
(
1
−
βλ
i
)
−
1
≥
0
1
−
βλ
i
≥
0
β
≤
1
/λ
i
Further, the inverse matrix
(
I
−
β
)
−
1
exists only if
det
(
I
−
β
)
=
n
i
=
1
(
1
−
βλ
i
)
=
0
which implies that
β
=
1
/λ
i
for all
i
. Thus, for
K
to be a valid kernel, we require that
β <
1
/λ
i
for all
i
=
1
,...,n
. The von Neumann kernel is therefore guaranteed to be
positivesemidefiniteif
|
β
|
<
1
/ρ(
S
)
,where
ρ(
S
)
=
max
i
{|
λ
i
|}
is calledthe
spectralradius
of
S
, defined as the largest eigenvalue of
S
in absolute value.
Example 5.15.
Consider the graph in Figure 5.2. Its adjacency and degree matrices
are given as
A
=
0 0 1 1 0
0 0 1 0 1
1 1 0 1 0
1 0 1 0 1
0 1 0 1 0
=
2 0 0 0 0
0 2 0 0 0
0 0 3 0 0
0 0 0 3 0
0 0 0 0 2
v
1
v
4
v
5
v
3
v
2
Figure 5.2.
Graph diffusion kernel.
160
Kernel Methods
The negated Laplacian matrix for the graph is therefore
S
=−
L
=
A
−
D
=
−
2 0 1 1 0
0
−
2 1 0 1
1 1
−
3 1 0
1 0 1
−
3 1
0 1 0 1
−
2
The eigenvalues of
S
are as follows:
λ
1
=
0
λ
2
=−
1
.
38
λ
3
=−
2
.
38
λ
4
=−
3
.
62
λ
5
=−
4
.
62
and the eigenvectors of
S
are
U
=
u
1
u
2
u
3
u
4
u
5
0
.
45
−
0
.
63 0
.
00 0
.
63 0
.
00
0
.
45 0
.
51
−
0
.
60 0
.
20
−
0
.
37
0
.
45
−
0
.
20
−
0
.
37
−
0
.
51 0
.
60
0
.
45
−
0
.
20 0
.
37
−
0
.
51
−
0
.
60
0
.
45 0
.
51 0
.
60 0
.
20 0
.
37
Assuming
β
=
0
.
2, the exponential diffusion kernel matrix is given as
K
=
exp
0
.
2
S
=
U
exp
{
0
.
2
λ
1
}
0
···
0
0 exp
{
0
.
2
λ
2
} ···
0
.
.
.
.
.
.
.
.
.
0
0 0
···
exp
{
0
.
2
λ
n
}
U
T
=
0
.
70 0
.
01 0
.
14 0
.
14 0
.
01
0
.
01 0
.
70 0
.
13 0
.
03 0
.
14
0
.
14 0
.
13 0
.
59 0
.
13 0
.
03
0
.
14 0
.
03 0
.
13 0
.
59 0
.
13
0
.
01 0
.
14 0
.
03 0
.
13 0
.
70
For the von Neumann diffusion kernel, we have
(
I
−
0
.
2
)
−
1
=
1 0
.
00 0
.
00 0
.
00 0
.
00
0 0
.
78 0
.
00 0
.
00 0
.
00
0 0
.
00 0
.
68 0
.
00 0
.
00
0 0
.
00 0
.
00 0
.
58 0
.
00
0 0
.
00 0
.
00 0
.
00 0
.
52
5.6 Exercises
161
For instance, because
λ
2
=−
1
.
38, we have 1
−
βλ
2
=
1
+
0
.
2
×
1
.
38
=
1
.
28, and
thereforethe second diagonal entry is
(
1
−
βλ
2
)
−
1
=
1
/
1
.
28
=
0
.
78. The von Neumann
kernel is given as
K
=
U
(
I
−
0
.
2
)
−
1
U
T
=
0
.
75 0
.
02 0
.
11 0
.
11 0
.
02
0
.
02 0
.
74 0
.
10 0
.
03 0
.
11
0
.
11 0
.
10 0
.
66 0
.
10 0
.
03
0
.
11 0
.
03 0
.
10 0
.
66 0
.
10
0
.
02 0
.
11 0
.
03 0
.
10 0
.
74
5.5
FURTHER READING
Kernel methods have been extensively studied in machine learning and data mining.
For an in-depth introduction and more advanced topics see Sch
¨
olkopf and Smola
(2002) and Shawe-Taylor and Cristianini (2004). For applications of kernel methods
in bioinformatics see Sch
¨
olkopf, Tsuda, and Vert (2004).
Sch
¨
olkopf, B. and Smola, A. J. (2002).
Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond
. Cambridge, MA: MIT
Press.
Sch
¨
olkopf, B., Tsuda, K., and Vert, J.-P. (2004).
Kernel Methods in Computational
Biology
. Cambridge, MA: MIT Press.
Shawe-Taylor, J. and Cristianini, N. (2004).
Kernel Methods for Pattern Analysis
.
New York: Cambridge University Press.
5.6
EXERCISES
Q1.
Prove that the dimensionality ofthe feature space for the inhomogeneous polynomial
kernel of degree
q
is
m
=
d
+
q
q
Q2.
Consider the data shown in Table 5.1. Assume the following kernel function:
K
(
x
i
,
x
j
)
=
x
i
−
x
j
2
. Compute the kernel matrix
K
.
Table 5.1.
Dataset for Q2
i
x
i
x
1
(
4
,
2
.
9
)
x
2
(
2
.
5
,
1
)
x
3
(
3
.
5
,
4
)
x
4
(
2
,
2
.
1
)
162
Kernel Methods
Q3.
Show that eigenvectors of
S
and
S
l
are identical, and further that eigenvalues of
S
l
are given as
(λ
i
)
l
(for all
i
=
1
,...,n)
, where
λ
i
is an eigenvalue of
S
, and
S
is some
n
×
n
symmetric similarity matrix.
Q4.
The von Neumann diffusion kernel is a valid positive semidefinite kernel if
|
β
|
<
1
ρ(
S
)
,
where
ρ(
S
)
is the spectral radius of
S
. Can you derive better bounds for cases when
β >
0 and when
β <
0?
Q5.
Given the three points
x
1
=
(
2
.
5
,
1
)
T
,
x
2
=
(
3
.
5
,
4
)
T
, and
x
3
=
(
2
,
2
.
1
)
T
.
(a)
Compute the kernel matrix for the Gaussian kernel assuming that
σ
2
=
5.
(b)
Compute the distance of the point
φ(
x
1
)
from the mean in feature space.
(c)
Compute the dominant eigenvector and eigenvalue for the kernel matrix
from (a).
CHAPTER 6
High-dimensional Data
In data mining typically the data is very high dimensional, as the number of
attributes can easily be in the hundreds or thousands. Understanding the nature
of high-dimensional space, or
hyperspace
, is very important, especially because
hyperspace does not behave like the more familiar geometry in two or three
dimensions.
6.1
HIGH-DIMENSIONAL OBJECTS
Consider the
n
×
d
data matrix
D
=
X
1
X
2
···
X
d
x
1
x
11
x
12
···
x
1
d
x
2
x
21
x
22
···
x
2
d
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
n
x
n
1
x
n
2
···
x
nd
where each point
x
i
∈
R
d
and each attribute
X
j
∈
R
n
.
Hypercube
Let the minimum and maximum values for each attribute
X
j
be given as
min
(
X
j
)
=
min
i
x
ij
max
(
X
j
)
=
max
i
x
ij
The data hyperspace can be considered as a
d
-dimensional
hyper-rectangle
, defined as
R
d
=
d
j
=
1
min
(
X
j
),
max
(
X
j
)
=
x
=
(x
1
,x
2
,...,x
d
)
T
x
j
∈
[
min
(
X
j
),
max
(
X
j
)
]
,
for
j
=
1
,...,d
163
164
High-dimensional Data
Assume the data is centered to have mean
µ
=
0
. Let
m
denote the largest absolute
value in
D
, given as
m
=
d
max
j
=
1
n
max
i
=
1
|
x
ij
|
The data hyperspace can be represented as a
hypercube
, centered at
0
, with all sides of
length
l
=
2
m
, given as
H
d
(l)
=
x
=
(x
1
,x
2
,...,x
d
)
T
∀
i, x
i
∈
[
−
l/
2
,l/
2]
The hypercube in one dimension,
H
1
(l)
, represents an interval, which in two dimen-
sions,
H
2
(l)
, represents a square, and which in three dimensions,
H
3
(l)
, represents a
cube, and so on. The
unit hypercube
has all sides of length
l
=
1, and is denoted as
H
d
(
1
)
.
Hypersphere
Assume that the data has been centered, so that
µ
=
0
. Let
r
denote the largest
magnitude among all points:
r
=
max
i
x
i
The data hyperspace can also be represented as a
d
-dimensional
hyperball
centered at
0
with radius
r
, defined as
B
d
(r)
=
x
|
x
≤
r
or
B
d
(r)
=
x
=
(x
1
,x
2
,...,x
d
)
d
j
=
1
x
2
j
≤
r
2
The surface of the hyperball is called a
hypersphere
, and it consists of all the points
exactly at distance
r
from the center of the hyperball, defined as
S
d
(r)
=
x
|
x
=
r
or
S
d
(r)
=
x
=
(x
1
,x
2
,...,x
d
)
d
j
=
1
(x
j
)
2
=
r
2
Because the hyperball consists of all the surface and interior points, it is also called a
closed hypersphere
.
Example 6.1.
Consider the 2-dimensional, centered, Iris dataset, plotted in
Figure 6.1. The largest absolute value along any dimension is
m
=
2
.
06, and the
point with the largest magnitude is
(
2
.
06
,
0
.
75
)
, with
r
=
2
.
19. In two dimensions, the
hypercube representing the data space is a square with sides of length
l
=
2
m
=
4
.
12.
The hypersphere marking the extent of the space is a circle (shown dashed) with
radius
r
=
2
.
19.
6.2 High-dimensional Volumes
165
−
2
−
1
0
1
2
−
2
−
1 0 1 2
X
1
: sepal length
X
2
:
s
e
p
a
l
w
i
d
t
h
r
Figure 6.1.
Iris data hyperspace: hypercube (solid; with
l
=
4
.
12) and hypersphere (dashed; with
r
=
2
.
19).
6.2
HIGH-DIMENSIONAL VOLUMES
Hypercube
The volume of a hypercube with edge length
l
is given as
vol
(
H
d
(l))
=
l
d
Hypersphere
The volume of a hyperball and its corresponding hypersphere is identical because the
volume measures the total content of the object, including all internal space. Consider
the well known equations for the volume of a hypersphere in lower dimensions
vol
(
S
1
(r))
=
2
r
(6.1)
vol
(
S
2
(r))
=
πr
2
(6.2)
vol
(
S
3
(r))
=
4
3
πr
3
(6.3)
As per the derivation in Appendix 6.7, the general equation for the volume of a
d
-dimensional hypersphere is given as
vol
(
S
d
(r))
=
K
d
r
d
=
π
d
2
Ŵ
d
2
+
1
r
d
(6.4)
166
High-dimensional Data
where
K
d
=
π
d/
2
Ŵ(
d
2
+
1
)
(6.5)
is a scalar that depends on the dimensionality
d
, and
Ŵ
is the gamma function
[Eq.(3.17)], defined as (for
α >
0)
Ŵ(α)
=
∞
0
x
α
−
1
e
−
x
dx
(6.6)
By direct integration of Eq.(6.6), we have
Ŵ(
1
)
=
1 and
Ŵ
1
2
=
√
π
(6.7)
The gamma function also has the following property for any
α >
1:
Ŵ(α)
=
(α
−
1
)Ŵ(α
−
1
)
(6.8)
For any integer
n
≥
1, we immediately have
Ŵ(n)
=
(n
−
1
)
! (6.9)
Turning our attention back to Eq.(6.4), when
d
is even, then
d
2
+
1 is an integer,
and by Eq.(6.9) we have
Ŵ
d
2
+
1
=
d
2
!
and when
d
is odd, then by Eqs.(6.8) and (6.7), we have
Ŵ
d
2
+
1
=
d
2
d
−
2
2
d
−
4
2
···
d
−
(d
−
1
)
2
Ŵ
1
2
=
d
!!
2
(d
+
1
)/
2
√
π
where
d
!! denotes the double factorial (or multifactorial), given as
d
!!
=
1 if
d
=
0 or
d
=
1
d
·
(d
−
2
)
!! if
d
≥
2
Putting it all together we have
Ŵ
d
2
+
1
=
d
2
! if
d
is even
√
π
d
!!
2
(d
+
1
)/
2
if
d
is odd
(6.10)
Plugging in values of
Ŵ(d/
2
+
1
)
in Eq.(6.4) gives us the equations for the volume
of the hypersphere in different dimensions.
6.2 High-dimensional Volumes
167
Example 6.2.
By Eq.(6.10), we have for
d
=
1,
d
=
2 and
d
=
3:
Ŵ(
1
/
2
+
1
)
=
1
2
√
π
Ŵ(
2
/
2
+
1
)
=
1!
=
1
Ŵ(
3
/
2
+
1
)
=
3
4
√
π
Thus, we can verify that the volume of a hypersphere in one, two, and three
dimensions is given as
vol
(
S
1
(r))
=
√
π
1
2
√
π
r
=
2
r
vol
(
S
2
(r))
=
π
1
r
2
=
πr
2
vol
(
S
3
(r))
=
π
3
/
2
3
4
√
π
r
3
=
4
3
πr
3
which match the expressions in Eqs.(6.1), (6.2), and (6.3), respectively.
Surface Area
The
surface area
of the hypersphere can be obtained by differentiating
its volume with respect to
r
, given as
area
(
S
d
(r))
=
d
dr
vol
(
S
d
(r))
=
π
d
2
Ŵ
d
2
+
1
dr
d
−
1
=
2
π
d
2
Ŵ
d
2
r
d
−
1
We can quickly verify that for two dimensions the surface area of a circle is given as
2
πr
, and for three dimensions the surface area of sphere is given as 4
πr
2
.
Asymptotic Volume
An interesting observation about the hypersphere volume is
that as dimensionality increases, the volume first increases up to a point, and then
starts to decrease, and ultimately vanishes. In particular, for the unit hypersphere
with
r
=
1,
lim
d
→∞
vol
(
S
d
(
1
))
=
lim
d
→∞
π
d
2
Ŵ(
d
2
+
1
)
→
0
Example 6.3.
Figure 6.2 plots the volume of the unit hypersphere in Eq.(6.4) with
increasing dimensionality. We see that initially the volume increases, and achieves
the highest volume for
d
=
5 with vol
(
S
5
(
1
))
=
5
.
263. Thereafter, the volume drops
rapidly and essentially becomes zero by
d
=
30.
168
High-dimensional Data
0
1
2
3
4
5
0 5 10 15 20 25 30 35 40 45 50
d
v
o
l
(
S
d
(
1
)
)
Figure 6.2.
Volume of a unit hypersphere.
6.3
HYPERSPHERE INSCRIBED WITHIN HYPERCUBE
We next look at the space enclosed within the largest hypersphere that can be
accommodated within a hypercube (which represents the dataspace). Consider a
hypersphere of radius
r
inscribed in a hypercube with sides of length 2
r
. When we
take the ratio of the volume of the hypersphere of radius
r
to the hypercube with side
length
l
=
2
r
, we observe the following trends.
In two dimensions, we have
vol
(
S
2
(r))
vol
(
H
2
(
2
r))
=
πr
2
4
r
2
=
π
4
=
78
.
5%
Thus, an inscribed circle occupies
π
4
of the volume of its enclosing square, as illustrated
in Figure 6.3a.
In three dimensions, the ratio is given as
vol
(
S
3
(r))
vol
(
H
3
(
2
r))
=
4
3
πr
3
8
r
3
=
π
6
=
52
.
4%
An inscribed sphere takes up only
π
6
of the volume of its enclosing cube, as shown in
Figure 6.3b, which is quite a sharp decrease over the 2-dimensional case.
For the general case, as the dimensionality
d
increases asymptotically, we get
lim
d
→∞
vol
(
S
d
(r))
vol
(
H
d
(
2
r))
=
lim
d
→∞
π
d/
2
2
d
Ŵ(
d
2
+
1
)
→
0
This means that as the dimensionality increases, most of the volume of the hypercube
is in the “corners,” whereas the center is essentially empty. The mental picture that
6.4 Volume of Thin Hypersphere Shell
169
−
r
0
r
(a)
−
r
0
r
(b)
Figure 6.3.
Hypersphere inscribed inside a hypercube: in (a) two and (b) three dimensions.
(a) (b) (c) (d)
Figure 6.4.
Conceptual view of high-dimensional space: (a) two, (b) three, (c) four, and (d) higher
dimensions. In
d
dimensions there are 2
d
“corners” and 2
d
−
1
diagonals. The radius of the inscribed circle
accurately reflects the difference between the volume of the hypercube and the inscribed hypersphere in
d
dimensions.
emerges is that high-dimensional space looks like a rolled-up porcupine, as illustrated
in Figure 6.4.
6.4
VOLUME OF THIN HYPERSPHERE SHELL
Let us now consider the volume of a thin hypersphere shell of width
ǫ
bounded by an
outer hypersphere of radius
r
, and an inner hypersphere of radius
r
−
ǫ
. The volume
of the thin shell is given as the difference between the volumes of the two bounding
hyperspheres, as illustrated in Figure 6.5.
Let
S
d
(r,ǫ)
denote the thin hypershell of width
ǫ
. Its volume is given as
vol
(
S
d
(r,ǫ))
=
vol
(
S
d
(r))
−
vol
(
S
d
(r
−
ǫ))
=
K
d
r
d
−
K
d
(r
−
ǫ)
d
.
170
High-dimensional Data
r
r
−
ǫ
ǫ
Figure 6.5.
Volume of a thin shell (for
ǫ >
0).
Let us consider the ratio of the volume of the thin shell to the volume of the outer
sphere:
vol
(
S
d
(r,ǫ))
vol
(
S
d
(r))
=
K
d
r
d
−
K
d
(r
−
ǫ)
d
K
d
r
d
=
1
−
1
−
ǫ
r
d
Example 6.4.
For example, for a circle in two dimensions, with
r
=
1 and
ǫ
=
0
.
01 the
volume of thethin shell is 1
−
(
0
.
99
)
2
=
0
.
0199
≃
2%. As expected,in two-dimensions,
the thin shell encloses only a small fractionof the volumeof the original hypersphere.
For three dimensions this fraction becomes 1
−
(
0
.
99
)
3
=
0
.
0297
≃
3%, which is still a
relatively small fraction.
Asymptotic Volume
As
d
increases, in the limit we obtain
lim
d
→∞
vol
(
S
d
(r,ǫ))
vol
(
S
d
(r))
=
lim
d
→∞
1
−
1
−
ǫ
r
d
→
1
That is, almost all of the volume of the hypersphere is contained in the thin shell as
d
→∞
. This means that in high-dimensional spaces, unlike in lower dimensions, most
of the volume is concentrated around the surface (within
ǫ
) of the hypersphere, and
the center is essentially void. In other words, if the data is distributed uniformly in
the
d
-dimensional space, then all of the points essentially lie on the boundary of the
space (which is a
d
−
1 dimensional object). Combined with the fact that most of the
hypercube volume is in the corners, we can observe that in high dimensions, data tends
to get scattered on the boundary and corners of the space.
6.5 Diagonals in Hyperspace
171
6.5
DIAGONALS IN HYPERSPACE
Another counterintuitive behavior of high-dimensional spaces deals with the diag-
onals. Let us assume that we have a
d
-dimensional hypercube, with origin
0
d
=
(
0
1
,
0
2
,...,
0
d
)
,andbounded ineachdimension intherange[
−
1
,
1].Theneach“corner”
of the hyperspace is a
d
-dimensional vector of the form
(
±
1
1
,
±
1
2
,...,
±
1
d
)
T
. Let
e
i
=
(
0
1
,...,
1
i
,...,
0
d
)
T
denote the
d
-dimensional canonical unit vector in dimension
i
, and let
1
denote the
d
-dimensional diagonal vector
(
1
1
,
1
2
,...,
1
d
)
T
.
Consider the angle
θ
d
between the diagonal vector
1
and the first axis
e
1
, in
d
dimensions:
cos
θ
d
=
e
T
1
1
e
1
1
=
e
T
1
1
e
T
1
e
1
√
1
T
1
=
1
√
1
√
d
=
1
√
d
Example 6.5.
Figure 6.6 illustrates the angle between the diagonal vector
1
and
e
1
,
for
d
=
2 and
d
=
3. In two dimensions, we have cos
θ
2
=
1
√
2
whereas in three
dimensions, we have cos
θ
3
=
1
√
3
.
Asymptotic Angle
As
d
increases, the angle between the
d
-dimensional diagonal vector
1
and the first
axis vector
e
1
is given as
lim
d
→∞
cos
θ
d
=
lim
d
→∞
1
√
d
→
0
which implies that
lim
d
→∞
θ
d
→
π
2
=
90
◦
−
1
0
1
−
1 0 1
1
e
1
θ
(a)
−
1
0
1
−
1
0
1
−
1
0
1
1
e
1
θ
(b)
Figure 6.6.
Angle between diagonal vector
1
and
e
1
: in (a) two and (b) three dimensions.
172
High-dimensional Data
This analysis holds for the angle between the diagonal vector
1
d
and any of the
d
principal axis vectors
e
i
(i.e., for all
i
∈
[1
,d
]). In fact, the same result holds for any
diagonal vector and any principal axis vector (in both directions). This implies that in
high dimensions all of the diagonal vectors are perpendicular (or orthogonal) to all
the coordinates axes! Because there are 2
d
corners in a
d
-dimensional hyperspace,
there are 2
d
diagonal vectors from the origin to each of the corners. Because the
diagonal vectors in opposite directions define a new axis, we obtain 2
d
−
1
new axes,
each of which is essentially orthogonal to all of the
d
principal coordinate axes! Thus,
in effect, high-dimensional space has an exponential number of orthogonal “axes.” A
consequence of this strange property of high-dimensional space is that if there is a
point or a group of points, say a cluster of interest, near a diagonal, these points will
get projected into the origin and will not be visible in lower dimensional projections.
6.6
DENSITY OF THE MULTIVARIATE NORMAL
Let us consider how, for the standard multivariate normal distribution, the density of
points around themeanchangesin
d
dimensions. In particular,consider theprobability
of a point being within a fraction
α >
0, of the peak density at the mean.
For a multivariate normal distribution [Eq.(2.33)], with
µ
=
0
d
(the
d
-dimensional
zero vector), and
=
I
d
(the
d
×
d
identity matrix), we have
f(
x
)
=
1
(
√
2
π)
d
exp
−
x
T
x
2
(6.11)
At the mean
µ
=
0
d
, the peak density is
f(
0
d
)
=
1
(
√
2
π)
d
. Thus, the set of points
x
with
density at least
α
fraction of the density at the mean, with 0
< α <
1, is given as
f(
x
)
f(
0
)
≥
α
which implies that
exp
−
x
T
x
2
≥
α
or
x
T
x
≤−
2ln
(α)
and thus
d
i
=
1
(x
i
)
2
≤−
2ln
(α)
(6.12)
It is known that if the random variables
X
1
,
X
2
,
...
,
X
k
are independent and
identically distributed, and if each variable has a standard normal distribution, then
theirsquaredsum
X
2
+
X
2
2
+···+
X
2
k
followsa
χ
2
distribution with
k
degreesoffreedom,
denoted as
χ
2
k
. Because the projection of the standard multivariate normal onto any
attribute
X
j
is a standard univariate normal, we conclude that
x
T
x
=
d
i
=
1
(x
i
)
2
has a
χ
2
distribution with
d
degrees of freedom. The probability that a point
x
is within
α
times
the density at the mean can be computed from the
χ
2
d
density function using Eq.(6.12),
6.6 Density of the Multivariate Normal
173
as follows:
P
f(
x
)
f(
0
)
≥
α
=
P
x
T
x
≤−
2ln
(α)
=
−
2ln
(α)
0
f
χ
2
d
(
x
T
x
)
=
F
χ
2
d
(
−
2ln
(α))
(6.13)
where
f
χ
2
q
(x)
is the chi-squared probability density function [Eq.(3.16)] with
q
degrees
of freedom:
f
χ
2
q
(x)
=
1
2
q/
2
Ŵ(q/
2
)
x
q
2
−
1
e
−
x
2
and
F
χ
2
q
(x)
is its cumulative distribution function.
As dimensionality increases, this probability decreases sharply, and eventually
tends to zero, that is,
lim
d
→∞
P
x
T
x
≤−
2ln
(α)
→
0 (6.14)
Thus, in higher dimensions the probability density around the mean decreases very
rapidly as one moves away from the mean. In essence the entire probability mass
migrates to the tail regions.
Example 6.6.
Consider the probability of a point being within 50% of the density at
the mean, that is,
α
=
0
.
5. From Eq.(6.13) we have
P
x
T
x
≤−
2ln
(
0
.
5
)
=
F
χ
2
d
(
1
.
386
)
We can compute the probability of a point being within 50% of the peak density
by evaluating the cumulative
χ
2
distribution for different degrees of freedom (the
number of dimensions). For
d
=
1, we find that the probability is
F
χ
2
1
(
1
.
386
)
=
76
.
1%.
For
d
=
2 the probability decreases to
F
χ
2
2
(
1
.
386
)
=
50%, and for
d
=
3 it reduces to
29.12%.Looking at Figure6.7, we can see thatonly about 24% of the density is in the
tail regions for one dimension, but for two dimensions more than 50% of the density
is in the tail regions.
Figure 6.8 plots the
χ
2
d
distribution and shows the probability
P
x
T
x
≤
1
.
386
for
two and three dimensions. This probability decreases rapidly with dimensionality; by
d
=
10, it decreases to 0
.
075%, that is, 99.925% of the points lie in the extreme or tail
regions.
Distance of Points from the Mean
Let us consider the average distance of a point
x
from the center of the standard
multivariate normal. Let
r
2
denote the square of the distance of a point
x
to the center
µ
=
0
, given as
r
2
=
x
−
0
2
=
x
T
x
=
d
i
=
1
x
2
i
174
High-dimensional Data
0
.
1
0
.
2
0
.
3
0
.
4
0 1 2 3 4
−
1
−
2
−
3
−
4
| |
α
=
0
.
5
(a)
X
1
X
2
f(
x
)
−
4
−
3
−
2
−
1
0
1
2
3
4
−
4
−
3
−
2
−
1
0
1
2
3
4
0
0.05
0.10
0.15
α
=
0
.
5
(b)
Figure 6.7.
Density contour for
α
fraction of the density at the mean: in (a) one and (b) two dimensions.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 5 10 15
x
f(x)
F
=
0
.
5
(a)
d
=
2
0
0
.
05
0
.
10
0
.
15
0
.
20
0
.
25
0 5 10 15
x
f(x)
F
=
0
.
29
(b)
d
=
3
Figure 6.8.
Probability
P
(
x
T
x
≤−
2ln
(α))
, with
α
=
0
.
5.
6.7 Appendix: Derivation of Hypersphere Volume
175
x
T
x
follows a
χ
2
distribution with
d
degreesof freedom, which has mean
d
and variance
2
d
. It follows that the mean and variance of the random variable
r
2
is
µ
r
2
=
d σ
2
r
2
=
2
d
By the central limit theorem, as
d
→∞
,
r
2
is approximately normal with mean
d
and
variance 2
d
, which implies that
r
2
is concentrated about its mean value of
d
. As a
consequence, the distance
r
of a point
x
to the center of the standard multivariate
normal is likewise approximately concentrated around its mean
√
d
.
Next, to estimate the spread of the distance
r
around its mean value, we need to
derive the standard deviation of
r
from that of
r
2
. Assuming that
σ
r
is much smaller
compared to
r
, then using the fact that
d
log
r
dr
=
1
r
, after rearranging the terms, we have
dr
r
=
d
log
r
=
1
2
d
log
r
2
Using the fact that
d
log
r
2
dr
2
=
1
r
2
, and rearranging the terms, we obtain
dr
r
=
1
2
dr
2
r
2
which implies that
dr
=
1
2
r
dr
2
. Setting the change in
r
2
equal to the standard deviation
of
r
2
, we have
dr
2
=
σ
r
2
=
√
2
d
, and setting the mean radius
r
=
√
d
, we have
σ
r
=
dr
=
1
2
√
d
√
2
d
=
1
√
2
We conclude that for large
d
, the radius
r
(or the distance of a point
x
from the
origin
0
) follows a normal distribution with mean
√
d
and standard deviation 1
/
√
2.
Nevertheless, the density at the mean distance
√
d
, is exponentially smaller than that
at the peak density because
f(
x
)
f(
0
)
=
exp
−
x
T
x
/
2
=
exp
{−
d/
2
}
Combined with the fact that the probability mass migrates away from the mean in
high dimensions, we have another interesting observation, namely that, whereas the
density of the standard multivariate normal is maximized at the center
0
, most of the
probability mass (the points) is concentrated in a small band around the mean distance
of
√
d
from the center.
6.7
APPENDIX: DERIVATION OF HYPERSPHERE VOLUME
The volume of the hypersphere can be derived via integration using spherical polar
coordinates. We consider the derivation in two and three dimensions, and then for a
general
d
.
176
High-dimensional Data
X
1
X
2
θ
1
r
(x
1
,x
2
)
Figure 6.9.
Polar coordinates in two dimensions.
Volume in Two Dimensions
As illustrated in Figure 6.9, in
d
=
2 dimensions, the point
x
=
(x
1
,x
2
)
∈
R
2
can be
expressed in polar coordinates as follows:
x
1
=
r
cos
θ
1
=
rc
1
x
2
=
r
sin
θ
1
=
rs
1
where
r
=
x
, and we use the notation cos
θ
1
=
c
1
and sin
θ
1
=
s
1
for convenience.
The
Jacobian matrix
for this transformation is given as
J
(θ
1
)
=
∂x
1
∂r
∂x
1
∂θ
1
∂x
2
∂r
∂x
2
∂θ
1
=
c
1
−
rs
1
s
1
rc
1
The determinant of the Jacobian matrix is called the
Jacobian
. For
J
(θ
1
)
, the Jacobian
is given as
det
(
J
(θ
1
))
=
rc
2
1
+
rs
2
1
=
r(c
2
1
+
s
2
1
)
=
r
(6.15)
Using the Jacobian in Eq.(6.15), the volume of the hypersphere in two dimensions
can be obtained by integration over
r
and
θ
1
(with
r >
0, and 0
≤
θ
1
≤
2
π
)
vol
(
S
2
(r))
=
r
θ
1
det
(
J
(θ
1
))
dr dθ
1
=
r
0
2
π
0
r dr dθ
1
=
r
0
r dr
2
π
0
dθ
1
=
r
2
2
r
0
·
θ
1
2
π
0
=
πr
2
6.7 Appendix: Derivation of Hypersphere Volume
177
X
1
X
2
X
3
(x
1
,x
2
,x
3
)
r
θ
1
θ
2
Figure 6.10.
Polar coordinates in three dimensions.
Volume in Three Dimensions
As illustrated in Figure 6.10, in
d
=
3 dimensions, the point
x
=
(x
1
,x
2
,x
3
)
∈
R
3
can be
expressed in polar coordinates as follows:
x
1
=
r
cos
θ
1
cos
θ
2
=
rc
1
c
2
x
2
=
r
cos
θ
1
sin
θ
2
=
rc
1
s
2
x
3
=
r
sin
θ
1
=
rs
1
where
r
=
x
, and we used the fact that the dotted vector that lies in the
X
1
–
X
2
plane
in Figure 6.10 has magnitude
r
cos
θ
1
.
The Jacobian matrix is given as
J
(θ
1
,θ
2
)
=
∂x
1
∂r
∂x
1
∂θ
1
∂x
1
∂θ
2
∂x
2
∂r
∂x
2
∂θ
1
∂x
2
∂θ
2
∂x
3
∂r
∂x
3
∂θ
1
∂x
3
∂θ
2
=
c
1
c
2
−
rs
1
c
2
−
rc
1
s
2
c
1
s
2
−
rs
1
s
2
rc
1
c
2
s
1
rc
1
0
The Jacobian is then given as
det
(
J
(θ
1
,θ
2
))
=
s
1
(
−
rs
1
)(c
1
)
det
(
J
(θ
2
))
−
rc
1
c
1
c
1
det
(
J
(θ
2
))
=−
r
2
c
1
(s
2
1
+
c
2
2
)
=−
r
2
c
1
(6.16)
In computing this determinant we made use of the fact that if a column of a matrix
A
is multiplied by a scalar
s
, then the resulting determinant is
s
det
(
A
)
. We also relied
on the fact that the
(
3
,
1
)
-
minor
of
J
(θ
1
,θ
2
)
, obtained by deleting row 3 and column
1 is actually
J
(θ
2
)
with the first column multiplied by
−
rs
1
and the second column
178
High-dimensional Data
multiplied by
c
1
. Likewise, the
(
3
,
2
)
-minor of
J
(θ
1
,θ
2
))
is
J
(θ
2
)
with both the columns
multiplied by
c
1
.
The volume of the hypersphere for
d
=
3 is obtained via a triple integral with
r >
0,
−
π/
2
≤
θ
1
≤
π/
2, and 0
≤
θ
2
≤
2
π
vol
(
S
3
(r))
=
r
θ
1
θ
2
det
(
J
(θ
1
,θ
2
))
dr dθ
1
dθ
2
=
r
0
π/
2
−
π/
2
2
π
0
r
2
cos
θ
1
dr dθ
1
dθ
2
=
r
0
r
2
dr
π/
2
−
π/
2
cos
θ
1
dθ
1
2
π
0
dθ
2
=
r
3
3
r
0
·
sin
θ
1
π/
2
−
π/
2
·
θ
2
2
π
0
=
r
3
3
·
2
·
2
π
=
4
3
πr
3
(6.17)
Volume in
d
Dimensions
Before deriving a general expression for the hypersphere volume in
d
dimensions, let
us consider the Jacobian in four dimensions. Generalizing the polar coordinates from
three dimensions in Figure 6.10 to four dimensions, we obtain
x
1
=
r
cos
θ
1
cos
θ
2
cos
θ
3
=
rc
2
c
2
c
3
x
2
=
r
cos
θ
1
cos
θ
2
sin
θ
3
=
rc
1
c
2
s
3
x
3
=
r
cos
θ
1
sin
θ
2
=
rc
1
s
1
x
4
=
r
sin
θ
1
=
rs
1
The Jacobian matrix is given as
J
(θ
1
,θ
2
,θ
3
)
=
∂x
1
∂r
∂x
1
∂θ
1
∂x
1
∂θ
2
∂x
1
∂θ
3
∂x
2
∂r
∂x
2
∂θ
1
∂x
2
∂θ
2
∂x
2
∂θ
3
∂x
3
∂r
∂x
3
∂θ
1
∂x
3
∂θ
2
∂x
3
∂θ
3
∂x
4
∂r
∂x
4
∂θ
1
∂x
4
∂θ
2
∂x
4
∂θ
3
=
c
1
c
2
c
3
−
rs
1
c
2
c
3
−
rc
1
s
2
c
3
rc
1
c
2
s
3
c
1
c
2
s
3
−
rs
1
c
2
s
3
−
rc
1
s
2
s
3
rc
1
c
2
c
3
c
1
s
2
−
rs
1
s
2
rc
1
c
2
0
s
1
rc
1
0 0
Utilizingthe Jacobianin three dimensions [Eq.(6.16)], theJacobianin four dimensions
is given as
det
(
J
(θ
1
,θ
2
,θ
3
))
=
s
1
(
−
rs
1
)(c
1
)(c
1
)
det
(
J
(θ
2
,θ
3
))
−
rc
1
(c
1
)(c
1
)(c
1
)
det
(
J
(θ
2
,θ
3
))
=
r
3
s
2
1
c
2
1
c
2
+
r
3
c
4
1
c
2
=
r
3
c
2
1
c
2
(s
2
1
+
c
2
1
)
=
r
3
c
2
1
c
2
Jacobian in
d
Dimensions
By induction, we can obtain the
d
-dimensional Jacobian as
follows:
det
(
J
(θ
1
,θ
2
,...,θ
d
−
1
))
=
(
−
1
)
d
r
d
−
1
c
d
−
2
1
c
d
−
3
2
...c
d
−
2
6.7 Appendix: Derivation of Hypersphere Volume
179
The volume of the hypersphere is given by the
d
-dimensional integral with
r >
0,
−
π/
2
≤
θ
i
≤
π/
2 for all
i
=
1
,...,d
−
2, and 0
≤
θ
d
−
1
≤
2
π
:
vol
(
S
d
(r))
=
r
θ
1
θ
2
...
θ
d
−
1
det
(
J
(θ
1
,θ
2
,...,θ
d
−
1
))
dr dθ
1
dθ
2
...dθ
d
−
1
=
r
0
r
d
−
1
dr
π/
2
−
π/
2
c
d
−
2
1
dθ
1
...
π/
2
−
π/
2
c
d
−
2
dθ
d
−
2
2
π
0
dθ
d
−
1
(6.18)
Consider one of the intermediate integrals:
π/
2
−
π/
2
(
cos
θ)
k
dθ
=
2
π/
2
0
cos
k
θdθ
(6.19)
Let us substitute
u
=
cos
2
θ
, then we have
θ
=
cos
−
1
(u
1
/
2
)
, and the Jacobian is
J
=
∂θ
∂u
=−
1
2
u
−
1
/
2
(
1
−
u)
−
1
/
2
(6.20)
Substituting Eq.(6.20) in Eq.(6.19), we get the new integral:
2
π/
2
0
cos
k
θdθ
=
1
0
u
(k
−
1
)/
2
(
1
−
u)
−
1
/
2
du
=
B
k
+
1
2
,
1
2
=
Ŵ
k
+
1
2
Ŵ
1
2
Ŵ
k
2
+
1
(6.21)
where
B
(α,β)
is the
beta function
, given as
B
(α,β)
=
1
0
u
α
−
1
(
1
−
u)
β
−
1
du
and it can be expressed in terms of the gamma function [Eq.(6.6)] via the identity
B
(α,β)
=
Ŵ(α)Ŵ(β)
Ŵ(α
+
β)
Using the fact that
Ŵ(
1
/
2
)
=
√
π
, and
Ŵ(
1
)
=
1, plugging Eq.(6.21) into Eq.(6.18),
we get
vol
(
S
d
(r))
=
r
d
d
Ŵ
d
−
1
2
Ŵ
1
2
Ŵ
d
2
Ŵ
d
−
2
2
Ŵ
1
2
Ŵ
d
−
1
2
...
Ŵ
(
1
)
Ŵ
1
2
Ŵ
3
2
2
π
=
πŴ
1
2
d/
2
−
1
r
d
d
2
Ŵ
d
2
=
π
d/
2
Ŵ
d
2
+
1
r
d
which matches the expression in Eq.(6.4).
180
High-dimensional Data
6.8
FURTHER READING
For an introduction to the geometry of
d
-dimensional spaces see Kendall (1961) and
also Scott (1992, Section 1.5). The derivation of the mean distance for the multivariate
normal is from MacKay (2003, p. 130).
Kendall, M. G. (1961).
A Course in the Geometry of
n
Dimensions
. New York: Hafner.
MacKay, D. J. (2003).
Information Theory, Inference and Learning Algorithms
.
New York: Cambridge University Press.
Scott, D. W. (1992).
Multivariate Density Estimation: Theory, Practice, and Visualiza-
tion
. New York: John Wiley & Sons.
6.9
EXERCISES
Q1.
Given the gamma function in Eq.(6.6), show the following:
(a)
Ŵ(
1
)
=
1
(b)
Ŵ
1
2
=
√
π
(c)
Ŵ(α)
=
(α
−
1
)Ŵ(α
−
1
)
Q2.
Show that the asymptotic volume of the hypersphere
S
d
(r)
for any value of radius
r
eventually tends to zero as
d
increases.
Q3.
The ball with center
c
∈
R
d
and radius
r
is defined as
B
d
(
c
,r)
=
x
∈
R
d
|
δ(
x
,
c
)
≤
r
where
δ(
x
,
c
)
is the distance between
x
and
c
, which can be specified using the
L
p
-norm:
L
p
(
x
,
c
)
=
d
i
=
1
|
x
i
−
c
i
|
p
1
p
where
p
=
0 is any real number. The distance can also be specified using the
L
∞
-norm:
L
∞
(
x
,
c
)
=
max
i
|
x
i
−
c
i
|
Answer the following questions:
(a)
For
d
=
2, sketch the shapeofthe hyperball inscribedinsidetheunitsquare, using
the
L
p
-distance with
p
=
0
.
5 and with center
c
=
(
0
.
5
,
0
.
5
)
T
.
(b)
With
d
=
2 and
c
=
(
0
.
5
,
0
.
5
)
T
, using the
L
∞
-norm, sketch the shape of the ball of
radius
r
=
0
.
25 inside a unit square.
(c)
Compute the formula for the maximum distance between any two points in
the unit hypercube in
d
dimensions, when using the
L
p
-norm. What is the
maximum distancefor
p
=
0
.
5 when
d
=
2? What is the maximum distance for the
L
∞
-norm?
6.9 Exercises
181
ǫ
ǫ
Figure 6.11.
For Q4.
Q4.
Consider the corner hypercubes of length
ǫ
≤
1 inside a unit hypercube. The
2-dimensional case is shown in Figure 6.11. Answer the following questions:
(a)
Let
ǫ
=
0
.
1. What is the fraction of the total volume occupied by the corner cubes
in two dimensions?
(b)
Derive an expression for the volume occupied by all of the corner hypercubes of
length
ǫ <
1 as a function of the dimension
d
. What happens to the fraction of the
volume in the corners as
d
→∞
?
(c)
Whatisthefraction ofvolumeoccupiedbythethin hypercubeshellofwidth
ǫ <
1
as a fraction of the total volume of the outer (unit) hypercube, as
d
→∞
? For
example, in two dimensions the thin shell is the space between the outer square
(solid) and inner square (dashed).
Q5.
Prove Eq.(6.14), that is, lim
d
→∞
P
x
T
x
≤−
2ln
(α)
→
0, for any
α
∈
(
0
,
1
)
and
x
∈
R
d
.
Q6.
Consider the conceptual view of high-dimensional space shown in Figure 6.4. Derive
an expression for the radius of the inscribed circle, so that the area in the spokes
accurately reflects the difference between the volume of the hypercube and the
inscribed hypersphere in
d
dimensions. For instance, if the length of a half-diagonal
is fixed at 1, then the radius of the inscribed circle is
1
√
2
in Figure 6.4a.
Q7.
Consider the unit hypersphere (with radius
r
=
1). Inside the hypersphere inscribe
a hypercube (i.e., the largest hypercube you can fit inside the hypersphere). An
example in two dimensions is shown in Figure 6.12. Answer the following questions:
Figure 6.12.
For Q7.
182
High-dimensional Data
(a)
Derive an expression for the volume of the inscribed hypercube for any given
dimensionality
d
. Derive the expression for one, two, and three dimensions, and
then generalize to higher dimensions.
(b)
What happens to the ratio of the volume of the inscribed hypercube to the
volume of the enclosing hypersphere as
d
→∞
? Again, give the ratio in one,
two and three dimensions, and then generalize.
Q8.
Assume that a unit hypercube is given as [0
,
1]
d
, that is, the range is [0
,
1] in each
dimension. The main diagonal in the hypercube is defined as the vector from
(
0
,
0
)
=
(
d
−
1
0
,...,
0
,
0
)
to
(
1
,
1
)
=
(
d
−
1
1
,...,
1
,
1
)
. For example, when
d
=
2, the main diagonal goes
from
(
0
,
0
)
to
(
1
,
1
)
. On the other hand, the main anti-diagonal is defined as the
vector from
(
1
,
0
)
=
(
d
−
1
1
,...,
1
,
0
)
to
(
0
,
1
)
=
(
d
−
1
0
,...,
0
,
1
)
For example, for
d
=
2, the
anti-diagonal is from
(
1
,
0
)
to
(
0
,
1
)
.
(a)
Sketch thediagonal andanti-diagonal in
d
=
3dimensions,andcomputetheangle
between them.
(b)
What happens to the angle between the main diagonal and anti-diagonal as
d
→
∞
. First compute a general expression for the
d
dimensions, and then take the
limit as
d
→∞
.
Q9.
Draw a sketch of a hypersphere in four dimensions.
CHAPTER 7
Dimensionality Reduction
We saw in Chapter 6 that high-dimensional data has some peculiar characteristics,
some of which are counterintuitive. For example, in high dimensions the center of
the space is devoid of points, with most of the points being scattered along the
surface of the space or in the corners. There is also an apparent proliferation of
orthogonal axes. As a consequence high-dimensional data can cause problems for
data mining and analysis, although in some cases high-dimensionality can help, for
example, for nonlinear classification. Nevertheless, it is important to check whether
the dimensionality can be reduced while preserving the essential properties of the full
data matrix. This can aid data visualization as well as data mining. In this chapter we
study methods that allow us to obtain optimal lower-dimensional projections of the
data.
7.1
BACKGROUND
Let the data
D
consist of
n
points over
d
attributes, that is, it is an
n
×
d
matrix,
given as
D
=
X
1
X
2
···
X
d
x
1
x
11
x
12
···
x
1
d
x
2
x
21
x
22
···
x
2
d
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
n
x
n
1
x
n
2
···
x
nd
Each point
x
i
=
(x
i
1
,x
i
2
,...,x
id
)
T
is a vector in the ambient
d
-dimensional vector space
spanned by the
d
standard basis vectors
e
1
,
e
2
,...,
e
d
, where
e
i
corresponds to the
i
th attribute
X
i
. Recall that the standard basis is an orthonormal basis for the data
space, that is, the basis vectors are pairwise orthogonal,
e
T
i
e
j
=
0, and have unit length
e
i
=
1.
As such, given any other set of
d
orthonormal vectors
u
1
,
u
2
,...,
u
d
, with
u
T
i
u
j
=
0
and
u
i
=
1 (or
u
T
i
u
i
=
1), we can re-express each point
x
as the linear combination
x
=
a
1
u
1
+
a
2
u
2
+···+
a
d
u
d
(7.1)
183
184
Dimensionality Reduction
where the vector
a
=
(a
1
,a
2
,...,a
d
)
T
represents the coordinates of
x
in the new basis.
The above linear combination can also be expressed as a matrix multiplication:
x
=
Ua
(7.2)
where
U
is the
d
×
d
matrix, whose
i
th column comprises the
i
th basis vector
u
i
:
U
=
| | |
u
1
u
2
···
u
d
| | |
The matrix
U
is an
orthogonal
matrix, whose columns, the basis vectors, are
orthonormal
, that is, they are pairwise orthogonal and have unit length
u
T
i
u
j
=
1 if
i
=
j
0 if
i
=
j
Because
U
is orthogonal, this means that its inverse equals its transpose:
U
−
1
=
U
T
which implies that
U
T
U
=
I
, where
I
is the
d
×
d
identity matrix.
Multiplying Eq.(7.2) on both sides by
U
T
yields the expression for computing the
coordinates of
x
in the new basis
U
T
x
=
U
T
Ua
a
=
U
T
x
(7.3)
Example 7.1.
Figure 7.1a shows the centered Iris dataset, with
n
=
150 points, in the
d
=
3 dimensional space comprising the
sepal length
(
X
1
),
sepal width
(
X
2
), and
petal length
(
X
3
) attributes. The space is spanned by the standard basis vectors
e
1
=
1
0
0
e
2
=
0
1
0
e
3
=
0
0
1
Figure 7.1b shows the same points in the space comprising the new basis vectors
u
1
=
−
0
.
390
0
.
089
−
0
.
916
u
2
=
−
0
.
639
−
0
.
742
0
.
200
u
3
=
−
0
.
663
0
.
664
0
.
346
For example, the new coordinates of the centered point
x
=
(
−
0
.
343
,
−
0
.
754
,
0
.
241
)
T
can be computed as
a
=
U
T
x
=
−
0
.
390 0
.
089
−
0
.
916
−
0
.
639
−
0
.
742 0
.
200
−
0
.
663 0
.
664 0
.
346
−
0
.
343
−
0
.
754
0
.
241
=
−
0
.
154
0
.
828
−
0
.
190
One can verify that
x
can be written as the linear combination
x
=−
0
.
154
u
1
+
0
.
828
u
2
−
0
.
190
u
3
7.1 Background
185
X
1
X
2
X
3
(a) Original Basis
u
1
u
3
u
2
(b) Optimal Basis
Figure 7.1.
Iris data: optimal basis in three dimensions.
Because there are potentially infinite choices for the set of orthonormal basis
vectors, one natural question is whether there exists an
optimal
basis, for a suitable
notion of optimality. Further, it is often the case that the input dimensionality
d
is
very large, which can cause various problems owing to the curse of dimensionality (see
Chapter 6). It is natural to ask whether we can find a reduced dimensionality subspace
that still preserves the essential characteristics of the data. That is, we are interested
in finding the optimal
r
-dimensional representation of
D
, with
r
≪
d
. In other words,
given a point
x
, and assuming that the basis vectors have been sorted in decreasing
order of importance, we can truncate its linear expansion [Eq.(7.1)] to just
r
terms, to
obtain
x
′
=
a
1
u
1
+
a
2
u
2
+···+
a
r
u
r
=
r
i
=
1
a
i
u
i
(7.4)
Here
x
′
is the projection of
x
onto the first
r
basis vectors, which can be written in
matrix notation as follows:
x
′
=
| | |
u
1
u
2
···
u
r
| | |
a
1
a
2
.
.
.
a
r
=
U
r
a
r
(7.5)
186
Dimensionality Reduction
where
U
r
is the matrix comprising the first
r
basis vectors, and
a
r
is vector comprising
thefirst
r
coordinates. Further, because
a
=
U
T
x
from Eq.(7.3), restricting it to the first
r
terms, we get
a
r
=
U
T
r
x
(7.6)
Plugging this into Eq.(7.5), the projection of
x
onto the first
r
basis vectors can be
compactly written as
x
′
=
U
r
U
T
r
x
=
P
r
x
(7.7)
where
P
r
=
U
r
U
T
r
is the
orthogonal projection matrix
for the subspace spanned by the
first
r
basis vectors. That is,
P
r
is symmetric and
P
2
r
=
P
r
. This is easy to verify because
P
T
r
=
(
U
r
U
T
r
)
T
=
U
r
U
T
r
=
P
r
, and
P
2
r
=
(
U
r
U
T
r
)(
U
r
U
T
r
)
=
U
r
U
T
r
=
P
r
, where we use the
observation that
U
T
r
U
r
=
I
r
×
r
, the
r
×
r
identity matrix. The projection matrix
P
r
can
also be written as the decomposition
P
r
=
U
r
U
T
r
=
r
i
=
1
u
i
u
T
i
(7.8)
From Eqs.(7.1) and (7.4), the projection of
x
onto the remaining dimensions
comprises the
error vector
ǫ
=
d
i
=
r
+
1
a
i
u
i
=
x
−
x
′
It is worth noting that that
x
′
and
ǫ
are orthogonal vectors:
x
′
T
ǫ
=
r
i
=
1
d
j
=
r
+
1
a
i
a
j
u
T
i
u
j
=
0
This is a consequence of the basis being orthonormal. In fact, we can make an even
stronger statement. The subspace spanned by the first
r
basis vectors
S
r
=
span
(
u
1
,...,
u
r
)
and the subspace spanned by the remaining basis vectors
S
d
−
r
=
span
(
u
r
+
1
,...,
u
d
)
are
orthogonal subspaces
, that is, all pairs of vectors
x
∈
S
r
and
y
∈
S
d
−
r
must be
orthogonal. The subspace
S
d
−
r
is also called the
orthogonal complement
of
S
r
.
Example 7.2.
Continuing Example 7.1, approximating the centered point
x
=
(
−
0
.
343
,
−
0
.
754
,
0
.
241
)
T
by using only the first basis vector
u
1
=
(
−
0
.
390
,
0
.
089
,
−
0
.
916
)
T
, we have
x
′
=
a
1
u
1
=−
0
.
154
u
1
=
0
.
060
−
0
.
014
0
.
141
7.2 Principal Component Analysis
187
The projection of
x
on
u
1
could have been obtained directly from the projection
matrix
P
1
=
u
1
u
T
1
=
−
0
.
390
0
.
089
−
0
.
916
−
0
.
390 0
.
089
−
0
.
916
=
0
.
152
−
0
.
035 0
.
357
−
0
.
035 0
.
008
−
0
.
082
0
.
357
−
0
.
082 0
.
839
That is
x
′
=
P
1
x
=
0
.
060
−
0
.
014
0
.
141
The error vector is given as
ǫ
=
a
2
u
2
+
a
3
u
3
=
x
−
x
′
=
−
0
.
40
−
0
.
74
0
.
10
One can verify that
x
′
and
ǫ
are orthogonal, i.e.,
x
′
T
ǫ
=
0
.
060
−
0
.
014 0
.
141
−
0
.
40
−
0
.
74
0
.
10
=
0
The goal of dimensionality reduction is to seek an
r
-dimensional basis that gives
the best possible approximation
x
′
i
over all the points
x
i
∈
D
. Alternatively, we may
seek to minimize the error
ǫ
i
=
x
i
−
x
′
i
over all the points.
7.2
PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is a technique that seeks a
r
-dimensional basis
that best captures the variance in the data. The direction with the largest projected
variance is called the first principal component. The orthogonal direction that captures
the second largest projected variance is called the second principal component, and
so on. As we shall see, the direction that maximizes the variance is also the one that
minimizes the mean squared error.
7.2.1
Best Line Approximation
We will start with
r
=
1, that is, the one-dimensional subspace or line
u
that best
approximates
D
in terms of the variance of the projected points. This will lead to the
general PCA technique for the best 1
≤
r
≤
d
dimensional basis for
D
.
Without loss of generality, we assume that
u
has magnitude
u
2
=
u
T
u
=
1;
otherwise it is possible to keep on increasing the projected variance by simply
188
Dimensionality Reduction
increasing the magnitude of
u
. We also assume that the data has been centered so
that it has mean
µ
=
0
.
The projection of
x
i
on the vector
u
is given as
x
′
i
=
u
T
x
i
u
T
u
u
=
(
u
T
x
i
)
u
=
a
i
u
where the scalar
a
i
=
u
T
x
i
gives the coordinate of
x
′
i
along
u
. Note that because the mean point is
µ
=
0
, its
coordinate along
u
is
µ
u
=
0.
We have to choose the direction
u
such that the variance of the projected points is
maximized. The projected variance along
u
is given as
σ
2
u
=
1
n
n
i
=
1
(a
i
−
µ
u
)
2
=
1
n
n
i
=
1
(
u
T
x
i
)
2
=
1
n
n
i
=
1
u
T
x
i
x
T
i
u
=
u
T
1
n
n
i
=
1
x
i
x
T
i
u
=
u
T
u
(7.9)
where
is the covariance matrix for the centered data
D
.
To maximize the projected variance, we have to solve a constrained optimization
problem, namely to maximize
σ
2
u
subject to the constraint that
u
T
u
=
1. This can
be solved by introducing a Lagrangian multiplier
α
for the constraint, to obtain the
unconstrained maximization problem
max
u
J
(
u
)
=
u
T
u
−
α(
u
T
u
−
1
)
(7.10)
Setting the derivative of
J
(
u
)
with respect to
u
to the zero vector, we obtain
∂
∂
u
J
(
u
)
=
0
∂
∂
u
u
T
u
−
α(
u
T
u
−
1
)
=
0
2
u
−
2
α
u
=
0
u
=
α
u
(7.11)
This implies that
α
is an eigenvalue of the covariance matrix
, with the associated
eigenvector
u
. Further, taking the dot product with
u
on both sides of Eq.(7.11) yields
u
T
u
=
u
T
α
u
7.2 Principal Component Analysis
189
From Eq.(7.9), we then have
σ
2
u
=
α
u
T
u
or
σ
2
u
=
α
(7.12)
To maximize the projected variance
σ
2
u
, we should thus choose the largest eigenvalue
of
. In other words, the dominant eigenvector
u
1
specifies the direction of most
variance, also called the
first principal component
, that is,
u
=
u
1
. Further, the largest
eigenvalue
λ
1
specifies the projected variance, that is,
σ
2
u
=
α
=
λ
1
.
Minimum Squared Error Approach
We now show that the direction that maximizes the projected variance is also the one
that minimizes the average squared error. As before, assume that the dataset
D
has
beencenteredbysubtractingthemeanfrom eachpoint.For apoint
x
i
∈
D
,let
x
′
i
denote
its projection along the direction
u
, and let
ǫ
i
=
x
i
−
x
′
i
denote the error vector. The
mean squared error (
MSE
) optimization condition is defined as
MSE
(
u
)
=
1
n
n
i
=
1
ǫ
i
2
(7.13)
=
1
n
n
i
=
1
x
i
−
x
′
i
2
=
1
n
n
i
=
1
(
x
i
−
x
′
i
)
T
(
x
i
−
x
′
i
)
=
1
n
n
i
=
1
x
i
2
−
2
x
T
i
x
′
i
+
(
x
′
i
)
T
x
′
i
(7.14)
Noting that
x
′
i
=
(
u
T
x
i
)
u
, we have
=
1
n
n
i
=
1
x
i
2
−
2
x
T
i
(
u
T
x
i
)
u
+
(
u
T
x
i
)
u
T
(
u
T
x
i
)
u
=
1
n
n
i
=
1
x
i
2
−
2
(
u
T
x
i
)(
x
T
i
u
)
+
(
u
T
x
i
)(
x
T
i
u
)
u
T
u
=
1
n
n
i
=
1
x
i
2
−
(
u
T
x
i
)(
x
T
i
u
)
=
1
n
n
i
=
1
x
i
2
−
1
n
n
i
=
1
u
T
(
x
i
x
T
i
)
u
=
1
n
n
i
=
1
x
i
2
−
u
T
1
n
n
i
=
1
x
i
x
T
i
u
=
n
i
=
1
x
i
2
n
−
u
T
u
(7.15)
190
Dimensionality Reduction
Note that by Eq.(1.4) the total variance of the centered data (i.e., with
µ
=
0
) is
given as
var(
D
)
=
1
n
n
i
=
1
x
i
−
0
2
=
1
n
n
i
=
1
x
i
2
Further, by Eq.(2.28), we have
var(
D
)
=
tr(
)
=
d
i
=
1
σ
2
i
Thus, we may rewrite Eq.(7.15) as
MSE
(
u
)
=
var(
D
)
−
u
T
u
=
d
i
=
1
σ
2
i
−
u
T
u
Because the first term,
var(
D
)
, is a constant for a given dataset
D
, the vector
u
that
minimizes
MSE
(
u
)
is thus the same one that maximizes the second term, the projected
variance
u
T
u
. Because we know that
u
1
, the dominant eigenvector of
, maximizes
the projected variance, we have
MSE
(
u
1
)
=
var(
D
)
−
u
T
1
u
1
=
var(
D
)
−
u
T
1
λ
1
u
1
=
var(
D
)
−
λ
1
(7.16)
Thus, the principal component
u
1
, which is the direction that maximizes the projected
variance, is also the direction that minimizes the mean squared error.
Example 7.3.
Figure 7.2 shows the first principal component, that is, the best
one-dimensional approximation, for the three dimensional Iris dataset shown in
Figure 7.1a. The covariance matrix for this dataset is given as
=
0
.
681
−
0
.
039 1
.
265
−
0
.
039 0
.
187
−
0
.
320
1
.
265
−
0
.
320 3
.
092
The variance values
σ
2
i
for each of the original dimensions are given along the
main diagonal of
. For example,
σ
2
1
=
0
.
681,
σ
2
2
=
0
.
187, and
σ
2
3
=
3
.
092. The
largest eigenvalue of
is
λ
1
=
3
.
662, and the corresponding dominant eigenvector
is
u
1
=
(
−
0
.
390
,
0
.
089
,
−
0
.
916
)
T
. The unit vector
u
1
thus maximizes the projected
variance, which is given as
J
(
u
1
)
=
α
=
λ
1
=
3
.
662. Figure 7.2 plots the principal
component
u
1
. It also shows the error vectors
ǫ
i
, as thin gray line segments.
The total variance of the data is given as
var(
D
)
=
1
n
n
i
=
1
x
2
=
1
150
·
594
.
04
=
3
.
96
7.2 Principal Component Analysis
191
X
1
X
2
X
3
u
1
Figure 7.2.
Best one-dimensional or line approximation.
We can also directly obtain the total variance as the trace of the covariance matrix:
var(
D
)
=
tr(
)
=
σ
2
1
+
σ
2
2
+
σ
2
3
=
0
.
681
+
0
.
187
+
3
.
092
=
3
.
96
Thus, using Eq.(7.16), the minimum value of the mean squared error is given as
MSE
(
u
1
)
=
var(
D
)
−
λ
1
=
3
.
96
−
3
.
662
=
0
.
298
7.2.2
Best 2-dimensional Approximation
We are now interested in the best two-dimensional approximation to
D
. As before,
assume that
D
has already been centered, so that
µ
=
0
. We already computed the
direction with the most variance, namely
u
1
, which is the eigenvector corresponding to
the largest eigenvalue
λ
1
of
. We now want to find another direction
v
, which also
maximizes the projected variance, but is orthogonal to
u
1
. According to Eq.(7.9) the
projected variance along
v
is given as
σ
2
v
=
v
T
v
We further require that
v
be a unit vector orthogonal to
u
1
, that is,
v
T
u
1
=
0
v
T
v
=
1
192
Dimensionality Reduction
The optimization condition then becomes
max
v
J
(
v
)
=
v
T
v
−
α(
v
T
v
−
1
)
−
β(
v
T
u
1
−
0
)
(7.17)
Taking the derivative of
J
(
v
)
with respect to
v
, and setting it to the zero vector, gives
2
v
−
2
α
v
−
β
u
1
=
0
(7.18)
If we multiply on the left by
u
T
1
we get
2
u
T
1
v
−
2
α
u
T
1
v
−
β
u
T
1
u
1
=
0
2
v
T
u
1
−
β
=
0
,
which implies that
β
=
2
v
T
λ
1
u
1
=
2
λ
1
v
T
u
1
=
0
In the derivation above we used the fact that
u
T
1
v
=
v
T
u
1
, and that
v
is orthogonal
to
u
1
. Plugging
β
=
0 into Eq.(7.18) gives us
2
v
−
2
α
v
=
0
v
=
α
v
This means that
v
is another eigenvector of
. Also, as in Eq.(7.12), we have
σ
2
v
=
α
. To maximize the variance along
v
, we should choose
α
=
λ
2
, the second largest
eigenvalueof
,withthe
secondprincipalcomponent
beinggivenbythecorresponding
eigenvector, that is,
v
=
u
2
.
Total Projected Variance
Let
U
2
be the matrix whose columns correspond to the two principal components,
given as
U
2
=
| |
u
1
u
2
| |
Given the point
x
i
∈
D
its coordinates in the two-dimensional subspace spanned by
u
1
and
u
2
can be computed via Eq.(7.6), as follows:
a
i
=
U
T
2
x
i
Assume that each point
x
i
∈
R
d
in
D
has been projected to obtain its coordinates
a
i
∈
R
2
, yielding the new dataset
A
. Further, because
D
is assumed to be centered, with
µ
=
0
, the coordinates of the projected mean are also zero because
U
T
2
µ
=
U
T
2
0
=
0
.
7.2 Principal Component Analysis
193
The total variance for
A
is given as
var(
A
)
=
1
n
n
i
=
1
a
i
−
0
2
=
1
n
n
i
=
1
U
T
2
x
i
T
U
T
2
x
i
=
1
n
n
i
=
1
x
T
i
U
2
U
T
2
x
i
=
1
n
n
i
=
1
x
T
i
P
2
x
i
(7.19)
where
P
2
is the orthogonal projection matrix [Eq.(7.8)] given as
P
2
=
U
2
U
T
2
=
u
1
u
T
1
+
u
2
u
T
2
Substituting this into Eq.(7.19), the projected total variance is given as
var(
A
)
=
1
n
n
i
=
1
x
T
i
P
2
x
i
(7.20)
=
1
n
n
i
=
1
x
T
i
u
1
u
T
1
+
u
2
u
T
2
x
i
=
1
n
n
i
=
1
(
u
T
1
x
i
)(
x
T
i
u
1
)
+
1
n
n
i
=
1
(
u
T
2
x
i
)(
x
T
i
u
2
)
=
u
T
1
u
1
+
u
T
2
u
2
(7.21)
Because
u
1
and
u
2
are eigenvectors of
, we have
u
1
=
λ
1
u
1
and
u
2
=
λ
2
u
2
, so that
var(
A
)
=
u
T
1
u
1
+
u
T
2
u
2
=
u
T
1
λ
1
u
1
+
u
T
2
λ
2
u
2
=
λ
1
+
λ
2
(7.22)
Thus, the sum of the eigenvalues is the total variance of the projected points, and the
first two principal components maximize this variance.
Mean Squared Error
We now show that the first two principal components also minimize the mean square
error objective. The mean square error objective is given as
MSE
=
1
n
n
i
=
1
x
i
−
x
′
i
2
=
1
n
n
i
=
1
x
i
2
−
2
x
T
i
x
′
i
+
(
x
′
i
)
T
x
′
i
,
using Eq.(7.14)
=
var(
D
)
+
1
n
n
i
=
1
−
2
x
T
i
P
2
x
i
+
(
P
2
x
i
)
T
P
2
x
i
,
using Eq.(7.7) that
x
′
i
=
P
2
x
i
194
Dimensionality Reduction
=
var(
D
)
−
1
n
n
i
=
1
x
T
i
P
2
x
i
=
var(
D
)
−
var(
A
),
using Eq.(7.20) (7.23)
Thus, the MSE objective is minimized precisely when the total projected variance
var(
A
)
is maximized. From Eq.(7.22), we have
MSE
=
var(
D
)
−
λ
1
−
λ
2
Example 7.4.
For the Iris dataset from Example 7.1, the two largest eigenvalues are
λ
1
=
3
.
662, and
λ
2
=
0
.
239, with the corresponding eigenvectors:
u
1
=
−
0
.
390
0
.
089
−
0
.
916
u
2
=
−
0
.
639
−
0
.
742
0
.
200
The projection matrix is given as
P
2
=
U
2
U
T
2
=
| |
u
1
u
2
| |
—
u
T
1
—
—
u
T
2
—
=
u
1
u
T
1
+
u
2
u
T
2
=
0
.
152
−
0
.
035 0
.
357
−
0
.
035 0
.
008
−
0
.
082
0
.
357
−
0
.
082 0
.
839
+
0
.
408 0
.
474
−
0
.
128
0
.
474 0
.
551
−
0
.
148
−
0
.
128
−
0
.
148 0
.
04
=
0
.
560 0
.
439 0
.
229
0
.
439 0
.
558
−
0
.
230
0
.
229
−
0
.
230 0
.
879
Thus, each point
x
i
can be approximated by its projection onto the first two principal
components
x
′
i
=
P
2
x
i
. Figure 7.3aplots this optimal 2-dimensional subspace spanned
by
u
1
and
u
2
. The error vector
ǫ
i
for each point is shown as a thin line segment. The
gray points are behind the 2-dimensional subspace, whereas the white points are in
front of it. The total variance captured by the subspace is given as
λ
1
+
λ
2
=
3
.
662
+
0
.
239
=
3
.
901
The mean squared error is given as
MSE
=
var(
D
)
−
λ
1
−
λ
2
=
3
.
96
−
3
.
662
−
0
.
239
=
0
.
059
Figure 7.3b plots a nonoptimal 2-dimensional subspace. As one can see the optimal
subspace maximizes the variance, and minimizes the squared error, whereas the
nonoptimalsubspacecaptureslessvariance,andhasahighmeansquarederror value,
which can be pictorially seen from the lengths of the error vectors (line segments). In
fact, this is the worst possible 2-dimensional subspace; its MSE is 3.662.
7.2 Principal Component Analysis
195
X
1
X
2
X
3
u
1
u
2
(a) Optimal basis
X
1
X
2
X
3
(b) Nonoptimal basis
Figure 7.3.
Best two-dimensional approximation.
7.2.3
Best
r
-dimensional Approximation
We are now interested in the best
r
-dimensional approximation to
D
, where 2
< r
≤
d
.
Assume that we have already computed the first
j
−
1 principal components or
eigenvectors,
u
1
,
u
2
,...,
u
j
−
1
, corresponding to the
j
−
1 largest eigenvalues of
,
for 1
≤
j
≤
r
. To compute the
j
th new basis vector
v
, we have to ensure that it is
normalized to unit length,thatis,
v
T
v
=
1,and is orthogonal to all previous components
u
i
, i.e.,
u
T
i
v
=
0, for 1
≤
i < j
. As before, the projected variance along
v
is given as
σ
2
v
=
v
T
v
Combined with the constraints on
v
, this leads to the following maximization problem
with Lagrange multipliers:
max
v
J
(
v
)
=
v
T
v
−
α(
v
T
v
−
1
)
−
j
−
1
i
=
1
β
i
(
u
T
i
v
−
0
)
Taking the derivative of
J
(
v
)
with respect to
v
and setting it to the zero vector gives
2
v
−
2
α
v
−
j
−
1
i
=
1
β
i
u
i
=
0
(7.24)
196
Dimensionality Reduction
If we multiply on the left by
u
T
k
, for 1
≤
k < j
, we get
2
u
T
k
v
−
2
α
u
T
k
v
−
β
k
u
T
k
u
k
−
j
−
1
i
=
1
i
=
k
β
i
u
T
k
u
i
=
0
2
v
T
u
k
−
β
k
=
0
β
k
=
2
v
T
λ
k
u
k
=
2
λ
k
v
T
u
k
=
0
where we used the fact that
u
k
=
λ
k
u
k
, as
u
k
is the eigenvector corresponding to the
k
th largest eigenvalue
λ
k
of
. Thus, we find that
β
i
=
0 for all
i
0. Also, because
η
j
=
nλ
j
, the variance
along the
j
th principal component is given as
λ
j
=
η
j
n
. Algorithm 7.2 gives the
pseudo-code for the kernel PCA method.
7.3 Kernel Principal Component Analysis
207
ALGORITHM 7.2. Kernel Principal Component Analysis
K
ERNEL
PCA (D
,
K
,α
)
:
K
=
K
(
x
i
,
x
j
)
i,j
=
1
,...,n
// compute
n
×
n
kernel matrix
1
K
=
(
I
−
1
n
1
n
×
n
)
K
(
I
−
1
n
1
n
×
n
)
// center the kernel matrix
2
(η
1
,η
2
,...,η
d
)
=
eigenvalues
(
K
)
// compute eigenvalues
3
c
1
c
2
···
c
n
=
eigenvectors
(
K
)
// compute eigenvectors
4
λ
i
=
η
i
n
for all
i
=
1
,...,n
// compute variance for each component
5
c
i
=
1
η
i
·
c
i
for all
i
=
1
,...,n
// ensure that
u
T
i
u
i
=
1
6
f(r)
=
r
i
=
1
λ
i
d
i
=
1
λ
i
,
for all
r
=
1
,
2
,...,d
// fraction of total variance
7
Choose smallest
r
so that
f(r)
≥
α
// choose dimensionality
8
C
r
=
c
1
c
2
···
c
r
// reduced basis
9
A
={
a
i
|
a
i
=
C
T
r
K
i
,
for
i
=
1
,...,n
}
// reduced dimensionality data
10
Example 7.8.
Consider the nonlinear Iris data from Example 7.7 with
n
=
150 points.
Let us use the homogeneous quadratic polynomial kernel in Eq.(5.8):
K
(
x
i
,
x
j
)
=
x
T
i
x
j
2
The kernel matrix
K
has three nonzero eigenvalues:
η
1
=
31
.
0
η
2
=
8
.
94
η
3
=
2
.
76
λ
1
=
η
1
150
=
0
.
2067
λ
2
=
η
2
150
=
0
.
0596
λ
3
=
η
3
150
=
0
.
0184
The corresponding eigenvectors
c
1
,
c
2
, and
c
3
are not shown because they lie in
R
150
.
Figure 7.8 shows the contour lines of constant projection onto the first three
kernelprincipal components. These lines areobtained bysolving theequations
u
T
i
x
=
n
j
=
1
c
ij
K
(
x
j
,
x
)
=
s
for different projection values
s
, for each of the eigenvectors
c
i
=
(c
i
1
,c
i
2
,...,c
in
)
T
of the kernel matrix. For instance, for the first principal component
this corresponds to the solutions
x
=
(x
1
,x
2
)
T
, shown as contour lines, of thefollowing
equation:
1
.
0426
x
2
1
+
0
.
995
x
2
2
+
0
.
914
x
1
x
2
=
s
foreachchosen valueof
s
. The principal components are alsonot shown in thefigure,
as it is typically not possible or feasible to map the points into feature space, and thus
onecannotderiveanexplicitexpression for
u
i
. However,becausetheprojection onto
the principal components can be carried out via kernel operations via Eq.(7.36),
Figure 7.9 shows the projection of the points onto the first two kernel principal
components, which capture
λ
1
+
λ
2
λ
1
+
λ
2
+
λ
3
=
0
.
2663
0
.
2847
=
93
.
5% of the total variance.
Incidentally, the use of a linear kernel
K
(
x
i
,
x
j
)
=
x
T
i
x
j
yields exactly the same
principal components as shown in Figure 7.7.
208
Dimensionality Reduction
−
1
−
0
.
5
0
0
.
5
1
.
0
1
.
5
−
0
.
5 0 0
.
5 1
.
0 1
.
5
X
1
X
2
(a)
λ
1
=
0
.
2067
−
1
−
0
.
5
0
0
.
5
1
.
0
1
.
5
−
0
.
5 0 0
.
5 1
.
0 1
.
5
X
1
X
2
(b)
λ
2
=
0
.
0596
−
1
−
0
.
5
0
0
.
5
1
.
0
1
.
5
−
0
.
5 0 0
.
5 1
.
0 1
.
5
X
1
X
2
(c)
λ
3
=
0
.
0184
Figure 7.8.
Kernel PCA: homogeneous quadratic kernel.
7.4
SINGULAR VALUE DECOMPOSITION
Principal components analysis is a special case of a more generalmatrix decomposition
method called
Singular Value Decomposition (SVD)
. We saw in Eq.(7.28) that PCA
yields the following decomposition of the covariance matrix:
=
U
U
T
(7.37)
7.4 Singular Value Decomposition
209
−
2
−
1
.
5
−
1
.
0
−
0
.
5
0
−
0
.
5 0 0
.
5 1
.
0 1
.
5 2
.
0 2
.
5 3
.
0 3
.
5
u
1
u
2
Figure 7.9.
Projected point coordinates: homogeneous quadratic kernel.
where the covariance matrix has been factorized into the orthogonal matrix
U
containing its eigenvectors, and a diagonal matrix
containing its eigenvalues (sorted
in decreasing order). SVD generalizes the above factorization for any matrix. In
particular for an
n
×
d
data matrix
D
with
n
points and
d
columns, SVD factorizes
D
as follows:
D
=
L
R
T
(7.38)
where
L
is a orthogonal
n
×
n
matrix,
R
is an orthogonal
d
×
d
matrix, and
is an
n
×
d
“diagonal” matrix. The columns of
L
are called the
left singular vectors
, and the
columns of
R
(or rows of
R
T
) are called the
right singular vectors
. The matrix
is
defined as
(i,j)
=
δ
i
If
i
=
j
0 If
i
=
j
where
i
=
1
,...,n
and
j
=
1
,...,d
. The entries
(i,i)
=
δ
i
along the main diagonal of
are called the
singular values
of
D
, and they are all non-negative. If the rank of
D
is
r
≤
min
(n,d)
, then there will be only
r
nonzero singular values, which we assume are
ordered as follows:
δ
1
≥
δ
2
≥···≥
δ
r
>
0
One can discard those left and right singular vectors that correspond to zero singular
values, to obtain the
reduced SVD
as
D
=
L
r
r
R
T
r
(7.39)
210
Dimensionality Reduction
where
L
r
is the
n
×
r
matrix of the left singular vectors,
R
r
is the
d
×
r
matrix of
the right singular vectors, and
r
is the
r
×
r
diagonal matrix containing the positive
singular vectors. The reduced SVD leads directly to the
spectral decomposition
of
D
,
given as
D
=
L
r
r
R
T
r
=
| | |
l
1
l
2
···
l
r
| | |
δ
1
0
···
0
0
δ
2
···
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
···
δ
r
—
r
T
1
—
—
r
T
2
—
—
.
.
.
—
—
r
T
r
—
=
δ
1
l
1
r
T
1
+
δ
2
l
2
r
T
2
+···+
δ
r
l
r
r
T
r
=
r
i
=
1
δ
i
l
i
r
T
i
The spectral decomposition represents
D
as a sum of rank one matrices of the form
δ
i
l
i
r
T
i
. By selecting the
q
largest singular values
δ
1
,δ
2
,...,δ
q
and the corresponding left
and right singular vectors, we obtain the best rank
q
approximation to the original
matrix
D
. That is, if
D
q
is the matrix defined as
D
q
=
q
i
=
1
δ
i
l
i
r
T
i
then it can be shown that
D
q
is the rank
q
matrix that minimizes the expression
D
−
D
q
F
where
A
F
is called the
Frobenius Norm
of the
n
×
d
matrix
A
, defined as
A
F
=
n
i
=
1
d
j
=
1
A
(i,j)
2
7.4.1
Geometry of SVD
In general, any
n
×
d
matrix
D
represents a
linear transformation
,
D
:
R
d
→
R
n
, from
the space of
d
-dimensional vectors to the space of
n
-dimensional vectors because for
any
x
∈
R
d
there exists
y
∈
R
n
such that
Dx
=
y
The set of all vectors
y
∈
R
n
such that
Dx
=
y
over all possible
x
∈
R
d
is called the
column space
of
D
, and the set of all vectors
x
∈
R
d
, such that
D
T
y
=
x
over all
y
∈
R
n
,
is called the
row space
of
D
, which is equivalent to the column space of
D
T
. In other
words, the column space of
D
is the set of all vectors that can be obtained as linear
combinations of columns of
D
, and the row space of
D
is the set of all vectors that can
7.4 Singular Value Decomposition
211
be obtained as linear combinations of the rows of
D
(or columns of
D
T
). Also note that
the set of all vectors
x
∈
R
d
, such that
Dx
=
0
is called the
null space
of
D
, and finally,
the set of all vectors
y
∈
R
n
, such that
D
T
y
=
0
is called the
left null space
of
D
.
One of the main properties of SVD is that it gives a basis for each of the four
fundamental spaces associated with the matrix
D
. If
D
has rank
r
, it means that it
has only
r
independent columns, and also only
r
independent rows. Thus, the
r
left
singular vectors
l
1
,
l
2
,...,
l
r
corresponding to the
r
nonzero singular values of
D
in
Eq.(7.38)represent a basis for the column space of
D
. The remaining
n
−
r
leftsingular
vectors
l
r
+
1
,...,
l
n
represent a basis for the left null space of
D
. For the row space, the
r
right singular vectors
r
1
,
r
2
,...,
r
r
corresponding to the
r
non-zero singular values,
represent a basis for the row space of
D
, and the remaining
d
−
r
right singular vectors
r
j
(
j
=
r
+
1
,...,d
), represent a basis for the null space of
D
.
Consider the reduced SVD expression in Eq.(7.39). Right multiplying both sides
of the equation by
R
r
and noting that
R
T
r
R
r
=
I
r
, where
I
r
is the
r
×
r
identity matrix,
we have
DR
r
=
L
r
r
R
T
r
R
r
DR
r
=
L
r
r
DR
r
=
L
r
δ
1
0
···
0
0
δ
2
···
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
···
δ
r
D
| | |
r
1
r
2
···
r
r
| | |
=
| | |
δ
1
l
1
δ
2
l
2
···
δ
r
l
r
| | |
From the above, we conclude that
Dr
i
=
δ
i
l
i
for all
i
=
1
,...,r
In other words, SVD is a special factorization of the matrix
D
, such that any basis
vector
r
i
for the row space is mapped to the corresponding basis vector
l
i
in the column
space, scaled by the singular value
δ
i
. As such, we can think of the SVD as a mapping
from an orthonormal basis
(
r
1
,
r
2
,...,
r
r
)
in
R
d
(the row space) to an orthonormal basis
(
l
1
,
l
2
,...,
l
r
)
in
R
n
(the column space), with the corresponding axes scaled according to
the singular values
δ
1
,δ
2
,...,δ
r
.
7.4.2
Connection between SVD and PCA
Assume that the matrix
D
has been centered, and assume that it has been factorized
via SVD [Eq.(7.38)] as
D
=
L
R
T
. Consider the
scatter matrix
for
D
, given as
D
T
D
.
We have
D
T
D
=
L
R
T
T
L
R
T
=
R
T
L
T
L
R
T
212
Dimensionality Reduction
=
R
(
T
)
R
T
=
R
2
d
R
T
(7.40)
where
2
d
is the
d
×
d
diagonal matrix defined as
2
d
(i,i)
=
δ
2
i
, for
i
=
1
,...,d
. Only
r
≤
min
(d,n)
of these eigenvalues are positive, whereas the rest are all zeros.
Because the covariance matrix of centered
D
is given as
=
1
n
D
T
D
, and because
it can be decomposed as
=
U
U
T
via PCA [Eq.(7.37)], we have
D
T
D
=
n
=
n
U
U
T
=
U
(n
)
U
T
(7.41)
Equating Eq.(7.40) and Eq.(7.41), we conclude that the right singular vectors
R
are
the same as the eigenvectors of
. Further, the corresponding singular values of
D
are
related to the eigenvalues of
by the expression
nλ
i
=
δ
2
i
or,
λ
i
=
δ
2
i
n
,
for
i
=
1
,...,d
(7.42)
Let us now consider the matrix
DD
T
. We have
DD
T
=
(
L
R
T
)(
L
R
T
)
T
=
L
R
T
R
T
L
T
=
L
(
T
)
L
T
=
L
2
n
L
T
where
2
n
is the
n
×
n
diagonal matrix given as
2
n
(i,i)
=
δ
2
i
, for
i
=
1
,...,n
. Only
r
of
these singular values are positive, whereas the rest are all zeros. Thus, the left singular
vectorsin
L
aretheeigenvectorsofthematrix
n
×
n
matrix
DD
T
,and thecorresponding
eigenvalues are given as
δ
2
i
.
Example 7.9.
Letus consider the
n
×
d
centeredIris datamatrix
D
from Example 7.1,
with
n
=
150 and
d
=
3. In Example 7.5 we computed the eigenvectors and
eigenvalues of the covariance matrix
as follows:
λ
1
=
3
.
662
λ
2
=
0
.
239
λ
3
=
0
.
059
u
1
=
−
0
.
390
0
.
089
−
0
.
916
u
2
=
−
0
.
639
−
0
.
742
0
.
200
u
3
=
−
0
.
663
0
.
664
0
.
346
7.5 Further Reading
213
Computing the SVD of
D
yields the following nonzero singular values and the
corresponding right singular vectors
δ
1
=
23
.
437
δ
2
=
5
.
992
δ
3
=
2
.
974
r
1
=
−
0
.
390
0
.
089
−
0
.
916
r
2
=
0
.
639
0
.
742
−
0
.
200
r
3
=
−
0
.
663
0
.
664
0
.
346
We do not show the left singular vectors
l
1
,
l
2
,
l
3
because they lie in
R
150
. Using
Eq.(7.42) one can verify that
λ
i
=
δ
2
i
n
. For example,
λ
1
=
δ
2
1
n
=
23
.
437
2
150
=
549
.
29
150
=
3
.
662
Notice also that the right singular vectors are equivalent to the principal components
or eigenvectors of
, up to isomorphism. That is, they may potentially be reversed
in direction. For the Iris dataset, we have
r
1
=
u
1
,
r
2
=−
u
2
, and
r
3
=
u
3
. Here the
secondrightsingularvectorisreversedinsignwhencomparedtothesecondprincipal
component.
7.5
FURTHER READING
Principal component analysis was pioneered in Pearson (1901). For a comprehensive
description of PCA see Jolliffe (2002). Kernel PCA was first introduced in Sch
¨
olkopf,
Smola, and M
¨
uller (1998). For further exploration of non-linear dimensionality
reduction methods see Lee and Verleysen (2007). The requisite linear algebra
background can be found in Strang (2006).
Jolliffe, I. (2002).
Principal Component Analysis,
2nd ed. Springer Series in Statistics.
New York: Springer Science
+
Business Media.
Lee, J. A. and Verleysen, M. (2007).
Nonlinear Dimensionality Reduction
. New York:
Springer Science
+
Business Media.
Pearson, K. (1901). “On lines and planes of closest fit to systems of points in space.”
The London, Edinburgh, and Dublin Philosophical Magazine and Journal of
Science
, 2(11): 559–572.
Sch
¨
olkopf, B., Smola, A. J., and M
¨
uller, K.-R. (1998). “Nonlinear component analysis
as a kernel eigenvalue problem.”
Neural Computation
, 10(5): 1299–1319.
Strang, G. (2006).
Linear Algebra and Its Applications
, 4th ed. Independence, KY:
Thomson Brooks/Cole, Cengage Learning.
214
Dimensionality Reduction
7.6
EXERCISES
Q1.
Consider the following data matrix
D
:
X
1
X
2
8
−
20
0
−
1
10
−
19
10
−
20
2 0
(a)
Compute the mean
µ
and covariance matrix
for
D
.
(b)
Compute the eigenvalues of
.
(c)
What is the “intrinsic” dimensionality of this dataset (discounting some small
amount of variance)?
(d)
Compute the first principal component.
(e)
If the
µ
and
from above characterize the normal distribution from which the
points were generated, sketch the orientation/extent ofthe 2-dimensional normal
density function.
Q2.
Given the covariance matrix
=
5 4
4 5
, answer the following questions:
(a)
Compute the eigenvalues of
by solving the equation det
(
−
λ
I
)
=
0.
(b)
Find the corresponding eigenvectors by solving the equation
u
i
=
λ
i
u
i
.
Q3.
Compute the singular values and the left and right singular vectors of the following
matrix:
A
=
1 1 0
0 0 1
Q4.
Consider the data in Table 7.1. Define the kernel function as follows:
K
(
x
i
,
x
j
)
=
x
i
−
x
j
2
. Answer the following questions:
(a)
Compute the kernel matrix
K
.
(b)
Find the first kernel principal component.
Table 7.1.
Dataset for Q4
i
x
i
x
1
(
4
,
2
.
9
)
x
4
(
2
.
5
,
1
)
x
7
(
3
.
5
,
4
)
x
9
(
2
,
2
.
1
)
Q5.
Given the two points
x
1
=
(
1
,
2
)
T
, and
x
2
=
(
2
,
1
)
T
, use the kernel function
K
(
x
i
,
x
j
)
=
(
x
T
i
x
j
)
2
to find the kernel principal component, by solving the equation
Kc
=
η
1
c
.
PART TWO
 FREQUENT PATTERN
MINING
CHAPTER 8
Itemset Mining
In many applications one is interested in how often two or more objects of interest
co-occur. For example, consider a popular website, which logs all incoming traffic to
its site in the form of weblogs. Weblogs typically record the source and destination
pages requested by some user, as well as the time, return code whether the request was
successful or not, and so on. Given such weblogs, one might be interested in finding
if there are sets of web pages that many users tend to browse whenever they visit the
website. Such “frequent” sets of web pages give clues to user browsing behavior and
can be used for improving the browsing experience.
The quest to mine frequent patterns appears in many other domains. The
prototypical application is
market basket analysis
, that is, to mine the sets of items that
are frequently bought together at a supermarket by analyzing the customer shopping
carts (the so-called “market baskets”). Once we mine the frequent sets, they allow us
to extract
association rules
among the item sets, where we make some statement about
how likely are two sets of items to co-occur or to conditionally occur. For example,
in the weblog scenario frequent sets allow us to extract rules like, “Users who visit
the sets of pages
main
,
laptops
and
rebates
also visit the pages
shopping-cart
and
checkout
”, indicating, perhaps, that the special rebate offer is resulting in more
laptop sales. In the case of market baskets, we can find rules such as “Customers
who buy milk and cereal also tend to buy bananas,” which may prompt a grocery
store to co-locate bananas in the cereal aisle. We begin this chapter with algorithms
to mine frequent itemsets, and then show how they can be used to extract association
rules.
8.1
FREQUENT ITEMSETS AND ASSOCIATION RULES
Itemsets and Tidsets
Let
I
={
x
1
,x
2
,...,x
m
}
be a set of elements called
items
. A set
X
⊆
I
is called an
itemset
.
The set of items
I
may denote, for example, the collection of all products sold at a
supermarket, the set of all web pages at a website, and so on. An itemset of cardinality
(or size)
k
is called a
k
-itemset. Further, we denote by
I
(k)
the set of all
k
-itemsets,
that is, subsets of
I
with size
k
. Let
T
={
t
1
,t
2
,...,t
n
}
be another set of elements called
217
218
Itemset Mining
transaction identifiers or
tids
. A set
T
⊆
T
is called a
tidset
. We assume that itemsets
and tidsets are kept sorted in lexicographic order.
A
transaction
is a tuple of the form
t,
X
, where
t
∈
T
is a unique transaction
identifier, and
X
is an itemset. The set of transactions
T
may denote the set of all
customers at a supermarket, the set of all the visitors to a website, and so on. For
convenience, we refer to a transaction
t,
X
by its identifier
t
.
Database Representation
A binary database
D
is a binary relation on the set of tids and items, that is,
D
⊆
T
×
I
.
We say that tid
t
∈
T
contains
item
x
∈
I
iff
(t,x)
∈
D
. In other words,
(t,x)
∈
D
iff
x
∈
X
in the tuple
t,
X
. We say that tid
t
contains
itemset
X
={
x
1
,x
2
,...,x
k
}
iff
(t,x
i
)
∈
D
for
all
i
=
1
,
2
,...,k
.
Example 8.1.
Figure 8.1a shows an example binary database. Here
I
=
{
A
,
B
,
C
,
D
,
E
}
, and
T
={
1
,
2
,
3
,
4
,
5
,
6
}
. In the binary database, the cell in row
t
and
column
x
is 1 iff
(t,x)
∈
D
, and 0 otherwise. We can see that transaction 1 contains
item
B
, and it also contains the itemset
BE
, and so on.
For a set
X
, we denote by 2
X
the powerset of
X
, that is, the set of all subsets of
X
.
Let
i
: 2
T
→
2
I
be a function, defined as follows:
i
(
T
)
={
x
| ∀
t
∈
T
, t
contains
x
}
(8.1)
where
T
⊆
T
, and
i
(
T
)
is the set of items that are common to
all
the transactions in the
tidset
T
. In particular,
i
(t)
is the set of items contained in tid
t
∈
T
. Note that in this
chapter we drop the set notation for convenience (e.g., we write
i
(t)
instead of
i
(
{
t
}
)
).
It is sometimes convenient to consider the binary database
D
, as a
transaction database
consisting of tuples of the form
t,
i
(t)
, with
t
∈
T
. The transaction or itemset database
canbe considered as a horizontal representationof the binary database,where we omit
items that are not contained in a given tid.
Let
t
: 2
I
→
2
T
be a function, defined as follows:
t
(
X
)
={
t
|
t
∈
T
and
t
contains
X
}
(8.2)
where
X
⊆
I
, and
t
(
X
)
is the set of tids that contain
all
the items in the itemset
X
. In particular,
t
(x)
is the set of tids that contain the single item
x
∈
I
. It is also
sometimesconvenienttothinkofthebinarydatabase
D
,asatidsetdatabasecontaining
a collection of tuples of the form
x,
t
(x)
, with
x
∈
I
. The tidset database is a vertical
representation of the binary database, where we omit tids that do not contain a given
item.
Example 8.2.
Figure 8.1b shows the corresponding transaction database for the
binary database in Figure 8.1a. For instance, the first transaction is
1
,
{
A
,
B
,
D
,
E
}
,
where we omit item
C
since
(
1
,
C
)
∈
D
. Henceforth, for convenience, we drop
the set notation for itemsets and tidsets if there is no confusion. Thus, we write
1
,
{
A
,
B
,
D
,
E
}
as
1
,
ABDE
.
220
Itemset Mining
Table 8.1.
Frequent itemsets with
minsup
=
3
sup
itemsets
6
B
5
E
,
BE
4
A
,
C
,
D
,
AB
,
AE
,
BC
,
BD
,
ABE
3
AD
,
CE
,
DE
,
ABD
,
ADE
,
BCE
,
BDE
,
ABDE
Association Rules
An
association rule
is an expression
X
s,c
−→
Y
, where
X
and
Y
are itemsets and they are
disjoint, that is,
X
,
Y
⊆
I
, and
X
∩
Y
=∅
. Let the itemset
X
∪
Y
be denoted as
XY
. The
support
of the rule is the number of transactions in which both
X
and
Y
co-occur as
subsets:
s
=
sup
(
X
−→
Y
)
=|
t
(
XY
)
|=
sup
(
XY
)
The
relative support
of the rule is defined as the fraction of transactions where
X
and
Y
co-occur, and it provides an estimate of the joint probability of
X
and
Y
:
rsup
(
X
−→
Y
)
=
sup
(
XY
)
|
D
|
=
P(
X
∧
Y
)
The
confidence
of a rule is the conditional probability that a transaction contains
Y
given that it contains
X
:
c
=
conf
(
X
−→
Y
)
=
P(
Y
|
X
)
=
P(
X
∧
Y
)
P(
X
)
=
sup
(
XY
)
sup
(
X
)
A rule is
frequent
if the itemset
XY
is frequent, that is,
sup
(
XY
)
≥
minsup
and a rule
is
strong
if
conf
≥
minconf
, where
minconf
is a user-specified minimum confidence
threshold.
Example 8.4.
Consider the association rule
BC
−→
E
. Using the itemset support
values shown in Table 8.1, the support and confidence of the rule are as follows:
s
=
sup
(
BC
−→
E
)
=
sup
(
BCE
)
=
3
c
=
conf
(
BC
−→
E
)
=
sup
(
BCE
)
sup
(
BC
)
=
3
/
4
=
0
.
75
Itemset and Rule Mining
From the definition of rule support and confidence, we can observe that to generate
frequent and high confidence association rules, we need to first enumerate all the
frequent itemsets along with their support values. Formally, given a binary database
D
and a user defined minimum support threshold
minsup
, the task of frequent itemset
mining is to enumerate all itemsets that are frequent, i.e., those that have support at
least
minsup
. Next, given the set of frequent itemsets
F
and a minimum confidence
value
minconf
, the association rule mining task is to find all frequent and strong
rules.
8.2 Itemset Mining Algorithms
221
8.2
ITEMSET MINING ALGORITHMS
We begin by describing a naive or brute-force algorithm that enumerates all the
possible itemsets
X
⊆
I
, and for each such subset determines its support in the input
dataset
D
. The method comprises two main steps: (1) candidate generation and (2)
support computation.
Candidate Generation
This step generates all the subsets of
I
, which are called
candidates
, as each itemset is
potentially a candidate frequent pattern. The candidate itemset search space is clearly
exponential because there are 2
|
I
|
potentially frequent itemsets. It is also instructive
to note the structure of the itemset search space; the set of all itemsets forms a lattice
structure where any two itemsets
X
and
Y
are connected by a link iff
X
is an
immediate
subset
of
Y
, that is,
X
⊆
Y
and
|
X
|= |
Y
|−
1. In terms of a practical search strategy,
the itemsets in the lattice can be enumerated using either a breadth-first (BFS) or
depth-first (DFS) search on the
prefix tree
, where two itemsets
X
,
Y
are connected by a
link iff
X
is an immediate subset and prefix of
Y
. This allows one to enumerateitemsets
starting with an empty set, and adding one more item at a time.
Support Computation
This step computes the support of each candidate pattern
X
and determines if it is
frequent. For each transaction
t,
i
(t)
in the database, we determine if
X
is a subset of
i
(t)
. If so, we increment the support of
X
.
The pseudo-code for the brute-force method is shown in Algorithm 8.1. It
enumerates each itemset
X
⊆
I
, and then computes its support by checking if
X
⊆
i
(t)
for each
t
∈
T
.
ALGORITHM 8.1. Algorithm B
RUTE
F
ORCE
B
RUTE
F
ORCE
(D,
I
,
minsup
)
:
F
←∅
// set of frequent itemsets
1
foreach
X
⊆
I
do
2
sup
(
X
)
←
C
OMPUTE
S
UPPORT
(
X
,
D
)
3
if
sup
(
X
)
≥
minsup
then
4
F
←
F
∪
(
X
,
sup
(
X
))
5
return
F
6
C
OMPUTE
S
UPPORT
(
X
,
D)
:
sup
(
X
)
←
0
7
foreach
t,
i
(t)
∈
D do
8
if
X
⊆
i
(t)
then
9
sup
(
X
)
←
sup
(
X
)
+
1
10
return
sup
(
X
)
11
222
Itemset Mining
∅
A
B
C
D E
AB AC AD AE BC BD BE CD CE DE
ABC
ABD ABE
ACD ACE
ADE
BCD BCE
BDE
CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Figure 8.2.
Itemset lattice and prefix-based search tree (in bold).
Example 8.5.
Figure 8.2 shows the itemset lattice for the set of items
I
=
{
A
,
B
,
C
,
D
,
E
}
. There are 2
|
I
|
=
2
5
=
32 possible itemsets including the empty
set. The corresponding prefix search tree is also shown (in bold). The brute-force
method explores the entire itemset search space, regardless of the
minsup
threshold
employed. If
minsup
=
3, then the brute-force method would output the set of
frequent itemsets shown in Table 8.1.
Computational Complexity
Support computation takes time
O
(
|
I
|·|
D
|
)
in the worst case, and because there are
O
(
2
|
I
|
)
possible candidates, the computational complexity of the brute-force method
is
O
(
|
I
|·|
D
|·
2
|
I
|
)
. Because the database
D
can be very large, it is also important to
measure the input/output (I/O) complexity. Because we make one complete database
scan to compute the support of eachcandidate,the I/O complexity of B
RUTE
F
ORCE
is
O
(
2
|
I
|
)
databasescans.Thus, thebrute forceapproach is computationallyinfeasiblefor
even small itemset spaces, whereas in practice
I
can be very large (e.g., a supermarket
carries thousands of items). The approach is impractical from an I/O perspective
as well.
8.2 Itemset Mining Algorithms
223
We shall see next how to systematically improve on the brute force approach, by
improving both the candidate generation and support counting steps.
8.2.1
Level-wise Approach: Apriori Algorithm
The brute force approach enumerates all possible itemsets in its quest to determine
the frequent ones. This results in a lot of wasteful computation because many of
the candidates may not be frequent. Let
X
,
Y
⊆
I
be any two itemsets. Note that if
X
⊆
Y
, then
sup
(
X
)
≥
sup
(
Y
)
, which leads to the following two observations: (1) if
X
is frequent, then any subset
Y
⊆
X
is also frequent, and (2) if
X
is not frequent,
then any superset
Y
⊇
X
cannot be frequent. The
Apriori algorithm
utilizes these two
properties to significantly improve the brute-force approach. It employs a level-wise
or breadth-first exploration of the itemset search space, and prunes all supersets of
any infrequent candidate, as no superset of an infrequent itemset can be frequent.
It also avoids generating any candidate that has an infrequent subset. In addition to
improving the candidate generation step via itemset pruning, the Apriori method also
significantly improves the I/O complexity. Instead of counting the support for a single
itemset, it explores the prefix tree in a breadth-first manner, and computes the support
of all the valid candidates of size
k
that comprise level
k
in the prefix tree.
Example 8.6.
Consider the example dataset in Figure 8.1; let
minsup
=
3. Figure 8.3
shows the itemset search space for the Apriori method, organized as a prefix tree
where two itemsets are connected if one is a prefix and immediate subset of the
other. Each node shows an itemset along with its support, thus
AC
(
2
)
indicates that
sup
(
AC
)
=
2. Apriori enumerates the candidate patterns in a level-wise manner,
as shown in the figure, which also demonstrates the power of pruning the search
space via the two Apriori properties. For example, once we determine that
AC
is
infrequent, we can prune any itemset that has
AC
as a prefix, that is, the entire
subtree under
AC
can be pruned. Likewise for
CD
. Also, the extension
BCD
from
BC
can be pruned, since it has an infrequent subset, namely
CD
.
Algorithm 8.2 shows the pseudo-code for the Apriori method. Let
C
(k)
denote the
prefix tree comprising all the candidate
k
-itemsets. The method begins by inserting the
single items into an initially empty prefix tree to populate
C
(
1
)
. The while loop (lines
5–11) first computes the support for the current set of candidates at level
k
via the
C
OMPUTE
S
UPPORT
procedure that generates
k
-subsets of each transaction in the
database
D
, and for each such subset it increments the support of the corresponding
candidate in
C
(k)
if it exists. This way, the database is scanned only once per level,
and the supports for all candidate
k
-itemsets are incremented during that scan. Next,
we remove any infrequent candidate (line 9). The leaves of the prefix tree that
survive comprise the set of frequent
k
-itemsets
F
(k)
, which are used to generate the
candidate
(k
+
1
)
-itemsets for the next level (line 10). The E
XTEND
P
REFIX
T
REE
procedure employs prefix-based extension for candidate generation. Given two
frequent
k
-itemsets
X
a
and
X
b
with a common
k
−
1 length prefix, that is, given two
sibling leaf nodes with a common parent, we generate the
(k
+
1
)
-length candidate
X
ab
=
X
a
∪
X
b
. This candidate is retained only if it has no infrequent subset. Finally, if
a
k
-itemset
X
a
has no extension, it is pruned from the prefix tree, and we recursively
224
Itemset Mining
∅
A
(4)
B
(6)
C
(4)
D
(4)
E
(5)
AB
(4)
AC
(2)
AD
(3)
AE
(4)
BC
(4)
BD
(4)
BE
(5)
CD
(2)
CE
(3)
DE
(3)
ABC
ABD
(3)
ABE
(4)
ACD ACE
ADE
(3)
BCD
BCE
(3)
BDE
(3)
CDE
ABCD ABCE
ABDE
(3)
ACDE BCDE
ABCDE
Level 1
Level 2
Level 3
Level 4
Level 5
Figure 8.3.
Apriori: prefix search tree and effect of pruning. Shaded nodes indicate infrequent itemsets,
whereas dashed nodes and lines indicate all of the pruned nodes and branches. Solid lines indicate frequent
itemsets.
prune any of its ancestors with no
k
-itemset extension, so that in
C
(k)
all leaves are at
level
k
. If new candidates were added, the whole process is repeated for the next level.
This process continues until no new candidates are added.
Example 8.7.
Figure 8.4 illustrates the Apriori algorithm on the example dataset
from Figure 8.1 using
minsup
=
3. All the candidates
C
(
1
)
are frequent (see
Figure 8.4a).During extensionall the pairwise combinations will be considered, since
they all share the empty prefix
∅
as their parent. These comprise the new prefix tree
C
(
2
)
in Figure 8.4b; because
E
has no prefix-based extensions, it is removed from the
tree. After support computation
AC
(
2
)
and
CD
(
2
)
are eliminated (shown in gray)
since they are infrequent. The next level prefix tree is shown in Figure 8.4c. The
candidate
BCD
is pruned due to the presence of the infrequent subset
CD
. All of the
candidates at level 3 are frequent. Finally,
C
(
4
)
(shown in Figure 8.4d) has only one
candidate
X
ab
=
ABDE
, which is generated from
X
a
=
ABD
and
X
b
=
ABE
because
this is the only pair of siblings. The mining process stops after this step, since no more
extensions are possible.
The worst-case computational complexity of the Apriori algorithm is still
O
(
|
I
|·
|
D
|·
2
|
I
|
)
, as all itemsets may be frequent. In practice, due to the pruning of the search
8.2 Itemset Mining Algorithms
225
ALGORITHM 8.2. Algorithm A
PRIORI
A
PRIORI
(D,
I
,
minsup
)
:
F
←∅
1
C
(
1
)
←{∅}
// Initial prefix tree with single items
2
foreach
i
∈
I
do
Add
i
as child of
∅
in
C
(
1
)
with
sup(i)
←
0
3
k
←
1
//
k
denotes the level
4
while
C
(k)
=∅
do
5
C
OMPUTE
S
UPPORT
(
C
(k)
,
D
)
6
foreach
leaf X
∈
C
(k)
do
7
if
sup
(
X
)
≥
minsup
then
F
←
F
∪
(
X
,
sup
(
X
))
8
else
remove
X
from
C
(k)
9
C
(k
+
1
)
←
E
XTEND
P
REFIX
T
REE
(
C
(k)
)
10
k
←
k
+
1
11
return
F
(k)
12
C
OMPUTE
S
UPPORT
(
C
(k)
,
D)
:
foreach
t,
i
(t)
∈
D do
13
foreach
k
-subset X
⊆
i
(t)
do
14
if
X
∈
C
(k)
then
sup
(
X
)
←
sup
(
X
)
+
1
15
E
XTEND
P
REFIX
T
REE
(
C
(k)
)
:
foreach
leaf X
a
∈
C
(k)
do
16
foreach
leaf X
b
∈
SIBLING
(
X
a
),
such that
b > a
do
17
X
ab
←
X
a
∪
X
b
18
// prune candidate if there are any infrequent subsets
if
X
j
∈
C
(k)
,
for all
X
j
⊂
X
ab
,
such that
|
X
j
|=|
X
ab
|−
1
then
19
Add
X
ab
as child of
X
a
with
sup(
X
ab
)
←
0
20
if
no extensions from X
a
then
21
remove
X
a
, and all ancestors of
X
a
with no extensions, from
C
(k)
22
return
C
(k)
23
space the cost is much lower. However, in terms of I/O cost Apriori requires
O
(
|
I
|
)
database scans, as opposed to the
O
(
2
|
I
|
)
scans in the brute-force method. In practice,
it requires only
l
database scans, where
l
is the length of the longest frequent itemset.
8.2.2
Tidset Intersection Approach: Eclat Algorithm
The support counting step can be improved significantly if we can index the database
in such a way that it allows fast frequency computations. Notice that in the level-wise
approach, to count the support, we have to generate subsets of each transaction and
check whether they exist in the prefix tree. This can be expensive because we may end
up generating many subsets that do not exist in the prefix tree.
226
Itemset Mining
∅
(
6
)
A
(4)
B
(6)
C
(4)
D
(4)
E
(5)
(a)
C
(
1
)
∅
(
6
)
A
(4)
AB
(4)
AC
(2)
AD
(3)
AE
(4)
B
(6)
BC
(4)
BD
(4)
BE
(5)
C
(4)
CD
(2)
CE
(3)
D
(4)
DE
(3)
(b)
C
(
2
)
∅
(
6
)
A
(4)
AB
(4)
ABD
(3)
ABE
(4)
AD
(3)
ADE
(3)
B
(6)
BC
(4)
BCE
(3)
BD
(4)
BDE
(3)
(c)
C
(
3
)
∅
(
6
)
A
(4)
AB
(4)
ABD
(3)
ABDE
(3)
(d)
C
(
4
)
Figure 8.4.
Itemset mining: Apriori algorithm. The prefix search trees
C
(k)
at each level are shown. Leaves
(unshaded) comprise the set of frequent
k
-itemsets
F
(k)
.
The Eclat algorithm leverages the tidsets directly for support computation. The
basicideais thatthesupport ofacandidateitemsetcanbecomputedbyintersectingthe
tidsets of suitably chosen subsets. In general, given
t
(
X
)
and
t
(
Y
)
for any two frequent
itemsets
X
and
Y
, we have
t
(
XY
)
=
t
(
X
)
∩
t
(
Y
)
The support of candidate
XY
is simply the cardinality of
t
(
XY
)
, that is,
sup
(
XY
)
=
|
t
(
XY
)
|
. Eclatintersectsthe tidsetsonly if thefrequentitemsetsshare acommon prefix,
and it traverses the prefix search tree in a DFS-like manner, processing a group of
itemsets that have the same prefix, also called a
prefix equivalence class
.
Example 8.8.
For example, if we know that the tidsets for item
A
and
C
are
t
(
A
)
=
1345 and
t
(
C
)
=
2456, respectively, then we can determine the support of
AC
by
intersecting the two tidsets, to obtain
t
(
AC
)
=
t
(
A
)
∩
t
(
C
)
=
1345
∩
2456
=
45.
8.2 Itemset Mining Algorithms
227
ALGORITHM 8.3. Algorithm E
CLAT
// Initial Call:
F
←∅
,P
←
i,
t
(i)
|
i
∈
I
,
|
t
(i)
|≥
minsup
E
CLAT
(
P
,
minsup
,
F
)
:
foreach
X
a
,
t
(
X
a
)
∈
P
do
1
F
←
F
∪
(
X
a
,
sup
(
X
a
))
2
P
a
←∅
3
foreach
X
b
,
t
(
X
b
)
∈
P
, with X
b
>
X
a
do
4
X
ab
=
X
a
∪
X
b
5
t
(
X
ab
)
=
t
(
X
a
)
∩
t
(
X
b
)
6
if
sup
(
X
ab
)
≥
minsup
then
7
P
a
←
P
a
∪
X
ab
,
t
(
X
ab
)
8
if
P
a
=∅
then
E
CLAT
(
P
a
,
minsup
,
F
)
9
In this case, we have
sup
(
AC
)
= |
45
| =
2. An example of a prefix equivalence
class is the set
P
A
= {
AB
,
AC
,
AD
,
AE
}
, as all the elements of
P
A
share
A
as
the prefix.
The pseudo-code for Eclat is given in Algorithm 8.3. It employs a vertical
representation of the binary database
D
. Thus, the input is the set of tuples
i,
t
(i)
for all frequent items
i
∈
I
, which comprise an equivalence class
P
(they all share the
empty prefix); it is assumed that
P
contains only frequent itemsets. In general, given a
prefixequivalenceclass
P
,foreachfrequentitemset
X
a
∈
P
,wetrytointersectitstidset
with the tidsets of all other itemsets
X
b
∈
P
. The candidate pattern is
X
ab
=
X
a
∪
X
b
,
and we check the cardinality of the intersection
t
(
X
a
)
∩
t
(
X
b
)
to determine whether it
is frequent. If so,
X
ab
is added to the new equivalence class
P
a
that contains all itemsets
that share
X
a
as a prefix. A recursive call to Eclat then finds all extensions of the
X
a
branch in the search tree. This process continues until no extensions are possible over
all branches.
Example 8.9.
Figure 8.5 illustrates the Eclat algorithm. Here
minsup
=
3, and the
initial prefix equivalence class is
P
∅
=
A
,
1345
,
B
,
123456
,
C
,
2456
,
D
,
1356
,
E
,
12345
Eclat intersects
t
(
A
)
with each of
t
(
B
)
,
t
(
C
)
,
t
(
D
)
, and
t
(
E
)
to obtain the tidsets for
AB
,
AC
,
AD
and
AE
, respectively. Out of these
AC
is infrequent and is pruned
(marked gray). The frequent itemsets and their tidsets comprise the new prefix
equivalence class
P
A
=
AB
,
1345
,
AD
,
135
,
AE
,
1345
which is recursively processed. On return, Eclat intersects
t
(
B
)
with
t
(
C
)
,
t
(
D
)
, and
t
(
E
)
to obtain the equivalence class
P
B
=
BC
,
2456
,
BD
,
1356
,
BE
,
12345
228
Itemset Mining
∅
A
1345
AB
1345
ABD
135
ABDE
135
ABE
1345
AC
45
AD
135
ADE
135
AE
1345
B
123456
BC
2456
BCD
56
BCE
245
BD
1356
BDE
135
BE
12345
C
2456
CD
56
CE
245
D
1356
DE
135
E
12345
Figure 8.5.
Eclat algorithm: tidlist intersections (gray boxes indicate infrequent itemsets).
Other branches are processed in a similar manner; the entire search space that Eclat
explores is shown in Figure 8.5. The gray nodes indicate infrequent itemsets, whereas
the rest constitute the set of frequent itemsets.
The computational complexity of Eclat is
O
(
|
D
|·
2
|
I
|
)
in the worst case, since there
can be 2
|
I
|
frequent itemsets, and an intersection of two tidsets takes at most
O
(
|
D
|
)
time. The I/O complexity of Eclat is harder to characterize, as it depends on the size
of the intermediate tidsets. With
t
as the average tidset size, the initial database size
is
O
(t
·|
I
|
)
, and the total size of all the intermediate tidsets is
O
(t
·
2
|
I
|
)
. Thus, Eclat
requires
t
·
2
|
I
|
t
·|
I
|
=
O
(
2
|
I
|
/
|
I
|
)
database scans in the worst case.
Diffsets: Difference of Tidsets
The Eclat algorithm can be significantly improved if we can shrink the size of the
intermediate tidsets. This can be achieved by keeping track of the differences in
the tidsets as opposed to the full tidsets. Formally, let
X
k
= {
x
1
,x
2
,...,x
k
−
1
,x
k
}
be a
k
-itemset. Define the
diffset
of
X
k
as the set of tids that contain the prefix
X
k
−
1
=
{
x
1
,...,x
k
−
1
}
but do not contain the item
x
k
, given as
d
(
X
k
)
=
t
(
X
k
−
1
)
t
(
X
k
)
Consider two
k
-itemsets
X
a
={
x
1
,...,x
k
−
1
,x
a
}
and
X
b
={
x
1
,...,x
k
−
1
,x
b
}
that share the
common
(k
−
1
)
-itemset
X
={
x
1
,x
2
,...,x
k
−
1
}
as a prefix. The diffset of
X
ab
=
X
a
∪
X
b
=
{
x
1
,...,x
k
−
1
,x
a
,x
b
}
is given as
d
(
X
ab
)
=
t
(
X
a
)
t
(
X
ab
)
=
t
(
X
a
)
t
(
X
b
)
(8.3)
However, note that
t
(
X
a
)
t
(
X
b
)
=
t
(
X
a
)
∩
t
(
X
b
)
8.2 Itemset Mining Algorithms
229
and taking the union of the above with the emptyset
t
(
X
)
∩
t
(
X
)
, we can obtain an
expression for
d
(
X
ab
)
in terms of
d
(
X
a
)
and
d
(
X
b
)
as follows:
d
(
X
ab
)
=
t
(
X
a
)
t
(
X
b
)
=
t
(
X
a
)
∩
t
(
X
b
)
=
t
(
X
a
)
∩
t
(
X
b
)
∪
t
(
X
)
∩
t
(
X
)
=
t
(
X
a
)
∪
t
(
X
)
∩
t
(
X
b
)
∪
t
(
X
)
∩
t
(
X
a
)
∪
t
(
X
)
∩
t
(
X
b
)
∪
t
(
X
)
=
t
(
X
)
∩
t
(
X
b
)
∩
t
(
X
)
∩
t
(
X
a
)
∩
T
=
d
(
X
b
)
d
(
X
a
)
Thus, thediffsetof
X
ab
canbe obtainedfrom thediffsets ofits subsets
X
a
and
X
b
,which
means that we can replace all intersection operations in Eclat with diffset operations.
Using diffsets the support of a candidate itemset can be obtained by subtracting the
diffset size from the support of the prefix itemset:
sup
(
X
ab
)
=
sup
(
X
a
)
−|
d
(
X
ab
)
|
which follows directly from Eq.(8.3).
The variant of Eclat that uses the diffset optimization is called dEclat, whose
pseudo-code is shown in Algorithm 8.4. The input comprises all the frequent single
items
i
∈
I
along with their diffsets, which are computed as
d
(i)
=
t
(
∅
)
t
(i)
=
T
t
(i)
Given an equivalence class
P
, for each pair of distinct itemsets
X
a
and
X
b
we generate
the candidate pattern
X
ab
=
X
a
∪
X
b
and check whether it is frequent via the use of
diffsets (lines 6–7). Recursive calls are made to find further extensions. It is important
ALGORITHM 8.4. Algorithm
D
E
CLAT
// Initial Call:
F
←∅
,
P
←
i,
d
(i),
sup
(i)
|
i
∈
I
,
d
(i)
=
T
t
(i),
sup
(i)
≥
minsup
D
E
CLAT
(
P
,
minsup
,
F
)
:
foreach
X
a
,
d
(
X
a
),
sup
(
X
a
)
∈
P
do
1
F
←
F
∪
(
X
a
,
sup
(
X
a
))
2
P
a
←∅
3
foreach
X
b
,
d
(
X
b
),
sup
(
X
b
)
∈
P
, with X
b
>
X
a
do
4
X
ab
=
X
a
∪
X
b
5
d
(
X
ab
)
=
d
(
X
b
)
d
(
X
a
)
6
sup
(
X
ab
)
=
sup
(
X
a
)
−|
d
(
X
ab
)
|
7
if
sup
(
X
ab
)
≥
minsup
then
8
P
a
←
P
a
∪
X
ab
,
d
(
X
ab
),
sup
(
X
ab
)
9
if
P
a
=∅
then
D
E
CLAT
(
P
a
,
minsup
,
F
)
10
8.2 Itemset Mining Algorithms
231
and their support values are
sup
(
AB
)
=
sup
(
A
)
−|
d
(
AB
)
|=
4
−
0
=
4
sup
(
AC
)
=
sup
(
A
)
−|
d
(
AC
)
|=
4
−
2
=
2
Whereas
AB
is frequent, we can prune
AC
because it is not frequent. The frequent
itemsets and their diffsets and support values comprise the new prefix equivalence
class:
P
A
=
AB
,
∅
,
4
,
AD
,
4
,
3
,
AE
,
∅
,
4
which is recursively processed. Other branches are processed in a similar manner.
The entire search space for dEclat is shown in Figure 8.6. The support of an itemset
is shown within brackets. For example,
A
has support 4 and diffset
d
(
A
)
=
26.
8.2.3
Frequent Pattern Tree Approach: FPGrowth Algorithm
The FPGrowth method indexes the database for fast support computation via the use
of an augmented prefix tree called the
frequent pattern tree
(FP-tree). Each node in
the tree is labeled with a single item, and each child node represents a different item.
Each node also stores the support information for the itemset comprising the items on
the path from the root to that node. The FP-tree is constructed as follows. Initially the
tree contains as root the null item
∅
. Next, for each tuple
t,
X
∈
D
, where
X
=
i
(t)
,
we insert the itemset
X
into the FP-tree, incrementing the count of all nodes along the
path that represents
X
. If
X
shares a prefix with some previously inserted transaction,
then
X
will follow the same path until the common prefix. For the remaining items in
X
, new nodes are created under the common prefix, with counts initialized to 1. The
FP-tree is complete when all transactions have been inserted.
The FP-tree can be considered as a prefix compressed representation of
D
.
Because we want the tree to be as compact as possible, we want the most frequent
items to be at the top of the tree. FPGrowth therefore reorders the items in decreasing
order of support, that is, from the initial database, it first computes the support of all
single items
i
∈
I
. Next, it discards the infrequent items, and sorts the frequent items
by decreasing support. Finally, each tuple
t,
X
∈
D
is inserted into the FP-tree after
reordering
X
by decreasing item support.
Example 8.11.
Consider the example database in Figure 8.1. We add each transac-
tion one by one into the FP-tree, and keep track of the count at each node. For
our example database the sorted item order is
{
B
(
6
),
E
(
5
),
A
(
4
),
C
(
4
),
D
(
4
)
}
. Next,
each transaction is reordered in this same order; for example,
1
,
ABDE
becomes
1
,
BEAD
. Figure 8.7 illustrates step-by-step FP-tree construction as each sorted
transaction is added to it. The final FP-tree for the database is shown in Figure 8.7f.
Once the FP-tree has been constructed, it serves as an index in lieu of the
original database. All frequent itemsets can be mined from the tree directly via the
FPG
ROWTH
method, whose pseudo-code is shown in Algorithm 8.5. The method
accepts as input a FP-tree
R
constructed from the input database
D
, and the current
itemset prefix
P
, which is initially empty.
232
Itemset Mining
∅
(
1
)
B
(1)
E
(1)
A
(1)
D
(1)
(a)
1
,
BEAD
∅
(
2
)
B
(2)
E
(2)
A
(1)
D
(1)
C
(1)
(b)
2
,
BEC
∅
(
3
)
B
(3)
E
(3)
A
(2)
D
(2)
C
(1)
(c)
3
,
BEAD
∅
(
4
)
B
(4)
E
(4)
A
(3)
C
(1)
D
(2)
C
(1)
(d)
4
,
BEAC
∅
(
5
)
B
(5)
E
(5)
A
(4)
C
(2)
D
(1)
D
(2)
C
(1)
(e)
5
,
BEACD
∅
(
6
)
B
(6)
C
(1)
D
(1)
E
(5)
A
(4)
C
(2)
D
(1)
D
(2)
C
(1)
(f)
6
,
BCD
Figure 8.7.
Frequent pattern tree: bold edges indicate current transaction.
Given a FP-tree
R
, projected FP-trees are built for each frequent item
i
in
R
in
increasing order of support. To project
R
on item
i
, we find all the occurrences of
i
in
the tree, and for each occurrence, we determine the corresponding path from the root
to
i
(line 13). The count of item
i
on a given path is recorded in
cnt(i)
(line 14), and
the path is inserted into the new projected tree
R
X
, where
X
is the itemset obtained by
extendingthe prefix
P
with the item
i
. While inserting the path, the count of each node
in
R
X
along the given path is incremented by the path count
cnt(i)
. We omit the item
i
from the path, as it is now part of the prefix. The resulting FP-tree is a projection of the
itemset
X
that comprises the current prefix extended with item
i
(line 9). We then call
FPG
ROWTH
recursivelywith projected FP-tree
R
X
and the new prefix itemset
X
as the
parameters (line 16). The base case for the recursion happens when the input FP-tree
R
is a single path. FP-trees that are paths are handled by enumerating all itemsets that
are subsets of the path, with the support of each such itemset being given by the least
frequent item in it (lines 2–6).
8.2 Itemset Mining Algorithms
233
ALGORITHM 8.5. Algorithm FPG
ROWTH
// Initial Call:
R
←
FP-tree
(
D
)
,
P
←∅
,
F
←∅
FPG
ROWTH
(
R
,
P
,
F
,
minsup
)
:
Remove infrequent items from
R
1
if
I
S
P
ATH
(R)
then
// insert subsets of
R
into
F
2
foreach
Y
⊆
R
do
3
X
←
P
∪
Y
4
sup
(
X
)
←
min
x
∈
Y
{
cnt(x)
}
5
F
←
F
∪
(
X
,
sup
(
X
))
6
else
// process projected FP-trees for each frequent item
i
7
foreach
i
∈
R in increasing order of sup
(i)
do
8
X
←
P
∪{
i
}
9
sup
(
X
)
←
sup
(i)
// sum of
cnt(i)
for all nodes labeled
i
10
F
←
F
∪
(
X
,
sup
(
X
))
11
R
X
←∅
// projected FP-tree for
X
12
foreach
path
∈
P
ATH
F
ROM
R
OOT
(i)
do
13
cnt(i)
←
count of
i
in
path
14
Insert
path
, excluding
i
, into FP-tree
R
X
with count
cnt(i)
15
if
R
X
=∅
then
FPG
ROWTH
(
R
X
,
X
,
F
,
minsup
)
16
Example 8.12.
We illustrate the FPGrowth method on the FP-tree
R
built in
Example 8.11, as shown in Figure 8.7f. Let
minsup
=
3. The initial prefix is
P
= ∅
,
and the set of frequent items
i
in
R
are
B
(
6
)
,
E
(
5
)
,
A
(
4
)
,
C
(
4
)
, and
D
(
4
)
. FPGrowth
creates a projected FP-tree for each item, but in increasing order of support.
The projected FP-tree for item
D
is shown in Figure 8.8c. Given the initial
FP-tree
R
shown in Figure 8.7f, there are three paths from the root to a node labeled
D
, namely
BCD
, cnt(
D
)
=
1
BEACD
, cnt(
D
)
=
1
BEAD
, cnt(
D
)
=
2
These three paths, excluding the last item
i
=
D
, are inserted into the new FP-tree
R
D
with the counts incremented by the corresponding
cnt(
D
)
values, that is, we insert
into
R
D
, the paths
BC
with count of 1,
BEAC
with count of 1, and finally
BEA
with count of 2, as shown in Figures 8.8a–c. The projected FP-tree for
D
is shown
in Figure 8.8c, which is processed recursively.
When we process
R
D
, we have the prefix itemset
P
=
D
, and after removing the
infrequent item
C
(which has support 2), we find that the resulting FP-tree is a single
path
B
(
4
)
–
E
(
3
)
–
A
(
3
)
. Thus, we enumerate all subsets of this path and prefix them
234
Itemset Mining
∅
(
1
)
B
(1)
C
(1)
(a) Add
BC
,cnt
=
1
∅
(
2
)
B
(2)
C
(1)
E
(1)
A
(1)
C
(1)
(b) Add
BEAC
,cnt
=
1
∅
(
4
)
B
(4)
C
(1)
E
(3)
A
(3)
C
(1)
(c) Add
BEA
,cnt
=
2
Figure 8.8.
Projected frequent pattern tree for
D
.
with
D
, to obtain the frequent itemsets
DB
(
4
)
,
DE
(
3
)
,
DA
(
3
)
,
DBE
(
3
)
,
DBA
(
3
)
,
DEA
(
3
)
, and
DBEA
(
3
)
. At this point the call from
D
returns.
In a similar manner, we process the remaining items at the top level. The
projected trees for
C
,
A
, and
E
are all single-path trees, allowing us to generate the
frequent itemsets
{
CB
(
4
),
CE
(
3
),
CBE
(
3
)
}
,
{
AE
(
4
),
AB
(
4
),
AEB
(
4
)
}
, and
{
EB
(
5
)
}
,
respectively. This process is illustrated in Figure 8.9.
8.3
GENERATING ASSOCIATION RULES
Given a collection of frequent itemsets
F
, to generate association rules we iterate over
all itemsets
Z
∈
F
, and calculate the confidence of various rules that can be derived
from the itemset. Formally, given a frequent itemset
Z
∈
F
, we look at all proper
subsets
X
⊂
Z
to compute rules of the form
X
s,c
−→
Y
,
where
Y
=
Z
X
where
Z
X
=
Z
−
X
. The rule must be frequent because
s
=
sup
(
XY
)
=
sup
(
Z
)
≥
minsup
Thus, we have to only check whether the rule confidence satisfies the
minconf
threshold. We compute the confidence as follows:
c
=
sup
(
X
∪
Y
)
sup
(
X
)
=
sup
(
Z
)
sup
(
X
)
If
c
≥
minconf
, then the rule is a strong rule. On the other hand, if
conf
(
X
−→
Y
) < c
,
then
conf
(
W
−→
Z
W
) < c
for all subsets
W
⊂
X
, as
sup
(
W
)
≥
sup
(
X
)
. We can thus
avoid checking subsets of
X
.
Algorithm 8.6 shows the pseudo-code for the association rule mining algorithm.
For each frequent itemset
Z
∈
F
, with size at least 2, we initialize the set of antecedents
8.3 Generating Association Rules
235
∅
(
6
)
B
(6)
C
(1)
D
(1)
E
(5)
A
(4)
C
(2)
D
(1)
D
(2)
C
(1)
∅
(
4
)
B
(4)
C
(1)
E
(3)
A
(3)
C
(1)
∅
(
4
)
B
(4)
E
(3)
A
(2)
∅
(
4
)
B
(4)
E
(4)
∅
(
5
)
B
(5)
R
D
R
C
R
A
R
E
Figure 8.9.
FPGrowth algorithm: frequent pattern tree projection.
A
with all the nonempty subsets of
Z
(line 2). For each
X
∈
A
we check whether the
confidence of the rule
X
−→
Z
X
is at least
minconf
(line 7). If so, we output the rule.
Otherwise, we remove all subsets
W
⊂
X
from the set of possible antecedents (line 10).
Example 8.13.
Consider the frequent itemset
ABDE
(
3
)
from Table 8.1, whose
support is shown within the brackets. Assume that
minconf
=
0
.
9. To generatestrong
association rules we initialize the set of antecedents to
A
={
ABD
(
3
),
ABE
(
4
),
ADE
(
3
),
BDE
(
3
),
AB
(
3
),
AD
(
4
),
AE
(
4
),
BD
(
4
),
BE
(
5
),
DE
(
3
),
A
(
4
),
B
(
6
),
D
(
4
),
E
(
5
)
}
236
Itemset Mining
ALGORITHM 8.6. Algorithm A
SSOCIATION
R
ULES
A
SSOCIATION
R
ULES
(
F
,
minconf
)
:
foreach
Z
∈
F
, such that
|
Z
|≥
2
do
1
A
←
X
|
X
⊂
Z
,
X
=∅
2
while
A
=∅
do
3
X
←
maximal element in
A
4
A
←
A
X
// remove
X
from
A
5
c
←
sup
(
Z
)/
sup
(
X
)
6
if
c
≥
minconf
then
7
print
X
−→
Y
,
sup
(
Z
)
,
c
8
else
9
A
←
A
W
|
W
⊂
X
// remove all subsets of
X
from
A
10
The first subset is
X
=
ABD
, and the confidence of
ABD
−→
E
is 3
/
3
=
1
.
0, so we
output it. The next subset is
X
=
ABE
, but the corresponding rule
ABE
−→
D
is not
strong since
conf
(
ABE
−→
D
)
=
3
/
4
=
0
.
75. We can thus remove from
A
all subsets
of
ABE
; the updated set of antecedents is therefore
A
={
ADE
(
3
),
BDE
(
3
),
AD
(
4
),
BD
(
4
),
DE
(
3
),
D
(
4
)
}
Next, we select
X
=
ADE
, which yields a strong rule, and so do
X
=
BDE
and
X
=
AD
.However,whenweprocess
X
=
BD
,wefind that
conf
(
BD
−→
AE
)
=
3
/
4
=
0
.
75,
and thus we can prune all subsets of
BD
from
A
, to yield
A
={
DE
(
3
)
}
The last rule to be tried is
DE
−→
AB
which is also strong. The final set of strong
rules that are output are as follows:
ABD
−→
E
,
conf
=
1
.
0
ADE
−→
B
,
conf
=
1
.
0
BDE
−→
A
,
conf
=
1
.
0
AD
−→
BE
,
conf
=
1
.
0
DE
−→
AB
,
conf
=
1
.
0
8.4
FURTHER READING
The association rule mining problem was introduced in Agrawal, Imieli
´
nski, and
Swami (1993). The Apriori method was proposed in Agrawal and Srikant (1994), and
a similar approach was outlined independently in Mannila, Toivonen, and Verkamo
8.5 Exercises
237
(1994). The tidlist intersection based Eclat method is described in Zaki et al. (1997),
and the dEclat approach that uses diffset appears in Zaki and Gouda (2003). Finally,
the FPGrowth algorithm is described in Han, Pei, and Yin (2000). For an experimental
comparison between several of the frequent itemset mining algorithms see Goethals
and Zaki (2004). There is a very close connection between itemset mining and
association rules, and formal concept analysis (Ganter, Wille, and Franzke, 1997). For
example, association rules can be considered to be
partial implications
(Luxenburger,
1991) with frequency constraints.
Agrawal, R., Imieli
´
nski, T., and Swami, A. (May 1993). “Mining association rules
between sets of items in large databases.”
In Proceedings of the ACM SIGMOD
International Conference on Management of Data
. ACM.
Agrawal, R. and Srikant, R. (Sept. 1994). “Fast algorithms for mining association
rules.”
In Proceedings of the 20th International Conference on Very Large Data
Bases
, pp. 487–499.
Ganter, B., Wille, R., and Franzke, C. (1997).
Formal Concept Analysis: Mathematical
Foundations.
New York: Springer-Verlag.
Goethals, B. and Zaki, M. J. (2004). “Advances in frequent itemset mining implemen-
tations: report on FIMI’03.”
ACM SIGKDD Explorations
, 6(1): 109–117.
Han, J., Pei, J., and Yin, Y. (May 2000). “Mining frequent patterns without candidate
generation.”
In Proceedings of the ACM SIGMOD International Conference on
Management of Data,
ACM.
Luxenburger, M. (1991). “Implications partielles dans un contexte.”
Math´
ematiques et
Sciences Humaines
, 113: 35–55.
Mannila, H., Toivonen, H., and Verkamo, I. A. (1994). Efficient algorithms for dis-
covering association rules.
In Proceedings of the AAAI Workshop on Knowledge
Discovery in Databases
, AAAI Press.
Zaki, M. J. and Gouda, K. (2003). “Fast vertical mining using diffsets.”
In Proceedings
of the 9th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining
. ACM, pp. 326–335.
Zaki, M. J., Parthasarathy,S., Ogihara, M., and Li, W. (1997).“New algorithms for fast
discovery of association rules.”
In Proceedings of the 3rd International Conference
on Knowledge Discovery and Data Mining
, pp. 283–286.
8.5
EXERCISES
Q1.
Given the database in Table 8.2.
(a)
Using
minsup
=
3
/
8, show how the Apriori algorithm enumerates all frequent
patterns from this dataset.
(b)
With
minsup
=
2
/
8, show how FPGrowth enumerates the frequent itemsets.
Q2.
Consider the vertical database shown in Table 8.3. Assuming that
minsup
=
3,
enumerate all the frequent itemsets using the Eclat method.
8.5 Exercises
239
Q6.
Consider Figure 8.10. It shows a simple taxonomy on some food items. Each leaf is
a simple item and an internal node represents a higher-level category or item. Each
item (single or high-level) has a unique integer label noted under it. Consider the
database composed of the simple items shown in Table 8.5 Answer the following
questions:
vegetables
1
grain
14
bread
12
wheat
2
white
3
rye
4
rice
5
fruit
6
diary
15
yogurt
7
milk
13
whole
8
2%
9
skim
10
cheese
11
Figure 8.10.
Item taxonomy for Q6.
Table 8.5.
Dataset for Q6
tid itemset
1 2 3 6 7
2 1 3 4 8 11
3 3 9 11
4 1 5 6 7
5 1 3 8 10 11
6 3 5 7 9 11
7 4 6 8 10 11
8 1 3 5 8 11
(a)
Whatisthesizeoftheitemset searchspaceifonerestrictsoneselftoonlyitemsets
composed of simple items?
(b)
Let
X
={
x
1
,x
2
,...,x
k
}
be a frequent itemset. Let us replace some
x
i
∈
X
with its
parent in the taxonomy (provided it exists) to obtain
X
′
, then the support of the
new itemset
X
′
is:
i.
more than support of
X
ii.
less than support of
X
iii.
not equal to support of
X
iv.
more than or equal to support of
X
v.
less than or equal to support of
X
240
Itemset Mining
(c)
Use
minsup
=
7
/
8. Find all frequent itemsets composed only of high-level items
in the taxonomy. Keep in mind that if a simpleitem appears in a transaction, then
its high-level ancestors are all assumed to occur in the transaction as well.
Q7.
Let
D
be a database with
n
transactions. Consider a sampling approach for mining
frequent itemsets, where we extract a random sample
S
⊂
D
, with say
m
transactions,
and we mine all the frequent itemsets in the sample, denoted as
F
S
. Next, we make
one complete scan of
D
, and for each
X
∈
F
S
, we find its actual support in the
whole database. Some of the itemsets in the sample may not be truly frequent in
the database; these are the false positives. Also, some of the true frequent itemsets
in the original database may never be present in the sample at all; these are the false
negatives.
Prove that if
X
is a false negative, then this case can be detected by counting
the support in
D
for every itemset belonging to the
negative border
of
F
S
, denoted
B
d
−
(
F
S
)
, which is defined as the set of minimal infrequent itemsets in sample
S
.
Formally,
B
d
−
(
F
S
)
=
inf
Y
|
sup
(
Y
) <
minsup
and
∀
Z
⊂
Y
,
sup
(
Z
)
≥
minsup
where inf returns the minimal elements of the set.
Q8.
Assume that we want to mine frequent patterns from relational tables. For example
consider Table 8.6, with three attributes
A
,
B
, and
C
, and six records. Each attribute
hasadomainfromwhichitdrawsitsvalues, forexample, thedomainof
A
is
dom(
A
)
=
{
a
1
,a
2
,a
3
}
. Note that no record can have more than one value of a given attribute.
Table 8.6.
Data for Q8
tid
A B C
1
a
1
b
1
c
1
2
a
2
b
3
c
2
3
a
2
b
3
c
3
4
a
2
b
1
c
1
5
a
2
b
3
c
3
6
a
3
b
3
c
3
We define a
relational pattern
P
over some
k
attributes
X
1
,
X
2
,...,
X
k
to be a
subset of the Cartesian product of the domains of the attributes, i.e.,
P
⊆
dom(
X
1
)
×
dom(
X
2
)
×···×
dom(
X
k
)
. That is,
P
=
P
1
×
P
2
×···×
P
k
, where each
P
i
⊆
dom(
X
i
)
.
For example,
{
a
1
,a
2
}×{
c
1
}
is a possible pattern over attributes
A
and
C
, whereas
{
a
1
}×{
b
1
}×{
c
1
}
is another pattern over attributes
A
,
B
and
C
.
The support of relational pattern
P
=
P
1
×
P
2
×···×
P
k
in dataset
D
is defined as
the number of records in the dataset that belong to it; it is given as
sup(P)
=
{
r
=
(r
1
,r
2
,...,r
n
)
∈
D
:
r
i
∈
P
i
for all
P
i
in
P
}
For example,
sup(
{
a
1
,a
2
}×{
c
1
}
)
=
2, as both records 1 and 4 contribute to its support.
Note, however that the pattern
{
a
1
}×{
c
1
}
has a support of 1, since only record 1
belongs to it. Thus, relational patterns
do not
satisfy the Apriori property that we
8.5 Exercises
241
used for frequent itemsets, that is, subsets of a frequent relational pattern can be
infrequent.
We call a relational pattern
P
=
P
1
×
P
2
×···×
P
k
over attributes
X
1
,...,
X
k
as
valid
iff for all
u
∈
P
i
and all
v
∈
P
j
, the pair of values
(
X
i
=
u,
X
j
=
v)
occurs together in
some record. For example,
{
a
1
,a
2
}×{
c
1
}
is a valid pattern since both
(
A
=
a
1
,
C
=
c
1
)
and
(
A
=
a
2
,
C
=
c
1
)
occur in some records (namely, records 1 and 4, respectively),
whereas
{
a
1
,a
2
}×{
c
2
}
isnotavalid pattern, sincethereis norecordthat hasthe values
(
A
=
a
1
,
C
=
c
2
)
. Thus, for a pattern to be valid every pair of values in
P
from distinct
attributes must belong to some record.
Given that
minsup
=
2, find all frequent, valid, relational patterns in the dataset in
Table 8.6.
Q9.
Given the following multiset dataset:
tid multiset
1
ABCA
2
ABABA
3
CABBA
Using
minsup
=
2, answer the following:
(a)
Find all frequent multisets. Recall that a multiset is still a set (i.e., order is not
important), but it allows multiple occurrences of an item.
(b)
Find all minimal infrequent multisets, that is, those multisets that have no
infrequent sub-multisets.
CHAPTER 9
Summarizing Itemsets
The search space for frequent itemsets is usually very large and it grows exponentially
with the number of items. In particular, a low minimum support value may result
in an intractable number of frequent itemsets. An alternative approach, studied in
this chapter, is to determine condensed representations of the frequent itemsets that
summarize their essential characteristics. The use of condensed representations can
not only reduce the computational and storage demands, but it can also make it easier
to analyzethemined patterns.In this chapterwe discuss three of theserepresentations:
closed, maximal, and nonderivable itemsets.
9.1
MAXIMAL AND CLOSED FREQUENT ITEMSETS
Given a binary database
D
⊆
T
×
I
, over the tids
T
and items
I
, let
F
denote the set
of all frequent itemsets, that is,
F
=
X
|
X
⊆
I
and
sup
(
X
)
≥
minsup
Maximal Frequent Itemsets
A frequent itemset
X
∈
F
is called
maximal
if it has no frequent supersets. Let
M
be
the set of all maximal frequent itemsets, given as
M
=
X
|
X
∈
F
and
∃
Y
⊃
X
,
such that
Y
∈
F
The set
M
is a condensed representation of the set of all frequent itemset
F
, because
we can determine whether any itemset
X
is frequent or not using
M
. If there exists a
maximal itemset
Z
such that
X
⊆
Z
, then
X
must be frequent; otherwise
X
cannot be
frequent. On the other hand, we cannot determine
sup
(
X
)
using
M
alone, although we
can lower-bound it, that is,
sup
(
X
)
≥
sup
(
Z
)
if
X
⊆
Z
∈
M
.
Example 9.1.
Consider the dataset given in Figure 9.1a. Using any of the algorithms
discussed in Chapter 8 and
minsup
=
3, we obtain the frequent itemsets shown
in Figure 9.1b. Notice that there are 19 frequent itemsets out of the 2
5
−
1
=
31
possible nonempty itemsets. Out of these, there are only two maximal itemsets,
242
9.1 Maximal and Closed Frequent Itemsets
243
Tid Itemset
1
ABDE
2
BCE
3
ABDE
4
ABCE
5
ABCDE
6
BCD
(a) Transaction database
sup
Itemsets
6
B
5
E
,
BE
4
A
,
C
,
D
,
AB
,
AE
,
BC
,
BD
,
ABE
3
AD
,
CE
,
DE
,
ABD
,
ADE
,
BCE
,
BDE
,
ABDE
(b) Frequent itemsets (
minsup
=
3)
Figure 9.1.
An example database.
ABDE
and
BCE
. Any other frequent itemset must be a subset of one of the maximal
itemsets. For example, we can determine that
ABE
is frequent, since
ABE
⊂
ABDE
,
and we can establish that
sup
(
ABE
)
≥
sup
(
ABDE
)
=
3.
Closed Frequent Itemsets
Recallthatthefunction
t
:2
I
→
2
T
[Eq.(8.2)]maps itemsetsto tidsets,and thefunction
i
:2
T
→
2
I
[Eq.(8.1)] maps tidsets to itemsets. That is, given
T
⊆
T
, and
X
⊆
I
, we have
t
(
X
)
={
t
∈
T
|
t
contains
X
}
i
(
T
)
={
x
∈
I
| ∀
t
∈
T
, t
contains
x
}
Define by
c
:2
I
→
2
I
the
closure operator
, given as
c
(
X
)
=
i
◦
t
(
X
)
=
i
(
t
(
X
))
The closure operator
c
maps itemsets to itemsets, and it satisfies the following three
properties:
•
Extensive
:
X
⊆
c
(
X
)
•
Monotonic
: If
X
i
⊆
X
j
, then
c
(
X
i
)
⊆
c
(
X
j
)
•
Idempotent
:
c
(
c
(
X
))
=
c
(
X
)
An itemset
X
is called
closed
if
c
(
X
)
=
X
, that is, if
X
is a fixed point of the closure
operator
c
.Ontheotherhand,if
X
=
c
(
X
)
,then
X
isnotclosed,buttheset
c
(
X
)
iscalled
its closure. From the properties of the closure operator, both
X
and
c
(
X
)
havethe same
tidset. It follows that a frequent set
X
∈
F
is closed if it has no frequent superset
with
the same frequency
because by definition, it is the largest itemset common to all the
tids in the tidset
t
(
X
)
. The set of all closed frequent itemsets is thus defined as
C
=
X
|
X
∈
F
and
∃
Y
⊃
X
such that
sup
(
X
)
=
sup
(
Y
)
(9.1)
244
Summarizing Itemsets
Put differently,
X
is closed if all supersets of
X
have strictly less support, that is,
sup
(
X
) >
sup
(
Y
)
, for all
Y
⊃
X
.
The set of all closed frequent itemsets
C
is a condensed representation, as we can
determine whether an itemset
X
is frequent, as well as the exact support of
X
using
C
alone. The itemset
X
is frequent if there exists a closed frequent itemset
Z
∈
C
such
that
X
⊆
Z
. Further, the support of
X
is given as
sup
(
X
)
=
max
sup
(
Z
)
|
Z
∈
C
,
X
⊆
Z
The following relationship holds between the set of all, closed, and maximal
frequent itemsets:
M
⊆
C
⊆
F
Minimal Generators
A frequent itemset
X
is a
minimal generator
if it has no subsets with the same support:
G
=
X
|
X
∈
F
and
∃
Y
⊂
X
,
such that
sup
(
X
)
=
sup
(
Y
)
In other words, all subsets of
X
have strictly higher support, that is,
sup
(
X
) <
sup
(
Y
)
,
for all
Y
⊂
X
. The concept of minimum generator is closely related to the notion
of closed itemsets. Given an equivalence class of itemsets that have the same tidset,
a closed itemset is the unique maximum element of the class, whereas the minimal
generators are the minimal elements of the class.
Example 9.2.
Consider the example dataset in Figure 9.1a. The frequent closed (as
well as maximal) itemsets using
minsup
=
3 are shown in Figure 9.2. We can see,
for instance, that the itemsets
AD
,
DE
,
ABD
,
ADE
,
BDE
, and
ABDE
, occur in the
same three transactions, namely 135, and thus constitute an equivalence class. The
largest itemset among these, namely
ABDE
, is the closed itemset. Using the closure
operator yields the same result; we have
c
(
AD
)
=
i
(
t
(
AD
))
=
i
(
135
)
=
ABDE
, which
indicates that the closure of
AD
is
ABDE
. To verify that
ABDE
is closed note that
c
(
ABDE
)
=
i
(
t
(
ABDE
))
=
i
(
135
)
=
ABDE
. The minimal elements of theequivalence
class, namely
AD
and
DE
, are the minimal generators. No subset of these itemsets
shares the same tidset.
The set of all closed frequent itemsets, and the corresponding set of minimal
generators, is as follows:
Tidset
C G
1345
ABE A
123456
B B
1356
BD D
12345
BE E
2456
BC C
135
ABDE AD
,
DE
245
BCE CE
9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm
245
A
1345
B
123456
D
1356
E
12345
C
2456
AD
135
DE
135
AB
1345
AE
1345
BD
1356
BE
12345
BC
2456
CE
245
ABD
135
ADE
135
BDE
135
ABE
1345
BCE
245
ABDE
135
Figure 9.2.
Frequent, closed, minimal generators, and maximal frequent itemsets. Itemsets that are boxed
and shaded are closed, whereas those within boxes (but unshaded) are the minimal generators; maximal
itemsets are shown boxed with double lines.
Out of the closed itemsets, the maximal ones are
ABDE
and
BCE
. Consider itemset
AB
. Using
C
we can determine that
sup
(
AB
)
=
max
{
sup
(
ABE
),
sup
(
ABDE
)
}=
max
{
4
,
3
}=
4
9.2
MINING MAXIMAL FREQUENT ITEMSETS: GENMAX ALGORITHM
Mining maximal itemsets requires additional steps beyond simply determining the
frequent itemsets. Assuming that the set of maximal frequent itemsets is initially
empty, that is,
M
= ∅
, each time we generate a new frequent itemset
X
, we have to
perform the following maximality checks
•
Subset Check:
∃
Y
∈
M
, such that
X
⊂
Y
. If such a
Y
exists, then clearly
X
is not
maximal. Otherwise, we add
X
to
M
, as a potentially maximal itemset.
•
SupersetCheck:
∃
Y
∈
M
,suchthat
Y
⊂
X
.Ifsuch a
Y
exists, then
Y
cannotbemaximal,
and we have to remove it from
M
.
These two maximality checks take
O
(
|
M
|
)
time, which can get expensive, especially
as
M
grows; thus for efficiency reasons it is crucial to minimize the number of times
these checks are performed. As such, any of the frequent itemset mining algorithms
246
Summarizing Itemsets
from Chapter 8 can be extended to mine maximal frequent itemsets by adding the
maximality checking steps. Here we consider the GenMax method, which is based
on the tidset intersection approach of Eclat (see Section 8.2.2). We shall see that it
never inserts a nonmaximal itemset into
M
. It thus eliminates the superset checks and
requires only subset checks to determine maximality.
Algorithm 9.1 shows the pseudo-code for GenMax. The initial call takes as input
the set of frequent items along with their tidsets,
i,
t
(i)
, and the initially empty set
of maximal itemsets,
M
. Given a set of itemset–tidset pairs, called IT-pairs, of the
form
X
,
t
(
X
)
, the recursive GenMax method works as follows. In lines 1–3, we check
if the entire current branch can be pruned by checking if the union of all the itemsets,
Y
=
X
i
, is alreadysubsumed by(or containedin) some maximalpattern
Z
∈
M
.If so,
no maximal itemset can be generatedfrom the current branch, and it is pruned. On the
other hand, if thebranch is not pruned, we intersecteach IT-pair
X
i
,
t
(
X
i
)
with all the
other IT-pairs
X
j
,
t
(
X
j
)
, with
j > i
, to generate new candidates
X
ij
, which are added
to the IT-pair set
P
i
(lines 6–9). If
P
i
is not empty, a recursive call to G
EN
M
AX
is made
to find other potentially frequent extensions of
X
i
. On the other hand, if
P
i
is empty,
it means that
X
i
cannot be extended, and it is potentially maximal. In this case, we add
X
i
to the set
M
, provided that
X
i
is not contained in any previously added maximal set
Z
∈
M
(line 12). Note also that, because of this check for maximality before inserting
any itemset into
M
, we never have to remove any itemsets from it. In other words,
all itemsets in
M
are guaranteed to be maximal. On termination of GenMax, the
set
M
contains the final set of all maximal frequent itemsets. The GenMax approach
also includes a number of other optimizations to reduce the maximality checks and to
improve the support computations. Further, GenMax utilizes diffsets (differences of
tidsets) for fast support computation, which were described in Section 8.2.2. We omit
these optimizations here for clarity.
ALGORITHM 9.1. Algorithm G
EN
M
AX
// Initial Call:
M
←∅
,
P
←
i,
t
(i)
|
i
∈
I
,
sup
(i)
≥
minsup
G
EN
M
AX
(
P
,
minsup
,
M
)
:
Y
←
X
i
1
if
∃
Z
∈
M
,
such that Y
⊆
Z
then
2
return
// prune entire branch
3
foreach
X
i
,
t
(
X
i
)
∈
P
do
4
P
i
←∅
5
foreach
X
j
,
t
(
X
j
)
∈
P
, with
j > i
do
6
X
ij
←
X
i
∪
X
j
7
t
(
X
ij
)
=
t
(
X
i
)
∩
t
(
X
j
)
8
if
sup
(
X
ij
)
≥
minsup
then
P
i
←
P
i
∪{
X
ij
,
t
(
X
ij
)
}
9
if
P
i
=∅
then
G
EN
M
AX
(
P
i
,
minsup
,
M
)
10
else if
∃
Z
∈
M
,
X
i
⊆
Z
then
11
M
=
M
∪
X
i
// add
X
i
to maximal set
12
9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm
247
Example 9.3.
Figure 9.3 shows the execution of GenMax on the example database
from Figure 9.1a using
minsup
=
3. Initially the set of maximal itemsets is empty. The
root of thetreerepresents theinitial call withall IT-pairs consisting of frequentsingle
items and their tidsets. We first intersect
t
(
A
)
with the tidsets of the other items. The
set of frequent extensions from
A
are
P
A
=
AB
,
1345
,
AD
,
135
,
AE
,
1345
Choosing
X
i
=
AB
, leads to the next set of extensions, namely
P
AB
=
ABD
,
135
,
ABE
,
1345
Finally, we reach the left-most leaf corresponding to
P
ABD
={
ABDE
,
135
}
. At this
point, we add
ABDE
to the set of maximal frequent itemsets because it has no other
extensions, so that
M
={
ABDE
}
.
The search then backtracks one level, and we try to process
ABE
, which is also
a candidate to be maximal. However, it is contained in
ABDE
, so it is pruned.
Likewise, when we try to process
P
AD
= {
ADE
,
135
}
it will get pruned because it
is also subsumed by
ABDE
, and similarly for
AE
. At this stage, all maximal itemsets
starting with
A
have been found, and we next proceed with the
B
branch. The
left-most
B
branch, namely
BCE
, cannot be extended further. Because
BCE
is not
A B C D E
1345 123456 2456 1356 12345
AB AD AE
1345 135 1345
P
A
ABD ABE
135 1345
P
AB
ABDE
135
P
ABD
ADE
135
P
AD
BC BD BE
2456 1356 12345
P
B
BCE
245
P
BC
BDE
135
P
BD
CE
245
P
C
DE
135
P
D
Figure 9.3.
Mining maximal frequent itemsets. Maximal itemsets are shown asshaded ovals, whereas pruned
branches are shown with the strike-through. Infrequent itemsets are not shown.
248
Summarizing Itemsets
a subset of any maximal itemset in
M
, we insert it as a maximal itemset, so that
M
={
ABDE
,
BCE
}
. Subsequently, all remaining branches are subsumed by one of
these two maximal itemsets, and are thus pruned.
9.3
MINING CLOSED FREQUENT ITEMSETS: CHARM ALGORITHM
Mining closed frequent itemsets requires that we perform closure checks, that is,
whether
X
=
c
(
X
)
. Direct closure checking can be very expensive, as we would have to
verifythat
X
is thelargestitemsetcommon to allthe tids in
t
(
X
)
,that is,
X
=
t
∈
t
(
X
)
i
(t)
.
Instead, we will describe a vertical tidset intersection based method called C
HARM
thatperforms moreefficientclosure checking.GivenacollectionofIT-pairs
{
X
i
,
t
(
X
i
)
}
,
the following three properties hold:
Property (1) If
t
(
X
i
)
=
t
(
X
j
)
, then
c
(
X
i
)
=
c
(
X
j
)
=
c
(
X
i
∪
X
j
)
, which implies that we
can replace every occurrence of
X
i
with
X
i
∪
X
j
and prune the branch
under
X
j
because its closure is identical to the closure of
X
i
∪
X
j
.
Property (2) If
t
(
X
i
)
⊂
t
(
X
j
)
, then
c
(
X
i
)
=
c
(
X
j
)
but
c
(
X
i
)
=
c
(
X
i
∪
X
j
)
, which means
that we can replace every occurrence of
X
i
with
X
i
∪
X
j
, but we cannot
prune
X
j
because it generates a different closure. Note that if
t
(
X
i
)
⊃
t
(
X
j
)
then we simply interchange the role of
X
i
and
X
j
.
Property (3) If
t
(
X
i
)
=
t
(
X
j
)
, then
c
(
X
i
)
=
c
(
X
j
)
=
c
(
X
i
∪
X
j
)
. In this case we cannot
remove either
X
i
or
X
j
, as each of them generates a different closure.
Algorithm 9.2 presents the pseudo-code for Charm, which is also based on the
Eclat algorithm described in Section 8.2.2.It takes as input the set of all frequent single
items along with their tidsets. Also, initially the set of all closed itemsets,
C
, is empty.
Given any IT-pair set
P
={
X
i
,
t
(
X
i
)
}
, the method first sorts them in increasing order
of support. For each itemset
X
i
we try to extend it with all other items
X
j
in the sorted
order, and we apply the above three properties to prune branches where possible. First
we make sure that
X
ij
=
X
i
∪
X
j
is frequent, by checking the cardinality of
t
(
X
ij
)
. If yes,
then we check properties 1 and 2 (lines 8 and 12). Note that whenever we replace
X
i
with
X
ij
=
X
i
∪
X
j
, we make sure to do so in the current set
P
, as well as the new set
P
i
. Only when property 3 holds do we add the new extension
X
ij
to the set
P
i
(line 14).
If the set
P
i
is not empty, then we make a recursive call to Charm. Finally, if
X
i
is
not a subset of any closed set
Z
with the same support, we can safely add it to the set
of closed itemsets,
C
(line 18). For fast support computation, Charm uses the diffset
optimization described in Section 8.2.2; we omit it here for clarity.
Example 9.4.
We illustrate the Charm algorithm for mining frequent closed itemsets
from the example database in Figure 9.1a, using
minsup
=
3. Figure 9.4 shows the
sequence of steps. The initial set of IT-pairs, after support based sorting, is shown
at the root of the search tree. The sorted order is
A
,
C
,
D
,
E
, and
B
. We first
process extensions from
A
, as shown in Figure 9.4a. Because
AC
is not frequent,
9.3 Mining Closed Frequent Itemsets: Charm Algorithm
249
ALGORITHM 9.2. Algorithm C
HARM
// Initial Call:
C
←∅
,
P
←
i,
t
(i)
:
i
∈
I
,sup(i)
≥
minsup
C
HARM
(
P
,
minsup
,
C
)
:
Sort
P
in increasing order of support (i.e., by increasing
|
t
(
X
i
)
|
)
1
foreach
X
i
,
t
(
X
i
)
∈
P
do
2
P
i
←∅
3
foreach
X
j
,
t
(
X
j
)
∈
P
, with
j > i
do
4
X
ij
=
X
i
∪
X
j
5
t
(
X
ij
)
=
t
(
X
i
)
∩
t
(
X
j
)
6
if
sup
(
X
ij
)
≥
minsup
then
7
if t
(
X
i
)
=
t
(
X
j
)
then
// Property 1
8
Replace
X
i
with
X
ij
in
P
and
P
i
9
Remove
X
j
,
t
(
X
j
)
from
P
10
else
11
if t
(
X
i
)
⊂
t
(
X
j
)
then
// Property 2
12
Replace
X
i
with
X
ij
in
P
and
P
i
13
else
// Property 3
14
P
i
←
P
i
∪
X
ij
,
t
(
X
ij
)
15
if
P
i
=∅
then
C
HARM
(
P
i
,
minsup
,
C
)
16
if
∃
Z
∈
C
,
such that X
i
⊆
Z and
t
(
X
i
)
=
t
(
Z
)
then
17
C
=
C
∪
X
i
// Add
X
i
to closed set
18
it is pruned.
AD
is frequent and because
t
(
A
)
=
t
(
D
)
, we add
AD
,
135
to the set
P
A
(property 3). When we combine
A
with
E
, property 2 applies, and we simply
replace all occurrences of
A
in both
P
and
P
A
with
AE
, which is illustrated with the
strike-through. Likewise, because
t
(
A
)
⊂
t
(
B
)
all current occurrences of
A
, actually
AE
,in both
P
and
P
A
arereplacedby
AEB
.Theset
P
A
thuscontainsonly oneitemset
{
ADEB
,
135
}
. When C
HARM
is invoked with
P
A
as the IT-pair, it jumps straight to
line 18, and adds
ADEB
to the set of closed itemsets
C
. When the call returns, we
check whether
AEB
can be added as a closed itemset.
AEB
is a subset of
ADEB
,
but it does not have the same support, thus
AEB
is also added to
C
. At this point all
closed itemsets containing
A
have been found.
The Charm algorithm proceeds with the remaining branches as shown in
Figure 9.4b. For instance,
C
is processed next.
CD
is infrequent and thus pruned.
CE
is frequent and it is added to
P
C
as a new extension (via property 3). Because
t
(
C
)
⊂
t
(
B
)
, all occurrences of
C
are replaced by
CB
, and
P
C
={
CEB
,
245
}
.
CEB
and
CB
are both found to be closed. The computation proceeds in this manner until
all closed frequent itemsets are enumerated. Note that when we get to
DEB
and
perform the closure check, we find that it is a subset of
ADEB
and also has the same
support; thus
DEB
is not closed.
250
Summarizing Itemsets
A AE AEB
1345
C
2456
D
1356
E
12345
B
123456
AD ADE ADEB
135
P
A
(a) Process
A
A AE AEB C CB D DB E EB B
1345 2456 1356 12345 123456
AD ADE ADEB
135
P
A
CE CEB
245
P
C
DE DEB
135
P
D
(b) Charm
Figure 9.4.
Mining closed frequent itemsets. Closed itemsets are shown as shaded ovals. Strike-through
represents itemsets
X
i
replaced by
X
i
∪
X
j
during execution of the algorithm. Infrequent itemsets are not
shown.
9.4
NONDERIVABLE ITEMSETS
An itemset is called
nonderivable
if its support cannot be deduced from the supports
of its subsets. The set of all frequent nonderivable itemsets is a summary or condensed
representation of the set of all frequent itemsets. Further, it is lossless with respect to
support, that is, the exactsupport of all other frequent itemsets can be deduced from it.
Generalized Itemsets
Let
T
be a set of tids, let
I
be a set of items, and let
X
be a
k
-itemset, that is,
X
=
{
x
1
,x
2
,...,x
k
}
. Consider the tidsets
t
(x
i
)
for each item
x
i
∈
X
. These
k
-tidsets induce a
partitioning of the set of all tids into 2
k
regions, some of which may be empty, where
each partition contains the tids for some subset of items
Y
⊆
X
, but for none of the
remaining items
Z
=
Y
X
. Each such region is therefore the tidset of a
generalized
itemset
comprising items in
X
or their negations. As such a generalized itemset can be
represented as
YZ
, where
Y
consists of regular items and
Z
consists of negated items.
We define the support of a generalized itemset
YZ
as the number of transactions that
9.4 Nonderivable Itemsets
251
t
(
ACD
)
=∅
t
(
ACD
)
=
2
t
(
ACD
)
=∅
t
(
ACD
)
=
4
t
(
ACD
)
=
13
t
(
ACD
)
=
6
t
(
ACD
)
=
5
t
(
ACD
)
=∅
t
(
A
)
t
(
C
)
t
(
D
)
Figure 9.5.
Tidset partitioning induced by
t
(
A
)
,
t
(
C
)
, and
t
(
D
)
.
contain all items in
Y
but no item in
Z
:
sup
(
YZ
)
=
{
t
∈
T
|
Y
⊆
i
(t)
and
Z
∩
i
(t)
=∅}
Example 9.5.
Consider the example dataset in Figure 9.1a. Let
X
=
ACD
. We have
t
(
A
)
=
1345,
t
(
C
)
=
2456, and
t
(
D
)
=
1356. These three tidsets induce a partitioning
on the space of all tids, as illustrated in the Venn diagram shown in Figure 9.5. For
example, the region labeled
t
(
ACD
)
=
4 represents those tids that contain
A
and
C
but not
D
. Thus, the support of the generalized itemset
ACD
is 1. The tids that
belong to all the eight regions are shown. Some regions are empty, which means that
the support of the corresponding generalized itemset is 0.
Inclusion–Exclusion Principle
Let
YZ
be a generalized itemset, and let
X
=
Y
∪
Z
=
YZ
. The inclusion–exclusion
principle allows one to directly compute the support of
YZ
as a combination of the
supports for all itemsets
W
, such that
Y
⊆
W
⊆
X
:
sup
(
YZ
)
=
Y
⊆
W
⊆
X
−
1
|
W
Y
|
·
sup
(
W
)
(9.2)
252
Summarizing Itemsets
Example 9.6.
Let us compute the support of the generalized itemset
ACD
=
CAD
,
where
Y
=
C
,
Z
=
AD
and
X
=
YZ
=
ACD
. In the Venn diagram shown in Figure 9.5,
we start with all the tids in
t
(
C
)
, and remove the tids contained in
t
(
AC
)
and
t
(
CD
)
.
However, we realize that in terms of support this removes
sup
(
ACD
)
twice, so we
need to add it back. In other words, the support of
CAD
is given as
sup
(
CAD
)
=
sup
(
C
)
−
sup
(
AC
)
−
sup
(
CD
)
+
sup
(
ACD
)
=
4
−
2
−
2
+
1
=
1
But, this is precisely what the inclusion–exclusion formula gives:
sup
(
CAD
)
=
(
−
1
)
0
sup
(
C
)
+
W
=
C
,
|
W
Y
|=
0
(
−
1
)
1
sup
(
AC
)
+
W
=
AC
,
|
W
Y
|=
1
(
−
1
)
1
sup
(
CD
)
+
W
=
CD
,
|
W
Y
|=
1
(
−
1
)
2
sup
(
ACD
)
W
=
ACD
,
|
W
Y
|=
2
=
sup
(
C
)
−
sup
(
AC
)
−
sup
(
CD
)
+
sup
(
ACD
)
We can see that the support of
CAD
is a combination of the support values over all
itemsets
W
such that
C
⊆
W
⊆
ACD
.
Support Bounds for an Itemset
Notice that the inclusion–exclusion formula in Eq.(9.2) for the support of
YZ
has
terms for all subsets between
Y
and
X
=
YZ
. Put differently, for a given
k
-itemset
X
, there are 2
k
generalized itemsets of the form
YZ
, with
Y
⊆
X
and
Z
=
X
Y
,
and each such generalized itemset has a term for
sup
(
X
)
in the inclusion–exclusion
equation; this happens when
W
=
X
. Because the support of any (generalized)
itemset must be non-negative, we can derive a bound on the support of
X
from
each of the 2
k
generalized itemsets by setting
sup
(
YZ
)
≥
0. However, note that
whenever
|
X
Y
|
is even, the coefficient of
sup
(
X
)
is
+
1, but when
|
X
Y
|
is odd,
the coefficient of
sup
(
X
)
is
−
1 in Eq.(9.2). Thus, from the 2
k
possible subsets
Y
⊆
X
, we derive 2
k
−
1
lower bounds and 2
k
−
1
upper bounds for
sup
(
X
)
, obtained after
setting
sup
(
YZ
)
≥
0, and rearranging the terms in the inclusion–exclusion formula,
so that
sup
(
X
)
is on the left hand side and the the remaining terms are on the right
hand side
Upper Bounds
(
|
X
Y
|
is odd):
sup
(
X
)
≤
Y
⊆
W
⊂
X
−
1
(
|
X
Y
|+
1
)
sup
(
W
)
(9.3)
Lower Bounds
(
|
X
Y
|
is even):
sup
(
X
)
≥
Y
⊆
W
⊂
X
−
1
(
|
X
Y
|+
1
)
sup
(
W
)
(9.4)
Note that the only difference in the two equations is the inequality, which depends on
the starting subset
Y
.
9.4 Nonderivable Itemsets
253
Example 9.7.
Consider Figure 9.5, which shows the partitioning induced by the
tidsets of
A
,
C
, and
D
. We wish to determine the support bounds for
X
=
ACD
using
each of the generalized itemsets
YZ
where
Y
⊆
X
. For example, if
Y
=
C
, then the
inclusion-exclusion principle [Eq.(9.2)] gives us
sup
(
CAD
)
=
sup
(
C
)
−
sup
(
AC
)
−
sup
(
CD
)
+
sup
(
ACD
)
Setting
sup
(
CAD
)
≥
0, and rearranging the terms, we obtain
sup
(
ACD
)
≥−
sup
(
C
)
+
sup
(
AC
)
+
sup
(
CD
)
which is precisely the expression from the lower-bound formula in Eq.(9.4) because
|
X
Y
|=|
ACD
−
C
|=|
AD
|=
2 is even.
As another example, let
Y
=∅
. Setting
sup
(
ACD
)
≥
0, we have
sup
(
ACD
)
=
sup
(
∅
)
−
sup
(
A
)
−
sup
(
C
)
−
sup
(
D
)
+
sup
(
AC
)
+
sup
(
AD
)
+
sup
(
CD
)
−
sup
(
ACD
)
≥
0
=⇒
sup
(
ACD
)
≤
sup
(
∅
)
−
sup
(
A
)
−
sup
(
C
)
−
sup
(
D
)
+
sup
(
AC
)
+
sup
(
AD
)
+
sup
(
CD
)
Noticethatthis rule givesan upper bound on the support of
ACD
, which also follows
from Eq.(9.3) because
|
X
Y
|=
3 is odd.
In fact, from each of the regions in Figure 9.5, we get one bound, and out of the
eight possible regions, exactly four give upper bounds and the other four give lower
bounds for the support of
ACD
:
sup
(
ACD
)
≥
0 when
Y
=
ACD
≤
sup
(
AC
)
when
Y
=
AC
≤
sup
(
AD
)
when
Y
=
AD
≤
sup
(
CD
)
when
Y
=
CD
≥
sup
(
AC
)
+
sup
(
AD
)
−
sup
(
A
)
when
Y
=
A
≥
sup
(
AC
)
+
sup
(
CD
)
−
sup
(
C
)
when
Y
=
C
≥
sup
(
AD
)
+
sup
(
CD
)
−
sup
(
D
)
when
Y
=
D
≤
sup
(
AC
)
+
sup
(
AD
)
+
sup
(
CD
)
−
sup
(
A
)
−
sup
(
C
)
−
sup
(
D
)
+
sup
(
∅
)
when
Y
=∅
This derivationofthebounds is schematicallysummarized inFigure9.6.For instance,
at level 2 the inequality is
≥
, which implies that if
Y
is any itemset at this level, we
will obtain a lower bound. The signs at different levels indicate the coefficient of the
corresponding itemset in the upper or lower bound computations via Eq.(9.3) and
Eq.(9.4). Finally, the subset lattice shows which intermediate terms
W
have to be
considered in the summation. For instance, if
Y
=
A
, then the intermediate terms are
W
∈{
AC
,
AD
,
A
}
, with the corresponding signs
{+
1
,
+
1
,
−
1
}
, so that we obtain the
lower bound rule:
sup
(
ACD
)
≥
sup
(
AC
)
+
sup
(
AD
)
−
sup
(
A
)
254
Summarizing Itemsets
subset lattice
ACD
sign inequality level
AC AD CD
1
≤
1
A
C D
−
1
≥
2
∅
1
≤
3
Figure 9.6.
Support bounds from subsets.
Nonderivable Itemsets
Given an itemset
X
, and
Y
⊆
X
, let
IE
(
Y
)
denote the summation
IE
(
Y
)
=
Y
⊆
W
⊂
X
−
1
(
|
X
Y
|+
1
)
·
sup
(
W
)
Then, the sets of all upper and lower bounds for
sup
(
X
)
are given as
UB
(
X
)
=
IE
(
Y
)
Y
⊆
X
,
|
X
Y
|
is odd
LB
(
X
)
=
IE
(
Y
)
Y
⊆
X
,
|
X
Y
|
is even
An itemset
X
is called
nonderivable
if max
{
LB
(
X
)
}
=
min
{
UB
(
X
)
}
, which implies that
the support of
X
cannot be derived from the support values of its subsets; we know
only the range of possible values, that is,
sup
(
X
)
∈
max
{
LB
(
X
)
}
,
min
{
UB
(
X
)
}
On the other hand,
X
is derivable if
sup
(
X
)
=
max
{
LB
(
X
)
}=
min
{
UB
(
X
)
}
because in
this case
sup
(
X
)
can be derived exactly using the supports of its subsets. Thus, the set
of all frequent nonderivable itemsets is given as
N
=
X
∈
F
|
max
{
LB
(
X
)
}
=
min
{
UB
(
X
)
}
where
F
is the set of all frequent itemsets.
Example 9.8.
Consider the set of upper bound and lower bound formulas for
sup
(
ACD
)
outlined in Example 9.7. Using the tidset information in Figure 9.5, the
9.4 Nonderivable Itemsets
255
support lower bounds are
sup
(
ACD
)
≥
0
≥
sup
(
AC
)
+
sup
(
AD
)
−
sup
(
A
)
=
2
+
3
−
4
=
1
≥
sup
(
AC
)
+
sup
(
CD
)
−
sup
(
C
)
=
2
+
2
−
4
=
0
≥
sup
(
AD
)
+
sup
(
CD
)
−
sup
(
D
)
=
3
+
2
−
4
=
0
and the upper bounds are
sup
(
ACD
)
≤
sup
(
AC
)
=
2
≤
sup
(
AD
)
=
3
≤
sup
(
CD
)
=
2
≤
sup
(
AC
)
+
sup
(
AD
)
+
sup
(
CD
)
−
sup
(
A
)
−
sup
(
C
)
−
sup
(
D
)
+
sup
(
∅
)
=
2
+
3
+
2
−
4
−
4
−
4
+
6
=
1
Thus, we have
LB
(
ACD
)
={
0
,
1
}
max
{
LB
(
ACD
)
}=
1
UB
(
ACD
)
={
1
,
2
,
3
}
min
{
UB
(
ACD
)
}=
1
Because max
{
LB
(
ACD
)
}=
min
{
UB
(
ACD
)
}
we conclude that
ACD
is derivable.
Note that is it not essential to derive all the upper and lower bounds before
one can conclude whether an itemset is derivable. For example, let
X
=
ABDE
.
Considering its immediate subsets, we can obtain the following upper bound values:
sup
(
ABDE
)
≤
sup
(
ABD
)
=
3
≤
sup
(
ABE
)
=
4
≤
sup
(
ADE
)
=
3
≤
sup
(
BDE
)
=
3
From these upper bounds, we know for sure that
sup
(
ABDE
)
≤
3. Now, let us
consider the lower bound derived from
Y
=
AB
:
sup
(
ABDE
)
≥
sup
(
ABD
)
+
sup
(
ABE
)
−
sup
(
AB
)
=
3
+
4
−
4
=
3
At this point we know that
sup
(
ABDE
)
≥
3, so without processing any further
bounds, we can conclude that
sup
(
ABDE
)
∈
[3
,
3], which means that
ABDE
is
derivable.
For the example database in Figure 9.1a, the set of all frequent nonderivable
itemsets, along with their support bounds, is
N
=
A
[0
,
6]
,
B
[0
,
6]
,
C
[0
,
6]
,
D
[0
,
6]
,
E
[0
,
6]
,
AD
[2
,
4]
,
AE
[3
,
4]
,
CE
[3
,
4]
,
DE
[3
,
4]
Notice that single items are always nonderivable by definition.
256
Summarizing Itemsets
9.5
FURTHER READING
The concept of closed itemsets is based on the elegant lattice theoretic framework of
formal concept analysis (Ganter, Wille, and Franzke, 1997). The Charm algorithm for
mining frequent closed itemsets appears in Zaki and Hsiao (2005), and the GenMax
method for mining maximal frequent itemsets is described in Gouda and Zaki (2005).
For an Apriori style algorithm for maximal patterns, called MaxMiner, that uses very
effective support lower bound based itemset pruning see Bayardo (1998). The notion
of minimal generators was proposed in Bastide et al. (2000); they refer to them as
key
patterns
. Nonderivable itemset mining task was introduced in Calders and Goethals
(2007).
Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., and Lakhal, L. (2000). “Mining
frequent patterns with counting inference.”
ACM SIGKDD Explorations
, 2(2):
66–75.
Bayardo R. J., Jr. (1998). “Efficiently mining long patterns from databases.”
In
Proceedings of the ACM SIGMOD International Conference on Management of
Data
. ACM, pp. 85–93.
Calders, T. and Goethals, B. (2007).“Non-derivableitemset mining.”
DataMining and
Knowledge Discovery
, 14(1): 171–206.
Ganter, B., Wille, R., and Franzke, C. (1997).
Formal Concept Analysis: Mathematical
Foundations
. New York: Springer-Verlag.
Gouda, K. and Zaki, M. J. (2005). “Genmax: An efficient algorithm for mining
maximal frequent itemsets.”
Data Mining and Knowledge Discovery
, 11(3):
223–242.
Zaki,M.J.andHsiao,C.-J.(2005).“Efficientalgorithmsforminingcloseditemsetsand
their lattice structure.”
IEEE Transactions on Knowledge and Data Engineering
,
17(4): 462–478.
9.6
EXERCISES
Q1.
True or False:
(a)
Maximal frequent itemsets are sufficient to determine all frequent itemsets with
their supports.
(b)
An itemset and its closure share the same set of transactions.
(c)
The set of all maximal frequent sets is a subset of the set of all closed frequent
itemsets.
(d)
The set of all maximal frequent sets is the set of longest possible frequent
itemsets.
Q2.
Given the database in Table 9.1
(a)
Show the application of the closure operator on
AE
, that is, compute
c
(
AE
)
. Is
AE
closed?
(b)
Find all frequent, closed, and maximal itemsets using
minsup
=
2
/
6.
Q3.
Given the database in Table 9.2, find all minimal generators using
minsup
=
1.
9.6 Exercises
257
Table 9.1.
Dataset for Q2
Tid Itemset
t
1
ACD
t
2
BCE
t
3
ABCE
t
4
BDE
t
5
ABCE
t
6
ABCD
Table 9.2.
Dataset for Q3
Tid Itemset
1
ACD
2
BCD
3
AC
4
ABD
5
ABCD
6
BCD
ABCD
(3)
BC
(5)
ABD
(6)
B
(8)
Figure 9.7.
Closed itemset lattice for Q4.
Q4.
Consider the frequent closed itemset lattice shown in Figure 9.7. Assume that the
item space is
I
={
A
,
B
,
C
,
D
,
E
}
. Answer the following questions:
(a)
What is the frequency of
CD
?
(b)
Find all frequent itemsets and their frequency, for itemsets in the subset interval
[
B
,
ABD
].
(c)
Is
ADE
frequent? If yes, show its support. If not, why?
Q5.
Let
C
be the set of all closed frequent itemsets and
M
the set of all maximal frequent
itemsets for some database. Prove that
M
⊆
C
.
Q6.
Prove that the closure operator
c
=
i
◦
t
satisfies the following properties (
X
and
Y
are
some itemsets):
(a)
Extensive:
X
⊆
c
(
X
)
(b)
Monotonic: If
X
⊆
Y
then
c
(
X
)
⊆
c
(
Y
)
(c)
Idempotent:
c
(
X
)
=
c
(
c
(
X
))
258
Summarizing Itemsets
Table 9.3.
Dataset for Q7
Tid Itemset
1
ACD
2
BCD
3
ACD
4
ABD
5
ABCD
6
BC
Q7.
Let
δ
be an integer. An itemset
X
is called a
δ
-free
itemset iff for all subsets
Y
⊂
X
, we
have
sup
(
Y
)
−
sup
(
X
) >δ
. For any itemset
X
, we define the
δ
-closure
of
X
as follows:
δ
-closure(
X
)
=
Y
|
X
⊂
Y
,
sup
(
X
)
−
sup
(
Y
)
≤
δ,
and
Y
is maximal
Consider the database shown in Table 9.3. Answer the following questions:
(a)
Given
δ
=
1, compute all the
δ
-free itemsets.
(b)
For each of the
δ
-free itemsets, compute its
δ
-closure for
δ
=
1.
Q8.
Given the lattice of frequent itemsets (along with their supports) shown in Figure 9.8,
answer the following questions:
(a)
List all the closed itemsets.
(b)
Is
BCD
derivable? What about
ABCD
? What are the bounds on their supports.
∅
(6)
A
(6)
B
(5)
C
(4)
D
(3)
AB
(5)
AC
(4)
AD
(3)
BC
(3)
BD
(2)
CD
(2)
ABC
(3)
ABD
(2)
ACD
(2)
BCD
(1)
ABCD
(1)
Figure 9.8.
Frequent itemset lattice for Q8.
Q9.
Prove that if an itemset
X
is derivable, then so is any superset
Y
⊃
X
. Using this
observation describe an algorithm to mine all nonderivable itemsets.
CHAPTER 10
Sequence Mining
Many real-world applications such as bioinformatics, Web mining, and text mining
have to deal with sequential and temporal data. Sequence mining helps discover
patternsacross timeor positions in agivendataset.In this chapterweconsider methods
to mine frequent sequences, which allow gaps between elements, as well as methods to
mine frequent substrings, which do not allow gaps between consecutive elements.
10.1
FREQUENT SEQUENCES
Let
denote an
alphabet
, defined as a finite set of characters or symbols, and let
|
|
denote its cardinality. A
sequence
or a
string
is defined as an ordered list of symbols,
and is written as
s
=
s
1
s
2
...s
k
, where
s
i
∈
is a symbol at position
i
, also denoted as
s
[
i
]. Here
|
s
|=
k
denotes the
length
of the sequence. A sequence with length
k
is also
called a
k
-sequence
. We use the notation
s
[
i
:
j
]
=
s
i
s
i
+
1
···
s
j
−
1
s
j
to denote the
substring
or sequence of consecutive symbols in positions
i
through
j
, where
j > i
. Define the
prefix
ofa sequence
s
as anysubstring of theform
s
[1:
i
]
=
s
1
s
2
...s
i
, with0
≤
i
≤
n
. Also,
define the
suffix
of
s
as any substring of the form
s
[
i
:
n
]
=
s
i
s
i
+
1
...s
n
, with 1
≤
i
≤
n
+
1.
Note that
s
[1 : 0] is the empty prefix, and
s
[
n
+
1 :
n
] is the empty suffix. Let
⋆
be the
set of all possible sequences that can be constructed using the symbols in
, including
the empty sequence
∅
(which has length zero).
Let
s
=
s
1
s
2
...s
n
and
r
=
r
1
r
2
...r
m
be two sequences over
. We say that
r
is a
subsequence
of
s
denoted
r
⊆
s
, if there exists a one-to-one mapping
φ
: [1
,m
]
→
[1
,n
],
such that
r
[
i
]
=
s
[
φ(i)
] and for any two positions
i,j
in
r
,
i < j
=⇒
φ(i) < φ(j)
. In
other words, each position in
r
is mapped to a different position in
s
, and the order of
symbols is preserved, even though there may be intervening gaps between consecutive
elements of
r
in the mapping. If
r
⊆
s
, we also say that
s
contains
r
. The sequence
r
is
called a
consecutive subsequence
or substring of
s
provided
r
1
r
2
...r
m
=
s
j
s
j
+
1
...s
j
+
m
−
1
,
i.e.,
r
[1:
m
]
=
s
[
j
:
j
+
m
−
1],with 1
≤
j
≤
n
−
m
+
1.For substrings we do not allow any
gaps between the elements of
r
in the mapping.
Example 10.1.
Let
={
A
,
C
,
G
,
T
}
, and let
s
=
ACTGAACG
. Then
r
1
=
CGAAG
is a subsequence of
s
, and
r
2
=
CTGA
is a substring of
s
. The sequence
r
3
=
ACT
is a
prefix of
s
, and so is
r
4
=
ACTGA
, whereas
r
5
=
GAACG
is one of the suffixes of
s
.
259
260
Sequence Mining
Given a database
D
={
s
1
,
s
2
,...,
s
N
}
of
N
sequences, and given some sequence
r
,
the
support
of
r
in the database
D
is defined as the total number of sequences in
D
that
contain
r
sup
(
r
)
=
s
i
∈
D
|
r
⊆
s
i
The
relative support
of
r
is the fraction of sequences that contain
r
rsup
(
r
)
=
sup
(
r
)/
N
Given a user-specified
minsup
threshold, we say that a sequence
r
is
frequent
in
database
D
if
sup
(
r
)
≥
minsup
. A frequent sequence is
maximal
if it is not a
subsequence of any other frequent sequence, and a frequent sequence is
closed
if it
is not a subsequence of any other frequent sequence with the same support.
10.2
MINING FREQUENT SEQUENCES
For sequence mining the order of the symbols matters, and thus we have to consider
all possible
permutations
of the symbols as the possible frequent candidates. Contrast
this with itemset mining, where we had only to consider
combinations
of the items. The
sequence search space can be organized in a prefix search tree. The root of the tree, at
level 0, contains the empty sequence, with each symbol
x
∈
as one of its children. As
such, a node labeled with the sequence
s
=
s
1
s
2
...s
k
at level
k
has children of the form
s
′
=
s
1
s
2
...s
k
s
k
+
1
at level
k
+
1. In other words,
s
is a prefix of each child
s
′
, which is also
called an
extension
of
s
.
Example 10.2.
Let
={
A
,
C
,
G
,
T
}
and let the sequence database
D
consist of the
threesequencesshown in Table10.1.Thesequencesearchspaceorganizedasaprefix
search tree is illustrated in Figure 10.1. The support of each sequence is shown within
brackets. For example, the node labeled
A
has three extensions
AA
,
AG
, and
AT
,
out of which
AT
is infrequent if
minsup
=
3.
The subsequence search space is conceptually infinite because it comprises all
sequences in
∗
, that is, all sequences of length zero or more that can be created using
symbols in
. In practice, the database
D
consists of bounded length sequences. Let
l
denote the length of the longest sequence in the database, then, in the worst case, we
willhavetoconsider all candidatesequencesoflengthup to
l
, which givesthefollowing
Table 10.1.
Example sequence database
Id Sequence
s
1
CAGAAGT
s
2
TGACAG
s
3
GAAGT
10.2 Mining Frequent Sequences
261
ALGORITHM 10.1. Algorithm GSP
GSP (D,
,
minsup
)
:
F
←∅
1
C
(
1
)
←{∅}
// Initial prefix tree with single symbols
2
foreach
s
∈
do
Add
s
as child of
∅
in
C
(
1
)
with
sup(s)
←
0
3
k
←
1
//
k
denotes the level
4
while
C
(k)
=∅
do
5
C
OMPUTE
S
UPPORT
(
C
(k)
,
D
)
6
foreach
leaf
s
∈
C
(k)
do
7
if
sup
(
r
)
≥
minsup
then
F
←
F
∪
(
r
,
sup
(
r
))
8
else
remove
s
from
C
(k)
9
C
(k
+
1
)
←
E
XTEND
P
REFIX
T
REE
(
C
(k)
)
10
k
←
k
+
1
11
return
F
(k)
12
C
OMPUTE
S
UPPORT
(
C
(k)
,
D)
:
foreach s
i
∈
D do
13
foreach r
∈
C
(k)
do
14
if r
⊆
s
i
then
sup
(
r
)
←
sup
(
r
)
+
1
15
E
XTEND
P
REFIX
T
REE
(
C
(k)
)
:
foreach
leaf
r
a
∈
C
(k)
do
16
foreach
leaf
r
b
∈
C
HILDREN
(
P
ARENT
(
r
a
))
do
17
r
ab
←
r
a
+
r
b
[
k
]
// extend
r
a
with last item of
r
b
18
// prune if there are any infrequent subsequences
if r
c
∈
C
(k)
,
for all
r
c
⊂
r
ab
,
such that
|
r
c
|=|
r
ab
|−
1
then
19
Add
r
ab
as child of
r
a
with
sup(
r
ab
)
←
0
20
if
no extensions from
r
a
then
21
remove
r
a
, and all ancestors of
r
a
with no extensions, from
C
(k)
22
return
C
(k)
23
bound on the size of the search space:
|
|
1
+|
|
2
+···+|
|
l
=
O
(
|
|
l
)
(10.1)
since at level
k
there are
|
|
k
possible subsequences of length
k
.
10.2.1
Level-wise Mining: GSP
We can devise an effective sequence mining algorithm that searches the sequence
prefix tree using a level-wise or breadth-first search. Given the set of frequent
sequences at level
k
, we generate all possible sequence extensions or
candidates
at
level
k
+
1. We next compute the support of each candidate and prune those that are
not frequent. The search stops when no more frequent extensions are possible.
262
Sequence Mining
∅
(
3
)
A
(3)
AA
(3)
AAA
(
1
)
AAG
(
3
)
AAGG
AG
(3)
AGA
(
1
)
AGG
(
1
)
AT
(
2
)
C
(
2
)
G
(3)
GA
(3)
GAA
(3)
GAAA
GAAG
(
3
)
GAG
(3)
GAGA GAGG
GG
(3)
GGA
(
0
)
GGG
(
0
)
GT
(
2
)
T
(3)
TA
(
1
)
TG
(
1
)
TT
(
0
)
Figure 10.1.
Sequence search space: shaded ovals represent candidates that are infrequent; those without
support in brackets can be pruned based on an infrequent subsequence. Unshaded ovals represent frequent
sequences.
The pseudo-code for the level-wise, generalized sequential pattern (GSP) mining
method is shown in Algorithm 10.1. It uses the antimonotonic property of support to
prune candidate patterns, that is, no supersequence of an infrequent sequence can be
frequent, and all subsequences of a frequent sequence must be frequent. The prefix
search tree at level
k
is denoted
C
(k)
. Initially
C
(
1
)
comprises all the symbols in
.
Given the current set of candidate
k
-sequences
C
(k)
, the method first computes their
support (line 6). For each database sequence
s
i
∈
D
, we check whether a candidate
sequence
r
∈
C
(k)
is a subsequence of
s
i
. If so, we increment the support of
r
. Once the
frequent sequences at level
k
have been found, we generate the candidates for level
k
+
1 (line 10). For the extension, each leaf
r
a
is extended with the last symbol of any
other leaf
r
b
that shares the same prefix (i.e., has the same parent), to obtain the new
candidate
(k
+
1
)
-sequence
r
ab
=
r
a
+
r
b
[
k
] (line 18). If the new candidate
r
ab
contains
any infrequent
k
-sequence, we prune it.
Example 10.3.
For example, let us mine the database shown in Table 10.1 using
minsup
=
3. That is, we want to find only those subsequences that occur in all
three database sequences. Figure 10.1 shows that we begin by extending the empty
sequence
∅
at level 0, to obtain the candidates
A
,
C
,
G
, and
T
at level 1. Out of these
C
can be pruned because it is not frequent. Next we generate all possible candidates
at level 2. Notice that using
A
as the prefix we generate all possible extensions
AA
,
AG
, and
AT
. A similar process is repeated for the other two symbols
G
and
T
. Some candidate extensions can be pruned without counting. For example, the
extension
GAAA
obtained from
GAA
can be pruned because it has an infrequent
subsequence
AAA
. The figure shows all the frequent sequences (unshaded), out of
which
GAAG
(
3
)
and
T
(
3
)
are the maximal ones.
The computational complexity of GSP is
O
(
|
|
l
)
as per Eq.(10.1), where
l
is the
length of the longest frequent sequence. The I/O complexity is
O
(l
·
D
)
because we
compute the support of an entire level in one scan of the database.
264
Sequence Mining
Even though there are two occurrences of
GT
in
s
1
, the last symbol
T
occurs at
position 7 in both occurrences, thus the poslist for
GT
has the tuple
1
,
7
. The
full poslist for
GT
is
L
(
GT
)
= {
1
,
7
,
3
,
5
}
. The support of
GT
is
sup
(
GT
)
=
|
L
(
GT
)
|=
2.
Support computation in Spade is done via
sequential join
operations. Given
the poslists for any two
k
-sequences
r
a
and
r
b
that share the same
(k
−
1
)
length
prefix, the idea is to perform sequential joins on the poslists to compute the support
for the new
(k
+
1
)
length candidate sequence
r
ab
=
r
a
+
r
b
[
k
]. Given a tuple
i,pos
r
b
[
k
]
∈
L
(
r
b
)
, we first check if there exists a tuple
i,pos
r
a
[
k
]
∈
L
(
r
a
)
, that
is, both sequences must occur in the same database sequence
s
i
. Next, for each
position
p
∈
pos
r
b
[
k
]
, we check whether there exists a position
q
∈
pos
r
a
[
k
]
such that
q < p
. If yes, this means that the symbol
r
b
[
k
] occurs after the last
position of
r
a
and thus we retain
p
as a valid occurrence of
r
ab
. The poslist
L
(
r
ab
)
comprises all such valid occurrences. Notice how we keep track of positions only
for the last symbol in the candidate sequence. This is because we extend sequences
from a common prefix, so there is no need to keep track of all the occurrences
of the symbols in the prefix. We denote the sequential join as
L
(
r
ab
)
=
L
(
r
a
)
∩
L
(
r
b
)
.
The main advantage of the vertical approach is that it enables different search
strategies over the sequence search space, including breadth or depth-first search.
Algorithm 10.2 shows the pseudo-code for Spade. Given a set of sequences
P
that
share the same prefix, along with their poslists, the method creates a new prefix
equivalence class
P
a
for each sequence
r
a
∈
P
by performing sequential joins with
every sequence
r
b
∈
P
, including self-joins. After removing the infrequent extensions,
the new equivalence class
P
a
is then processed recursively.
ALGORITHM 10.2. Algorithm S
PADE
// Initial Call:
F
←∅
,
k
←
0
,
P
←
s,
L
(s)
|
s
∈
,
sup
(s)
≥
minsup
S
PADE
(
P
,
minsup
,
F
,
k
)
:
foreach r
a
∈
P
do
1
F
←
F
∪
(
r
a
,
sup
(
r
a
))
2
P
a
←∅
3
foreach r
b
∈
P
do
4
r
ab
=
r
a
+
r
b
[
k
]
5
L
(
r
ab
)
=
L
(
r
a
)
∩
L
(
r
b
)
6
if
sup
(
r
ab
)
≥
minsup
then
7
P
a
←
P
a
∪
r
ab
,
L
(
r
ab
)
8
if
P
a
=∅
then
S
PADE
(
P
,
minsup
,
F
,
k
+
1)
9
10.2 Mining Frequent Sequences
265
Example 10.5.
Consider the poslists for
A
and
G
shown in Figure 10.2. To obtain
L
(
AG
)
, we perform a sequential join over the poslists
L
(
A
)
and
L
(
G
)
. For the tuples
1
,
{
2
,
4
,
5
} ∈
L
(
A
)
and
1
,
{
3
,
6
} ∈
L
(
G
)
, both positions 3 and 6 for
G
, occur after
some occurrence of
A
, for example, at position 2. Thus, we add the tuple
1
,
{
3
,
6
}
to
L
(
AG
)
. The complete poslist for
AG
is
L
(
AG
)
={
1
,
{
3
,
6
}
,
2
,
6
,
3
,
4
}
.
Figure 10.2 illustrates the complete working of the Spade algorithm, along with
all the candidates and their poslists.
10.2.3
Projection-Based Sequence Mining: PrefixSpan
Let
D
denote a database, and let
s
∈
be any symbol. The
projected database
with
respect to
s
, denoted
D
s
, is obtained by finding the the first occurrence of
s
in
s
i
, say at
position
p
. Next, we retain in
D
s
only the suffix of
s
i
starting at position
p
+
1. Further,
any infrequent symbols are removed from the suffix. This is done for each sequence
s
i
∈
D
.
Example 10.6.
Consider the three database sequences in Table 10.1. Given that the
symbol
G
first occurs at position 3 in
s
1
=
CAGAAGT
, the projection of
s
1
with
respect to
G
is the suffix
AAGT
. The projected database for
G
, denoted
D
G
is
therefore given as:
{
s
1
:
AAGT
,
s
2
:
AAG
,
s
3
:
AAGT
}
.
The main idea in PrefixSpan is to compute the support for only the individual
symbols in the projected database
D
s
, and then to perform recursive projections on
the frequent symbols in a depth-first manner. The PrefixSpan method is outlined in
Algorithm 10.3. Here
r
is a frequent subsequence, and
D
r
is the projected dataset for
r
.
Initially
r
is empty and
D
r
is the entire input dataset
D
. Given a databaseof (projected)
sequences
D
r
, PrefixSpan first finds all the frequent symbols in the projected dataset.
For each such symbol
s
, we extend
r
by appending
s
to obtain the new frequent
subsequence
r
s
. Next, we create the projected dataset
D
s
by projecting
D
r
on symbol
s
. A recursive call to PrefixSpan is then made with
r
s
and
D
s
.
ALGORITHM 10.3. Algorithm P
REFIX
S
PAN
// Initial Call:
D
r
←
D
,
r
←∅
,
F
←∅
P
REFIX
S
PAN
(D
r
, r,
minsup
,
F
)
:
foreach
s
∈
such that sup
(s,
D
r
)
≥
minsup
do
1
r
s
=
r
+
s
// extend
r
by symbol
s
2
F
←
F
∪
(
r
s
,
sup
(s,
D
r
))
3
D
s
←∅
// create projected data for symbol
s
4
foreach s
i
∈
D
r
do
5
s
′
i
←
projection of
s
i
w.r.t symbol
s
6
Remove any infrequent symbols from
s
′
i
7
Add
s
′
i
to
D
s
if
s
′
i
=∅
8
if D
s
=∅
then
P
REFIX
S
PAN
(
D
s
,
r
s
,
minsup
,
F
)
9
266
Sequence Mining
Example 10.7.
Figure 10.3 shows the projection-based PrefixSpan mining approach
for the example dataset in Table 10.1 using
minsup
=
3. Initially we start with the
whole database
D
, which can also be denoted as
D
∅
. We compute the support of each
symbol, and find that
C
is not frequent (shown crossed out). Among the frequent
symbols, we first create a new projected dataset
D
A
. For
s
1
, we find that the first
A
occurs at position 2, so we retain only the suffix
GAAGT
. In
s
2
, the first
A
occurs
at position 3, so the suffix is
CAG
. After removing
C
(because it is infrequent), we
are left with only
AG
as the projection of
s
2
on
A
. In a similar manner we obtain the
projection for
s
3
as
AGT
. The left child of the root shows the final projected dataset
D
A
. Now the mining proceeds recursively. Given
D
A
, we count the symbol supports
in
D
A
, finding that only
A
and
G
are frequent, which will lead to the projection
D
AA
and then
D
AG
, and so on. The complete projection-based approach is illustrated in
Figure 10.3.
D
∅
s
1
CAGAAGT
s
2
TGACAG
s
3
GAAGT
A
(
3
)
,
C
(
2
)
,
G
(
3
)
,
T
(
3
)
D
A
s
1
GAAGT
s
2
AG
s
3
AGT
A
(
3
)
,
G
(
3
)
,
T
(
2
)
D
AA
s
1
AG
s
2
G
s
3
G
A
(
1
)
,
G
(
3
)
D
AAG
∅
D
AG
s
1
AAG
A
(
1
)
,
G
(
1
)
D
G
s
1
AAGT
s
2
AAG
s
3
AAGT
A
(
3
)
,
G
(
3
)
,
T
(
2
)
D
GA
s
1
AG
s
2
AG
s
3
AG
A
(
3
)
,
G
(
3
)
D
GAA
s
1
G
s
2
G
s
3
G
G
(
3
)
D
GAAG
∅
D
GAG
∅
D
GG
∅
D
T
s
2
GAAG
A
(
1
)
,
G
(
1
)
Figure 10.3.
Projection-based sequence mining: PrefixSpan.
10.3 Substring Mining via Suffix Trees
267
10.3
SUBSTRING MINING VIA SUFFIX TREES
We now look at efficient methods for mining frequent substrings. Let
s
be a sequence
having length
n
, then there are at most
O
(n
2
)
possible distinct substrings contained in
s
. To see this consider substrings of length
w
, of which there are
n
−
w
+
1possible ones
in
s
. Adding over all substring lengths we get
n
w
=
1
(n
−
w
+
1
)
=
n
+
(n
−
1
)
+···+
2
+
1
=
O
(n
2
)
This is a much smaller search space compared to subsequences, and consequently we
can design more efficient algorithms for solving the frequent substring mining task. In
fact, we can mine all the frequent substrings in worst case
O
(
N
n
2
)
time for a dataset
D
={
s
1
,
s
2
,...,
s
N
}
with
N
sequences.
10.3.1
Suffix Tree
Let
denotethealphabet,andlet$
∈
bea
terminal
characterused tomarktheendof
a string. Given a sequence
s
, we append the terminal character so that
s
=
s
1
s
2
...s
n
s
n
+
1
,
where
s
n
+
1
=
$, and the
j
th suffix of
s
is given as
s
[
j
:
n
+
1]
=
s
j
s
j
+
1
...s
n
+
1
. The
suffix
tree
ofthesequencesin thedatabase
D
, denoted
T
,stores allthesuffixesfor each
s
i
∈
D
in atree structure, wheresuffixes thatshare a common prefixlie on the samepath from
the root of the tree. The substring obtained by concatenating all the symbols from the
root node to a node
v
is called the
nodelabel
of
v
, and is denoted as
L
(v)
. The substring
that appears on an edge
(v
a
,v
b
)
is called an
edge label
, and is denoted as
L
(v
a
,v
b
)
. A
suffix tree has two kinds of nodes: internal and leaf nodes. An internal node in the
suffix tree (except for the root) has at least two children, where each edge label to a
child begins with a different symbol. Because the terminal character is unique, there
are as many leaves in the suffix tree as there are unique suffixes over all the sequences.
Each leaf node corresponds to a suffix shared by one or more sequences in
D
.
It is straightforward to obtain a quadratic time and space suffix tree construction
algorithm. Initially, the suffix tree
T
is empty. Next, for each sequence
s
i
∈
D
, with
|
s
i
| =
n
i
, we generate all its suffixes
s
i
[
j
:
n
i
+
1], with 1
≤
j
≤
n
i
, and insert each of
them into the tree by following the path from the root until we either reach a leaf or
there is a mismatch in one of the symbols along an edge. If we reach a leaf, we insert
the pair
(i,j)
into the leaf, noting that this is the
j
th suffix of sequence
s
i
. If there is
a mismatch in one of the symbols, say at position
p
≥
j
, we add an internal vertex
just before the mismatch, and create a new leaf node containing
(i,j)
with edge label
s
i
[
p
:
n
i
+
1].
Example 10.8.
Consider the database in Table 10.1 with three sequences. In
particular, let us focus on
s
1
=
CAGAAGT
. Figure 10.4 shows what the suffix tree
T
looks like after inserting the
j
th suffix of
s
1
into
T
. The first suffix is the entire
sequence
s
1
appended with the terminal symbol; thus the suffix tree contains a single
leaf containing
(
1
,
1
)
under the root (Figure 10.4a). The second suffix is
AGAAGT
$,
and Figure 10.4b shows the resulting suffix tree, which now has two leaves. The third
268
Sequence Mining
(1,1)
C
A
G
A
A
G
T
$
(a)
j
=
1
(1,2)
A
G
A
A
G
T
$
(1,1)
C
A
G
A
A
G
T
$
(b)
j
=
2
(1,2)
A
G
A
A
G
T
$
(1,1)
C
A
G
A
A
G
T
$
(1,3)
G
A
A
G
T
$
(c)
j
=
3
A
(1,4)
A
G
T
$
(1,2)
G
A
A
G
T
$
(1,1)
C
A
G
A
A
G
T
$
(1,3)
G
A
A
G
T
$
(d)
j
=
4
A
(1,4)
A
G
T
$
G
(1,2)
A
A
G
T
$
(1,5)
T
$
(1,1)
C
A
G
A
A
G
T
$
(1,3)
G
A
A
G
T
$
(e)
j
=
5
A
(1,4)
A
G
T
$
G
(1,2)
A
A
G
T
$
(1,5)
T
(1,1)
C
A
G
A
A
G
T
$
G
(1,3)
A
A
G
T
$
(1,6)
T
$
(f)
j
=
6
A
(1,4)
A
G
T
$
G
(1,2)
A
A
G
T
$
(1,5)
T
$
(1,1)
C
A
G
A
A
G
T
$
G
(1,3)
A
A
G
T
$
(1,6)
T
$
(1,7)
T
$
(g)
j
=
7
Figure 10.4.
Suffix tree construction: (a)–(g) show the successive changes to the tree, after we add the
j
th
suffix of
s
1
=
CAGAAGT
$ for
j
=
1
,...,
7.
suffix
GAAGT
$ begins with
G
, which has not yet been observed, so it creates a new
leafin
T
undertheroot.Thefourthsuffix
AAGT
$sharestheprefix
A
withthesecond
suffix, so it follows the path beginning with
A
from the root. However, because there
is a mismatch at position 2, we create an internal node right before it and insert the
leaf
(
1
,
4
)
, as shown in Figure 10.4d. The suffix tree obtained after inserting all of
the suffixes of
s
1
is shown in Figure 10.4g, and the complete suffix tree for all three
sequences is shown in Figure 10.5.
10.3 Substring Mining via Suffix Trees
269
3
3
A
(1,4)
(3,2)
A
G
T
$
(2,3)
C
A
G
$
3
G
(1,2)
A
A
G
T
$
(1,5)
(3,3)
T
$
(2,5)
$
2
C
A
G
(1,1)
A
A
G
T
$
(2,4)
$
3
G
3
A
(1,3)
(3,1)
A
G
T
$
(2,2)
C
A
G
$
(1,6)
(3,4)
T
$
(2,6)
$
3
T
(2,1)
G
A
C
A
G
$
(1,7)
(3,5)
$
Figure 10.5.
Suffix tree for all three sequences in Table 10.1. Internal nodes store support information.
Leaves also record the support (not shown).
In terms of the time and space complexity, the algorithm sketched above requires
O
(
N
n
2
)
time and space, where
N
is the number of sequences in
D
, and
n
is the longest
sequence length. The time complexity follows from the fact that the method always
inserts a new suffix starting from the root of the suffix tree. This means that in the
worst case it compares
O
(n)
symbols per suffix insertion, giving the worst case bound
of
O
(n
2
)
over all
n
suffixes. The space complexity comes from the fact that each suffix
is explicitly represented in the tree, taking
n
+
(n
−
1
)
+···+
1
=
O
(n
2
)
space. Over all
the
N
sequences in the database, we obtain
O
(
N
n
2
)
as the worst case time and space
bounds.
Frequent Substrings
Once the suffix tree is built, we can compute all the frequent substrings by checking
how many different sequences appear in a leaf node or under an internal node. The
node labels for the nodes with support at least
minsup
yield the set of frequent
substrings; all the prefixes of such node labels are also frequent. The suffix tree can
also support ad hoc queries for finding all the occurrences in the database for any
query substring
q
. For each symbol in
q
, we follow the path from the root until all
symbols in
q
have been seen, or until there is a mismatch at any position. If
q
is
found, then the set of leaves under that path is the list of occurrences of the query
q
. On the other hand, if there is mismatch that means the query does not occur
in the database. In terms of the query time complexity, because we have to match
each character in
q
, we immediately get
O
(
|
q
|
)
as the time bound (assuming that
|
|
is a constant), which is
independent
of the size of the database. Listing all the
matches takes additional time, for a total time complexity of
O
(
|
q
|+
k)
, if there are
k
matches.
270
Sequence Mining
Example 10.9.
Consider the suffix tree shown in Figure 10.5, which stores all the
suffixes for the sequence database in Table 10.1. To facilitate frequent substring
enumeration, we store the support for each internal as well as leaf node, that is,
we store the number of distinct sequence ids that occur at or under each node. For
example, the leftmost child of the root node on the path labeled
A
has support 3
because there are three distinct sequences under that subtree. If
minsup
=
3, then
the frequent substrings are
A
,
AG
,
G
,
GA
, and
T
. Out of these, the maximal ones are
AG
,
GA
, and
T
. If
minsup
=
2, then the maximal frequent substrings are
GAAGT
and
CAG
.
For ad hoc querying consider
q
=
GAA
. Searching for symbols in
q
starting from
the root leads to the leaf node containing the occurrences
(
1
,
3
)
and
(
3
,
1
)
, which
means that
GAA
appears at position 3 in
s
1
and at position 1 in
s
3
. On the other
hand if
q
=
CAA
, then the search terminates with a mismatch at position 3 after
following the branch labeled
CAG
from the root. This means that
q
does not occur
in the database.
10.3.2
Ukkonen’s Linear Time Algorithm
Wenow presenta linear timeand space algorithm for constructing suffixtrees.We first
consider how to build the suffix tree for a single sequence
s
=
s
1
s
2
...s
n
s
n
+
1
, with
s
n
+
1
=
$. The suffix tree for the entire dataset of
N
sequences can be obtained by inserting
each sequence one by one.
Achieving Linear Space
Let us see how to reduce the space requirements of a suffix tree. If an algorithm
stores all the symbols on each edge label, then the space complexity is
O
(n
2
)
, and we
cannot achievelinear time construction either. The trick is to not explicitly store all the
edge labels, but rather to use an
edge-compression
technique, where we store only the
starting and ending positions of the edge label in the input string
s
. That is, if an edge
label is given as
s
[
i
:
j
], then we represent is as the interval [
i,j
].
Example 10.10.
Consider thesuffixtree for
s
1
=
CAGAAGT
$shown in Figure 10.4g.
The edge label
CAGAAGT
$ for the suffix
(
1
,
1
)
can be represented via the interval
[1
,
8] because the edge label denotes the substring
s
1
[1 : 8]. Likewise, the edge
label
AAGT
$ leading to suffix
(
1
,
2
)
can be compressed as [4
,
8] because
AAGT
$
=
s
1
[4 : 8]. The complete suffix tree for
s
1
with compressed edge labels is shown in
Figure 10.6.
In terms of space complexity, note that when we add a new suffix to the tree
T
, it
can create at most one new internal node. As there are
n
suffixes, there are
n
leaves
in
T
and at most
n
internal nodes. With at most 2
n
nodes, the tree has at most 2
n
−
1
edges, and thus the total space required to store an interval for each edge is 2
(
2
n
−
1
)
=
4
n
−
2
=
O
(n)
.
10.3 Substring Mining via Suffix Trees
271
v
1
v
2
[
2
,
2
]
(1,4)
[
5
,
8
]
v
3
[
3
,
3
]
(1,2)
[
4
,
8
]
(1,5)
[
7
,
8
]
(1,1)
[
1
,
8
]
v
4
[
3
,
3
]
(1,3)
[
4
,
8
]
(1,6)
[
7
,
8
]
(1,7)
[
7
,
8
]
Figure 10.6.
Suffix tree for
s
1
=
CAGAAGT
$ using edge-compression.
Achieving Linear Time
Ukkonen’s method is an
online
algorithm, that is, given a string
s
=
s
1
s
2
...s
n
$ it
constructs the full suffixtreein phases. Phase
i
builds the treeup to the
i
-thsymbol in
s
,
that is, it updates the suffix tree from the previous phase by adding the next symbol
s
i
.
Let
T
i
denote the suffix tree up to the
i
th prefix
s
[1 :
i
], with 1
≤
i
≤
n
. Ukkonen’s
algorithm constructs
T
i
from
T
i
−
1
, by making sure that all suffixes including the
current
character
s
i
are in the new intermediate tree
T
i
. In other words, in the
i
th phase, it
inserts all the suffixes
s
[
j
:
i
] from
j
=
1 to
j
=
i
into the tree
T
i
. Each such insertion
is called the
j
th
extension
of the
i
th
phase
. Once we process the terminal character at
position
n
+
1 we obtain the final suffix tree
T
for
s
.
Algorithm 10.4 shows the code for a naive implementation of Ukkonen’s
approach. This method has cubic time complexity because to obtain
T
i
from
T
i
−
1
takes
O
(i
2
)
time, with the last phase requiring
O
(n
2
)
time. With
n
phases, the total
time is
O
(n
3
)
. Our goal is to show that this time can be reduced to just
O
(n)
via the
optimizations described in the following paragraghs.
Implicit Suffixes
This optimization states that, in phase
i
, if the
j
th extension
s
[
j
:
i
] is
found in the tree, then any subsequent extensions will also be found, and consequently
there is no need to process further extensions in phase
i
. Thus, the suffix tree
T
i
at the
end of phase
i
has
implicit suffixes
corresponding to extensions
j
+
1 through
i
. It is
important to note that all suffixes will become explicit the first time we encounter a
new substring that does not already exist in the tree. This will surely happen in phase
272
Sequence Mining
ALGORITHM 10.4. Algorithm N
AIVE
U
KKONEN
N
AIVE
U
KKONEN
(s)
:
n
←|
s
|
1
s
[
n
+
1]
←
$
// append terminal character
2
T
←∅
// add empty string as root
3
foreach
i
=
1
,...,n
+
1
do
// phase
i
- construct
T
i
4
foreach
j
=
1
,...,i
do
// extension
j
for phase
i
5
// Insert
s
[
j
:
i
]
into the suffix tree
Find end of the path with label
s
[
j
:
i
−
1] in
T
6
Insert
s
i
at end of path;
7
return
T
8
n
+
1 when we process the terminal character $, as it cannot occur anywhere else in
s
(after all, $
∈
).
Implicit Extensions
Let the current phase be
i
, and let
l
≤
i
−
1 be the last explicit
suffix in the previous tree
T
i
−
1
. All explicit suffixes in
T
i
−
1
have edge labels of the form
[
x,i
−
1] leading to the corresponding leaf nodes, where the starting position
x
is node
specific, but the ending position must be
i
−
1 because
s
i
−
1
was added to the end of
these paths in phase
i
−
1. In the current phase
i
, we would have to extend these paths
by adding
s
i
at the end. However, instead of explicitly incrementing all the ending
positions, we can replace the ending position by a pointer
e
which keeps track of the
current phase being processed. If we replace [
x,i
−
1] with [
x,e
], then in phase
i
, if we
set
e
=
i
, then immediately all the
l
existing suffixes get
implicitly
extended to [
x,i
].
Thus, in one operation of incrementing
e
we have, in effect, taken care of extensions 1
through
l
for phase
i
.
Example 10.11.
Let
s
1
=
CAGAAGT
$. Assume that we have already performed the
first six phases, which result in the tree
T
6
shown in Figure 10.7a. The last explicit
suffix in
T
6
is
l
=
4. In phase
i
=
7 we have to execute the following extensions:
CAGAAGT
extension 1
AGAAGT
extension 2
GAAGT
extension 3
AAGT
extension 4
AGT
extension 5
GT
extension 6
T
extension 7
At the start of the seventh phase, we set
e
=
7, which yields implicit extensions for all
suffixes explicitly in the tree, as shown in Figure 10.7b. Notice how symbol
s
7
=
T
is
now implicitly on each of the leaf edges, for example, the label [5
,e
]
=
AG
in
T
6
now
becomes [5
,e
]
=
AGT
in
T
7
. Thus, thefirst four extensionslisted abovearetakencare
of by simply incrementing
e
. To complete phase 7 we have to process the remaining
extensions.
10.3 Substring Mining via Suffix Trees
273
[
2
,
2
]
=
A
(1,4)
[
5
,
e
]
=
A
G
(1,2)
[
3
,
e
]
=
G
A
A
G
(1,1)
[
1
,
e
]
=
C
A
G
A
A
G
(1,3)
[
3
,
e
]
=
G
A
A
G
(a)
T
6
[
2
,
2
]
=
A
(1,4)
[
5
,
e
]
=
A
G
T
(1,2)
[
3
,
e
]
=
G
A
A
G
T
(1,1)
[
1
,
e
]
=
C
A
G
A
A
G
T
(1,3)
[
3
,
e
]
=
G
A
A
G
T
(b)
T
7
, extensions
j
=
1
,...,
4
Figure 10.7.
Implicit extensions in phase
i
=
7. Last explicit suffix in
T
6
is
l
=
4 (shown double-circled). Edge
labels shown for convenience; only the intervals are stored.
Skip/Count Trick
For the
j
th extension of phase
i
, we have to search for the substring
s
[
j
:
i
−
1] so that we can add
s
i
at the end. However, note that this string must exist
in
T
i
−
1
because we have already processed symbol
s
i
−
1
in the previous phase. Thus,
instead of searching for each character in
s
[
j
:
i
−
1] starting from the root, we first
count
the number of symbols on the edge beginning with character
s
j
; let this length
be
m
. If
m
is longer than the length of the substring (i.e., if
m > i
−
j
), then the
substring must end on this edge, so we simply jump to position
i
−
j
and insert
s
i
.
On the other hand, if
m
≤
i
−
j
, then we can
skip
directly to the child node, say
v
c
,
and search for the remaining string
s
[
j
+
m
:
i
−
1] from
v
c
using the same skip/count
technique. With this optimization, the cost of an extension becomes proportional
to the number of nodes on the path, as opposed to the number of characters in
s
[
j
:
i
−
1].
Suffix Links
We saw that with the skip/count optimization we can search for the
substring
s
[
j
:
i
−
1] by following nodes from parent to child. However, we still have
to start from the root node each time. We can avoid searching from the root via the
use of
suffix links
. For each internal node
v
a
we maintain a link to the internal node
v
b
, where
L
(v
b
)
is the immediate suffix of
L
(v
a
)
. In extension
j
−
1, let
v
p
denote the
internal node under which we find
s
[
j
−
1 :
i
], and let
m
be the length of the node label
of
v
p
. To insert the
j
th extension
s
[
j
:
i
], we follow the suffix link from
v
p
to another
node, say
v
s
, and search for the remaining substring
s
[
j
+
m
−
1 :
i
−
1] from
v
s
. The
use of suffix links allows us to jump internally within the tree for different extensions,
as opposed to searching from the root each time. As a final observation, if extension
j
274
Sequence Mining
ALGORITHM 10.5. Algorithm U
KKONEN
U
KKONEN
(s)
:
n
←|
s
|
1
s
[
n
+
1]
←
$
// append terminal character
2
T
←∅
// add empty string as root
3
l
←
0
// last explicit suffix
4
foreach
i
=
1
,...,n
+
1
do
// phase
i
- construct
T
i
5
e
←
i
// implicit extensions
6
foreach
j
=
l
+
1
,...,i
do
// extension
j
for phase
i
7
// Insert
s
[
j
:
i
]
into the suffix tree
Find end of
s
[
j
:
i
−
1] in
T
via skip/count and suffix links
8
if
s
i
∈
T
then
// implicit suffixes
9
break
10
else
11
Insert
s
i
at end of path
12
Set last explicit suffix
l
if needed
13
return
T
14
creates a new internal node, then its suffix link will point to the new internal node that
will be created during extension
j
+
1.
The pseudo-code for the optimized Ukkonen’s algorithm is shown in
Algorithm 10.5. It is important to note that it achieves linear time and space only with
all of the optimizations in conjunction, namely implicit extensions (line 6), implicit
suffixes (line 9), and skip/count and suffix links for inserting extensions in
T
(line 8).
Example 10.12.
Let us look at the execution of Ukkonen’s algorithm on the
sequence
s
1
=
CAGAAGT
$,asshown in Figure10.8.In phase1,weprocess character
s
1
=
C
and insert the suffix
(
1
,
1
)
into the tree with edge label [1
,e
] (see Figure 10.8a).
In phases 2 and 3, new suffixes
(
1
,
2
)
and
(
1
,
3
)
are added (see Figures 10.8b–10.8c).
For phase 4, when we want to process
s
4
=
A
, we note that all suffixes up to
l
=
3
are already explicit. Setting
e
=
4 implicitly extends all of them, so we have only
to make sure that the last extension (
j
=
4) consisting of the single character
A
is in the tree. Searching from the root, we find
A
in the tree implicitly, and we
thus proceed to the next phase. In the next phase, we set
e
=
5, and the suffix
(
1
,
4
)
becomes explicit when we try to add the extension
AA
, which is not in the
tree. For
e
=
6, we find the extension
AG
already in the tree and we skip ahead
to the next phase. At this point the last explicit suffix is still
(
1
,
4
)
. For
e
=
7,
T
is a previously unseen symbol, and so all suffixes will become explicit, as shown in
Figure 10.8g.
It is instructive to see the extensions in the last phase (
i
=
7). As described in
Example 10.11, the first four extensions will be done implicitly. Figure 10.9a shows
thesuffixtreeafterthesefour extensions.Forextension5,webeginatthelast explicit
10.3 Substring Mining via Suffix Trees
275
C AGAAGT
$,
e
=
1
(1,1)
[
1
,
e
]
=
C
(a)
T
1
C A GAAGT
$,
e
=
2
(1,2)
[
2
,
e
]
=
A
(1,1)
[
1
,
e
]
=
C
A
(b)
T
2
CA G AAGT
$,
e
=
3
(1,2)
[
2
,
e
]
=
A
G
(1,1)
[
1
,
e
]
=
C
A
G
(1,3)
[
3
,
e
]
=
G
(c)
T
3
CAG A AGT
$,
e
=
4
(1,2)
[
2
,
e
]
=
A
G
A
(1,1)
[
1
,
e
]
=
C
A
G
A
(1,3)
[
3
,
e
]
=
G
A
(d)
T
4
CAGA A GT
$,
e
=
5
[
2
,
2
]
=
A
(1,4)
[
5
,
e
]
=
A
(1,2)
[
3
,
e
]
=
G
A
A
(1,1)
[
1
,
e
]
=
C
A
G
A
A
(1,3)
[
3
,
e
]
=
G
A
A
(e)
T
5
CAGAA G T
$,
e
=
6
[
2
,
2
]
=
A
(1,4)
[
5
,
e
]
=
A
G
(1,2)
[
3
,
e
]
=
G
A
A
G
(1,1)
[
1
,
e
]
=
C
A
G
A
A
G
(1,3)
[
3
,
e
]
=
G
A
A
G
(f)
T
6
CAGAAG T
$,
e
=
7
[
2
,
2
]
=
A
(1,4)
[
5
,
e
]
=
A
G
T
[
3
,
3
]
=
G
(1,2)
[
4
,
e
]
=
A
A
G
T
(1,5)
[
7
,
e
]
=
T
(1,1)
[
1
,
e
]
=
C
A
G
A
A
G
T
[
3
,
3
]
=
G
(1,3)
[
4
,
e
]
=
A
A
G
T
(1,6)
[
7
,
e
]
=
T
(1,7)
[
7
,
e
]
=
T
(g)
T
7
Figure 10.8.
Ukkonen’s linear time algorithm for suffix tree construction. Steps (a)–(g) show the successive
changes to the tree after the
i
th phase. The suffix links are shown with dashed lines. The double-circled
leaf denotes the last explicit suffix in the tree. The last step is not shown because when
e
=
8, the terminal
character $ will not alter the tree. All the edge labels are shown for ease of understanding, although the
actual suffix tree keeps only the intervals for each edge.
leaf, follow its parent’s suffix link, and begin searching for the remaining characters
from that point. In our example, the suffix link points to the root, so we search for
s
[5 : 7]
=
AGT
from the root. We skip to node
v
A
, and look for the remaining string
GT
, which has a mismatch inside the edge [3
,e
]. We thus create a new internal
node after
G
, and insert the explicit suffix
(
1
,
5
)
, as shown in Figure 10.9b. The next
extension
s
[6 : 7]
=
GT
begins at the newly created leaf node
(
1
,
5
)
. Following the
closest suffix link leads back to the root, and a search for
GT
gets a mismatch on the
edgeout of the root to leaf
(
1
,
3
)
.We then createa new internalnode
v
G
at thatpoint,
add a suffix link from the previous internal node
v
AG
to
v
G
, and add a new explicit
leaf
(
1
,
6
)
, as shown in Figure 10.9c. The last extension, namely
j
=
7, corresponding
276
Sequence Mining
Extensions 1–4
v
A
[
2
,
2
]
=
A
(1,4)
[
5
,
e
]
=
A
G
T
(1,2)
[
3
,
e
]
=
G
A
A
G
T
(1,1)
[
1
,
e
]
=
C
A
G
A
A
G
T
(1,3)
[
3
,
e
]
=
G
A
A
G
T
(a)
Extension 5:
AGT
v
A
[
2
,
2
]
=
A
(1,4)
[
5
,
e
]
=
A
G
T
v
AG
[
3
,
3
]
=
G
(1,2)
[
4
,
e
]
=
A
A
G
T
(1,5)
[
7
,
e
]
=
T
(1,1)
[
1
,
e
]
=
C
A
G
A
A
G
T
(1,3)
[
3
,
e
]
=
G
A
A
G
T
(b)
Extension 6:
GT
v
A
[
2
,
2
]
=
A
(1,4)
[
5
,
e
]
=
A
G
T
v
AG
[
3
,
3
]
=
G
(1,2)
[
4
,
e
]
=
A
A
G
T
(1,5)
[
7
,
e
]
=
T
(1,1)
[
1
,
e
]
=
C
A
G
A
A
G
T
v
G
[
3
,
3
]
=
G
(1,3)
[
4
,
e
]
=
A
A
G
T
(1,6)
[
7
,
e
]
=
T
(c)
Figure 10.9.
Extensions in phase
i
=
7. Initially the last explicit suffix is
l
=
4 and is shown double-circled.
All the edge labels are shown for convenience; the actual suffix tree keeps only the intervals for each edge.
to
s
[7:7]
=
T
, results in making all thesuffixes explicitbecause thesymbol
T
has been
seen for the first time. The resulting tree is shown in Figure 10.8g.
Once
s
1
has been processed, we can then insert the remaining sequences in the
database
D
into the existing suffix tree. The final suffix tree for all three sequences
is shown in Figure 10.5, with additional suffix links (not shown) from all the internal
nodes.
Ukkonen’s algorithm has time complexity of
O
(n)
for a sequence of length
n
because it does only a constant amount of work (amortized) to make each suffix
explicit. Note that, for each phase, a certain number of extensions are done implicitly
just by incrementing
e
. Out of the
i
extensions from
j
=
1 to
j
=
i
, let us say that
l
are done implicitly. For the remaining extensions, we stop the first time some suffix
is implicitly in the tree; let that extension be
k
. Thus, phase
i
needs to add explicit
suffixes only for suffixes
l
+
1 through
k
−
1. For creating each explicit suffix, we
perform a constant number of operations, which include following the closest suffix
link, skip/counting to look for the first mismatch, and inserting if needed a new
suffix leaf node. Because each leaf becomes explicit only once, and the number of
skip/count steps are bounded by
O
(n)
over the whole tree, we get a worst-case
O
(n)
10.5 Exercises
277
time algorithm. The total time over the entire database of
N
sequences is thus
O
(
N
n)
,
if
n
is the longest sequence length.
10.4
FURTHER READING
The level-wise GSP method for mining sequential patterns was proposed in Srikant
and Agrawal (March 1996). Spade is described in Zaki (2001), and the PrefixSpan
algorithm in Pei et al. (2004). Ukkonen’s linear time suffix tree construction method
appears in Ukkonen (1995). For an excellent introduction to suffix trees and their
numerous applications see Gusfield (1997); the suffix tree description in this chapter
has been heavily influenced by it.
Gusfield, D. (1997).
AlgorithmsonStrings,TreesandSequences:ComputerScienceand
Computational Biology
. New York: Cambridge University Press.
Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., and
Hsu, M.-C. (2004).“Mining sequential patternsby pattern-growth:The PrefixSpan
approach.”
IEEE Transactions on Knowledge and Data Engineering
, 16(11):
1424–1440.
Srikant, R. and Agrawal, R. (March 1996). “Mining sequential patterns: Generaliza-
tions and performance improvements.”
In Proceedings of the 5th International
Conference on Extending Database Technology
. New York: Springer-Verlag.
Ukkonen, E. (1995). “On-line construction of suffix trees.”
Algorithmica
, 14(3):
249–260.
Zaki, M. J. (2001). “SPADE: An efficient algorithm for mining frequent sequences.”
Machine Learning
, 42(1–2): 31–60.
10.5
EXERCISES
Q1.
Consider the database shown in Table 10.2. Answer the following questions:
(a)
Let
minsup
=
4. Find all frequent sequences.
(b)
Given that the alphabet is
= {
A
,
C
,
G
,
T
}
. How many possible sequences of
length
k
can there be?
Table 10.2.
Sequence database for Q1
Id Sequence
s
1
AATACAAGAAC
s
2
GTATGGTGAT
s
3
AACATGGCCAA
s
4
AAGCGTGGTCAA
Q2.
Given the DNA sequence database in Table 10.3, answer the following questions
using
minsup
=
4
(a)
Find the maximal frequent sequences.
(b)
Find all the closed frequent sequences.
278
Sequence Mining
(c)
Find the maximal frequent substrings.
(d)
Show how Spade would work on this dataset.
(e)
Show the steps of the PrefixSpan algorithm.
Table 10.3.
Sequence database for Q2
Id Sequence
s
1
ACGTCACG
s
2
TCGA
s
3
GACTGCA
s
4
CAGTC
s
5
AGCT
s
6
TGCAGCTC
s
7
AGTCAG
Q3.
Given
s
=
AABBACBBAA
, and
= {
A
,
B
,
C
}
. Define support as the number
of occurrence of a subsequence in
s
. Using
minsup
=
2, answer the following
questions:
(a)
Show how the vertical Spade method can be extended to mine all frequent
substrings (consecutive subsequences) in
s
.
(b)
Construct the suffix tree for
s
using Ukkonen’s method. Show all intermediate
steps, including all suffix links.
(c)
Using the suffix tree from the previous step, find all the occurrences of the query
q
=
ABBA
allowing for at most two mismatches.
(d)
Show the suffix tree when we add another character
A
just before the $. That is,
you must undo the effect of adding the $, add the new symbol
A
, and then add $
back again.
(e)
Describean algorithm to extract all the maximal frequentsubstringsfrom a suffix
tree. Show all maximal frequent substrings in
s
.
Q4.
Consider a bitvector based approach for mining frequent subsequences. For instance,
in Table 10.2, for
s
1
, the symbol
C
occurs at positions 5 and 11. Thus, the bitvector for
C
in
s
1
is given as 00001000001. Because
C
does not appear in
s
2
its bitvector can be
omitted for
s
2
. The complete set of bitvectors for symbol
C
is
(
s
1
,
00001000001
)
(
s
3
,
00100001100
)
(
s
4
,
000100000100
)
Given the set of bitvectors for each symbol show how we can mine all frequent sub-
sequences by using bit operations on the bitvectors. Show the frequent subsequences
and their bitvectors using
minsup
=
4.
Q5.
Consider the database shown in Table 10.4. Each sequence comprises itemset events
that happen at the same time. For example, sequence
s
1
can be considered to be a
sequence of itemsets
(
AB
)
10
(
B
)
20
(
AB
)
30
(
AC
)
40
, where symbols within brackets are
considered to co-occur at the same time, which is given in the subscripts. Describe
an algorithm that can mine all the frequent subsequences over itemset events. The
10.5 Exercises
279
Table 10.4.
Sequences for Q5
Id Time Items
s
1
10
A
,
B
20
B
30
A
,
B
40
A
,
C
20
A
,
C
s
2
30
A
,
B
,
C
50
B
10
A
30
B
s
3
40
A
50
C
60
B
30
A
,
B
40
A
s
4
50
B
60
C
itemsets can be of any length as long as they are frequent. Find all frequent itemset
sequences with
minsup
=
3.
Q6.
The suffix tree shown in Figure 10.5 contains all suffixes for the three sequences
s
1
,
s
2
,
s
3
in Table 10.1. Note that a pair
(i,j)
in a leaf denotes the
j
th suffix of
sequence
s
i
.
(a)
Add a new sequence
s
4
=
GAAGCAGAA
to the existing suffix tree, using the
Ukkonen algorithm. Show the last character position (
e
), along with the suffixes
(
l
) as they become explicit in the tree for
s
4
. Show the final suffix tree after all
suffixes of
s
4
have become explicit.
(b)
Find all closed frequent substrings with
minsup
=
2 using the final suffix
tree.
Q7.
Given the following three sequences:
s
1
:
GAAGT
s
2
:
CAGAT
s
3
:
ACGT
Find all the frequent subsequences with
minsup
=
2, but allowing at most a gap of 1
position between successive sequence elements.
CHAPTER 11
Graph Pattern Mining
Graph data is becoming increasingly more ubiquitous in today’s networked world.
Examples include social networks as well as cell phone networks and blogs. The
Internet is another example of graph data, as is the hyperlinked structure of the
World Wide Web (WWW). Bioinformatics, especially systems biology, deals with
understanding interaction networks between various types of biomolecules, such as
protein–protein interactions, metabolic networks, gene networks, and so on. Another
prominentsource ofgraphdatais theSemanticWeb,andlinked opendata,withgraphs
represented using the Resource Description Framework (RDF) data model.
The goal of graph mining is to extract interesting subgraphs from a single large
graph (e.g., a social network), or from a database of many graphs. In different
applications we may be interested in different kinds of subgraph patterns, such as
subtrees, complete graphs or cliques, bipartite cliques, dense subgraphs, and so on.
These may represent, for example, communities in a social network, hub and authority
pages on the WWW, cluster of proteins involved in similar biochemical functions, and
so on. In this chapter we outline methods to mine all the frequent subgraphs that
appear in a database of graphs.
11.1
ISOMORPHISM AND SUPPORT
A graph is a pair
G
=
(
V
,
E
)
where
V
is a set of vertices, and
E
⊆
V
×
V
is a set of
edges. We assume that edges are unordered, so that the graph is undirected. If
(u,v)
is
an edge, we say that
u
and
v
are
adjacent
and that
v
is a
neighbor
of
u
, and vice versa.
The set of all neighbors of
u
in
G
is given as
N
(u)
={
v
∈
V
|
(u,v)
∈
E
}
. A
labeled graph
has labels associated with its vertices as well as edges. We use
L
(u)
to denote the label
of the vertex
u
, and
L
(u,v)
to denote the label of the edge
(u,v)
, with the set of vertex
labels denoted as
V
and the set of edge labels as
E
. Given an edge
(u,v)
∈
G
, the
tuple
u,v,
L
(u),
L
(v),
L
(u,v)
that augments the edge with the node and edge labels is
called an
extended edge
.
Example 11.1.
Figure 11.1a shows an example of an unlabeled graph, whereas
Figure 11.1b shows the same graph, with labels on the vertices, taken from the vertex
280
11.1 Isomorphism and Support
281
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
(a)
a c
b
a
d
c
b
c
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
(b)
Figure 11.1.
An unlabeled (a) and labeled (b) graph with eight vertices.
label set
V
= {
a,b,c,d
}
. In this example, edges are all assumed to be unlabeled,
and are therefore edge labels are not shown. Considering Figure 11.1b, the label of
vertex
v
4
is
L
(v
4
)
=
a
, and its neighbors are
N
(v
4
)
= {
v
1
,v
2
,v
3
,v
5
,v
7
,v
8
}
. The edge
(v
4
,v
1
)
leads to the extendededge
v
4
,v
1
,a,a
, wherewe omit the edgelabel
L
(v
4
,v
1
)
because it is empty.
Subgraphs
A graph
G
′
=
(
V
′
,
E
′
)
is said to be a
subgraph
of
G
if
V
′
⊆
V
and
E
′
⊆
E
. Note
that this definition allows for disconnected subgraphs. However, typically data mining
applications call for
connected subgraphs
, defined as a subgraph
G
′
such that
V
′
⊆
V
,
E
′
⊆
E
, and for any two nodes
u,v
∈
V
′
, there exists a
path
from
u
to
v
in
G
′
.
Example 11.2.
The graph defined by the bold edges in Figure 11.2a is a subgraph
of the larger graph; it has vertex set
V
′
= {
v
1
,v
2
,v
4
,v
5
,v
6
,v
8
}
. However, it is a
disconnected subgraph. Figure 11.2b shows an example of a connected subgraph on
the same vertex set
V
′
.
Graph and Subgraph Isomorphism
A graph
G
′
=
(
V
′
,
E
′
)
is said to be
isomorphic
to another graph
G
=
(
V
,
E
)
if there
exists a bijective function
φ
:
V
′
→
V
, i.e., both injective (into) and surjective (onto),
such that
1.
(u,v)
∈
E
′
⇐⇒
(φ(u),φ(v))
∈
E
2.
∀
u
∈
V
′
,
L
(u)
=
L
(φ(u))
3.
∀
(u,v)
∈
E
′
,
L
(u,v)
=
L
(φ(u),φ(v))
In other words, the
isomorphism
φ
preserves the edge adjacencies as well as the vertex
and edgelabels. Putdifferently,theextendedtuple
u,v,
L
(u),
L
(v),
L
(u,v)
∈
G
′
if and
only if
φ(u),φ(v),
L
(φ(u)),
L
(φ(v)),
L
(φ(u),φ(v))
∈
G
.
282
Graph Pattern Mining
a c
b
a
d
c
b
c
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
(a)
a c
b
a
d
c
b
c
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
(b)
Figure 11.2.
A subgraph (a) and connected subgraph (b).
u
1
a
G
1
u
2
a
u
3
b u
4
b
v
1
a
G
2
v
3
a
v
2
b v
4
b
w
1
a
G
3
w
2
a
w
3
b
x
1
b
G
4
x
2
a
x
3
b
Figure 11.3.
Graph and subgraph isomorphism.
If the function
φ
is only injective but not surjective, we say that the mapping
φ
is
a
subgraph isomorphism
from
G
′
to
G
. In this case, we say that
G
′
is isomorphic to a
subgraph of
G
, that is,
G
′
is
subgraph isomorphic
to
G
, denoted
G
′
⊆
G
; we also say
that
G contains G
′
.
Example 11.3.
In Figure11.3,
G
1
=
(
V
1
,
E
1
)
and
G
2
=
(
V
2
,
E
2
)
areisomorphic graphs.
There are several possible isomorphisms between
G
1
and
G
2
. An example of an
isomorphism
φ
:
V
2
→
V
1
is
φ(v
1
)
=
u
1
φ(v
2
)
=
u
3
φ(v
3
)
=
u
2
φ(v
4
)
=
u
4
The inverse mapping
φ
−
1
specifies the isomorphism from
G
1
to
G
2
. For example,
φ
−
1
(u
1
)
=
v
1
,
φ
−
1
(u
2
)
=
v
3
, and so on. The set of all possible isomorphisms from
G
2
to
G
1
are as follows:
v
1
v
2
v
3
v
4
φ
1
u
1
u
3
u
2
u
4
φ
2
u
1
u
4
u
2
u
3
φ
3
u
2
u
3
u
1
u
4
φ
4
u
2
u
4
u
1
u
3
11.1 Isomorphism and Support
283
The graph
G
3
is subgraph isomorphic to both
G
1
and
G
2
. The set of all possible
subgraph isomorphisms from
G
3
to
G
1
are as follows:
w
1
w
2
w
3
φ
1
u
1
u
2
u
3
φ
2
u
1
u
2
u
4
φ
3
u
2
u
1
u
3
φ
4
u
2
u
1
u
4
The graph
G
4
is not subgraph isomorphic to either
G
1
or
G
2
, and it is also not
isomorphic to
G
3
because the extendededge
x
1
,x
3
,b,b
has no possible mappings in
G
1
,
G
2
or
G
3
.
Subgraph Support
Given a databaseof graphs,
D
={
G
1
,
G
2
,...,
G
n
}
, and given some graph
G
, the support
of
G
in
D
is defined as follows:
sup
(
G
)
=
G
i
∈
D
|
G
⊆
G
i
The support is simply the number of graphs in the database that contain
G
. Given a
minsup
threshold, the goalof graph mining is to mine all frequentconnected subgraphs
with
sup
(
G
)
≥
minsup
.
To mine all the frequent subgraphs, one has to search over the space of all possible
graph patterns, which is exponential in size. If we consider subgraphs with
m
vertices,
then there are
m
2
=
O
(m
2
)
possible edges. The number of possible subgraphs with
m
nodes is then
O
(
2
m
2
)
because we may decide either to include or exclude each of
the edges. Many of these subgraphs will not be connected, but
O
(
2
m
2
)
is a convenient
upper bound. When we add labels to the vertices and edges, the number of labeled
graphs will be even more. Assume that
|
V
|=|
E
|=
s
, then there are
s
m
possible ways
to label the vertices and there are
s
m
2
ways to label the edges. Thus, the number of
possible labeled subgraphs with
m
vertices is 2
m
2
s
m
s
m
2
=
O
(
2
s)
m
2
. This is the worst
case bound, as many of these subgraphs will be isomorphic to each other, with the
number of distinct subgraphs being much less. Nevertheless, the search space is still
enormous because we typically have to search for all subgraphs ranging from a single
vertex to some maximum number of vertices given by the largest frequent subgraph.
There are two main challenges in frequent subgraph mining. The first is to system-
atically generate candidate subgraphs. We use
edge-growth
as the basic mechanism for
extending the candidates. The mining process proceeds in a breadth-first (level-wise)
or a depth-first manner, starting with an empty subgraph (i.e., with no edge), and
adding a new edge each time. Such an edge may either connect two existing vertices
in the graph or it may introduce a new vertex as one end of a new edge. The key is
to perform nonredundant subgraph enumeration, such that we do not generate the
same graph candidate more than once. This means that we have to perform graph
isomorphism checking to make sure that duplicate graphs are removed. The second
challenge is to count the support of a graph in the database. This involves subgraph
isomorphism checking, as we have to find the set of graphs that contain a given
candidate.
284
Graph Pattern Mining
11.2
CANDIDATE GENERATION
An effective strategy to enumerate subgraph patterns is the so-called
rightmost path
extension
. Given a graph
G
, we perform a depth-first search (DFS) over its vertices,
and create a DFS spanning tree, that is, one that covers or spans all the vertices. Edges
that are included in the DFS tree are called
forward
edges, and all other edges are
called
backward
edges. Backward edges create cycles in the graph. Once we have a
DFS tree, define the
rightmost
path as the path from the root to the rightmost leaf, that
is, to the leaf with the highest index in the DFS order.
Example 11.4.
Consider the graph shown in Figure 11.4a. One of the possible DFS
spanning trees is shown in Figure 11.4b (illustrated via bold edges), obtained by
starting at
v
1
and then choosing the vertex with the smallest index at each step.
Figure 11.5 shows the same graph (ignoring the dashed edges), rearranged to
emphasize the DFS tree structure. For instance, the edges
(v
1
,v
2
)
and
(v
2
,v
3
)
are
examples of forward edges, whereas
(v
3
,v
1
)
,
(v
4
,v
1
)
, and
(v
6
,v
1
)
are all backward
edges. The bold edges
(v
1
,v
5
)
,
(v
5
,v
7
)
and
(v
7
,v
8
)
comprise the rightmost path.
For generating new candidates from a given graph
G
, we extend it by adding a
new edge to vertices only on the rightmost path. We can either extend
G
by adding
backward edges from the
rightmost vertex
to some other vertex on the rightmost path
(disallowing self-loops or multi-edges), or we can extend
G
by adding forward edges
from any of the vertices on the rightmost path. A backward extension does not add a
new vertex, whereas a forward extension adds a new vertex.
For systematic candidate generation we impose a total order on the extensions, as
follows: First, we try all backward extensions from the rightmost vertex, and then we
try forward extensions from vertices on the rightmost path. Among the backward edge
extensions, if
u
r
is the rightmost vertex, the extension
(u
r
,v
i
)
is tried before
(u
r
,v
j
)
if
i < j
. In other words, backward extensions closer to the root are considered before
those farther away from the root along the rightmost path. Among the forward edge
extensions, if
v
x
is the new vertex to be added, the extension
(v
i
,v
x
)
is tried before
v
6
d
c
v
5
a
v
7
v
1
a
a
v
2
b v
8
v
4
c
b
v
3
(a)
v
6
d
c
v
5
a
v
7
v
1
a
a
v
2
b v
8
v
4
c
b
v
3
(b)
Figure 11.4.
A graph (a) and a possible depth-first spanning tree (b).
11.2 Candidate Generation
285
v
1
a
v
2
a
v
5
c
#6
v
3
b v
4
c
v
6
d v
7
a
#5
v
8
b
#4
#3
#1
#2
Figure 11.5.
Rightmost path extensions. The bold path is the rightmost path in the DFS tree. The
rightmost
vertex
is
v
8
, showndouble circled. Solid black lines (thin and bold) indicate the
forward
edges, which are part
of the DFS tree. The
backward
edges, which by definition are not part of the DFS tree, are shown in gray.
The set of possible extensions on the rightmost path are shown with dashed lines. The precedence ordering
of the extensions is also shown.
(v
j
,v
x
)
if
i > j
. In other words, the vertices farther from the root (those at greater
depth) are extended before those closer to the root. Also note that the new vertex will
be numbered
x
=
r
+
1, as it will become the new rightmost vertex after the extension.
Example 11.5.
Consider the order of extensions shown in Figure 11.5. Node
v
8
is the
rightmost vertex; thus we try backward extensions only from
v
8
. The first extension,
denoted #1 in Figure 11.5, is the backward edge
(v
8
,v
1
)
connecting
v
8
to the root,
and the next extension is
(v
8
,v
5
)
, denoted #2, which is also backward. No other
backward extensions are possible without introducing multiple edges between the
same pair of vertices.The forward extensions are tried in reverse order, starting from
the rightmost vertex
v
8
(extension denoted as #3) and ending at the root (extension
denoted as #6). Thus, the forward extension
(v
8
,v
x
)
, denoted #3, comes before the
forward extension
(v
7
,v
x
)
, denoted #4, and so on.
11.2.1
Canonical Code
When generating candidates using rightmost path extensions, it is possible that
duplicate, that is, isomorphic, graphs are generated via different extensions. Among
theisomorphic candidates,weneedtokeeponlyonefor furtherextension,whereasthe
others can be pruned to avoid redundant computation. The main idea is that if we can
somehow sort or rank the isomorphic graphs, we can pick the
canonical representative
,
say the one with the least rank, and extend only that graph.
286
Graph Pattern Mining
v
1
a
G
1
v
2
a
v
3
a
b v
4
q
r
r
r
v
1
a
G
2
v
2
a
v
3
b
a
v
4
q
r
r
r
v
1
a
G
3
v
2
a
b v
4
v
3
a
q
r
r
r
t
11
=
v
1
,v
2
,a,a,q
t
12
=
v
2
,v
3
,a,a,r
t
13
=
v
3
,v
1
,a,a,r
t
14
=
v
2
,v
4
,a,b,r
t
21
=
v
1
,v
2
,a,a,q
t
22
=
v
2
,v
3
,a,b,r
t
23
=
v
2
,v
4
,a,a,r
t
24
=
v
4
,v
1
,a,a,r
t
31
=
v
1
,v
2
,a,a,q
t
32
=
v
2
,v
3
,a,a,r
t
33
=
v
3
,v
1
,a,a,r
t
34
=
v
1
,v
4
,a,b,r
DFScode
(
G
1
)
DFScode
(
G
2
)
DFScode
(
G
3
)
Figure 11.6.
Canonical DFS code.
G
1
is canonical, whereas
G
2
and
G
3
are noncanonical. Vertex label set
V
={
a
,
b
}
, and edge label set
E
={
q
,
r
}
. The vertices are numbered in DFS order.
Let
G
be a graph and let
T
G
be a DFS spanning tree for
G
. The DFS tree
T
G
defines an ordering of both the nodes and edges in
G
. The DFS node ordering is
obtained by numbering the nodes consecutively in the order they are visited in the
DFS walk. We assume henceforth that for a pattern graph
G
the nodes are numbered
according to their position in the DFS ordering, so that
i < j
implies that
v
i
comes
before
v
j
in the DFS walk. The DFS edge ordering is obtained by following the edges
between consecutive nodes in DFS order, with the condition that all the backward
edges incident with vertex
v
i
are listed before any of the forward edges incident with it.
The
DFScode
for a graph
G
, for a givenDFS tree
T
G
, denoted DFScode
(
G
)
, is defined
as the sequence of extended edge tuples of the form
v
i
,v
j
,
L
(v
i
),
L
(v
j
),
L
(v
i
,v
j
)
listed
in the DFS edge order.
Example 11.6.
Figure 11.6 shows the DFS codes for three graphs, which are all
isomorphic to each other. The graphs have node and edge labels drawn from the
label sets
V
= {
a,b
}
and
E
= {
q,r
}
. The edge labels are shown centered on the
edges. The bold edges comprise the DFS tree for each graph. For
G
1
, the DFS node
ordering is
v
1
,v
2
,v
3
,v
4
, whereas the DFS edge ordering is
(v
1
,v
2
)
,
(v
2
,v
3
)
,
(v
3
,v
1
)
,
and
(v
2
,v
4
)
. Based on the DFS edge ordering, the first tuple in the DFS code for
G
1
is therefore
v
1
,v
2
,a,a,q
. The next tuple is
v
2
,v
3
,a,a,r
and so on. The DFS code
for each graph is shown in the corresponding box below the graph.
Canonical DFS Code
A subgraph is
canonical
if it has the smallest DFS code among all possible isomorphic
graphs, with the ordering between codes defined as follows. Let
t
1
and
t
2
be any two
11.2 Candidate Generation
287
DFS code tuples:
t
1
=
v
i
,v
j
,
L
(v
i
),
L
(v
j
),
L
(v
i
,v
j
)
t
2
=
v
x
,v
y
,
L
(v
x
),
L
(v
y
),
L
(v
x
,v
y
)
We say that
t
1
is smaller than
t
2
, written
t
1
x
. That is, (a) a forward extension to a node earlier in the DFS node
order is smaller, or (b) if both theforwardedgespoint to anode withthe
same DFS node order, then the forward extension from a node deeper
in the tree is smaller.
Condition (2) If
e
ij
and
e
xy
are both backward edges, then (a)
i < x
, or (b)
i
=
x
and
j < y
. That is, (a) a backward edge from a node earlier in the DFS
node order is smaller, or (b) if both thebackward edgesoriginatefrom a
node with the same DFS node order, then the backward edge to a node
earlier in DFS node order (i.e., closer to the root along the rightmost
path) is smaller.
Condition (3) If
e
ij
is a forward and
e
xy
is a backward edge, then
j
≤
x
. That is, a
forward edge to a node earlier in the DFS node order is smaller than a
backward edge from that node or any node that comes after it in DFS
node order.
Condition (4) If
e
ij
is a backward and
e
xy
is a forward edge, then
i < y
. That is, a
backward edge from a node earlier in DFS node order is smaller than a
forward edge to any later node.
Given any two DFS codes, we can compare them tuple by tuple to check which is
smaller. In particular, the
canonical DFS code
for a graph
G
is defined as follows:
C
=
min
G
′
DFScode
(
G
′
)
|
G
′
is isomorphic to
G
Given a candidate subgraph
G
, we can first determine whether its DFS code is
canonical or not. Only canonical graphs need to be retained for extension, whereas
noncanonical candidates can be removed from further consideration.
288
Graph Pattern Mining
Example 11.7.
Consider the DFS codes for the three graphs shown in Figure 11.6.
Comparing
G
1
and
G
2
, we find that
t
11
=
t
21
, but
t
12
< t
22
because
a,a,r
<
l
a,b,r
.
Comparing the codes for
G
1
and
G
3
, we find that the first three tuples are equal for
both the graphs, but
t
14
< t
34
because
(v
i
,v
j
)
=
(v
2
,v
4
) <
e
(v
1
,v
4
)
=
(v
x
,v
y
)
due to condition (1) above. That is, both are forward edges, and we have
v
j
=
v
4
=
v
y
with
v
i
=
v
2
> v
1
=
v
x
. In fact, it can be shown that the code for
G
1
is the canonical
DFS code for all graphs isomorphic to
G
1
. Thus,
G
1
is the canonical candidate.
11.3
THE GSPAN ALGORITHM
We describe the gSpan algorithm to mine all frequent subgraphs from a database
of graphs. Given a database
D
= {
G
1
,
G
2
,...,
G
n
}
comprising
n
graphs, and given
a minimum support threshold
minsup
, the goal is to enumerate all (connected)
subgraphs
G
that are frequent, that is,
sup
(
G
)
≥
minsup
. In gSpan, each graph is
represented by its canonical DFS code, so that the task of enumerating frequent
subgraphs is equivalent to the task of generating all canonical DFS codes for frequent
subgraphs. Algorithm 11.1 shows the pseudo-code for gSpan.
gSpan enumerates patterns in a depth-first manner, starting with the empty code.
Given a canonical and frequent code
C
, gSpan first determines the set of possible
edge extensions along the rightmost path (line 1). The function R
IGHT
M
OST
P
ATH
-
E
XTENSIONS
returns the set of edge extensions along with their support values,
E
.
Each extendededge
t
in
E
leads to a new candidateDFS code
C
′
=
C
∪{
t
}
, with support
sup(
C
)
=
sup(t)
(lines 3–4). For each new candidate code, gSpan checks whether it
is frequent and canonical, and if so gSpan recursively extends
C
′
(lines 5–6). The
algorithm stops when there are no more frequent and canonical extensions possible.
ALGORITHM 11.1. Algorithm
G
S
PAN
// Initial Call:
C
←∅
G
S
PAN
(
C
, D,
minsup
)
:
E
←
R
IGHT
M
OST
P
ATH
-E
XTENSIONS
(
C
,
D
)
// extensions and
1
supports
foreach
(t,sup(t))
∈
E
do
2
C
′
←
C
∪
t
// extend the code with extended edge tuple
t
3
sup(
C
′
)
←
sup(t)
// record the support of new extension
4
// recursively call
gSpan
if code is frequent and
canonical
if
sup(
C
′
)
≥
minsup
and
I
S
C
ANONICAL
(C
′
)
then
5
G
S
PAN
(
C
′
,
D
,
minsup
)
6
11.3 The gSpan Algorithm
289
G
1
a
10
b
20
a
30
b
40
G
2
b
50
a
60
b
70
a
80
Figure 11.7.
Example graph database.
Example 11.8.
Consider the example graph database comprising
G
1
and
G
2
shown
in Figure 11.7. Let
minsup
=
2, that is, assume that we are interested in mining
subgraphs that appear in both the graphs in the database. For each graph the node
labels and node numbers are both shown, for example, the node
a
10
in
G
1
means that
node 10 has label
a
.
Figure 11.8 shows the candidate patterns enumerated by gSpan. For each
candidate the nodes are numbered in the DFS tree order. The solid boxes show
frequent subgraphs, whereas the dotted boxes show the infrequent ones. The dashed
boxes represent noncanonical codes. Subgraphs that do not occur even once are not
shown. The figure also shows the DFS codes and their corresponding graphs.
The mining process begins with the empty DFS code
C
0
corresponding to the
empty subgraph. The set of possible 1-edge extensions comprises the new set of
candidates. Among these,
C
3
is pruned because it is not canonical (it is isomorphic to
C
2
), whereas
C
4
is pruned because it is not frequent. The remaining two candidates,
C
1
and
C
2
, are both frequent and canonical, and are thus considered for further
extension. The depth-first search considers
C
1
before
C
2
, with the rightmost path
extensions of
C
1
being
C
5
and
C
6
. However,
C
6
is not canonical; it is isomorphic
to
C
5
, which has the canonical DFS code. Further extensions of
C
5
are processed
recursively. Once the recursion from
C
1
completes, gSpan moves on to
C
2
, which will
be recursively extended via rightmost edge extensions as illustrated by the subtree
under
C
2
. After processing
C
2
, gSpan terminates because no other frequent and
canonical extensions are found. In this example,
C
12
is a maximal frequent subgraph,
that is, no supergraph of
C
12
is frequent.
This example also shows the importance of duplicate elimination via canonical
checking. The groups of isomorphic subgraphs encountered during the execution of
gSpan are as follows:
{
C
2
,
C
3
}
,
{
C
5
,
C
6
,
C
17
}
,
{
C
7
,
C
19
}
,
{
C
9
,
C
25
}
,
{
C
20
,
C
21
,
C
22
,
C
24
}
,
and
{
C
12
,
C
13
,
C
14
}
. Within each group the first graph is canonical and thus the
remaining codes are pruned.
For a complete description of gSpan we have to specify the algorithm for
enumerating the rightmost path extensions and their support, so that infrequent
patterns can be eliminated, and the procedure for checking whether a given DFS code
is canonical, so that duplicate patterns can be pruned. These are detailed next.
C
0
∅
C
1
0
,
1
,a,a
a
0
a
1
C
2
0
,
1
,a,b
a
0
b
1
C
3
0
,
1
,b,a
b
0
a
1
C
4
0
,
1
,b,b
b
0
b
1
C
5
0
,
1
,a,a
1
,
2
,a,b
a
0
a
1
b
2
C
6
0
,
1
,a,a
0
,
2
,a,b
a
0
a
1
b
2
C
15
0
,
1
,a,b
1
,
2
,b,a
a
0
b
1
a
2
C
16
0
,
1
,a,b
1
,
2
,b,b
a
0
b
1
b
2
C
17
0
,
1
,a,b
0
,
2
,a,a
a
0
b
1
a
2
C
18
0
,
1
,a,b
0
,
2
,a,b
a
0
b
1
b
2
C
7
0
,
1
,a,a
1
,
2
,a,b
2
,
0
,b,a
a
0
a
1
b
2
C
8
0
,
1
,a,a
1
,
2
,a,b
2
,
3
,b,b
a
0
a
1
b
2
b
3
C
9
0
,
1
,a,a
1
,
2
,a,b
1
,
3
,a,b
a
0
a
1
b
2
b
3
C
10
0
,
1
,a,a
1
,
2
,a,b
0
,
3
,a,b
a
0
a
1
b
3
b
2
C
24
0
,
1
,a,b
0
,
2
,a,b
2
,
3
,b,a
a
0
b
1
b
2
a
3
C
25
0
,
1
,a,b
0
,
2
,a,b
0
,
3
,a,a
a
0
b
1
b
2
a
3
C
19
0
,
1
,a,b
1
,
2
,b,a
2
,
0
,a,b
a
0
b
1
a
2
C
20
0
,
1
,a,b
1
,
2
,b,a
2
,
3
,a,b
a
0
b
1
a
2
b
3
C
21
0
,
1
,a,b
1
,
2
,b,a
1
,
3
,b,b
a
0
b
1
a
2
b
3
C
22
0
,
1
,a,b
1
,
2
,b,a
0
,
3
,a,b
a
0
b
1
b
3
a
2
C
11
0
,
1
,a,a
1
,
2
,a,b
2
,
0
,b,a
2
,
3
,b,b
a
0
a
1
b
2
b
3
C
12
0
,
1
,a,a
1
,
2
,a,b
2
,
0
,b,a
1
,
3
,a,b
a
0
a
1
b
2
b
3
C
13
0
,
1
,a,a
1
,
2
,a,b
2
,
0
,b,a
0
,
3
,a,b
a
0
a
1
b
3
b
2
C
14
0
,
1
,a,a
1
,
2
,a,b
1
,
3
,a,b
3
,
0
,b,a
a
0
a
1
b
2
b
3
C
23
0
,
1
,a,b
1
,
2
,b,a
2
,
3
,a,b
3
,
1
,b,b
a
0
b
1
a
2
b
3
Figure 11.8.
Frequent graph mining:
minsup
=
2. Solid boxes indicate the frequent subgraphs, dotted the
infrequent, and dashed the noncanonical subgraphs.
11.3 The gSpan Algorithm
291
11.3.1
Extension and Support Computation
The support computation task is to find the number of graphs in the database
D
that
contain a candidate subgraph, which is very expensive because it involves subgraph
isomorphism checks. gSpan combines the tasks of enumerating candidate extensions
and support computation.
Assume that
D
={
G
1
,
G
2
,...,
G
n
}
comprises
n
graphs. Let
C
={
t
1
,t
2
,...,t
k
}
denote
a frequent canonical DFS code comprising
k
edges, and let
G
(
C
)
denote the graph
corresponding to code
C
. The task is to compute the set of possible rightmost path
extensions from
C
, along with their support values, which is accomplished via the
pseudo-code in Algorithm 11.2.
Given code
C
, gSpan first records the nodes on the rightmost path (
R
), and the
rightmost child (
u
r
). Next, gSpan considers each graph
G
i
∈
D
. If
C
= ∅
, then each
distinct label tuple of the form
L
(x),
L
(y),
L
(x,y)
for adjacent nodes
x
and
y
in
G
i
contributes a forward extension
0
,
1
,
L
(x),
L
(y),
L
(x,y)
(lines 6-8). On the other
hand, if
C
is not empty, then gSpan enumerates all possible subgraph isomorphisms
i
between the code
C
and graph
G
i
via the function S
UBGRAPH
I
SOMORPHISMS
(line 10). Given subgraph isomorphism
φ
∈
i
, gSpan finds all possible forward and
backward edge extensions, and stores them in the extension set
E
.
Backward extensions (lines 12–15) are allowed only from the rightmost child
u
r
in
C
to some other node on the rightmost path
R
. The method considers each neighbor
x
of
φ(u
r
)
in
G
i
and checks whether it is a mapping for some vertex
v
=
φ
−
1
(x)
along
the rightmost path
R
in
C
. If the edge
(u
r
,v)
does not already exist in
C
, it is a new
extension, and the extended tuple
b
=
u
r
,v,
L
(u
r
),
L
(v),
L
(u
r
,v)
is added to the set of
extensions
E
, along with the graph id
i
that contributed to that extension.
Forward extensions (lines 16–19) are allowed only from nodes on the rightmost
path
R
to new nodes. For each node
u
in
R
, the algorithm finds a neighbor
x
in
G
i
that is not in a mapping from some node in
C
. For each such node
x
, the forward
extension
f
=
u,u
r
+
1
,
L
(φ(u)),
L
(x),
L
(φ(u),x)
is added to
E
, along with the graph
id
i
. Because a forward extension adds a new vertex to the graph
G
(
C
)
, the id of the
new node in
C
must be
u
r
+
1, that is, one more than the highest numbered node in
C
,
which by definition is the rightmost child
u
r
.
Once all the backward and forward extensions havebeen catalogedover all graphs
G
i
in the database
D
, we compute their support by counting the number of distinct
graph ids that contribute to each extension. Finally, the method returns the set of
all extensions and their supports in sorted order (increasing) based on the tuple
comparison operator in Eq.(11.1).
Example 11.9.
Consider the canonical code
C
and the corresponding graph
G
(
C
)
shown in Figure 11.9a.For this code all the vertices are on the rightmost path, that is,
R
={
0
,
1
,
2
}
, and the rightmost child is
u
r
=
2.
The sets of all possible isomorphisms from
C
to graphs
G
1
and
G
2
in the database
(shown in Figure 11.7) are listed in Figure 11.9b as
1
and
2
. For example, the first
isomorphism
φ
1
:
G
(
C
)
→
G
1
is defined as
φ
1
(
0
)
=
10
φ
1
(
1
)
=
30
φ
1
(
2
)
=
20
292
Graph Pattern Mining
ALGORITHM 11.2. Rightmost Path Extensions and Their Support
R
IGHT
M
OST
P
ATH
-E
XTENSIONS
(
C
, D)
:
R
←
nodes on the rightmost path in
C
1
u
r
←
rightmost child in
C
// dfs number
2
E
←∅
// set of extensions from
C
3
foreach
G
i
∈
D
,
i
=
1
,...,n
do
4
if
C
=∅
then
5
// add distinct label tuples in
G
i
as forward
extensions
foreach
distinct
L
(x),
L
(y),
L
(x,y)
∈
G
i
do
6
f
=
0
,
1
,
L
(x),
L
(y),
L
(x,y)
7
Add tuple
f
to
E
along with graph id
i
8
else
9
i
=
S
UBGRAPH
I
SOMORPHISMS
(
C
,
G
i
)
10
foreach
isomorphism
φ
∈
i
do
11
// backward extensions from rightmost child
foreach
x
∈
N
G
i
(φ(u
r
))
such that
∃
v
←
φ
−
1
(x)
do
12
if
v
∈
R and
(u
r
,v)
∈
G
(
C
)
then
13
b
=
u
r
,v,
L
(u
r
),
L
(v),
L
(u
r
,v)
14
Add tuple
b
to
E
along with graph id
i
15
// forward extensions from nodes on rightmost path
foreach
u
∈
R
do
16
foreach
x
∈
N
G
i
(φ(u))
and
∃
φ
−
1
(x)
do
17
f
=
u,u
r
+
1
,
L
(φ(u)),
L
(x),
L
(φ(u),x)
18
Add tuple
f
to
E
along with graph id
i
19
// Compute the support of each extension
foreach
distinct extension
s
∈
E
do
20
sup
(s)
=
number of distinct graph ids that support tuple
s
21
return
set of pairs
s,sup(s)
for extensions
s
∈
E
, in tuple sorted order
22
The list of possible backward and forward extensions for each isomorphism is
shown in Figure 11.9c. For example, there are two possible edge extensions from the
isomorphism
φ
1
. The first is a backward edge extension
2
,
0
,b,a
, as
(
20
,
10
)
is a
valid backward edge in
G
1
. That is, the node
x
=
10 is a neighbor of
φ(
2
)
=
20 in
G
1
,
φ
−
1
(
10
)
=
0
=
v
is on the rightmost path, and the edge
(
2
,
0
)
is not already in
G
(
C
)
,
which satisfy the backward extension steps in lines 12–15 in Algorithm 11.2. The
second extension is a forward one
1
,
3
,a,b
, as
30
,
40
,a,b
is a valid extended edge
in
G
1
. That is,
x
=
40 is a neighbor of
φ(
1
)
=
30 in
G
1
, and node 40 has not already
been mapped to any node in
G
(
C
)
, that is,
φ
−
1
1
(
40
)
does not exist. These conditions
satisfy the forward extension steps in lines 16–19 in Algorithm 11.2.
11.3 The gSpan Algorithm
293
C
t
1
:
0
,
1
,a,a
t
2
:
1
,
2
,a,b
G
(
C
)
a
0
a
1
b
2
(a) Code
C
and graph
G
(
C
)
φ
0 1 2
1
φ
1
10 30 20
φ
2
10 30 40
φ
3
30 10 20
2
φ
4
60 80 70
φ
5
80 60 50
φ
6
80 60 70
(b) Subgraph isomorphisms
Id
φ
Extensions
G
1
φ
1
{
2
,
0
,b,a
,
1
,
3
,a,b
}
φ
2
{
1
,
3
,a,b
,
0
,
3
,a,b
}
φ
3
{
2
,
0
,b,a
,
0
,
3
,a,b
}
G
2
φ
4
{
2
,
0
,b,a
,
2
,
3
,b,b
,
0
,
3
,a,b
}
φ
5
{
2
,
3
,b,b
,
1
,
3
,a,b
}
φ
6
{
2
,
0
,b,a
,
2
,
3
,b,b
,
1
,
3
,a,b
}
(c) Edge extensions
Extension Support
2
,
0
,b,a
2
2
,
3
,b,b
1
1
,
3
,a,b
2
0
,
3
,a,b
2
(d) Extensions (sorted) and supports
Figure 11.9.
Rightmost path extensions.
Given the set of all the edge extensions, and the graph ids that contribute
to them, we obtain support for each extension by counting how many graphs
contribute to it. The final set of extensions, in sorted order, along with their support
values is shown in Figure 11.9d. With
minsup
=
2, the only infrequent extension is
2
,
3
,b,b
.
Subgraph Isomorphisms
The key step in listing the edge extensions for a given code
C
is to enumerate all
the possible isomorphisms from
C
to each graph
G
i
∈
D
. The function S
UBGRAPH
I-
SOMORPHISMS
, shown in Algorithm 11.3, accepts a code
C
and a graph
G
, and
returns the set of all isomorphisms between
C
and
G
. The set of isomorphisms
is initialized by mapping vertex 0 in
C
to each vertex
x
in
G
that shares the same
label as 0, that is, if
L
(x)
=
L
(
0
)
(line 1). The method considers each tuple
t
i
in
C
and extends the current set of partial isomorphisms. Let
t
i
=
u,v,
L
(u),
L
(v),
L
(u,v)
.
We have to check if each isomorphism
φ
∈
can be extended in
G
using the
information from
t
i
(lines 5–12). If
t
i
is a forward edge, then we seek a neighbor
x
of
φ(u)
in
G
such that
x
has not already been mapped to some vertex in
C
,
that is,
φ
−
1
(x)
should not exist, and the node and edge labels should match, that is,
L
(x)
=
L
(v)
, and
L
(φ(u),x)
=
L
(u,v)
. If so,
φ
can be extended with the mapping
φ(v)
→
x
. The new extended isomorphism, denoted
φ
′
, is added to the initially
empty set of isomorphisms
′
. If
t
i
is a backward edge, we have to check if
φ(v)
is a neighbor of
φ(u)
in
G
. If so, we add the current isomorphism
φ
to
′
. Thus,
294
Graph Pattern Mining
ALGORITHM 11.3. Enumerate Subgraph Isomorphisms
S
UBGRAPH
I
SOMORPHISMS
(
C
={
t
1
,t
2
,...,t
k
}
,
G
)
:
←{
φ(
0
)
→
x
|
x
∈
G
and
L
(x)
=
L
(
0
)
}
1
foreach
t
i
∈
C,
i
=
1
,...,k
do
2
u,v,
L
(u),
L
(v),
L
(u,v)
←
t
i
// expand extended edge
t
i
3
′
←∅
// partial isomorphisms including
t
i
4
foreach
partial isomorphism
φ
∈
do
5
if
v > u
then
6
// forward edge
foreach
x
∈
N
G
(φ(u))
do
7
if
∃
φ
−
1
(x)
and L
(x)
=
L
(v)
and L
(φ(u),x)
=
L
(u,v)
then
8
φ
′
←
φ
∪{
φ(v)
→
x
}
9
Add
φ
′
to
′
10
else
11
// backward edge
if
φ(v)
∈
N
G
j
(φ(u))
then
Add
φ
to
′
// valid isomorphism
12
←
′
// update partial isomorphisms
13
return
14
only those isomorphisms that can be extended in the forward case, or those that
satisfy the backward edge, are retained for further checking. Once all the extended
edges in
C
have been processed, the set
contains all the valid isomorphisms from
C
to
G
.
Example 11.10.
Figure 11.10 illustrates the subgraph isomorphism enumeration
algorithm from the code
C
to each of the graphs
G
1
and
G
2
in the database shown in
Figure 11.7.
For
G
1
, the set of isomorphisms
is initialized by mapping the first node of
C
to
all nodes labeled
a
in
G
1
because
L
(
0
)
=
a
. Thus,
={
φ
1
(
0
)
→
10
,φ
2
(
0
)
→
30
}
. We
next consider each tuple in
C
, and see which isomorphisms can be extended.The first
tuple
t
1
=
0
,
1
,a,a
is a forward edge, thus for
φ
1
, we consider neighbors
x
of 10 that
are labeled
a
and not included in the isomorphism yet. The only other vertex that
satisfies this condition is 30; thus the isomorphism is extended by mapping
φ
1
(
1
)
→
30. In a similar manner the second isomorphism
φ
2
is extended by adding
φ
2
(
1
)
→
10,
as shown in Figure 11.10. For the second tuple
t
2
=
1
,
2
,a,b
, the isomorphism
φ
1
has two possible extensions, as 30 has two neighbors labeled
b
, namely 20
and 40. The extended mappings are denoted
φ
′
1
and
φ
′′
1
. For
φ
2
there is only one
extension.
The isomorphisms of
C
in
G
2
can be found in a similar manner. The complete
sets of isomorphisms in each database graph are shown in Figure 11.10.
296
Graph Pattern Mining
G
Step 1 Step 2 Step 3
a
0
a
1
b
2
b
3
a
G
∗
a
a
G
∗
a
b
a
G
∗
a
b
C
t
1
=
0
,
1
,a,a
t
2
=
1
,
2
,a,b
t
3
=
1
,
3
,a,b
t
4
=
3
,
0
,b,a
C
∗
s
1
=
0
,
1
,a,a
C
∗
s
1
=
0
,
1
,a,a
s
2
=
1
,
2
,a,b
C
∗
s
1
=
0
,
1
,a,a
s
2
=
1
,
2
,a,b
s
3
=
2
,
0
,b,a
Figure 11.11.
Canonicality checking.
Example 11.11.
Consider the subgraph candidate
C
14
from Figure 11.8, which is
replicated as graph
G
in Figure 11.11, along with its DFS code
C
. From an initial
canonical code
C
∗
=∅
, the smallest rightmost edge extension
s
1
is added in Step 1.
Because
s
1
=
t
1
, we proceed to the next step, which finds the smallest edge extension
s
2
. Once again
s
2
=
t
2
, so we proceed to the third step. The least possible edge
extension for
G
∗
is the extended edge
s
3
. However, we find that
s
3
sup
(
Y
),
for all
Y
⊃
X
An itemset
X
is a
minimal generator
if all its subsets have strictly higher support,
that is,
sup
(
X
) <
sup
(
Y
),
for all
Y
⊂
X
If an itemset
X
is not a minimal generator, then it implies that it has some redundant
items, that is, we can find some subset
Y
⊂
X
, which can be replaced with an even
smaller subset
W
⊂
Y
without changing the support of
X
, that is, there exists a
W
⊂
Y
,
such that
sup
(
X
)
=
sup
(
Y
∪
(
X
Y
))
=
sup
(
W
∪
(
X
Y
))
One can show that all subsets of a minimal generator must themselves be minimal
generators.
314
Pattern and Rule Assessment
Table 12.16.
Closed itemsets and minimal generators
sup
Closed Itemset Minimal Generators
3
ABDE AD
,
DE
3
BCE CE
4
ABE A
4
BC C
4
BD D
5
BE E
6
B B
Example 12.13.
Consider the dataset in Table 12.1 and the set of frequent itemsets
with
minsup
=
3 as shown in Table 12.2. There are only two maximal frequent
itemsets, namely
ABDE
and
BCE
, which capture essential information about
whether another itemset is frequent or not: an itemset is frequent only if it is a subset
of one of these two.
Table 12.16 shows the seven closed itemsets and the corresponding minimal
generators. Both of these sets allow one to infer the exact support of any other
frequent itemset. The support of an itemset
X
is the maximum support among
all closed itemsets that contain it. Alternatively, the support of
X
is the minimum
support among all minimal generators thatare subsets of
X
. For example,the itemset
AE
is a subset of the closed sets
ABE
and
ABDE
, and it is a superset of the minimal
generators
A
, and
E
; we can observe that
sup
(
AE
)
=
max
{
sup
(
ABE
),
sup
(
ABDE
)
}=
4
sup
(
AE
)
=
min
{
sup
(
A
),
sup
(
E
)
}=
4
Productive Itemsets
An itemset
X
is
productive
if its relative support is higher
than the expected relative support over all of its bipartitions, assuming they are
independent. More formally, let
|
X
| ≥
2, and let
{
X
1
,
X
2
}
be a bipartition of
X
. We
say that
X
is productive provided
rsup
(
X
)>
rsup
(
X
1
)
×
rsup
(
X
2
),
for all bipartitions
{
X
1
,
X
2
}
of
X
(12.3)
This immediately implies that
X
is productive if its minimum lift is greater than
one, as
MinLift
(
X
)
=
min
X
1
,
X
2
rsup
(
X
)
rsup
(
X
1
)
·
rsup
(
X
2
)
>
1
In terms of leverage,
X
is productive if its minimum leverage is above zero because
M
in
L
everage(
X
)
=
min
X
1
,
X
2
rsup
(
X
)
−
rsup
(
X
1
)
×
rsup
(
X
2
)
>
0
12.1 Rule and Pattern Assessment Measures
315
Example 12.14.
Considering thefrequentitemsetsin Table 12.2,the set
ABDE
is not
productive because there exists a bipartition with lift value of 1. For instance, for its
bipartition
{
B
,
ADE
}
we have
lift
(
B
−→
ADE
)
=
rsup
(
ABDE
)
rsup
(
B
)
·
rsup
(
ADE
)
=
3
/
6
6
/
6
·
3
/
6
=
1
On the other hand,
ADE
is productive because it has three distinct bipartitions
and all of them have lift above 1:
lift
(
A
−→
DE
)
=
rsup
(
ADE
)
rsup
(
A
)
·
rsup
(
DE
)
=
3
/
6
4
/
6
·
3
/
6
=
1
.
5
lift
(
D
−→
AE
)
=
rsup
(
ADE
)
rsup
(
D
)
·
rsup
(
AE
)
=
3
/
6
4
/
6
·
4
/
6
=
1
.
125
lift
(
E
−→
AD
)
=
rsup
(
ADE
)
rsup
(
E
)
·
rsup
(
AD
)
=
3
/
6
5
/
6
·
3
/
6
=
1
.
2
Comparing Rules
Given two rules
R
:
X
−→
Y
and
R
′
:
W
−→
Y
that have the same consequent, we say
that
R
is
more specific
than
R
′
, or equivalently, that
R
′
is
more general
than
R
provided
W
⊂
X
.
Nonredundant Rules
We say that a rule
R
:
X
−→
Y
is
redundant
provided there
exists a more general rule
R
′
:
W
−→
Y
that has the same support, that is,
W
⊂
X
and
sup
(
R
)
=
sup
(
R
′
)
. On the other hand, if
sup
(
R
) <
sup
(
R
′
)
over all its generalizations
R
′
, then
R
is
nonredundant
.
Improvement and Productive Rules
Define the
improvement
of a rule
X
−→
Y
as
follows:
imp
(
X
−→
Y
)
=
conf
(
X
−→
Y
)
−
max
W
⊂
X
conf
(
W
−→
Y
)
Improvement quantifies the minimum difference between the confidence of a rule and
any of its generalizations. A rule
R
:
X
−→
Y
is
productive
if its improvement is greater
than zero, which implies that for all more general rules
R
′
:
W
−→
Y
we have
conf
(
R
)>
conf
(
R
′
)
. On the other hand, if there exists a more general rule
R
′
with
conf
(
R
′
)
≥
conf
(
R
)
, then
R
is
unproductive
. If a rule is redundant, it is also unproductive because
its improvement is zero.
The smaller the improvement of a rule
R
:
X
−→
Y
, the more likely it is to be
unproductive. We can generalize this notion to consider rules that have at least some
minimum level of improvement, that is, we may require that
imp(
X
−→
Y
)
≥
t
, where
t
is a user-specified minimum improvement threshold.
316
Pattern and Rule Assessment
Example 12.15.
Consider the example dataset in Table 12.1, and the set of frequent
itemsets in Table 12.2. Consider rule
R
:
BE
−→
C
, which has support 3, and
confidence 3
/
5
=
0
.
60. It has two generalizations, namely
R
′
1
:
E
−→
C
,
sup
=
3
,
conf
=
3
/
5
=
0
.
6
R
′
2
:
B
−→
C
,
sup
=
4
,
conf
=
4
/
6
=
0
.
67
Thus,
BE
−→
C
is redundant w.r.t.
E
−→
C
becausetheyhavethesame support, that
is,
sup
(
BCE
)
=
sup
(
BC
)
. Further,
BE
−→
C
is also unproductive, since
imp(
BE
−→
C
)
=
0
.
6
−
max
{
0
.
6
,
0
.
67
}=−
0
.
07; it has a more general rule, namely
R
′
2
, with higher
confidence.
12.2
SIGNIFICANCE TESTING AND CONFIDENCE INTERVALS
We now consider how to assess the statistical significance of patterns and rules, and
how to derive confidence intervals for a given assessment measure.
12.2.1
Fisher Exact Test for Productive Rules
We begin by discussing the Fisher exact test for rule improvement. That is, we directly
test whether the rule
R
:
X
−→
Y
is productive by comparing its confidence with that
of each of its generalizations
R
′
:
W
−→
Y
, including the default or trivial rule
∅−→
Y
.
Let
R
:
X
−→
Y
be an association rule. Consider its generalization
R
′
:
W
−→
Y
,
where
W
=
X
Z
is the new antecedent formed by removing from
X
the subset
Z
⊆
X
. Given an input dataset
D
, conditional on the fact that
W
occurs, we can create a
2
×
2 contingency table between
Z
and the consequent
Y
as shown in Table 12.17. The
different cell values are as follows:
a
=
sup
(
WZY
)
=
sup
(
XY
) b
=
sup
(
WZ
¬
Y
)
=
sup
(
X
¬
Y
)
c
=
sup
(
W
¬
ZY
) d
=
sup
(
W
¬
Z
¬
Y
)
Here,
a
denotes the number of transactions that contain both
X
and
Y
,
b
denotes the
number of transactions that contain
X
but not
Y
,
c
denotes the number of transactions
that contain
W
and
Y
but not
Z
, and finally
d
denotes the number of transactions that
contain
W
but neither
Z
nor
Y
. The marginal counts are given as
row marginals:
a
+
b
=
sup
(
WZ
)
=
sup
(
X
), c
+
d
=
sup
(
W
¬
Z
)
column marginals:
a
+
c
=
sup
(
WY
), b
+
d
=
sup
(
W
¬
Y
)
where the row marginals give the occurrence frequency of
W
with and without
Z
, and
the column marginals specify the occurrence counts of
W
with and without
Y
. Finally,
we can observe that the sum of all the cells is simply
n
=
a
+
b
+
c
+
d
=
sup
(
W
)
. Notice
that when
Z
=
X
, we have
W
=∅
, and the contingency table defaults to the one shown
in Table 12.8.
Given a contingency table conditional on
W
, we are interested in the odds ratio
obtained by comparing the presence and absence of
Z
, that is,
oddsratio
=
a/(a
+
b)
b/(a
+
b)
c/(c
+
d)
d/(c
+
d)
=
ad
bc
(12.4)
12.2 Significance Testing and Confidence Intervals
317
Table 12.17.
Contingency table for
Z
and
Y
, conditional on
W
=
X
Z
W Y
¬
Y
Z
a b a
+
b
¬
Z
c d c
+
d
a
+
c b
+
d n
=
sup
(
W
)
Recall that the odds ratio measures the odds of
X
, that is,
W
and
Z
, occurring with
Y
versus the odds of its subset
W
, but not
Z
, occurring with
Y
. Under the null hypothesis
H
0
that
Z
and
Y
are independent given
W
the odds ratio is 1. To see this, note that
under theindependenceassumption thecount in a cellof thecontingencytableis equal
to the product of the corresponding row and column marginal counts divided by
n
, that
is, under
H
0
:
a
=
(a
+
b)(a
+
c)/n b
=
(a
+
b)(b
+
d)/n
c
=
(c
+
d)(a
+
c)/n d
=
(c
+
d)(b
+
d)/n
Plugging these values in Eq.(12.4), we obtain
oddsratio
=
ad
bc
=
(a
+
b)(c
+
d)(b
+
d)(a
+
c)
(a
+
c)(b
+
d)(a
+
b)(c
+
d)
=
1
The null hypothesis therefore corresponds to
H
0
:
oddsratio
=
1, and the alternative
hypothesis is
H
a
:
oddsratio >
1. Under the null hypothesis, if we further assume
that the row and column marginals are fixed, then
a
uniquely determines the other
three values
b
,
c
, and
d
, and the probability mass function of observing the value
a
in the contingency table is given by the hypergeometric distribution. Recall that the
hypergeometric distribution gives the probability of choosing
s
successes in
t
trails if
we sample
without replacement
from a finite population of size
T
that has
S
successes
in total, given as
P(s
|
t,
S
,
T
)
=
S
s
·
T
−
S
t
−
s
T
t
In our context, we take the occurrence of
Z
as a success. The population size is
T
=
sup
(
W
)
=
n
becausewe assumethat
W
alwaysoccurs, and thetotalnumber ofsuccesses
is the support of
Z
given
W
, that is,
S
=
a
+
b
. In
t
=
a
+
c
trials, the hypergeometric
distribution gives the probability of
s
=
a
successes:
P
a
(a
+
c),(a
+
b),n
=
a
+
b
a
·
n
−
(a
+
b)
(a
+
c)
−
a
n
a
+
c
=
a
+
b
a
·
c
+
d
c
n
a
+
c
=
(a
+
b)
!
(c
+
d)
!
a
!
b
!
c
!
d
!
n
!
(a
+
c)
!
(n
−
(a
+
c))
!
=
(a
+
b)
!
(c
+
d)
!
(a
+
c)
!
(b
+
d)
!
n
!
a
!
b
!
c
!
d
!
(12.5)
318
Pattern and Rule Assessment
Table 12.18.
Contingency table: increase
a
by
i
W Y
¬
Y
Z
a
+
i b
−
i a
+
b
¬
Z
c
−
i d
+
i c
+
d
a
+
c b
+
d n
=
sup
(
W
)
Our aim is to contrast the null hypothesis
H
0
that
oddsratio
=
1 with the
alternative hypothesis
H
a
that
oddsratio >
1. Because
a
determines the rest of the
cells under fixed row and column marginals, we can see from Eq.(12.4) that the larger
the
a
the larger the odds ratio, and consequently the greater the evidence for
H
a
. We
can obtain the
p
-
value
for a contingency table as extreme as that in Table 12.17 by
summing Eq.(12.5) over all possible values
a
or larger:
p
-
value(a)
=
min
(b,c)
i
=
0
P(a
+
i
|
(a
+
c),(a
+
b),n)
=
min
(b,c)
i
=
0
(a
+
b)
!
(c
+
d)
!
(a
+
c)
!
(b
+
d)
!
n
!
(a
+
i)
!
(b
−
i)
!
(c
−
i)
!
(d
+
i)
!
which follows from the fact that when we increase the count of
a
by
i
, then because the
row and column marginals are fixed,
b
and
c
must decrease by
i
, and
d
must increase
by
i
, as shown in Table 12.18. The lower the
p
-
value
the stronger the evidence that
the odds ratio is greater than one, and thus, we may reject the null hypothesis
H
0
if
p
-
value
≤
α
, where
α
is the significance threshold (e.g.,
α
=
0
.
01). This test is known as
the
Fisher Exact Test
.
In summary, to check whether a rule
R
:
X
−→
Y
is productive, we must compute
p
-
value(a)
=
p
-
value(
sup
(
XY
))
of the contingency tables obtained from each of its
generalizations
R
′
:
W
−→
Y
, where
W
=
X
Z
, for
Z
⊆
X
. If
p
-
value(
sup
(
XY
)) >
α
for any of these comparisons, then we can reject the rule
R
:
X
−→
Y
as
nonproductive. On the other hand, if
p
-
value(
sup
(
XY
))
≤
α
for all the generalizations,
then
R
is productive. However, note that if
|
X
| =
k
, then there are 2
k
−
1 possible
generalizations; to avoid this exponential complexity for large antecedents, we
typically restrict our attention to only the immediate generalizations of the form
R
′
:
X
z
−→
Y
, where
z
∈
X
is one of the attribute values in the antecedent.
However, we do include the trivial rule
∅ −→
Y
because the conditional probability
P(
Y
|
X
)
=
conf
(
X
−→
Y
)
should also be higher than the prior probability
P(
Y
)
=
conf
(
∅−→
Y
)
.
Example 12.16.
Consider the rule
R
:
pw
2
−→
c
2
obtained from the discretized
Iris dataset. To test if it is productive, because there is only a single item in the
antecedent, we compare it only with the default rule
∅ −→
c
2
. Using Table 12.17,
the various cell values are
a
=
sup
(pw
2
,c
2
)
=
49
b
=
sup
(pw
2
,
¬
c
2
)
=
5
c
=
sup
(
¬
pw
2
,c
2
)
=
1
d
=
sup
(
¬
pw
2
,
¬
c
2
)
=
95
12.2 Significance Testing and Confidence Intervals
319
with the contingency table given as
c
2
¬
c
2
pw
2
49 5 54
¬
pw
2
1 95 96
50 100 150
Thus the
p
-
value
is given as
p
-
value
=
min
(b,c)
i
=
0
P(a
+
i
|
(a
+
c),(a
+
b),n)
=
P(
49
|
50
,
54
,
150
)
+
P(
50
|
50
,
54
,
150
)
=
54
49
·
96
95
150
50
+
54
50
·
96
96
150
50
=
1
.
51
×
10
−
32
+
1
.
57
×
10
−
35
=
1
.
51
×
10
−
32
Since the
p
-
value
is extremely small, we can safely reject the null hypothesis that the
odds ratio is 1. Instead, there is a strong relationship between
X
=
pw
2
and
Y
=
c
2
,
and we conclude that
R
:
pw
2
−→
c
2
is a productive rule.
Example 12.17.
Consider another rule
{
sw
1
,pw
2
} −→
c
2
, with
X
= {
sw
1
,pw
2
}
and
Y
=
c
2
. Consider its three generalizations, and the corresponding contingency tables
and
p
-
values
:
R
′
1
:
pw
2
−→
c
2
Z
={
sw
1
}
W
=
X
Z
={
pw
2
}
p
-
value
=
0
.
84
W
=
pw
2
c
2
¬
c
2
sw
1
34 4 38
¬
sw
1
15 1 16
49 5 54
R
′
2
:
sw
1
−→
c
2
Z
={
pw
2
}
W
=
X
Z
={
sw
1
}
p
-
value
=
1
.
39
×
10
−
11
W
=
sw
1
c
2
¬
c
2
pw
2
34 4 38
¬
pw
2
0 19 19
34 23 57
R
′
3
:
∅−→
c
2
Z
={
sw
1
,pw
2
}
W
=
X
Z
=∅
p
-
value
=
3
.
55
×
10
−
17
W
=∅
c
2
¬
c
2
{
sw
1
,pw
2
}
34 4 38
¬{
sw
1
,pw
2
}
16 96 112
50 100 150
320
Pattern and Rule Assessment
We can see that whereas the
p
-
value
with respect to
R
′
2
and
R
′
3
is small, for
R
′
1
we
have
p
-
value
=
0
.
84, which is too high and thus we cannot reject the null hypothesis.
We conclude that
R
:
{
sw
1
,pw
2
}−→
c
2
is not productive. In fact, its generalization
R
′
1
is the one that is productive, as shown in Example 12.16.
Multiple Hypothesis Testing
Given an input dataset
D
, there can be an exponentially large number of rules
that need to be tested to check whether they are productive or not. We thus run
into the multiple hypothesis testing problem, that is, just by the sheer number of
hypothesis tests some unproductive rules will pass the
p
-
value
≤
α
threshold by
random chance. A strategy for overcoming this problem is to use the
Bonferroni
correction
of the significance level that explicitly takes into account the number of
experiments performed during the hypothesis testing process. Instead of using the
given
α
threshold, we should use an adjusted threshold
α
′
=
α
#
r
, where #
r
is the number
of rules to be testedor its estimate.This correction ensures that the rule false discovery
rate is bounded by
α
, where a false discovery is to claim that a rule is productive when
it is not.
Example 12.18.
Consider the discretized Iris dataset, using the discretization shown
in Table 12.10. Let us focus only on class-specific rules, that is, rules of the form
X
→
c
i
. Since each examplecan take on only one value at a time for a given attribute,
the maximum antecedent length is four, and the maximum number of class-specific
rules that can be generated from the Iris dataset is given as
#
r
=
c
×
4
i
=
1
4
i
b
i
where
c
is the number of Iris classes, and
b
is the maximum number of bins for any
other attribute. The summation is over the antecedent size
i
, that is, the number of
attributesto be used in the antecedent.Finally,thereare
b
i
possible combinations for
the chosen set of
i
attributes. Because there are three Iris classes, and because each
attribute has three bins, we have
c
=
3 and
b
=
3, and the number of possible rules is
#
r
=
3
×
4
i
=
1
4
i
3
i
=
3
(
12
+
54
+
108
+
81
)
=
3
·
255
=
765
Thus, if the input significance level is
α
=
0
.
01, then the adjusted significance
level using the Bonferroni correction is
α
′
=
α/
#
r
=
0
.
01
/
765
=
1
.
31
×
10
−
5
. The
rule
pw
2
−→
c
2
in Example 12.16 has
p
-
value
=
1
.
51
×
10
−
32
, and thus it remains
productive even when we use
α
′
.
12.2.2
Permutation Test for Significance
A
permutation
or
randomization
testdeterminesthe distribution ofa giventeststatistic
by randomly modifying the observed data several times to obtain a random sample
12.2 Significance Testing and Confidence Intervals
321
of datasets, which can in turn be used for significance testing. In the context of pattern
assessment, given an input dataset
D
, we first generate
k
randomly permuted datasets
D
1
,
D
2
,...,
D
k
. We can then perform different types of significance tests. For instance,
given a pattern or rule we can check whether it is statistically significant by first
computing the empirical probability mass function (EPMF) for the test statistic
by
computing its value
θ
i
in the
i
th randomized dataset
D
i
for all
i
∈
[1
,k
]. From these
values we can generate the empirical cumulative distribution function
ˆ
F(x)
=
ˆ
P
(
≤
x
)
=
1
k
k
i
=
1
I
(θ
i
≤
x)
where
I
is an indicator variable that takes on the value 1 when its argument is true,
and is 0 otherwise. Let
θ
be the value of the test statistic in the input dataset
D
, then
p
-
value(θ)
, that is, the probability of obtaining a value as high as
θ
by random chance
can be computed as
p
-
value(θ)
=
1
−
F(θ)
Given a significance level
α
, if
p
-
value(θ) > α
, then we accept the null hypothesis that
the pattern/rule is not statistically significant. On the other hand, if
p
-
value(θ)
≤
α
,
then we can reject the null hypothesis and conclude that the pattern is significant
because a value as high as
θ
is highly improbable. The permutation test approach can
also be used to assess an entire set of rules or patterns. For instance, we may test a
collection of frequent itemsets by comparing the number of frequent itemsets in
D
with the distribution of the number of frequent itemsets empirically derived from the
permuted datasets
D
i
. We may also do this analysis as a function of
minsup
, and so on.
Swap Randomization
A key question in generating the permuted datasets
D
i
is which characteristics of the
input dataset
D
we should preserve. The
swap randomization
approach maintains as
invariantthecolumn androw marginsforagivendataset,thatis,thepermuteddatasets
preservethesupport ofeachitem(thecolumn margin)aswellasthenumber ofitemsin
each transaction (the row margin). Given a dataset
D
, we randomly create
k
datasets
that have the same row and column margins. We then mine frequent patterns in
D
and check whether the pattern statistics are different from those obtained using the
randomized datasets. If the differences are not significant, we may conclude that the
patterns arise solely from the row and column margins, and not from any interesting
properties of the data.
Given a binary matrix
D
⊆
T
×
I
, the swap randomization method exchanges
two nonzero cells of the matrix via a
swap
that leaves the row and column margins
unchanged. To illustrate how swap works, consider any two transactions
t
a
,t
b
∈
T
and any two items
i
a
,i
b
∈
I
such that
(t
a
,i
a
),(t
b
,i
b
)
∈
D
and
(t
a
,i
b
),(t
b
,i
a
)
∈
D
, which
corresponds to the 2
×
2 submatrix in
D
, given as
D
(t
a
,i
a
;
t
b
,i
b
)
=
1 0
0 1
322
Pattern and Rule Assessment
ALGORITHM 12.1. Generate Swap Randomized Dataset
S
WAP
R
ANDOMIZATION
(
t
, D
⊆
T
×
I
)
:
while
t >
0
do
1
Select pairs
(t
a
,i
a
),(t
b
,i
b
)
∈
D
randomly
2
if
(t
a
,i
b
)
∈
D
and
(t
b
,i
a
)
∈
D then
3
D
←
D
(t
a
,i
a
),(t
b
,i
b
)
∪
(t
a
,i
b
),(t
b
,i
a
)
4
t
=
t
−
1
5
return D
6
After a swap operation we obtain the new submatrix
D
(t
a
,i
b
;
t
b
,i
a
)
=
0 1
1 0
where we exchange the elements in
D
so that
(t
a
,i
b
),(t
b
,i
a
)
∈
D
, and
(t
a
,i
a
),(t
b
,i
b
)
∈
D
.
We denote this operation as
S
wap(t
a
,i
a
;
t
b
,i
b
)
. Notice that a swap does not affect the
row and column margins, and we can thus generate a permuted dataset with the same
row and column sums as
D
through a sequence of swaps. Algorithm 12.1 shows the
pseudo-codeforgeneratingaswaprandomized dataset.The algorithmperforms
t
swap
trials by selecting two pairs
(t
a
,i
a
), (t
b
,i
b
)
∈
D
at random; a swap is successful only if
both
(t
a
,i
b
), (t
b
,i
a
)
∈
D
.
Example 12.19.
Consider the input binary dataset
D
shown in Table 12.19a, whose
row andcolumn sums arealso shown. Table 12.19bshows theresulting datasetaftera
single swap operation
S
wap(
1
,
D
;
4
,
C
)
, highlighted by the gray cells. When we apply
another swap, namely
S
wap(
2
,
C
;
4
,
A
)
, we obtain the data in Table 12.19c. We can
observe that the marginal counts remain invariant.
From the input dataset
D
in Table 12.19awe generated
k
=
100swap randomized
datasets, each of which is obtained by performing 150 swaps (the product of all
possible transactionpairs anditem pairs, thatis,
6
2
·
5
2
=
150).Letthe teststatisticbe
the total number of frequent itemsets using
minsup
=
3. Mining
D
results in
|
F
|=
19
frequent itemsets. Likewise, mining each of the
k
=
100 permuted datasets results in
the following empirical PMF for
|
F
|
:
P
|
F
|=
19
=
0
.
67
P
|
F
|=
17
=
0
.
33
Because
p
-
value(
19
)
=
0
.
67, we may conclude that the set of frequent itemsets is
essentially determined by the row and column marginals.
Focusing on a specific itemset, consider
ABDE
, which is one of the maximal
frequent itemsets in
D
, with
sup
(
ABDE
)
=
3. The probability that
ABDE
is
frequent is 17
/
100
=
0
.
17 because it is frequent in 17 of the 100 swapped datasets.
As this probability is not very low, we may conclude that
ABDE
is not a
statistically significant pattern; it has a relatively high chance of being frequent in
random datasets. Consider another itemset
BCD
that is not frequent in
D
because
12.2 Significance Testing and Confidence Intervals
323
sup
(
BCD
)
=
2. The empirical PMF for the support of
BCD
is given as
P(sup
=
2
)
=
0
.
54
P(sup
=
3
)
=
0
.
44
P(sup
=
4
)
=
0
.
02
In a majority of the datasets
BCD
is infrequent, and if
minsup
=
4, then
p
-
value(sup
=
4
)
=
0
.
02 implies that
BCD
is highly unlikely to be a frequent pattern.
12.2 Significance Testing and Confidence Intervals
325
wehave
ˆ
F(
10
)
=
P(sup<
10
)
=
0
.
517.Putdifferently,
P(sup
≥
10
)
=
1
−
0
.
517
=
0
.
483,
that is, 48.3% of the itemsets that occur at least once are frequent using
minsup
=
10.
Define the test statistic to be the
relativelift
, defined as the relative change in the
lift value of itemset
X
when comparing the input dataset
D
and a randomized dataset
D
i
, that is,
rlift
(
X
,
D
,
D
i
)
=
lift
(
X
,
D
)
−
lift
(
X
,
D
i
)
lift
(
X
,
D
)
For an
m
-itemset
X
={
x
1
,...,x
m
}
, by Eq.(12.2) note that
lift
(
X
,
D
)
=
rsup
(
X
,
D
)
m
j
=
1
rsup
(x
j
,
D
)
Because the swap randomization process leaves item supports (the column margins)
intact, and does not change the number of transactions, we have
rsup
(x
j
,
D
)
=
rsup
(x
j
,
D
i
)
, and
|
D
|=|
D
i
|
. We can thus rewrite the relative lift statistic as
rlift
(
X
,
D
,
D
i
)
=
sup
(
X
,
D
)
−
sup
(
X
,
D
i
)
sup
(
X
,
D
)
=
1
−
sup
(
X
,
D
i
)
sup
(
X
,
D
)
We generate
k
=
100 randomized datasets and compute the average relative lift
for each of the 140 frequent itemsets of size two or more in the input dataset, as lift
values are not defined for single items. Figure 12.4 shows the cumulative distribution
for average relative lift, which ranges from
−
0.55 to 0.998. An average relative lift
close to 1 means that the corresponding frequent pattern hardly ever occurs in any
of the randomized datasets. On the other hand, a larger negative average relative
lift value means that the support in randomized datasets is higher than in the input
dataset. Finally, a value close to zero means that the support of the itemset is the
same in both the original and randomized datasets; it is mainly a consequence of the
marginal counts, and thus of little interest.
0
0
.
25
0
.
50
0
.
75
1
.
00
−
0
.
6
−
0
.
4
−
0
.
2 0 0
.
2 0
.
4 0
.
6 0
.
8 1
.
0
Avg. Relative Lift
ˆ
F
Figure 12.4.
Cumulative distribution for average relative lift.
326
Pattern and Rule Assessment
0
0
.
04
0
.
08
0
.
12
0
.
16
−
1
.
2
−
1
.
0
−
0
.
8
−
0
.
6
−
0
.
4
−
0
.
2 0
Relative Lift
ˆ
f
Figure 12.5.
PMF for relative lift for
{
sl
1
,
pw
2
}
.
Figure 12.4 indicates that 44% of the frequent itemsets have average relative
lift values above 0.8. These patterns are likely to be of interest. The pattern with
the highest lift value of 0
.
998 is
{
sl
1
,sw
3
,pl
1
,pw
1
,c
1
}
. The itemset that has more
or less the same support in the input and randomized datasets is
{
sl
2
,c
3
}
; its
average relative lift is
−
0.002. On the other hand, 5% of the frequent itemsets
have average relative lift below
−
0.2. These are also of interest because they
indicate more of a dis-association among the items, that is, the itemsets are
more frequent by random chance. An example of such a pattern is
{
sl
1
,pw
2
}
.
Figure 12.5 shows the empirical probability mass function for its relative lift values
across the 100 swap randomized datasets. Its average relative lift value is
−
0.55,
and
p
-
value(
−
0
.
2
)
=
0
.
069, which indicates a high probability that the itemset is
disassociative.
12.2.3
Bootstrap Sampling for Confidence Interval
Typically the input transaction database
D
is just a sample from some population, and
it is not enough to claim that a pattern
X
is frequent in
D
with support
sup
(
X
)
. What
can we say about the range of possible support values for
X
? Likewise, for a rule
R
with a given lift value in
D
, what can we say about the range of lift values in different
samples? In general, given a test assessment statistic
, bootstrap sampling allows one
to infer the confidence interval for the possible values of
at a desired confidence
level
α
.
The main idea is to generate
k
bootstrap samples from
D
using sampling
with
replacement
, that is, assuming
|
D
| =
n
, each sample
D
i
is obtained by selecting at
random
n
transactions from
D
with replacement. Given pattern
X
or rule
R
:
X
−→
Y
,
we can obtain the value of the test statistic in each of the bootstrap samples; let
θ
i
denote the value in sample
D
i
. From these values we can generate the empirical
12.2 Significance Testing and Confidence Intervals
327
cumulative distribution function for the statistic
ˆ
F(x)
=
ˆ
P
(
≤
x
)
=
1
k
k
i
=
1
I
(θ
i
≤
x)
where
I
is an indicator variable that takes on the value 1 when its argument is true, and
0 otherwise. Given a desired confidence level
α
(e.g.,
α
=
0
.
95) we can compute the
interval for the test statistic by discarding values from the tail ends of
ˆ
F
on both sides
that encompass
(
1
−
α)/
2 of the probability mass. Formally, let
v
t
denote the critical
value such that
ˆ
F(v
t
)
=
t
, which can be obtained from quantile function as
v
t
=
ˆ
F
−
1
(t)
.
We then have
P
∈
[
v
(
1
−
α)/
2
,v
(
1
+
α)/
2
]
=
ˆ
F
(
1
+
α)/
2
−
ˆ
F
(
1
−
α)/
2
=
(
1
+
α)/
2
−
(
1
−
α)/
2
=
α
Thus, the
α
% confidence interval for the chosen test statistic
is
[
v
(
1
−
α)/
2
,v
(
1
+
α)/
2
]
The pseudo-code for bootstrap sampling for estimating the confidence interval is
shown in Algorithm 12.2.
ALGORITHM 12.2. Bootstrap Resampling Method
B
OOTSTRAP
-C
ONFIDENCE
I
NTERVAL
(
X
,
α
,
k
, D)
:
for
i
∈
[1
,k
]
do
1
D
i
←
sample of size
n
with replacement from
D
2
θ
i
←
compute test statistic for
X
on
D
i
3
ˆ
F(x)
=
P
(
≤
x
)
=
1
k
k
i
=
1
I
(θ
i
≤
x)
4
v
(
1
−
α)/
2
=
ˆ
F
−
1
(
1
−
α)/
2
5
v
(
1
+
α)/
2
=
ˆ
F
−
1
(
1
+
α)/
2
6
return
[
v
(
1
−
α)/
2
,v
(
1
+
α)/
2
]
7
Example 12.21.
Let the relative support
rsup
be the test statistic. Consider the
itemset
X
= {
sw
1
,pl
3
,pw
3
,cl
3
}
, which has relative support
rsup
(
X
,
D
)
=
0
.
113 (or
sup
(
X
,
D
)
=
17) in the Iris dataset.
Using
k
=
100 bootstrap samples, we first compute the relative support of
X
in each of the samples (
rsup
(
X
,
D
i
)
). The empirical probability mass function for
the relative support of
X
is shown in Figure 12.6 and the corresponding empirical
cumulative distribution is shown in Figure 12.7. Let the confidence level be
α
=
0
.
9.
To obtain the confidence interval we have to discard the values that account for 0
.
05
of the probability mass at both ends of the relative support values. The critical values
328
Pattern and Rule Assessment
0
0.04
0.08
0.12
0.16
0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18
rsup
ˆ
f
Figure 12.6.
Empirical PMF for relative support.
0
0.25
0.50
0.75
1.00
0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18
rsup
ˆ
F
v
0
.
05
v
0
.
95
Figure 12.7.
Empirical cumulative distribution for relative support.
at the left and right ends are as follows:
v
(
1
−
α)/
2
=
v
0
.
05
=
0
.
073
v
(
1
+
α)/
2
=
v
0
.
95
=
0
.
16
Thus, the 90% confidence interval for the relative support of
X
is [0
.
073
,
0
.
16],which
corresponds to the interval [11
,
24] for its absolute support. Note that the relative
support of
X
in the input dataset is 0
.
113, which has
p
-
value(
0
.
113
)
=
0
.
45, and the
expected relative support value of
X
is
µ
rsup
=
0
.
115.
12.4 Further Reading
329
12.3
FURTHER READING
Reviews of various measures for rule and pattern interestingness appear in Tan,
Kumar, and Srivastava (2002); Geng and Hamilton (2006) and Lallich, Teytaud, and
Prudhomme (2007). Randomization and resampling methods for significance testing
and confidence intervals are described in Megiddo and Srikant (1998) and Gionis et al.
(2007). Statistical testing and validation approaches also appear in Webb (2006) and
Lallich, Teytaud, and Prudhomme (2007).
Geng, L. and Hamilton, H. J. (2006). “Interestingness measures for data mining: A
survey.”
ACM Computing Surveys
, 38(3): 9.
Gionis, A., Mannila, H., Mielik
¨
ainen, T., and Tsaparas, P. (2007). “Assessing data
mining results via swap randomization.”
ACM Transactions on Knowledge
Discovery from Data
, 1(3): 14.
Lallich, S., Teytaud, O., and Prudhomme, E. (2007). “Association rule interesting-
ness: measure and statistical validation.” In
Quality Measures in Data Mining,
(pp. 251–275). New York: Springer Science
+
Business Media.
Megiddo, N. and Srikant, R. (1998). “Discovering predictive association rules.”
In
Proceedings of the 4th International Conference on Knowledge Discovery in
Databases and Data Mining
, pp. 274–278.
Tan, P.-N., Kumar, V., and Srivastava, J. (2002). “Selecting the right interestingness
measure for association patterns.”
In Proceedings of the 8th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
ACM,
pp. 32–41.
Webb, G. I. (2006). “Discovering significant rules.”
In Proceedings of the 12th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
ACM, pp. 434–443.
12.4
EXERCISES
Q1.
Show that if
X
and
Y
are independent, then
conv(
X
−→
Y
)
=
1.
Q2.
Show that if
X
and
Y
are independent then
oddsratio(
X
−→
Y
)
=
1.
Q3.
Show that for a frequent itemset
X
, the value of the relative lift statistic defined in
Example 12.20 lies in the range
1
−|
D
|
/
minsup
,
1
Q4.
Prove that all subsets of a minimal generator must themselves be minimal generators.
Q5.
Let
D
be a binary database spanning one trillion (10
9
) transactions. Because it is
too time consuming to mine it directly, we use Monte Carlo sampling to find the
bounds on the frequency of a given itemset
X
. We run 200 sampling trials
D
i
(
i
=
1
...
200), with each sample of size 100
,
000, and we obtain the support values for
X
in
the various samples, as shown in Table 12.20. The table shows the number of samples
where the support of the itemset was a given value. For instance, in 5 samples its
support was 10,000. Answer the following questions:
330
Pattern and Rule Assessment
Table 12.20.
Data for Q5
Support No. of samples
10,000 5
15,000 20
20,000 40
25,000 50
30,000 20
35,000 50
40,000 5
45,000 10
(a)
Draw a histogram for the table, and calculate the mean and variance of the
support across the different samples.
(b)
Find the lower and upper bound on the supportof
X
at the 95% confidence level.
The support values given should be for the entire database
D
.
(c)
Assume that
minsup
=
0
.
25, and let the observed support of
X
in a sample be
sup
(
X
)
=
32500. Set up a hypothesis testing framework to check if the support of
X
is significantly higher than the
minsup
value. What is the
p
-
value
?
Q6.
Let
A
and
B
be two binary attributes. While mining association rules at 30%
minimum support and 60% minimum confidence, the following rule was mined:
A
−→
B
, with
sup
=
0
.
4, and
conf
=
0
.
66. Assume that there are a total of 10,000
customers, and that 4000 of them buy both
A
and
B
; 2000 buy
A
but not
B
, 3500 buy
B
but not
A
, and 500 buy neither
A
nor
B
.
Compute the dependence between
A
and
B
via the
χ
2
-statistic from the corre-
sponding contingency table. Do you think the discovered association is truly a strong
rule, that is,does
A
predict
B
strongly? Set up ahypothesis testing framework, writing
down the null and alternate hypotheses, to answer the above question, at the 95%
confidencelevel. Hereare somevalues ofchi-squaredstatistic forthe95% confidence
level for various degrees of freedom (df):
df
χ
2
1 3.84
2 5.99
3 7.82
4 9.49
5 11.07
6 12.59
PART THREE
 CLUSTERING
CHAPTER 13
Representative-based Clustering
Given a dataset with
n
points in a
d
-dimensional space,
D
= {
x
i
}
n
i
=
1
, and given the
number of desired clusters
k
, the goal of representative-based clustering is to partition
the dataset into
k
groups or clusters, which is called a
clustering
and is denoted as
C
={
C
1
,
C
2
,...,
C
k
}
. Further, for each cluster
C
i
there exists a representativepoint that
summarizes the cluster, a common choice being the mean (also called the
centroid
)
µ
i
of all points in the cluster, that is,
µ
i
=
1
n
i
x
j
∈
C
i
x
j
where
n
i
=|
C
i
|
is the number of points in cluster
C
i
.
A brute-force or exhaustive algorithm for finding a good clustering is simply to
generate all possible partitions of
n
points into
k
clusters, evaluate some optimization
score for each of them, and retain the clustering that yields the best score. The
exact
number of ways of partitioning
n
points into
k
nonempty and disjoint parts is given by
the
Stirling numbers of the second kind
, given as
S
(n,k)
=
1
k
!
k
t
=
0
(
−
1
)
t
k
t
(k
−
t)
n
Informally, each point can be assigned to any one of the
k
clusters, so there are at
most
k
n
possible clusterings. However, any permutation of the
k
clusters within a given
clustering yields an equivalent clustering; therefore, there are
O
(k
n
/k
!
)
clusterings of
n
points into
k
groups. It is clear that exhaustive enumeration and scoring of all possible
clusterings is not practically feasible. In this chapter we describe two approaches for
representative-based clustering, namely the K-means and expectation-maximization
algorithms.
13.1
K-MEANS ALGORITHM
Given a clustering
C
={
C
1
,
C
2
,...,
C
k
}
we need some scoring function that evaluatesits
quality or goodness. This
sum of squared errors
scoring function is defined as
333
334
Representative-based Clustering
SSE
(
C
)
=
k
i
=
1
x
j
∈
C
i
x
j
−
µ
i
2
(13.1)
The goal is to find the clustering that minimizes the SSE score:
C
∗
=
argmin
C
{
SSE
(
C
)
}
K-means employs a greedy iterative approach to find a clustering that minimizes
the SSE objective [Eq.(13.1)]. As such it can converge to a local optima instead of a
globally optimal clustering.
K-means initializes the cluster means by randomly generating
k
points in the
data space. This is typically done by generating a value uniformly at random within
the range for each dimension. Each iteration of K-means consists of two steps:
(1) cluster assignment, and (2) centroid update. Given the
k
cluster means, in the
cluster assignment step, each point
x
j
∈
D
is assigned to the closest mean, which
induces a clustering, with each cluster
C
i
comprising points that are closer to
µ
i
than any other cluster mean. That is, each point
x
j
is assigned to cluster
C
j
∗
,
where
j
∗
=
arg
k
min
i
=
1
x
j
−
µ
i
2
(13.2)
Given a set of clusters
C
i
,
i
=
1
,...,k
, in the centroid update step, new mean values
are computed for each cluster from the points in
C
i
. The cluster assignment and
centroid update steps are carried out iteratively until we reach a fixed point or local
minima. Practically speaking, one can assume that K-means has converged if the
centroids do not change from one iteration to the next. For instance, we can stop if
k
i
=
1
µ
t
i
−
µ
t
−
1
i
2
≤
ǫ
, where
ǫ >
0 is the convergence threshold,
t
denotes the current
iteration, and
µ
t
i
denotes the mean for cluster
C
i
in iteration
t
.
The pseudo-code for K-means is given in Algorithm 13.1. Because the method
starts with a random guess for the initial centroids, K-means is typically run several
times, and the run with the lowest SSE value is chosen to report the final clustering. It
is also worth noting that K-means generatesconvex-shapedclusters because the region
in the data space corresponding to each cluster can be obtained as the intersection of
half-spaces resulting from hyperplanes that bisect and are normal to the line segments
that join pairs of cluster centroids.
In terms of the computational complexity of K-means, we can see that the cluster
assignment step take
O
(nkd)
time because for each of the
n
points we have to compute
its distance to each of the
k
clusters, which takes
d
operations in
d
dimensions. The
centroid re-computation step takes
O
(nd)
time because we have to add at total of
n
d
-dimensional points. Assuming that there are
t
iterations, the total time for K-means
is givenas
O
(tnkd)
. In terms oftheI/O cost itrequires
O
(t)
fulldatabasescans, because
we have to read the entire database in each iteration.
Example 13.1.
Consider the one-dimensional data shown in Figure 13.1a. Assume
that we want to cluster the data into
k
=
2 groups. Let the initial centroids be
µ
1
=
2
and
µ
2
=
4. In the first iteration, we first compute the clusters, assigning each point
13.1 K-means Algorithm
335
ALGORITHM 13.1. K-means Algorithm
K-
MEANS
(D
,k,ǫ
)
:
t
=
0
1
Randomly initialize
k
centroids:
µ
t
1
,
µ
t
2
,...,
µ
t
k
∈
R
d
2
repeat
3
t
←
t
+
1
4
C
j
←∅
for all
j
=
1
,
···
,k
5
// Cluster Assignment Step
foreach x
j
∈
D do
6
j
∗
←
argmin
i
x
j
−
µ
t
i
2
// Assign
x
j
to closest centroid
7
C
j
∗
←
C
j
∗
∪{
x
j
}
8
// Centroid Update Step
foreach
i
=
1
to
k
do
9
µ
t
i
←
1
|
C
i
|
x
j
∈
C
i
x
j
10
until
k
i
=
1
µ
t
i
−
µ
t
−
1
i
2
≤
ǫ
11
to the closest mean, to obtain
C
1
={
2
,
3
}
C
2
={
4
,
10
,
11
,
12
,
20
,
25
,
30
}
We next update the means as follows:
µ
1
=
2
+
3
2
=
5
2
=
2
.
5
µ
2
=
4
+
10
+
11
+
12
+
20
+
25
+
30
7
=
112
7
=
16
The new centroids and clusters after the first iteration are shown in Figure 13.1b.
For the second step, we repeat the cluster assignment and centroid update steps, as
shown in Figure 13.1c, to obtain the new clusters:
C
1
={
2
,
3
,
4
}
C
2
={
10
,
11
,
12
,
20
,
25
,
30
}
and the new means:
µ
1
=
2
+
3
+
4
4
=
9
3
=
3
µ
2
=
10
+
11
+
12
+
20
+
25
+
30
6
=
108
6
=
18
Thecompleteprocess untilconvergenceisillustratedin Figure13.1.The finalclusters
are given as
C
1
={
2
,
3
,
4
,
10
,
11
,
12
}
C
2
={
20
,
25
,
30
}
with representatives
µ
1
=
7 and
µ
2
=
25.
336
Representative-based Clustering
2
3
4
10
11
12
20
25
30
(a) Initial dataset
µ
1
=
2
2
3
µ
2
=
4
4
10
11
12
20
25
30
(b) Iteration:
t
=
1
µ
1
=
2
.
5
2
3
4
µ
2
=
16
10
11
12
20
25
30
(c) Iteration:
t
=
2
µ
1
=
3
2
3
4
10
µ
2
=
18
11
12
20
25
30
(d) Iteration:
t
=
3
µ
1
=
4
.
75
2
3
4
10
11
12
µ
2
=
19
.
60
20
25
30
(e) Iteration:
t
=
4
µ
1
=
7
2
3
4
10
11
12
µ
2
=
25
20
25
30
(f) Iteration:
t
=
5 (converged)
Figure 13.1.
K-means in one dimension.
Example 13.2 (K-means in Two Dimensions).
In Figure 13.2 we illustrate the
K-means algorithm on the Iris dataset, using the first two principal components as
the two dimensions. Iris has
n
=
150 points, and we want to find
k
=
3 clusters,
corresponding to the three types of Irises. A random initialization of the cluster
means yields
µ
1
=
(
−
0
.
98
,
−
1
.
24
)
T
µ
2
=
(
−
2
.
96
,
1
.
16
)
T
µ
3
=
(
−
1
.
69
,
−
0
.
80
)
T
as shown in Figure 13.2a. With these initial clusters, K-means takes eight iterations
to converge. Figure 13.2b shows the clusters and their means after one iteration:
µ
1
=
(
1
.
56
,
−
0
.
08
)
T
µ
2
=
(
−
2
.
86
,
0
.
53
)
T
µ
3
=
(
−
1
.
50
,
−
0
.
05
)
T
Finally, Figure 13.2c shows the clusters on convergence. The final means are as
follows:
µ
1
=
(
2
.
64
,
0
.
19
)
T
µ
2
=
(
−
2
.
35
,
0
.
27
)
T
µ
3
=
(
−
0
.
66
,
−
0
.
33
)
T
13.1 K-means Algorithm
337
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(a) Random initialization:
t
=
0
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(b) Iteration:
t
=
1
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(c) Iteration:
t
=
8 (converged)
Figure 13.2.
K-means in two dimensions: Iris principal components dataset.
338
Representative-based Clustering
Figure 13.2 shows the cluster means as black points, and shows the convex regions
of data space that correspond to each of the three clusters. The dashed lines
(hyperplanes) arethe perpendicular bisectors of the line segments joining two cluster
centers. The resulting convex partition of the points comprises the clustering.
Figure 13.2c shows the final three clusters:
C
1
as circles,
C
2
as squares, and
C
3
as
triangles. White points indicate a wrong grouping when compared to the known Iris
types.Thus, we canseethat
C
1
perfectlycorresponds to
iris-setosa
,andthe major-
ityofthepoints in
C
2
correspond to
iris-virginica
,andin
C
3
to
iris-versicolor
.
For example, three points (white squares) of type
iris-versicolor
are wrongly
clustered in
C
2
, and 14 points from
iris-virginica
are wrongly clustered in
C
3
(white triangles). Of course, because the Iris class label is not used in clustering, it is
reasonable to expect that we will not obtain a perfect clustering.
13.2
KERNEL K-MEANS
In K-means, the separating boundary between clusters is linear. Kernel K-means
allows one to extract nonlinear boundaries between clusters via the use of the kernel
trick outlined in Chapter 5. This way the method can be used to detect nonconvex
clusters.
In kernel K-means, the main idea is to conceptually map a data point
x
i
in input
space to a point
φ(
x
i
)
in some high-dimensional feature space, via an appropriate
nonlinear mapping
φ
. However, the kernel trick allows us to carry out the clustering in
feature space purely in terms of the kernel function
K
(
x
i
,
x
j
)
, which can be computed
in input space, but corresponds to a dot (or inner) product
φ(
x
i
)
T
φ(
x
j
)
in featurespace.
Assume for the moment that all points
x
i
∈
D
have been mapped to their
corresponding images
φ(
x
i
)
in feature space. Let
K
=
K
(
x
i
,
x
j
)
i,j
=
1
,...,n
denote the
n
×
n
symmetric kernel matrix, where
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
. Let
{
C
1
,...,
C
k
}
specify
the partitioning of the
n
points into
k
clusters, and let the corresponding cluster means
in feature space be given as
{
µ
φ
1
,...,
µ
φ
k
}
, where
µ
φ
i
=
1
n
i
x
j
∈
C
i
φ(
x
j
)
is the mean of cluster
C
i
in feature space, with
n
i
=|
C
i
|
.
In feature space, the kernel K-means sum of squared errors objective can be
written as
min
C
SSE
(
C
)
=
k
i
=
1
x
j
∈
C
i
φ(
x
j
)
−
µ
φ
i
2
Expanding the kernel SSE objective in terms of the kernel function, we get
SSE
(
C
)
=
k
i
=
1
x
j
∈
C
i
φ(
x
j
)
−
µ
φ
i
2
=
k
i
=
1
x
j
∈
C
i
φ(
x
j
)
2
−
2
φ(
x
j
)
T
µ
φ
i
+
µ
φ
i
2
13.2 Kernel K-means
339
=
k
i
=
1
x
j
∈
C
i
φ(
x
j
)
2
−
2
n
i
1
n
i
x
j
∈
C
i
φ(
x
j
)
T
µ
φ
i
+
n
i
µ
φ
i
2
=
k
i
=
1
x
j
∈
C
i
φ(
x
j
)
T
φ(
x
j
)
−
k
i
=
1
n
i
µ
φ
i
2
=
k
i
=
1
x
j
∈
C
i
K
(
x
j
,
x
j
)
−
k
i
=
1
1
n
i
x
a
∈
C
i
x
b
∈
C
i
K
(
x
a
,
x
b
)
=
n
j
=
1
K
(
x
j
,
x
j
)
−
k
i
=
1
1
n
i
x
a
∈
C
i
x
b
∈
C
i
K
(
x
a
,
x
b
)
(13.3)
Thus, the kernel K-means SSE objective function can be expressed purely in terms of
the kernel function. Like K-means, to minimize the SSE objective we adopt a greedy
iterative approach. The basic idea is to assign each point to the closest mean in feature
space, resulting in a new clustering, which in turn can be used obtain new estimates for
the cluster means. However, the main difficulty is that we cannot explicitly compute
the mean of each cluster in feature space. Fortunately, explicitly obtaining the cluster
means is not required; all operations can be carried out in terms of the kernel function
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
.
Consider the distance of a point
φ(
x
j
)
to the mean
µ
φ
i
in feature space, which can
be computed as
φ(
x
j
)
−
µ
φ
i
2
=
φ(
x
j
)
2
−
2
φ(
x
j
)
T
µ
φ
i
+
µ
φ
i
2
=
φ(
x
j
)
T
φ(
x
j
)
−
2
n
i
x
a
∈
C
i
φ(
x
j
)
T
φ(
x
a
)
+
1
n
2
i
x
a
∈
C
i
x
b
∈
C
i
φ(
x
a
)
T
φ(
x
b
)
=
K
(
x
j
,
x
j
)
−
2
n
i
x
a
∈
C
i
K
(
x
a
,
x
j
)
+
1
n
2
i
x
a
∈
C
i
x
b
∈
C
i
K
(
x
a
,
x
b
)
(13.4)
Thus, the distance of a point to a cluster mean in feature space can be computed using
only kernel operations. In the cluster assignment step of kernel K-means, we assign a
point to the closest cluster mean as follows:
C
∗
(
x
j
)
=
argmin
i
φ(
x
j
)
−
µ
φ
i
2
=
argmin
i
K
(
x
j
,
x
j
)
−
2
n
i
x
a
∈
C
i
K
(
x
a
,
x
j
)
+
1
n
2
i
x
a
∈
C
i
x
b
∈
C
i
K
(
x
a
,
x
b
)
=
argmin
i
1
n
2
i
x
a
∈
C
i
x
b
∈
C
i
K
(
x
a
,
x
b
)
−
2
n
i
x
a
∈
C
i
K
(
x
a
,
x
j
)
(13.5)
340
Representative-based Clustering
where we drop the
K
(
x
j
,
x
j
)
term because it remains the same for all
k
clusters and
does not impact the cluster assignment decision. Also note that the first term is simply
the average pairwise kernel value for cluster
C
i
and is independent of the point
x
j
. It is
in fact the squared norm of the cluster mean in feature space. The second term is twice
the average kernel value for points in
C
i
with respect to
x
j
.
Algorithm 13.2 shows the pseudo-code for the kernel K-means method. It starts
from an initial random partitioning of the points into
k
clusters. It then iteratively
updates the cluster assignments by reassigning each point to the closest mean in
feature space via Eq.(13.5). To facilitate the distance computation, it first computes
the average kernel value, that is, the squared norm of the cluster mean, for each
cluster (for loop in line 5). Next, it computes the average kernel value for each point
x
j
with points in cluster
C
i
(for loop in line 7). The main cluster assignment step uses
these values to compute the distance of
x
j
from each of the clusters
C
i
and assigns
x
j
to the closest mean. This reassignment information is used to re-partition the points
into a new set of clusters. That is, all points
x
j
that are closer to the mean for
C
i
make up the new cluster for the next iteration. This iterative process is repeated until
convergence.
For convergence testing, we check if there is any change in the cluster assignments
of the points. The number of points that do not change clusters is given as the
sum
k
i
=
1
|
C
t
i
∩
C
t
−
1
i
|
, where
t
specifies the current iteration. The fraction of points
ALGORITHM 13.2. Kernel K-means Algorithm
K
ERNEL
-K
MEANS
(K
,k,ǫ
)
:
t
←
0
1
C
t
←{
C
t
1
,...,
C
t
k
}
// Randomly partition points into
k
clusters
2
repeat
3
t
←
t
+
1
4
foreach
C
i
∈
C
t
−
1
do
// Compute squared norm of cluster means
5
sqnorm
i
←
1
n
2
i
x
a
∈
C
i
x
b
∈
C
i
K
(
x
a
,
x
b
)
6
foreach x
j
∈
D do
// Average kernel value for
x
j
and
C
i
7
foreach
C
i
∈
C
t
−
1
do
8
avg
ji
←
1
n
i
x
a
∈
C
i
K
(
x
a
,
x
j
)
9
// Find closest cluster for each point
foreach x
j
∈
D do
10
foreach
C
i
∈
C
t
−
1
do
11
d(
x
j
,
C
i
)
←
sqnorm
i
−
2
·
avg
ji
12
j
∗
←
argmin
i
d(
x
j
,
C
i
)
13
C
t
j
∗
←
C
t
j
∗
∪{
x
j
}
// Cluster reassignment
14
C
t
←
C
t
1
,...,
C
t
k
15
until
1
−
1
n
k
i
=
1
C
t
i
∩
C
t
−
1
i
≤
ǫ
16
13.2 Kernel K-means
341
1.5
3
4
5
6
0 1 2 3 4 5 6 7 8 9 10 11 12
X
1
X
2
(a) Linear kernel:
t
=
5 iterations
1.5
3
4
5
6
0 1 2 3 4 5 6 7 8 9 10 11 12
X
1
X
2
(b) Gaussian kernel:
t
=
4 Iterations
Figure 13.3.
Kernel K-means: linear versus Gaussian kernel.
reassigned to a different cluster in the current iteration is given as
n
−
k
i
=
1
|
C
t
i
∩
C
t
−
1
i
|
n
=
1
−
1
n
k
i
=
1
|
C
t
i
∩
C
t
−
1
i
|
Kernel K-means stops when the fraction of points with new cluster assignments falls
below some threshold
ǫ
≥
0. For example, one can iterate until no points change
clusters.
Computational Complexity
Computing the average kernel value for each cluster
C
i
takes time
O
(n
2
)
across all
clusters. Computing the average kernel value of each point with respect to each of the
k
clusters also takes
O
(n
2
)
time. Finally, computing the closest mean for each point and
cluster reassignment takes
O
(kn)
time. The total computational complexity of kernel
342
Representative-based Clustering
K-means is thus
O
(tn
2
)
, where
t
is the number of iterations until convergence. The I/O
complexity is
O
(t)
scans of the kernel matrix
K
.
Example 13.3.
Figure 13.3 shows an application of the kernel K-means approach on
a synthetic dataset with three embedded clusters. Each cluster has 100 points, for a
total of
n
=
300 points in the dataset.
Using the linear kernel
K
(
x
i
,
x
j
)
=
x
T
i
x
j
is equivalent to the K-means algorithm
because in this case Eq.(13.5) is the same as Eq.(13.2). Figure 13.3a shows the
resulting clusters; points in
C
1
are shown as squares, in
C
2
as triangles, and in
C
3
as circles. We can see that K-means is not able to separate the three clusters due
to the presence of the parabolic shaped cluster. The white points are those that are
wrongly clustered, comparing with the ground truth in terms of the generated cluster
labels.
Using the Gaussian kernel
K
(
x
i
,
x
j
)
=
exp
−
x
i
−
x
j
2
2
σ
2
from Eq.(5.10), with
σ
=
1
.
5,resultsin anear-perfectclustering,asshownin Figure13.3b.Onlyfourpoints
(white triangles) are grouped incorrectly with cluster
C
2
, whereas they should belong
to cluster
C
1
. We can see from this example that kernel K-means is able to handle
nonlinear cluster boundaries. One caveat is that the value of the spread parameter
σ
has to be set by trial and error.
13.3
EXPECTATION-MAXIMIZATION CLUSTERING
The K-means approach is an example of a
hard assignment
clustering, where each
point can belong to only one cluster. We now generalize the approach to consider
soft assignment
of points to clusters, so that each point has a probability of belonging
to each cluster.
Let
D
consist of
n
points
x
j
in
d
-dimensional space
R
d
. Let
X
a
denote the
random variable corresponding to the
a
th attribute. We also use
X
a
to denote the
a
th
column vector, corresponding to the
n
data samples from
X
a
. Let
X
=
(
X
1
,
X
2
,...,
X
d
)
denote the vector random variable across the
d
-attributes, with
x
j
being a data sample
from
X
.
Gaussian Mixture Model
We assume that each cluster
C
i
is characterized by a multivariate normal distribution,
that is,
f
i
(
x
)
=
f(
x
|
µ
i
,
i
)
=
1
(
2
π)
d
2
|
i
|
1
2
exp
−
(
x
−
µ
i
)
T
−
1
i
(
x
−
µ
i
)
2
(13.6)
where the cluster mean
µ
i
∈
R
d
and covariance matrix
i
∈
R
d
×
d
are both unknown
parameters.
f
i
(
x
)
is the probability density at
x
attributable to cluster
C
i
. We assume
that the probability density function of
X
is given as a
Gaussian mixture model
over all
13.3 Expectation-Maximization Clustering
343
the
k
cluster normals, defined as
f(
x
)
=
k
i
=
1
f
i
(
x
)P(
C
i
)
=
k
i
=
1
f(
x
|
µ
i
,
i
)P(
C
i
)
(13.7)
where the prior probabilities
P(
C
i
)
are called the
mixture parameters
, which must
satisfy the condition
k
i
=
1
P(
C
i
)
=
1
The Gaussian mixture model is thus characterized by the mean
µ
i
, the covariance
matrix
i
, and the mixture probability
P(
C
i
)
for each of the
k
normal distributions.
We write the set of all the model parameters compactly as
θ
=
{
µ
1
,
1
,P(
C
i
)...,
µ
k
,
k
,P(
C
k
)
}
Maximum Likelihood Estimation
Given the dataset
D
, we define the
likelihood
of
θ
as the conditional probability of
the data
D
given the model parameters
θ
, denoted as
P(
D
|
θ
)
. Because each of the
n
points
x
j
is considered to be a random sample from
X
(i.e., independent and identically
distributed as
X
), the likelihood of
θ
is given as
P(
D
|
θ
)
=
n
j
=
1
f(
x
j
)
The goal of maximum likelihood estimation (MLE) is to choose the parameters
θ
that maximize the likelihood, that is,
θ
∗
=
argmax
θ
{
P(
D
|
θ
)
}
It is typical to maximize the log of the likelihood function because it turns the
product over the points into a summation and the maximum value of the likelihood
and log-likelihood coincide. That is, MLE maximizes
θ
∗
=
argmax
θ
{
ln
P(
D
|
θ
)
}
where the
log-likelihood
function is given as
ln
P(
D
|
θ
)
=
n
j
=
1
ln
f(
x
j
)
=
n
j
=
1
ln
k
i
=
1
f(
x
j
|
µ
i
,
i
)P(
C
i
)
(13.8)
Directly maximizing the log-likelihood over
θ
is hard. Instead, we can use
the expectation-maximization (EM) approach for finding the maximum likelihood
estimates for the parameters
θ
. EM is a two-step iterative approach that starts from an
initial guess for the parameters
θ
. Given the current estimates for
θ
, in the
expectation
step
EM computes the cluster posterior probabilities
P(
C
i
|
x
j
)
via the Bayes theorem:
P(
C
i
|
x
j
)
=
P(
C
i
and
x
j
)
P(
x
j
)
=
P(
x
j
|
C
i
)P(
C
i
)
k
a
=
1
P(
x
j
|
C
a
)P(
C
a
)
344
Representative-based Clustering
Because each cluster is modeled as a multivariate normal distribution [Eq.(13.6)], the
probability of
x
j
given cluster
C
i
can be obtained by considering a small interval
ǫ >
0
centered at
x
j
, as follows:
P(
x
j
|
C
i
)
≃
2
ǫ
·
f(
x
j
|
µ
i
,
i
)
=
2
ǫ
·
f
i
(
x
j
)
The posterior probability of
C
i
given
x
j
is thus given as
P(
C
i
|
x
j
)
=
f
i
(
x
j
)
·
P(
C
i
)
k
a
=
1
f
a
(
x
j
)
·
P(
C
a
)
(13.9)
and
P(
C
i
|
x
j
)
can be considered as the weight or contribution of the point
x
j
to cluster
C
i
. Next, in the
maximization step
, using the weights
P(
C
i
|
x
j
)
EM re-estimates
θ
,
that is, it re-estimates the parameters
µ
i
,
i
, and
P(
C
i
)
for each cluster
C
i
. The
re-estimated mean is given as the weighted average of all the points, the re-estimated
covariance matrix is given as the weighted covariance over all pairs of dimensions, and
the re-estimated prior probability for each cluster is given as the fraction of weights
that contribute to that cluster. In Section 13.3.3 we formally derive the expressions
for the MLE estimates for the cluster parameters, and in Section 13.3.4 we describe
the generic EM approach in more detail. We begin with the application of the EM
clustering algorithm for the one-dimensional and general
d
-dimensional cases.
13.3.1
EM in One Dimension
Consider a dataset
D
consisting of a single attribute
X
, where each point
x
j
∈
R
(
j
=
1
,...,n
) is a random sample from
X
. For the mixture model [Eq.(13.7)], we use
univariate normals for each cluster:
f
i
(x)
=
f(x
|
µ
i
,σ
2
i
)
=
1
√
2
πσ
i
exp
−
(x
−
µ
i
)
2
2
σ
2
i
withthecluster parameters
µ
i
,
σ
2
i
, and
P(
C
i
)
.The EM approach consists ofthreesteps:
initialization, expectation step, and maximization step.
Initialization
For each cluster
C
i
, with
i
=
1
,
2
,...,k
, we can randomly initialize the cluster
parameters
µ
i
,
σ
2
i
, and
P(
C
i
)
. The mean
µ
i
is selected uniformly at random from the
range of possible values for
X
. It is typical to assume that the initial variance is given as
σ
2
i
=
1. Finally, the cluster prior probabilities are initialized to
P(
C
i
)
=
1
k
, so that each
cluster has an equal probability.
Expectation Step
Assume that for each of the
k
clusters, we have an estimate for the parameters, namely
the mean
µ
i
, variance
σ
2
i
, and prior probability
P(
C
i
)
. Given these values the clusters
posterior probabilities are computed using Eq.(13.9):
P(
C
i
|
x
j
)
=
f(x
j
|
µ
i
,σ
2
i
)
·
P(
C
i
)
k
a
=
1
f(x
j
|
µ
a
,σ
2
a
)
·
P(
C
a
)
13.3 Expectation-Maximization Clustering
345
For convenience, we use the notation
w
ij
=
P(
C
i
|
x
j
)
, treating the posterior probability
as the weight or contribution of the point
x
j
to cluster
C
i
. Further, let
w
i
=
(w
i
1
,...,w
in
)
T
denote the weight vector for cluster
C
i
across all the
n
points.
Maximization Step
Assuming that all the posterior probability values or weights
w
ij
=
P(
C
i
|
x
j
)
are
known, the maximization step, as the name implies, computes the maximum likelihood
estimates of the cluster parameters by re-estimating
µ
i
,
σ
2
i
, and
P(
C
i
)
.
The re-estimatedvalue for the cluster mean,
µ
i
, is computed as the weighted mean
of all the points:
µ
i
=
n
j
=
1
w
ij
·
x
j
n
j
=
1
w
ij
In terms of the weight vector
w
i
and the attribute vector
X
=
(x
1
,x
2
,...,x
n
)
T
, we can
rewrite the above as
µ
i
=
w
T
i
X
w
T
i
1
The re-estimated value of the cluster variance is computed as the weighted
variance across all the points:
σ
2
i
=
n
j
=
1
w
ij
(x
j
−
µ
i
)
2
n
j
=
1
w
ij
Let
Z
i
=
X
−
µ
i
1
=
(x
1
−
µ
i
,x
2
−
µ
i
,...,x
n
−
µ
i
)
T
=
(z
i
1
,z
i
2
,...,z
in
)
T
be the
centered attribute vector for cluster
C
i
, and let
Z
s
i
be the squared vector given as
Z
s
i
=
(z
2
i
1
,...,z
2
in
)
T
. The variance can be expressed compactly in terms of the dot
product between the weight vector and the squared centered vector:
σ
2
i
=
w
T
i
Z
s
i
w
T
i
1
Finally, the prior probability of cluster
C
i
is re-estimatedas the fraction of thetotal
weight belonging to
C
i
, computed as
P(
C
i
)
=
n
j
=
1
w
ij
k
a
=
1
n
j
=
1
w
aj
=
n
j
=
1
w
ij
n
j
=
1
1
=
n
j
=
1
w
ij
n
(13.10)
where we made use of the fact that
k
i
=
1
w
ij
=
k
i
=
1
P(
C
i
|
x
j
)
=
1
In vector notation the prior probability can be written as
P(
C
i
)
=
w
T
i
1
n
346
Representative-based Clustering
Iteration
Starting from an initial set of values for the cluster parameters
µ
i
,
σ
2
i
and
P(
C
i
)
for
all
i
=
1
,...,k
, the EM algorithm applies the expectation step to compute the weights
w
ij
=
P(
C
i
|
x
j
)
. These values are then used in the maximization step to compute the
updated cluster parameters
µ
i
,
σ
2
i
and
P(
C
i
)
. Both the expectation and maximization
steps are iteratively applied until convergence, for example, until the means change
very little from one iteration to the next.
Example 13.4 (EM in 1D).
Figure 13.4 illustrates the EM algorithm on the
one-dimensional dataset:
x
1
=
1
.
0
x
2
=
1
.
3
x
3
=
2
.
2
x
4
=
2
.
6
x
5
=
2
.
8
x
6
=
5
.
0
x
7
=
7
.
3
x
8
=
7
.
4
x
9
=
7
.
5
x
10
=
7
.
7
x
11
=
7
.
9
We assume that
k
=
2. The initial random means are shown in Figure 13.4a, with the
initial parameters given as
µ
1
=
6
.
63
σ
2
1
=
1
P(
C
2
)
=
0
.
5
µ
2
=
7
.
57
σ
2
2
=
1
P(
C
2
)
=
0
.
5
After repeated expectation and maximization steps, the EM method converges after
five iterations. After
t
=
1 (see Figure 13.4b) we have
µ
1
=
3
.
72
σ
2
1
=
6
.
13
P(
C
1
)
=
0
.
71
µ
2
=
7
.
4
σ
2
2
=
0
.
69
P(
C
2
)
=
0
.
29
After the final iteration (
t
=
5), as shown in Figure 13.4c, we have
µ
1
=
2
.
48
σ
2
1
=
1
.
69
P(
C
1
)
=
0
.
55
µ
2
=
7
.
56
σ
2
2
=
0
.
05
P(
C
2
)
=
0
.
45
One of the main advantages of the EM algorithm over K-means is that it returns
the probability
P(
C
i
|
x
j
)
of each cluster
C
i
for each point
x
j
. However, in this
1-dimensional example, these values are essentially binary; assigning each point to
the cluster with the highest posterior probability, we obtain the hard clustering
C
1
={
x
1
,x
2
,x
3
,x
4
,x
5
,x
6
}
(white points)
C
2
={
x
7
,x
8
,x
9
,x
10
,x
11
}
(gray points)
as illustrated in Figure 13.4c.
13.3.2
EM in
d
Dimensions
We now consider the EM method in
d
dimensions, where each cluster is characterized
by a multivariate normal distribution [Eq.(13.6)], with parameters
µ
i
,
i
, and
P(
C
i
)
.
For each cluster
C
i
, we thus need to estimate the
d
-dimensional mean vector:
µ
i
=
(µ
i
1
,µ
i
2
,...,µ
id
)
T
13.3 Expectation-Maximization Clustering
347
0
.
1
0
.
2
0
.
3
0
.
4
0 1 2 3 4 5 6 7 8 9 10 11
−
1
µ
1
=
6
.
63
µ
2
=
7
.
57
(a) Initialization:
t
=
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0 1 2 3 4 5 6 7 8 9 10 11
−
1
−
2
µ
1
=
3
.
72
µ
2
=
7
.
4
(b) Iteration:
t
=
1
0
.
3
0
.
6
0
.
9
1
.
2
1
.
5
1
.
8
0 1 2 3 4 5 6 7 8 9 10 11
−
1
µ
1
=
2
.
48
µ
2
=
7
.
56
(c) Iteration:
t
=
5 (converged)
Figure 13.4.
EM in one dimension.
and the
d
×
d
covariance matrix:
i
=
(σ
i
1
)
2
σ
i
12
... σ
i
1
d
σ
i
21
(σ
i
2
)
2
... σ
i
2
d
.
.
.
.
.
.
.
.
.
σ
i
d
1
σ
i
d
2
... (σ
i
d
)
2
Because the covariance matrix is symmetric, we have to estimate
d
2
=
d(d
−
1
)
2
pairwise
covariances and
d
variances, for a total of
d(d
+
1
)
2
parameters for
i
. This may be
too many parameters for practical purposes because we may not have enough data
to estimate all of them reliably. For example, if
d
=
100, then we have to estimate
100
·
101
/
2
=
5050 parameters! One simplification is to assume that all dimensions are
348
Representative-based Clustering
independent, which leads to a diagonal covariance matrix:
i
=
(σ
i
1
)
2
0
...
0
0
(σ
i
2
)
2
...
0
.
.
.
.
.
.
.
.
.
0 0
... (σ
i
d
)
2
Under the independence assumption we have only
d
parameters to estimate for the
diagonal covariance matrix.
Initialization
For each cluster
C
i
, with
i
=
1
,
2
,...,k
, we randomly initialize the mean
µ
i
by selecting
a value
µ
ia
for each dimension
X
a
uniformly at random from the range of
X
a
. The
covariance matrix is initialized as the
d
×
d
identity matrix,
i
=
I
. Finally, the cluster
prior probabilities are initialized to
P(
C
i
)
=
1
k
, so that each cluster has an equal
probability.
Expectation Step
In the expectation step, we compute the posterior probability of cluster
C
i
given point
x
j
using Eq.(13.9), with
i
=
1
,...,k
and
j
=
1
,...,n
. As before, we use the shorthand
notation
w
ij
=
P(
C
i
|
x
j
)
to denotethe factthat
P(
C
i
|
x
j
)
can be considered as theweight
orcontribution ofpoint
x
j
tocluster
C
i
,andweusethenotation
w
i
=
(w
i
1
,w
i
2
,...,w
in
)
T
to denote the weight vector for cluster
C
i
, across all the
n
points.
Maximization Step
Given the weights
w
ij
, in the maximization step, we re-estimate
i
,
µ
i
and
P(
C
i
)
. The
mean
µ
i
for cluster
C
i
can be estimated as
µ
i
=
n
j
=
1
w
ij
·
x
j
n
j
=
1
w
ij
(13.11)
which can be expressed compactly in matrix form as
µ
i
=
D
T
w
i
w
T
i
1
Let
Z
i
=
D
−
1
·
µ
T
i
be the centered data matrix for cluster
C
i
. Let
z
ji
=
x
j
−
µ
i
∈
R
d
denote the
j
th centered point in
Z
i
. We can express
i
compactly using the
outer-product form
i
=
n
j
=
1
w
ij
z
ji
z
T
ji
w
T
i
1
(13.12)
Considering the pairwise attribute view, the covariance between dimensions
X
a
and
X
b
is estimated as
σ
i
ab
=
n
j
=
1
w
ij
(x
ja
−
µ
ia
)(x
jb
−
µ
ib
)
n
j
=
1
w
ij
13.3 Expectation-Maximization Clustering
349
where
x
ja
and
µ
ia
denote the values of the
a
th dimension for
x
j
and
µ
i
, respectively.
Finally, the prior probability
P(
C
i
)
for each cluster is the same as in the
one-dimensional case [Eq.(13.10)], given as
P(
C
i
)
=
n
j
=
1
w
ij
n
=
w
T
i
1
n
(13.13)
A formal derivation of these re-estimates for
µ
i
[Eq.(13.11)],
i
[Eq.(13.12)], and
P(
C
i
)
[Eq.(13.13)] is given in Section 13.3.3.
EM Clustering Algorithm
The pseudo-code for the multivariate EM clustering algorithm is given in
Algorithm 13.3. After initialization of
µ
i
,
i
, and
P(
C
i
)
for all
i
=
1
,...,k
, the expecta-
tion and maximization steps are repeated until convergence. For the convergence test,
we check whether
i
µ
t
i
−
µ
t
−
1
i
2
≤
ǫ
, where
ǫ >
0 is the convergence threshold, and
t
denotes the iteration. In words, the iterative process continues until the change in the
cluster means becomes very small.
ALGORITHM 13.3. Expectation-Maximization (EM) Algorithm
E
XPECTATION
-M
AXIMIZATION
(D
,k,ǫ
)
:
t
←
0
1
// Initialization
Randomly initialize
µ
t
1
,...,
µ
t
k
2
t
i
←
I
,
∀
i
=
1
,...,k
3
P
t
(
C
i
)
←
1
k
,
∀
i
=
1
,...,k
4
repeat
5
t
←
t
+
1
6
// Expectation Step
for
i
=
1
,...,k
and
j
=
1
,...,n
do
7
w
ij
←
f(
x
j
|
µ
i
,
i
)
·
P(
C
i
)
k
a
=
1
f(
x
j
|
µ
a
,
a
)
·
P(
C
a
)
// posterior probability
P
t
(
C
i
|
x
j
)
8
// Maximization Step
for
i
=
1
,...,k
do
9
µ
t
i
←
n
j
=
1
w
ij
·
x
j
n
j
=
1
w
ij
// re-estimate mean
10
t
i
←
n
j
=
1
w
ij
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
n
j
=
1
w
ij
// re-estimate covariance matrix
11
P
t
(
C
i
)
←
n
j
=
1
w
ij
n
// re-estimate priors
12
until
k
i
=
1
µ
t
i
−
µ
t
−
1
i
2
≤
ǫ
13
350
Representative-based Clustering
Example 13.5 (EM in 2D).
Figure 13.5 illustrates the EM algorithm for the
two-dimensional Iris dataset, where the two attributes are its first two principal
components. The dataset consists of
n
=
150 points, and EM was run using
k
=
3, with
full covariancematrixforeachcluster.The initialcluster parametersare
i
=
1 0
0 1
and
P(
C
i
)
=
1
/
3, with the means chosen as
µ
1
=
(
−
3
.
59
,
0
.
25
)
T
µ
2
=
(
−
1
.
09
,
−
0
.
46
)
T
µ
3
=
(
0
.
75
,
1
.
07
)
T
The cluster means (shown in black) and the joint probability density function are
shown in Figure 13.5a.
The EM algorithm took 36 iterations to converge (using
ǫ
=
0
.
001). An
intermediate stage of the clustering is shown in Figure 13.5b, for
t
=
1. Finally
at iteration
t
=
36, shown in Figure 13.5c, the three clusters have been correctly
identified, with the following parameters:
µ
1
=
(
−
2
.
02
,
0
.
017
)
T
µ
2
=
(
−
0
.
51
,
−
0
.
23
)
T
µ
3
=
(
2
.
64
,
0
.
19
)
T
1
=
0
.
56
−
0
.
29
−
0
.
29 0
.
23
2
=
0
.
36
−
0
.
22
−
0
.
22 0
.
19
3
=
0
.
05
−
0
.
06
−
0
.
06 0
.
21
P(
C
1
)
=
0
.
36
P(
C
2
)
=
0
.
31
P(
C
3
)
=
0
.
33
To see the effect of a full versus diagonal covariance matrix, we ran the
EM algorithm on the Iris principal components dataset under the independence
assumption, which took
t
=
29 iterations to converge. The final cluster parameters
were
µ
1
=
(
−
2
.
1
,
0
.
28
)
T
µ
2
=
(
−
0
.
67
,
−
0
.
40
)
T
µ
3
=
(
2
.
64
,
0
.
19
)
T
1
=
0
.
59 0
0 0
.
11
2
=
0
.
49 0
0 0
.
11
3
=
0
.
05 0
0 0
.
21
P(
C
1
)
=
0
.
30
P(
C
2
)
=
0
.
37
P(
C
3
)
=
0
.
33
Figure 13.6b shows the clustering results. Also shown are the contours of the normal
density function for each cluster (plotted so that the contours do not intersect). The
results for the full covariance matrix are shown in Figure 13.6a, which is a projection
of Figure 13.5c onto the 2D plane. Points in
C
1
are shown as squares, in
C
2
as
triangles, and in
C
3
as circles.
One can observe that the diagonal assumption leads to axis parallel contours
for the normal density, contrasted with the rotated contours for the full covariance
matrix. The full matrix yields much better clustering, which can be observed by
considering thenumber ofpoints grouped with the wrong Iris type(thewhite points).
For the full covariance matrix only three points are in the wrong group, whereas
for the diagonal covariance matrix 25 points are in the wrong cluster, 15 from
iris-virginica
(white triangles) and 10 from
iris-versicolor
(white squares).
The points corresponding to
iris-setosa
are correctly clustered as
C
3
in both
approaches.
13.3 Expectation-Maximization Clustering
351
X
1
X
2
f(
x
)
(a) Iteration:
t
=
0
X
1
X
2
f(
x
)
(b) Iteration:
t
=
1
X
1
X
2
f(
x
)
−
4
−
3
−
2
−
1
0
1
2
3
4
−
1
0
1
2
(c) iteration:
t
=
36
Figure 13.5.
EM algorithm in two dimensions: mixture of
k
=
3 Gaussians.
352
Representative-based Clustering
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
X
1
X
2
(a) Full covariance matrix (
t
=
36)
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
X
1
X
2
(b) Diagonal covariance matrix (
t
=
29)
Figure 13.6.
Iris principal components dataset: full versus diagonal covariance matrix.
Computational Complexity
For the expectation step, to compute the cluster posterior probabilities, we need to
invert
i
and compute its determinant
|
i
|
, which takes
O
(d
3
)
time. Across the
k
clusters the time is
O
(kd
3
)
. For the expectationstep, evaluatingthe density
f(
x
j
|
µ
i
,
i
)
takes
O
(d
2
)
time, for a total time of
O
(knd
2
)
over the
n
points and
k
clusters. For the
maximization step, the time is dominated by the update for
i
, which takes
O
(knd
2
)
time over all
k
clusters. The computational complexity of the EM method is thus
O
(t(kd
3
+
nkd
2
))
, where
t
is the number of iterations. If we use a diagonal covariance
matrix, then inverse and determinant of
i
can be computed in
O
(d)
time. Density
computation per point takes
O
(d)
time, so that the time for the expectation step is
O
(knd)
. The maximization step also takes
O
(knd)
time to re-estimate
i
. The total
time for a diagonal covariance matrix is therefore
O
(tnkd)
. The I/O complexity for the
13.3 Expectation-Maximization Clustering
353
EM algorithm is
O
(t)
complete database scans because we read the entire set of points
in each iteration.
K-means as Specialization of EM
Although we assumed a normal mixture model for the clusters, the EM approach can
be applied with other models for the cluster density distribution
P(
x
j
|
C
i
)
. For instance,
K-means can be considered as a special case of the EM algorithm, obtained as follows:
P(
x
j
|
C
i
)
=
1 if
C
i
=
argmin
C
a
x
j
−
µ
a
2
0 otherwise
Using Eq.(13.9), the posterior probability
P(
C
i
|
x
j
)
is given as
P(
C
i
|
x
j
)
=
P(
x
j
|
C
i
)P(
C
i
)
k
a
=
1
P(
x
j
|
C
a
)P(
C
a
)
One can see that if
P(
x
j
|
C
i
)
=
0, then
P(
C
i
|
x
j
)
=
0. Otherwise, if
P(
x
j
|
C
i
)
=
1, then
P(
x
j
|
C
a
)
=
0 for all
a
=
i
, and thus
P(
C
i
|
x
j
)
=
1
·
P(
C
i
)
1
·
P(
C
i
)
=
1. Putting it all together, the
posterior probability is given as
P(
C
i
|
x
j
)
=
1 if
x
j
∈
C
i
,
i.e., if
C
i
=
argmin
C
a
x
j
−
µ
a
2
0 otherwise
It is clear that for K-means the cluster parameters are
µ
i
and
P(
C
i
)
; we can ignore the
covariance matrix.
13.3.3
Maximum Likelihood Estimation
In this section, we derive the maximum likelihood estimates for the cluster parameters
µ
i
,
i
and
P(
C
i
)
. We do this by taking the derivative of the log-likelihood function
with respect to each of these parameters and setting the derivative to zero.
The partial derivative of the log-likelihood function [Eq.(13.8)] with respect to
some parameter
θ
i
for cluster
C
i
is given as
∂
∂
θ
i
ln
P(
D
|
θ
)
=
∂
∂
θ
i
n
j
=
1
ln
f(
x
j
)
=
n
j
=
1
1
f(
x
j
)
·
∂f(
x
j
)
∂
θ
i
=
n
j
=
1
1
f(
x
j
)
k
a
=
1
∂
∂
θ
i
f(
x
j
|
µ
a
,
a
)P(
C
a
)
=
n
j
=
1
1
f(
x
j
)
·
∂
∂
θ
i
f(
x
j
|
µ
i
,
i
)P(
C
i
)
The last step follows from the fact that because
θ
i
is a parameter for the
i
th cluster the
mixture components for the other clusters are constants with respect to
θ
i
. Using the
354
Representative-based Clustering
fact that
|
i
|=
1
|
−
1
i
|
the multivariate normal density in Eq.(13.6) can be written as
f(
x
j
|
µ
i
,
i
)
=
(
2
π)
−
d
2
|
−
1
i
|
1
2
exp
g(
µ
i
,
i
)
(13.14)
where
g(
µ
i
,
i
)
=−
1
2
(
x
j
−
µ
i
)
T
−
1
i
(
x
j
−
µ
i
)
(13.15)
Thus, the derivative of the log-likelihood function can be written as
∂
∂
θ
i
ln
P(
D
|
θ
)
=
n
j
=
1
1
f(
x
j
)
·
∂
∂
θ
i
(
2
π)
−
d
2
|
−
1
i
|
1
2
exp
g(
µ
i
,
i
)
P(
C
i
)
(13.16)
Below, we make use of the fact that
∂
∂
θ
i
exp
g(
µ
i
,
i
)
=
exp
g(
µ
i
,
i
)
·
∂
∂
θ
i
g(
µ
i
,
i
)
(13.17)
Estimation of Mean
To derive the maximum likelihood estimate for the mean
µ
i
, we have to take the
derivative of the log-likelihood with respect to
θ
i
=
µ
i
. As per Eq.(13.16), the only
term involving
µ
i
is exp
g(
µ
i
,
i
)
. Using the fact that
∂
∂
µ
i
g(
µ
i
,
i
)
=
−
1
i
(
x
j
−
µ
i
)
(13.18)
and making use of Eq.(13.17), the partial derivative of the log-likelihood [Eq.(13.16)]
with respect to
µ
i
is
∂
∂
µ
i
ln
(P(
D
|
θ
))
=
n
j
=
1
1
f(
x
j
)
(
2
π)
−
d
2
|
−
1
i
|
1
2
exp
g(
µ
i
,
i
)
P(
C
i
)
−
1
i
(
x
j
−
µ
i
)
=
n
j
=
1
f(
x
j
|
µ
i
,
i
)P(
C
i
)
f(
x
j
)
·
−
1
i
(
x
j
−
µ
i
)
=
n
j
=
1
w
ij
−
1
i
(
x
j
−
µ
i
)
where we made use of Eqs.(13.14) and (13.9), and the fact that
w
ij
=
P(
C
i
|
x
j
)
=
f(
x
j
|
µ
i
,
i
)P(
C
i
)
f(
x
j
)
13.3 Expectation-Maximization Clustering
355
Setting the partial derivative of the log-likelihood to the zero vector, and multiplying
both sides by
i
, we get
n
j
=
1
w
ij
(
x
j
−
µ
i
)
=
0
,
which implies that
n
j
=
1
w
ij
x
j
=
µ
i
j
=
1
w
ij
,
and therefore
µ
i
=
n
j
=
1
w
ij
x
j
n
j
=
1
w
ij
(13.19)
which is precisely the re-estimation formula we used in Eq.(13.11).
Estimation of Covariance Matrix
To re-estimate the covariance matrix
i
, we take the partial derivative of
Eq.(13.16) with respect to
−
1
i
using the product rule for the differentiation of the
term
|
−
1
i
|
1
2
exp
g(
µ
i
,
i
)
.
Using the fact that for any square matrix
A
, we have
∂
|
A
|
∂
A
= |
A
| ·
(
A
−
1
)
T
the
derivative of
|
−
1
i
|
1
2
with respect to
−
1
i
is
∂
|
−
1
i
|
1
2
∂
−
1
i
=
1
2
·|
−
1
i
|
−
1
2
·|
−
1
i
|·
i
=
1
2
·|
−
1
i
|
1
2
·
i
(13.20)
Next, using the fact that for the square matrix
A
∈
R
d
×
d
and vectors
a
,
b
∈
R
d
, we have
∂
∂
A
a
T
Ab
=
ab
T
the derivative of exp
g(
µ
i
,
i
)
with respect to
−
1
i
is obtained from
Eq.(13.17) as follows:
∂
∂
−
1
i
exp
g(
µ
i
,
i
)
=−
1
2
exp
g(
µ
i
,
i
)
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
(13.21)
Using the product rule on Eqs.(13.20) and (13.21), we get
∂
∂
−
1
i
|
−
1
i
|
1
2
exp
g(
µ
i
,
i
)
=
1
2
|
−
1
i
|
1
2
i
exp
g(
µ
i
,
i
)
−
1
2
|
−
1
i
|
1
2
exp
g(
µ
i
,
i
)
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
=
1
2
·|
−
1
i
|
1
2
·
exp
g(
µ
i
,
i
)
i
−
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
(13.22)
Plugging Eq.(13.22) into Eq.(13.16) the derivative of the log-likelihood function with
respect to
−
1
i
is given as
∂
∂
−
1
i
ln
(P(
D
|
θ
))
=
1
2
n
j
=
1
(
2
π)
−
d
2
|
−
1
i
|
1
2
exp
g(
µ
i
,
i
)
P(
C
i
)
f(
x
j
)
i
−
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
356
Representative-based Clustering
=
1
2
n
j
=
1
f(
x
j
|
µ
i
,
i
)P(
C
i
)
f(
x
j
)
·
i
−
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
=
1
2
n
j
=
1
w
ij
i
−
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
Setting the derivative to the
d
×
d
zero matrix
0
d
×
d
, we can solve for
i
:
n
j
=
1
w
ij
i
−
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
=
0
d
×
d
,
which implies that
i
=
n
j
=
1
w
ij
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
n
j
=
1
w
ij
(13.23)
Thus, we can see that the maximum likelihood estimate for the covariance matrix is
given as the weighted outer-product form in Eq.(13.12).
Estimating the Prior Probability: Mixture Parameters
To obtain a maximum likelihood estimate for the mixture parameters or the prior
probabilities
P(
C
i
)
, we have to take the partial derivative of the log-likelihood
[Eq.(13.16)] with respect to
P(
C
i
)
. However, we have to introduce a Lagrange
multiplier
α
for the constraint that
k
a
=
1
P(
C
a
)
=
1. We thus take the following
derivative:
∂
∂P(
C
i
)
ln
(P(
D
|
θ
))
+
α
k
a
=
1
P(
C
a
)
−
1
(13.24)
The partial derivative of the log-likelihood in Eq.(13.16) with respect to
P(
C
i
)
gives
∂
∂P(
C
i
)
ln
(P(
D
|
θ
))
=
n
j
=
1
f(
x
j
|
µ
i
,
i
)
f(
x
j
)
The derivative in Eq.(13.24) thus evaluates to
n
j
=
1
f(
x
j
|
µ
i
,
i
)
f(
x
j
)
+
α
Setting the derivative to zero, and multiplying on both sides by
P(
C
i
)
, we get
n
j
=
1
f(
x
j
|
µ
i
,
i
)P(
C
i
)
f(
x
j
)
=−
αP(
C
i
)
n
j
=
1
w
ij
=−
αP(
C
i
)
(13.25)
13.3 Expectation-Maximization Clustering
357
Taking the summation of Eq.(13.25) over all clusters yields
k
i
=
1
n
j
=
1
w
ij
=−
α
k
i
=
1
P(
C
i
)
or
n
=−
α
(13.26)
The last step follows from the fact that
k
i
=
1
w
ij
=
1. Plugging Eq.(13.26) into
Eq.(13.25), gives us the maximum likelihood estimate for
P(
C
i
)
as follows:
P(
C
i
)
=
n
j
=
1
w
ij
n
(13.27)
which matches the formula in Eq.(13.13).
We can see that all three parameters
µ
i
,
i
, and
P(
C
i
)
for cluster
C
i
depend
on the weights
w
ij
, which correspond to the cluster posterior probabilities
P(
C
i
|
x
j
)
.
Equations (13.19), (13.23), and (13.27) thus do not represent a closed-form solution
for maximizing the log-likelihood function. Instead, we use the iterative EM approach
to compute the
w
ij
in the expectation step, and we then re-estimate
µ
i
,
i
and
P(
C
i
)
in the maximization step. Next, we describe the EM framework in some more detail.
13.3.4
EM Approach
Maximizingthelog-likelihoodfunction [Eq.(13.8)]directlyis hardbecausethemixture
term appears inside the logarithm. The problem is that for any point
x
j
we do not
know which normal, or mixture component, it comes from. Suppose that we knew
this information, that is, suppose each point
x
j
had an associated value indicating the
cluster that generated the point. As we shall see, it is much easier to maximize the
log-likelihood given this information.
The categorical attribute corresponding to the cluster label can be modeled as a
vector random variable
C
=
(
C
1
,
C
2
,...,
C
k
)
, where
C
i
is a Bernoulli random variable
(see Section 3.1.2 for details on how to model a categorical variable). If a given point
is generated from cluster
C
i
, then
C
i
=
1, otherwise
C
i
=
0. The parameter
P(
C
i
)
gives
the probability
P(
C
i
=
1
)
. Because each point can be generated from only one cluster,
if
C
a
=
1 for a given point, then
C
i
=
0 for all
i
=
a
. It follows that
k
i
=
1
P(
C
i
)
=
1.
For each point
x
j
, let its cluster vector be
c
j
=
(c
j
1
,...,c
jk
)
T
. Only one component
of
c
j
has value 1. If
c
ji
=
1, it means that
C
i
=
1, that is, the cluster
C
i
generates the
point
x
j
. The probability mass function of
C
is given as
P(
C
=
c
j
)
=
k
i
=
1
P(
C
i
)
c
ji
Given the cluster information
c
j
for each point
x
j
, the conditional probability density
function for
X
is given as
f(
x
j
|
c
j
)
=
k
i
=
1
f(
x
j
|
µ
i
,
i
)
c
ji
358
Representative-based Clustering
Onlyoneclustercangenerate
x
j
,say
C
a
,inwhich case
c
ja
=
1,andtheaboveexpression
would simplify to
f(
x
j
|
c
j
)
=
f(
x
j
|
µ
a
,
a
)
.
The pair
(
x
j
,
c
j
)
is a random sample drawn from the joint distribution of vector
random variables
X
=
(
X
1
,...,
X
d
)
and
C
=
(
C
1
,...,
C
k
)
, corresponding to the
d
data
attributes and
k
cluster attributes. The joint density function of
X
and
C
is given as
f(
x
j
and
c
j
)
=
f(
x
j
|
c
j
)P(
c
j
)
=
k
i
=
1
f(
x
j
|
µ
i
,
i
)P(
C
i
)
c
ji
The log-likelihood for the data given the cluster information is as follows:
ln
P(
D
|
θ
)
=
ln
n
j
=
1
f(
x
j
and
c
j
|
θ
)
=
n
j
=
1
ln
f(
x
j
and
c
j
|
θ
)
=
n
j
=
1
ln
k
i
=
1
f(
x
j
|
µ
i
,
i
)P(
C
i
)
c
ji
=
n
j
=
1
k
i
=
1
c
ji
ln
f(
x
j
|
µ
i
,
i
)
+
ln
P(
C
i
)
(13.28)
Expectation Step
In the expectation step, we compute the expected value of the log-likelihood for
the labeled data given in Eq.(13.28). The expectation is over the missing cluster
information
c
j
treating
µ
i
,
i
,
P(
C
i
)
, and
x
j
as fixed. Owing to the linearity of
expectation, the expected value of the log-likelihood is given as
E
[ln
P(
D
|
θ
)
]
=
n
j
=
1
k
i
=
1
E
[
c
ji
]
ln
f(
x
j
|
µ
i
,
i
)
+
ln
P(
C
i
)
The expected value
E
[
c
ji
] can be computed as
E
[
c
ji
]
=
1
·
P(c
ji
=
1
|
x
j
)
+
0
·
P(c
ji
=
0
|
x
j
)
=
P(c
ji
=
1
|
x
j
)
=
P(
C
i
|
x
j
)
=
P(
x
j
|
C
i
)P(
C
i
)
P(
x
j
)
=
f(
x
j
|
µ
i
,
i
)P(
C
i
)
f(
x
j
)
=
w
ij
(13.29)
Thus, in the expectation step we use the values of
θ
=
{
µ
i
,
i
,P(
C
i
)
}
k
i
=
1
to estimate the
posterior probabilities or weights
w
ij
for each point for each cluster. Using
E
[
c
ji
]
=
w
ij
,
the expected value of the log-likelihood function can be rewritten as
E
[ln
P(
D
|
θ
)
]
=
n
j
=
1
k
i
=
1
w
ij
ln
f(
x
j
|
µ
i
,
i
)
+
ln
P(
C
i
)
(13.30)
13.3 Expectation-Maximization Clustering
359
Maximization Step
In the maximization step, we maximize the expected value of the log-likelihood
[Eq.(13.30)]. Taking the derivative with respect to
µ
i
,
i
or
P(
C
i
)
we can ignore the
terms for all the other clusters.
The derivative of Eq.(13.30) with respect to
µ
i
is given as
∂
∂
µ
i
ln
E
[
P(
D
|
θ
)
]
=
∂
∂
µ
i
n
j
=
1
w
ij
ln
f(
x
j
|
µ
i
,
i
)
=
n
j
=
1
w
ij
·
1
f(
x
j
|
µ
i
,
i
)
∂
∂
µ
i
f(
x
j
|
µ
i
,
i
)
=
n
j
=
1
w
ij
·
1
f(
x
j
|
µ
i
,
i
)
·
f(
x
j
|
µ
i
,
i
)
−
1
i
(
x
j
−
µ
i
)
=
n
j
=
1
w
ij
−
1
i
(
x
j
−
µ
i
)
where we used the observation that
∂
∂
µ
i
f(
x
j
|
µ
i
,
i
)
=
f(
x
j
|
µ
i
,
i
)
−
1
i
(
x
j
−
µ
i
)
which follows from Eqs.(13.14), (13.17), and (13.18). Setting the derivative of the
expected value of the log-likelihood to the zero vector, and multiplying on both sides
by
i
, we get
µ
i
=
n
j
=
1
w
ij
x
j
n
j
=
1
w
ij
matching the formula in Eq.(13.11).
Making use ofEqs.(13.22)and (13.14), we obtain thederivativeof Eq.(13.30)with
respect to
−
1
i
as follows:
∂
∂
−
1
i
ln
E
[
P(
D
|
θ
)
]
=
n
j
=
1
w
ij
·
1
f(
x
j
|
µ
i
,
i
)
·
1
2
f(
x
j
|
µ
i
,
i
)
i
−
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
=
1
2
n
j
=
1
w
ij
·
i
−
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
Setting the derivative to the
d
×
d
zero matrix and solving for
i
yields
i
=
n
j
=
1
w
ij
(
x
j
−
µ
i
)(
x
j
−
µ
i
)
T
n
j
=
1
w
ij
which is the same as that in Eq.(13.12).
360
Representative-based Clustering
Using the Lagrange multiplier
α
for the constraint
k
i
=
1
P(
C
i
)
=
1, and noting that
in the log-likelihood function [Eq.(13.30)], the term ln
f(
x
j
|
µ
i
,
i
)
is a constant with
respect to
P(
C
i
)
, we obtain the following:
∂
∂P(
C
i
)
ln
E
[
P(
D
|
θ
)
]
+
α
k
i
=
1
P(
C
i
)
−
1
=
∂
∂P(
C
i
)
w
ij
ln
P(
C
i
)
+
αP(
C
i
)
=
n
j
=
1
w
ij
·
1
P(
C
i
)
+
α
Setting the derivative to zero, we get
n
j
=
1
w
ij
=−
α
·
P(
C
i
)
Using the same derivation as in Eq.(13.26) we obtain
P(
C
i
)
=
n
j
=
1
w
ij
n
which is identical to the re-estimation formula in Eq.(13.13).
13.4
FURTHER READING
The K-means algorithm was proposed in several contexts during the 1950s and 1960s;
among the first works to develop the method include MacQueen (1967); Lloyd (1982)
and Hartigan (1975). Kernel k-means was first proposed in Sch
¨
olkopf, Smola, and
M
¨
uller (1996).The EM algorithm was proposed in Dempster, Laird, and Rubin (1977).
A good review on EM method can be found in McLachlan and Krishnan (2008).
For a scalable and incremental representative-based clustering method that can also
generate hierarchical clusterings see Zhang, Ramakrishnan, and Livny (1996).
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). “Maximum likelihood from
incomplete data via the EM algorithm.”
Journal of the Royal Statistical Society,
Series B
, 39(1): 1–38.
Hartigan, J. A. (1975).
Clustering Algorithms
. New York: John Wiley & Sons.
Lloyd, S. (1982). “Least squares quantization in PCM.”
IEEE Transactions on
Information Theory
, 28(2): 129–137.
MacQueen, J. (1967). “Some methods for classification and analysis of multivariate
observations.”
In Proceedings of the 5th Berkeley Symposium on Mathematical
Statistics and Probability
, vol. 1, pp. 281–297, University of California Press,
Berkeley.
McLachlan, G. and Krishnan, T. (2008).
The EM Algorithm and Extensions,
2nd ed.
Hoboken, NJ: John Wiley & Sons.
Sch
¨
olkopf, B., Smola, A., and M
¨
uller, K.-R. (1996).
Nonlinear component analysis
as a kernel eigenvalue problem
. Technical Report No. 44. T
¨
ubingen, Germany:
Max-Planck-Institut f
¨
ur biologische Kybernetik.
13.5 Exercises
361
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). “BIRCH: an efficient data
clustering method for very large databases.”
ACM SIGMOD Record
, 25(2):
103–114.
13.5
EXERCISES
Q1.
Given the following points: 2
,
4
,
10
,
12
,
3
,
20
,
30
,
11
,
25. Assume
k
=
3, and that we
randomly pick the initial means
µ
1
=
2,
µ
2
=
4 and
µ
3
=
6. Show the clusters obtained
using K-means algorithm after one iteration, and show the new means for the next
iteration.
Table 13.1.
Dataset for Q2
x P(
C
1
|
x) P(
C
2
|
x)
2 0.9 0.1
3 0.8 0.1
7 0.3 0.7
9 0.1 0.9
2 0.9 0.1
1 0.8 0.2
Q2.
Given the data points in Table 13.1, and their probability of belonging to two clusters.
Assume that these points were produced by a mixture of two univariate normal
distributions. Answer the following questions:
(a)
Find the maximum likelihood estimate of the means
µ
1
and
µ
2
.
(b)
Assume that
µ
1
=
2,
µ
2
=
7, and
σ
1
=
σ
2
=
1. Find the probability that the point
x
=
5 belongs to cluster
C
1
and to cluster
C
2
. You may assume that the prior
probability of each cluster is equal (i.e.,
P(
C
1
)
=
P(
C
2
)
=
0
.
5), and the prior
probability
P(x
=
5
)
=
0
.
029.
Table 13.2.
Dataset for Q3
X
1
X
2
x
1
0 2
x
2
0 0
x
3
1
.
5 0
x
4
5 0
x
5
5 2
Q3.
Given the two-dimensional points in Table 13.2, assume that
k
=
2, and that initially
the points are assigned to clusters as follows:
C
1
= {
x
1
,
x
2
,
x
4
}
and
C
2
= {
x
3
,
x
5
}
.
Answer the following questions:
(a)
Apply the K-means algorithm until convergence, that is, the clusters do not
change, assuming (1) the usual Euclidean distance or the
L
2
-
norm
as the distance
13.5 Exercises
363
Q5.
Given the points in Table 13.4, assume that there are two clusters:
C
1
and
C
2
, with
µ
1
=
(
0
.
5
,
4
.
5
,
2
.
5
)
T
and
µ
2
=
(
2
.
5
,
2
,
1
.
5
)
T
. Initially assign each point to the closest
mean, and compute the covariance matrices
i
and the prior probabilities
P(
C
i
)
for
i
=
1
,
2. Next, answer which cluster is more likely to have produced
x
8
?
Q6.
Consider the data in Table 13.5. Answer the following questions:
(a) Compute the kernel matrix
K
between the points assuming the following kernel:
K
(
x
i
,
x
j
)
=
1
+
x
T
i
x
j
(b) Assume initial cluster assignments of
C
1
= {
x
1
,
x
2
}
and
C
2
= {
x
3
,
x
4
}
. Using kernel
K-means, which cluster should
x
1
belong to in the next step?
Table 13.5.
Data for Q6
X
1
X
2
X
3
x
1
0.4 0.9 0.6
x
2
0.5 0.1 0.6
x
3
0.6 0.3 0.6
x
4
0.4 0.8 0.5
Q7.
Prove the following equivalence for the multivariate normal density function:
∂
∂
µ
i
f(
x
j
|
µ
i
,
i
)
=
f(
x
j
|
µ
i
,
i
)
−
1
i
(
x
j
−
µ
i
)
CHAPTER 14
Hierarchical Clustering
Given
n
points in a
d
-dimensional space, the goal of hierarchical clustering is to create
a sequence of nested partitions, which can be conveniently visualized via a tree or
hierarchy of clusters, also called the cluster
dendrogram
. The clusters in the hierarchy
range from the fine-grained to the coarse-grained – the lowest level of the tree (the
leaves) consists of each point in its own cluster, whereas the highest level (the root)
consistsofallpoints inonecluster.Bothofthesemaybeconsideredtobe
trivial
cluster-
ings. At some intermediate level, we may find meaningful clusters. If the user supplies
k
, the desired number of clusters, we can choose the level at which there are
k
clusters.
There are two main algorithmic approaches to mine hierarchical clusters:
agglomerative and divisive. Agglomerative strategies work in a bottom-up manner.
That is, starting with each of the
n
points in a separate cluster, they repeatedly merge
the most similar pair of clusters until all points are members of the same cluster.
Divisive strategies do just the opposite, working in a top-down manner. Starting with
all the points in the same cluster, they recursively split the clusters until all points are
in separate clusters. In this chapter we focus on agglomerative strategies. We discuss
some divisive strategies in Chapter 16, in the context of graph partitioning.
14.1
PRELIMINARIES
Given a dataset
D
= {
x
1
,...,
x
n
}
, where
x
i
∈
R
d
, a clustering
C
= {
C
1
,...,
C
k
}
is a
partition of
D
, that is, each cluster is a set of points
C
i
⊆
D
, such that the clusters
are pairwise disjoint
C
i
∩
C
j
= ∅
(for all
i
=
j
), and
∪
k
i
=
1
C
i
=
D
. A clustering
A
={
A
1
,...,
A
r
}
is said to be nested in another clustering
B
={
B
1
,...,
B
s
}
if and only
if
r > s
, and for each cluster
A
i
∈
A
, there exists a cluster
B
j
∈
B
, such that
A
i
⊆
B
j
.
Hierarchical clustering yields a sequence of
n
nested partitions
C
1
,...,
C
n
, ranging from
the trivial clustering
C
1
=
{
x
1
}
,...,
{
x
n
}
where each point is in a separate cluster, to
the other trivial clustering
C
n
=
{
x
1
,...,
x
n
}
, where all points are in one cluster. In
general, the clustering
C
t
−
1
is nested in the clustering
C
t
. The cluster dendrogram is
a rooted binary tree that captures this nesting structure, with edges between cluster
C
i
∈
C
t
−
1
and cluster
C
j
∈
C
t
if
C
i
is nested in
C
j
, that is, if
C
i
⊂
C
j
. In this way the
dendrogram captures the entire sequence of nested clusterings.
364
14.1 Preliminaries
365
ABCDE
ABCD
AB
A
B
CD
C D E
Figure 14.1.
Hierarchical clustering dendrogram.
Example 14.1.
Figure14.1shows anexampleofhierarchicalclusteringof fivelabeled
points:
A
,
B
,
C
,
D
, and
E
. The dendrogram represents the following sequence of
nested partitions:
Clustering Clusters
C
1
{
A
}
,
{
B
}
,
{
C
}
,
{
D
}
,
{
E
}
C
2
{
AB
}
,
{
C
}
,
{
D
}
,
{
E
}
C
3
{
AB
}
,
{
CD
}
,
{
E
}
C
4
{
ABCD
}
,
{
E
}
C
5
{
ABCDE
}
with
C
t
−
1
⊂
C
t
for
t
=
2
,...,
5. We assume that
A
and
B
are merged before
C
and
D
.
Number of Hierarchical Clusterings
The number of different nested or hierarchical clusterings corresponds to the number
of different binary rooted trees or dendrograms with
n
leaves with distinct labels. Any
tree with
t
nodes has
t
−
1 edges. Also, any rooted binary tree with
m
leaves has
m
−
1
internal nodes. Thus, a dendrogram with
m
leaf nodes has a total of
t
=
m
+
m
−
1
=
2
m
−
1 nodes, and consequently
t
−
1
=
2
m
−
2 edges. To count the number of different
dendrogram topologies, let us consider how we can extendadendrogram with
m
leaves
by adding an extra leaf, to yield a dendrogram with
m
+
1 leaves. Note that we can add
the extra leaf by splitting (i.e., branching from) any of the 2
m
−
2 edges. Further, we
can also add the new leaf as a child of a new root, giving 2
m
−
2
+
1
=
2
m
−
1 new
dendrograms with
m
+
1 leaves. The total number of different dendrograms with
n
leaves is thus obtained by the following product:
n
−
1
m
=
1
(
2
m
−
1
)
=
1
×
3
×
5
×
7
×···×
(
2
n
−
3
)
=
(
2
n
−
3
)
!! (14.1)
366
Hierarchical Clustering
1
(a)
m
=
1
1
2
(b)
m
=
2
1
3
2
1
2
3
1
2
3
(c)
m
=
3
Figure 14.2.
Number of hierarchical clusterings.
The index
m
in Eq.(14.1) goes up to
n
−
1 because the last term in the product denotes
the number of dendrograms one obtains when we extend a dendrogram with
n
−
1
leaves by adding one more leaf, to yield dendrograms with
n
leaves.
The number of possible hierarchical clusterings is thus given as
(
2
n
−
3
)
!!, which
grows extremelyrapidly.It is obvious thatanaiveapproach of enumeratingallpossible
hierarchical clusterings is simply infeasible.
Example 14.2.
Figure14.2shows thenumber oftreeswith one,two,andthreeleaves.
The gray nodes are the virtual roots, and the black dots indicate locations where a
new leaf can be added. There is only one tree possible with a single leaf, as shown
in Figure 14.2a. It can be extended in only one way to yield the unique tree with
two leaves in Figure 14.2b. However, this tree has three possible locations where the
third leaf can be added. Each of these cases is shown in Figure 14.2c. We can further
see that each of the trees with
m
=
3 leaves has five locations where the fourth leaf
can be added, and so on, which confirms the equation for the number of hierarchical
clusterings in Eq.(14.1).
14.2
AGGLOMERATIVE HIERARCHICAL CLUSTERING
In agglomerative hierarchical clustering, we begin with each of the
n
points in a
separate cluster. We repeatedly merge the two closest clusters until all points are
members of the same cluster, as shown in the pseudo-code given in Algorithm 14.1.
Formally, given a set of clusters
C
={
C
1
,
C
2
,..,
C
m
}
, we find the
closest
pair of clusters
C
i
and
C
j
and merge them into a new cluster
C
ij
=
C
i
∪
C
j
. Next, we update the set of
clusters by removing
C
i
and
C
j
and adding
C
ij
, as follows
C
=
C
{
C
i
,
C
j
}
∪{
C
ij
}
.
We repeat the process until
C
contains only one cluster. Because the number of
clusters decreases by one in each step, this process results in a sequence of
n
nested
clusterings. If specified, we can stop the merging process when there are exactly
k
clusters remaining.
14.2 Agglomerative Hierarchical Clustering
367
ALGORITHM 14.1. Agglomerative Hierarchical Clustering Algorithm
A
GGLOMERATIVE
C
LUSTERING
(D
,k
)
:
C
←{
C
i
={
x
i
}|
x
i
∈
D
}
// Each point in separate cluster
1
←{
δ(
x
i
,
x
j
)
:
x
i
,
x
j
∈
D
}
// Compute distance matrix
2
repeat
3
Find the closest pair of clusters
C
i
,
C
j
∈
C
4
C
ij
←
C
i
∪
C
j
// Merge the clusters
5
C
←
C
{
C
i
,
C
j
}
∪{
C
ij
}
// Update the clustering
6
Update distance matrix
to reflect new clustering
7
until
|
C
|=
k
8
14.2.1
Distance between Clusters
The main step in the algorithm is to determine the closest pair of clusters. Several
distance measures, such as single link, complete link, group average, and others
discussed in the following paragraphs, can be used to compute the distance between
any two clusters. The between-cluster distances are ultimately based on the distance
between two points, which is typically computed using the Euclidean distance or
L
2
-
norm
, defined as
δ(
x
,
y
)
=
x
−
y
2
=
d
i
=
1
(x
i
−
y
i
)
2
1
/
2
However, one may use other distance metrics, or if available one may a user-specified
distance matrix.
Single Link
Given two clusters
C
i
and
C
j
, the distance between them, denoted
δ(
C
i
,
C
j
)
, is defined
as the minimum distance between a point in
C
i
and a point in
C
j
δ(
C
i
,
C
j
)
=
min
{
δ(
x
,
y
)
|
x
∈
C
i
,
y
∈
C
j
}
The name
single link
comes from the observation that if we choose the minimum
distance between points in the two clusters and connect those points, then (typically)
only a single link would exist between those clusters because all other pairs of points
would be farther away.
Complete Link
The distance betweentwo clusters is defined as the maximum distance between a point
in
C
i
and a point in
C
j
:
δ(
C
i
,
C
j
)
=
max
{
δ(
x
,
y
)
|
x
∈
C
i
,
y
∈
C
j
}
The name
complete link
conveys the fact that if we connect all pairs of points from the
two clusters with distance at most
δ(
C
i
,
C
j
)
, then all possible pairs would be connected,
that is, we get a complete linkage.
368
Hierarchical Clustering
Group Average
The distance between two clusters is defined as the average pairwise distance between
points in
C
i
and
C
j
:
δ(
C
i
,
C
j
)
=
x
∈
C
i
y
∈
C
j
δ(
x
,
y
)
n
i
·
n
j
where
n
i
=|
C
i
|
denotes the number of points in cluster
C
i
.
Mean Distance
The distance between two clusters is defined as the distance between the means or
centroids of the two clusters:
δ(
C
i
,
C
j
)
=
δ(
µ
i
,
µ
j
)
(14.2)
where
µ
i
=
1
n
i
x
∈
C
i
x
.
Minimum Variance: Ward’s Method
The distance between two clusters is defined as the increase in the sum of squared
errors (SSE) when the two clusters are merged. The SSE for a given cluster
C
i
is
given as
SSE
i
=
x
∈
C
i
x
−
µ
i
2
which can also be written as
SSE
i
=
x
∈
C
i
x
−
µ
i
2
=
x
∈
C
i
x
T
x
−
2
x
∈
C
i
x
T
µ
i
+
x
∈
C
i
µ
T
i
µ
i
=
x
∈
C
i
x
T
x
−
n
i
µ
T
i
µ
i
(14.3)
The SSE for a clustering
C
={
C
1
,...,
C
m
}
is given as
SSE
=
m
i
=
1
SSE
i
=
m
i
=
1
x
∈
C
i
x
−
µ
i
2
Ward’s measure defines the distance between two clusters
C
i
and
C
j
as the net
change in the SSE value when we merge
C
i
and
C
j
into
C
ij
, given as
δ(
C
i
,
C
j
)
=
SSE
ij
=
SSE
ij
−
SSE
i
−
SSE
j
(14.4)
We can obtain a simpler expression for the Ward’s measure by plugging
Eq.(14.3) into Eq.(14.4), and noting that because
C
ij
=
C
i
∪
C
j
and
C
i
∩
C
j
= ∅
, we
14.2 Agglomerative Hierarchical Clustering
369
have
|
C
ij
|=
n
ij
=
n
i
+
n
j
, and therefore
δ(
C
i
,
C
j
)
=
SSE
ij
=
z
∈
C
ij
z
−
µ
ij
2
−
x
∈
C
i
x
−
µ
i
2
−
y
∈
C
j
y
−
µ
j
2
=
z
∈
C
ij
z
T
z
−
n
ij
µ
T
ij
µ
ij
−
x
∈
C
i
x
T
x
+
n
i
µ
T
i
µ
i
−
y
∈
C
j
y
T
y
+
n
j
µ
T
j
µ
j
=
n
i
µ
T
i
µ
i
+
n
j
µ
T
j
µ
j
−
(n
i
+
n
j
)
µ
T
ij
µ
ij
(14.5)
The last step follows from the fact that
z
∈
C
ij
z
T
z
=
x
∈
C
i
x
T
x
+
y
∈
C
j
y
T
y
. Noting that
µ
ij
=
n
i
µ
i
+
n
j
µ
j
n
i
+
n
j
we obtain
µ
T
ij
µ
ij
=
1
(n
i
+
n
j
)
2
n
2
i
µ
T
i
µ
i
+
2
n
i
n
j
µ
T
i
µ
j
+
n
2
j
µ
T
j
µ
j
Plugging the above into Eq.(14.5), we finally obtain
δ(
C
i
,
C
j
)
=
SSE
ij
=
n
i
µ
T
i
µ
i
+
n
j
µ
T
j
µ
j
−
1
(n
i
+
n
j
)
n
2
i
µ
T
i
µ
i
+
2
n
i
n
j
µ
T
i
µ
j
+
n
2
j
µ
T
j
µ
j
=
n
i
(n
i
+
n
j
)
µ
T
i
µ
i
+
n
j
(n
i
+
n
j
)
µ
T
j
µ
j
−
n
2
i
µ
T
i
µ
i
−
2
n
i
n
j
µ
T
i
µ
j
−
n
2
j
µ
T
j
µ
j
n
i
+
n
j
=
n
i
n
j
µ
T
i
µ
i
−
2
µ
T
i
µ
j
+
µ
T
j
µ
j
n
i
+
n
j
=
n
i
n
j
n
i
+
n
j
µ
i
−
µ
j
2
Ward’s measure is therefore a weighted version of the mean distance measure
because if we use Euclidean distance, the mean distance in Eq.(14.2) can be
rewritten as
δ(
µ
i
,
µ
j
)
=
µ
i
−
µ
j
2
(14.6)
We can see that the only difference is that Ward’s measure weights the distance
between the means by half of the harmonic mean of the cluster sizes, where the
harmonic mean of two numbers
n
1
and
n
2
is given as
2
1
n
1
+
1
n
2
=
2
n
1
n
2
n
1
+
n
2
.
Example 14.3 (Single Link).
Consider the single link clustering shown in Figure 14.3
on a dataset of five points, whose pairwise distances are also shown on the bottom
left. Initially, all points are in their own cluster. The closest pair of points are
(
A
,
B
)
and
(
C
,
D
)
, both with
δ
=
1. We choose to first merge
A
and
B
, and
derive a new distance matrix for the merged cluster. Essentially, we have to
370
Hierarchical Clustering
ABCDE
δ
E
ABCD
3
ABCD
δ
CD E
AB
2
3
CD
3
CD
δ
C D E
AB
3 2 3
C
1
3
D
5
AB
δ
B C D E
A
1
3 2 4
B
3 2 3
C
1 3
D
5
A B C D E
1 1
1 1
2
2
3
3
Figure 14.3.
Single link agglomerative clustering.
compute the distances of the new cluster
AB
to all other clusters. For example,
δ(
AB
,
E
)
=
3 because
δ(
AB
,
E
)
=
min
{
δ(
A
,
E
),δ(
B
,
E
)
} =
min
{
4
,
3
}=
3. In the next
step we merge
C
and
D
because they are the closest clusters, and we obtain a new
distance matrix for the resulting set of clusters. After this,
AB
and
CD
are merged,
and finally,
E
is merged with
ABCD
. In the distance matrices, we have shown
(circled) the minimum distance used at each iteration that results in a merging of
the two closest pairs of clusters.
14.2.2
Updating Distance Matrix
Whenever two clusters
C
i
and
C
j
are merged into
C
ij
, we need to update the distance
matrix by recomputing the distances from the newly created cluster
C
ij
to all other
clusters
C
r
(
r
=
i
and
r
=
j
). The Lance–Williams formula provides a general equation
to recompute the distances for all of the cluster proximity measures we considered
earlier; it is given as
δ(
C
ij
,
C
r
)
=
α
i
·
δ(
C
i
,
C
r
)
+
α
j
·
δ(
C
j
,
C
r
)
+
β
·
δ(
C
i
,
C
j
)
+
γ
·
δ(
C
i
,
C
r
)
−
δ(
C
j
,
C
r
)
(14.7)
14.2 Agglomerative Hierarchical Clustering
371
Table 14.1.
Lance–Williams formula for cluster proximity
Measure
α
i
α
j
β γ
Single link
1
2
1
2
0
−
1
2
Complete link
1
2
1
2
0
1
2
Group average
n
i
n
i
+
n
j
n
j
n
i
+
n
j
0 0
Mean distance
n
i
n
i
+
n
j
n
j
n
i
+
n
j
−
n
i
·
n
j
(n
i
+
n
j
)
2
0
Ward’s measure
n
i
+
n
r
n
i
+
n
j
+
n
r
n
j
+
n
r
n
i
+
n
j
+
n
r
−
n
r
n
i
+
n
j
+
n
r
0
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
Figure 14.4.
Iris dataset: complete link.
The coefficients
α
i
,α
j
,β,
and
γ
differ from one measure to another. Let
n
i
= |
C
i
|
denote the cardinality of cluster
C
i
; then the coefficients for the different distance
measures are as shown in Table 14.1.
Example 14.4.
Consider the two-dimensional Iris principal components dataset
shown in Figure 14.4, which also illustrates the results of hierarchical clustering using
thecomplete-link method,with
k
=
3 clusters. Table 14.2shows the contingencytable
comparing the clustering results with the ground-truth Iris types (which are not used
in clustering). We can observe that 15 points are misclustered in total; these points
are shown in white in Figure 14.4. Whereas
iris-setosa
is well separated, the other
two Iris types are harder to separate.
14.2.3
Computational Complexity
In agglomerative clustering, we need to compute the distance of each cluster to all
other clusters, and at each step the number of clusters decreases by 1. Initially it takes
372
Hierarchical Clustering
Table 14.2.
Contingency table: clusters versus Iris types
iris-setosa iris-virginica iris-versicolor
C
1
(circle) 50 0 0
C
2
(triangle) 0 1 36
C
3
(square) 0 49 14
O
(n
2
)
time to create the pairwise distance matrix, unless it is specified as an input to
the algorithm.
At each merge step, the distances from the merged cluster to the other clusters
have to be recomputed, whereas the distances between the other clusters remain the
same. This means that in step
t
, we compute
O
(n
−
t)
distances. The other main
operation is to find the closest pair in the distance matrix. For this we can keep the
n
2
distances in a heap data structure, which allows us to find the minimum distance
in
O
(
1
)
time; creating the heap takes
O
(n
2
)
time. Deleting/updating distances from
the heap takes
O
(
log
n)
time for each operation, for a total time across all merge
steps of
O
(n
2
log
n)
. Thus, the computational complexity of hierarchical clustering is
O
(n
2
log
n)
.
14.3
FURTHER READING
Hierarchical clustering has a long history, especially in taxonomy or classificatory
systems, and phylogenetics; see, for example, Sokal and Sneath (1963). The generic
Lance–Williams formula for distance updates appears in Lance and Williams (1967).
Ward’s measure is from Ward (1963). Efficient methods for single-link and
complete-link measures with
O
(n
2
)
complexity are given in Sibson (1973) and Defays
(1977), respectively. For a good discussion of hierarchical clustering, and clustering in
general, see Jain and Dubes (1988).
Defays, D. (Nov. 1977). “An efficient algorithm for a complete link method.”
Computer Journal
, 20(4): 364–366.
Jain, A. K. and Dubes, R. C. (1988).
Algorithms for Clustering Data
. Upper Saddle
River, NJ: Prentice-Hall.
Lance, G. N. and Williams, W. T. (1967). “A general theory of classificatory sorting
strategies 1. Hierarchical systems.”
The Computer Journal
, 9(4): 373–380.
Sibson, R. (1973). “SLINK: An optimally efficient algorithm for the single-link cluster
method.”
Computer Journal
, 16(1): 30–34.
Sokal, R. R. and Sneath, P. H. (1963).
Principles of Numerical Taxonomy.
San
Francisco: W.H. Freeman.
Ward, J. H. (1963). “Hierarchical grouping to optimize an objective function.”
Journal
of the American Statistical Association
, 58(301): 236–244.
14.4 Exercises and Projects
373
14.4
EXERCISES AND PROJECTS
Q1.
Consider the 5-dimensional categorical data shown in Table 14.3.
Table 14.3.
Data for Q1
Point
X
1
X
2
X
3
X
4
X
5
x
1
1 0 1 1 0
x
2
1 1 0 1 0
x
3
0 0 1 1 0
x
4
0 1 0 1 0
x
5
1 0 1 0 1
x
6
0 1 1 0 0
The similarity between categorical data points can be computed in terms of the
numberofmatches and mismatchesfor thedifferentattributes. Let
n
11
bethenumber
of attributes on which two points
x
i
and
x
j
assume the value 1, and let
n
10
denote the
number of attributes where
x
i
takes value 1, but
x
j
takes on the value of 0. Define
n
01
and
n
00
in a similar manner. The contingency table for measuring the similarity is
then given as
x
j
1 0
x
i
1
n
11
n
10
0
n
01
n
00
Define the following similarity measures:
•
Simple matching coefficient:
SMC
(
X
i
,
X
j
)
=
n
11
+
n
00
n
11
+
n
10
+
n
01
+
n
00
•
Jaccard coefficient:
JC
(
X
i
,
X
j
)
=
n
11
n
11
+
n
10
+
n
01
•
Rao’s coefficient:
RC
(
X
i
,
X
j
)
=
n
11
n
11
+
n
10
+
n
01
+
n
00
Findthecluster dendrogramsproducedbythehierarchical clusteringalgorithm under
the following scenarios:
(a)
We use single link with
RC
.
(b)
We use complete link with
SMC
.
(c)
We use group average with
JC
.
Q2.
Given the dataset in Figure 14.5, show the dendrogram resulting from the single-link
hierarchical agglomerative clustering approach using the
L
1
-
norm
as the distance
between points
δ(
x
,
y
)
=
2
a
=
1
|
x
ia
−
y
ia
|
Whenever there is a choice, merge the cluster that has the lexicographically smallest
labeled point. Show the cluster merge order in the tree, stopping when you have
k
=
4
clusters. Show the full distance matrix at each step.
374
Hierarchical Clustering
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
a
b
c
d e
f
g
h
i
j
k
Figure 14.5.
Dataset for Q2.
Table 14.4.
Dataset for Q3
A B C D E
A
0 1 3 2 4
B
0 3 2 3
C
0 1 3
D
0 5
E
0
Q3.
Using the distance matrix from Table 14.4, use the average link method to generate
hierarchical clusters. Show the merging distance thresholds.
Q4.
Prove that in the Lance–Williams formula [Eq.(14.7)]
(a)
If
α
i
=
n
i
n
i
+
n
j
,
α
j
=
n
j
n
i
+
n
j
,
β
=
0 and
γ
=
0, then we obtain the group average
measure.
(b)
If
α
i
=
n
i
+
n
r
n
i
+
n
j
+
n
r
,
α
j
=
n
j
+
n
r
n
i
+
n
j
+
n
r
,
β
=
−
n
r
n
i
+
n
j
+
n
r
and
γ
=
0, then we obtain Ward’s
measure.
Q5.
If we treat each point as a vertex, and add edges between two nodes with distance
less than some threshold value, then the single-link method corresponds to a well
known graph algorithm. Describe this graph-based algorithm to hierarchically cluster
the nodes via single-link measure, using successively higher distance thresholds.
CHAPTER 15
Density-based Clustering
The representative-based clustering methods like K-means and expectation-
maximization are suitable for finding ellipsoid-shaped clusters, or at best convex
clusters. However, for nonconvex clusters, such as those shown in Figure 15.1, these
methods have trouble finding the true clusters, as two points from different clusters
may be closer than two points in the same cluster. The density-based methods we
consider in this chapter are able to mine such nonconvex clusters.
15.1
THE DBSCAN ALGORITHM
Density-based clustering uses the local density of points to determine the clusters,
rather than using only the distance between points. We define a ball of radius
ǫ
around
a point
x
∈
R
d
, called the
ǫ
-
neighborhood
of
x
, as follows:
N
ǫ
(
x
)
=
B
d
(
x
,ǫ)
={
y
|
δ(
x
,
y
)
≤
ǫ
}
Here
δ(
x
,
y
)
represents the distance between points
x
and
y
, which is usually assumed
to be theEuclidean distance, thatis,
δ(
x
,
y
)
=
x
−
y
2
. However, other distance metrics
can also be used.
For anypoint
x
∈
D
,wesaythat
x
is a
corepoint
ifthereareatleast
minpts
points in
its
ǫ
-neighborhood. In other words,
x
is a core point if
|
N
ǫ
(
x
)
|≥
minpts
, where
minpts
is a user-defined local density or frequency threshold. A
border point
is defined as a
point that does not meet the
minpts
threshold, that is, it has
|
N
ǫ
(
x
)
|
< minpts
, but it
belongs to the
ǫ
-neighborhood of some core point
z
, that is,
x
∈
N
ǫ
(
z
)
. Finally, if a point
is neither a core nor a border point, then it is called a
noise point
or an outlier.
Example 15.1.
Figure 15.2a shows the
ǫ
-neighborhood of the point
x
, using the
Euclidean distance metric. Figure 15.2b shows the three different types of points,
using
minpts
=
6. Here
x
is a core point because
|
N
ǫ
(
x
)
| =
6,
y
is a border point
because
|
N
ǫ
(
y
)
|
< minpts
, but it belongs to the
ǫ
-neighborhood of the core point
x
,
i.e.,
y
∈
N
ǫ
(
x
)
. Finally,
z
is a noise point.
We say that a point
x
is
directly density reachable
from another point
y
if
x
∈
N
ǫ
(
y
)
and
y
is a core point. We say that
x
is
density reachable
from
y
if there exists a chain
375
376
Density-based Clustering
20
95
170
245
320
395
0 100 200 300 400 500 600
X
1
X
2
Figure 15.1.
Density-based dataset.
ǫ
x
(a)
x
y
z
(b)
Figure 15.2.
(a) Neighborhood of a point. (b) Core, border, and noise points.
of points,
x
0
,
x
1
,...,
x
l
, such that
x
=
x
0
and
y
=
x
l
, and
x
i
is directly density reachable
from
x
i
−
1
for all
i
=
1
,...,l
. In other words, there is set of core points leading from
y
to
x
. Note that density reachability is an asymmetric or directed relationship. Define any
two points
x
and
y
to be
density connected
if there exists a core point
z
, such that both
x
and
y
are density reachable from
z
. A
density-based cluster
is defined as a maximal
set of density connected points.
The pseudo-code for the DBSCAN density-based clustering method is shown in
Algorithm 15.1. First, DBSCAN computes the
ǫ
-neighborhood
N
ǫ
(
x
i
)
for each point
x
i
in the dataset
D
, and checks if it is a core point (lines 2–5). It also sets the cluster
id
id(
x
i
)
=∅
for all points, indicating that they are not assigned to any cluster. Next,
starting from each unassigned core point, the method recursively finds all its density
connected points, which are assigned to the same cluster (line 10). Some border point
15.1 The DBSCAN Algorithm
377
ALGORITHM 15.1. Density-based Clustering Algorithm
DBSCAN
(D,
ǫ
,
minpts
)
:
Core
←∅
1
foreach x
i
∈
D do
// Find the core points
2
Compute
N
ǫ
(
x
i
)
3
id
(
x
i
)
←∅
// cluster id for
x
i
4
if
N
ǫ
(
x
i
)
≥
minpts
then
Core
←
Core
∪{
x
i
}
5
k
←
0
// cluster id
6
foreach x
i
∈
Core, such that id
(
x
i
)
=∅
do
7
k
←
k
+
1
8
id
(
x
i
)
←
k
// assign
x
i
to cluster id
k
9
D
ENSITY
C
ONNECTED
(
x
i
,k
)
10
C
←{
C
i
}
k
i
=
1
, where
C
i
←{
x
∈
D
|
id(
x
)
=
i
}
11
Noise
←{
x
∈
D
|
id
(
x
)
=∅}
12
Border
←
D
{
Core
∪
Noise
}
13
return
C
,
Core
,
Border
,
Noise
14
D
ENSITY
C
ONNECTED
(x,
k
)
:
foreach y
∈
N
ǫ
(
x
)
do
15
id
(
y
)
←
k
// assign
y
to cluster id
k
16
if y
∈
C
ore
then
D
ENSITY
C
ONNECTED
(
y
,k
)
17
may be reachable from core points in more than one cluster; they may either be
arbitrarily assigned to one of the clusters or to all of them (if overlapping clusters are
allowed). Those points thatdo not belong to anycluster aretreatedas outliers or noise.
DBSCAN can also be considered as a search for the connected components in
a graph where the vertices correspond to the core points in the dataset, and there
exists an (undirected) edge between two vertices (core points) if the distance between
them is less than
ǫ
, that is, each of them is in the
ǫ
-neighborhood of the other
point. The connected components of this graph correspond to the core points of each
cluster. Next, each core point incorporates into its cluster any border points in its
neighborhood.
One limitation of DBSCAN is that it is sensitive to the choice of
ǫ
, in particular if
clusters have different densities. If
ǫ
is too small, sparser clusters will be categorized as
noise. If
ǫ
is too large, denser clusters may be merged together. In other words, if there
are clusters with different local densities, then a single
ǫ
value may not suffice.
Example 15.2.
Figure 15.3 shows the clusters discovered by
DBSCAN
on the
density-based dataset in Figure 15.1. For the parameter values
ǫ
=
15 and
minpts
=
10, found after parameter tuning,
DBSCAN
yields a near-perfect clustering
comprising all nine clusters. Cluster are shown using different symbols and shading;
noise points are shown as plus symbols.
378
Density-based Clustering
20
95
170
245
320
395
0 100 200 300 400 500 600
X
1
X
2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Figure 15.3.
Density-based clusters.
2
2
.
5
3
.
0
3
.
5
4
.
0
4 5 6 7
X
1
X
2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
(a)
ǫ
=
0
.
2,
minpts
=
5
2
2
.
5
3
.
0
3
.
5
4
.
0
4 5 6 7
X
1
X
2
+
+
+
+
(b)
ǫ
=
0
.
36,
minpts
=
3
Figure 15.4.
DBSCAN clustering: Iris dataset.
Example 15.3.
Figure 15.4 shows the clusterings obtained via DBSCAN on the
two-dimensional Iris dataset (over
sepal length
and
sepal width
attributes) for
two differentparametersettings. Figure15.4ashows theclusters obtained with radius
ǫ
=
0
.
2 and core threshold
minpts
=
5. The three clusters are plotted using different
shaped points, namely circles, squares, and triangles. Shaded points are core points,
whereastheborder pointsforeachclusterareshowedunshaded(white).Noisepoints
are shown as plus symbols. Figure 15.4b shows the clusters obtained with a larger
value of radius
ǫ
=
0
.
36, with
minpts
=
3. Two clusters are found, corresponding to
the two dense regions of points.
For this dataset tuning the parameters is not that easy, and DBSCAN is not very
effective in discovering the three Iris classes. For instance it identifies too many
points (47 of them) as noise in Figure 15.4a. However, DBSCAN is able to find
the two main dense sets of points, distinguishing
iris-setosa
(in triangles) from
the other types of Irises, in Figure 15.4b. Increasing the radius more than
ǫ
=
0
.
36
collapses all points into a single large cluster.
15.2 Kernel Density Estimation
379
Computational Complexity
The main cost in DBSCAN is for computing the
ǫ
-neighborhood for each point. If
the dimensionality is not too high this can be done efficiently using a spatial index
structure in
O
(n
log
n)
time. When dimensionality is high, it takes
O
(n
2
)
to compute
the neighborhood for each point. Once
N
ǫ
(
x
)
has been computed the algorithm needs
only a single pass over all the points to find the density connected clusters. Thus, the
overall complexity of DBSCAN is
O
(n
2
)
in the worst-case.
15.2
KERNEL DENSITY ESTIMATION
There is a close connection between density-based clustering and density estimation.
The goal of density estimation is to determine the unknown probability density
function by findingthe denseregions ofpoints, which canin turn be used for clustering.
Kernel density estimation is a nonparametric technique that does not assume any
fixed probability model of the clusters, as in the case of K-means or the mixture
model assumed in the EM algorithm. Instead, it tries to directly infer the underlying
probability density at each point in the dataset.
15.2.1
Univariate Density Estimation
Assume that
X
is a continuous random variable, and let
x
1
,x
2
,...,x
n
be a random
sample drawn from the underlying probability density function
f(x)
, which is assumed
to be unknown. We can directly estimate the cumulative distribution function from the
data by counting how many points are less than or equal to
x
:
ˆ
F(x)
=
1
n
n
i
=
1
I
(x
i
≤
x)
where
I
is an indicator function that has value 1 only when its argument is true, and 0
otherwise. We can estimate the density function by taking the derivative of
ˆ
F(x)
, by
considering a window of small width
h
centered at
x
, that is,
ˆ
f(x)
=
ˆ
F
x
+
h
2
−
ˆ
F
x
−
h
2
h
=
k/n
h
=
k
nh
(15.1)
where
k
is the number of points that lie in the window of width
h
centered at
x
, that
is, within the closed interval
[
x
−
h
2
, x
+
h
2
]
. Thus, the density estimate is the ratio of
the fraction of the points in the window (
k/n
) to the volume of the window (
h
). Here
h
plays the role of “influence.” That is, a large
h
estimates the probability density over
a large window by considering many points, which has the effect of smoothing the
estimate. On the other hand, if
h
is small, then only the points in close proximity to
x
are considered. In general we want a small value of
h
, but not too small, as in that case
no points will fall in the window and we will not be able to get an accurate estimate of
the probability density.
380
Density-based Clustering
Kernel Estimator
Kerneldensity estimation relies on a
kernelfunctionK
thatis non-negative,symmetric,
and integrates to 1, that is,
K
(x)
≥
0,
K
(
−
x)
=
K
(x)
for all values
x
, and
K
(x)dx
=
1.
Thus,
K
is essentially a probability density function. Note that
K
should not be
confused with the positive semidefinite kernel mentioned in Chapter 5.
Discrete Kernel
The density estimate
ˆ
f(x)
from Eq.(15.1) can also be rewritten in
terms of the kernel function as follows:
ˆ
f(x)
=
1
nh
n
i
=
1
K
x
−
x
i
h
where the
discrete kernel
function
K
computes the number of points in a window of
width
h
, and is defined as
K
(z)
=
1 If
|
z
|≤
1
2
0 Otherwise
(15.2)
We can see that if
|
z
|=|
x
−
x
i
h
|≤
1
2
, then the point
x
i
is within a window of width
h
centered at
x
, as
x
−
x
i
h
≤
1
2
implies that
−
1
2
≤
x
i
−
x
h
≤
1
2
,
or
−
h
2
≤
x
i
−
x
≤
h
2
,
and finally
x
−
h
2
≤
x
i
≤
x
+
h
2
Example 15.4.
Figure 15.5 shows the kernel density estimates using the discrete
kernel for different values of the influence parameter
h
, for the one-dimensional Iris
dataset comprising the
sepal length
attribute. The
x
-axis plots the
n
=
150 data
points. Because several points have the same value, they are shown stacked, where
the stack height corresponds to the frequency of that value.
When
h
is small, as shown in Figure 15.5a, the density function has many local
maxima or modes. However, as we increase
h
from 0
.
25 to 2, the number of modes
decreases, until
h
becomes large enough to yield a unimodal distribution, as shown in
Figure15.5d.Wecanobservethatthediscretekernelyieldsanon-smooth (or jagged)
density function.
Gaussian Kernel
The width
h
is a parameter that denotes the spread or smoothness
of the density estimate. If the spread is too large we get a more averaged value. If it is
too small we do not have enough points in the window. Further, the kernel function in
Eq.(15.2) has an abrupt influence. For points within the window (
|
z
|≤
1
2
) there is a net
contribution of
1
hn
to the probability estimate
ˆ
f(x)
. On the other hand, points outside
the window (
|
z
|
>
1
2
) contribute 0.
15.2 Kernel Density Estimation
381
0
0
.
33
0
.
66
4 5 6 7 8
x
f(x)
(a)
h
=
0
.
25
0
0
.
22
0
.
44
4 5 6 7 8
x
f(x)
(b)
h
=
0
.
5
0
0
.
21
0
.
42
4 5 6 7 8
x
f(x)
(c)
h
=
1
.
0
0
0
.
2
0
.
4
4 5 6 7 8
x
f(x)
(d)
h
=
2
.
0
Figure 15.5.
Kernel density estimation: discrete kernel (varying
h
).
Instead of the discrete kernel, we can define a more smooth transition of influence
via a Gaussian kernel:
K
(
z
)
=
1
√
2
π
exp
−
z
2
2
Thus, we have
K
x
−
x
i
h
=
1
√
2
π
exp
−
(x
−
x
i
)
2
2
h
2
Here
x
, which is at the center of the window, plays the role of the mean, and
h
acts as
the standard deviation.
Example 15.5.
Figure 15.6 shows the univariate density function for the
1-dimensional Iris dataset (over
sepal length
) using the Gaussian kernel. Plots are
shown for increasing values of the spread parameter
h
. The data points are shown
stacked along the
x
-axis, with the heights corresponding to the value frequencies.
As
h
varies from 0
.
1 to 0
.
5,we can see the smoothing effectof increasing
h
on the
density function. For instance, for
h
=
0
.
1 there are many local maxima, whereas for
h
=
0
.
5 there is only one density peak. Compared to the discrete kernel case shown
in Figure 15.5, we can clearly see that the Gaussian kernel yields much smoother
estimates, without discontinuities.
382
Density-based Clustering
0
0
.
27
0
.
54
4 5 6 7 8
x
f(x)
(a)
h
=
0
.
1
0
0
.
23
0
.
46
4 5 6 7 8
x
f(x)
(b)
h
=
0
.
15
0
0
.
2
0
.
4
4 5 6 7 8
x
f(x)
(c)
h
=
0
.
25
0
0
.
19
0
.
38
4 5 6 7 8
x
f(x)
(d)
h
=
0
.
5
Figure 15.6.
Kernel density estimation: Gaussian kernel (varying
h
).
15.2.2
Multivariate Density Estimation
To estimate the probability density at a
d
-dimensional point
x
=
(x
1
,x
2
,...,x
d
)
T
,
we define the
d
-dimensional “window” as a hypercube in
d
dimensions, that is, a
hypercube centered at
x
with edge length
h
. The volume of such a
d
-dimensional
hypercube is given as
vol
(
H
d
(h))
=
h
d
The density is then estimated as the fraction of the point weight lying within the
d
-dimensional window centered at
x
, divided by the volume of the hypercube:
ˆ
f(
x
)
=
1
nh
d
n
i
=
1
K
x
−
x
i
h
(15.3)
where the multivariate kernel function
K
satisfies the condition
K
(
z
)d
z
=
1.
Discrete Kernel
For any
d
-dimensional vector
z
=
(z
1
,z
2
,...,z
d
)
T
, the discrete kernel
function in
d
-dimensions is given as
K
(
z
)
=
1 If
|
z
j
|≤
1
2
, for all dimensions
j
=
1
,...,d
0 Otherwise
15.2 Kernel Density Estimation
383
(a)
h
=
0
.
1
(b)
h
=
0
.
2
(c)
h
=
0
.
35
(d)
h
=
0
.
6
Figure 15.7.
Density estimation: 2D Iris dataset (varying
h
).
For
z
=
x
−
x
i
h
, we see that the kernel computes the number of points within the
hypercube centered at
x
because
K
(
x
−
x
i
h
)
=
1 if and only if
|
x
j
−
x
ij
h
| ≤
1
2
for all
dimensions
j
. Each point within the hypercube thus contributes a weight of
1
n
to the
density estimate.
Gaussian Kernel
The
d
-dimensional Gaussian kernel is given as
K
(
z
)
=
1
(
2
π)
d/
2
exp
−
z
T
z
2
(15.4)
where we assume that the covariance matrix is the
d
×
d
identity matrix, that is,
=
I
d
.
Plugging
z
=
x
−
x
i
h
in Eq.(15.4), we have
K
x
−
x
i
h
=
1
(
2
π)
d/
2
exp
−
(
x
−
x
i
)
T
(
x
−
x
i
)
2
h
2
Each point contributes a weight to the density estimate inversely proportional to its
distance from
x
tempered by the width parameter
h
.
Example 15.6.
Figure 15.7 shows the probability density function for the 2D
Iris dataset comprising the
sepal length
and
sepal width
attributes, using the
Gaussian kernel. As expected, for small values of
h
the density function has
several local maxima, whereas for larger values the number of maxima reduce, and
ultimately for a large enough value we obtain a unimodal distribution.
384
Density-based Clustering
X
1
X
2
0 100 200 300 400 500 600 700
0
100
200
300
400
500
Figure 15.8.
Density estimation: density-based dataset.
Example 15.7.
Figure 15.8 shows the kernel density estimate for the density-based
dataset in Figure 15.1, using a Gaussian kernel with
h
=
20. One can clearly
discern that the density peaks closely correspond to regions with higher density of
points.
15.2.3
Nearest Neighbor Density Estimation
In the preceding density estimation formulation we implicitly fixed the volume by
fixing the width
h
, and we used the kernel function to find out the number or weight
of points that lie inside the fixed volume region. An alternative approach to density
estimation is to fix
k
, the number of points required to estimate the density, and
allow the volume of the enclosing region to vary to accommodate those
k
points. This
approach is called the
k
nearest neighbors (KNN) approach to density estimation. Like
kernel density estimation, KNN density estimation is also a nonparametric approach.
Given
k
, the number of neighbors, we estimate the density at
x
as follows:
ˆ
f(
x
)
=
k
n
vol
(
S
d
(h
x
))
where
h
x
isthedistancefrom
x
toits
k
thnearestneighbor,andvol
(
S
d
(h
x
))
is thevolume
of the
d
-dimensional hypersphere
S
d
(h
x
)
centered at
x
, with radius
h
x
[Eq.(6.4)]. In
other words, the width (or radius)
h
x
is now a variable, which depends on
x
and the
chosen value
k
.
15.3 Density-based Clustering: DENCLUE
385
15.3
DENSITY-BASED CLUSTERING: DENCLUE
Having laid the foundations of kernel density estimation, we can develop a general
formulation of density-based clustering. The basic approach is to find the peaks in the
density landscape via gradient-based optimization, and find the regions with density
above a given threshold.
Density Attractors and Gradient
A point
x
∗
is called a
density attractor
if it is a local maxima of the probability density
function
f
. A density attractor can be found via a gradient ascent approach starting at
some point
x
. The idea is to compute the density gradient, the direction of the largest
increase in the density, and to move in the direction of the gradient in small steps, until
we reach a local maxima.
The gradient at a point
x
can be computed as the multivariate derivative of the
probability density estimate in Eq.(15.3), given as
∇
ˆ
f(
x
)
=
∂
∂
x
ˆ
f(
x
)
=
1
nh
d
n
i
=
1
∂
∂
x
K
x
−
x
i
h
(15.5)
For the Gaussian kernel [Eq.(15.4)], we have
∂
∂
x
K
(
z
)
=
1
(
2
π)
d/
2
exp
−
z
T
z
2
·−
z
·
∂
z
∂
x
=
K
(
z
)
·−
z
·
∂
z
∂
x
Setting
z
=
x
−
x
i
h
above, we get
∂
∂
x
K
x
−
x
i
h
=
K
x
−
x
i
h
·
x
i
−
x
h
·
1
h
which follows from the fact that
∂
∂
x
x
−
x
i
h
=
1
h
. Substituting the above in Eq.(15.5), the
gradient at a point
x
is given as
∇
ˆ
f(
x
)
=
1
nh
d
+
2
n
i
=
1
K
x
−
x
i
h
·
(
x
i
−
x
)
(15.6)
This equation can be thought of as having two parts. A vector
(
x
i
−
x
)
and a scalar
influence
value
K
(
x
−
x
i
h
)
. For each point
x
i
, we first compute the direction away from
x
, that is, the vector
(
x
i
−
x
)
. Next, we scale it using the Gaussian kernel value as the
weight
K
x
−
x
i
h
. Finally, the vector
∇
ˆ
f(
x
)
is the net influence at
x
, as illustrated in
Figure 15.9, that is, the weighted sum of the difference vectors.
Wesaythat
x
∗
is a
densityattractor
for
x
,oralternativelythat
x
is
densityattracted
to
x
∗
, if a hill climbing process started at
x
converges to
x
∗
. That is, there exists a sequence
of points
x
=
x
0
→
x
1
→
...
→
x
m
, startingfrom
x
and ending at
x
m
,such that
x
m
−
x
∗
≤
ǫ
, that is,
x
m
converges to the attractor
x
∗
.
The typical approach is to use the gradient-ascent method to compute
x
∗
, that is,
starting from
x
, we iteratively update it at each step
t
via the update rule:
x
t
+
1
=
x
t
+
δ
·∇
ˆ
f(
x
t
)
386
Density-based Clustering
0
1
2
3
0 1 2 3 4 5
x
x
1
x
2
x
3
∇
ˆ
f(
x
)
Figure 15.9.
The gradient vector
∇
ˆ
f(
x
)
(shown in thick black) obtained as the sum of difference vectors
x
i
−
x
(shown in gray).
where
δ >
0 is the step size. That is, each intermediate point is obtained after a small
move in the direction of the gradient vector. However, the gradient-ascent approach
can be slow to converge. Instead, one can directly optimize the move direction by
setting the gradient [Eq.(15.6)] to the zero vector:
∇
ˆ
f(
x
)
=
0
1
nh
d
+
2
n
i
=
1
K
x
−
x
i
h
·
(
x
i
−
x
)
=
0
x
·
n
i
=
1
K
x
−
x
i
h
=
n
i
=
1
K
x
−
x
i
h
x
i
x
=
n
i
=
1
K
x
−
x
i
h
x
i
n
i
=
1
K
x
−
x
i
h
The point
x
is involved on both the left- and right-hand sides above; however, it can be
used to obtain the following iterative update rule:
x
t
+
1
=
n
i
=
1
K
x
t
−
x
i
h
x
i
n
i
=
1
K
x
t
−
x
i
h
(15.7)
where
t
denotes the current iteration and
x
t
+
1
is the updated value for the current
vector
x
t
. This direct update rule is essentially a weighted average of the influence
(computed via the kernel function
K
) of each point
x
i
∈
D
on the current point
x
t
. The
direct update rule results in much faster convergence of the hill-climbing process.
Center-defined Cluster
A cluster
C
⊆
D
, is called a
center-defined cluster
if all the points
x
∈
C
are density
attractedto a unique density attractor
x
∗
, such that
ˆ
f(
x
∗
)
≥
ξ
, where
ξ
is a user-defined
15.3 Density-based Clustering: DENCLUE
387
minimum density threshold. In other words,
ˆ
f(
x
∗
)
=
1
nh
d
n
i
=
1
K
x
∗
−
x
i
h
≥
ξ
Density-based Cluster
An arbitrary-shaped cluster
C
⊆
D
is called a
density-based cluster
if there exists a set
of density attractors
x
∗
1
,
x
∗
2
,...,
x
∗
m
, such that
1. Each point
x
∈
C
is attracted to some attractor
x
∗
i
.
2. Each density attractor has density above
ξ
. That is,
ˆ
f(
x
∗
i
)
≥
ξ
.
3. Any two density attractors
x
∗
i
and
x
∗
j
are
density reachable
, that is, there exists a path
from
x
∗
i
to
x
∗
j
, such that for all points
y
on the path,
ˆ
f(
y
)
≥
ξ
.
DENCLUE Algorithm
The pseudo-code for DENCLUE is shown in Algorithm 15.2. The first step is to
compute the density attractor
x
∗
for each point
x
in the dataset (line 4). If the density
at
x
∗
is above the minimum density threshold
ξ
, the attractor is added to the set of
attractors
A
. The data point
x
is also added to the set of points
R
(
x
∗
)
attracted to
x
∗
ALGORITHM 15.2. DENCLUE Algorithm
DENCLUE
(D
,h,ξ,ǫ
)
:
A
←∅
1
foreach x
∈
D do
// find density attractors
2
x
∗
←
F
IND
A
TTRACTOR
(
x
,
D
,h,ǫ)
4
4
if
ˆ
f(
x
∗
)
≥
ξ
then
5
A
←
A
∪{
x
∗
}
7
7
R
(
x
∗
)
←
R
(
x
∗
)
∪{
x
}
9
9
C
←{
maximal
C
⊆
A
| ∀
x
∗
i
,
x
∗
j
∈
C
,
x
∗
i
and
x
∗
j
are density reachable
}
11
11
foreach
C
∈
C
do
// density-based clusters
12
foreach x
∗
∈
C
do
C
←
C
∪
R
(
x
∗
)
13
return
C
14
F
IND
A
TTRACTOR
(x
,
D
,h,ǫ
)
:
t
←
0
16
16
x
t
←
x
17
repeat
18
x
t
+
1
←
n
i
=
1
K
x
t
−
x
i
h
·
x
t
n
i
=
1
K
x
t
−
x
i
h
20
20
t
←
t
+
1
21
until
x
t
−
x
t
−
1
≤
ǫ
22
return x
t
24
24
388
Density-based Clustering
(line 9). In the second step, DENCLUE finds all the maximal subsets of attractors
C
⊆
A
, such that any pair of attractors in
C
is density-reachable from each other
(line 11). These maximal subsets of mutually reachable attractors form the seed for
each density-based cluster. Finally, for each attractor
x
∗
∈
C
, we add to the cluster
all of the points
R
(
x
∗
)
that are attracted to
x
∗
, which results in the final set of
clusters
C
.
The F
IND
A
TTRACTOR
method implements the hill-climbing process using the
direct update rule [Eq.(15.7)], which results in fast convergence. To further speed
up the influence computation, it is possible to compute the kernel values for only the
nearest neighbors of
x
t
. That is, we can index the points in the dataset
D
using a spatial
index structure, so that we can quickly compute all the nearest neighbors of
x
t
within
some radius
r
. For the Gaussian kernel, we can set
r
=
h
·
z
, where
h
is the influence
parameter that plays the role of standard deviation, and
z
specifies the number of
standard deviations. Let
B
d
(
x
t
,r)
denote the set of all points in
D
that lie within a
d
-dimensional ball of radius
r
centered at
x
t
. The nearest neighbor based update rule
can then be expressed as
x
t
+
1
=
x
i
∈
B
d
(
x
t
,r)
K
x
t
−
x
i
h
x
i
x
i
∈
B
d
(
x
t
,r)
K
x
t
−
x
i
h
which can be used in line 20 in Algorithm 15.2. When the data dimensionality is not
high, this can result in a significant speedup. However, the effectiveness deteriorates
rapidlywithincreasingnumberofdimensions. Thisis duetotwoeffects.Thefirstisthat
finding
B
d
(
x
t
,r)
reduces to a linear-scan of the data taking
O
(n)
time for each query.
Second, due to the
curse of dimensionality
(see Chapter 6), nearly all points appear
to be equally close to
x
t
, thereby nullifying any benefits of computing the nearest
neighbors.
Example 15.8.
Figure 15.10 shows the DENCLUE clustering for the 2-dimensional
Iris dataset comprising the
sepal length
and
sepal width
attributes. The results
were obtained with
h
=
0
.
2 and
ξ
=
0
.
08, using a Gaussian kernel. The clustering is
obtained by thresholding the probability density function in Figure 15.7b at
ξ
=
0
.
08.
The two peaks correspond to the two final clusters. Whereas
iris setosa
is well
separated, it is hard to separate the other two types of Irises.
Example 15.9.
Figure 15.11 shows the clusters obtained by DENCLUE on the
density-based dataset from Figure 15.1. Using the parameters
h
=
10 and
ξ
=
9
.
5
×
10
−
5
, with a Gaussian kernel, we obtain eight clusters. The figure is obtained by
slicing the density function at the density value
ξ
; only the regions above that value
are plotted. All the clusters are correctly identified, with the exception of the two
semicircular clusters on the lower right that appear merged into one cluster.
DENCLUE: Special Cases
It can be shown that DBSCAN is a special case of the general kernel density estimate
based clustering approach, DENCLUE. If we let
h
=
ǫ
and
ξ
=
minpts
, then using a
15.3 Density-based Clustering: DENCLUE
389
X
1
X
2
f(
x
)
3.5
4.5
5.5
6.5
7.5
1
2
3
4
Figure 15.10.
DENCLUE: Iris 2D dataset.
X
1
X
2
0
100
200 300 400
500 600
700
0
100
200
300
400
500
Figure 15.11.
DENCLUE: density-based dataset.
discrete kernel DENCLUE yields exactlythe same clusters as DBSCAN. Each density
attractor corresponds to a core point, and the set of connected core points define the
attractors of a density-based cluster. It can also be shown that K-means is a special
case of density-based clustering for appropriates value of
h
and
ξ
, with the density
attractors corresponding to the cluster centroids. Further, it is worth noting that the
density-based approach can produce hierarchical clusters, by varying the
ξ
threshold.
390
Density-based Clustering
For example, decreasing
ξ
can result in the merging of several clusters found at higher
thresholds values. At the same time it can also lead to new clusters if the peak density
satisfies the lower
ξ
value.
Computational Complexity
Thetimefor DENCLUEisdominatedbythecostofthehill-climbingprocess. Foreach
point
x
∈
D
, finding the density attractor takes
O
(nt)
time, where
t
is the maximum
number of hill-climbing iterations. This is because each iteration takes
O
(n)
time for
computing the sum of the influence function over all the points
x
i
∈
D
. The total cost to
compute density attractors is therefore
O
(n
2
t)
. We assume that for reasonable values
of
h
and
ξ
, there are only a few density attractors, that is,
|
A
| =
m
≪
n
. The cost of
finding the maximal reachable subsets of attractors is
O
(m
2
)
, and the final clusters can
be obtained in
O
(n)
time.
15.4
FURTHER READING
Kernel density estimation was developed independently in Rosenblatt (1956) and
Parzen (1962). For an excellent description of density estimation techniques see
Silverman (1986). The density-based DBSCAN algorithm was introduced in Ester et
al. (1996). The DENCLUE method was proposed in Hinneburg and Keim (1998), with
the faster direct update rule appearing in Hinneburg and Gabriel (2007). However,
thedirectupdateruleis essentiallythe
mean-shift
algorithmfirstproposed inFukunaga
andHostetler(1975).SeeCheng(1995)forconvergenceproperties andgeneralizations
of the mean-shift method.
Cheng, Y. (1995). “Mean shift, mode seeking, and clustering.”
IEEE Transactions on
Pattern Analysis and Machine Intelligence,
17(8): 790–799.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). “A density-based algorithm
for discovering clusters in large spatial databases with noise.”
In Proceedings
of the 2nd International Conference on Knowledge Discovery and Data Mining
(pp. 226–231), edited by E. Simoudis, J. Han, and U. M. Fayyad. Palo Ato, CA:
AAAI Press.
Fukunaga, K. and Hostetler, L. (1975). “The estimation of the gradient of a
density function, with applications in pattern recognition.”
IEEE Transactions on
Information Theory
, 21(1): 32–40.
Hinneburg, A. and Gabriel, H.-H. (2007). “Denclue 2.0: Fast clustering based on
kernel density estimation.”
In Proceedings of the 7th International Symposium
on Intelligent Data Analysis
(pp. 70–80). New York: Springer Science+Business
Media.
Hinneburg, A. and Keim, D. A. (1998).
“An efficient approach to clustering in
large multimedia databases with noise.” In Proceedings of the 4th International
Conference on Knowledge Discovery and Data Mining
(pp. 58–65), edited by
R. Agrawal and P. E. Stolorz. Palo Alto, CA: AAAI Press.
Parzen, E. (1962). On estimation of a probability density function and mode.
The
Annals of Mathematical Statistics
, 33(3): 1065–1076.
15.5 Exercises
391
Rosenblatt, M. (1956). “Remarks on some nonparametric estimates of a density
function.”
The Annals of Mathematical Statistics
, 27(3): 832–837.
Silverman, B. (1986).
Density Estimation for Statistics and Data Analysis
. Monographs
on Statistics and Applied Probability. Boca Raton, FL: Chapman and
Hall/CRC.
15.5
EXERCISES
Q1.
Consider Figure 15.12 and answer the following questions, assuming that we use the
Euclidean distance between points, and that
ǫ
=
2 and
minpts
=
3
(a)
List all the core points.
(b)
Is
a
directly density reachable from
d
?
(c)
Is
o
density reachable from
i
? Show the intermediate points on the chain or the
point where the chain breaks.
(d)
Is density reachable a symmetric relationship, that is, if
x
is density reachable
from
y
, does it imply that
y
is density reachable from
x
? Why or why not?
(e)
Is
l
density connected to
x
? Show the intermediate points that makethem density
connected or violate the property, respectively.
(f)
Is density connected a symmetric relationship?
(g)
Show the density-based clusters and the noise points.
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
a b c
d
e
f g
h i
j
k
l
m
n
o
p q
r
s t u
v w
x
Figure 15.12.
Dataset for Q1.
Q2.
Consider the points in Figure 15.13. Define the following distance measures:
L
∞
(
x
,
y
)
=
d
max
i
=
1
|
x
i
−
y
i
|
L
1
2
(
x
,
y
)
=
d
i
=
1
|
x
i
−
y
i
|
1
2
2
392
Density-based Clustering
L
min
(
x
,
y
)
=
d
min
i
=
1
|
x
i
−
y
i
|
L
pow
(
x
,
y
)
=
d
i
=
1
2
i
−
1
(x
i
−
y
i
)
2
1
/
2
(a)
Using
ǫ
=
2,
minpts
=
5, and
L
∞
distance, find all core, border, and noise points.
(b)
Show theshapeoftheball ofradius
ǫ
=
4usingthe
L
1
2
distance. Using
minpts
=
3
show all the clusters found by DBSCAN.
(c)
Using
ǫ
=
1,
minpts
=
6, and
L
min
, list all core, border, and noise points.
(d)
Using
ǫ
=
4,
minpts
=
3, and
L
pow
, show all clusters found by DBSCAN.
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
a
b
c
d
e
f g
h
i
j
k
Figure 15.13.
Dataset for Q2 and Q3.
Q3.
Consider the points shown in Figure 15.13. Define the following two kernels:
K
1
(
z
)
=
1 If
L
∞
(
z
,
0
)
≤
1
0 Otherwise
K
2
(
z
)
=
1 If
d
j
=
1
|
z
j
|≤
1
0 Otherwise
Using each of the two kernels
K
1
and
K
2
, answer the following questions assuming
that
h
=
2:
(a)
What is the probability density at
e
?
(b)
What is the gradient at
e
?
(c)
List all the density attractors for this dataset.
Q4.
The Hessian matrix is defined as the set of partial derivatives of the gradient vector
with respect to
x
. What is the Hessian matrix for the Gaussian kernel? Use the
gradient in Eq.(15.6).
Q5.
Let us compute the probability density at a point
x
using the
k
-nearest neighbor
approach, given as
ˆ
f(x)
=
k
n
V
x
15.5 Exercises
393
where
k
is the number of nearest neighbors,
n
is the total number of points, and
V
x
is
the volume of the region encompassing the
k
nearest neighbors of
x
. In other words,
we fix
k
and allow the volume to vary based on those
k
nearest neighbors of
x
. Given
the following points
2
,
2
.
5
,
3
,
4
,
4
.
5
,
5
,
6
.
1
Find the peak density in this dataset, assuming
k
=
4. Keep in mind that this may
happen at a point other than those given above. Also, a point is its own nearest
neighbor.
CHAPTER 16
Spectral and Graph Clustering
In this chapter we consider clustering over graph data, that is, given a graph, the
goal is to cluster the nodes by using the edges and their weights, which represent
the similarity between the incident nodes. Graph clustering is related to divisive
hierarchical clustering, as many methods partition the set of nodes to obtain the final
clusters using the pairwise similarity matrix between nodes. As we shall see, graph
clustering also has a very strong connection to spectral decomposition of graph-based
matrices. Finally, if the similarity matrix is positive semidefinite, it can be considered
as a kernel matrix, and graph clustering is therefore also related to kernel-based
clustering.
16.1
GRAPHS AND MATRICES
Given a dataset
D
= {
x
i
}
n
i
=
1
consisting of
n
points in
R
d
, let
A
denote the
n
×
n
symmetric
similarity matrix
between the points, given as
A
=
a
11
a
12
···
a
1
n
a
21
a
22
···
a
2
n
.
.
.
.
.
.
···
.
.
.
a
n
1
a
n
2
···
a
nn
(16.1)
where
A
(i,j)
=
a
ij
denotes the similarity or affinity between points
x
i
and
x
j
. We
require the similarity to be symmetric and non-negative, that is,
a
ij
=
a
ji
and
a
ij
≥
0,
respectively. The matrix
A
may be considered to be a
weighted adjacency matrix
of the
weighted (undirected) graph
G
=
(
V
,
E
)
, where each vertex is a point and each edge
joins a pair of points, that is,
V
={
x
i
|
i
=
1
,...,n
}
E
=
(
x
i
,
x
j
)
|
1
≤
i,j
≤
n
Further, the similarity matrix
A
gives the weight on each edge, that is,
a
ij
denotes the
weight of the edge
(
x
i
,
x
j
)
. If all affinities are 0 or 1, then
A
represents the regular
adjacency relationship between the vertices.
394
16.1 Graphs and Matrices
395
For a vertex
x
i
, let
d
i
denote the
degree
of the vertex, defined as
d
i
=
n
j
=
1
a
ij
We define the
degree matrix
of graph
G
as the
n
×
n
diagonal matrix:
=
d
1
0
···
0
0
d
2
···
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
···
d
n
=
n
j
=
1
a
1
j
0
···
0
0
n
j
=
1
a
2
j
···
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
···
n
j
=
1
a
nj
can be compactly written as
(i,i)
=
d
i
for all 1
≤
i
≤
n
.
Example 16.1.
Figure 16.1 shows the similarity graph for the Iris dataset, obtained
as follows. Each of the
n
=
150 points
x
i
∈
R
4
in the Iris dataset is represented by a
node in
G
. To create the edges, we first compute the pairwise similarity between the
points using the Gaussian kernel [Eq.(5.10)]:
a
ij
=
exp
−
x
i
−
x
j
2
2
σ
2
using
σ
=
1. Each edge
(
x
i
,
x
j
)
has the weight
a
ij
. Next, for each node
x
i
we compute
the top
q
nearest neighbors in terms of the similarity value, given as
N
q
(
x
i
)
=
x
j
∈
V
:
a
ij
≤
a
iq
where
a
iq
represents the similarity value between
x
i
and its
q
th nearest neighbor. We
used a value of
q
=
16, as in this case each node records at least 15 nearest neighbors
(not including the node itself), which corresponds to 10% of the nodes. An edge is
added betweennodes
x
i
and
x
j
if and only if both nodes are
mutualnearestneighbors
,
that is, if
x
j
∈
N
q
(
x
i
)
and
x
i
∈
N
q
(
x
j
)
. Finally, if the resulting graph is disconnected, we
add the top
q
most similar (i.e., highest weighted) edges between any two connected
components.
The resulting Iris similarity graph is shown in Figure 16.1. It has
|
V
|=
n
=
150
nodes and
|
E
|=
m
=
1730 edges. Edges with similarity
a
ij
≥
0
.
9 are shown in black,
and the remaining edges are shown in gray. Although
a
ii
=
1
.
0 for all nodes, we do
not show the self-edges or loops.
Normalized Adjacency Matrix
The normalized adjacency matrix is obtained by dividing each row of the adjacency
matrix by the degree of the corresponding node. Given the weighted adjacency matrix
396
Spectral and Graph Clustering
Figure 16.1.
Iris similarity graph.
A
for a graph
G
, its normalized adjacency matrix is defined as
M
=
−
1
A
=
a
11
d
1
a
12
d
1
···
a
1
n
d
1
a
21
d
2
a
22
d
2
···
a
2
n
d
2
.
.
.
.
.
.
.
.
.
.
.
.
a
n
1
d
n
a
n
2
d
n
···
a
nn
d
n
(16.2)
Because
A
is assumed to have non-negative elements, this implies that each element
of
M
, namely
m
ij
is also non-negative, as
m
ij
=
a
ij
d
i
≥
0. Consider the sum of the
i
th row
in
M
; we have
n
j
=
1
m
ij
=
n
j
=
1
a
ij
d
i
=
d
i
d
i
=
1 (16.3)
Thus, each row in
M
sums to 1. This implies that 1 is an eigenvalue of
M
. In fact,
λ
1
=
1 is the largest eigenvalue of
M
, and the other eigenvalues satisfy the property
that
|
λ
i
| ≤
1. Also, if
G
is connected then the eigenvector corresponding to
λ
1
is
u
1
=
1
√
n
(
1
,
1
,...,
1
)
T
=
1
√
n
1
. Because
M
is not symmetric, its eigenvectors are not
necessarily orthogonal.
16.1 Graphs and Matrices
397
1
6
2 4 5
3 7
Figure 16.2.
Example graph.
Example 16.2.
Consider the graph in Figure 16.2. Its adjacency and degree matrices
are given as
A
=
0 1 0 1 0 1 0
1 0 1 1 0 0 0
0 1 0 1 0 0 1
1 1 1 0 1 0 0
0 0 0 1 0 1 1
1 0 0 0 1 0 1
0 0 1 0 1 1 0
=
3 0 0 0 0 0 0
0 3 0 0 0 0 0
0 0 3 0 0 0 0
0 0 0 4 0 0 0
0 0 0 0 3 0 0
0 0 0 0 0 3 0
0 0 0 0 0 0 3
The normalized adjacency matrix is as follows:
M
=
−
1
A
=
0 0
.
33 0 0
.
33 0 0
.
33 0
0
.
33 0 0
.
33 0
.
33 0 0 0
0 0
.
33 0 0
.
33 0 0 0
.
33
0
.
25 0
.
25 0
.
25 0 0
.
25 0 0
0 0 0 0
.
33 0 0
.
33 0
.
33
0
.
33 0 0 0 0
.
33 0 0
.
33
0 0 0
.
33 0 0
.
33 0
.
33 0
The eigenvalues of
M
sorted in decreasing order are as follows:
λ
1
=
1
λ
2
=
0
.
483
λ
3
=
0
.
206
λ
4
=−
0
.
045
λ
5
=−
0
.
405
λ
6
=−
0
.
539
λ
7
=−
0
.
7
The eigenvector corresponding to
λ
1
=
1 is
u
1
=
1
√
7
(
1
,
1
,
1
,
1
,
1
,
1
,
1
)
T
=
(
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
)
T
398
Spectral and Graph Clustering
Graph Laplacian Matrices
The
Laplacian matrix
of a graph is defined as
L
=
−
A
=
n
j
=
1
a
1
j
0
···
0
0
n
j
=
1
a
2
j
···
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
···
n
j
=
1
a
nj
−
a
11
a
12
···
a
1
n
a
21
a
22
···
a
2
n
.
.
.
.
.
.
···
.
.
.
a
n
1
a
n
2
···
a
nn
=
j
=
1
a
1
j
−
a
12
··· −
a
1
n
−
a
21
j
=
2
a
2
j
··· −
a
2
n
.
.
.
.
.
.
···
.
.
.
−
a
n
1
−
a
n
2
···
j
=
n
a
nj
(16.4)
It is interesting to note that
L
is a symmetric, positive semidefinite matrix, as for
any
c
∈
R
n
, we have
c
T
Lc
=
c
T
(
−
A
)
c
=
c
T
c
−
c
T
Ac
=
n
i
=
1
d
i
c
2
i
−
n
i
=
1
n
j
=
1
c
i
c
j
a
ij
=
1
2
n
i
=
1
d
i
c
2
i
−
2
n
i
=
1
n
j
=
1
c
i
c
j
a
ij
+
n
j
=
1
d
j
c
2
j
=
1
2
n
i
=
1
n
j
=
1
a
ij
c
2
i
−
2
n
i
=
1
n
j
=
1
c
i
c
j
a
ij
+
n
i
=
j
n
i
=
1
a
ij
c
2
j
=
1
2
n
i
=
1
n
j
=
1
a
ij
(c
i
−
c
j
)
2
≥
0 because
a
ij
≥
0 and
(c
i
−
c
j
)
2
≥
0
(16.5)
This means that
L
has
n
real, non-negative eigenvalues, which can be arranged in
decreasing order as follows:
λ
1
≥
λ
2
≥ ··· ≥
λ
n
≥
0. Because
L
is symmetric, its
eigenvectorsare orthonormal. Further, from Eq.(16.4)we can see thatthe first column
(and the first row) is a linear combination of the remaining columns (rows). That is, if
L
i
denotes the
i
th column of
L
, then we can observe that
L
1
+
L
2
+
L
3
+···+
L
n
=
0
.
This implies that the rank of
L
is at most
n
−
1, and the smallest eigenvalue is
λ
n
=
0,
with the corresponding eigenvector given as
u
n
=
1
√
n
(
1
,
1
,...,
1
)
T
=
1
√
n
1
, provided the
graph is connected. If the graph is disconnected, then the number of eigenvalues equal
to zero specifies the number of connected components in the graph.
16.1 Graphs and Matrices
399
Example 16.3.
Consider the graph in Figure 16.2, whose adjacency and degree
matrices are shown in Example 16.2. The graph Laplacian is given as
L
=
−
A
=
3
−
1 0
−
1 0
−
1 0
−
1 3
−
1
−
1 0 0 0
0
−
1 3
−
1 0 0
−
1
−
1
−
1
−
1 4
−
1 0 0
0 0 0
−
1 3
−
1
−
1
−
1 0 0 0
−
1 3
−
1
0 0
−
1 0
−
1
−
1 3
The eigenvalues of
L
are as follows:
λ
1
=
5
.
618
λ
2
=
4
.
618
λ
3
=
4
.
414
λ
4
=
3
.
382
λ
5
=
2
.
382
λ
6
=
1
.
586
λ
7
=
0
The eigenvector corresponding to
λ
7
=
0 is
u
7
=
1
√
7
(
1
,
1
,
1
,
1
,
1
,
1
,
1
)
T
=
(
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
)
T
The
normalized symmetric Laplacian matrix
of a graph is defined as
L
s
=
−
1
/
2
L
−
1
/
2
(16.6)
=
−
1
/
2
(
−
A
)
−
1
/
2
=
−
1
/
2
−
1
/
2
−
−
1
/
2
A
−
1
/
2
=
I
−
−
1
/
2
A
−
1
/
2
where
1
/
2
is the diagonal matrix given as
1
/
2
(i,i)
=
√
d
i
, and
−
1
/
2
is the diagonal
matrix given as
−
1
/
2
(i,i)
=
1
√
d
i
(assuming that
d
i
=
0), for 1
≤
i
≤
n
. In other words,
the normalized Laplacian is given as
L
s
=
−
1
/
2
L
−
1
/
2
=
j
=
1
a
1
j
√
d
1
d
1
−
a
12
√
d
1
d
2
··· −
a
1
n
√
d
1
d
n
−
a
21
√
d
2
d
1
j
=
2
a
2
j
√
d
2
d
2
··· −
a
2
n
√
d
2
d
n
.
.
.
.
.
.
.
.
.
.
.
.
−
a
n
1
√
d
n
d
1
−
a
n
2
√
d
n
d
2
···
j
=
n
a
nj
√
d
n
d
n
(16.7)
Like the derivation in Eq.(16.5), we can show that
L
s
is also positive semidefinite
because for any
c
∈
R
d
, we get
c
T
L
s
c
=
1
2
n
i
=
1
n
j
=
1
a
ij
c
i
√
d
i
−
c
j
d
j
2
≥
0 (16.8)
400
Spectral and Graph Clustering
Further, if
L
s
i
denotes the
i
th column of
L
s
, then from Eq.(16.7) we can see that
d
1
L
s
1
+
d
2
L
s
2
+
d
3
L
s
3
+···+
d
n
L
s
n
=
0
That is, the first column is a linear combination of the other columns, which means that
L
s
has rank at most
n
−
1, with the smallest eigenvalue
λ
n
=
0, and the corresponding
eigenvector
1
√
i
d
i
(
√
d
1
,
√
d
2
,...,
√
d
n
)
T
=
1
√
i
d
i
1
/
2
1
. Combined with the fact that
L
s
is positive semidefinite, we conclude that
L
s
has
n
(not necessarily distinct) real,
positive eigenvalues
λ
1
≥
λ
2
≥···≥
λ
n
=
0.
Example 16.4.
We continue with Example 16.3. For the graph in Figure 16.2, its
normalized symmetric Laplacian is given as
L
s
=
1
−
0
.
33 0
−
0
.
29 0
−
0
.
33 0
−
0
.
33 1
−
0
.
33
−
0
.
29 0 0 0
0
−
0
.
33 1
−
0
.
29 0 0
−
0
.
33
−
0
.
29
−
0
.
29
−
0
.
29 1
−
0
.
29 0 0
0 0 0
−
0
.
29 1
−
0
.
33
−
0
.
33
−
0
.
33 0 0 0
−
0
.
33 1
−
0
.
33
0 0
−
0
.
33 0
−
0
.
33
−
0
.
33 1
The eigenvalues of
L
s
are as follows:
λ
1
=
1
.
7
λ
2
=
1
.
539
λ
3
=
1
.
405
λ
4
=
1
.
045
λ
5
=
0
.
794
λ
6
=
0
.
517
λ
7
=
0
The eigenvector corresponding to
λ
7
=
0 is
u
7
=
1
√
22
(
√
3
,
√
3
,
√
3
,
√
4
,
√
3
,
√
3
,
√
3
)
T
=
(
0
.
37
,
0
.
37
,
0
.
37
,
0
.
43
,
0
.
37
,
0
.
37
,
0
.
37
)
T
The
normalized asymmetric Laplacian
matrix is defined as
L
a
=
−
1
L
=
−
1
(
−
A
)
=
I
−
−
1
A
=
j
=
1
a
1
j
d
1
−
a
12
d
1
··· −
a
1
n
d
1
−
a
21
d
2
j
=
2
a
2
j
d
2
··· −
a
2
n
d
2
.
.
.
.
.
.
.
.
.
.
.
.
−
a
n
1
d
n
−
a
n
2
d
n
···
j
=
n
a
nj
d
n
(16.9)
Consider the eigenvalue equation for the symmetric Laplacian
L
s
:
L
s
u
=
λ
u
16.2 Clustering as Graph Cuts
401
Left multiplying by
−
1
/
2
on both sides, we get
−
1
/
2
L
s
u
=
λ
−
1
/
2
u
−
1
/
2
−
1
/
2
L
−
1
/
2
u
=
λ
−
1
/
2
u
−
1
L
−
1
/
2
u
=
λ
−
1
/
2
u
L
a
v
=
λ
v
where
v
=
−
1
/
2
u
is an eigenvector of
L
a
, and
u
is an eigenvector of
L
s
. Further,
L
a
has the same set of eigenvalues as
L
s
, which means that
L
a
is a positive semi-definite
matrix with
n
real eigenvalues
λ
1
≥
λ
2
≥···≥
λ
n
=
0. From Eq.(16.9) we can see that
if
L
a
i
denotes the
i
th column of
L
a
, then
L
a
1
+
L
a
2
+···+
L
a
n
=
0, which implies that
v
n
=
1
√
n
1
is the eigenvector corresponding to the smallest eigenvalue
λ
n
=
0.
Example 16.5.
For the graph in Figure 16.2, its normalized asymmetric Laplacian
matrix is given as
L
a
=
−
1
L
=
1
−
0
.
33 0
−
0
.
33 0
−
0
.
33 0
−
0
.
33 1
−
0
.
33
−
0
.
33 0 0 0
0
−
0
.
33 1
−
0
.
33 0 0
−
0
.
33
−
0
.
25
−
0
.
25
−
0
.
25 1
−
0
.
25 0 0
0 0 0
−
0
.
33 1
−
0
.
33
−
0
.
33
−
0
.
33 0 0 0
−
0
.
33 1
−
0
.
33
0 0
−
0
.
33 0
−
0
.
33
−
0
.
33 1
The eigenvalues of
L
a
are identical to those for
L
s
, namely
λ
1
=
1
.
7
λ
2
=
1
.
539
λ
3
=
1
.
405
λ
4
=
1
.
045
λ
5
=
0
.
794
λ
6
=
0
.
517
λ
7
=
0
The eigenvector corresponding to
λ
7
=
0 is
u
7
=
1
√
7
(
1
,
1
,
1
,
1
,
1
,
1
,
1
)
T
=
(
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
)
T
16.2
CLUSTERING AS GRAPH CUTS
A
k
-way cut
in a graph is a partitioning or clustering of the vertex set, given as
C
={
C
1
,...,
C
k
}
, such that
C
i
=∅
for all
i
,
C
i
∩
C
j
=∅
for all
i,j
, and
V
=
i
C
i
. We
require
C
to optimize some objective function that captures the intuition that nodes
within a cluster should have high similarity, and nodes from different clusters should
have low similarity.
Given a weighted graph
G
defined by its similarity matrix [Eq.(16.1)], let
S
,
T
⊆
V
be any two subsets of the vertices. We denote by
W
(
S
,
T
)
the sum of the weights on all
402
Spectral and Graph Clustering
edges with one vertex in
S
and the other in
T
, given as
W
(
S
,
T
)
=
v
i
∈
S
v
j
∈
T
a
ij
Given
S
⊆
V
, we denote by
S
the complementary set of vertices, that is,
S
=
V
−
S
. A
(vertex) cut
in a graph is defined as a partitioning of
V
into
S
⊂
V
and
S
. The
weight of
the cut
or
cut weight
is defined as the sum of all the weights on edges between vertices
in
S
and
S
, given as
W
(
S
,
S
)
.
Given a clustering
C
={
C
1
,...,
C
k
}
comprising
k
clusters, the
size
of a cluster
C
i
is
the number of nodes in the cluster, given as
|
C
i
|
. The
volume
of a cluster
C
i
is defined
as the sum of all the weights on edges with one end in cluster
C
i
:
vol(
C
i
)
=
v
j
∈
C
i
d
j
=
v
j
∈
C
i
v
r
∈
V
a
jr
=
W
(
C
i
,
V
)
Let
c
i
∈{
0
,
1
}
n
be the
cluster indicator vector
that records the cluster membership for
cluster
C
i
, defined as
c
ij
=
1 if
v
j
∈
C
i
0 if
v
j
∈
C
i
Because a clustering creates pairwise disjoint clusters, we immediately have
c
T
i
c
j
=
0
Further, the cluster size can be written as
|
C
i
|=
c
T
i
c
i
=
c
i
2
The following identities allow us to express the weight of a cut in terms of matrix
operations. Let us derive an expression for the sum of the weights for all edges with
one end in
C
i
. These edges include internal cluster edges (with both ends in
C
i
), as well
as external cluster edges (with the other end in another cluster
C
j
=
i
).
vol(
C
i
)
=
W
(
C
i
,
V
)
=
v
r
∈
C
i
d
r
=
v
r
∈
C
i
c
ir
d
r
c
ir
=
n
r
=
1
n
s
=
1
c
ir
rs
c
is
=
c
T
i
c
i
(16.10)
Consider the sum of weights of all internal edges:
W
(
C
i
,
C
i
)
=
v
r
∈
C
i
v
s
∈
C
i
a
rs
=
n
r
=
1
n
s
=
1
c
ir
a
rs
c
is
=
c
T
i
Ac
i
(16.11)
16.2 Clustering as Graph Cuts
403
We can get the sum of weights for all the external edges, or the cut weight by
subtracting Eq.(16.11) from Eq.(16.10), as follows:
W
(
C
i
,
C
i
)
=
v
r
∈
C
i
v
s
∈
V
−
C
i
a
rs
=
W
(
C
i
,
V
)
−
W
(
C
i
,
C
i
)
=
c
i
(
−
A
)
c
i
=
c
T
i
Lc
i
(16.12)
Example 16.6.
Consider the graph in Figure 16.2. Assume that
C
1
= {
1
,
2
,
3
,
4
}
and
C
2
={
5
,
6
,
7
}
are two clusters. Their cluster indicator vectors are given as
c
1
=
(
1
,
1
,
1
,
1
,
0
,
0
,
0
)
T
c
2
=
(
0
,
0
,
0
,
0
,
1
,
1
,
1
)
T
As required, we have
c
T
1
c
2
=
0, and
c
T
1
c
1
=
c
1
2
=
4 and
c
T
2
c
2
=
3 give the cluster sizes.
Consider the cut weight between
C
1
and
C
2
. Because there are three edges between
the two clusters, we have
W
(
C
1
,
C
1
)
=
W
(
C
1
,
C
2
)
=
3. Using the Laplacian matrix
from Example 16.3, by Eq.(16.12) we have
W
(
C
1
,
C
1
)
=
c
T
1
Lc
1
=
(
1
,
1
,
1
,
1
,
0
,
0
,
0
)
3
−
1 0
−
1 0
−
1 0
−
1 3
−
1
−
1 0 0 0
0
−
1 3
−
1 0 0
−
1
−
1
−
1
−
1 4
−
1 0 0
0 0 0
−
1 3
−
1
−
1
−
1 0 0 0
−
1 3
−
1
0 0
−
1 0
−
1
−
1 3
1
1
1
1
0
0
0
=
(
1
,
0
,
1
,
1
,
−
1
,
−
1
,
−
1
)(
1
,
1
,
1
,
1
,
0
,
0
,
0
)
T
=
3
16.2.1
Clustering Objective Functions: Ratio and Normalized Cut
The clustering objective function can be formulated as an optimization problem
over the
k
-way cut
C
= {
C
1
,...,
C
k
}
. We consider two common minimization
objectives, namely ratio and normalized cut. We consider maximization objectives in
Section 16.2.3, after describing the spectral clustering algorithm.
Ratio Cut
The
ratio cut
objective is defined over a
k
-way cut as follows:
min
C
J
rc
(
C
)
=
k
i
=
1
W
(
C
i
,
C
i
)
|
C
i
|
=
k
i
=
1
c
T
i
Lc
i
c
T
i
c
i
=
k
i
=
1
c
T
i
Lc
i
c
i
2
(16.13)
where we make use of Eq.(16.12), that is,
W
(
C
i
,
C
i
)
=
c
T
i
Lc
i
.
Ratio cut tries to minimize the sum of the similarities from a cluster
C
i
to other
pointsnotinthecluster
C
i
,takingintoaccountthesizeofeachcluster.Onecanobserve
that the objective function has a lower value when the cut weight is minimized and
when the cluster size is large.
404
Spectral and Graph Clustering
Unfortunately, for binary cluster indicator vectors
c
i
, the ratio cut objective is
NP-hard. An obvious relaxation is to allow
c
i
to take on any real value. In this case, we
can rewrite the objective as
min
C
J
rc
(
C
)
=
k
i
=
1
c
T
i
Lc
i
c
i
2
=
k
i
=
1
c
i
c
i
T
L
c
i
c
i
=
k
i
=
1
u
T
i
Lu
i
(16.14)
where
u
i
=
c
i
c
i
is the unit vector in the direction of
c
i
∈
R
n
, that is,
c
i
is assumed to be
an arbitrary real vector.
To minimize
J
rc
wetakeitsderivativewithrespectto
u
i
andsetittothezerovector.
To incorporate the constraint that
u
T
i
u
i
=
1, we introduce the Lagrange multiplier
λ
i
for each cluster
C
i
. We have
∂
∂
u
i
k
i
=
1
u
T
i
Lu
i
+
n
i
=
1
λ
i
(
1
−
u
T
i
u
i
)
=
0
,
which implies that
2
Lu
i
−
2
λ
i
u
i
=
0
,
and thus
Lu
i
=
λ
i
u
i
(16.15)
This implies that
u
i
is one of the eigenvectorsof the Laplacian matrix
L
, corresponding
to the eigenvalue
λ
i
. Using Eq.(16.15), we can see that
u
T
i
Lu
i
=
u
T
i
λ
i
u
i
=
λ
i
which in turn implies that to minimize the ratio cut objective [Eq.(16.14)], we should
choose the
k
smallest eigenvalues, and the corresponding eigenvectors, so that
min
C
J
rc
(
C
)
=
u
T
n
Lu
n
+ ··· +
u
T
n
−
k
+
1
Lu
n
−
k
+
1
=
λ
n
+ ··· +
λ
n
−
k
+
1
(16.16)
where we assume that the eigenvalues have been sorted so that
λ
1
≥
λ
2
≥ ··· ≥
λ
n
.
Noting that the smallest eigenvalue of
L
is
λ
n
=
0, the
k
smallest eigenvalues are as
follows: 0
=
λ
n
≤
λ
n
−
1
≤
λ
n
−
k
+
1
. The corresponding eigenvectors
u
n
,
u
n
−
1
,...,
u
n
−
k
+
1
represent the relaxed cluster indicator vectors. However, because
u
n
=
1
√
n
1
, it does not
provide any guidance on how to separate the graph nodes if the graph is connected.
Normalized Cut
Normalized cut
is similar to ratio cut, except that it divides the cut weight of each
cluster by the volume of a cluster instead of its size. The objective function is
given as
min
C
J
nc
(
C
)
=
k
i
=
1
W
(
C
i
,
C
i
)
vol(
C
i
)
=
k
i
=
1
c
T
i
Lc
i
c
T
i
c
i
(16.17)
where we use Eqs.(16.12) and (16.10), that is,
W
(
C
i
,
C
i
)
=
c
T
i
Lc
i
and
vol(
C
i
)
=
c
T
i
c
i
,
respectively.The
J
nc
objectivefunction has lower valueswhen thecut weight is low and
when the cluster volume is high, as desired.
16.2 Clustering as Graph Cuts
405
As in the case of ratio cut, we can obtain an optimal solution to the normalized cut
objective if we relax the condition that
c
i
be a binary cluster indicator vector. Instead
we assume
c
i
to be an arbitrary real vector. Using the observation that the diagonal
degree matrix
can be written as
=
1
/
2
1
/
2
, and using the fact that
I
=
1
/
2
−
1
/
2
and
T
=
(because
is diagonal), we can rewrite the normalized cut objective in
terms of the normalized symmetric Laplacian, as follows:
min
C
J
nc
(
C
)
=
k
i
=
1
c
T
i
Lc
i
c
T
i
c
i
=
k
i
=
1
c
T
i
1
/
2
−
1
/
2
L
−
1
/
2
1
/
2
c
i
c
T
i
1
/
2
1
/
2
c
i
=
k
i
=
1
(
1
/
2
c
i
)
T
(
−
1
/
2
L
−
1
/
2
)(
1
/
2
c
i
)
(
1
/
2
c
i
)
T
(
1
/
2
c
i
)
=
k
i
=
1
1
/
2
c
i
1
/
2
c
i
T
L
s
1
/
2
c
i
1
/
2
c
i
=
k
i
=
1
u
T
i
L
s
u
i
where
u
i
=
1
/
2
c
i
1
/
2
c
i
is the unit vector in the direction of
1
/
2
c
i
. Following the same
approach as in Eq.(16.15), we conclude that the normalized cut objective is optimized
by selecting the
k
smallest eigenvalues of the normalized Laplacian matrix
L
s
, namely
0
=
λ
n
≤···≤
λ
n
−
k
+
1
.
The normalized cut objective [Eq.(16.17)], can also be expressed in terms of the
normalized asymmetric Laplacian, by differentiating Eq.(16.17) with respect to
c
i
and
setting the result to the zero vector. Noting that all terms other than that for
c
i
are
constant with respect to
c
i
, we have:
∂
∂
c
i
k
j
=
1
c
T
j
Lc
j
c
T
j
c
j
=
∂
∂
c
i
c
T
i
Lc
i
c
T
i
c
i
=
0
Lc
i
(
c
T
i
c
i
)
−
c
i
(
c
T
i
Lc
i
)
(
c
T
i
c
i
)
2
=
0
Lc
i
=
c
T
i
Lc
i
c
T
i
c
i
c
i
−
1
Lc
i
=
λ
i
c
i
L
a
c
i
=
λ
i
c
i
where
λ
i
=
c
T
i
Lc
i
c
T
i
c
i
is the eigenvalue corresponding to the
i
th eigenvector
c
i
of the
asymmetric Laplacian matrix
L
a
. To minimize the normalized cut objective we
therefore choose the
k
smallest eigenvalues of
L
a
, namely, 0
=
λ
n
≤···≤
λ
n
−
k
+
1
.
To derive the clustering, for
L
a
, we can use the corresponding eigenvectors
u
n
,...,
u
n
−
k
+
1
, with
c
i
=
u
i
representing the real-valued cluster indicator vectors.
406
Spectral and Graph Clustering
However, note that for
L
a
, we have
c
n
=
u
n
=
1
√
n
1
. Further, for the normalized
symmetric Laplacian
L
s
, the real-valued cluster indicator vectors are given as
c
i
=
−
1
/
2
u
i
, which again implies that
c
n
=
1
√
n
1
. This means that the eigenvector
u
n
corresponding to the smallest eigenvalue
λ
n
=
0 does not by itself contain any useful
information for clustering if the graph is connected.
16.2.2
Spectral Clustering Algorithm
Algorithm 16.1 gives the pseudo-code for the spectral clustering approach. We assume
that the underlying graph is connected. The method takes a dataset
D
as input and
computes the similarity matrix
A
. Alternatively, the matrix
A
may be directly input as
well. Depending on the objective function, we choose the corresponding matrix
B
. For
instance, for normalized cut
B
is chosen to be either
L
s
or
L
a
, whereas for ratio cut
we choose
B
=
L
. Next, we compute the
k
smallest eigenvalues and eigenvectors of
B
.
However, the main problem we face is that the eigenvectors
u
i
are not binary, and thus
it is not immediately clear how we can assign points to clusters. One solution to this
problem is to treat the
n
×
k
matrix of eigenvectors as a new data matrix:
U
=
| | |
u
n
u
n
−
1
···
u
n
−
k
+
1
| | |
=
u
n,
1
u
n
−
1
,
1
···
u
n
−
k
+
1
,
1
u
n
2
u
n
−
1
,
2
···
u
n
−
k
+
1
,
2
| | ··· |
u
n,n
u
n
−
1
,n
···
u
n
−
k
+
1
,n
(16.18)
Next, we normalize each row of
U
to obtain the unit vector:
y
i
=
1
k
j
=
1
u
2
n
−
j
+
1
,i
(u
n,i
, u
n
−
1
,i
, ..., u
n
−
k
+
1
,i
)
T
(16.19)
which yields the new normalized data matrix
Y
∈
R
n
×
k
comprising
n
points in a reduced
k
dimensional space:
Y
=
—
y
T
1
—
—
y
T
2
—
.
.
.
—
y
T
n
—
ALGORITHM 16.1. Spectral Clustering Algorithm
S
PECTRAL
C
LUSTERING
(D
,k
)
:
Compute the similarity matrix
A
∈
R
n
×
n
1
if
ratio cut
then B
←
L
2
else if
normalized cut
then B
←
L
s
or
L
a
3
Solve
Bu
i
=
λ
i
u
i
for
i
=
n,...,n
−
k
+
1, where
λ
n
≤
λ
n
−
1
≤···≤
λ
n
−
k
+
1
4
U
←
u
n
u
n
−
1
···
u
n
−
k
+
1
5
Y
←
normalize rows of
U
using Eq.(16.19)
6
C
←{
C
1
,...,
C
k
}
via K-means on
Y
7
16.2 Clustering as Graph Cuts
407
We can now cluster the new points in
Y
into
k
clusters via the K-means algorithm or
any other fastclustering method, as it is expectedthat theclusters are well-separatedin
the
k
-dimensional eigen-space. Note that for
L
,
L
s
, and
L
a
, the cluster indicator vector
corresponding to the smallest eigenvalue
λ
n
=
0 is a vector of all 1’s, which does not
provide any information about how to separate the nodes. The real information for
clustering is contained in eigenvectors starting from the second smallest eigenvalue.
However, if the graph is disconnected, then even the eigenvector corresponding to
λ
n
can contain information valuable for clustering. Thus, we retain all
k
eigenvectors in
U
in Eq.(16.18).
Strictly speaking, the normalization step [Eq.(16.19)] is recommended only for
the normalized symmetric Laplacian
L
s
. This is because the eigenvectors of
L
s
and the
cluster indicator vectors are related as
1
/
2
c
i
=
u
i
. The
j
th entry of
u
i
, corresponding
to vertex
v
j
, is given as
u
ij
=
d
j
c
ij
n
r
=
1
d
r
c
2
ir
If vertex degrees vary a lot, vertices with small degrees would have very small values
u
ij
. This can cause problems for K-means for correctly clustering these vertices. The
normalization step helps alleviate this problem for
L
s
, though it can also help other
objectives.
Computational Complexity
The computational complexity of the spectral clustering algorithm is
O
(n
3
)
, because
computing the eigenvectors takes that much time. However, if the graph is sparse, the
complexity to compute the eigenvectors is
O
(mn)
where
m
is the number of edges in
the graph. In particular, if
m
=
O
(n)
, then the complexity reduces to
O
(n
2
)
. Running
the K-means method on
Y
takes
O
(tnk
2
)
time, where
t
is the number of iterations
K-means takes to converge.
Example 16.7.
Consider the normalized cut approach applied to the graph in
Figure 16.2. Assume that we want to find
k
=
2 clusters. For the normalized
asymmetric Laplacian matrix from Example 16.5, we compute the eigenvectors,
v
7
and
v
6
, corresponding to the two smallest eigenvalues,
λ
7
=
0 and
λ
6
=
0
.
517. The
matrix composed of both the eigenvectors is given as
U
=
u
1
u
2
−
0
.
378
−
0
.
226
−
0
.
378
−
0
.
499
−
0
.
378
−
0
.
226
−
0
.
378
−
0
.
272
−
0
.
378 0
.
425
−
0
.
378 0
.
444
−
0
.
378 0
.
444
408
Spectral and Graph Clustering
−
1
−
0
.
5
0
0
.
5
−
1
−
0
.
9
−
0
.
8
−
0
.
7
−
0
.
6
u
1
u
2
1
,
3
2
4
5
6
,
7
Figure 16.3.
K-means on spectral dataset
Y
.
We treat the
i
th component of
u
1
and
u
2
as the
i
th point
(u
1
i
,u
2
i
)
∈
R
2
, and after
normalizing all points to have unit length we obtain the new dataset:
Y
=
−
0
.
859
−
0
.
513
−
0
.
604
−
0
.
797
−
0
.
859
−
0
.
513
−
0
.
812
−
0
.
584
−
0
.
664 0
.
747
−
0
.
648 0
.
761
−
0
.
648 0
.
761
For instance the first point is computed as
y
1
=
1
(
−
0
.
378
)
2
+
(
−
0
.
226
2
)
(
−
0
.
378
,
−
0
.
226
)
T
=
(
−
0
.
859
,
−
0
.
513
)
T
Figure 16.3 plots the new dataset
Y
. Clustering the points into
k
=
2 groups using
K-means yields the two clusters
C
1
={
1
,
2
,
3
,
4
}
and
C
2
={
5
,
6
,
7
}
.
Example 16.8.
We apply spectral clustering on the Iris graph in Figure 16.1 using
the normalized cut objective with the asymmetric Laplacian matrix
L
a
. Figure 16.4
shows the
k
=
3 clusters. Comparing them with the true Iris classes (not used in
the clustering), we obtain the contingency table shown in Table 16.1, indicating
the number of points clustered correctly (on the main diagonal) and incorrectly
(off-diagonal). We can see that cluster
C
1
corresponds mainly to
iris-setosa
,
C
2
to
iris-virginica
, and
C
3
to
iris-versicolor
. The latter two are more difficult
to separate. In total there are 18 points that are misclustered when compared to the
true Iris types.
16.2 Clustering as Graph Cuts
409
Figure 16.4.
Normalized cut on Iris graph.
Table 16.1.
Contingency table: clusters versus Iris types
iris-setosa iris-virginica iris-versicolor
C
1
(triangle) 50 0 4
C
2
(square) 0 36 0
C
3
(circle) 0 14 46
16.2.3
Maximization Objectives: Average Cut and Modularity
We now discuss two clustering objective functions that can be formulated as
maximization problems over the
k
-way cut
C
= {
C
1
,...,
C
k
}
. These include average
weight and modularity. We also explore their connections with normalized cut and
kernel K-means.
Average Weight
The
average weight
objective is defined as
max
C
J
aw
(
C
)
=
k
i
=
1
W
(
C
i
,
C
i
)
|
C
i
|
=
k
i
=
1
c
T
i
Ac
i
c
T
i
c
i
(16.20)
where we used the equivalence
W
(
C
i
,
C
i
)
=
c
T
i
Ac
i
established in Eq.(16.11). Instead
of trying to minimize the weights on edges between clusters as in ratio cut, average
weight tries to maximize the within cluster weights. The problem of maximizing
J
aw
for
binary cluster indicator vectors is also NP-hard; we can obtain a solution by relaxing
410
Spectral and Graph Clustering
the constraint on
c
i
, by assuming that it can take on any real values for its elements.
This leads to the relaxed objective
max
C
J
aw
(
C
)
=
k
i
=
1
u
T
i
Au
i
(16.21)
where
u
i
=
c
i
c
i
. Following the same approach as in Eq.(16.15), we can maximize
the objective by selecting the
k
largest eigenvalues of
A
, and the corresponding
eigenvectors
max
C
J
aw
(
C
)
=
u
T
1
Au
1
+···+
u
T
k
Au
k
=
λ
1
+···+
λ
k
where
λ
1
≥
λ
2
≥···≥
λ
n
.
If we assume that
A
is the weighted adjacency matrix obtained from a symmetric
and positive semidefinite kernel, that is, with
a
ij
=
K
(
x
i
,
x
j
)
, then
A
will be positive
semidefinite and will have non-negative real eigenvalues. In general, if we threshold
A
orif
A
istheunweightedadjacencymatrixfor anundirectedgraph,theneventhough
A
is symmetric,itmaynot bepositive semidefinite.This meansthatin general
A
canhave
negative eigenvalues, though they are all real. Because
J
aw
is a maximization problem,
this means that we must consider only the positive eigenvalues and the corresponding
eigenvectors.
Example 16.9.
For the graph in Figure 16.2, with the adjacency matrix shown in
Example 16.3, its eigenvalues are as follows:
λ
1
=
3
.
18
λ
2
=
1
.
49
λ
3
=
0
.
62
λ
4
=−
0
.
15
λ
5
=−
1
.
27
λ
6
=−
1
.
62
λ
7
=−
2
.
25
We can see that the eigenvalues can be negative, as
A
is the adjacency graph and is
not positive semidefinite.
Average Weight and Kernel K-means
The average weight objective leads to an
interesting connection between kernel K-means and graph cuts. If the weighted
adjacency matrix
A
represents the kernel value between a pair of points, so that
a
ij
=
K
(
x
i
,
x
j
)
, then we may use the sum of squared errors objective [Eq.(13.3)] of
kernel K-means for graph clustering. The SSE objective is given as
min
C
J
sse
(
C
)
=
n
j
=
1
K
(
x
j
,
x
j
)
−
k
i
=
1
1
|
C
i
|
x
r
∈
C
i
x
s
∈
C
i
K
(
x
r
,
x
s
)
=
n
j
=
1
a
jj
−
k
i
=
1
1
|
C
i
|
v
r
∈
C
i
v
s
∈
C
i
a
rs
16.2 Clustering as Graph Cuts
411
=
n
j
=
1
a
jj
−
k
i
=
1
c
T
i
Ac
i
c
T
i
c
i
=
n
j
=
1
a
jj
−
J
aw
(
C
)
(16.22)
We can observe that because
n
j
=
1
a
jj
is independent of the clustering, minimizing the
SSE objective is the same as maximizing the average weight objective. In particular, if
a
ij
represents the linear kernel
x
T
i
x
j
between the nodes, then maximizing the average
weight objective [Eq.(16.20)] is equivalent to minimizing the regular K-means SSE
objective[Eq.(13.1)].Thus, spectralclusteringusing
J
aw
and kernelK-meansrepresent
two different approaches to solve the same problem. Kernel K-means tries to solve the
NP-hard problem by using a greedy iterative approach to directly optimize the SSE
objective, whereas the graph cut formulation tries to solve the same NP-hard problem
by optimally solving a relaxed problem.
Modularity
Informally, modularity is defined as the difference betweenthe observed and expected
fraction of edges within a cluster. It measures the extent to which nodes of the same
type (in our case, the same cluster) are linked to each other.
Unweighted Graphs
Let us assume for the moment that the graph
G
is unweighted,
and that
A
is its binary adjacency matrix. The number of edges within a cluster
C
i
is
given as
1
2
v
r
∈
C
i
v
s
∈
C
i
a
rs
where we divide by
1
2
because each edge is counted twice in the summation. Over all
the clusters, the observed number of edges within the same cluster is given as
1
2
k
i
=
1
v
r
∈
C
i
v
s
∈
C
i
a
rs
(16.23)
Let us compute the expected number of edges between any two vertices
v
r
and
v
s
, assuming that edges are placed at random, and allowing multiple edges between
the same pair of vertices. Let
|
E
|=
m
be the total number of edges in the graph. The
probability that one end of an edge is
v
r
is given as
d
r
2
m
, where
d
r
is the degree of
v
r
. The
probability that one end is
v
r
and the other
v
s
is then given as
1
2
p
rs
=
d
r
2
m
·
d
s
2
m
=
d
r
d
s
4
m
2
The number of edges between
v
r
and
v
s
follows a binomial distribution with success
probability
p
rs
over 2
m
trials (because we are selecting the two ends of
m
edges). The
expected number of edges between
v
r
and
v
s
is given as
2
m
·
p
rs
=
d
r
d
s
2
m
412
Spectral and Graph Clustering
The expected number of edges within a cluster
C
i
is then
1
2
v
r
∈
C
i
v
s
∈
C
i
d
r
d
s
2
m
and the expected number of edges within the same cluster, summed over all
k
clusters,
is given as
1
2
k
i
=
1
v
r
∈
C
i
v
s
∈
C
i
d
r
d
s
2
m
(16.24)
where we divide by 2 because each edge is counted twice. The
modularity
of the
clustering
C
is defined as the difference between the observed and expected fraction
of edges within the same cluster, obtained by subtracting Eq.(16.24) from Eq.(16.23),
and dividing by the number of edges:
Q
=
1
2
m
k
i
=
1
v
r
∈
C
i
v
s
∈
C
i
a
rs
−
d
r
d
s
2
m
Because 2
m
=
n
i
=
1
d
i
, we can rewrite modularity as follows:
Q
=
k
i
=
1
v
r
∈
C
i
v
s
∈
C
i
a
rs
n
j
=
1
d
j
−
d
r
d
s
n
j
=
1
d
j
2
(16.25)
Weighted Graphs
One advantage of the modularity formulation in Eq.(16.25) is that
it directly generalizes to weighted graphs. Assume that
A
is the weighted adjacency
matrix; we interpret the modularity of a clustering as the difference between the
observed and expected fraction of weights on edges within the clusters.
From Eq.(16.11) we have
v
r
∈
C
i
v
s
∈
C
i
a
rs
=
W
(
C
i
,
C
i
)
and from Eq.(16.10) we have
v
r
∈
C
i
v
s
∈
C
i
d
r
d
s
=
v
r
∈
C
i
d
r
v
s
∈
C
i
d
s
=
W
(
C
i
,
V
)
2
Further, note that
n
j
=
1
d
j
=
W
(
V
,
V
)
Using the above equivalences,can write the modularity objective[Eq.(16.25)] in terms
of the weight function
W
as follows:
max
C
J
Q
(
C
)
=
k
i
=
1
W
(
C
i
,
C
i
)
W
(
V
,
V
)
−
W
(
C
i
,
V
)
W
(
V
,
V
)
2
(16.26)
16.2 Clustering as Graph Cuts
413
We now express the modularity objective [Eq.(16.26)] in matrix terms. From
Eq.(16.11), we have
W
(
C
i
,
C
i
)
=
c
T
i
Ac
i
Also note that
W
(
C
i
,
V
)
=
v
r
∈
C
i
d
r
=
v
r
∈
C
i
d
r
c
ir
=
n
j
=
1
d
j
c
ij
=
n
j
=
1
d
T
c
i
where
d
=
(d
1
,d
2
,...,d
n
)
T
is the vector of vertex degrees. Further, we have
W
(
V
,
V
)
=
n
j
=
1
d
j
=
tr(
)
where
tr(
)
is the trace of
, that is, sum of the diagonal entries of
.
The clustering objective based on modularity can then be written as
max
C
J
Q
(
C
)
=
k
i
=
1
c
T
i
Ac
i
tr(
)
−
(
d
T
i
c
i
)
2
tr(
)
2
=
k
i
=
1
c
T
i
A
tr(
)
c
i
−
c
T
i
d
·
d
T
tr(
)
2
c
i
=
k
i
=
1
c
T
i
Qc
i
(16.27)
where
Q
is the
modularity matrix
:
Q
=
1
tr(
)
A
−
d
·
d
T
tr(
)
Directly maximizing objective Eq.(16.27) for binary cluster vectors
c
i
is hard.
We resort to the approximation that elements of
c
i
can take on real values. Further,
we require that
c
T
i
c
i
=
c
i
2
=
1 to ensure that
J
Q
does not increase without bound.
Following the approach in Eq.(16.15), we conclude that
c
i
is an eigenvector of
Q
.
However, because this a maximization problem, instead of selecting the
k
smallest
eigenvalues, we select the
k
largest eigenvalues and the corresponding eigenvectors
to obtain
max
C
J
Q
(
C
)
=
u
T
1
Qu
1
+···+
u
T
k
Qu
k
=
λ
1
+···+
λ
k
where
u
i
is the eigenvector corresponding to
λ
i
, and the eigenvalues are sorted so that
λ
1
≥···≥
λ
n
. The relaxed cluster indicator vectors are given as
c
i
=
u
i
. Note that the
modularity matrix
Q
is symmetric, but it is not positive semidefinite. This means that
although it has real eigenvalues,they may be negativetoo. Also note that if
Q
i
denotes
the
i
th column of
Q
, then we have
Q
1
+
Q
2
+···+
Q
n
=
0
, which implies that 0 is
an eigenvalue of
Q
with the corresponding eigenvector
1
√
n
1
. Thus, for maximizing the
modularity one should use only the positive eigenvalues.
414
Spectral and Graph Clustering
Example 16.10.
Consider the graph in Figure 16.2. The degree vector is
d
=
(
3
,
3
,
3
,
4
,
3
,
3
,
3
)
T
, and the sum of degrees is
tr(
)
=
22. The modularity matrix is
given as
Q
=
1
tr(
)
A
−
1
tr(
)
2
d
·
d
T
=
1
22
0 1 0 1 0 1 0
1 0 1 1 0 0 0
0 1 0 1 0 0 1
1 1 1 0 1 0 0
0 0 0 1 0 1 1
1 0 0 0 1 0 1
0 0 1 0 1 1 0
−
1
484
9 9 9 12 9 9 9
9 9 9 12 9 9 9
9 9 9 12 9 9 9
12 12 12 16 12 12 12
9 9 9 12 9 9 9
9 9 9 12 9 9 9
9 9 9 12 9 9 9
=
−
0
.
019 0
.
027
−
0
.
019 0
.
021
−
0
.
019 0
.
027
−
0
.
019
0
.
027
−
0
.
019 0
.
027 0
.
021
−
0
.
019
−
0
.
019
−
0
.
019
−
0
.
019 0
.
027
−
0
.
019 0
.
021
−
0
.
019
−
0
.
019 0
.
027
0
.
021 0
.
021 0
.
021
−
0
.
033 0
.
021
−
0
.
025
−
0
.
025
−
0
.
019
−
0
.
019
−
0
.
019 0
.
021
−
0
.
019 0
.
027 0
.
027
0
.
027
−
0
.
019
−
0
.
019
−
0
.
025 0
.
027
−
0
.
019 0
.
027
−
0
.
019
−
0
.
019 0
.
027
−
0
.
025 0
.
027 0
.
027
−
0
.
019
The eigenvalues of
Q
are as follows:
λ
1
=
0
.
0678
λ
2
=
0
.
0281
λ
3
=
0
λ
4
=−
0
.
0068
λ
5
=−
0
.
0579
λ
6
=−
0
.
0736
λ
7
=−
0
.
1024
The eigenvector corresponding to
λ
3
=
0 is
u
3
=
1
√
7
(
1
,
1
,
1
,
1
,
1
,
1
,
1
)
T
=
(
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
,
0
.
38
)
T
Modularity as Average Weight
Consider what happens to the modularity matrix
Q
if
we use the normalized adjacency matrix
M
=
−
1
A
in place of the standard adjacency
matrix
A
in Eq.(16.27). In this case, we know by Eq.(16.3) that each row of
M
sums to
1, that is,
n
j
=
1
m
ij
=
d
i
=
1
,
for all
i
=
1
,...,n
We thus have
tr(
)
=
n
i
=
1
d
i
=
n
, and further
d
·
d
T
=
1
n
×
n
, where
1
n
×
n
is the
n
×
n
matrix of all 1’s. The modularity matrix can then be written as
Q
=
1
n
M
−
1
n
2
1
n
×
n
For large graphs with many nodes,
n
is large and the second term practically
vanishes, as
1
n
2
will be very small. Thus, the modularity matrix can be reasonably
16.2 Clustering as Graph Cuts
415
approximated as
Q
≃
1
n
M
(16.28)
Substituting the above in the modularity objective [Eq.(16.27)], we get
max
C
J
Q
(
C
)
=
k
i
=
1
c
T
i
Qc
i
=
k
i
=
1
c
T
i
Mc
i
(16.29)
where we dropped the
1
n
factor because it is a constant for a given graph; it only scales
the eigenvalues without effecting the eigenvectors.
In conclusion, if we use the normalized adjacency matrix, maximizing the
modularity is equivalent to selecting the
k
largest eigenvalues and the corresponding
eigenvectors of the normalized adjacency matrix
M
. Note that in this case modularity
is also equivalent to the average weight objective and kernel K-means as established
in Eq.(16.22).
Normalized Modularity as Normalized Cut
Define the
normalized modularity
objective as follows:
max
C
J
n
Q
(
C
)
=
k
i
=
1
1
W
(
C
i
,
V
)
W
(
C
i
,
C
i
)
W
(
V
,
V
)
−
W
(
C
i
,
V
)
W
(
V
,
V
)
2
(16.30)
We can observe that the main difference from the modularity objective [Eq.(16.26)] is
that we divide by
vol(
C
i
)
=
W
(
C
,
V
i
)
for each cluster. Simplifying the above, we obtain
J
n
Q
(
C
)
=
1
W
(
V
,
V
)
k
i
=
1
W
(
C
i
,
C
i
)
W
(
C
i
,
V
)
−
W
(
C
i
,
V
)
W
(
V
,
V
)
=
1
W
(
V
,
V
)
k
i
=
1
W
(
C
i
,
C
i
)
W
(
C
i
,
V
)
−
k
i
=
1
W
(
C
i
,
V
)
W
(
V
,
V
)
=
1
W
(
V
,
V
)
k
i
=
1
W
(
C
i
,
C
i
)
W
(
C
i
,
V
)
−
1
Now consider the expression
(k
−
1
)
−
W
(
V
,
V
)
·
J
n
Q
(
C
)
, we have
(k
−
1
)
−
W
(
V
,
V
)
J
n
Q
(
C
)
=
(k
−
1
)
−
k
i
=
1
W
(
C
i
,
C
i
)
W
(
C
i
,
V
)
−
1
=
k
−
k
i
=
1
W
(
C
i
,
C
i
)
W
(
C
i
,
V
)
=
k
i
=
1
1
−
W
(
C
i
,
C
i
)
W
(
C
i
,
V
)
=
k
i
=
1
W
(
C
i
,
V
)
−
W
(
C
i
,
C
i
)
W
(
C
i
,
V
)
416
Spectral and Graph Clustering
=
k
i
=
1
W
(
C
i
,
C
i
)
W
(
C
i
,
V
)
=
k
i
=
1
W
(
C
i
,
C
i
)
vol(
C
i
)
=
J
nc
(
C
)
In other words the normalized cut objective [Eq.(16.17)] is related to the normalized
modularity objective [Eq.(16.30)] by the following equation:
J
nc
(
C
)
=
(k
−
1
)
−
W
(
V
,
V
)
·
J
n
Q
(
C
)
Since
W
(
V
,
V
)
is a constant for a given graph, we observe that minimizing normalized
cut is equivalent to maximizing normalized modularity.
Spectral Clustering Algorithm
Both average weight and modularity are maximization objectives; therefore we have
to slightly modify Algorithm 16.1 for spectral clustering to use these objectives. The
matrix
B
is chosen to be
A
ifwe aremaximizingaverageweightor
Q
for themodularity
objective. Next, instead of computing the
k
smallest eigenvalues we have to select the
k
largest eigenvalues and their corresponding eigenvectors. Because both
A
and
Q
can
havenegativeeigenvalues,we must selectonly the positive eigenvalues.The rest of the
algorithm remains the same.
16.3
MARKOV CLUSTERING
We now consider a graph clustering method based on simulating a random walk on
a weighted graph. The basic intuition is that if node transitions reflect the weights on
the edges, then transitions from one node to another within a cluster are much more
likely than transitions between nodes from different clusters. This is because nodes
within a cluster have higher similarities or weights, and nodes across clusters have
lower similarities.
Given the weighted adjacency matrix
A
for a graph
G
, the normalized adjacency
matrix [Eq.(16.2)] is given as
M
=
−
1
A
. The matrix
M
can be interpreted as the
n
×
n
transition matrix
where the entry
m
ij
=
a
ij
d
i
can be interpreted as the probability of
transitioning or jumping from node
i
to node
j
in the graph
G
. This is because
M
is a
row stochastic
or
Markov
matrix, which satisfies the following conditions: (1) elements
of the matrix are non-negative, that is,
m
ij
≥
0, which follows from the fact that
A
is
non-negative,and (2) rows of
M
are probability vectors, that is, row elements add to 1,
because
n
j
=
1
m
ij
=
n
j
=
1
a
ij
d
i
=
1
The matrix
M
is thus thetransition matrixfora
Markovchain
or aMarkov random
walk on graph
G
. A Markov chain is a discrete-time stochastic process over a set of
16.3 Markov Clustering
417
states, in our case the set of vertices
V
. The Markov chain makes a transition from
one node to another at discrete timesteps
t
=
1
,
2
,...
, with the probability of making a
transition from node
i
to node
j
given as
m
ij
. Let the random variable
X
t
denote the
state at time
t
. The Markov property means that the probability distribution of
X
t
over
the states at time
t
depends only on the probability distribution of
X
t
−
1
, that is,
P(
X
t
=
i
|
X
0
,
X
1
,...,
X
t
−
1
)
=
P(
X
t
=
i
|
X
t
−
1
)
Further, we assume that the Markov chain is
homogeneous
, that is, the transition
probability
P(
X
t
=
j
|
X
t
−
1
=
i)
=
m
ij
is independent of the time step
t
.
Given node
i
the transition matrix
M
specifies the probabilities of reaching any
other node
j
in one time step. Starting from node
i
at
t
=
0, let us consider the
probability of being at node
j
at
t
=
2, that is, after two steps. We denote by
m
ij
(
2
)
the probability of reaching
j
from
i
in two time steps. We can compute this as follows:
m
ij
(
2
)
=
P(
X
2
=
j
|
X
0
=
i)
=
n
a
=
1
P(
X
1
=
a
|
X
0
=
i)P(
X
2
=
j
|
X
1
=
a)
=
n
a
=
1
m
ia
m
aj
=
m
T
i
M
j
(16.31)
where
m
i
=
(m
i
1
,m
i
2
,...,m
in
)
T
denotes the vector corresponding to the
i
th row of
M
and
M
j
=
(m
1
j
,m
2
j
,...,m
nj
)
T
denotes the vector corresponding to the
j
th column
of
M
.
Consider the product of
M
with itself:
M
2
=
M
·
M
=
—
m
T
1
—
—
m
T
2
—
.
.
.
—
m
T
n
—
| | |
M
1
M
2
···
M
n
| | |
=
m
T
i
M
j
n
i,j
=
1
=
m
ij
(
2
)
n
i,j
=
1
(16.32)
Equations (16.31) and (16.32) imply that
M
2
is precisely the transition probability
matrix for the Markov chain over two time-steps. Likewise, the three-step transition
matrix is
M
2
·
M
=
M
3
. In general, the transition probability matrix for
t
time steps is
given as
M
t
−
1
·
M
=
M
t
(16.33)
Arandomwalkon
G
thus corresponds totakingsuccessivepowersofthetransition
matrix
M
. Let
π
0
specify the initial state probability vector at time
t
=
0, that is,
π
0
i
=
P(
X
0
=
i)
is the probability of starting at node
i
, for all
i
=
1
,...,n
. Starting
418
Spectral and Graph Clustering
from
π
0
, we can obtain the state probability vector for
X
t
, that is, the probability of
being at node
i
at time-step
t
, as follows
π
T
t
=
π
T
t
−
1
M
=
π
T
t
−
2
M
·
M
=
π
T
t
−
2
M
2
=
π
T
t
−
3
M
2
·
M
=
π
T
t
−
3
M
3
=
.
.
.
=
π
T
0
M
t
Equivalently, taking transpose on both sides, we get
π
t
=
(
M
t
)
T
π
0
=
(
M
T
)
t
π
0
The state probability vector thus converges to the dominant eigenvector of
M
T
,
reflecting the steady-state probability of reaching any node in the graph, regardless
of the starting node. Note that if the graph is directed, then the steady-state vector is
equivalent to the normalized prestige vector [Eq.(4.6)].
Transition Probability Inflation
Wenow consider avariationoftherandom walk,wheretheprobability oftransitioning
from node
i
to
j
is inflated by taking each element
m
ij
to the power
r
≥
1. Given a
transition matrix
M
, define the inflation operator
ϒ
as follows:
ϒ(
M
,r)
=
(m
ij
)
r
n
a
=
1
(m
ia
)
r
n
i,j
=
1
(16.34)
Theinflationoperationresultsinatransformedorinflatedtransitionprobabilitymatrix
because the elements remain non-negative, and each row is normalized to sum to 1.
The net effect of the inflation operator is to increase the higher probability transitions
and decrease the lower probability transitions.
16.3.1
Markov Clustering Algorithm
The Markov clustering algorithm (MCL) is an iterative method that interleaves matrix
expansion and inflation steps. Matrix expansion corresponds to taking successive
powers of the transition matrix, leading to random walks of longer lengths. On the
other hand, matrix inflation makes the higher probability transitions even more likely
and reduces the lower probability transitions. Because nodes in the same cluster are
expected to have higher weights, and consequently higher transition probabilities
between them, the inflation operator makes it more likely to stay within the cluster.
It thus limits the extent of the random walk.
The pseudo-code for MCL is given in Algorithm 16.2. The method works on the
weighted adjacency matrix for a graph. Instead of relying on a user-specified value for
k
, the number of output clusters, MCL takes as input the inflation parameter
r
≥
1.
Higher values lead to more, smaller clusters, whereas smaller values lead to fewer,
but larger clusters. However, the exact number of clusters cannot be pre-determined.
Given the adjacency matrix
A
, MCL first adds
loops
or self-edges to
A
if they do
16.3 Markov Clustering
419
ALGORITHM 16.2. Markov Clustering Algorithm (MCL)
M
ARKOV
C
LUSTERING
(A
,r,ǫ
)
:
t
←
0
1
Add self-edges to
A
if they do not exist
2
M
t
←
−
1
A
3
repeat
4
t
←
t
+
1
5
M
t
←
M
t
−
1
·
M
t
−
1
6
M
t
←
ϒ(
M
t
,r)
7
until
M
t
−
M
t
−
1
F
≤
ǫ
8
G
t
←
directed graph induced by
M
t
9
C
←{
weakly connected components in
G
t
}
10
not exist. If
A
is a similarity matrix, then this is not required, as a node is most
similar to itself, and thus
A
should have high values on the diagonals. For simple,
undirectedgraphs,if
A
is theadjacencymatrix,thenaddingself-edgesassociatesreturn
probabilities with each node.
The iterative MCL expansion and inflation process stops when the transition
matrix converges, that is, when the difference between the transition matrix from two
successive iterations falls below some threshold
ǫ
≥
0. The matrix difference is given in
terms of the
Frobenius norm
:
M
t
−
M
t
−
1
F
=
n
i
=
1
n
j
=
1
M
t
(i,j)
−
M
t
−
1
(i,j)
2
The MCL process stops when
M
t
−
M
t
−
1
F
≤
ǫ
.
MCL Graph
The final clusters are found by enumerating the weakly connected components in
the directed graph induced by the converged transition matrix
M
t
. The directed
graph induced by
M
t
is denoted as
G
t
=
(
V
t
,
E
t
)
. The vertex set is the same
as the set of nodes in the original graph, that is,
V
t
=
V
, and the edge set is
given as
E
t
=
(i,j)
|
M
t
(i,j) >
0
In other words, a directed edge
(i,j)
exists only if node
i
can transition to node
j
within
t
steps of the expansion and inflation process. A node
j
is called an
attractor
if
M
t
(j,j) >
0, and we say that node
i
is attracted to attractor
j
if
M
t
(i,j) >
0.
The MCL process yields a set of attractor nodes,
V
a
⊆
V
, such that other nodes are
attracted to at least one attractor in
V
a
. That is, for all nodes
i
there exists a node
j
∈
V
a
, such that
(i,j)
∈
E
t
. A strongly connected component in a directed graph
420
Spectral and Graph Clustering
is defined a maximal subgraph such that there exists a directed path between all
pairs of vertices in the subgraph. To extract the clusters from
G
t
, MCL first finds
the strongly connected components
S
1
,
S
2
,...,
S
q
over the set of attractors
V
a
. Next,
for each strongly connected set of attractors
S
j
, MCL finds the weakly connected
components consisting of all nodes
i
∈
V
t
−
V
a
attractedto an attractor in
S
j
. If a node
i
is attractedto multiple strongly connectedcomponents, it is added to eachsuch cluster,
resulting in possibly overlapping clusters.
Example 16.11.
We apply the MCL method to find
k
=
2 clusters for the graph
shown in Figure 16.2. We add the self-loops to the graph to obtain the adjacency
matrix:
A
=
1 1 0 1 0 1 0
1 1 1 1 0 0 0
0 1 1 1 0 0 1
1 1 1 1 1 0 0
0 0 0 1 1 1 1
1 0 0 0 1 1 1
0 0 1 0 1 1 1
The corresponding Markov matrix is given as
M
0
=
−
1
A
=
0
.
25 0
.
25 0 0
.
25 0 0
.
25 0
0
.
25 0
.
25 0
.
25 0
.
25 0 0 0
0 0
.
25 0
.
25 0
.
25 0 0 0
.
25
0
.
20 0
.
20 0
.
20 0
.
20 0
.
20 0 0
0 0 0 0
.
25 0
.
25 0
.
25 0
.
25
0
.
25 0 0 0 0
.
25 0
.
25 0
.
25
0 0 0
.
25 0 0
.
25 0
.
25 0
.
25
In the first iteration, we apply expansion and then inflation (with
r
=
2
.
5) to obtain
M
1
=
M
0
·
M
0
=
0
.
237 0
.
175 0
.
113 0
.
175 0
.
113 0
.
125 0
.
062
0
.
175 0
.
237 0
.
175 0
.
237 0
.
050 0
.
062 0
.
062
0
.
113 0
.
175 0
.
237 0
.
175 0
.
113 0
.
062 0
.
125
0
.
140 0
.
190 0
.
140 0
.
240 0
.
090 0
.
100 0
.
100
0
.
113 0
.
050 0
.
113 0
.
113 0
.
237 0
.
188 0
.
188
0
.
125 0
.
062 0
.
062 0
.
125 0
.
188 0
.
250 0
.
188
0
.
062 0
.
062 0
.
125 0
.
125 0
.
188 0
.
188 0
.
250
M
1
=
ϒ(
M
1
,
2
.
5
)
=
0
.
404 0
.
188 0
.
062 0
.
188 0
.
062 0
.
081 0
.
014
0
.
154 0
.
331 0
.
154 0
.
331 0
.
007 0
.
012 0
.
012
0
.
062 0
.
188 0
.
404 0
.
188 0
.
062 0
.
014 0
.
081
0
.
109 0
.
234 0
.
109 0
.
419 0
.
036 0
.
047 0
.
047
0
.
060 0
.
008 0
.
060 0
.
060 0
.
386 0
.
214 0
.
214
0
.
074 0
.
013 0
.
013 0
.
074 0
.
204 0
.
418 0
.
204
0
.
013 0
.
013 0
.
074 0
.
074 0
.
204 0
.
204 0
.
418
16.3 Markov Clustering
421
1 6
2 4 5
3 7
1
1
1
0.5
0.5
0.5
Figure 16.5.
MCL attractors and clusters.
MCL converges in 10 iterations (using
ǫ
=
0
.
001), with the final transition matrix
M
=
1 2 3 4 5 6 7
1 0 0 0 1 0 0 0
2 0 0 0 1 0 0 0
3 0 0 0 1 0 0 0
4 0 0 0 1 0 0 0
5 0 0 0 0 0 0
.
5 0
.
5
6 0 0 0 0 0 0
.
5 0
.
5
7 0 0 0 0 0 0
.
5 0
.
5
Figure 16.5 shows the directed graph induced by the converged
M
matrix, where
an edge
(i,j)
exists if and only if
M
(i,j) >
0. The nonzero diagonal elements of
M
are the attractors (nodes with self-loops, shown in gray). We can observe that
M
(
4
,
4
)
,
M
(
6
,
6
)
, and
M
(
7
,
7
)
are all greater than zero, making nodes 4, 6, and 7 the
three attractors. Because both 6 and 7 can reach each other, the equivalence classes
of attractors are
{
4
}
and
{
6
,
7
}
. Nodes 1
,
2
,
and 3 are attracted to 4, and node 5 is
attracted to both 6 and 7. Thus, the two weakly connected components that make up
the two clusters are
C
1
={
1
,
2
,
3
,
4
}
and
C
2
={
5
,
6
,
7
}
.
Example 16.12.
Figure 16.6a shows the clusters obtained via the MCL algorithm
on the Iris graph from Figure 16.1, using
r
=
1
.
3 in the inflation step. MCL yields
three attractors (shown as gray nodes; self-loops omitted), which separate the graph
into three clusters. The contingency table for the discovered clusters versus the
true Iris types is given in Table 16.2. One point with class
iris-versicolor
is
(wrongly) grouped with
iris-setosa
in
C
1
, but 14 points from
iris-virginica
are
misclustered.
Notice that the only parameter for MCL is
r
, the exponent for the inflation step.
The number of clusters is not explicitlyspecified, but higher values of
r
result in more
clusters. The value of
r
=
1
.
3 was used above because it resulted in three clusters.
Figure 16.6b shows the results for
r
=
2. MCL yields nine clusters, where one of the
clusters (top-most) has two attractors.
422
Spectral and Graph Clustering
Table 16.2.
Contingency table: MCL clusters versus Iris types
iris-setosa iris-virginica iris-versicolor
C
1
(triangle) 50 0 1
C
2
(square) 0 36 0
C
3
(circle) 0 14 49
(a)
r
=
1
.
3
(b)
r
=
2
Figure 16.6.
MCL on Iris graph.
Computational Complexity
The computational complexity of the MCL algorithm is
O
(tn
3
)
, where
t
is the number
of iterations until convergence. This follows from the fact that whereas the inflation
operation takes
O
(n
2
)
time, the expansion operation requires matrix multiplication,
which takes
O
(n
3
)
time. However, the matrices become sparse very quickly, and it is
possible to use sparse matrix multiplication to obtain
O
(n
2
)
complexity for expansion
in later iterations. On convergence, the weakly connected components in
G
t
can be
found in
O
(n
+
m)
time, where
m
is the number of edges. Because
G
t
is very sparse,
with
m
=
O
(n)
, the final clustering step takes
O
(n)
time.
16.4
FURTHER READING
Spectral partitioning of graphs was first proposed in Donath and Hoffman (1973).
Properties of the second smallest eigenvalue of the Laplacian matrix, also called
alge-
braic connectivity
, were studied in Fiedler (1973). A recursive bipartitioning approach
to find
k
clusters using the normalized cut objective was given in Shi and Malik (2000).
The direct
k
-way partitioning approach for normalized cut, using the normalized
symmetric Laplacian matrix, was proposed in Ng, Jordan, and Weiss (2001). The
connection between spectral clustering objective and kernel K-means was established
16.5 Exercises
423
in Dhillon, Guan, and Kulis (2007). The modularity objective was introduced in
Newman (2003), where it was called
assortativity coefficient
. The spectral algorithm
using the modularity matrix was first proposed in Smyth and White (2005). The
relationship betweenmodularityandnormalized cutwasshown in Yuand Ding(2010).
For an excellent tutorial on spectral clustering techniques see Luxburg (2007). The
Markov clustering algorithm was originally proposed in van Dongen (2000). For an
extensive review of graph clustering methods see Fortunato (2010).
Dhillon, I. S., Guan, Y., and Kulis, B. (2007). “Weighted graph cuts without
eigenvectors: A multilevel approach.”
IEEE Transactions on Pattern Analysis and
Machine Intelligence
, 29(11): 1944–1957.
Donath, W. E. and Hoffman, A. J. (September 1973).“Lower bounds for the
partitioning of graphs.”
IBM Journal of Research and Development
, 17(5):
420–425.
Fiedler, M. (1973). “Algebraic connectivity of graphs.”
Czechoslovak Mathematical
Journal
, 23(2): 298–305.
Fortunato, S. (2010). “Community detection in graphs.”
Physics Reports
, 486(3):
75–174.
Luxburg, U. (December 2007). “A tutorial on spectral clustering.”
Statistics and
Computing
, 17(4): 395–416.
Newman, M. E. (2003). “Mixing patterns in networks.”
Physical Review E
, 67(2):
026126.
Ng, A. Y., Jordan, M. I., and Weiss, Y. (2001). “On spectral clustering: Analysis and an
algorithm.”
Advances in Neural Information Processing Systems 14
(pp. 849–856).
Cambridge, MA: MIT Press.
Shi, J. and Malik, J. (August 2000). “Normalized cuts and image segmentation.”
IEEE
Transactions on Pattern Analysis and Machine Intelligence
, 22(8): 888–905.
Smyth, S. and White,S. (2005).“A spectralclustering approach to finding communities
in graphs.”
In Proceedings of the 5th SIAM International Conference on Data
Mining,
vol. 119, p. 274.
van Dongen, S. M. (2000). “Graph clustering by flow simulation.” PhD thesis. The
University of Utrecht, The Netherlands.
Yu, L. and Ding, C. (2010). “Network community discovery: solving modularity
clustering via normalized cut.”
In Proceedings of the 8th Workshop on Mining and
Learning with Graphs
. ACM pp. 34–36.
16.5
EXERCISES
Q1.
Show that if
Q
i
denotes the
i
th column ofthe modularitymatrix
Q
, then
n
i
=
1
Q
i
=
0
.
Q2.
Prove that both the normalized symmetric and asymmetric Laplacian matrices
L
s
[Eq.(16.6)] and
L
a
[Eq.(16.9)] are positive semidefinite. Also show that the smallest
eigenvalue is
λ
n
=
0 for both.
Q3.
Prove that the largest eigenvalue of the normalized adjacency matrix
M
[Eq.(16.2)]
is 1, and further that all eigenvalues satisfy the condition that
|
λ
i
|≤
1.
424
Spectral and Graph Clustering
Q4.
Show that
v
r
∈
C
i
c
ir
d
r
c
ir
=
n
r
=
1
n
s
=
1
c
ir
rs
c
is
, where
c
i
is the cluster indicator
vector for cluster
C
i
and
is the degree matrix for the graph.
Q5.
For the normalized symmetric Laplacian
L
s
, show that for the normalized cut
objective the real-valued cluster indicator vector corresponding to the smallest
eigenvalue
λ
n
=
0 is given as
c
n
=
1
√
n
1
.
1
2 4
3
Figure 16.7.
Graph for Q6.
Q6.
Given the graph in Figure 16.7, answer the following questions:
(a)
Cluster the graph into two clusters using ratio cut and normalized cut.
(b)
Use the normalized adjacency matrix
M
for the graph and cluster it into two
clusters using average weight and kernel K-means, using
K
=
M
.
(c)
Cluster the graph using the MCL algorithm with inflation parameters
r
=
2 and
r
=
2
.
5.
Table 16.3.
Data for Q7
X
1
X
2
X
3
x
1
0.4 0.9 0.6
x
2
0.5 0.1 0.6
x
3
0.6 0.3 0.6
x
4
0.4 0.8 0.5
Q7.
Consider Table 16.3. Assuming these are nodes in a graph, define the weighted
adjacency matrix
A
using the linear kernel
A
(i,j)
=
1
+
x
T
i
x
j
Cluster the data into two groups using the modularity objective.
CHAPTER 17
Clustering Validation
There exist many different clustering methods, depending on the type of clusters
sought and on the inherent data characteristics. Given the diversity of clustering
algorithms and their parameters it is important to develop objective approaches to
assess clustering results. Cluster validation and assessment encompasses three main
tasks:
clustering evaluation
seeks to assess the goodness or quality of the clustering,
clustering stability
seeks to understand the sensitivity of the clustering result to various
algorithmic parameters, for example, the number of clusters, and
clustering tendency
assesses the suitability of applying clustering in the first place, that is, whether the
data has any inherent grouping structure. There are a number of validity measures and
statistics that have been proposed for each of the aforementioned tasks, which can be
divided into three main types:
External:
External validation measures employ criteria that are not inherent to the
dataset. This can be in form of prior or expert-specified knowledge about the
clusters, for example, class labels for each point.
Internal:
Internal validation measures employ criteria that are derived from the data
itself. For instance, we can use intracluster and intercluster distances to obtain
measures of cluster compactness (e.g., how similar are the points in the same
cluster?) and separation (e.g., how far apart are the points in different clusters?).
Relative:
Relative validation measures aim to directly compare different clusterings,
usually those obtained via different parameter settings for the same algorithm.
In this chapter we study some of the main techniques for clustering validation and
assessment spanning all three types of measures.
17.1
EXTERNAL MEASURES
As the name implies, external measures assume that the correct or ground-truth
clusteringisknown
apriori
.The trueclusterlabelsplaytheroleofexternalinformation
425
426
Clustering Validation
that is used to evaluate a given clustering. In general, we would not know the correct
clustering; however, external measures can serve as way to test and validate different
methods. For instance, classification datasets that specify the class for each point
can be used to evaluate the quality of a clustering. Likewise, synthetic datasets with
known cluster structure can be created to evaluate various clustering algorithms by
quantifying the extent to which they can recover the known groupings.
Let
D
= {
x
i
}
n
i
=
1
be a dataset consisting of
n
points in a
d
-dimensional space,
partitioned into
k
clusters. Let
y
i
∈
{
1
,
2
,...,k
}
denote the ground-truth cluster
membership or label information for each point. The ground-truth clustering is given
as
T
=
{
T
1
,
T
2
,...,
T
k
}
, where the cluster
T
j
consists of all the points with label
j
, i.e.,
T
j
=
{
x
i
∈
D
|
y
i
=
j
}
. Also, let
C
={
C
1
,...,
C
r
}
denote a clustering of the same dataset
into
r
clusters, obtained via some clustering algorithm, and let
ˆ
y
i
∈
{
1
,
2
,...,r
}
denote
the cluster label for
x
i
. For clarity, henceforth, we will refer to
T
as the ground-truth
partitioning
, and to each
T
i
as a
partition
. We will call
C
a clustering, with each
C
i
referred to as a cluster. Because the ground truth is assumed to be known, typically
clustering methods will be run with the correct number of clusters, that is, with
r
=
k
.
However, to keep the discussion more general, we allow
r
to be different from
k
.
Externalevaluationmeasures try capture the extentto which points from the same
partition appear in the same cluster, and the extent to which points from different
partitions are grouped in different clusters. There is usually a trade-off between
these two goals, which is either explicitly captured by a measure or is implicit in its
computation. All of the external measures rely on the
r
×
k
contingency table
N
that is
induced by a clustering
C
and the ground-truth partitioning
T
, defined as follows
N
(i,j)
=
n
ij
=
C
i
∩
T
j
In other words, the count
n
ij
denotes the number of points that are common to cluster
C
i
and ground-truth partition
T
j
. Further, for clarity, let
n
i
=|
C
i
|
denote the number
of points in cluster
C
i
, and let
m
j
=|
T
j
|
denote the number of points in partition
T
j
.
The contingency table can be computed from
T
and
C
in
O
(n)
time by examining
the partition and cluster labels,
y
i
and
ˆ
y
i
, for each point
x
i
∈
D
and incrementing the
corresponding count
n
y
i
ˆ
y
i
.
17.1.1
Matching Based Measures
Purity
Purity quantifies the extent to which a cluster
C
i
contains entities from only one
partition. In other words, it measures how “pure” each cluster is. The purity of cluster
C
i
is defined as
purity
i
=
1
n
i
k
max
j
=
1
{
n
ij
}
The purity of clustering
C
is defined as the weighted sum of the clusterwise purity
values:
purity
=
r
i
=
1
n
i
n
purity
i
=
1
n
r
i
=
1
k
max
j
=
1
{
n
ij
}
17.1 External Measures
427
wherethe ratio
n
i
n
denotes thefraction ofpoints in cluster
C
i
. The largerthepurity of
C
,
thebettertheagreementwith thegroundtruth. Themaximumvalueofpurityis 1,when
each cluster comprises points from only one partition. When
r
=
k
, a purity value of 1
indicates a perfect clustering, with a one-to-one correspondence between the clusters
and partitions. However, purity can be 1 even for
r > k
, when each of the clusters is a
subset of a ground-truth partition. When
r < k
, purity can never be 1, because at least
one cluster must contain points from more than one partition.
Maximum Matching
The maximum matching measure selects the mapping between clusters and partitions,
such that the sum of the number of common points (
n
ij
) is maximized, provided that
only one cluster can match with a given partition. This is unlike purity, where two
different clusters may share the same majority partition.
Formally, we treat the contingency table as a complete weighted bipartite graph
G
=
(
V
,
E
)
, where each partition and cluster is a node, that is,
V
=
C
∪
T
, and there
exists an edge
(
C
i
,
T
j
)
∈
E
, with weight
w(
C
i
,
T
j
)
=
n
ij
, for all
C
i
∈
C
and
T
j
∈
T
. A
matching M
in
G
is a subset of
E
, such that the edges in
M
are pairwise nonadjacent,
that is, they do not have a common vertex.The maximum matching measure is defined
as the
maximum weight matching
in
G
:
match
=
argmax
M
w(
M
)
n
where the weight of a matching
M
is simply the sum of all the edge weights in
M
, given
as
w(
M
)
=
e
∈
M
w(e)
. The maximum matching can be computed in time
O
(
|
V
|
2
·|
E
|
)
=
O
((r
+
k)
2
rk)
, which is equivalent to
O
(k
4
)
if
r
=
O
(k)
.
F-Measure
Given cluster
C
i
, let
j
i
denote the partition that contains the maximum number of
points from
C
i
, that is,
j
i
=
max
k
j
=
1
{
n
ij
}
. The
precision
of a cluster
C
i
is the same as its
purity:
prec
i
=
1
n
i
k
max
j
=
1
n
ij
=
n
ij
i
n
i
It measures the fraction of points in
C
i
from the majority partition
T
j
i
.
The
recall
of cluster
C
i
is defined as
recall
i
=
n
ij
i
|
T
j
i
|
=
n
ij
i
m
j
i
where
m
j
i
=|
T
j
i
|
. It measures the fraction of point in partition
T
j
i
shared in common
with cluster
C
i
.
The F-measure is the harmonic mean of the precision and recall values for each
cluster. The F-measure for cluster
C
i
is therefore given as
F
i
=
2
1
prec
i
+
1
recall
i
=
2
·
prec
i
·
recall
i
prec
i
+
recall
i
=
2
n
ij
i
n
i
+
m
j
i
(17.1)
428
Clustering Validation
The F-measure for the clustering
C
is the mean of clusterwise F-measure values:
F
=
1
r
r
i
=
1
F
i
F-measure thus tries to balance the precision and recall values across all the clusters.
For a perfect clustering, when
r
=
k
, the maximum value of the F-measure is 1.
Example 17.1.
Figure 17.1 shows two different clusterings obtained via the K-means
algorithm on the Iris dataset, using the first two principal components as the two
dimensions. Here
n
=
150, and
k
=
3. Visual inspection confirms that Figure 17.1a
is a better clustering than that in Figure 17.1b. We now examine how the different
contingency table based measures can be used to evaluate these two clusterings.
Consider the clustering in Figure 17.1a. The three clusters are illustrated with
different symbols; the gray points are in the correct partition, whereas the white
ones are wrongly clustered compared to the ground-truth Iris types. For instance,
C
3
mainly corresponds to partition
T
3
(
Iris-virginica
),but it has three points (the
white triangles) from
T
2
. The complete contingency table is as follows:
iris-setosa iris-versicolor iris-virginica
T
1
T
2
T
3
n
i
C
1
(squares) 0 47 14 61
C
2
(circles) 50 0 0 50
C
3
(triangles) 0 3 36 39
m
j
50 50 50
n
=
100
To compute purity, we first note for each cluster the partition with the maximum
overlap. We have the correspondence
(
C
1
,
T
2
)
,
(
C
2
,
T
1
)
, and
(
C
3
,
T
3
)
. Thus, purity is
given as
purity
=
1
150
(
47
+
50
+
36
)
=
133
150
=
0
.
887
For this contingency table, the maximum matching measure gives the same result,
as the correspondence above is in fact a maximum weight matching. Thus,
match
=
0
.
887.
The cluster
C
1
contains
n
1
=
47
+
14
=
61 points, whereas its corresponding
partition
T
2
contains
m
2
=
47
+
3
=
50 points. Thus, the precision and recall for
C
1
are given as
prec
1
=
47
61
=
0
.
77
recall
1
=
47
50
=
0
.
94
The F-measure for
C
1
is therefore
F
1
=
2
·
0
.
77
·
0
.
94
0
.
77
+
0
.
94
=
1
.
45
1
.
71
=
0
.
85
17.1 External Measures
429
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(a) K-means: good
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(b) K-means: bad
Figure 17.1.
K-means: Iris principal components dataset.
We can also directly compute
F
1
using Eq.(17.1)
F
1
=
2
·
n
12
n
1
+
m
2
=
2
·
47
61
+
50
=
94
111
=
0
.
85
Likewise, we obtain
F
2
=
1
.
0 and
F
3
=
0
.
81. Thus, the F-measure value for the
clustering is given as
F
=
1
3
(F
1
+
F
2
+
F
3
)
=
2
.
66
3
=
0
.
88
For the clustering in Figure 17.1b, we have the following contingency table:
iris-setosa iris-versicolor iris-virginica
T
1
T
2
T
3
n
i
C
1
30 0 0 30
C
2
20 4 0 24
C
3
0 46 50 96
m
j
50 50 50
n
=
150
430
Clustering Validation
For the purity measure, the partition with which each cluster shares the most points
is given as
(
C
1
,
T
1
)
,
(
C
2
,
T
1
)
, and
(
C
3
,
T
3
)
. Thus, the purity value for this clustering is
purity
=
1
150
(
30
+
20
+
50
)
=
100
150
=
0
.
67
We can see that both
C
1
and
C
2
choose partition
T
1
as the maximum overlapping
partition. However, the maximum weight matching is different; it yields the
correspondence
(
C
1
,
T
1
)
,
(
C
2
,
T
2
)
, and
(
C
3
,
T
3
)
, and thus
match
=
1
150
(
30
+
4
+
50
)
=
84
150
=
0
.
56
The table below compares the different contingency based measures for the two
clusterings shown in Figure 17.1.
purity match
F
(a) Good 0.887 0.887 0.885
(b) Bad 0.667 0.560 0.658
As expected, the good clustering in Figure 17.1a has higher scores for the purity,
maximum matching, and F-measure.
17.1.2
Entropy-based Measures
Conditional Entropy
The entropy of a clustering
C
is defined as
H
(
C
)
=−
r
i
=
1
p
C
i
log
p
C
i
where
p
C
i
=
n
i
n
is the probability of cluster
C
i
. Likewise, the entropy of the partitioning
T
is defined as
H
(
T
)
=−
k
j
=
1
p
T
j
log
p
T
j
where
p
T
j
=
m
j
n
is the probability of partition
T
j
.
The cluster-specific entropy of
T
, that is, the conditional entropy of
T
with respect
to cluster
C
i
is defined as
H
(
T
|
C
i
)
=−
k
j
=
1
n
ij
n
i
log
n
ij
n
i
The conditional entropy of
T
given clustering
C
is then defined as the weighted sum:
H
(
T
|
C
)
=
r
i
=
1
n
i
n
H
(
T
|
C
i
)
=−
r
i
=
1
k
j
=
1
n
ij
n
log
n
ij
n
i
=−
r
i
=
1
k
j
=
1
p
ij
log
p
ij
p
C
i
(17.2)
17.1 External Measures
431
where
p
ij
=
n
ij
n
is the probability that a point in cluster
i
also belongs to partition
j
. The
more a cluster’s members are split into different partitions, the higher the conditional
entropy. For a perfect clustering, the conditional entropy value is zero, whereas the
worst possible conditional entropy value is log
k
. Further, expanding Eq.(17.2), we can
see that
H
(
T
|
C
)
=−
r
i
=
1
k
j
=
1
p
ij
log
p
ij
−
log
p
C
i
=−
r
i
=
1
k
j
=
1
p
ij
log
p
ij
+
r
i
=
1
log
p
C
i
k
j
=
1
p
ij
=−
r
i
=
1
k
j
=
1
p
ij
log
p
ij
+
r
i
=
1
p
C
i
log
p
C
i
=
H
(
C
,
T
)
−
H
(
C
)
(17.3)
where
H
(
C
,
T
)
=−
r
i
=
1
k
j
=
1
p
ij
log
p
ij
is the joint entropyof
C
and
T
.The conditional
entropy
H
(
T
|
C
)
thus measures the remaining entropy of
T
given the clustering
C
. In
particular,
H
(
T
|
C
)
=
0 if and only if
T
is completely determined by
C
, corresponding
to the ideal clustering. On the other hand, if
C
and
T
are independent of each other,
then
H
(
T
|
C
)
=
H
(
T
)
, which means that
C
provides no information about
T
.
Normalized Mutual Information
The
mutual information
tries to quantify the amount of shared information between
the clustering
C
and partitioning
T
, and it is defined as
I
(
C
,
T
)
=
r
i
=
1
k
j
=
1
p
ij
log
p
ij
p
C
i
·
p
T
j
(17.4)
It measures the dependence betweenthe observed joint probability
p
ij
of
C
and
T
, and
the expected joint probability
p
C
i
·
p
T
j
under the independence assumption. When
C
and
T
are independent then
p
ij
=
p
C
i
·
p
T
j
, and thus
I
(
C
,
T
)
=
0. However, there is no
upper bound on the mutual information.
Expanding Eq.(17.4) we observe that
I
(
C
,
T
)
=
H
(
C
)
+
H
(
T
)
−
H
(
C
,
T
)
. Using
Eq.(17.3), we obtain the two equivalent expressions:
I
(
C
,
T
)
=
H
(
T
)
−
H
(
T
|
C
)
I
(
C
,
T
)
=
H
(
C
)
−
H
(
C
|
T
)
Finally, because
H
(
C
|
T
)
≥
0 and
H
(
T
|
C
)
≥
0, we have the inequalities
I
(
C
,
T
)
≤
H
(
C
)
and
I
(
C
,
T
)
≤
H
(
T
)
. We can obtain a normalized version of mutual information
by considering the ratios
I
(
C
,
T
)/
H
(
C
)
and
I
(
C
,
T
)/
H
(
T
)
, both of which can be at
432
Clustering Validation
most one. The
normalizedmutual information
(NMI) is defined as the geometric mean
of these two ratios:
NMI
(
C
,
T
)
=
I
(
C
,
T
)
H
(
C
)
·
I
(
C
,
T
)
H
(
T
)
=
I
(
C
,
T
)
√
H
(
C
)
·
H
(
T
)
The NMI value lies in the range [0
,
1]. Values close to 1 indicate a good clustering.
Variation of Information
This criterion is based on the mutual information between the clustering
C
and the
ground-truth partitioning
T
, and their entropy; it is defined as
VI
(
C
,
T
)
=
(
H
(
T
)
−
I
(
C
,
T
))
+
(
H
(
C
)
−
I
(
C
,
T
))
=
H
(
T
)
+
H
(
C
)
−
2
I
(
C
,
T
)
(17.5)
Variation of information (VI) is zero only when
C
and
T
are identical. Thus, the lower
the VI value the better the clustering
C
.
Using the equivalence
I
(
C
,
T
)
=
H
(
T
)
−
H
(
T
|
C
)
=
H
(
C
)
−
H
(
C
|
T
)
, we can also
express Eq.(17.5) as
VI
(
C
,
T
)
=
H
(
T
|
C
)
+
H
(
C
|
T
)
Finally, noting that
H
(
T
|
C
)
=
H
(
T
,
C
)
−
H
(
C
)
, another expression for VI is given as
VI
(
C
,
T
)
=
2
H
(
T
,
C
)
−
H
(
T
)
−
H
(
C
)
Example 17.2.
We continue with Example 1, which compares the two clusterings
shown in Figure 17.1. For the entropy-based measures, we use base 2 for the
logarithms; the formulas are valid for any base as such.
For the clustering in Figure 17.1a, we have the following contingency table:
iris-setosa iris-versicolor iris-virginica
T
1
T
2
T
3
n
i
C
1
0 47 14 61
C
2
50 0 0 50
C
3
0 3 36 39
m
j
50 50 50
n
=
100
Consider the conditional entropy for cluster
C
1
:
H
(
T
|
C
1
)
=−
0
61
log
2
0
61
−
47
61
log
2
47
61
−
14
61
log
2
14
61
=−
0
−
0
.
77log
2
(
0
.
77
)
−
0
.
23log
2
(
0
.
23
)
=
0
.
29
+
0
.
49
=
0
.
78
In a similar manner, we obtain
H
(
T
|
C
2
)
=
0 and
H
(
T
|
C
3
)
=
0
.
39. The conditional
entropy for the clustering
C
is then given as
H
(
T
|
C
)
=
61
150
·
0
.
78
+
50
150
·
0
+
39
150
·
0
.
39
=
0
.
32
+
0
+
0
.
10
=
0
.
42
17.1 External Measures
433
To compute the normalized mutual information, note that
H
(
T
)
=−
3
50
150
log
2
50
150
=
1
.
585
H
(
C
)
=−
61
150
log
2
61
150
+
50
150
log
2
50
150
+
39
150
log
2
39
150
=
0
.
528
+
0
.
528
+
0
.
505
=
1
.
561
I
(
C
,
T
)
=
47
150
log
2
47
·
150
61
·
50
+
14
150
log
2
14
·
150
61
·
50
+
50
150
log
2
50
·
150
50
·
50
+
3
150
log
2
3
·
150
39
·
50
+
36
150
log
2
36
·
150
39
·
50
=
0
.
379
−
0
.
05
+
0
.
528
−
0
.
042
+
0
.
353
=
1
.
167
Thus, the NMI and VI values are
NMI
(
C
,
T
)
=
I
(
C
,
T
)
√
H
(
T
)
·
H
(
C
)
=
1
.
167
√
1
.
585
×
1
.
561
=
0
.
742
VI
(
C
,
T
)
=
H
(
T
)
+
H
(
C
)
−
2
I
(
C
,
T
)
=
1
.
585
+
1
.
561
−
2
·
1
.
167
=
0
.
812
We can likewise compute these measures for the other clustering in Figure 17.1b,
whose contingency table is shown in Example 1.
The table below compares the entropy based measures for the two clusterings
shown in Figure 17.1.
H
(
T
|
C
)
NMI VI
(a) Good 0.418 0.742 0.812
(b) Bad 0.743 0.587 1.200
As expected, the good clustering in Figure 17.1a has a higher score for
normalized mutual information, and lower scores for conditional entropy and
variation of information.
17.1.3
Pairwise Measures
Given clustering
C
and ground-truth partitioning
T
, the pairwise measures utilize the
partition and cluster label information over all pairs of points. Let
x
i
,
x
j
∈
D
be any two
points, with
i
=
j
. Let
y
i
denote the true partition label and let
ˆ
y
i
denote the cluster
label for point
x
i
. If both
x
i
and
x
j
belong to the same cluster, that is,
ˆ
y
i
= ˆ
y
j
, we call it
a
positive
event, and if they do not belong to the same cluster, that is,
ˆ
y
i
= ˆ
y
j
, we call
that a
negative
event. Depending on whether there is agreement between the cluster
labels and partition labels, there are four possibilities to consider:
•
TruePositives:
x
i
and
x
j
belong to the samepartition in
T
, and they are also in the same
cluster in
C
. This is a true positive pair because the positive event,
ˆ
y
i
= ˆ
y
j
, corresponds
to the ground truth,
y
i
=
y
j
. The number of true positive pairs is given as
TP
=
{
(
x
i
,
x
j
)
:
y
i
=
y
j
and
ˆ
y
i
= ˆ
y
j
}
434
Clustering Validation
•
False Negatives:
x
i
and
x
j
belong to the same partition in
T
, but they do not belong to
the same cluster in
C
. That is, the negative event,
ˆ
y
i
= ˆ
y
j
, does not correspond to the
truth,
y
i
=
y
j
. This pair is thus a false negative, and the number of all false negative
pairs is given as
FN
=
{
(
x
i
,
x
j
)
:
y
i
=
y
j
and
ˆ
y
i
= ˆ
y
j
}
•
False Positives:
x
i
and
x
j
do not belong to the same partition in
T
, but they do belong
to the same cluster in
C
. This pair is a false positive because the positive event,
ˆ
y
i
= ˆ
y
j
,
is actually false, that is, it does not agree with the ground-truth partitioning, which
indicates that
y
i
=
y
j
. The number of false positive pairs is given as
FP
=
{
(
x
i
,
x
j
)
:
y
i
=
y
j
and
ˆ
y
i
= ˆ
y
j
}
•
True Negatives:
x
i
and
x
j
neither belong to the same partition in
T
, nor do they belong
to the same cluster in
C
. This pair is thus a true negative, that is,
ˆ
y
i
= ˆ
y
j
and
y
i
=
y
j
. The
number of such true negative pairs is given as
TN
=
{
(
x
i
,
x
j
)
:
y
i
=
y
j
and
ˆ
y
i
= ˆ
y
j
}
Because there are
N
=
n
2
=
n(n
−
1
)
2
pairs of points, we have the following identity:
N
=
TP
+
FN
+
FP
+
TN
(17.6)
A naive computation of the preceding four cases requires
O
(n
2
)
time. However,
they can be computed more efficiently using the contingency table
N
=
n
ij
, with
1
≤
i
≤
r
and 1
≤
j
≤
k
. The number of true positives is given as
TP
=
r
i
=
1
k
j
=
1
n
ij
2
=
r
i
=
1
k
j
=
1
n
ij
(n
ij
−
1
)
2
=
1
2
r
i
=
1
k
j
=
1
n
2
ij
−
r
i
=
1
k
j
=
1
n
ij
=
1
2
r
i
=
1
k
j
=
1
n
2
ij
−
n
(17.7)
This follows from the fact that each pair of points among the
n
ij
share the same cluster
label (
i
) and the same partition label (
j
). The last step follows from the fact that the
sum of all the entries in the contingency table must add to
n
, that is,
r
i
=
1
k
j
=
1
n
ij
=
n
.
To compute the total number of false negatives, we remove the number of true
positives from the number of pairs that belong to the same partition. Because two
points
x
i
and
x
j
that belong to the same partition have
y
i
=
y
j
, if we remove the true
positives, that is, pairs with
ˆ
y
i
= ˆ
y
j
, we are left with pairs for whom
ˆ
y
i
= ˆ
y
j
, that is, the
false negatives. We thus have
FN
=
k
j
=
1
m
j
2
−
TP
=
1
2
k
j
=
1
m
2
j
−
k
j
=
1
m
j
−
r
i
=
1
k
j
=
1
n
2
ij
+
n
=
1
2
k
j
=
1
m
2
j
−
r
i
=
1
k
j
=
1
n
2
ij
(17.8)
The last step follows from the fact that
k
j
=
1
m
j
=
n
.
17.1 External Measures
435
The number of false positives can be obtained in a similar manner by subtracting
thenumber oftruepositivesfromthenumber ofpointpairs thatarein thesamecluster:
FP
=
r
i
=
1
n
i
2
−
TP
=
1
2
r
i
=
1
n
2
i
−
r
i
=
1
k
j
=
1
n
2
ij
(17.9)
Finally, the number of true negatives can be obtained via Eq.(17.6) as follows:
TN
=
N
−
(
TP
+
FN
+
FP
)
=
1
2
n
2
−
r
i
=
1
n
2
i
−
k
j
=
1
m
2
j
+
r
i
=
1
k
j
=
1
n
2
ij
(17.10)
Each of the four values can be computed in
O
(rk)
time. Because the contingency
table can be obtained in linear time, the total time to compute the four values is
O
(n
+
rk)
, which is much better than the naive
O
(n
2
)
bound. We nextconsider pairwise
assessment measures based on these four values.
Jaccard Coefficient
The Jaccard Coefficient measures the fraction of true positive point pairs, but after
ignoring the true negatives. It is defined as follows:
Jaccard
=
TP
TP
+
FN
+
FP
(17.11)
For a perfect clustering
C
(i.e., total agreement with the partitioning
T
), the Jaccard
Coefficient has value 1, as in that case there are no false positives or false negatives.
The Jaccard coefficient is asymmetric in terms of the true positives and negatives
because it ignores the true negatives. In other words, it emphasizes the similarity in
terms of the point pairs that belong together in both the clustering and ground-truth
partitioning, but it discounts the point pairs that do not belong together.
Rand Statistic
The Rand statistic measures the fraction of true positives and true negatives over all
point pairs; it is defined as
Rand
=
TP
+
TN
N
(17.12)
The Rand statistic, which is symmetric, measures the fraction of point pairs where both
C
and
T
agree. A prefect clustering has a value of 1 for the statistic.
Fowlkes-Mallows Measure
Define the overall
pairwise precision
and
pairwise recall
values for a clustering
C
, as
follows:
prec
=
TP
TP
+
FP
recall
=
TP
TP
+
FN
Precision measures the fraction of true or correctly clustered point pairs compared to
all the point pairs in the same cluster. On the other hand, recall measures the fraction
of correctly labeled points pairs compared to all the point pairs in the same partition.
436
Clustering Validation
The Fowlkes–Mallows (FM) measure is defined as the geometric mean of the
pairwise precision and recall
FM
=
prec
·
recall
=
TP
√
(
TP
+
FN
)(
TP
+
FP
)
(17.13)
The FM measure is also asymmetric in terms of the true positives and negatives
because it ignores the true negatives. Its highest value is also 1, achieved when there
are no false positives or negatives.
Example 17.3.
Letus continue withExample1.Consider againthecontingencytable
for the clustering in Figure 17.1a:
iris-setosa iris-versicolor iris-virginica
T
1
T
2
T
3
C
1
0 47 14
C
2
50 0 0
C
3
0 3 36
Using Eq.(17.7), we can obtain the number of true positives as follows:
TP
=
47
2
+
14
2
+
50
2
+
3
2
+
36
2
=
1081
+
91
+
1225
+
3
+
630
=
3030
Using Eqs.(17.8), (17.9), and (17.10), we obtain
FN
=
645
FP
=
766
TN
=
6734
Note that there are a total of
N
=
150
2
=
11175 point pairs.
We can now compute the different pairwise measures for clustering
evaluation. The Jaccard coefficient [Eq.(17.11)], Rand statistic [Eq.(17.12)], and
Fowlkes–Mallows measure [Eq.(17.13)], are given as
Jaccard
=
3030
3030
+
645
+
766
=
3030
4441
=
0
.
68
Rand
=
3030
+
6734
11175
=
9764
11175
=
0
.
87
FM
=
3030
√
3675
·
3796
=
3030
3735
=
0
.
81
Using the contingency table for the clustering in Figure 17.1b from Example 1,
we obtain
TP
=
2891
FN
=
784
FP
=
2380
TN
=
5120
The table below compares the different contingency based measures on the two
clusterings in Figure 17.1.
Jaccard Rand FM
(a) Good 0.682 0.873 0.811
(b) Bad 0.477 0.717 0.657
As expected, the clustering in Figure 17.1a has higher scores for all three
measures.
17.1 External Measures
437
17.1.4
Correlation Measures
Let
X
and
Y
be two symmetric
n
×
n
matrices, and let
N
=
n
2
. Let
x
,
y
∈
R
N
denote
the vectors obtained by linearizing the upper triangular elements (excluding the main
diagonal) of
X
and
Y
(e.g., in a row-wise manner), respectively. Let
µ
X
denote the
element-wise mean of
x
, given as
µ
X
=
1
N
n
−
1
i
=
1
n
j
=
i
+
1
X
(i,j)
=
1
N
x
T
x
and let
z
x
denote the centered
x
vector, defined as
z
x
=
x
−
1
·
µ
X
where
1
∈
R
N
is the vector of all ones. Likewise, let
µ
Y
be the element-wise mean of
y
,
and
z
y
the centered
y
vector.
The Hubert statistic is defined as the averaged element-wise product between
X
and
Y
Ŵ
=
1
N
n
−
1
i
=
1
n
j
=
i
+
1
X
(i,j)
·
Y
(i,j)
=
1
N
x
T
y
(17.14)
The normalizedHubertstatisticis definedastheelement-wisecorrelationbetween
X
and
Y
Ŵ
n
=
n
−
1
i
=
1
n
j
=
i
+
1
X
(i,j)
−
µ
X
·
Y
(i,j)
−
µ
Y
n
−
1
i
=
1
n
j
=
i
+
1
X
(i,j)
−
µ
X
2
n
−
1
i
=
1
n
j
=
i
+
1
Y
[
i
]
−
µ
Y
2
=
σ
XY
σ
2
X
σ
2
Y
where
σ
2
X
and
σ
2
Y
are the variances, and
σ
XY
the covariance, for the vectors
x
and
y
,
defined as
σ
2
X
=
1
N
n
−
1
i
=
1
n
j
=
i
+
1
X
(i,j)
−
µ
X
2
=
1
N
z
T
x
z
x
=
1
N
z
x
2
σ
2
Y
=
1
N
n
−
1
i
=
1
n
j
=
i
+
1
Y
(i,j)
−
µ
Y
2
=
1
N
z
T
y
z
y
=
1
N
z
y
2
σ
XY
=
1
N
n
−
1
i
=
1
n
j
=
i
+
1
X
(i,j)
−
µ
X
Y
(i,j)
−
µ
Y
=
1
N
z
T
x
z
y
Thus, the normalized Hubert statistic can be rewritten as
Ŵ
n
=
z
T
x
z
y
z
x
·
z
y
=
cos
θ
(17.15)
438
Clustering Validation
where
θ
is the angle betweenthe two centeredvectors
z
x
and
z
y
. It follows immediately
that
Ŵ
n
ranges from
−
1 to
+
1.
When
X
and
Y
are arbitrary
n
×
n
matrices the above expressions can be easily
modified to range over all the
n
2
elements of the two matrices. The (normalized)
Hubert statistic can be used as an external evaluation measure, with appropriately
defined matrices
X
and
Y
, as described next.
Discretized Hubert Statistic
Let
T
and
C
be the
n
×
n
matrices defined as
T
(i,j)
=
1 if
y
i
=
y
j
,i
=
j
0 otherwise
C
(i,j)
=
1 if
ˆ
y
i
= ˆ
y
j
,i
=
j
0 otherwise
Also, let
t
,
c
∈
R
N
denote the
N
-dimensional vectors comprising the upper triangular
elements (excluding the diagonal) of
T
and
C
, respectively, where
N
=
n
2
denotes the
number of distinct point pairs. Finally, let
z
t
and
z
c
denote the centered
t
and
c
vectors.
The discretized Hubert statistic is computed via Eq.(17.14), by setting
x
=
t
and
y
=
c
:
Ŵ
=
1
N
t
T
c
=
TP
N
(17.16)
Because the
i
th element of
t
is 1 only when the
i
th pair of points belongs to the same
partition, and, likewise, the
i
th element of
c
is 1 only when the
i
th pair of points also
belongs to the same cluster, the dot product
t
T
c
is simply the number of true positives,
and thus the
Ŵ
value is equivalent to the fraction of all pairs that are true positives.
It follows that the higher the agreement between the ground-truth partitioning
T
and
clustering
C
, the higher the
Ŵ
value.
Normalized Discretized Hubert Statistic
The normalized version of the discretized Hubert statistic is simply the correlation
between
t
and
c
[Eq.(17.15)]:
Ŵ
n
=
z
T
t
z
c
z
t
·
z
c
=
cos
θ
(17.17)
Note that
µ
T
=
1
N
t
T
t
is the fraction of point pairs that belong to the same partition, that
is, with
y
i
=
y
j
, regardless of whether
ˆ
y
i
matches
ˆ
y
j
or not. Thus, we have
µ
T
=
t
T
t
N
=
TP
+
FN
N
Similarly,
µ
C
=
1
N
c
T
c
is the fraction of point pairs that belong to the same cluster, that
is, with
ˆ
y
i
= ˆ
y
j
, regardless of whether
y
i
matches
y
j
or not, so that
µ
C
=
c
T
c
N
=
TP
+
FP
N
17.1 External Measures
439
Substituting these into the numerator in Eq.(17.17), we get
z
T
t
z
c
=
(
t
−
1
·
µ
T
)
T
(
c
−
1
·
µ
C
)
=
t
T
c
−
µ
C
t
T
1
−
µ
T
c
T
1
+
1
T
1
µ
T
µ
C
=
t
T
c
−
N
µ
C
µ
T
−
N
µ
T
µ
C
+
N
µ
T
µ
C
=
t
T
c
−
N
µ
T
µ
C
=
TP
−
N
µ
T
µ
C
(17.18)
where
1
∈
R
N
is the vector of all 1’s. We also made use of identities
t
T
1
=
t
T
t
and
c
T
1
=
c
T
c
. Likewise, we can derive
z
t
2
=
z
T
t
z
t
=
t
T
t
−
N
µ
2
T
=
N
µ
T
−
N
µ
2
T
=
N
µ
T
(
1
−
µ
T
)
(17.19)
z
c
2
=
z
T
c
z
c
=
c
T
c
−
N
µ
2
C
=
N
µ
C
−
N
µ
2
C
=
N
µ
C
(
1
−
µ
C
)
(17.20)
Plugging Eqs.(17.18), (17.19), and (17.20) into Eq.(17.17) the normalized, discretized
Hubert statistic can be written as
Ŵ
n
=
TP
N
−
µ
T
µ
C
√
µ
T
µ
C
(
1
−
µ
T
)(
1
−
µ
C
)
(17.21)
because
µ
T
=
TP
+
FN
N
and
µ
C
=
TP
+
FP
N
, the normalized
Ŵ
n
statistic can be computed using
only the
TP
,
FN
,and
FP
values.The maximumvalueof
Ŵ
n
=+
1is obtainedwhenthere
are no false positives or negatives, that is, when
FN
=
FP
=
0. The minimum value of
Ŵ
n
=−
1 is when there are no true positives and negatives, that is, when
TP
=
TN
=
0.
Example 17.4.
Continuing Example 17.3, for the good clustering in Figure 17.1a, we
have
TP
=
3030
FN
=
645
FP
=
766
TN
=
6734
From these values, we obtain
µ
T
=
TP
+
FN
N
=
3675
11175
=
0
.
33
µ
C
=
TP
+
FP
N
=
3796
11175
=
0
.
34
Using Eqs.(17.16) and (17.21) the Hubert statistic values are
Ŵ
=
3030
11175
=
0
.
271
Ŵ
n
=
0
.
27
−
0
.
33
·
0
.
34
√
0
.
33
·
0
.
34
·
(
1
−
0
.
33
)
·
(
1
−
0
.
34
)
=
0
.
159
0
.
222
=
0
.
717
Likewise, for the bad clustering in Figure 17.1b, we have
TP
=
2891
FN
=
784
FP
=
2380
TN
=
5120
440
Clustering Validation
and the values for the discretized Hubert statistic are given as
Ŵ
=
0
.
258
Ŵ
n
=
0
.
442
We observe that the good clustering has higher values, though the normalized
statistic is more discerning than the unnormalized version, thatis, the good clustering
has a much higher value of
Ŵ
n
than the bad clustering, whereas the difference in
Ŵ
for the two clusterings is not that high.
17.2
INTERNAL MEASURES
Internal evaluation measures do not have recourse to the ground-truth partitioning,
which is the typical scenario when clustering a dataset. To evaluate the quality of the
clustering, internal measures therefore have to utilize notions of intracluster similarity
or compactness, contrasted with notions of intercluster separation, with usually a
trade-off in maximizing these two aims. The internal measures are based on the
n
×
n
distance matrix
, also called the
proximity matrix
, of all pairwise distances among the
n
points:
W
=
δ(
x
i
,
x
j
)
n
i,j
=
1
(17.22)
where
δ(
x
i
,
x
j
)
=
x
i
−
x
j
2
is the Euclidean distance between
x
i
,
x
j
∈
D
, although other distance metrics can also
be used. Because
W
is symmetric and
δ(
x
i
,
x
i
)
=
0, usually only the upper triangular
elements of
W
(excluding the diagonal) are used in the internal measures.
The proximity matrix
W
can also be considered as the adjacency matrix of the
weightedcompletegraph
G
over the
n
points, that is, with nodes
V
={
x
i
|
x
i
∈
D
}
, edges
E
={
(
x
i
,
x
j
)
|
x
i
,
x
j
∈
D
}
, and edge weights
w
ij
=
W
(i,j)
for all
x
i
,
x
j
∈
D
. There is thus
a close connection between the internal evaluation measures and the graph clustering
objectives we examined in Chapter 16.
For internal measures, we assume that we do not have access to a ground-truth
partitioning. Instead, we assume that we are given a clustering
C
= {
C
1
,...,
C
k
}
comprising
r
=
k
clusters, with cluster
C
i
containing
n
i
=|
C
i
|
points. Let
ˆ
y
i
∈
{
1
,
2
,...,k
}
denote the cluster label for point
x
i
. The clustering
C
can be considered as a
k
-way cut
in
G
because
C
i
=∅
for all
i
,
C
i
∩
C
j
=∅
for all
i,j
, and
i
C
i
=
V
. Given any subsets
S
,
R
⊂
V
, define
W
(
S
,
R
)
as the sum of the weights on all edges with one vertex in
S
and
the other in
R
, given as
W
(
S
,
R
)
=
x
i
∈
S
x
j
∈
R
w
ij
Also, given
S
⊆
V
, we denote by
S
the complementary set of vertices, that is,
S
=
V
−
S
.
The internal measures are based on various functions over the intracluster and
intercluster weights. In particular, note that the sum of all the intracluster weights over
17.2 Internal Measures
441
all clusters is given as
W
in
=
1
2
k
i
=
1
W
(
C
i
,
C
i
)
(17.23)
We divide by 2 because each edge within
C
i
is counted twice in the summation given
by
W
(
C
i
,
C
i
)
. Also note that the sum of all intercluster weights is given as
W
out
=
1
2
k
i
=
1
W
(
C
i
,
C
i
)
=
k
−
1
i
=
1
j>i
W
(
C
i
,
C
j
)
(17.24)
Here too we divide by 2 because each edge is counted twice in the summation across
clusters. The number of distinct intracluster edges,denoted
N
in
, and intercluster edges,
denoted
N
out
, are given as
N
in
=
k
i
=
1
n
i
2
=
1
2
k
i
=
1
n
i
(n
i
−
1
)
N
out
=
k
−
1
i
=
1
k
j
=
i
+
1
n
i
·
n
j
=
1
2
k
i
=
1
k
j
=
1
j
=
i
n
i
·
n
j
Note that the total number of distinct pairs of points
N
satisfies the identity
N
=
N
in
+
N
out
=
n
2
=
1
2
n(n
−
1
)
Example 17.5.
Figure 17.2 shows the graphs corresponding to the two K-means
clusterings shown in Figure 17.1. Here, each vertex corresponds to a point
x
i
∈
D
,
and an edge
(
x
i
,
x
j
)
exists betweeneach pair of points. However, only the intracluster
edges are shown (with intercluster edges omitted) to avoid clutter. Because internal
measures do not have access to a ground truth labeling, the goodness of a clustering
is measured based on intracluster and intercluster statistics.
BetaCV Measure
The BetaCV measure is the ratio of the mean intracluster distance to the mean
intercluster distance:
B
eta
CV
=
W
in
/
N
in
W
out
/
N
out
=
N
out
N
in
·
W
in
W
out
=
N
out
N
in
k
i
=
1
W
(
C
i
,
C
i
)
k
i
=
1
W
(
C
i
,
C
i
)
The smaller the BetaCV ratio, the better the clustering, as it indicates that intracluster
distances are on average smaller than intercluster distances.
C-index
Let
W
min
(
N
in
)
be the sum of the smallest
N
in
distances in the proximity matrix
W
,
where
N
in
is the total number of intracluster edges, or point pairs. Let
W
max
(
N
in
)
be
the sum of the largest
N
in
distances in
W
.
442
Clustering Validation
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(a) K-means: good
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(b) K-means: bad
Figure 17.2.
Clusterings as graphs: Iris.
The C-index measures to what extent the clustering puts together the
N
in
points
that are the closest across the
k
clusters. It is defined as
C
index
=
W
in
−
W
min
(
N
in
)
W
max
(
N
in
)
−
W
min
(
N
in
)
where
W
in
is the sum of all the intracluster distances [Eq.(17.23)]. The C-index lies in
the range [0
,
1]. The smaller the C-index, the better the clustering, as it indicates more
compact clusters with relatively smaller distances within clusters rather than between
clusters.
Normalized Cut Measure
The normalized cut objective [Eq.(16.17)] for graph clustering can also be used as an
internal clustering evaluation measure:
NC
=
k
i
=
1
W
(
C
i
,
C
i
)
vol(
C
i
)
=
k
i
=
1
W
(
C
i
,
C
i
)
W
(
C
i
,
V
)
17.2 Internal Measures
443
where
vol(
C
i
)
=
W
(
C
i
,
V
)
is the volume of cluster
C
i
, that is, the total weights on edges
with at least one end in the cluster. However, because we are using the proximity
or distance matrix
W
, instead of the affinity or similarity matrix
A
, the higher the
normalized cut value the better.
To see this, we make use of the observation that
W
(
C
i
,
V
)
=
W
(
C
i
,
C
i
)
+
W
(
C
i
,
C
i
)
,
so that
NC
=
k
i
=
1
W
(
C
i
,
C
i
)
W
(
C
i
,
C
i
)
+
W
(
C
i
,
C
i
)
=
k
i
=
1
1
W
(
C
i
,
C
i
)
W
(
C
i
,
C
i
)
+
1
We can see that NC is maximized when the ratios
W
(
C
i
,
C
i
)
W
(
C
i
,
C
i
)
(across the
k
clusters) are
as small as possible, which happens when the intracluster distances are much smaller
compared to intercluster distances, that is, when the clustering is good. The maximum
possible value of NC is
k
.
Modularity
The modularity objective for graph clustering [Eq.(16.26)] can also be used as an
internal measure:
Q
=
k
i
=
1
W
(
C
i
,
C
i
)
W
(
V
,
V
)
−
W
(
C
i
,
V
)
W
(
V
,
V
)
2
where
W
(
V
,
V
)
=
k
i
=
1
W
(
C
i
,
V
)
=
k
i
=
1
W
(
C
i
,
C
i
)
+
k
i
=
1
W
(
C
i
,
C
i
)
=
2
(
W
in
+
W
out
)
The laststep follows from Eqs.(17.23)and(17.24).Modularity measures thedifference
between the observed and expected fraction of weights on edges within the clusters.
Since we are using the distance matrix, the smaller the modularity measure the better
the clustering, which indicates that the intracluster distances are lower than expected.
Dunn Index
The Dunn index is defined as the ratio between the minimum distance between point
pairs from different clusters and the maximum distance between point pairs from the
same cluster. More formally, we have
D
unn
=
W
min
out
W
max
in
where
W
min
out
is the minimum intercluster distance:
W
min
out
=
min
i,j>i
w
ab
|
x
a
∈
C
i
,
x
b
∈
C
j
444
Clustering Validation
and
W
max
in
is the maximum intracluster distance:
W
max
in
=
max
i
w
ab
|
x
a
,
x
b
∈
C
i
The larger the Dunn index the better the clustering because it means even the closest
distance between points in different clusters is much larger than the farthest distance
between points in the same cluster. However, the Dunn index may be insensitive
because the minimum intercluster and maximum intracluster distances do not capture
all the information about a clustering.
Davies–Bouldin Index
Let
µ
i
denote the cluster mean, given as
µ
i
=
1
n
i
x
j
∈
C
i
x
j
(17.25)
Further, let
σ
µ
i
denote the dispersion or spread of the points around the cluster mean,
given as
σ
µ
i
=
x
j
∈
C
i
δ(
x
j
,µ
i
)
2
n
i
=
var(
C
i
)
where
var(
C
i
)
is the total variance [Eq.(1.4)] of cluster
C
i
.
The Davies–Bouldin measure for a pair of clusters
C
i
and
C
j
is defined as the ratio
DB
ij
=
σ
µ
i
+
σ
µ
j
δ(µ
i
,µ
j
)
DB
ij
measures how compact the clusters are compared to the distance between the
cluster means. The Davies–Bouldin index is then defined as
DB
=
1
k
k
i
=
1
max
j
=
i
{
DB
ij
}
That is, for each cluster
C
i
, we pick the cluster
C
j
that yields the largest
DB
ij
ratio.
The smaller the DB value the better the clustering, as it means that the clusters are
well separated (i.e., the distance between cluster means is large), and each cluster is
well represented by its mean (i.e., has a small spread).
Silhouette Coefficient
The silhouette coefficient is a measure of both cohesion and separation of clusters,
and is based on the difference between the average distance to points in the closest
cluster and to points in the same cluster. For each point
x
i
we calculate its silhouette
coefficient
s
i
as
s
i
=
µ
min
out
(
x
i
)
−
µ
in
(
x
i
)
max
µ
min
out
(
x
i
),µ
in
(
x
i
)
(17.26)
17.2 Internal Measures
445
where
µ
in
(
x
i
)
is the mean distance from
x
i
to points in its own cluster
ˆ
y
i
:
µ
in
(
x
i
)
=
x
j
∈
C
ˆ
y
i
,j
=
i
δ(
x
i
,
x
j
)
n
ˆ
y
i
−
1
and
µ
min
out
(
x
i
)
is the mean of the distances from
x
i
to points in the closest cluster:
µ
min
out
(
x
i
)
=
min
j
=ˆ
y
i
y
∈
C
j
δ(
x
i
,
y
)
n
j
The
s
i
value of a point lies in the interval [
−
1
,
+
1]. A value close to
+
1 indicates
that
x
i
is much closer to points in its own cluster and is far from other clusters. A value
close to zero indicates that
x
i
is close to the boundary between two clusters. Finally, a
value close to
−
1 indicates that
x
i
is much closer to another cluster than its own cluster,
and therefore, the point may be mis-clustered.
The silhouette coefficient is defined as the mean
s
i
value across all the points:
SC
=
1
n
n
i
=
1
s
i
(17.27)
A value close to
+
1 indicates a good clustering.
Hubert Statistic
The Hubert
Ŵ
statistic [Eq.(17.14)], and its normalized version
Ŵ
n
[Eq.(17.15)], can
both be used as internal evaluation measures by letting
X
=
W
be the pairwise distance
matrix, and by defining
Y
as the matrix of distances between the cluster means:
Y
=
δ(µ
ˆ
y
i
,µ
ˆ
y
j
)
n
i,j
=
1
(17.28)
Because both
W
and
Y
are symmetric, both
Ŵ
and
Ŵ
n
are computed over their upper
triangular elements.
Example 17.6.
Consider the two clusterings for the Iris principal components dataset
shown in Figure 17.1, along with their corresponding graph representations in
Figure 17.2. Let us evaluate these two clusterings using internal measures.
The good clustering shown in Figure 17.1a and Figure 17.2a has clusters with the
following sizes:
n
1
=
61
n
2
=
50
n
3
=
39
Thus, the number of intracluster and intercluster edges (i.e., point pairs) is given as
N
in
=
61
2
+
50
2
+
31
2
=
1830
+
1225
+
741
=
3796
N
out
=
61
·
50
+
61
·
39
+
50
·
39
=
3050
+
2379
+
1950
=
7379
In total there are
N
=
N
in
+
N
out
=
3796
+
7379
=
11175 distinct point pairs.
446
Clustering Validation
The weights on edges within each cluster
W
(
C
i
,
C
i
)
, and those from a cluster to
another
W
(
C
i
,
C
j
)
, are as given in the intercluster weight matrix
W C
1
C
2
C
3
C
1
3265
.
69 10402
.
30 4418
.
62
C
2
10402
.
30 1523
.
10 9792
.
45
C
3
4418
.
62 9792
.
45 1252
.
36
(17.29)
Thus, the sum of all the intracluster and intercluster edge weights is
W
in
=
1
2
(
3265
.
69
+
1523
.
10
+
1252
.
36
)
=
3020
.
57
W
out
=
(
10402
.
30
+
4418
.
62
+
9792
.
45
)
=
24613
.
37
The BetaCV measure can then be computed as
B
eta
CV
=
N
out
·
W
in
N
in
·
W
out
=
7379
×
3020
.
57
3796
×
24613
.
37
=
0
.
239
For the C-index, we first compute the sum of the
N
in
smallest and largest
pair-wise distances, given as
W
min
(
N
in
)
=
2535
.
96
W
max
(
N
in
)
=
16889
.
57
Thus, C-index is given as
C
index
=
W
in
−
W
min
(
N
in
)
W
max
(
N
in
)
−
W
min
(
N
in
)
=
3020
.
57
−
2535
.
96
16889
.
57
−
2535
.
96
=
484
.
61
14535
.
61
=
0
.
0338
For the normalized cut and modularity measures, we compute
W
(
C
i
,
C
i
)
,
W
(
C
i
,
V
)
=
k
j
=
1
W
(
C
i
,
C
j
)
and
W
(
V
,
V
)
=
k
i
=
1
W
(
C
i
,
V
)
, using the intercluster
weight matrix [Eq.(17.29)]:
W
(
C
1
,
C
1
)
=
10402
.
30
+
4418
.
62
=
14820
.
91
W
(
C
2
,
C
2
)
=
10402
.
30
+
9792
.
45
=
20194
.
75
W
(
C
3
,
C
3
)
=
4418
.
62
+
9792
.
45
=
14211
.
07
W
(
C
1
,
V
)
=
3265
.
69
+
W
(
C
1
,
C
1
)
=
18086
.
61
W
(
C
2
,
V
)
=
1523
.
10
+
W
(
C
2
,
C
2
)
=
21717
.
85
W
(
C
3
,
V
)
=
1252
.
36
+
W
(
C
3
,
C
3
)
=
15463
.
43
W
(
V
,
V
)
=
W
(
C
1
,
V
)
+
W
(
C
2
,
V
)
+
W
(
C
3
,
V
)
=
55267
.
89
17.2 Internal Measures
447
The normalized cut and modularity values are given as
NC
=
14820
.
91
18086
.
61
+
20194
.
75
21717
.
85
+
14211
.
07
15463
.
43
=
0
.
819
+
0
.
93
+
0
.
919
=
2
.
67
Q
=
3265
.
69
55267
.
89
−
18086
.
61
55267
.
89
2
+
1523
.
10
55267
.
89
−
21717
.
85
55267
.
89
2
+
1252
.
36
55267
.
89
−
15463
.
43
55267
.
89
2
=−
0
.
048
−
0
.
1269
−
0
.
0556
=−
0
.
2305
The Dunn index can be computed from the minimum and maximum distances
between pairs of points from two clusters
C
i
and
C
j
, computed as follows:
W
min
C
1
C
2
C
3
C
1
0 1
.
62 0
.
198
C
2
1
.
62 0 3
.
49
C
3
0
.
198 3
.
49 0
W
max
C
1
C
2
C
3
C
1
2
.
50 4
.
85 4
.
81
C
2
4
.
85 2
.
33 7
.
06
C
3
4
.
81 7
.
06 2
.
55
The Dunn index value for the clustering is given as
D
unn
=
W
min
out
W
max
in
=
0
.
198
2
.
55
=
0
.
078
To compute the Davies–Bouldin index, we compute the cluster mean and
dispersion values:
µ
1
=
−
0
.
664
−
0
.
33
µ
2
=
2
.
64
0
.
19
µ
3
=
−
2
.
35
0
.
27
σ
µ
1
=
0
.
723
σ
µ
2
=
0
.
512
σ
µ
3
=
0
.
695
and the
DB
ij
values for pairs of clusters:
DB
ij
C
1
C
2
C
3
C
1
– 0
.
369 0
.
794
C
2
0
.
369 – 0
.
242
C
3
0
.
794 0
.
242 –
For example,
DB
12
=
σ
µ
1
+
σ
µ
2
δ(µ
1
,µ
2
)
=
1
.
235
3
.
346
=
0
.
369. Finally, the DB index is given as
DB
=
1
3
(
0
.
794
+
0
.
369
+
0
.
794
)
=
0
.
652
The silhouette coefficient [Eq.(17.26)] for a chosen point, say
x
1
, is given as
s
1
=
1
.
902
−
0
.
701
max
{
1
.
902
,
0
.
701
}
=
1
.
201
1
.
902
=
0
.
632
The average value across all points is
SC
=
0
.
598
448
Clustering Validation
The Hubert statistic can be computed by taking the dot product over the upper
triangular elements of the proximity matrix
W
[Eq.(17.22)] and the
n
×
n
matrix of
distances among cluster means
Y
[Eq.(17.28)], and then dividing by the number of
distinct point pairs
N
:
Ŵ
=
w
T
y
N
=
91545
.
85
11175
=
8
.
19
where
w
,
y
∈
R
N
are vectors comprising the upper triangular elements of
W
and
Y
.
The normalized Hubert statistic can be obtained as the correlation between
w
and
y
[Eq.(17.15)]:
Ŵ
n
=
z
T
w
z
y
x
w
·
z
y
=
0
.
918
where
z
w
,
z
y
are the centered vectors corresponding to
w
and
y
, respectively.
The following table summarizes the various internal measure values for the good
and bad clusterings shown in Figure 17.1 and Figure 17.2.
Lower better Higher better
B
eta
CV C
index
Q DB NC D
unn
SC
Ŵ Ŵ
n
(a) Good 0.24 0.034
−
0.23 0.65 2.67 0.08 0.60 8.19 0.92
(b) Bad 0.33 0.08
−
0.20 1.11 2.56 0.03 0.55 7.32 0.83
Despite the fact that these internal measures do not have access to the
ground-truth partitioning, we can observe that the good clustering has higher values
for normalized cut, Dunn, silhouette coefficient, and the Hubert statistics, and
lower values for BetaCV,C-index,modularity, and Davies–Bouldin measures. These
measures are thus capable of discerning good versus bad clusterings of the data.
17.3
RELATIVE MEASURES
Relative measures are used to compare different clusterings obtained by varying
different parameters for the same algorithm, for example, to choose the number of
clusters
k
.
Silhouette Coefficient
The silhouette coefficient [Eq.(17.26)] for each point
s
j
, and the average SC value
[Eq.(17.27)], can be used to estimate the number of clusters in the data. The approach
consists of plotting the
s
j
values in descending order for each cluster, and to note the
overall
SC
value for a particular value of
k
, as well as clusterwise SC values:
SC
i
=
1
n
i
x
j
∈
C
i
s
j
We can then pick the value
k
that yields the best clustering, with many points having
high
s
j
values within each cluster, as well as high values for
SC
and
SC
i
(1
≤
i
≤
k
).
17.3 Relative Measures
449
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8
0
.
9
1
.
0
s
i
l
h
o
u
e
t
t
e
c
o
e
f
f
i
c
i
e
n
t
SC
1
=
0
.
706
n
1
=
97
SC
2
=
0
.
662
n
2
=
53
(a)
k
=
2,
SC
=
0
.
706
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8
0
.
9
s
i
l
h
o
u
e
t
t
e
c
o
e
f
f
i
c
i
e
n
t
SC
1
=
0
.
466
n
1
=
61
SC
2
=
0
.
818
n
2
=
50
SC
3
=
0
.
52
n
3
=
39
(b)
k
=
3,
SC
=
0
.
598
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8
0
.
9
s
i
l
h
o
u
e
t
t
e
c
o
e
f
f
i
c
i
e
n
t
SC
1
=
0
.
376
n
1
=
49
SC
2
=
0
.
534
n
2
=
28
SC
3
=
0
.
787
n
3
=
50
SC
4
=
0
.
484
n
4
=
23
(c)
k
=
4,
SC
=
0
.
559
Figure 17.3.
Iris K-means: silhouette coefficient plot.
Example 17.7.
Figure17.3shows thesilhouettecoefficientplot forthebestclustering
results for the K-means algorithm on the Iris principal components dataset for three
different values of
k
, namely
k
=
2
,
3
,
4. The silhouette coefficient values
s
i
for points
450
Clustering Validation
within each cluster are plotted in decreasing order. The overall average (
SC
) and
clusterwise averages (
SC
i
, for 1
≤
i
≤
k
) are also shown, along with the cluster sizes.
Figure 17.3ashows that
k
=
2 has the highest averagesilhouette coefficient,
SC
=
0
.
706. It shows two well separated clusters. The points in cluster
C
1
start out with
high
s
i
values, which gradually drop as we get to border points. The second cluster
C
2
is even better separated, since it has a higher silhouette coefficient and the pointwise
scores are all high, except for the last three points, suggesting that almost all the
points are well clustered.
The silhouette plot in Figure 17.3b, with
k
=
3, corresponds to the “good”
clustering shown in Figure 17.1a. We can see that cluster
C
1
from Figure 17.3a has
been split into two clusters for
k
=
3, namely
C
1
and
C
3
. Both of these have many
bordering points, whereas
C
2
is wellseparatedwith highsilhouette coefficientsacross
all points.
Finally, the silhouette plot for
k
=
4 is shown in Figure 17.3c. Here
C
3
is the
well separated cluster, corresponding to
C
2
above, and the remaining clusters are
essentially subclusters of
C
1
for
k
=
2 (Figure 17.3a). Cluster
C
1
also has two points
with negative
s
i
values, indicating that they are probably misclustered.
Because
k
=
2 yields the highest silhouette coefficient, and the two clusters are
essentially well separated, in the absence of prior knowledge, we would choose
k
=
2
as the best number of clusters for this dataset.
Calinski–Harabasz Index
Given the dataset
D
={
x
i
}
n
i
=
1
, the scatter matrix for
D
is given as
S
=
n
=
n
j
=
1
x
j
−
µ
x
j
−
µ
T
where
µ
=
1
n
n
j
=
1
x
j
is the mean and
is the covariance matrix. The scatter matrix can
be decomposed into two matrices
S
=
S
W
+
S
B
, where
S
W
is the within-cluster scatter
matrix and
S
B
is the between-cluster scatter matrix, given as
S
W
=
k
i
=
1
x
j
∈
C
i
x
j
−
µ
i
x
j
−
µ
i
T
S
B
=
k
i
=
1
n
i
(
µ
i
−
µ
)(
µ
i
−
µ
)
T
where
µ
i
=
1
n
i
x
j
∈
C
i
x
j
is the mean for cluster
C
i
.
The Calinski–Harabasz (CH) variance ratio criterion for a given value of
k
is
defined as follows:
CH
(k)
=
tr(
S
B
)/(k
−
1
)
tr(
S
W
)/(n
−
k)
=
n
−
k
k
−
1
·
tr(
S
B
)
tr(
S
W
)
where
tr(
S
W
)
and
tr(
S
B
)
are the traces (the sum of the diagonal elements) of the
within-cluster and between-cluster scatter matrices.
For a good value of
k
, we expect the within-cluster scatter to be smaller relative to
the between-cluster scatter, which should result in a higher
CH
(k)
value. On the other
17.3 Relative Measures
451
600
650
700
750
2 3 4 5 6 7 8 9
k
C
H
Figure 17.4.
Calinski–Harabasz variance ratio criterion.
hand, we do not desire a very large value of
k
; thus the term
n
−
k
k
−
1
penalizes larger values
of
k
. We could choose a value of
k
that maximizes
CH
(k)
. Alternatively, we can plot
the
CH
values and look for a large increase in the value followed by little or no gain.
For instance, we can choose the value
k >
3 that minimizes the term
(k)
=
CH
(k
+
1
)
−
CH
(k)
−
CH
(k)
−
CH
(k
−
1
)
The intuition is that we want to find the valueof
k
for which
CH
(k)
is much higher than
CH
(k
−
1
)
and there is only a little improvement or a decrease in the
CH
(k
+
1
)
value.
Example 17.8.
Figure 17.4 shows the CH ratio for various values of
k
on the Iris
principal components dataset, using the K-means algorithm, with the best results
chosen from 200 runs.
For
k
=
3, the within-cluster and between-cluster scatter matrices are given as
S
W
=
39
.
14
−
13
.
62
−
13
.
62 24
.
73
S
B
=
590
.
36 13
.
62
13
.
62 11
.
36
Thus, we have
CH
(
3
)
=
(
150
−
3
)
(
3
−
1
)
·
(
590
.
36
+
11
.
36
)
(
39
.
14
+
24
.
73
)
=
(
147
/
2
)
·
601
.
72
63
.
87
=
73
.
5
·
9
.
42
=
692
.
4
The successive
CH
(k)
and
(k)
values are as follows:
k
2 3 4 5 6 7 8 9
CH
(k)
570
.
25 692
.
40 717
.
79 683
.
14 708
.
26 700
.
17 738
.
05 728
.
63
(k)
–
−
96
.
78
−
60
.
03 59
.
78
−
33
.
22 45
.
97
−
47
.
30 –
452
Clustering Validation
If we choose the first large peak before a decrease we would choose
k
=
4. However,
(k)
suggests
k
=
3 as the best (lowest) value, representing the “knee-of-the-curve”.
One limitation of the
(k)
criteria is that values less than
k
=
3 cannot be evaluated,
since
(
2
)
depends on
CH
(
1
)
, which is not defined.
Gap Statistic
The gap statistic compares the sum of intracluster weights
W
in
[Eq.(17.23)] for
different values of
k
with their expected values assuming no apparent clustering
structure, which forms the null hypothesis.
Let
C
k
betheclusteringobtainedfor aspecifiedvalueof
k
,usingachosenclustering
algorithm. Let
W
k
in
(
D
)
denote the sum of intracluster weights (over all clusters) for
C
k
on the input dataset
D
. We would like to compute the probability of the observed
W
k
in
value under the null hypothesis that the points are randomly placed in the same data
space as
D
. Unfortunately, the sampling distribution of
W
in
is not known. Further, it
depends on the number of clusters
k
, the number of points
n
, and other characteristics
of
D
.
To obtain an empirical distribution for
W
in
, we resort to Monte Carlo simulations
of the sampling process. That is, we generate
t
random samples comprising
n
randomly
distributed points within the same
d
-dimensional data space as the input dataset
D
.
That is, for each dimension of
D
, say
X
j
, we compute its range
[
min
(
X
j
),
max
(
X
j
)
]
and
generate values for the
n
points (for the
j
th dimension) uniformly at random within
the given range. Let
R
i
∈
R
n
×
d
, 1
≤
i
≤
t
denote the
i
th sample. Let
W
k
in
(
R
i
)
denote
the sum of intracluster weights for a given clustering of
R
i
into
k
clusters. From each
sample dataset
R
i
, we generate clusterings for different values of
k
using the same
algorithm and record the intracluster values
W
k
in
(
R
i
)
. Let
µ
W
(k)
and
σ
W
(k)
denote the
mean and standard deviation of these intracluster weights for each value of
k
, given as
µ
W
(k)
=
1
t
t
i
=
1
log
W
k
in
(
R
i
)
σ
W
(k)
=
1
t
t
i
=
1
log
W
k
in
(
R
i
)
−
µ
W
(k)
2
where we use the logarithm of the
W
in
values, as they can be quite large.
The
gap statistic
for a given
k
is then defined as
gap(k)
=
µ
W
(k)
−
log
W
k
in
(
D
)
It measures the deviation of the observed
W
k
in
value from its expected value under the
null hypothesis. We can select the value of
k
that yields the largest gap statistic because
that indicates a clustering structure far away from the uniform distribution of points.
A more robust approach is to choose
k
as follows:
k
∗
=
argmin
k
gap(k)
≥
gap(k
+
1
)
−
σ
W
(k
+
1
)
That is, we select the least value of
k
such that the gap statistic is within one standard
deviation of the gap at
k
+
1.
17.3 Relative Measures
453
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
(a) Randomly generated data (
k
=
3)
10
11
12
13
14
15
0 1 2 3 4 5 6 7 8 9
k
l
o
g
2
W
k i
n
expected:
µ
W
(k)
observed:
W
k
in
(b) Intracluster weights
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8
0
.
9
0 1 2 3 4 5 6 7 8 9
k
g
a
p
(
k
)
(c) Gap statistic
Figure 17.5.
Gap statistic. (a) Randomly generated data. (b) Intracluster weights for different
k
. (c) Gap
statistic as a function of
k
.
Example 17.9.
To compute the gap statistic we have to generate
t
random samples
of
n
points drawn from the same data space as the Iris principal components dataset.
A random sample of
n
=
150 points is shown in Figure 17.5a, which does not have
any apparent cluster structure. However, when we run K-means on this dataset it
will output some clustering, an example of which is also shown, with
k
=
3. From this
clustering, we can compute the log
2
W
k
in
(
R
i
)
value; we use base 2 for all logarithms.
For Monte Carlo sampling, we generate
t
=
200 such random datasets, and
compute the mean or expected intracluster weight
µ
W
(k)
under the null hypothesis,
for each value of
k
. Figure 17.5b shows the expectedintracluster weights for different
values of
k
. It also shows the observed value of log
2
W
k
in
computed from the K-means
clustering of the Iris principal components dataset. For the Iris dataset, and each
of the uniform random samples, we run K-means 100 times and select the best
possible clustering, from which the
W
k
in
(
R
i
)
values are computed. We can see that
the observed
W
k
in
(
D
)
values are smaller than the expected values
µ
W
(k)
.
454
Clustering Validation
Table 17.1.
Gap statistic values as a function of
k
k gap(k) σ
W
(k) gap(k)
−
σ
W
(k)
1 0.093 0.0456 0.047
2 0.346 0.0486 0.297
3 0.679 0.0529 0.626
4 0.753 0.0701 0.682
5 0.586 0.0711 0.515
6 0.715 0.0654 0.650
7 0.808 0.0611 0.746
8 0.680 0.0597 0.620
9 0.632 0.0606 0.571
From these values, we then compute the gap statistic
gap(k)
for different values
of
k
, which are plotted in Figure 17.5c. Table 17.1 lists the gap statistic and standard
deviation values. The optimal value for the number of clusters is
k
=
4 because
gap(
4
)
=
0
.
753
> gap(
5
)
−
σ
W
(
5
)
=
0
.
515
However, if we had relaxed the gap test to be within two standard deviations, then
the optimal value would have been
k
=
3 because
gap(
3
)
=
0
.
679
>gap(
4
)
−
2
σ
W
(
4
)
=
0
.
753
−
2
·
0
.
0701
=
0
.
613
Essentially, there is still some subjectivity in selecting the right number of clusters,
but the gap statistic plot can help in this task.
17.3.1
Cluster Stability
The main idea behind cluster stability is that the clusterings obtained from several
datasets sampled from the same underlying distribution as
D
should be similar or
“stable.” The cluster stability approach can be used to find good parameter values
for a given clustering algorithm; we will focus on the task of finding a good value for
k
,
the correct number of clusters.
The joint probability distribution for
D
is typically unknown. Therefore, to sample
a datasetfrom the same distribution we can try a varietyof methods, including random
perturbations,subsampling,or bootstrapresampling.Letus considerthebootstrapping
approach; we generate
t
samples of size
n
by sampling from
D
with replacement, which
allows the same point to be chosen possibly multiple times, and thus each sample
D
i
will be different. Next, for each sample
D
i
we run the same clustering algorithm with
different cluster values
k
ranging from 2 to
k
max
.
Let
C
k
(
D
i
)
denote the clustering obtained from sample
D
i
, for a given value of
k
.
Next, the method compares the distance between all pairs of clusterings
C
k
(
D
i
)
and
C
k
(
D
j
)
via some distance function. Several of the external cluster evaluation measures
can be used as distance measures, by setting, for example,
C
=
C
k
(
D
i
)
and
T
=
C
k
(
D
j
)
,
or vice versa. From these values we compute the expected pairwise distance for each
value of
k
. Finally, the value
k
∗
that exhibits the least deviation between the clusterings
17.3 Relative Measures
455
ALGORITHM 17.1. Clustering Stability Algorithm for Choosing
k
C
LUSTERING
S
TABILITY
(
A
,t,k
max
,
D)
:
n
←|
D
|
1
// Generate
t
samples
for
i
=
1
,
2
,...,t
do
2
D
i
←
sample
n
points from
D
with replacement
3
// Generate clusterings for different values of
k
for
i
=
1
,
2
,...,t
do
4
for
k
=
2
,
3
,...,k
max
do
5
C
k
(
D
i
)
←
cluster
D
i
into
k
clusters using algorithm
A
6
// Compute mean difference between clusterings for each
k
foreach
pair
D
i
,
D
j
with
j >i
do
7
D
ij
←
D
i
∩
D
j
// create common dataset using Eq.
(17.30)
8
for
k
=
2
,
3
,...,k
max
do
9
d
ij
(k)
←
d
C
k
(
D
i
),
C
k
(
D
j
),
D
ij
// distance between
10
clusterings
for
k
=
2
,
3
,...,k
max
do
11
µ
d
(k)
←
2
t(t
−
1
)
t
i
=
1
j>i
d
ij
(k)
// expected pairwise distance
12
// Choose best
k
k
∗
←
argmin
k
µ
d
(k)
13
obtained from the resampled datasets is the best choice for
k
because it exhibits the
most stability.
There is, however, one complication when evaluating the distance between a pair
of clusterings
C
k
(
D
i
)
and
C
k
(
D
j
)
, namely that the underlying datasets
D
i
and
D
j
are
different. That is, the set of points being clustered is different because each sample
D
i
is different. Before computing the distance between the two clusterings, we have to
restrict the clusterings only to the points common to both
D
i
and
D
j
, denoted as
D
ij
.
Because sampling with replacement allows multiple instances of the same point, we
also have to account for this when creating
D
ij
. For each point
x
a
in the input dataset
D
, let
m
a
i
and
m
a
j
denote the number of occurrences of
x
a
in
D
i
and
D
j
, respectively.
Define
D
ij
=
D
i
∩
D
j
=
m
a
instances of
x
a
|
x
a
∈
D
,m
a
=
min
{
m
a
i
,m
a
j
}
(17.30)
That is, the common dataset
D
ij
is created by selecting the minimum number of
instances of the point
x
a
in
D
i
or
D
j
.
Algorithm 17.1 shows the pseudo-code for the clustering stability method for
choosing the best
k
value. It takes as input the clustering algorithm
A
, the number
of samples
t
, the maximum number of clusters
k
max
, and the input dataset
D
.
456
Clustering Validation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1 2 3 4 5 6 7 8 9
k
E
x
p
e
c
t
e
d
V
a
l
u
e
µ
s
(k)
:
FM
µ
d
(k)
:
VI
Figure 17.6.
Clustering stability: Iris dataset.
It first generates the
t
bootstrap samples and clusters them using algorithm
A
. Next,
it computes the distance between the clusterings for each pair of datasets
D
i
and
D
j
,
for each value of
k
. Finally, the method computes the expected pairwise distance
µ
d
(k)
in line 12. We assume that the clustering distance function
d
is symmetric. If
d
is not
symmetric, then the expected difference should be computed over all ordered pairs,
that is,
µ
d
(k)
=
1
t(t
−
1
)
r
i
=
1
j
=
i
d
ij
(k)
.
Instead of a distance function
d
, we can also evaluate clustering stability via a
similarity measure, in which case, after computing the average similarity between
pairs of clusterings for a given
k
, we can choose the best value
k
∗
as the one that
maximizes the expected similarity
µ
s
(k)
. In general, those external measures that
yield lower values for better agreement between
C
k
(
D
i
)
and
C
k
(
D
j
)
can be used as
distance functions, whereas those that yield higher values for better agreement can be
used as similarity functions. Examples of distance functions include normalized mutual
information, variation of information, and conditional entropy (which is asymmetric).
Examplesof similarity functions include Jaccard,Fowlkes–Mallows, Hubert
Ŵ
statistic,
and so on.
Example 17.10.
We study the clustering stability for the Iris principal components
dataset, with
n
=
150, using the K-means algorithm. We use
t
=
500 bootstrap
samples. For each dataset
D
i
, and each value of
k
, we run K-means with 100 initial
starting configurations, and select the best clustering.
For the distance function, we used the variation of information [Eq.(17.5)]
between each pair of clusterings. We also used the Fowlkes–Mallows measure
[Eq.(17.13)] as an example of a similarity measure. The expected values of the
pairwise distance
µ
d
(k)
for the VI measure, and the pairwise similarity
µ
s
(k)
for the
FM measure are plotted in Figure 17.6. Both the measures indicate that
k
=
2 is the
best value, as for the VI measure this leads to the least expected distance between
pairs of clusterings, and for the FM measure this choice leads to the most expected
similarity between clusterings.
17.3 Relative Measures
457
17.3.2
Clustering Tendency
Clustering tendency or clusterability aims to determine whether the dataset
D
has
any meaningful groups to begin with. This is usually a hard task given the different
definitions of what it means to be a cluster, for example, partitional, hierarchical,
density-based, graph-based and so on. Even if we fix the cluster type, it is still a
hard task to define the appropriate null model (e.g., the one without any clustering
structure) for a given dataset
D
. Furthermore, if we do determine that the data is
clusterable, then we are still faced with the question of how many clusters there are.
Nevertheless, it is still worthwhile to assess the clusterability of a dataset; we look at
some approaches to answer the question whether the data is clusterable or not.
Spatial Histogram
One simple approach is to contrast the
d
-dimensional spatial histogram of the input
dataset
D
with the histogram from samples generated randomly in the same data
space. Let
X
1
,
X
2
,...,
X
d
denote the
d
dimensions. Given
b
, the number of bins for
each dimension, we divide each dimension
X
j
into
b
equi-width bins, and simply count
how many points lie in each of the
b
d
d
-dimensional cells. From this spatial histogram,
we can obtain the empirical joint probability mass function (EPMF) for the dataset
D
, which is an approximation of the unknown joint probability density function. The
EPMF is given as
f(
i
)
=
P(
x
j
∈
cell
i
)
=
{
x
j
∈
cell
i
}
n
where
i
=
(i
1
,i
2
,...,i
d
)
denotes a cell index, with
i
j
denoting the bin index along
dimension
X
j
.
Next, we generate
t
random samples, each comprising
n
points within the same
d
-dimensionalspaceastheinputdataset
D
.Thatis,for eachdimension
X
j
,wecompute
its range
[
min
(
X
j
),
max
(
X
j
)
]
, and generate values uniformly at random within the
given range. Let
R
j
denote the
j
th such random sample. We can then compute the
corresponding EPMF
g
j
(
i
)
for each
R
j
, 1
≤
j
≤
t
.
Finally, we can compute how much the distribution
f
differs from
g
j
(for
j
=
1
,...,t
), using the Kullback–Leibler (KL) divergence from
f
to
g
j
, defined as
KL
(f
|
g
j
)
=
i
f(
i
)
log
f(
i
)
g
j
(
i
)
(17.31)
The KL divergence is zero only when
f
and
g
j
are the same distributions. Using these
divergence values, we can compute how much the dataset
D
differs from a random
dataset.
The main limitation of this approach is that as dimensionality increases, the
number of cells (
b
d
) increases exponentially, and with a fixed sample size
n
, most
of the cells will be empty, or will have only one point, making it hard to estimate
the divergence. The method is also sensitive to the choice of parameter
b
. Instead of
histograms, and the corresponding EPMF, we can also use density estimation methods
(see Section 15.2) to determine the joint probability density function (PDF) for the
458
Clustering Validation
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(a) Iris: spatial cells
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(b) Uniform: spatial cells
0
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Spatial Cells
P
r
o
b
a
b
i
l
i
t
y
Iris (
f
)
Uniform (
g
j
)
(c) Empirical probability mass function
0
0
.
05
0
.
10
0
.
15
0
.
20
0
.
25
0
.
65 0
.
80 0
.
95 1
.
10 1
.
25 1
.
40 1
.
55 1
.
70
KL Divergence
P
r
o
b
a
b
i
l
i
t
y
(d) KL-divergence distribution
Figure 17.7.
Iris dataset: spatial histogram.
17.3 Relative Measures
459
dataset
D
, and see how it differs from the PDF for the random datasets. However, the
curse of dimensionality also causes problems for density estimation.
Example 17.11.
Figure 17.7c shows the empirical joint probability mass function for
the Iris principal components dataset that has
n
=
150 points in
d
=
2 dimensions.
It also shows the EPMF for one of the datasets generated uniformly at random in the
same dataspace.Both EPMFs were computed using
b
=
5 bins in each dimension, for
a total of 25 spatial cells. The spatial grids/cells for the Iris dataset
D
, and the random
sample
R
, are shown in Figures 17.7aand 17.7b,respectively.The cells are numbered
starting from 0, from bottom to top, and then left to right. Thus, the bottom left cell
is 0, top left is 4, bottom right is 19, and top right is 24. These indices are used along
the
x
-axis in the EPMF plot in Figure 17.7c.
We generated
t
=
500 random samples from the null distribution, and computed
the KL divergence from
f
to
g
j
for each 1
≤
j
≤
t
(using logarithm with
base 2). The distribution of the KL values is plotted in Figure 17.7d. The mean
KL value was
µ
KL
=
1
.
17, with a standard deviation of
σ
KL
=
0
.
18, indicating
that the Iris data is indeed far from the randomly generated data, and thus is
clusterable.
Distance Distribution
Instead of trying to estimate the density, another approach to determine clusterability
is to compare the pairwise point distances from
D
, with those from the randomly
generated samples
R
i
from the null distribution. That is, we create the EPMF
from the proximity matrix
W
for
D
[Eq.(17.22)] by binning the distances into
b
bins:
f(i)
=
P(w
pq
∈
bin
i
|
x
p
,
x
q
∈
D
,p < q)
=
{
w
pq
∈
bin
i
}
n(n
−
1
)/
2
Likewise, for each of the samples
R
j
, we can determine the EPMF for the pairwise
distances, denoted
g
j
. Finally, we can compute the KL divergences between
f
and
g
j
using Eq.(17.31). The expecteddivergenceindicates theextentto which
D
differs from
the null (random) distribution.
Example 17.12.
Figure 17.8a shows the distance distribution for the Iris principal
components dataset
D
and the random sample
R
j
from Figure 17.7b. The distance
distribution is obtained by binning the edge weights between all pairs of points using
b
=
25 bins.
We then compute the KL divergence from
D
to each
R
j
, over
t
=
500 samples.
The distribution of the KL divergences (using logarithm with base 2) is shown
in Figure 17.8b. The mean divergence is
µ
KL
=
0
.
18, with standard deviation
σ
KL
=
0
.
017. Even though the Iris dataset has a good clustering tendency, the KL
divergence is not very large. We conclude that, at least for the Iris dataset, the
distance distribution is not as discriminative as the spatial histogram approach for
clusterability analysis.
460
Clustering Validation
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0 1 2 3 4 5 6
Pairwise distance
P
r
o
b
a
b
i
l
i
t
y
Iris (
f
)
Uniform (
g
j
)
(a)
0
0.05
0.10
0.15
0.20
0.12 0.14 0.16 0.18 0.20 0.22
KL divergence
P
r
o
b
a
b
i
l
i
t
y
(b)
Figure 17.8.
Iris dataset: distance distribution.
Hopkins Statistic
The Hopkins statistic is a sparse sampling test for spatial randomness. Given a dataset
D
comprising
n
points, we generate
t
random subsamples
R
i
of
m
points each, where
m
≪
n
. These samples are drawn from the same data space as
D
, generated uniformly
at random along each dimension. Further, we also generate
t
subsamples of
m
points
directly from
D
, using sampling without replacement. Let
D
i
denote the
i
th direct
subsample. Next, we compute the minimum distance between each point
x
j
∈
D
i
and
points in
D
δ
min
(
x
j
)
=
min
x
i
∈
D
,
x
i
=
x
j
δ(
x
j
,
x
i
)
Likewise, we compute the minimum distance
δ
min
(
y
j
)
between a point
y
j
∈
R
i
and
points in
D
.
The Hopkins statistic (in
d
dimensions) for the
i
th pair of samples
R
i
and
D
i
is
then defined as
HS
i
=
y
j
∈
R
i
δ
min
(
y
j
)
d
y
j
∈
R
i
δ
min
(
y
j
)
d
+
x
j
∈
D
i
δ
min
(
x
j
)
d
17.4 Further Reading
461
0.05
0.10
0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98
Hopkins Statistic
P
r
o
b
a
b
i
l
i
t
y
Figure 17.9.
Iris dataset: Hopkins statistic distribution.
This statistic compares the nearest-neighbor distribution of randomly generatedpoints
to the same distribution for random subsets of points from
D
. If the data is well
clustered we expect
δ
min
(
x
j
)
valuesto be smaller compared to the
δ
min
(
y
j
)
values, and in
this case
HS
i
tends to 1. If both nearest-neighbor distances are similar, then
HS
i
takes
on values close to 0.5, which indicates that the data is essentially random, and there is
no apparent clustering. Finally, if
δ
min
(
x
j
)
values are larger compared to
δ
min
(
y
j
)
values,
then
HS
i
tends to 0, and it indicates point repulsion, with no clustering. From the
t
different values of
HS
i
we may then compute the mean and variance of the statistic to
determine whether
D
is clusterable or not.
Example 17.13.
Figure 17.9 plots the distribution of the Hopkins statistic values over
t
=
500 pairs of samples:
R
j
generated uniformly at random, and
D
j
subsampled
from the input dataset
D
. The subsample size was set as
m
=
30, using 20% of the
points in
D
, that is, the Iris principal components dataset, which has
n
=
150 points in
d
=
2 dimensions. The mean of the Hopkins statistic is
µ
HS
=
0
.
935, with a standard
deviation of
σ
HS
=
0
.
025. Given the high value of the statistic, we conclude that the
Iris dataset has a good clustering tendency.
17.4
FURTHER READING
For an excellent introduction to clustering validation see Jain and Dubes (1988); the
book describes many of the external, internal, and relative measures discussed in
this chapter, including clustering tendency. Other good reviews appear in Halkidi,
Batistakis, and Vazirgiannis (2001) and Theodoridis and Koutroumbas (2008). For
recent work on formal properties for comparing clusterings via external measures see
Amig
´
o et al. (2009) and Meil
˘
a (2007). For the silhouette plot see Rousseeuw (1987),
and for gap statistic see Tibshirani, Walther, and Hastie (2001). For an overview of
cluster stability methods see Luxburg (2009). A recent review of clusterability appears
462
Clustering Validation
in Ackerman and Ben-David (2009). Overall reviews of clustering methods appear in
Xu and Wunsch (2005) and Jain, Murty, and Flynn (1999). See Kriegel, Kr
¨
oger, and
Zimek (2009) for a review of subspace clustering methods.
Ackerman, M. and Ben-David, S. (2009). “Clusterability: A theoretical study.”
In Proceedings of 12th International Conference on Artificial Intelligence and
Statistics.
Amig
´
o, E., Gonzalo, J., Artiles, J., and Verdejo, F. (2009). “A comparison of extrinsic
clustering evaluation metrics based on formal constraints.”
Information Retrieval
,
12(4): 461–486.
Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001). “On clustering validation
techniques.”
Journal of Intelligent Information Systems
, 17(2–3): 107–145.
Jain, A. K. and Dubes, R. C. (1988).
Algorithms for Clustering Data
. Upper Saddle
River, NJ: Prentice-Hall.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). “Data clustering: A review.”
ACM
Computing Surveys
, 31(3): 264–323.
Kriegel, H.-P., Kr
¨
oger, P., and Zimek, A. (2009). “Clustering high-dimensional
data: A survey on subspace clustering, pattern-based clustering, and correlation
clustering.”
ACM Transactions on Knowledge Discovery from Data
, 3(1): 1.
Luxburg, U. von (2009). “Clustering stability: An overview.”
Foundations and Trends
in Machine Learning
, 2(3): 235–274.
Meil
˘
a, M. (2007). “Comparing clusterings – an information based distance.”
Journal of
Multivariate Analysis
, 98(5): 873–895.
Rousseeuw, P. J. (1987). “Silhouettes: A graphical aid to the interpretation and
validationof clusteranalysis.”
JournalofComputationalandAppliedMathematics
,
20: 53–65.
Theodoridis, S. and Koutroumbas, K. (2008).
Pattern Recognition,
4th ed. San Diego:
Academic Press.
Tibshirani, R., Walther, G., and Hastie, T. (2001). “Estimating the number of clusters
in a dataset via the gap statistic.”
Journal of the Royal Statistical Society B
,
63: 411–423.
Xu, R. and Wunsch, D. (2005). “Survey of clustering algorithms.”
IEEE Transactions
on Neural Networks
, 16(3): 645–678.
17.5
EXERCISES
Q1.
Prove that the maximum value of the entropy measure in Eq.(17.2) is log
k
.
Q2.
Show that if
C
and
T
are independent of each other then
H
(
T
|
C
)
=
H
(
T
)
, and further
that
H
(
C
,
T
)
=
H
(
C
)
+
H
(
T
)
.
Q3.
Show that
H
(
T
|
C
)
=
0 if and only if
T
is completely determined by
C
.
Q4.
Show that
I
(
C
,
T
)
=
H
(
C
)
+
H
(
T
)
−
H
(
T
,
C
)
.
Q5.
Show that the variation of information is 0 only when
C
and
T
are identical.
17.5 Exercises
463
Q6.
Prove that the maximum value of the normalized discretized Hubert statistic in
Eq.(17.21) is obtained when
FN
=
FP
=
0, and the minimum value is obtained when
TP
=
TN
=
0.
Q7.
Show that the Fowlkes–Mallows measure can be considered as the correlation
between the pairwise indicator matrices for
C
and
T
, respectively. Define
C
(i,j)
=
1
if
x
i
and
x
j
(with
i
=
j
) are in the same cluster, and 0 otherwise. Define
T
similarly
for the ground-truth partitions. Define
C
,
T
=
n
i,j
=
1
C
ij
T
ij
. Show that
FM
=
C
,
T
√
T
,
T
C
,
C
Q8.
Show that the silhouette coefficient of a point lies in the interval [
−
1
,
+
1].
Q9.
Show that the scatter matrix can be decomposed as
S
=
S
W
+
S
B
, where
S
W
and
S
B
are the within-cluster and between-cluster scatter matrices.
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
a
b
c
d
e
f
g
h
i
j
k
Figure 17.10.
Data for Q10 .
Q10.
Consider the dataset in Figure 17.10. Compute the silhouette coefficient for the point
labeled
c
.
Q11.
Describe how one may apply the gap statistic methodology for determining the
parameters of density-based clustering algorithms, such as DBSCAN and DEN-
CLUE (see Chapter 15).
PART FOUR
 CLASSIFICATION
CHAPTER 18
Probabilistic Classification
Classification refers to the task of predicting a class label for a given unlabeled point.
In this chapter we consider three examples of the probabilistic classification approach.
The (full) Bayes classifier uses the Bayes theorem to predict the class as the one that
maximizes the posterior probability. The main task is to estimate the joint probability
densityfunction for eachclass, which is modeled viaa multivariatenormal distribution.
The naive Bayes classifier assumes that attributes are independent, but it is still
surprisingly powerful for many applications. We also describe the nearest neighbors
classifier, which uses a non-parametric approach to estimate the density.
18.1
BAYES CLASSIFIER
Let the training dataset
D
consist of
n
points
x
i
in a
d
-dimensional space, and let
y
i
denote the class for each point, with
y
i
∈ {
c
1
,c
2
,...,c
k
}
. The Bayes classifier directly
uses the Bayes theorem to predict the class for a new test instance,
x
. It estimates the
posterior probability
P(c
i
|
x
)
for each class
c
i
, and chooses the class that has the largest
probability. The predicted class for
x
is given as
ˆ
y
=
argmax
c
i
{
P(c
i
|
x
)
}
(18.1)
The Bayes theorem allows us to invert the posterior probability in terms of the
likelihood and prior probability, as follows:
P(c
i
|
x
)
=
P(
x
|
c
i
)
·
P(c
i
)
P(
x
)
where
P(
x
|
c
i
)
is the
likelihood
, defined as the probability of observing
x
assuming that
the true class is
c
i
,
P(c
i
)
is the
prior probability
of class
c
i
, and
P(
x
)
is the probability
of observing
x
from any of the
k
classes, given as
P(
x
)
=
k
j
=
1
P(
x
|
c
j
)
·
P(c
j
)
467
468
Probabilistic Classification
Because
P(
x
)
is fixed for a given point, Bayes rule [Eq.(18.1)] can be rewritten as
ˆ
y
=
argmax
c
i
{
P(c
i
|
x
)
}
=
argmax
c
i
P(
x
|
c
i
)P(c
i
)
P(
x
)
=
argmax
c
i
P(
x
|
c
i
)P(c
i
)
(18.2)
In other words, the predicted class essentially depends on the likelihood of that class
taking its prior probability into account.
18.1.1
Estimating the Prior Probability
To classify points, we have to estimate the likelihood and prior probabilities directly
from the training dataset
D
. Let
D
i
denote the subset of points in
D
that are labeled
with class
c
i
:
D
i
={
x
j
∈
D
|
x
j
has class
y
j
=
c
i
}
Let the size of the dataset
D
be given as
|
D
|=
n
, and let the size of each class-specific
subset
D
i
be given as
|
D
i
|=
n
i
. The prior probability for class
c
i
can be estimated as
follows:
ˆ
P(c
i
)
=
n
i
n
18.1.2
Estimating the Likelihood
To estimate the likelihood
P(
x
|
c
i
)
, we have to estimate the joint probability of
x
across
all the
d
dimensions, that is, we have to estimate
P
x
=
(x
1
,x
2
,...,x
d
)
|
c
i
.
Numeric Attributes
Assuming all dimensions are numeric, we can estimate the joint probability of
x
via
either a nonparametric or a parametric approach. We consider the non-parametric
approach in Section 18.3.
In the parametric approach we typically assume that each class
c
i
is normally
distributed around some mean
µ
i
with a corresponding covariance matrix
i
, both
of which are estimated from
D
i
. For class
c
i
, the probability density at
x
is thus
given as
f
i
(
x
)
=
f(
x
|
µ
i
,
i
)
=
1
(
√
2
π)
d
√
|
i
|
exp
−
(
x
−
µ
i
)
T
−
1
i
(
x
−
µ
i
)
2
(18.3)
Because
c
i
is characterized by a continuous distribution, the probability of any given
point must be zero, i.e.,
P(
x
|
c
i
)
=
0. However, we can compute the likelihood by
considering a small interval
ǫ >
0 centered at
x
:
P(
x
|
c
i
)
=
2
ǫ
·
f
i
(
x
)
18.1 Bayes Classifier
469
The posterior probability is then given as
P(c
i
|
x
)
=
2
ǫ
·
f
i
(
x
)P(c
i
)
k
j
=
1
2
ǫ
·
f
j
(
x
)P(c
j
)
=
f
i
(
x
)P(c
i
)
k
j
=
1
f
j
(
x
)P(c
j
)
(18.4)
Further, because
k
j
=
1
f
j
(
x
)P(c
j
)
remains fixed for
x
, we can predict the class for
x
by
modifying Eq.(18.2) as follows:
ˆ
y
=
argmax
c
i
f
i
(
x
)P(c
i
)
To classify a numeric test point
x
, the Bayes classifier estimates the parameters via
the sample mean and sample covariance matrix. The sample mean for the class
c
i
can
be estimated as
ˆ
µ
i
=
1
n
i
x
j
∈
D
i
x
j
and the sample covariance matrix for each class can be estimated using Eq.(2.30), as
follows
i
=
1
n
i
Z
T
i
Z
i
where
Z
i
is the centered data matrix for class
c
i
given as
Z
i
=
D
i
−
1
· ˆ
µ
T
i
. These values
can be used to estimate the probability density in Eq.(18.3) as
ˆ
f
i
(
x
)
=
f(
x
|ˆ
µ
i
,
i
)
.
Algorithm 18.1 shows the pseudo-code for the Bayes classifier. Given an input
dataset
D
, the method estimates the prior probability, mean and covariance matrix
for each class. For testing, given a test point
x
, it simply returns the class with the
maximum posterior probability. The cost of training is dominated by the covariance
matrix computation step which takes
O
(nd
2
)
time.
ALGORITHM 18.1. Bayes Classifier
B
AYES
C
LASSIFIER
(D
={
(
x
j
,y
j
)
}
n
j
=
1
)
:
for
i
=
1
,...,k
do
1
D
i
←
x
j
|
y
j
=
c
i
,j
=
1
,...,n
// class-specific subsets
2
n
i
←|
D
i
|
// cardinality
3
ˆ
P(c
i
)
←
n
i
/n
// prior probability
4
ˆ
µ
i
←
1
n
i
x
j
∈
D
i
x
j
// mean
5
Z
i
←
D
i
−
1
n
i
ˆ
µ
T
i
// centered data
6
i
←
1
n
i
Z
T
i
Z
i
// covariance matrix
7
return
ˆ
P(c
i
),
ˆ
µ
i
,
i
for all
i
=
1
,...,k
8
T
ESTING
(x and
ˆ
P(c
i
)
,
ˆ
µ
i
,
i
, for all
i
∈
[1
,k
]
)
:
ˆ
y
←
argmax
c
i
f(
x
|ˆ
µ
i
,
i
)
·
P(c
i
)
9
return
ˆ
y
10
470
Probabilistic Classification
2
2
.
5
3
.
0
3
.
5
4
.
0
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
X
1
X
2
x
=
(
6
.
75
,
4
.
25
)
T
Figure 18.1.
Iris data:
X
1
:
sepal length
versus
X
2
:
sepal width
. The class means are show in black; the
density contours are also shown. The square represents a test point labeled
x
.
Example 18.1.
Consider the 2-dimensional Iris data, with attributes
sepal length
and
sepal width
, shown in Figure 18.1. Class
c
1
, which corresponds to
iris-setosa
(shown as circles), has
n
1
=
50 points, whereas the other class
c
2
(shown as triangles)
has
n
2
=
100 points. The prior probabilities for the two classes are
ˆ
P(c
1
)
=
n
1
n
=
50
150
=
0
.
33
ˆ
P(c
2
)
=
n
2
n
=
100
150
=
0
.
67
The means for
c
1
and
c
2
(shown as black circle and triangle) are given as
ˆ
µ
1
=
5
.
01
3
.
42
ˆ
µ
2
=
6
.
26
2
.
87
and the corresponding covariance matrices are as follows:
1
=
0
.
122 0
.
098
0
.
098 0
.
142
2
=
0
.
435 0
.
121
0
.
121 0
.
110
Figure 18.1 shows the contour or level curve (corresponding to 1% of the peak
density) for the multivariate normal distribution modeling the probability density
for both classes.
Let
x
=
(
6
.
75
,
4
.
25
)
T
be a test point (shown as white square). The posterior
probabilities for
c
1
and
c
2
can be computed using Eq. (18.4):
ˆ
P(c
1
|
x
)
∝
ˆ
f(
x
|ˆ
µ
1
,
1
)
ˆ
P(c
1
)
=
(
4
.
914
×
10
−
7
)
×
0
.
33
=
1
.
622
×
10
−
7
ˆ
P(c
2
|
x
)
∝
ˆ
f(
x
|ˆ
µ
2
,
2
)
ˆ
P(c
2
)
=
(
2
.
589
×
10
−
5
)
×
0
.
67
=
1
.
735
×
10
−
5
Because
ˆ
P(c
2
|
x
) >
ˆ
P(c
1
|
x
)
the class for
x
is predicted as
ˆ
y
=
c
2
.
18.1 Bayes Classifier
471
Categorical Attributes
If the attributes are categorical, the likelihood can be computed using the categorical
data modeling approach presented in Chapter 3. Formally, let
X
j
be a categorical
attribute over the domain
dom(
X
j
)
= {
a
j
1
,a
j
2
,...,a
jm
j
}
, that is, attribute
X
j
can take
on
m
j
distinct categorical values. Each categorical attribute
X
j
is modeled as an
m
j
-dimensional multivariate Bernoulli random variable
X
j
that takes on
m
j
distinct
vector values
e
j
1
,
e
j
2
,...,
e
jm
j
, where
e
jr
is the
r
th standard basis vector in
R
m
j
and
corresponds to the
r
th valueor symbol
a
jr
∈
dom(
X
j
)
. Theentire
d
-dimensionaldataset
is modeled as the vector random variable
X
=
(
X
1
,
X
2
,...,
X
d
)
T
. Let
d
′
=
d
j
=
1
m
j
;
a categorical point
x
=
(x
1
,x
2
,...,x
d
)
T
is therefore represented as the
d
′
-dimensional
binary vector
v
=
v
1
.
.
.
v
d
=
e
1
r
1
.
.
.
e
dr
d
where
v
j
=
e
jr
j
provided
x
j
=
a
jr
j
is the
r
j
th value in the domain of
X
j
. The probability
of the categorical point
x
is obtained from the joint probability mass function (PMF)
for the vector random variable
X
:
P(
x
|
c
i
)
=
f(
v
|
c
i
)
=
f
X
1
=
e
1
r
1
,...,
X
d
=
e
dr
d
|
c
i
(18.5)
The above joint PMF can be estimated directly from the data
D
i
for each class
c
i
as
follows:
ˆ
f(
v
|
c
i
)
=
n
i
(
v
)
n
i
where
n
i
(
v
)
is the number of times the value
v
occurs in class
c
i
. Unfortunately, if
the probability mass at the point
v
is zero for one or both classes, it would lead to a
zero value for the posterior probability. To avoid zero probabilities, one approach is
to introduce a small prior probability for all the possible values of the vector random
variable
X
. One simple approach is to assume a
pseudo-count
of 1 for each value, that
is, to assume that each value of
X
occurs at least one time, and to augment this base
count of 1 with the actual number of occurrences of the observed value
v
in class
c
i
.
The adjusted probability mass at
v
is then given as
ˆ
f(
v
|
c
i
)
=
n
i
(
v
)
+
1
n
i
+
d
j
=
1
m
j
(18.6)
where
d
j
=
1
m
j
gives the number of possible values of
X
. Extending the code in
Algorithm 18.1 to incorporate categorical attributes is relatively straightforward; all
that is required is to compute the joint PMF for each class using Eq.(18.6).
18.2 Naive Bayes Classifier
473
probability mass at
v
is zero for both classes. We adjust the PMF via pseudo-counts
[Eq. (18.6)]; note that the number of possible values are
m
1
×
m
2
=
4
×
3
=
12. The
likelihood and prior probability can then be computed as
ˆ
P(
x
|
c
1
)
=
ˆ
f(
v
|
c
1
)
=
0
+
1
50
+
12
=
1
.
61
×
10
−
2
ˆ
P(
x
|
c
2
)
=
ˆ
f(
v
|
c
2
)
=
0
+
1
100
+
12
=
8
.
93
×
10
−
3
ˆ
P(c
1
|
x
)
∝
(
1
.
61
×
10
−
2
)
×
0
.
33
=
5
.
32
×
10
−
3
ˆ
P(c
2
|
x
)
∝
(
8
.
93
×
10
−
3
)
×
0
.
67
=
5
.
98
×
10
−
3
Thus, the predicted class is
ˆ
y
=
c
2
.
Challenges
The main problem with the Bayes classifier is the lack of enough data to reliably
estimate the joint probability density or mass function, especially for high-dimensional
data. For instance, for numeric attributes we have to estimate
O
(d
2
)
covariances, and
as the dimensionality increases, this requires us to estimate too many parameters. For
categorical attributes we have to estimate the joint probability for all the possible
values of
v
, given as
j
|
dom
X
j
|
. Even if each categorical attribute has only two
values, we would need to estimate the probability for 2
d
values. However, because
there can be at most
n
distinct values for
v
, most of the counts will be zero. To address
some of these concerns we can use reduced set of parameters in practice, as described
next.
18.2
NAIVE BAYES CLASSIFIER
We saw earlier that the full Bayes approach is fraught with estimation related
problems, especially with large number of dimensions. The naive Bayes approach
makes the simple assumption that all the attributes are independent. This leads to a
much simpler, though surprisingly effective classifier in practice. The independence
assumption immediately implies that the likelihood can be decomposed into a product
of dimension-wise probabilities:
P(
x
|
c
i
)
=
P(x
1
,x
2
,...,x
d
|
c
i
)
=
d
j
=
1
P(x
j
|
c
i
)
(18.7)
Numeric Attributes
For numeric attributes we make the default assumption that each of them is normally
distributed for each class
c
i
. Let
µ
ij
and
σ
2
ij
denote the mean and variance for attribute
X
j
, for class
c
i
. The likelihood for class
c
i
, for dimension
X
j
, is given as
P(x
j
|
c
i
)
∝
f(x
j
|
µ
ij
,σ
2
ij
)
=
1
√
2
πσ
ij
exp
−
(x
j
−
µ
ij
)
2
2
σ
2
ij
474
Probabilistic Classification
Incidentally, the naive assumption corresponds to setting all the covariances to
zero in
i
, that is,
i
=
σ
2
i
1
0
...
0
0
σ
2
i
2
...
0
.
.
.
.
.
.
.
.
.
0 0
... σ
2
id
This yields
|
i
|=
det
(
i
)
=
σ
2
i
1
σ
2
i
2
···
σ
2
id
=
d
j
=
1
σ
2
ij
Also, we have
−
1
i
=
1
σ
2
i
1
0
...
0
0
1
σ
2
i
2
...
0
.
.
.
.
.
.
.
.
.
0 0
...
1
σ
2
id
assuming that
σ
2
ij
=
0 for all
j
. Finally,
(
x
−
µ
i
)
T
−
1
i
(
x
−
µ
i
)
=
d
j
=
1
(x
j
−
µ
ij
)
2
σ
2
ij
Plugging these into Eq. (18.3) gives us
P(
x
|
c
i
)
=
1
(
√
2
π)
d
d
j
=
1
σ
2
ij
exp
−
d
j
=
1
(x
j
−
µ
ij
)
2
2
σ
2
ij
=
d
j
=
1
1
√
2
π σ
ij
exp
−
(x
j
−
µ
ij
)
2
2
σ
2
ij
=
d
j
=
1
P(x
j
|
c
i
)
which is equivalent to Eq.(18.7). In other words, the joint probability has been
decomposed into a product of the probability along each dimension, as required by
the independence assumption.
The naive Bayesclassifier uses the sample mean
ˆ
µ
i
=
(
ˆ
µ
i
1
,...,
ˆ
µ
id
)
T
and a
diagonal
sample covariance matrix
i
=
diag
(σ
2
i
1
,...,σ
2
id
)
for each class
c
i
. Thus, in total 2
d
parameters have to be estimated, corresponding to the sample mean and sample
variance for each dimension
X
j
.
Algorithm 18.2 shows the pseudo-code for the naive Bayes classifier. Given an
input dataset
D
, the method estimates the prior probability and mean for each class.
Next, it computes the variance
ˆ
σ
2
ij
for each of the attributes
X
j
, with all the
d
variances
for class
c
i
stored in the vector
ˆ
σ
i
. The variance for attribute
X
j
is obtained by first
18.2 Naive Bayes Classifier
475
ALGORITHM 18.2. Naive Bayes Classifier
N
AIVE
B
AYES
(D
={
(
x
j
,y
j
)
}
n
j
=
1
)
:
for
i
=
1
,...,k
do
1
D
i
←
x
j
|
y
j
=
c
i
,j
=
1
,...,n
// class-specific subsets
2
n
i
←|
D
i
|
// cardinality
3
ˆ
P(c
i
)
←
n
i
/n
// prior probability
4
ˆ
µ
i
←
1
n
i
x
j
∈
D
i
x
j
// mean
5
Z
i
=
D
i
−
1
· ˆ
µ
T
i
// centered data for class
c
i
6
for
j
=
1
,..,d
do
// class-specific variance for
X
j
7
ˆ
σ
2
ij
←
1
n
i
Z
T
ij
Z
ij
// variance
8
ˆ
σ
i
=
ˆ
σ
2
i
1
,...,
ˆ
σ
2
id
T
// class-specific attribute variances
9
return
ˆ
P(c
i
),
ˆ
µ
i
,
ˆ
σ
i
for all
i
=
1
,...,k
10
T
ESTING
(x and
ˆ
P(c
i
)
,
ˆ
µ
i
,
ˆ
σ
i
, for all
i
∈
[1
,k
]
)
:
ˆ
y
←
argmax
c
i
ˆ
P(c
i
)
d
j
=
1
f(x
j
|ˆ
µ
ij
,
ˆ
σ
2
ij
)
11
return
ˆ
y
12
centering the data for class
D
i
via
Z
i
=
D
i
−
1
· ˆ
µ
T
i
. We denote by
Z
ij
the centered data
for class
c
i
corresponding to attribute
X
j
. The variance is then given as
ˆ
σ
=
1
n
i
Z
T
ij
Z
ij
.
Training the naive Bayes classifier is very fast, with
O
(nd)
computational
complexity. For testing, given a test point
x
, it simply returns the class with the
maximum posterior probability obtained as a product of the likelihood for each
dimension and the class prior probability.
Example 18.3.
Consider Example 18.1. In the naive Bayes approach the prior
probabilities
ˆ
P(c
i
)
and means
ˆ
µ
i
remain unchanged. The key difference is that the
covariance matrices are assumed to be diagonal, as follows:
1
=
0
.
122 0
0 0
.
142
2
=
0
.
435 0
0 0
.
110
Figure 18.2 shows the contour or level curve (corresponding to 1% of the peak
density) of the multivariate normal distribution for both classes. One can see that the
diagonal assumption leads to contours that are axis-parallel ellipses; contrast these
with the contours in Figure 18.1 for the full Bayes classifier.
For the test point
x
=
(
6
.
75
,
4
.
25
)
T
, the posterior probabilities for
c
1
and
c
2
are as
follows:
ˆ
P(c
1
|
x
)
∝
ˆ
f(
x
|ˆ
µ
1
,
1
)
ˆ
P(c
1
)
=
(
3
.
99
×
10
−
7
)
×
0
.
33
=
1
.
32
×
10
−
7
ˆ
P(c
2
|
x
)
∝
ˆ
f(
x
|ˆ
µ
2
,
2
)
ˆ
P(c
2
)
=
(
9
.
597
×
10
−
5
)
×
0
.
67
=
6
.
43
×
10
−
5
Because
ˆ
P(c
2
|
x
)>
ˆ
P(c
1
|
x
)
the class for
x
is predicted as
ˆ
y
=
c
2
.
476
Probabilistic Classification
2
2
.
5
3
.
0
3
.
5
4
.
0
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
X
1
X
2
x
=
(
6
.
75
,
4
.
25
)
T
Figure 18.2.
Naive Bayes:
X
1
:
sepal length
versus
X
2
:
sepal width
. The class means are shown in black;
the density contours are also shown. The square represents a test point labeled
x
.
Categorical Attributes
The independence assumption leads to a simplification of the joint probability mass
function in Eq.(18.5), which can be rewritten as
P(
x
|
c
i
)
=
d
j
=
1
P(x
j
|
c
i
)
=
d
j
=
1
f
X
j
=
e
jr
j
|
c
i
where
f(
X
j
=
e
jr
j
|
c
i
)
is the probability mass function for
X
j
, which can be estimated
from
D
i
as follows:
ˆ
f(
v
j
|
c
i
)
=
n
i
(
v
j
)
n
i
where
n
i
(
v
j
)
is the observed frequency of the value
v
j
=
e
j
r
j
corresponding to the
r
j
th
categorical value
a
jr
j
for the attribute
X
j
for class
c
i
. As in the full Bayes case, if the
count is zero, we can use the pseudo-count method to obtain a prior probability. The
adjusted estimates with pseudo-counts are given as
ˆ
f(
v
j
|
c
i
)
=
n
i
(
v
j
)
+
1
n
i
+
m
j
where
m
j
=|
dom(
X
j
)
|
.Extendingthecode in Algorithm18.2toincorporate categorical
attributes is straightforward.
Example 18.4.
Continuing Example 18.2, the class-specific PMF for each discretized
attribute is shown in Table 18.2. In particular, these correspond to the row and
column marginal probabilities
ˆ
f
X
1
and
ˆ
f
X
2
, respectively.
18.3
K
Nearest Neighbors Classifier
477
The test point
x
=
(
6
.
75
,
4
.
25
)
, corresponding to
(
Long, Long
)
or
v
=
(
e
13
,
e
23
)
, is
classified as follows:
ˆ
P(
v
|
c
1
)
=
ˆ
P(
e
13
|
c
1
)
·
ˆ
P(
e
23
|
c
1
)
=
0
+
1
50
+
4
·
13
50
=
4
.
81
×
10
−
3
ˆ
P(
v
|
c
2
)
=
ˆ
P(
e
13
|
c
2
)
·
ˆ
P(
e
23
|
c
2
)
=
43
100
·
2
100
=
8
.
60
×
10
−
3
ˆ
P(c
1
|
v
)
∝
(
4
.
81
×
10
−
3
)
×
0
.
33
=
1
.
59
×
10
−
3
ˆ
P(c
2
|
v
)
∝
(
8
.
6
×
10
−
3
)
×
0
.
67
=
5
.
76
×
10
−
3
Thus, the predicted class is
ˆ
y
=
c
2
.
18.3
K
NEAREST NEIGHBORS CLASSIFIER
In the preceding sections we considered a parametric approach for estimating the
likelihood
P(
x
|
c
i
)
. In this section, we consider a non-parametric approach, which does
not make any assumptions about the underlying joint probability density function.
Instead, it directly uses the data sample to estimate the density, for example, using
the density estimation methods from Chapter 15. We illustrate the non-parametric
approach using nearest neighbors density estimation from Section 15.2.3, which leads
to the
K nearest neighbors
(KNN) classifier.
Let
D
be a training dataset comprising
n
points
x
i
∈
R
d
, and let
D
i
denote the
subset of points in
D
that are labeled with class
c
i
, with
n
i
=|
D
i
|
. Given a test point
x
∈
R
d
, and
K
, the number of neighbors to consider, let
r
denote the distance from
x
to
its
K
th nearest neighbor in
D
.
Consider the
d
-dimensionalhyperballofradius
r
aroundthetestpoint
x
,definedas
B
d
(
x
,r)
=
x
i
∈
D
|
δ(
x
,
x
i
)
≤
r
Here
δ(
x
,
x
i
)
is the distance between
x
and
x
i
, which is usually assumed to be the
Euclidean distance, i.e.,
δ(
x
,
x
i
)
=
x
−
x
i
2
. However, other distance metrics can also
be used. We assume that
|
B
d
(
x
,r)
|=
K
.
Let
K
i
denote the number of points among the
K
nearest neighbors of
x
that are
labeled with class
c
i
, that is
K
i
=
x
j
∈
B
d
(
x
,r)
|
y
j
=
c
i
The class conditional probability density at
x
can be estimated as the fraction of
points from class
c
i
that lie within the hyperball divided by its volume, that is
ˆ
f(
x
|
c
i
)
=
K
i
/
n
i
V
=
K
i
n
i
V
where
V
=
vol
(
B
d
(
x
,r))
is the volume of the
d
-dimensional hyperball [Eq.(6.4)].
Using Eq.(18.4), the posterior probability
P(c
i
|
x
)
can be estimated as
P(c
i
|
x
)
=
ˆ
f(
x
|
c
i
)
ˆ
P(c
i
)
k
j
=
1
ˆ
f(
x
|
c
j
)
ˆ
P(c
j
)
478
Probabilistic Classification
However, because
ˆ
P(c
i
)
=
n
i
n
, we have
ˆ
f(
x
|
c
i
)
ˆ
P(c
i
)
=
K
i
n
i
V
·
n
i
n
=
K
i
n
V
Thus the posterior probability is given as
P(c
i
|
x
)
=
K
i
n
V
k
j
=
1
K
j
n
V
=
K
i
K
Finally, the predicted class for
x
is
ˆ
y
=
argmax
c
i
{
P(c
i
|
x
)
}
=
argmax
c
i
K
i
K
=
argmax
c
i
{
K
i
}
Because
K
is fixed,theKNNclassifierpredictstheclass of
x
asthemajorityclass among
its
K
nearest neighbors.
Example 18.5.
Consider the 2D Iris dataset shown in Figure 18.3. The two classes
are:
c
1
(circles) with
n
1
=
50 points and
c
2
(triangles) with
n
2
=
100 points.
Let us classify the test point
x
=
(
6
.
75
,
4
.
25
)
T
using its
K
=
5 nearest neighbors.
The distance from
x
to its 5th nearest neighbor, namely
(
6
.
2
,
3
.
4
)
T
, is given as
r
=
√
1
.
025
=
1
.
012. The enclosing ball or circle of radius
r
is shown in the figure. It
encompasses
K
1
=
1 point from class
c
1
and
K
2
=
4 points from class
c
2
. Therefore,
the predicted class for
x
is
ˆ
y
=
c
2
.
2
2
.
5
3
.
0
3
.
5
4
.
0
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
X
1
X
2
x
=
(
6
.
75
,
4
.
25
)
T
r
Figure 18.3.
Iris Data:
K
Nearest Neighbors Classifier
18.5 Exercises
479
18.4
FURTHER READING
The naive Bayes classifier is surprisingly effective even though the independence
assumption is usually violated in real datasets. Comparison of the naive Bayes
classifier againstotherclassification approachesand reasons for why is works well have
appearedin Langley,Iba, and Thompson (1992);Domingos and Pazzani(1997);Zhang
(2005); Hand and Yu (2001) and Rish (2001). For the long history of naive Bayes in
information retrieval see Lewis (1998). The
K
nearest neighbor classification approach
was first proposed in Fix and Hodges, Jr. (1951).
Domingos, P. and Pazzani, M. (1997). “On the optimality of the simple Bayesian
classifier under zero-one loss.”
Machine Learning
, 29(2–3): 103–130.
Fix,E. and HodgesJr., J. L. (1951).Discriminatory analysis, nonparametricdiscrimina-
tion.
USAF School of Aviation Medicine, Randolph Field, TX, Project 21-49-004,
Report 4, Contract AF41(128)-31
.
Hand, D. J. and Yu, K. (2001). “Idiot’s Bayes-not so stupid after all?”
International
Statistical Review
, 69(3): 385–398.
Langley, P., Iba, W., and Thompson, K. (1992). “An analysis of Bayesian classifiers.”
In Proceedings of the National Conference on Artificial Intelligence
, pp. 223–223.
Lewis, D. D. (1998). “Naive (Bayes) at forty: The independence assumption in
information retrieval.”
In Proceedings of the 10th European Conference on
Machine Learning.
pp. 4–15.
Rish, I. (2001). “An empirical study of the naive Bayes classifier.”
In Proceedings of
the IJCAI Workshop on Empirical Methods in Artificial Intelligence
, pp. 41–46.
Zhang, H. (2005). “Exploring conditions for the optimality of naive Bayes.”
Interna-
tional Journal of Pattern Recognition and Artificial Intelligence
, 19(2): 183–198.
18.5
EXERCISES
Q1.
Consider the dataset in Table 18.3. Classify the new point: (Age=23, Car=truck) via
the full and naive Bayes approach. You may assume that the domain of Car is given
as
{
sports, vintage, suv, truck
}
.
Table 18.3.
Data for Q1
x
i
Age Car Class
x
1
25 sports
L
x
2
20 vintage
H
x
3
25 sports
L
x
4
45 suv
H
x
5
20 sports
H
x
6
25 suv
H
Q2.
Given the dataset in Table 18.4, usethe naive Bayes classifier to classify the new point
(
T
,F,
1
.
0
)
.
480
Probabilistic Classification
Table 18.4.
Data for Q2
x
i
a
1
a
2
a
3
Class
x
1
T T
5.0
Y
x
2
T T
7.0
Y
x
3
T F
8.0
N
x
4
F F
3.0
Y
x
5
F T
7.0
N
x
6
F T
4.0
N
x
7
F F
5.0
N
x
8
T F
6.0
Y
x
9
F T
1.0
N
Q3.
Consider the class means and covariance matrices for classes
c
1
and
c
2
:
µ
1
=
(
1
,
3
)
µ
2
=
(
5
,
5
)
1
=
5 3
3 2
2
=
2 0
0 1
Classify the point
(
3
,
4
)
T
via the (full) Bayesian approach, assuming normally
distributed classes, and
P(c
1
)
=
P(c
2
)
=
0
.
5. Show all steps. Recall that the inverse
of a 2
×
2 matrix
A
=
a b
c d
is given as
A
−
1
=
1
det
(
A
)
d
−
b
−
c a
.
CHAPTER 19
Decision Tree Classifier
Let the training dataset
D
= {
x
i
,y
i
}
n
i
=
1
consist of
n
points in a
d
-dimensional space,
with
y
i
being the class label for point
x
i
. We assume that the dimensions or the
attributes
X
j
are numeric or categorical, and that there are
k
distinct classes, so
that
y
i
∈ {
c
1
,c
2
,...,c
k
}
. A decision tree classifier is a recursive, partition-based tree
model that predicts the class
ˆ
y
i
for each point
x
i
. Let
R
denote the data space that
encompasses the set of input points
D
. A decision tree uses an axis-parallelhyperplane
to split the data space
R
into two resulting half-spaces or regions, say
R
1
and
R
2
,
which also induces a partition of the input points into
D
1
and
D
2
, respectively. Each of
these regions is recursively split via axis-parallel hyperplanes until the points within an
induced partition are relatively pure in terms of their class labels, that is, most of the
points belong to the same class. The resulting hierarchy of split decisions constitutes
the decision tree model, with the leaf nodes labeled with the majority class among
points in those regions. To classify a new
test
point we have to recursively evaluate
which half-space it belongs to until we reach a leaf node in the decision tree, at which
point we predict its class as the label of the leaf.
Example 19.1.
Consider the Iris dataset shown in Figure 19.1a, which plots the
attributes
sepal length
(
X
1
) and
sepal width
(
X
2
). The classification task is
to discriminate between
c
1
, corresponding to
iris-setosa
(in circles), and
c
2
,
corresponding to the other two types of Irises (in triangles). The input dataset
D
has
n
=
150 points that lie in the data space which is given as the rectangle,
R
=
range(
X
1
)
×
range(
X
2
)
=
[4
.
3
,
7
.
9]
×
[2
.
0
,
4
.
4].
The recursive partitioning of the space
R
via axis-parallel hyperplanes is
illustrated in Figure 19.1a. In two dimensions a hyperplane is simply a line. The first
split corresponds to hyperplane
h
0
shown as a black line. The resulting left and right
half-spaces are further split via hyperplanes
h
2
and
h
3
, respectively (shown as gray
lines). The bottom half-space for
h
2
is further split via
h
4
, and the top half-space for
h
3
is split via
h
5
; these third level hyperplanes,
h
4
and
h
5
, are shown as dashed lines.
The set of hyperplanes and the set of six leaf regions, namely
R
1
,...,
R
6
, constitute
the decision tree model. Note also the induced partitioning of the input points into
these six regions.
481
482
Decision Tree Classifier
2
2
.
5
3
.
0
3
.
5
4
.
0
4
.
3 4
.
8 5
.
3 5
.
8 6
.
3 6
.
8 7
.
3 7
.
8
X
1
X
2
h
0
h
2
h
3
h
4
h
5
R
1
R
2
R
3
R
4
R
5
R
6
z
(a) Recursive Splits
X
1
≤
5
.
45
X
2
≤
2
.
8
Yes
X
1
≤
4
.
7
Yes
c
1
1
c
2
0
R
3
Yes
c
1
0
c
2
6
R
4
No
c
1
44
c
2
1
R
1
No
X
2
≤
3
.
45
No
c
1
0
c
2
90
R
2
Yes
X
1
≤
6
.
5
No
c
1
5
c
2
0
R
5
Yes
c
1
0
c
2
3
R
6
No
(b) Decision Tree
Figure 19.1.
Decision trees: recursive partitioning via axis-parallel hyperplanes.
Consider the test point
z
=
(
6
.
75
,
4
.
25
)
T
(shown as a white square). To predict its
class, the decision tree first checks which side of
h
0
it lies in. Because the point lies in
the right half-space, the decision tree next checks
h
3
to determine that
z
is in the top
half-space. Finally, we check and find that
z
is in the right half-space of
h
5
, and we
reach the leaf region
R
6
. The predicted class is
c
2
, as that leaf region has all points
(three of them) with class
c
2
(triangles).
19.1 Decision Trees
483
19.1
DECISION TREES
A decision tree consists of internal nodes that represent the decisions corresponding
to the hyperplanes or split points (i.e., which half-space a given point lies in), and leaf
nodes that represent regions or partitions of the data space, which are labeled with the
majority class. A region is characterized by the subset of data points that lie in that
region.
Axis-Parallel Hyperplanes
A hyperplane
h(
x
)
is defined as the set of all points
x
thatsatisfy the following equation
h(
x
)
:
w
T
x
+
b
=
0 (19.1)
Here
w
∈
R
d
is a
weightvector
thatis normal to the hyperplane,and
b
is the offsetof the
hyperplane from the origin. A decision tree considers only
axis-parallel hyperplanes
,
that is, the weight vector must be parallel to one of the original dimensions or axes
X
j
.
Put differently, the weight vector
w
is restricted
a priori
to one of the standard basis
vectors
{
e
1
,
e
2
,...,
e
d
}
, where
e
i
∈
R
d
has a 1 for the
j
th dimension, and 0 for all other
dimensions. If
x
=
(x
1
,x
2
,...,x
d
)
T
and assuming
w
=
e
j
, we can rewrite Eq.(19.1) as
h(
x
)
:
e
T
j
x
+
b
=
0
,
which implies that
h(
x
)
:
x
j
+
b
=
0
where the choice of the offset
b
yields different hyperplanes along dimension
X
j
.
Split Points
A hyperplane specifies a decision or
split point
because it splits the data space
R
into
two half-spaces. All points
x
such that
h(
x
)
≤
0 are on the hyperplane or to one side
of the hyperplane, whereas all points such that
h(
x
) >
0 are on the other side. The
split point associated with an axis-parallel hyperplane can be written as
h(
x
)
≤
0, which
implies that
x
i
+
b
≤
0, or
x
i
≤−
b
. Because
x
i
is some value from dimension
X
j
and the
offset
b
can be chosen to be any value, the generic form of a split point for a numeric
attribute
X
j
is given as
X
j
≤
v
where
v
=−
b
is some value in the domain of attribute
X
j
. The decision or split point
X
j
≤
v
thus splits the input data space
R
into two regions
R
Y
and
R
N
, which denote
the set of
all possible points
that satisfy the decision and those that do not.
Data Partition
Each split of
R
into
R
Y
and
R
N
also induces a binary partition of the corresponding
input data points
D
. That is, a split point of the form
X
j
≤
v
induces the data partition
D
Y
={
x
|
x
∈
D
,x
j
≤
v
}
D
N
={
x
|
x
∈
D
,x
j
>v
}
where
D
Y
is the subset of data points that lie in region
R
Y
and
D
N
is the subset of input
points that line in
R
N
.
484
Decision Tree Classifier
Purity
The purity of a region
R
j
is defined in terms of the mixture of classes for points in
the corresponding data partition
D
j
. Formally, purity is the fraction of points with the
majority label in
D
j
, that is,
purity
(
D
j
)
=
max
i
n
ji
n
j
(19.2)
where
n
j
=|
D
j
|
isthetotalnumberofdatapoints intheregion
R
j
,and
n
ji
isthenumber
of points in
D
j
with class label
c
i
.
Example 19.2.
Figure 19.1b shows the resulting decision tree that corresponds to
the recursive partitioning of the space via axis-parallel hyperplanes illustrated
in Figure 19.1a. The recursive splitting terminates when appropriate stopping
conditions are met, usually taking into account the size and purity of the regions.
In this example, we use a size threshold of 5 and a purity threshold of 0
.
95. That is,
a region will be split further only if the number of points is more than five and the
purity is less than 0
.
95.
The very first hyperplane to be considered is
h
1
(
x
)
:
x
1
−
5
.
45
=
0 which
corresponds to the decision
X
1
≤
5
.
45
at the root of the decision tree.The two resulting half-spaces are recursively split into
smaller half-spaces.
For example, the region
X
1
≤
5
.
45 is further split using the hyperplane
h
2
(
x
)
:
x
2
−
2
.
8
=
0 corresponding to the decision
X
2
≤
2
.
8
which forms the left child of the root. Notice how this hyperplane is restricted only
to the region
X
1
≤
5
.
45. This is because each region is considered independently
after the split, as if it were a separate dataset. There are seven points that satisfy
the condition
X
2
≤
2
.
8, out of which one is from class
c
1
(circle) and six are from class
c
2
(triangles). The purity of this region is therefore 6
/
7
=
0
.
857. Because the region
has more than five points, and its purity is less than 0
.
95, it is further split via the
hyperplane
h
4
(
x
)
:
x
1
−
4
.
7
=
0 yielding the left-most decision node
X
1
≤
4
.
7
in the decision tree shown in Figure 19.1b.
Returning back to the right half-space corresponding to
h
2
, namely the region
X
2
>
2
.
8, it has 45 points, of which only one is a triangle. The size of the region is 45,
but the purity is 44
/
45
=
0
.
98. Because the region exceeds the purity threshold it is
not split further. Instead, it becomes a leaf node in the decision tree, and the entire
region (
R
1
) is labeled with the majority class
c
1
. The frequency for each class is also
noted at a leaf node so that the potential error rate for that leaf can be computed.
For example, we can expect that the probability of misclassification in region
R
1
is
1
/
45
=
0
.
022, which is the error rate for that leaf.
19.2 Decision Tree Algorithm
485
Categorical Attributes
In addition to numeric attributes, a decision tree can also handle categorical data.
For a categorical attribute
X
j
, the split points or decisions are of the
X
j
∈
V
, where
V
⊂
dom(
X
j
)
, and
dom(
X
j
)
denotes the domain for
X
j
. Intuitively, this split can be
considered to be the categorical analog of a hyperplane. It results in two “half-spaces,”
one region
R
Y
consisting of points
x
that satisfy the condition
x
i
∈
V
, and the other
region
R
N
comprising points that satisfy the condition
x
i
∈
V
.
Decision Rules
One of the advantages of decision trees is that they produce models that are relatively
easy to interpret. In particular, a tree can be read as set of decision rules, with each
rule’s antecedent comprising the decisions on the internal nodes along a path to a leaf,
and its consequent being the label of the leaf node. Further, because the regions are
all disjoint and cover the entire space, the set of rules can be interpreted as a set of
alternatives or disjunctions.
Example 19.3.
Consider thedecision tree in Figure 19.1b.It can be interpretedas the
following set of disjunctive rules, one per leaf region
R
i
R
3
: If
X
1
≤
5
.
45 and
X
2
≤
2
.
8 and
X
1
≤
4
.
7, then class is
c
1
, or
R
4
: If
X
1
≤
5
.
45 and
X
2
≤
2
.
8 and
X
1
>
4
.
7, then class is
c
2
, or
R
1
: If
X
1
≤
5
.
45 and
X
2
>
2
.
8, then class is
c
1
, or
R
2
: If
X
1
>
5
.
45 and
X
2
≤
3
.
45, then class is
c
2
, or
R
5
: If
X
1
>
5
.
45 and
X
2
>
3
.
45 and
X
1
≤
6
.
5, then class is
c
1
, or
R
6
: If
X
1
>
5
.
45 and
X
2
>
3
.
45 and
X
1
>
6
.
5, then class is
c
2
19.2
DECISION TREE ALGORITHM
The pseudo-code for decision tree model construction is shown in Algorithm 19.1. It
takes as input a training dataset
D
, and two parameters
η
and
π
, where
η
is the leaf size
and
π
the leaf purity threshold. Different split points are evaluated for each attribute
in
D
. Numeric decisions are of the form
X
j
≤
v
for some value
v
in the value range
for attribute
X
j
, and categorical decisions are of the form
X
j
∈
V
for some subset of
values in the domain of
X
j
. The best split point is chosen to partition the data into
two subsets,
D
Y
and
D
N
, where
D
Y
corresponds to all points
x
∈
D
that satisfy the
split decision, and
D
N
corresponds to all points that do not satisfy the split decision.
The decision tree method is then called recursively on
D
Y
and
D
N
. A number of
stopping conditions can be used to stop the recursive partitioningprocess. The simplest
condition is based on the size of the partition
D
. If the number of points
n
in
D
drops
below the user-specified size threshold
η
, then we stop the partitioning process and
make
D
a leaf. This condition prevents over-fitting the model to the training set, by
avoiding to model very small subsets of the data. Size alone is not sufficient because if
486
Decision Tree Classifier
ALGORITHM 19.1. Decision Tree Algorithm
D
ECISION
T
REE
(D
,η,π
)
:
n
←|
D
|
// partition size
1
n
i
←|{
x
j
|
x
j
∈
D
,y
j
=
c
i
}|
// size of class
c
i
2
purity
(
D
)
←
max
i
n
i
n
3
if
n
≤
η
or purity
(
D
)
≥
π
then
// stopping condition
4
c
∗
←
argmax
c
i
n
i
n
// majority class
5
create leaf node, and label it with class
c
∗
6
return
7
(
split point
∗
,
score
∗
)
←
(
∅
,
0
)
// initialize best split point
8
foreach
(attribute X
j
)
do
9
if
(X
j
is numeric)
then
10
(v,
score
)
←
E
VALUATE
-N
UMERIC
-A
TTRIBUTE
(
D
,
X
j
)
11
if
score
>
score
∗
then
(
split point
∗
,
score
∗
)
←
(
X
j
≤
v,
score
)
12
else if
(X
j
is categorical)
then
13
(
V
,
score
)
←
E
VALUATE
-C
ATEGORICAL
-A
TTRIBUTE
(
D
,
X
j
)
14
if
score
>
score
∗
then
(
split point
∗
,
score
∗
)
←
(
X
j
∈
V
,
score
)
15
// partition
D
into
D
Y
and
D
N
using
split point
∗
, and call
recursively
D
Y
←{
x
∈
D
|
x
satisfies
split point
∗
}
16
D
N
←{
x
∈
D
|
x
does not satisfy
split point
∗
}
17
create internal node
split point
∗
, with two child nodes,
D
Y
and
D
N
18
D
ECISION
T
REE
(
D
Y
); D
ECISION
T
REE
(
D
N
)
19
the partition is already pure then it does not make sense to split it further. Thus, the
recursive partitioning is also terminated if the purity of
D
is above the purity threshold
π
. Details of how the split points are evaluated and chosen are given next.
19.2.1
Split Point Evaluation Measures
Given a split point of the form
X
j
≤
v
or
X
j
∈
V
for a numeric or categorical attribute,
respectively, we need an objective criterion for scoring the split point. Intuitively, we
want to select a split point that gives the best separation or discrimination between the
different class labels.
Entropy
Entropy, in general,measures the amount of disorder or uncertaintyin a system. In the
classification setting, a partition has lower entropy (or low disorder) if it is relatively
pure, that is, if most of the points have the same label. On the other hand, a partition
has higher entropy (or more disorder) if the class labels are mixed, and there is no
majority class as such.
19.2 Decision Tree Algorithm
487
The entropy of a set of labeled points
D
is defined as follows:
H
(
D
)
=−
k
i
=
1
P(c
i
|
D
)
log
2
P(c
i
|
D
)
(19.3)
where
P(c
i
|
D
)
is the probability of class
c
i
in
D
, and
k
is the number of classes. If a
region is pure, that is, has points from the same class, then the entropy is zero. On the
other hand, if the classes are all mixed up, and each appears with equal probability
P(c
i
|
D
)
=
1
k
, then the entropy has the highest value,
H
(
D
)
=
log
2
k
.
Assume that a split point partitions
D
into
D
Y
and
D
N
. Define the
split entropy
as
the weighted entropy of each of the resulting partitions, given as
H
(
D
Y
,
D
N
)
=
n
Y
n
H
(
D
Y
)
+
n
N
n
H
(
D
N
)
(19.4)
where
n
= |
D
|
is the number of points in
D
, and
n
Y
= |
D
Y
|
and
n
N
= |
D
N
|
are the
number of points in
D
Y
and
D
N
.
To see if the split point results in a reduced overall entropy, we define the
information gain
for a given split point as follows:
Gain
(
D
,
D
Y
,
D
N
)
=
H
(
D
)
−
H
(
D
Y
,
D
N
)
(19.5)
The higher the information gain, the more the reduction in entropy, and the better the
split point. Thus, given split points and their corresponding partitions, we can score
each split point and choose the one that gives the highest information gain.
Gini Index
Another common measure to gaugethe purity of a split point is the
Giniindex
, defined
as follows:
G
(
D
)
=
1
−
k
i
=
1
P(c
i
|
D
)
2
(19.6)
If thepartition is pure, then the probability of the majority class is 1 and the probability
of all other classes is 0,and thus, the Gini index is 0. On the other hand,when eachclass
is equally represented, with probability
P(c
i
|
D
)
=
1
k
, then the Gini index has value
k
−
1
k
.
Thus, higher values of the Gini index indicate more disorder, and lower values indicate
more order in terms of the class labels.
We can compute the weighted Gini index of a split point as follows:
G
(
D
Y
,
D
N
)
=
n
Y
n
G
(
D
Y
)
+
n
N
n
G
(
D
N
)
where
n
,
n
Y
, and
n
N
denote the number of points in regions
D
,
D
Y
, and
D
N
,
respectively. The lower the Gini index value, the better the split point.
Other measures can also be used instead of entropy and Gini index to evaluate
the splits. For example, the Classification And Regression Trees (CART) measure is
given as
CART
(
D
Y
,
D
N
)
=
2
n
Y
n
n
N
n
k
i
=
1
P(c
i
|
D
Y
)
−
P(c
i
|
D
N
)
(19.7)
488
Decision Tree Classifier
This measure thus prefers a split point that maximizes the difference between the class
probability mass function for the two partitions; the higher the CART measure, the
better the split point.
19.2.2
Evaluating Split Points
All of the split point evaluation measures, such as entropy [Eq.(19.3)], Gini-index
[Eq.(19.6)], and CART [Eq.(19.7)], considered in the preceding section depend on
the class probability mass function (PMF) for
D
, namely,
P(c
i
|
D
)
, and the class PMFs
for the resulting partitions
D
Y
and
D
N
, namely
P(c
i
|
D
Y
)
and
P(c
i
|
D
N
)
. Note that we
have to compute the class PMFs for all possible split points; scoring each of them
independently would result in significant computational overhead. Instead, one can
incrementally compute the PMFs as described in the following paragraphs.
Numeric Attributes
If
X
is anumericattribute,wehavetoevaluatesplit pointsoftheform
X
≤
v
.Evenifwe
restrict
v
to lie within the value range of attribute
X
, there are still an infinite number
of choices for
v
. One reasonable approach is to consider only the midpoints between
two successive distinct values for
X
in the sample
D
. This is because split points of the
form
X
≤
v
, for
v
∈
[
x
a
,x
b
)
, where
x
a
and
x
b
are two successive distinct values of
X
in
D
, produce the same partitioning of
D
into
D
Y
and
D
N
, and thus yield the same scores.
Because there can be at most
n
distinct values for
X
, there are at most
n
−
1 midpoint
values to consider.
Let
{
v
1
,...,v
m
}
denote the set of all such midpoints, such that
v
1
< v
2
<
···
< v
m
.
For each split point
X
≤
v
, we have to estimate the class PMFs:
ˆ
P(c
i
|
D
Y
)
=
ˆ
P(c
i
|
X
≤
v)
(19.8)
ˆ
P(c
i
|
D
N
)
=
ˆ
P(c
i
|
X
>v)
(19.9)
Let
I
()
be an indicator variablethat takes on the value1 only when its argumentis true,
and is 0 otherwise. Using the Bayes theorem, we have
ˆ
P(c
i
|
X
≤
v)
=
ˆ
P(
X
≤
v
|
c
i
)
ˆ
P(c
i
)
ˆ
P(
X
≤
v)
=
ˆ
P(
X
≤
v
|
c
i
)
ˆ
P(c
i
)
k
j
=
1
ˆ
P(
X
≤
v
|
c
j
)
ˆ
P(c
j
)
(19.10)
The prior probability for each class in
D
can be estimated as follows:
ˆ
P(c
i
)
=
1
n
n
j
=
1
I
(y
j
=
c
i
)
=
n
i
n
(19.11)
where
y
j
is the class for point
x
j
,
n
= |
D
|
is the total number of points, and
n
i
is the
number of points in
D
with class
c
i
. Define
N
vi
as the number of points
x
j
≤
v
with
class
c
i
, where
x
j
is the value of data point
x
j
for the attribute
X
, given as
N
vi
=
n
j
=
1
I
(x
j
≤
v
and
y
j
=
c
i
)
(19.12)
19.2 Decision Tree Algorithm
489
We can then estimate
P(
X
≤
v
|
c
i
)
as follows:
ˆ
P(
X
≤
v
|
c
i
)
=
ˆ
P(
X
≤
v
and
c
i
)
ˆ
P(c
i
)
=
1
n
n
j
=
1
I
(x
j
≤
v
and
y
j
=
c
i
)
n
i
/n
=
N
vi
n
i
(19.13)
Plugging Eqs.(19.11) and (19.13) into Eq.(19.10), and using Eq.(19.8), we have
ˆ
P(c
i
|
D
Y
)
=
ˆ
P(c
i
|
X
≤
v)
=
N
vi
k
j
=
1
N
vj
(19.14)
We can estimate
ˆ
P(
X
>v
|
c
i
)
as follows:
ˆ
P(
X
> v
|
c
i
)
=
1
−
ˆ
P(
X
≤
v
|
c
i
)
=
1
−
N
vi
n
i
=
n
i
−
N
vi
n
i
(19.15)
Using Eqs.(19.11) and (19.15), the class PMF
ˆ
P(c
i
|
D
N
)
is given as
ˆ
P(c
i
|
D
N
)
=
ˆ
P(c
i
|
X
> v)
=
ˆ
P(
X
> v
|
c
i
)
ˆ
P(c
i
)
k
j
=
1
ˆ
P(
X
> v
|
c
j
)
ˆ
P(c
j
)
=
n
i
−
N
vi
k
j
=
1
(n
j
−
N
vj
)
(19.16)
Algorithm 19.2 shows the split point evaluation method for numeric attributes.
The for loop on line 4 iterates through all the points and computes the midpoint
values
v
and the number of points
N
vi
from class
c
i
such that
x
j
≤
v
. The for loop
on line 12 enumerates all possible split points of the form
X
≤
v
, one for each midpoint
v
, and scores them using the gain criterion [Eq.(19.5)]; the best split point and score
are recorded and returned. Any of the other evaluation measures can also be used.
However, for Gini index and CART a lower score is better unlike for gain where a
higher score is better.
In terms of computational complexity, the initial sorting of values of
X
(line 1)
takestime
O
(n
log
n)
. The cost ofcomputing themidpoints and theclass-specificcounts
N
vi
takes time
O
(nk)
(for loop on line 4). The cost of computing the score is also
bounded by
O
(nk)
, because the total number of midpoints
v
can be at most
n
(for loop
on line 12). The total cost of evaluatinga numeric attribute is therefore
O
(n
log
n
+
nk)
.
Ignoring
k
, because it is usually a small constant, the total cost of numeric split point
evaluation is
O
(n
log
n)
.
Example 19.4 (Numeric Attributes).
Consider the 2-dimensional Iris dataset shown
in Figure 19.1a. In the initial invocation of Algorithm 19.1, the entire dataset
D
with
n
=
150 points is considered at the root of the decision tree. The task is to find the
best split point considering both the attributes,
X
1
(
sepal length
) and
X
2
(
sepal
width
). Because there are
n
1
=
50 points labeled
c
1
(
iris-setosa
),the other class
c
2
has
n
2
=
100 points. We thus have
ˆ
P(c
1
)
=
50
/
150
=
1
/
3
ˆ
P(c
2
)
=
100
/
150
=
2
/
3
490
Decision Tree Classifier
ALGORITHM 19.2. Evaluate Numeric Attribute (Using Gain)
E
VALUATE
-N
UMERIC
-A
TTRIBUTE
(D
,
X
)
:
sort
D
on attribute
X
, so that
x
j
≤
x
j
+
1
,
∀
j
=
1
,...,n
−
1
1
M
←∅
// set of midpoints
2
for
i
=
1
,...,k
do
n
i
←
0
3
for
j
=
1
,...,n
−
1
do
4
if
y
j
=
c
i
then
n
i
←
n
i
+
1
// running count for class
c
i
5
if
x
j
+
1
=
x
j
then
6
v
←
x
j
+
1
+
x
j
2
;
M
←
M
∪{
v
}
// midpoints
7
for
i
=
1
,...,k
do
8
N
vi
←
n
i
// Number of points such that
x
j
≤
v
and
y
j
=
c
i
9
if
y
n
=
c
i
then
n
i
←
n
i
+
1
10
// evaluate split points of the form
X
≤
v
v
∗
←∅
;
score
∗
←
0
// initialize best split point
11
forall
v
∈
M
do
12
for
i
=
1
,...,k
do
13
ˆ
P(c
i
|
D
Y
)
←
N
vi
k
j
=
1
N
vj
14
ˆ
P(c
i
|
D
N
)
←
n
i
−
N
vi
k
j
=
1
n
j
−
N
vj
15
score
(
X
≤
v)
←
G
ain(
D
,
D
Y
,
D
N
)
// use Eq.
(19.5)
16
if
score
(
X
≤
v) >
score
∗
then
17
v
∗
←
v
;
score
∗
←
score
(
X
≤
v)
18
return
(v
∗
,
score
∗
)
19
The entropy [Eq.(19.3)] of the dataset
D
is therefore
H
(
D
)
=−
1
3
log
2
1
3
+
2
3
log
2
2
3
=
0
.
918
Consider split points for attribute
X
1
. To evaluate the splits we first compute
the frequencies
N
vi
using Eq.(19.12), which are plotted in Figure 19.2 for both the
classes. For example,consider the split point
X
1
≤
5
.
45.From Figure 19.2,we see that
N
v
1
=
45
N
v
2
=
7
Plugging in these values into Eq.(19.14) we get
ˆ
P(c
1
|
D
Y
)
=
N
v
1
N
v
1
+
N
v
2
=
45
45
+
7
=
0
.
865
ˆ
P(c
2
|
D
Y
)
=
N
v
2
N
v
1
+
N
v
2
=
7
45
+
7
=
0
.
135
19.2 Decision Tree Algorithm
491
0
10
20
30
40
50
60
70
80
90
100
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5
Midpoints:
v
F
r
e
q
u
e
n
c
y
:
N
v
i
v
=
5
.
45
45
7
iris-setosa
(
c
1
)
other
(
c
2
)
Figure 19.2.
Iris: frequencies
N
vi
for classes
c
1
and
c
2
for attribute
sepal length
.
and using Eq.(19.16), we obtain
ˆ
P(c
1
|
D
N
)
=
n
1
−
N
v
1
(n
1
−
N
v
1
)
+
(n
2
−
N
v
2
)
=
50
−
45
(
50
−
45
)
+
(
100
−
7
)
=
0
.
051
ˆ
P(c
2
|
D
N
)
=
n
2
−
N
v
2
(n
1
−
N
v
1
)
+
(n
2
−
N
v
2
)
=
(
100
−
7
)
(
50
−
45
)
+
(
100
−
7
)
=
0
.
949
We can now compute the entropy of the partitions
D
Y
and
D
N
as follows:
H
(
D
Y
)
=−
(
0
.
865log
2
0
.
865
+
0
.
135log
2
0
.
135
)
=
0
.
571
H
(
D
N
)
=−
(
0
.
051log
2
0
.
051
+
0
.
949log
2
0
.
949
)
=
0
.
291
The entropy of the split point
X
≤
5
.
45 is given via Eq.(19.4)
H
(
D
Y
,
D
N
)
=
52
150
H
(
D
Y
)
+
98
150
H
(
D
N
)
=
0
.
388
where
n
Y
=|
D
Y
|=
52 and
n
N
=|
D
N
|=
98. The information gain for the split point is
therefore
G
ain
=
H
(
D
)
−
H
(
D
Y
,
D
N
)
=
0
.
918
−
0
.
388
=
0
.
53
In a similar manner, we can evaluate all of the split points for both attributes
X
1
and
X
2
. Figure 19.3 plots the gain values for the different split points for the two
attributes. We can observe that
X
≤
5
.
45 is the best split point and it is thus chosen
as the root of the decision tree in Figure 19.1b.
The recursive treegrowthprocess continues and yields thefinal decision treeand
the split points as shown in Figure 19.1b. In this example, we use a leaf size threshold
of 5 and a purity threshold of 0
.
95.
492
Decision Tree Classifier
0
0
.
05
0
.
10
0
.
15
0
.
20
0
.
25
0
.
30
0
.
35
0
.
40
0
.
45
0
.
50
0
.
55
2 2
.
5 3
.
0 3
.
5 4
.
0 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5
Split points:
X
i
≤
v
I
n
f
o
r
m
a
t
i
o
n
G
a
i
n
X
1
≤
5
.
45
sepal-width
(
X
2
)
sepal-length
(
X
1
)
Figure 19.3.
Iris: gain for different split points, for
sepal length
and
sepal width
.
Categorical Attributes
If
X
is a categorical attribute we evaluate split points of the form
X
∈
V
, where
V
⊂
dom(
X
)
and
V
= ∅
. In words, all distinct partitions of the set of values of
X
are
considered. Because the split point
X
∈
V
yields the same partition as
X
∈
V
, where
V
=
dom(
X
)
V
is the complement of
V
, the total number of distinct partitions is
given as
⌊
m/
2
⌋
i
=
1
m
i
=
O
(
2
m
−
1
)
(19.17)
where
m
is the number of values in the domain of
X
, that is,
m
= |
dom(
X
)
|
. The
number of possible split points to consider is therefore exponential in
m
, which can
pose problems if
m
is large. One simplification is to restrict
V
to be of size one, so that
there are only
m
split points of the form
X
j
∈{
v
}
, where
v
∈
dom(
X
j
)
.
To evaluate a given split point
X
∈
V
we have to compute the following class
probability mass functions:
P(c
i
|
D
Y
)
=
P(c
i
|
X
∈
V
) P(c
i
|
D
N
)
=
P(c
i
|
X
∈
V
)
Making use of the Bayes theorem, we have
P(c
i
|
X
∈
V
)
=
P(
X
∈
V
|
c
i
)P(c
i
)
P(
X
∈
V
)
=
P(
X
∈
V
|
c
i
)P(c
i
)
k
j
=
1
P(
X
∈
V
|
c
j
)P(c
j
)
However, note that a given point
x
can take on only one value in the domain of
X
, and
thus the values
v
∈
dom(
X
)
are mutually exclusive. Therefore, we have
P(
X
∈
V
|
c
i
)
=
v
∈
V
P(
X
=
v
|
c
i
)
19.2 Decision Tree Algorithm
493
and we can rewrite
P(c
i
|
D
Y
)
as
P(c
i
|
D
Y
)
=
v
∈
V
P(
X
=
v
|
c
i
)P(c
i
)
k
j
=
1
v
∈
V
P(
X
=
v
|
c
j
)P(c
j
)
(19.18)
Define
n
vi
as the number of points
x
j
∈
D
, with value
x
j
=
v
for attribute
X
and
having class
y
j
=
c
i
:
n
vi
=
n
j
=
1
I
(x
j
=
v
and
y
j
=
c
i
)
(19.19)
The class conditional empirical PMF for
X
is then given as
ˆ
P(
X
=
v
|
c
i
)
=
ˆ
P
X
=
v
and
c
i
ˆ
P(c
i
)
=
1
n
n
j
=
1
I
(x
j
=
v
and
y
j
=
c
i
)
n
i
/n
=
n
vi
n
i
(19.20)
Note that the class prior probabilities can be estimated using Eq.(19.11) as discussed
earlier, thatis,
ˆ
P(c
i
)
=
n
i
/n
. Thus, substituting Eq.(19.20)in Eq.(19.18), the class PMF
for the partition
D
Y
for the split point
X
∈
V
is given as
ˆ
P(c
i
|
D
Y
)
=
v
∈
V
ˆ
P(
X
=
v
|
c
i
)
ˆ
P(c
i
)
k
j
=
1
v
∈
V
ˆ
P(
X
=
v
|
c
j
)
ˆ
P(c
j
)
=
v
∈
V
n
vi
k
j
=
1
v
∈
V
n
vj
(19.21)
In a similar manner, the class PMF for the partition
D
N
is given as
ˆ
P(c
i
|
D
N
)
=
ˆ
P(c
i
|
X
∈
V
)
=
v
∈
V
n
vi
k
j
=
1
v
∈
V
n
vj
(19.22)
Algorithm 19.3 shows the split point evaluation method for categorical attributes.
The for loop on line 4 iterates through all the points and computes
n
vi
, that is,
the number of points having value
v
∈
dom(
X
)
and class
c
i
. The for loop on line 7
enumerates all possible split points of the form
X
∈
V
for
V
⊂
dom(
X
)
,such that
|
V
|≤
l
,
where
l
is a user specified parameter denoting the maximum cardinality of
V
. For
example, to control the number of split points, we can also restrict
V
to be a single
item, that is,
l
=
1, so that splits are of the form
V
∈{
v
}
, with
v
∈
dom(
X
)
. If
l
=⌊
m/
2
⌋
,
we have to consider all possible distinct partitions
V
. Given a split point
X
∈
V
, the
method scores it using information gain [Eq.(19.5)], although any of the other scoring
criteria can also be used. The best split point and score are recorded and returned.
In terms of computational complexity the class-specific counts for each value
n
vi
takes
O
(n)
time (for loop on line 4). With
m
= |
dom(
X
)
|
, the maximum number of
partitions
V
is
O
(
2
m
−
1
)
, and because each split point can be evaluated in time
O
(mk)
,
the for loop in line 7 takes time
O
(mk
2
m
−
1
)
. The total cost for categorical attributes
is therefore
O
(n
+
mk
2
m
−
1
)
. If we make the assumption that 2
m
−
1
=
O
(n)
, that is, if
we bound the maximum size of
V
to
l
=
O
(
log
n)
, then the cost of categorical splits is
bounded as
O
(n
log
n)
, ignoring
k
.
494
Decision Tree Classifier
ALGORITHM 19.3. Evaluate Categorical Attribute (Using Gain)
E
VALUATE
-C
ATEGORICAL
-A
TTRIBUTE
(D
,
X
,l
)
:
for
i
=
1
,...,k
do
1
n
i
←
0
2
forall
v
∈
dom(
X
)
do
n
vi
←
0
3
for
j
=
1
,...,n
do
4
if
x
j
=
v
and
y
j
=
c
i
then
n
vi
←
n
vi
+
1
// frequency statistics
5
// evaluate split points of the form
X
∈
V
V
∗
←∅
;
score
∗
←
0
// initialize best split point
6
forall
V
⊂
dom(
X
)
,such that
1
≤|
V
|≤
l
do
7
for
i
=
1
,...,k
do
8
ˆ
P(c
i
|
D
Y
)
←
v
∈
V
n
vi
k
j
=
1
v
∈
V
n
vj
9
ˆ
P(c
i
|
D
N
)
←
v
∈
V
n
vi
k
j
=
1
v
∈
V
n
vj
10
score
(
X
∈
V
)
←
G
ain(
D
,
D
Y
,
D
N
)
// use Eq.
(19.5)
11
if
score
(
X
∈
V
) >
score
∗
then
12
V
∗
←
V
;
score
∗
←
score
(
X
∈
V
)
13
return
(
V
∗
,
score
∗
)
14
Example 19.5 (Categorical Attributes).
Consider the 2-dimensional Iris dataset
comprising the
sepal length
and
sepal width
attributes. Let us assume that
sepal
length
has been discretized as shown in Table 19.1.The class frequencies
n
vi
are also
shown. For instance
n
a
1
2
=
6 denotes the fact that there are 6 points in
D
with value
v
=
a
1
and class
c
2
.
Consider the split point
X
1
∈{
a
1
,a
3
}
. From Table 19.1 we can compute the class
PMF for partition
D
Y
using Eq.(19.21)
ˆ
P(c
1
|
D
Y
)
=
n
a
1
1
+
n
a
3
1
(n
a
1
1
+
n
a
3
1
)
+
(n
a
1
2
+
n
a
3
2
)
=
39
+
0
(
39
+
0
)
+
(
6
+
43
)
=
0
.
443
ˆ
P(c
2
|
D
Y
)
=
1
−
ˆ
P(c
1
|
D
Y
)
=
0
.
557
with the entropy given as
H
(
D
Y
)
=−
(
0
.
443log
2
0
.
443
+
0
.
557log
2
0
.
557
)
=
0
.
991
To compute the class PMF for
D
N
[Eq.(19.22)], we sum up the frequencies over
values
v
∈
V
={
a
1
,a
3
}
, that is, we sum over
v
=
a
2
and
v
=
a
4
, as follows:
ˆ
P(c
1
|
D
N
)
=
n
a
2
1
+
n
a
4
1
(n
a
2
1
+
n
a
4
1
)
+
(n
a
2
2
+
n
a
4
2
)
=
11
+
0
(
11
+
0
)
+
(
39
+
12
)
=
0
.
177
ˆ
P(c
2
|
D
N
)
=
1
−
ˆ
P(c
1
|
D
N
)
=
0
.
823
496
Decision Tree Classifier
attribute is
O
(n
log
n)
, where
n
=|
D
|
is the size of the dataset. Given
D
, the decision
tree algorithm evaluates all
d
attributes, with cost
(dn
log
n)
. The total cost depends on
the depth of the decision tree. In the worst case, the tree can have depth
n
, and thus
the total cost is
O
(dn
2
log
n)
.
19.3
FURTHER READING
Among the earliest works on decision trees are Hunt, Marin, and Stone (1966);
Breiman et al. (1984); and Quinlan (1986). The description in this chapter is largely
based on the C4.5 method described in Quinlan (1993), which is an excellent reference
for further details, such as how to prune decision trees to prevent overfitting, how
to handle missing attribute values, and other implementation issues. A survey of
methods for simplifying decision trees appears in Breslow and Aha (1997). Scalable
implementation techniques are described in Mehta, Agrawal, and Rissanen (1996) and
Gehrke et al. (1999).
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. (1984).
Classification and
Regression Trees
. Boca Raton, FL: Chapman and Hall/CRC Press.
Breslow, L. A. and Aha, D. W. (1997). “Simplifying decision trees: A survey.”
Knowledge Engineering Review
, 12(1): 1–40.
Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W.-Y. (1999). “BOAT-optimistic
decision tree construction.”
ACM SIGMOD Record
, 28(2): 169–180.
Hunt, E. B., Marin, J., and Stone, P. J. (1966).
Experiments in Induction
. New York:
Academic Press.
Mehta, M., Agrawal, R., and Rissanen, J. (1996). “SLIQ: A fast scalable classifier
for data mining.”
In Proceedings of the International Conference on Extending
Database Technology
(pp. 18–32). New York: Springer-Verlag.
Quinlan, J. R. (1986). “Induction of decision trees.”
Machine Learning
, 1(1): 81–106.
Quinlan, J. R. (1993).
C4.5: Programs for Machine Learning
. New York: Morgan
Kaufmann.
19.4
EXERCISES
Q1.
True or False:
(a)
High entropy means that the partitions in classification are “pure.”
(b)
Multiway split of a categorical attribute generally results in more pure partitions
than a binary split.
Q2.
Given Table 19.3, construct a decision tree using a purity threshold of 100%. Use
information gain as the split point evaluation measure. Next, classify the point
(Age=27,Car=Vintage).
Q3.
What is the maximum and minimum value of the CART measure [Eq.(19.7)] and
under what conditions?
CHAPTER 20
Linear Discriminant Analysis
Given labeled data consisting of
d
-dimensional points
x
i
along with their classes
y
i
,
the goal of linear discriminant analysis (LDA) is to find a vector
w
that maximizes
the separation between the classes after projection onto
w
. Recall from Chapter 7
that the first principal component is the vector that maximizes the projected variance
of the points. The key difference between principal component analysis and LDA is
that the former deals with unlabeled data and tries to maximize variance, whereas the
latter deals with labeled data and tries to maximize the discrimination between the
classes.
20.1
OPTIMAL LINEAR DISCRIMINANT
Let us assume that the dataset
D
consists of
n
labeled points
{
x
i
,y
i
}
, where
x
i
∈
R
d
and
y
i
∈ {
c
1
,c
2
,...,c
k
}
. Let
D
i
denote the subset of points labeled with class
c
i
, i.e.,
D
i
= {
x
j
|
y
j
=
c
i
}
, and let
|
D
i
| =
n
i
denote the number of points with class
c
i
. We
assume that there are only
k
=
2 classes. Thus, the dataset
D
can be partitioned into
D
1
and
D
2
.
Let
w
be a unit vector, that is,
w
T
w
=
1. By Eq.(1.7), the projection of any
d
-dimensional point
x
i
onto the vector
w
is given as
x
′
i
=
w
T
x
i
w
T
w
w
=
w
T
x
i
w
=
a
i
w
where
a
i
specifies the offset or coordinate of
x
′
i
along the line
w
:
a
i
=
w
T
x
i
Thus, the set of
n
scalars
{
a
1
,a
2
,...,a
n
}
represents the mapping from
R
d
to
R
, that is,
from the original
d
-dimensional space to a 1-dimensional space (along
w
).
Example 20.1.
Consider Figure 20.1, which shows the 2-dimensional Iris dataset
with
sepal length
and
sepal width
as the attributes, and
iris-setosa
as class
c
1
(circles), and theothertwo Iris typesas class
c
2
(triangles).There are
n
1
=
50points in
c
1
and
n
2
=
100points in
c
2
. One possible vector
w
is shown, along with the projection
498
20.1 Optimal Linear Discriminant
499
1
.
5
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
4
.
5
4
.
0 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
w
Figure 20.1.
Projection onto
w
.
of all the points onto
w
. The projected means of the two classes are shown in black.
Here
w
has been translated so that it passes through the mean of the entire data. One
canobserve that
w
is not verygood in discriminating betweenthetwo classes because
the projection of the points onto
w
are all mixed up in terms of their class labels. The
optimal linear discriminant direction is shown in Figure 20.2.
Each point coordinate
a
i
has associated with it the original class label
y
i
, and thus
wecancompute,foreachofthetwoclasses,themeanoftheprojectedpoints asfollows:
m
1
=
1
n
1
x
i
∈
D
1
a
i
=
1
n
1
x
i
∈
D
1
w
T
x
i
=
w
T
1
n
1
x
i
∈
D
1
x
i
=
w
T
µ
1
where
µ
1
is the mean of all point in
D
1
. Likewise, we can obtain
m
2
=
w
T
µ
2
In other words, the mean of the projected points is the same as the projection of the
mean.
500
Linear Discriminant Analysis
1
.
5
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
4
.
5
4
.
0 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
w
Figure 20.2.
Linear discriminant direction
w
.
To maximize the separation between the classes, it seems reasonable to maximize
the difference between the projected means,
|
m
1
−
m
2
|
. However, this is not enough.
For good separation, the variance of the projected points for each class should also
not be too large. A large variance would lead to possible overlaps among the points of
the two classes due to the large spread of the points, and thus we may fail to have a
good separation. LDA maximizes the separation by ensuring that the
scatter
s
2
i
for the
projected points within each class is small, where scatter is defined as
s
2
i
=
x
j
∈
D
i
(a
j
−
m
i
)
2
Scatter is the total squared deviation from the mean, as opposed to the variance, which
is the average deviation from mean. In other words
s
2
i
=
n
i
σ
2
i
where
n
i
=|
D
i
|
is the size, and
σ
2
i
is the variance, for class
c
i
.
We can incorporate the two LDA criteria, namely, maximizing the distance
between projected means and minimizing the sum of projected scatter, into a single
maximization criterion called the
Fisher LDA objective
:
max
w
J
(
w
)
=
(m
1
−
m
2
)
2
s
2
1
+
s
2
2
(20.1)
20.1 Optimal Linear Discriminant
501
The goal of LDA is to find the vector
w
that maximizes
J
(
w
)
, that is, the direction
that maximizes the separation between the two means
m
1
and
m
2
, and minimizes the
total scatter
s
2
1
+
s
2
2
of the two classes. The vector
w
is also called the
optimal linear
discriminant (LD)
. The optimization objective [Eq.(20.1)] is in the projected space.
To solve it, we have to rewrite it in terms of the input data, as described next.
Note that we can rewrite
(m
1
−
m
2
)
2
as follows:
(m
1
−
m
2
)
2
=
w
T
(
µ
1
−
µ
2
)
2
=
w
T
(
µ
1
−
µ
2
)(
µ
1
−
µ
2
)
T
w
=
w
T
Bw
(20.2)
where
B
=
(
µ
1
−
µ
2
)(
µ
1
−
µ
2
)
T
is a
d
×
d
rank-onematrixcalledthe
between-classscatter
matrix
.
As for the projected scatter for class
c
1
, we can compute it as follows:
s
2
1
=
x
i
∈
D
1
(a
i
−
m
1
)
2
=
x
i
∈
D
1
(
w
T
x
i
−
w
T
µ
1
)
2
=
x
i
∈
D
1
w
T
(
x
i
−
µ
1
)
2
=
w
T
x
i
∈
D
1
(
x
i
−
µ
1
)(
x
i
−
µ
1
)
T
w
=
w
T
S
1
w
(20.3)
where
S
1
is the
scatter matrix
for
D
1
. Likewise, we can obtain
s
2
2
=
w
T
S
2
w
(20.4)
Noticeagainthatthe scattermatrix is essentiallythe sameas the covariancematrix,but
insteadofrecording theaveragedeviationfromthemean,itrecords thetotaldeviation,
that is,
S
i
=
n
i
i
(20.5)
Combining Eqs.(20.3) and (20.4), the denominator in Eq.(20.1) can be rewrit-
ten as
s
2
1
+
s
2
2
=
w
T
S
1
w
+
w
T
S
2
w
=
w
T
(
S
1
+
S
2
)
w
=
w
T
Sw
(20.6)
where
S
=
S
1
+
S
2
denotes the
within-class scatter matrix
for the pooled data. Because
both
S
1
and
S
2
are
d
×
d
symmetric positive semidefinite matrices,
S
has the same
properties.
Using Eqs.(20.2) and (20.6), we write the LDA objective function [Eq.(20.1)] as
follows:
max
w
J
(
w
)
=
w
T
Bw
w
T
Sw
(20.7)
502
Linear Discriminant Analysis
To solve for the best direction
w
, we differentiate the objective function with
respect to
w
, and set the result to zero. We do not explicitly have to deal with the
constraint that
w
T
w
=
1 because in Eq.(20.7) the terms related to the magnitude of
w
cancel out in the numerator and the denominator.
Recall that if
f(x)
and
g(x)
are two functions then we have
d
dx
f(x)
g(x)
=
f
′
(x)g(x)
−
g
′
(x)f(x)
g(x)
2
where
f
′
(x)
denotes the derivative of
f(x)
. Taking the derivative of Eq.(20.7) with
respect to the vector
w
, and setting the result to the zero vector, gives us
d
d
w
J
(
w
)
=
2
Bw
(
w
T
Sw
)
−
2
Sw
(
w
T
Bw
)
(
w
T
Sw
)
2
=
0
which yields
Bw
(
w
T
Sw
)
=
Sw
(
w
T
Bw
)
Bw
=
Sw
w
T
Bw
w
T
Sw
Bw
=
J
(
w
)
Sw
Bw
=
λ
Sw
(20.8)
where
λ
=
J
(
w
)
. Eq.(20.8) represents a
generalized eigenvalue problem
where
λ
is a
generalized eigenvalue of
B
and
S
; the eigenvalue
λ
satisfies the equation det
(
B
−
λ
S
)
=
0. Because the goal is to maximize the objective [Eq.(20.7)],
J
(
w
)
=
λ
should
be chosen to be the largest generalized eigenvalue, and
w
to be the corresponding
eigenvector.If
S
is
nonsingular
, that is, if
S
−
1
exists, then Eq.(20.8) leads to the regular
eigenvalue–eigenvectorequation, as
Bw
=
λ
Sw
S
−
1
Bw
=
λ
S
−
1
Sw
(
S
−
1
B
)
w
=
λ
w
(20.9)
Thus, if
S
−
1
exists, then
λ
=
J
(
w
)
is an eigenvalue, and
w
is an eigenvector of the matrix
S
−
1
B
. To maximize
J
(
w
)
we look for the largest eigenvalue
λ
, and the corresponding
dominant eigenvector
w
specifies the best linear discriminant vector.
Algorithm 20.1 shows the pseudo-code for linear discriminant analysis. Here, we
assume that there are two classes, and that
S
is nonsingular (i.e.,
S
−
1
exists). The
vector
1
n
i
is the vector of all ones, with the appropriate dimension for each class, i.e.,
1
n
i
∈
R
n
i
for class
i
=
1
,
2. After dividing
D
into the two groups
D
1
and
D
2
, LDA
proceeds to compute the between-class and within-class scatter matrices,
B
and
S
. The
optimal LD vector is obtained as the dominant eigenvector of
S
−
1
B
. In terms of com-
putational complexity, computing
S
takes
O
(nd
2
)
time, and computing the dominant
eigenvalue-eigenvector pair takes
O
(d
3
)
time in the worst case. Thus, the total time is
O
(d
3
+
nd
2
)
.
20.1 Optimal Linear Discriminant
503
ALGORITHM 20.1. Linear Discriminant Analysis
L
INEAR
D
ISCRIMINANT
(D
={
(
x
i
,y
i
)
}
n
i
=
1
)
:
D
i
←
x
j
|
y
j
=
c
i
,j
=
1
,...,n
,i
=
1
,
2
// class-specific subsets
1
µ
i
←
mean
(
D
i
),i
=
1
,
2
// class means
2
B
←
(
µ
1
−
µ
2
)(
µ
1
−
µ
2
)
T
// between-class scatter matrix
3
Z
i
←
D
i
−
1
n
i
µ
T
i
,i
=
1
,
2
// center class matrices
4
S
i
←
Z
T
i
Z
i
,i
=
1
,
2
// class scatter matrices
5
S
←
S
1
+
S
2
// within-class scatter matrix
6
λ
1
,
w
←
eigen
(
S
−
1
B
)
// compute dominant eigenvector
7
Example 20.2 (Linear Discriminant Analysis).
Consider the 2-dimensional Iris data
(with attributes
sepal length
and
sepal width
) shown in Example 20.1. Class
c
1
,
corresponding to
iris-setosa
,has
n
1
=
50points, whereastheother class
c
2
has
n
2
=
100 points. The means for the two classes
c
1
and
c
2
, and their difference is given as
µ
1
=
5
.
01
3
.
42
T
µ
2
=
6
.
26
2
.
87
T
µ
1
−
µ
2
=
−
1
.
256
0
.
546
T
The between-class scatter matrix is
B
=
(
µ
1
−
µ
2
)(
µ
1
−
µ
2
)
T
=
−
1
.
256
0
.
546
−
1
.
256 0
.
546
=
1
.
587
−
0
.
693
−
0
.
693 0
.
303
and the within-class scatter matrix is
S
1
=
6
.
09 4
.
91
4
.
91 7
.
11
S
2
=
43
.
5 12
.
09
12
.
09 10
.
96
S
=
S
1
+
S
2
=
49
.
58 17
.
01
17
.
01 18
.
08
S
is nonsingular, with its inverse given as
S
−
1
=
0
.
0298
−
0
.
028
−
0
.
028 0
.
0817
Therefore, we have
S
−
1
B
=
0
.
0298
−
0
.
028
−
0
.
028 0
.
0817
1
.
587
−
0
.
693
−
0
.
693 0
.
303
=
0
.
066
−
0
.
029
−
0
.
100 0
.
044
The direction of most separation between
c
1
and
c
2
is the dominant eigenvector
corresponding to the largest eigenvalue of the matrix
S
−
1
B
. The solution is
J
(
w
)
=
λ
1
=
0
.
11
w
=
0
.
551
−
0
.
834
Figure 20.2 plots the optimal linear discriminant direction
w
, translated to the mean
of the data. The projected means for the two classes are shown in black. We can
504
Linear Discriminant Analysis
clearly observe that along
w
the circles appear together as a group, and are quite
well separated from the triangles. Except for one outlying circle corresponding to
the point
(
4
.
5
,
2
.
3
)
T
, all points in
c
1
are perfectly separated from points in
c
2
.
For the two class scenario, if
S
is nonsingular, we can directly solve for
w
without
computing the eigenvalues and eigenvectors. Note that
B
=
(
µ
1
−
µ
2
)(
µ
1
−
µ
2
)
T
is a
d
×
d
rank-one matrix, and thus
Bw
must point in the same direction as
(
µ
1
−
µ
2
)
because
Bw
=
(
µ
1
−
µ
2
)(
µ
1
−
µ
2
)
T
w
=
(
µ
1
−
µ
2
)
(
µ
1
−
µ
2
)
T
w
=
b(
µ
1
−
µ
2
)
where
b
=
(
µ
1
−
µ
2
)
T
w
is just a scalar multiplier.
We can then rewrite Eq.(20.9) as
Bw
=
λ
Sw
b(
µ
1
−
µ
2
)
=
λ
Sw
w
=
b
λ
S
−
1
(
µ
1
−
µ
2
)
Because
b
λ
is just a scalar, we can solve for the best linear discriminant as
w
=
S
−
1
(
µ
1
−
µ
2
)
(20.10)
Once the direction
w
has been found we can normalize it to be a unit vector. Thus,
instead of solving for the eigenvalue/eigenvector,in the two class case, we immediately
obtain the direction
w
using Eq.(20.10). Intuitively, the direction that maximizes the
separation between the classes can be viewed as a linear transformation (by
S
−
1
) of the
vector joining the two class means (
µ
1
−
µ
2
).
Example 20.3.
Continuing Example 20.2, we can directly compute
w
as follows:
w
=
S
−
1
(
µ
1
−
µ
2
)
=
0
.
066
−
0
.
029
−
0
.
100 0
.
044
−
1
.
246
0
.
546
=
−
0
.
0527
0
.
0798
After normalizing, we have
w
=
w
w
=
1
0
.
0956
−
0
.
0527
0
.
0798
=
−
0
.
551
0
.
834
Note that even though the sign is reversed for
w
, compared to that in Example 20.2,
they represent the same direction; only the scalar multiplier is different.
20.2 Kernel Discriminant Analysis
505
20.2
KERNEL DISCRIMINANT ANALYSIS
Kernel discriminant analysis, like linear discriminant analysis, tries to find a direction
that maximizes the separation between the classes. However, it does so in
featurespace
via the use of kernel functions.
Given a dataset
D
={
(
x
i
,y
i
)
}
n
i
=
1
, where
x
i
is a point in input space and
y
i
∈{
c
1
,c
2
}
is the class label, let
D
i
={
x
j
|
y
j
=
c
i
}
denote the data subset restricted to class
c
i
, and
let
n
i
=|
D
i
|
. Further, let
φ(
x
i
)
denote the corresponding point in feature space, and let
K
be a kernel function.
The goal of kernel LDA is to find the direction vector
w
in feature space that
maximizes
max
w
J
(
w
)
=
(m
1
−
m
2
)
2
s
2
1
+
s
2
2
(20.11)
where
m
1
and
m
2
are the projected means, and
s
2
1
and
s
2
2
are projected scatter values
in feature space. We first show that
w
can be expressed as a linear combination of
the points in feature space, and then we transform the LDA objective in terms of the
kernel matrix.
Optimal LD: Linear Combination of Feature Points
The mean for class
c
i
in feature space is given as
µ
φ
i
=
1
n
i
x
j
∈
D
i
φ(
x
j
)
(20.12)
and the covariance matrix for class
c
i
in feature space is
φ
i
=
1
n
i
x
j
∈
D
i
φ(
x
j
)
−
µ
φ
i
φ(
x
j
)
−
µ
φ
i
T
Using a derivation similar to Eq.(20.2) we obtain an expression for the between-class
scatter matrix in feature space
B
φ
=
µ
φ
1
−
µ
φ
2
µ
φ
1
−
µ
φ
2
T
=
d
φ
d
T
φ
(20.13)
where
d
φ
=
µ
φ
1
−
µ
φ
2
is the difference between the two class mean vectors. Likewise,
using Eqs.(20.5) and (20.6) the within-class scatter matrix in feature space is given as
S
φ
=
n
1
φ
1
+
n
2
φ
2
S
φ
is a
d
×
d
symmetric, positive semidefinite matrix, where
d
is the dimensionality of
the featurespace. From Eq.(20.9), we conclude that the best linear discriminant vector
w
in feature space is the dominant eigenvector, which satisfies the expression
S
−
1
φ
B
φ
w
=
λ
w
(20.14)
where we assume that
S
φ
is non-singular. Let
δ
i
denote the
i
theigenvalueand
u
i
the
i
th
eigenvector of
S
φ
, for
i
=
1
,...,d
. The eigen-decomposition of
S
φ
yields
S
φ
=
U
U
T
,
506
Linear Discriminant Analysis
with the inverse of
S
φ
given as
S
−
1
φ
=
U
−
1
U
T
. Here
U
is the matrix whose columns are
the eigenvectors of
S
φ
and
is the diagonal matrix of eigenvalues of
S
φ
. The inverse
S
−
1
φ
can thus be expressed as the spectral sum
S
−
1
φ
=
d
r
=
1
1
δ
r
u
r
u
T
r
(20.15)
Plugging Eqs.(20.13) and (20.15) into Eq.(20.14), we obtain
λ
w
=
d
r
=
1
1
δ
r
u
r
u
T
r
d
φ
d
T
φ
w
=
d
r
=
1
1
δ
r
u
r
(
u
T
r
d
φ
)(
d
T
φ
w
)
=
d
r
=
1
b
r
u
r
where
b
r
=
1
δ
r
(
u
T
r
d
φ
)(
d
T
φ
w
)
is a scalar value. Using a derivation similar to that in
Eq.(7.32), the
r
th eigenvector of
S
φ
can be expressed as a linear combination of the
feature points, say
u
r
=
n
j
=
1
c
rj
φ(
x
j
)
, where
c
rj
is a scalar coefficient. Thus, we can
rewrite
w
as
w
=
1
λ
d
r
=
1
b
r
n
j
=
1
c
rj
φ(
x
j
)
=
n
j
=
1
φ(
x
j
)
d
r
=
1
b
r
c
rj
λ
=
n
j
=
1
a
j
φ(
x
j
)
where
a
j
=
d
r
=
1
b
r
c
rj
/λ
is a scalar value for the feature point
φ(
x
j
)
. Therefore, the
direction vector
w
can be expressed as a linear combination of the points in feature
space.
LDA Objective via Kernel Matrix
We now rewrite the kernel LDA objective [Eq.(20.11)] in terms of the kernel
matrix. Projecting the mean for class
c
i
given in Eq.(20.12) onto the LD direction
w
,
we have
m
i
=
w
T
µ
φ
i
=
n
j
=
1
a
j
φ(
x
j
)
T
1
n
i
x
k
∈
D
i
φ(
x
k
)
=
1
n
i
n
j
=
1
x
k
∈
D
i
a
j
φ(
x
j
)
T
φ(
x
k
)
=
1
n
i
n
j
=
1
x
k
∈
D
i
a
j
K
(
x
j
,
x
k
)
=
a
T
m
i
(20.16)
20.2 Kernel Discriminant Analysis
507
where
a
=
(a
1
,a
2
,...,a
n
)
T
is the weight vector, and
m
i
=
1
n
i
x
k
∈
D
i
K
(
x
1
,
x
k
)
x
k
∈
D
i
K
(
x
2
,
x
k
)
.
.
.
x
k
∈
D
i
K
(
x
n
,
x
k
)
=
1
n
i
K
c
i
1
n
i
(20.17)
where
K
c
i
is the
n
×
n
i
subset of the kernel matrix, restricted to columns belonging to
points only in
D
i
, and
1
n
i
is the
n
i
-dimensional vector all of whose entries are one. The
n
-length vector
m
i
thus stores for each point in
D
its average kernel value with respect
to the points in
D
i
.
We can rewrite the separation between the projected means in feature space as
follows:
(m
1
−
m
2
)
2
=
w
T
µ
φ
1
−
w
T
µ
φ
2
2
=
a
T
m
1
−
a
T
m
2
2
=
a
T
(
m
1
−
m
2
)(
m
1
−
m
2
)
T
a
=
a
T
Ma
(20.18)
where
M
=
(
m
1
−
m
2
)(
m
1
−
m
2
)
T
is the between-class scatter matrix.
We can also compute the projected scatter for each class,
s
2
1
and
s
2
2
, purely in terms
of the kernel function, as
s
2
1
=
x
i
∈
D
1
w
T
φ(
x
i
)
−
w
T
µ
φ
1
2
=
x
i
∈
D
1
w
T
φ(
x
i
)
2
−
2
x
i
∈
D
1
w
T
φ(
x
i
)
·
w
T
µ
φ
1
+
x
i
∈
D
1
w
T
µ
φ
1
2
=
x
i
∈
D
1
n
j
=
1
a
j
φ(
x
j
)
T
φ(
x
i
)
2
−
2
·
n
1
·
w
T
µ
φ
1
2
+
n
1
·
w
T
µ
φ
1
2
=
x
i
∈
D
1
a
T
K
i
K
T
i
a
−
n
1
·
a
T
m
1
m
T
1
a
by using Eq.(20.16)
=
a
T
x
i
∈
D
1
K
i
K
T
i
−
n
1
m
1
m
T
1
a
=
a
T
N
1
a
where
K
i
is the
i
th column of the kernel matrix, and
N
1
is the class scatter matrix for
c
1
.
Let
K
(
x
i
,
x
j
)
=
K
ij
. We can express
N
1
more compactly in matrix notation as follows:
N
1
=
x
i
∈
D
1
K
i
K
T
i
−
n
1
m
1
m
T
1
=
(
K
c
1
)
I
n
1
−
1
n
1
1
n
1
×
n
1
(
K
c
1
)
T
(20.19)
508
Linear Discriminant Analysis
where
I
n
1
is the
n
1
×
n
1
identity matrix and
1
n
1
×
n
1
is the
n
1
×
n
1
matrix, all of whose
entries are 1’s.
In a similar manner we get
s
2
2
=
a
T
N
2
a
, where
N
2
=
(
K
c
2
)
I
n
2
−
1
n
2
1
n
2
×
n
2
(
K
c
2
)
T
where
I
n
2
is the
n
2
×
n
2
identity matrix and
1
n
2
×
n
2
is the
n
2
×
n
2
matrix, all of whose
entries are 1’s.
The sum of projected scatter values is then given as
s
2
1
+
s
2
2
=
a
T
(
N
1
+
N
2
)
a
=
a
T
Na
(20.20)
where
N
is the
n
×
n
within-class scatter matrix.
Substituting Eqs.(20.18) and (20.20) in Eq.(20.11), we obtain the kernel LDA
maximization condition
max
w
J
(
w
)
=
max
a
J
(
a
)
=
a
T
Ma
a
T
Na
Notice how all the terms in the expression above involve only kernel functions. The
weight vector
a
is the eigenvector corresponding to the largest eigenvalue of the
generalized eigenvalue problem:
Ma
=
λ
1
Na
(20.21)
If
N
is nonsingular,
a
is the dominant eigenvector corresponding to the largest
eigenvalue for the system
(
N
−
1
M
)
a
=
λ
1
a
As in the case of linear discriminant analysis [Eq.(20.10)], when there are only two
classes we do not have to solve for the eigenvector because
a
can be obtained directly:
a
=
N
−
1
(
m
1
−
m
2
)
Once
a
has been obtained, we can normalize
w
to be a unit vector by ensuring that
w
T
w
=
1
,
which implies that
n
i
=
1
n
j
=
1
a
i
a
j
φ(
x
i
)
T
φ(
x
j
)
=
1
,
or
a
T
Ka
=
1
Put differently, we can ensure that
w
is a unit vector if we scale
a
by
1
√
a
T
Ka
.
Finally, we can project any point
x
onto the discriminant direction, as follows:
w
T
φ(
x
)
=
n
j
=
1
a
j
φ(
x
j
)
T
φ(
x
)
=
n
j
=
1
a
j
K
(
x
j
,
x
)
(20.22)
Algorithm 20.2 shows the pseudo-code for kernel discriminant analysis. The
method proceeds by computing the
n
×
n
kernel matrix
K
, and the
n
×
n
i
class
20.2 Kernel Discriminant Analysis
509
ALGORITHM 20.2. Kernel Discriminant Analysis
K
ERNEL
D
ISCRIMINANT
(D
={
(
x
i
,y
i
)
}
n
i
=
1
,
K
)
:
K
←
K
(
x
i
,
x
j
)
i,j
=
1
,...,n
// compute
n
×
n
kernel matrix
1
K
c
i
←
K
(j,k)
|
y
k
=
c
i
,
1
≤
j,k
≤
n
,i
=
1
,
2
// class kernel matrix
2
m
i
←
1
n
i
K
c
i
1
n
i
,i
=
1
,
2
// class means
3
M
←
(
m
1
−
m
2
)(
m
1
−
m
2
)
T
// between-class scatter matrix
4
N
i
←
K
c
i
(
I
n
i
−
1
n
i
1
n
i
×
n
i
)(
K
c
i
)
T
,
i
=
1
,
2
// class scatter matrices
5
N
←
N
1
+
N
2
// within-class scatter matrix
6
λ
1
,
a
←
eigen
(
N
−
1
M
)
// compute weight vector
7
a
←
a
√
a
T
Ka
// normalize
w
to be unit vector
8
specific kernel matrices
K
c
i
for each class
c
i
. After computing the between-class and
within-class scatter matrices
M
and
N
, the weight vector
a
is obtained as the dominant
eigenvector of
N
−
1
M
. The last step scales
a
so that
w
will be normalized to be unit
length. The complexity of kernel discriminant analysis is
O
(n
3
)
, with the dominant
steps being the computation of
N
and solving for the dominant eigenvector of
N
−
1
M
,
both of which take
O
(n
3
)
time.
Example 20.4 (Kernel Discriminant Analysis).
Consider the 2-dimensional Iris
dataset comprising the
sepal length
and
sepal width
attributes. Figure 20.3a
shows the points projected onto the first two principal components. The points
have been divided into two classes:
c
1
(circles) corresponds to
iris-virginica
and
c
2
(triangles) corresponds to the other two Iris types. Here
n
1
=
50 and
n
2
=
100, with a
total of
n
=
150 points.
Because
c
1
is surrounded by points in
c
2
a good linear discriminant will not
be found. Instead, we apply kernel discriminant analysis using the homogeneous
quadratic kernel
K
(
x
i
,
x
j
)
=
(
x
T
i
x
j
)
2
Solving for
a
via Eq.(20.21) yields
λ
1
=
0
.
0511
However, we do not show
a
because it lies in
R
150
. Figure 20.3a shows the contours
of constant projections onto the best kernel discriminant. The contours are obtained
by solving Eq.(20.22), that is, by solving
w
T
φ(
x
)
=
n
j
=
1
a
j
K
(
x
j
,
x
)
=
c
for different
values of the scalars
c
. The contours are hyperbolic, and thus form pairs starting
from the center. For instance, the first curve on the left and right of the origin
(
0
,
0
)
T
forms the same contour, that is, points along both the curves have the same value
when projected onto
w
. We can see that contours or pairs of curves starting with the
fourth curve (on the left and right) from the center all relate to class
c
2
, whereas the
first three contours deal mainly with class
c
1
, indicating good discrimination with the
homogeneous quadratic kernel.
510
Linear Discriminant Analysis
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(a)
−
1 0 1 2 3 4 5 6 7 8 9 10
w
(b)
Figure 20.3.
Kernel discriminant analysis: quadratic homogeneous kernel.
A better picture emerges when we plot the coordinates of all the points
x
i
∈
D
when projected onto
w
, as shown in Figure 20.3b. We can observe that
w
is able
to separate the two classes reasonably well; all the circles (
c
1
) are concentrated on
the left, whereas the triangles (
c
2
) are spread out on the right. The projected means
are shown in white. The scatters and means for both classes after projection are as
follows:
m
1
=
0
.
338
m
2
=
4
.
476
s
2
1
=
13
.
862
s
2
2
=
320
.
934
The value of
J
(
w
)
is given as
J
(
w
)
=
(m
1
−
m
2
)
2
s
2
1
+
s
2
2
=
(
0
.
338
−
4
.
476
)
2
13
.
862
+
320
.
934
=
17
.
123
334
.
796
=
0
.
0511
which, as expected, matches
λ
1
=
0
.
0511 from above.
In general, it is not desirable or possible to obtain an explicit discriminant vector
w
, since it lies in feature space. However, because each point
x
=
(x
1
,x
2
)
T
∈
R
2
in
20.3 Further Reading
511
X
1
X
2
X
2
1
X
2
2
w
Figure 20.4.
Homogeneous quadratic kernel feature space.
input space is mapped to the point
φ(
x
)
=
(
√
2
x
1
x
2
,x
2
1
,x
2
2
)
T
∈
R
3
in feature space via
the homogeneous quadratic kernel, for our example it is possible to visualize the
feature space, as illustrated in Figure 20.4. The projection of each point
φ(
x
i
)
onto
the discriminant vector
w
is also shown, where
w
=
0
.
511
x
1
x
2
+
0
.
761
x
2
1
−
0
.
4
x
2
2
The projections onto
w
are identical to those shown in Figure 20.3b.
20.3
FURTHER READING
Linear discriminant analysis was introduced in Fisher (1936). Its extension to kernel
discriminant analysis was proposed in Mika et al. (1999). The 2-class LDA approach
can be generalized to
k >
2 classes by finding the optimal
(k
−
1
)
-dimensional subspace
projection that best discriminates between the
k
classes; see Duda, Hart, and Stork
(2012) for details.
Duda, R. O., Hart, P. E., and Stork, D. G. (2012).
Pattern Classification
. New York:
Wiley-Interscience.
Fisher, R. A. (1936). “The use of multiple measurements in taxonomic problems.”
Annals of Eugenics
, 7(2): 179–188.
512
Linear Discriminant Analysis
Mika, S., Ratsch, G., Weston, J., Scholkopf, B., and Mullers, K. (1999). “Fisher
discriminant analysis with kernels.”
In Proceedings of the IEEE Neural Networks
for Signal Processing Workshop,
IEEE, pp. 41–48.
20.4
EXERCISES
Q1.
Consider the data shown in Table 20.1. Answer the following questions:
(a)
Compute
µ
+
1
and
µ
−
1
, and
S
B
, the between-class scatter matrix.
(b)
Compute
S
+
1
and
S
−
1
, and
S
W
, the within-class scatter matrix.
(c)
Find the best direction
w
that discriminates between the classes. Use the fact that
the inverse of the matrix
A
=
a b
c d
is given as
A
−
1
=
1
det
(
A
)
d
−
b
−
c a
.
(d)
Having found the direction
w
, find the point on
w
that best separates the two
classes.
Table 20.1.
Dataset for Q1
i
x
i
y
i
x
1
(4,2.9) 1
x
2
(3.5,4) 1
x
3
(2.5,1)
−
1
x
4
(2,2.1)
−
1
Q2.
Given the labeled points (from two classes) shown in Figure 20.5, and given that the
inverse of the within-class scatter matrix is
0
.
056
−
0
.
029
−
0
.
029 0
.
052
Find the best linear discriminant line
w
, and sketch it.
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
Figure 20.5.
Dataset for Q2.
20.4 Exercises
513
Q3.
Maximize the objective in Eq.(20.7) by explicitly considering the constraint
w
T
w
=
1,
that is, by using a Lagrange multiplier for that constraint.
Q4.
Prove the equality in Eq.(20.19). That is, show that
N
1
=
x
i
∈
D
1
K
i
K
T
i
−
n
1
m
1
m
T
1
=
(
K
c
1
) (
I
n
1
−
1
n
1
1
n
1
×
n
1
) (
K
c
1
)
T
CHAPTER 21
Support Vector Machines
In this chapter we describe Support Vector Machines (SVMs), a classification method
based on maximum margin linear discriminants, that is, the goal is to find the optimal
hyperplane that maximizes the gap or margin between the classes. Further, we can use
thekerneltrick to find theoptimal nonlinear decision boundary betweenclasses, which
corresponds to a hyperplane in some high-dimensional “nonlinear” space.
21.1
SUPPORT VECTORS AND MARGINS
Let
D
={
(
x
i
,y
i
)
}
n
i
=
1
be a classification dataset, with
n
points in a
d
-dimensional space.
Further, let us assume that there are only two class labels, that is,
y
i
∈ {+
1
,
−
1
}
,
denoting the positive and negative classes.
Hyperplanes
A hyperplane in
d
dimensions is given as the set of all points
x
∈
R
d
that satisfy the
equation
h(
x
)
=
0, where
h(
x
)
is the
hyperplane function
, defined as follows:
h(
x
)
=
w
T
x
+
b
(21.1)
=
w
1
x
1
+
w
2
x
2
+···+
w
d
x
d
+
b
Here,
w
is a
d
dimensional
weight vector
and
b
is a scalar, called the
bias
. For points
that lie on the hyperplane, we have
h(
x
)
=
w
T
x
+
b
=
0 (21.2)
The hyperplane is thus defined as the set of all points such that
w
T
x
=−
b
. To see the
role played by
b
, assuming that
w
1
=
0, and setting
x
i
=
0 for all
i >
1, we can obtain
the offset where the hyperplane intersects the first axis, as by Eq.(21.2), we have
w
1
x
1
=−
b
or
x
1
=
−
b
w
1
In other words, the point
(
−
b
w
1
,
0
,...,
0
)
lies on the hyperplane. In a similar manner, we
can obtain the offset where the hyperplane intersects each of the axes, which is given
as
−
b
w
i
(provided
w
i
=
0).
514
21.1 Support Vectors and Margins
515
Separating Hyperplane
A hyperplane splits the original
d
-dimensional space into two
half-spaces
. A dataset
is said to be
linearly separable
if each half-space has points only from a single class.
If the input dataset is linearly separable, then we can find a
separating
hyperplane
h(
x
)
=
0, such that for all points labeled
y
i
=−
1, we have
h(
x
i
) <
0, and for all points
labeled
y
i
= +
1, we have
h(
x
i
) >
0. In fact, the hyperplane function
h(
x
)
serves as a
linear classifier or a linear discriminant, which predicts the class
y
for any given point
x
, according to the decision rule:
y
=
+
1 if
h(
x
) >
0
−
1 if
h(
x
) <
0
(21.3)
Let
a
1
and
a
2
be two arbitrary points that lie on the hyperplane. From Eq.(21.2)
we have
h(
a
1
)
=
w
T
a
1
+
b
=
0
h(
a
2
)
=
w
T
a
2
+
b
=
0
Subtracting one from the other we obtain
w
T
(
a
1
−
a
2
)
=
0
This means that the weight vector
w
is orthogonal to the hyperplane because it is
orthogonal to any arbitrary vector (
a
1
−
a
2
) on the hyperplane. In other words, the
weight vector
w
specifies the direction that is normal to the hyperplane, which fixes
the orientation of the hyperplane, whereas the bias
b
fixes the offset of the hyperplane
in the
d
-dimensional space. Because both
w
and
−
w
are normal to the hyperplane,
we remove this ambiguity by requiring that
h(
x
i
) >
0 when
y
i
=
1, and
h(
x
i
) <
0 when
y
i
=−
1.
Distance of a Point to the Hyperplane
Consider a point
x
∈
R
d
, such that
x
does not lie on the hyperplane. Let
x
p
be the
orthogonal projection of
x
on the hyperplane, and let
r
=
x
−
x
p
, then as shown in
Figure 21.1 we can write
x
as
x
=
x
p
+
r
x
=
x
p
+
r
w
w
(21.4)
where
r
is the
directed distance
of the point
x
from
x
p
, that is,
r
gives the offset of
x
from
x
p
in terms of the unit weight vector
w
w
. The offset
r
is positive if
r
is in the same
direction as
w
, and
r
is negative if
r
is in a direction opposite to
w
.
Plugging Eq.(21.4) into the hyperplane function [Eq.(21.1)], we get
h(
x
)
=
h
x
p
+
r
w
w
=
w
T
x
p
+
r
w
w
+
b
516
Support Vector Machines
1
2
3
4
5
1 2 3 4 5
h
(
x
)
=
0
x
x
p
0
r
=
r
w
w
h(
x
) <
0
h(
x
) >
0
w
w
b
w
Figure 21.1.
Geometry of a separating hyperplane in 2D. Points labeled
+
1 are shown as circles, and those
labeled
−
1areshownastriangles. Thehyperplane
h
(
x
)
=
0dividesthespace intotwohalf-spaces. Theshaded
region comprises all points
x
satisfying
h
(
x
)<
0, whereas the unshaded region consists of all points satisfying
h
(
x
) >
0. The unit weight vector
w
w
(in gray) is orthogonal to the hyperplane. The directed distance of the
origin to the hyperplane is
b
w
.
=
w
T
x
p
+
b
h(
x
p
)
+
r
w
T
w
w
=
h(
x
p
)
0
+
r
w
=
r
w
The last step follows from the fact that
h(
x
p
)
=
0 because
x
p
lies on the hyperplane.
Using the result above, we obtain an expression for the directed distance of a point to
the hyperplane:
r
=
h(
x
)
w
To obtain distance, which must be non-negative,we can convenientlymultiply
r
by
the class label
y
of the point because when
h(
x
) <
0, the class is
−
1, and when
h(
x
) >
0
the class is
+
1. The distance of a point
x
from the hyperplane
h(
x
)
=
0 is thus given as
δ
=
y r
=
y h(
x
)
w
(21.5)
21.1 Support Vectors and Margins
517
In particular, for the origin
x
=
0
, the directed distance is
r
=
h(
0
)
w
=
w
T
0
+
b
w
=
b
w
as illustrated in Figure 21.1.
Example 21.1.
Consider the example shown in Figure 21.1. In this 2-dimensional
example, the hyperplane is just a line, defined as the set of all points
x
=
(x
1
,x
2
)
T
that satisfy the following equation:
h(
x
)
=
w
T
x
+
b
=
w
1
x
1
+
w
2
x
2
+
b
=
0
Rearranging the terms we get
x
2
=−
w
1
w
2
x
1
−
b
w
2
where
−
w
1
w
2
is the slope of the line, and
−
b
w
2
is the intercept along the second
dimension.
Consider any two points on the hyperplane, say
p
=
(p
1
,p
2
)
=
(
4
,
0
)
, and
q
=
(q
1
,q
2
)
=
(
2
,
5
)
. The slope is given as
−
w
1
w
2
=
q
2
−
p
2
q
1
−
p
1
=
5
−
0
2
−
4
=−
5
2
which implies that
w
1
=
5 and
w
2
=
2. Given any point on the hyperplane, say
(
4
,
0
)
,
we can compute the offset
b
directly as follows:
b
=−
5
x
1
−
2
x
2
=−
5
·
4
−
2
·
0
=−
20
Thus,
w
=
5
2
is the weight vector, and
b
=−
20 is the bias, and the equation of the
hyperplane is given as
h(
x
)
=
w
T
x
+
b
=
5 2
x
1
x
2
−
20
=
0
One can verify that the distance of the origin
0
from the hyperplane is given as
δ
=
y r
=−
1
r
=
−
b
w
=
−
(
−
20
)
√
29
=
3
.
71
Margin and Support Vectors of a Hyperplane
Given a training dataset of labeled points,
D
={
x
i
,y
i
}
n
i
=
1
with
y
i
∈{+
1
,
−
1
}
, and given
a separating hyperplane
h(
x
)
=
0, for each point
x
i
we can find its distance to the
hyperplane by Eq.(21.5):
δ
i
=
y
i
h(
x
i
)
w
=
y
i
(
w
T
x
i
+
b)
w
518
Support Vector Machines
Over all the
n
points, we define the
margin
of the linear classifier as the minimum
distance of a point from the separating hyperplane, given as
δ
∗
=
min
x
i
y
i
(
w
T
x
i
+
b)
w
(21.6)
Note that
δ
∗
=
0, since
h(
x
)
is assumed to be a separating hyperplane, and Eq.(21.3)
must be satisfied.
All the points (or vectors) that achieve this minimum distance are called
support
vectors
for the hyperplane. In other words, a support vector
x
∗
is a point that lies
precisely on the margin of the classifier, and thus satisfies the condition
δ
∗
=
y
∗
(
w
T
x
∗
+
b)
w
where
y
∗
is the class label for
x
∗
. The numerator
y
∗
(
w
T
x
∗
+
b)
gives the absolute
distance of the support vector to the hyperplane, whereas the denominator
w
makes
it a relative distance in terms of
w
.
Canonical Hyperplane
Consider the equation of the hyperplane [Eq.(21.2)]. Multiplying on both sides by
some scalar
s
yields an equivalent hyperplane:
s h(
x
)
=
s
w
T
x
+
s b
=
(s
w
)
T
x
+
(sb)
=
0
To obtain the unique or
canonical
hyperplane, we choose the scalar
s
such that the
absolute distance of a support vector from the hyperplane is 1. That is,
sy
∗
(
w
T
x
∗
+
b)
=
1
which implies
s
=
1
y
∗
(
w
T
x
∗
+
b)
=
1
y
∗
h(
x
∗
)
(21.7)
Henceforth, we will assume that any separating hyperplane is canonical. That is, it
has already been suitably rescaled so that
y
∗
h(
x
∗
)
=
1 for a support vector
x
∗
, and the
margin is given as
δ
∗
=
y
∗
h(
x
∗
)
w
=
1
w
For the canonical hyperplane, for each support vector
x
∗
i
(with label
y
∗
i
), we
have
y
∗
i
h(
x
∗
i
)
=
1, and for any point that is not a support vector we have
y
i
h(
x
i
) >
1,
because, by definition, it must be farther from the hyperplane than a support
vector. Over all the
n
points in the dataset
D
, we thus obtain the following set of
inequalities:
y
i
(
w
T
x
i
+
b)
≥
1
,
for all points
x
i
∈
D
(21.8)
21.1 Support Vectors and Margins
519
1
2
3
4
5
1 2 3 4 5
h
(
x
)
=
0
1
w
1
w
Figure 21.2.
Margin of a separating hyperplane:
1
w
is the margin, and the shaded points are the support
vectors.
Example 21.2.
Figure 21.2 gives an illustration of the support vectors and the margin
of a hyperplane. The equation of the separating hyperplane is
h(
x
)
=
5
2
T
x
−
20
=
0
Consider the support vector
x
∗
=
(
2
,
2
)
T
, with class
y
∗
= −
1. To find the canonical
hyperplane equation, we have to rescale the weight vector and bias by the scalar
s
,
obtained using Eq.(21.7):
s
=
1
y
∗
h(
x
∗
)
=
1
−
1
5
2
T
2
2
−
20
=
1
6
Thus, the rescaled weight vector is
w
=
1
6
5
2
=
5
/
6
2
/
6
and the rescaled bias is
b
=
−
20
6
520
Support Vector Machines
The canonical form of the hyperplane is therefore
h(
x
)
=
5
/
6
2
/
6
T
x
−
20
/
6
=
0
.
833
0
.
333
T
x
−
3
.
33
and the margin of the canonical hyperplane is
δ
∗
=
y
∗
h(
x
∗
)
w
=
1
5
6
2
+
2
6
2
=
6
√
29
=
1
.
114
In this example there are five support vectors (shown as shaded points), namely,
(
2
,
2
)
T
and
(
2
.
5
,
0
.
75
)
T
with class
y
=−
1 (shown as triangles), and
(
3
.
5
,
4
.
25
)
T
,
(
4
,
3
)
T
,
and
(
4
.
5
,
1
.
75
)
T
with class
y
=+
1 (shown as circles), as illustrated in Figure 21.2.
21.2
SVM: LINEAR AND SEPARABLE CASE
Given a dataset
D
= {
x
i
,y
i
}
n
i
=
1
with
x
i
∈
R
d
and
y
i
∈ {+
1
,
−
1
}
, let us assume for
the moment that the points are linearly separable, that is, there exists a separating
hyperplane that perfectly classifies each point. In other words, all points labeled
y
i
=+
1 lie on one side (
h(
x
) >
0) and all points labeled
y
i
=−
1 lie on the other side
(
h(
x
) <
0) of the hyperplane. It is obvious that in the linearly separable case, there
are in fact an infinite number of such separating hyperplanes. Which one should we
choose?
Maximum Margin Hyperplane
Thefundamentalideabehind SVMs is tochoose thecanonicalhyperplane,specifiedby
the weight vector
w
and the bias
b
, that yields the maximum margin among all possible
separating hyperplanes. If
δ
∗
h
represents the margin for hyperplane
h(
x
)
=
0, then the
goal is to find the optimal hyperplane
h
∗
:
h
∗
=
argmax
h
δ
∗
h
=
argmax
w
,b
1
w
The SVM task is to find the hyperplane that maximizes the margin
1
w
, subject to the
n
constraints given in Eq.(21.8), namely,
y
i
(
w
T
x
i
+
b)
≥
1
,
for all points
x
i
∈
D
. Notice
that instead of maximizing the margin
1
w
, we can minimize
w
. In fact, we can obtain
an equivalent minimization formulation given as follows:
Objective Function:
min
w
,b
w
2
2
Linear Constraints:
y
i
(
w
T
x
i
+
b)
≥
1
,
∀
x
i
∈
D
We can directly solve the above
primal
convex minimization problem with the
n
linear constraints using standard optimization algorithms, as outlined later in
Section 21.5.However, it is more common to solve the
dual
problem, which is obtained
viathe use of
Lagrangemultipliers
. The main ideais to introduce a Lagrangemultiplier
21.2 SVM: Linear and Separable Case
521
α
i
for each constraint, which satisfies the Karush–Kuhn–Tucker (KKT) conditions at
the optimal solution:
α
i
y
i
(
w
T
x
i
+
b)
−
1
=
0
and
α
i
≥
0
Incorporating all the
n
constraints, the new objective function, called the
Lagrangian
,
then becomes
min
L
=
1
2
w
2
−
n
i
=
1
α
i
y
i
(
w
T
x
i
+
b)
−
1
(21.9)
L
should beminimizedwithrespectto
w
and
b
,anditshould bemaximizedwithrespect
to
α
i
.
Taking the derivative of
L
with respect to
w
and
b
, and setting those to zero, we
obtain
∂
∂
w
L
=
w
−
n
i
=
1
α
i
y
i
x
i
=
0
or
w
=
n
i
=
1
α
i
y
i
x
i
(21.10)
∂
∂b
L
=
n
i
=
1
α
i
y
i
=
0 (21.11)
The above equations give important intuition about the optimal weight vector
w
. In
particular, Eq.(21.10) implies that
w
can be expressed as a linear combination of the
data points
x
i
, with the signed Lagrange multipliers,
α
i
y
i
, serving as the coefficients.
Further, Eq.(21.11) implies that the sum of the signed Lagrange multipliers,
α
i
y
i
, must
be zero.
Plugging these into Eq.(21.9), we obtain the
dual Lagrangian
objective function,
which is specified purely in terms of the Lagrange multipliers:
L
dual
=
1
2
w
T
w
−
w
T
n
i
=
1
α
i
y
i
x
i
w
−
b
n
i
=
1
α
i
y
i
0
+
n
i
=
1
α
i
=−
1
2
w
T
w
+
n
i
=
1
α
i
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
x
T
i
x
j
The dual objective is thus given as
Objective Function:
max
α
L
dual
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
x
T
i
x
j
Linear Constraints:
α
i
≥
0
,
∀
i
∈
D
,
and
n
i
=
1
α
i
y
i
=
0
(21.12)
522
Support Vector Machines
where
α
=
(α
1
,α
2
,...,α
n
)
T
is the vector comprising the Lagrange multipliers.
L
dual
is
a convex quadratic programming problem (note the
α
i
α
j
terms), which can be solved
using standard optimization techniques. See Section 21.5 for a gradient-based method
for solving the dual formulation.
Weight Vector and Bias
Once we have obtained the
α
i
values for
i
=
1
,...,n
, we can solve for the weight vector
w
and the bias
b
. Note that according to the KKT conditions, we have
α
i
y
i
(
w
T
x
i
+
b)
−
1
=
0
which gives rise to two cases:
(1)
α
i
=
0, or
(2)
y
i
(
w
T
x
i
+
b)
−
1
=
0, which implies
y
i
(
w
T
x
i
+
b)
=
1
This is a very important result because if
α
i
>
0, then
y
i
(
w
T
x
i
+
b)
=
1, and thus the
point
x
i
must be a support vector. On the other hand if
y
i
(
w
T
x
i
+
b) >
1, then
α
i
=
0,
that is, if a point is not a support vector, then
α
i
=
0.
Once we know
α
i
for all points, we can compute the weight vector
w
using
Eq.(21.10), but by taking the summation only for the support vectors:
w
=
i,α
i
>
0
α
i
y
i
x
i
(21.13)
In other words,
w
is obtained as a linear combination of the support vectors, with the
α
i
y
i
’s representing the weights. The rest of the points (with
α
i
=
0) are not support
vectors and thus do not play a role in determining
w
.
To compute the bias
b
, we first compute one solution
b
i
, per support vector, as
follows:
α
i
y
i
(
w
T
x
i
+
b)
−
1
=
0
y
i
(
w
T
x
i
+
b)
=
1
b
i
=
1
y
i
−
w
T
x
i
=
y
i
−
w
T
x
i
(21.14)
We can take
b
as the average bias value over all the support vectors:
b
=
avg
α
i
>
0
{
b
i
}
(21.15)
SVM Classifier
Given the optimal hyperplane function
h(
x
)
=
w
T
x
+
b
, for any new point
z
, we predict
its class as
ˆ
y
=
sign
(h(
z
))
=
sign
(
w
T
z
+
b)
(21.16)
where the sign
(
·
)
function returns
+
1 if its argument is positive, and
−
1 if its argument
is negative.
524
Support Vector Machines
x
i
w
T
x
i
b
i
=
y
i
−
w
T
x
i
x
1
4
.
332
−
3
.
332
x
2
4
.
331
−
3
.
331
x
4
4
.
331
−
3
.
331
x
13
2
.
333
−
3
.
333
x
14
2
.
332
−
3
.
332
b
=
avg
{
b
i
} −
3
.
332
Thus, the optimal hyperplane is given as follows:
h(
x
)
=
0
.
833
0
.
334
T
x
−
3
.
332
=
0
which matches the canonical hyperplane in Example 21.2.
21.3
SOFT MARGIN SVM: LINEAR AND NONSEPARABLE CASE
So far we have assumed that the dataset is perfectly linearly separable. Here we
consider the case where the classes overlap to some extent so that a perfect separation
is not possible, as depicted in Figure 21.3.
1
2
3
4
5
1 2 3 4 5
h
(
x
)
=
0
1
w
1
w
Figure 21.3.
Soft margin hyperplane: the shaded points are the support vectors. The margin is 1
/
w
as
illustrated, and points with positive slack values are also shown (thin black line).
21.3 Soft Margin SVM: Linear and Nonseparable Case
525
Recall that when points are linearly separable we can find a separating hyperplane
so that all points satisfy the condition
y
i
(
w
T
x
i
+
b)
≥
1. SVMs can handle non-separable
points by introducing
slack variables
ξ
i
in Eq.(21.8), as follows:
y
i
(
w
T
x
i
+
b)
≥
1
−
ξ
i
where
ξ
i
≥
0 is the slack variable for point
x
i
, which indicates how much the point
violates the separability condition, that is, the point may no longer be at least 1
/
w
away from the hyperplane. The slack values indicate three types of points. If
ξ
i
=
0,
then the corresponding point
x
i
is at least
1
w
away from the hyperplane. If 0
< ξ
i
<
1,
then the point is within the margin and still correctly classified, that is, it is on the
correct side of the hyperplane. However, if
ξ
i
≥
1 then the point is misclassified and
appears on the wrong side of the hyperplane.
In the nonseparable case, also called the
soft margin
case, the goal of SVM
classification is to find the hyperplane with maximum margin that also minimizes the
slack terms. The new objective function is given as
Objective Function:
min
w
,b,ξ
i
w
2
2
+
C
n
i
=
1
(ξ
i
)
k
Linear Constraints:
y
i
(
w
T
x
i
+
b)
≥
1
−
ξ
i
,
∀
x
i
∈
D
ξ
i
≥
0
∀
x
i
∈
D
(21.17)
where
C
and
k
are constants that incorporate the cost of misclassification. The term
n
i
=
1
(ξ
i
)
k
gives the
loss
, that is, an estimate of the deviation from the separable case.
The scalar
C
, which is chosen empirically, is a
regularization constant
that controls
the trade-off between maximizing the margin (corresponding to minimizing
w
2
/
2)
or minimizing the loss (corresponding to minimizing the sum of the slack terms
n
i
=
1
(ξ
i
)
k
). For example, if
C
→
0, then the loss component essentially disappears, and
the objective defaults to maximizing the margin. On the other hand, if
C
→∞
, then
the margin ceases to have much effect, and the objective function tries to minimize the
loss. The constant
k
governs the form of the loss. Typically
k
is set to 1 or 2. When
k
=
1, called
hinge loss
, the goal is to minimize the sum of the slack variables, whereas
when
k
=
2, called
quadratic loss
, the goal is to minimize the sum of the squared slack
variables.
21.3.1
Hinge Loss
Assuming
k
=
1, we can compute the Lagrangian for the optimization problem in
Eq.(21.17) by introducing Lagrange multipliers
α
i
and
β
i
that satisfy the following
KKT conditions at the optimal solution:
α
i
y
i
(
w
T
x
i
+
b)
−
1
+
ξ
i
=
0 with
α
i
≥
0
β
i
(ξ
i
−
0
)
=
0 with
β
i
≥
0 (21.18)
The Lagrangian is then given as
L
=
1
2
w
2
+
C
n
i
=
1
ξ
i
−
n
i
=
1
α
i
y
i
(
w
T
x
i
+
b)
−
1
+
ξ
i
−
n
i
=
1
β
i
ξ
i
(21.19)
526
Support Vector Machines
We turn this into a dual Lagrangian by taking its partial derivative with respect to
w
,
b
and
ξ
i
, and setting those to zero:
∂
∂
w
L
=
w
−
n
i
=
1
α
i
y
i
x
i
=
0
or
w
=
n
i
=
1
α
i
y
i
x
i
∂
∂b
L
=
n
i
=
1
α
i
y
i
=
0
∂
∂ξ
i
L
=
C
−
α
i
−
β
i
=
0 or
β
i
=
C
−
α
i
(21.20)
Plugging these values into Eq.(21.19), we get
L
dual
=
1
2
w
T
w
−
w
T
n
i
=
1
α
i
y
i
x
i
w
−
b
n
i
=
1
α
i
y
i
0
+
n
i
=
1
α
i
+
n
i
=
1
(
C
−
α
i
+
β
i
)
0
ξ
i
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
x
T
i
x
j
The dual objective is thus given as
Objective Function:
max
α
L
dual
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
x
T
i
x
j
Linear Constraints:
0
≤
α
i
≤
C
,
∀
i
∈
D
and
n
i
=
1
α
i
y
i
=
0
(21.21)
Notice that the objective is the same as the dual Lagrangian in the linearly separable
case [Eq.(21.12)]. However, the constraints on
α
i
’s are different because we now
requirethat
α
i
+
β
i
=
C
with
α
i
≥
0and
β
i
≥
0,which impliesthat0
≤
α
i
≤
C
.Section21.5
describes a gradient ascent approach for solving this dual objective function.
Weight Vector and Bias
Once we solve for
α
i
, we have the same situation as before, namely,
α
i
=
0 for points
that are not support vectors, and
α
i
>
0 only for the support vectors, which comprise
all points
x
i
for which we have
y
i
(
w
T
x
i
+
b)
=
1
−
ξ
i
(21.22)
Notice that the support vectors now include all points that are on the margin, which
have zero slack (
ξ
i
=
0), as well as all points with positive slack (
ξ
i
>
0).
21.3 Soft Margin SVM: Linear and Nonseparable Case
527
We can obtain the weight vector from the support vectors as before:
w
=
i,α
i
>
0
α
i
y
i
x
i
(21.23)
We can also solve for the
β
i
using Eq.(21.20):
β
i
=
C
−
α
i
Replacing
β
i
in the KKT conditions [Eq.(21.18)] with the expression from above we
obtain
(
C
−
α
i
)ξ
i
=
0 (21.24)
Thus, for the support vectors with
α
i
>
0, we have two cases to consider:
(1)
ξ
i
>
0, which implies that
C
−
α
i
=
0, that is,
α
i
=
C
, or
(2)
C
−
α
i
>
0, that is
α
i
<
C
. In this case, from Eq.(21.24) we must have
ξ
i
=
0. In
other words, these are precisely those support vectors that are on the margin.
Using those support vectors that are on the margin, that is, have 0
< α
i
<
C
and
ξ
i
=
0, we can solve for
b
i
:
α
i
y
i
(
w
T
x
i
+
b
i
)
−
1
=
0
y
i
(
w
T
x
i
+
b
i
)
=
1
b
i
=
1
y
i
−
w
T
x
i
=
y
i
−
w
T
x
i
(21.25)
To obtain the final bias
b
, we can take the average over all the
b
i
values. From
Eqs.(21.23) and (21.25), both the weight vector
w
and the bias term
b
can be computed
without explicitly computing the slack terms
ξ
i
for each point.
Once the optimal hyperplane plane has been determined, the SVM model predicts
the class for a new point
z
as follows:
ˆ
y
=
sign
(h(
z
))
=
sign
(
w
T
z
+
b)
Example 21.4.
Let us consider the data points shown in Figure 21.3. There are
four new points in addition to the 14 points from Table 21.1 that we considered in
Example 21.3; these points are
x
i
x
i
1
x
i
2
y
i
x
15
4 2
+
1
x
16
2 3
+
1
x
17
3 2
−
1
x
18
5 3
−
1
Let
k
=
1 and
C
=
1, then solving the
L
dual
yields the following support vectors and
Lagrangian values
α
i
:
21.3 Soft Margin SVM: Linear and Nonseparable Case
529
One can see that this is essentially the same as the canonical hyperplane we found in
Example 21.3.
It is instructive to see what the slack variables are in this case. Note that
ξ
i
=
0 for
all points that are not support vectors, and also for those support vectors that are on
the margin. So the slack is positive only for the remaining support vectors, for whom
the slack can be computed directly from Eq.(21.22), as follows:
ξ
i
=
1
−
y
i
(
w
T
x
i
+
b)
Thus, for all support vectors not on the margin, we have
x
i
w
T
x
i
w
T
x
i
+
b ξ
i
=
1
−
y
i
(
w
T
x
i
+
b)
x
15
4.001 0.667 0.333
x
16
2.667
−
0
.
667 1.667
x
17
3.167
−
0
.
167 0.833
x
18
5.168 1.834 2.834
As expected, the slack variable
ξ
i
>
1 for those points that are misclassified (i.e.,
are on the wrong side of the hyperplane), namely
x
16
=
(
3
,
3
)
T
and
x
18
=
(
5
,
3
)
T
. The
other two points are correctly classified, but lie within the margin, and thus satisfy
0
< ξ
i
<
1. The total slack is given as
i
ξ
i
=
ξ
15
+
ξ
16
+
ξ
17
+
ξ
18
=
0
.
333
+
1
.
667
+
0
.
833
+
2
.
834
=
5
.
667
21.3.2
Quadratic Loss
For quadratic loss, we have
k
=
2 in the objective function [Eq.(21.17)]. In this case
we can drop the positivity constraint
ξ
i
≥
0 due to the fact that (1) the sum of the
slack terms
n
i
=
1
ξ
2
i
is always positive, and (2) a potential negative value of slack will
be ruled out during optimization because a choice of
ξ
i
=
0 leads to a smaller value of
the primary objective, and it still satisfies the constraint
y
i
(
w
T
x
i
+
b)
≥
1
−
ξ
i
whenever
ξ
i
<
0.Inother words, theoptimization process willreplaceanynegativeslackvariables
by zero values. Thus, the SVM objective for quadratic loss is given as
Objective Function:
min
w
,b,ξ
i
w
2
2
+
C
n
i
=
1
ξ
2
i
Linear Constraints:
y
i
(
w
T
x
i
+
b)
≥
1
−
ξ
i
,
∀
x
i
∈
D
The Lagrangian is then given as:
L
=
1
2
w
2
+
C
n
i
=
1
ξ
2
i
−
n
i
=
1
α
i
y
i
(
w
T
x
i
+
b)
−
1
+
ξ
i
(21.26)
530
Support Vector Machines
Differentiating with respect to
w
,
b
, and
ξ
i
and setting them to zero results in the
following conditions, respectively:
w
=
n
i
=
1
α
i
y
i
x
i
n
i
=
1
α
i
y
i
=
0
ξ
i
=
1
2
C
α
i
Substituting these back into Eq.(21.26) yields the dual objective
L
dual
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
x
T
i
x
j
−
1
4
C
n
i
=
1
α
2
i
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
x
T
i
x
j
+
1
2
C
δ
ij
where
δ
is the
Kronecker delta
function, defined as
δ
ij
=
1 if
i
=
j
, and
δ
ij
=
0 otherwise.
Thus, the dual objective is given as
max
α
L
dual
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
x
T
i
x
j
+
1
2
C
δ
ij
subject to the constraints
α
i
≥
0
,
∀
i
∈
D
,
and
n
i
=
1
α
i
y
i
=
0
(21.27)
Once we solve for
α
i
using the methods from Section 21.5, we can recover the weight
vector and bias as follows:
w
=
i,α
i
>
0
α
i
y
i
x
i
b
=
avg
i,
C
>α
i
>
0
y
i
−
w
T
x
i
21.4
KERNEL SVM: NONLINEAR CASE
The linear SVM approach can be used for datasets with a nonlinear decision boundary
via the kernel trick from Chapter 5. Conceptually, the idea is to map the original
d
-dimensional points
x
i
in the input space to points
φ(
x
i
)
in a high-dimensional feature
space via some nonlinear transformation
φ
. Given the extra flexibility, it is more likely
that the points
φ(
x
i
)
might be linearly separable in the feature space. Note, however,
that a linear decision surface in feature space actually corresponds to a nonlinear
decision surface in the input space. Further, the kernel trick allows us to carry out
all operations via the kernel function computed in input space, rather than having to
map the points into feature space.
21.4 Kernel SVM: Nonlinear Case
531
0
1
2
3
4
5
0 1 2 3 4 5 6 7
Figure 21.4.
Nonlinear SVM: shaded points are the support vectors.
Example 21.5.
Consider the set of points shown in Figure 21.4. There is no linear
classifier that can discriminate between the points. However, there exists a perfect
quadratic classifier that can separate the two classes. Given the input space over
the two dimensions
X
1
and
X
2
, if we transform each point
x
=
(x
1
,x
2
)
T
into a
point in the feature space consisting of the dimensions
(
X
1
,
X
2
,
X
2
1
,
X
2
2
,
X
1
X
2
)
, via
the transformation
φ(
x
)
=
(
√
2
x
1
,
√
2
x
2
,x
2
1
,x
2
2
,
√
2
x
1
x
2
)
T
, then it is possible to find a
separating hyperplane in feature space. For this dataset, it is possible to map the
hyperplane back to the input space, where it is seen as an ellipse (thick black line)
that separates the two classes (circles and triangles). The support vectors are those
points (shown in gray) that lie on the margin (dashed ellipses).
To apply the kernel trick for nonlinear SVM classification, we have to show that
all operations require only the kernel function:
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
Let the original database be given as
D
={
x
i
,y
i
}
n
i
=
1
. Applying
φ
to each point, we can
obtain the new dataset in the feature space
D
φ
={
φ(
x
i
),y
i
}
n
i
=
1
.
The SVM objective function [Eq.(21.17)] in feature space is given as
Objective Function:
min
w
,b,ξ
i
w
2
2
+
C
n
i
=
1
(ξ
i
)
k
Linear Constraints:
y
i
(
w
T
φ(
x
i
)
+
b)
≥
1
−
ξ
i
,
and
ξ
i
≥
0
,
∀
x
i
∈
D
(21.28)
where
w
is the weight vector,
b
is the bias, and
ξ
i
are the slack variables, all in feature
space.
532
Support Vector Machines
Hinge Loss
For hinge loss, the dual Lagrangian [Eq.(21.21)] in feature space is given as
max
α
L
dual
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
φ(
x
i
)
T
φ(
x
j
)
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
K
(
x
i
,
x
j
)
(21.29)
Subject to the constraints that 0
≤
α
i
≤
C
, and
n
i
=
1
α
i
y
i
=
0. Notice that the dual
Lagrangian depends only on the dot product between two vectors in feature space
φ(
x
i
)
T
φ(
x
j
)
=
K
(
x
i
,
x
j
)
, and thus we can solve the optimization problem using the
kernel matrix
K
={
K
(
x
i
,
x
j
)
}
i,j
=
1
,...,n
. Section 21.5 describes a stochastic gradient-based
approach for solving the dual objective function.
Quadratic Loss
Forquadraticloss, thedual Lagrangian[Eq.(21.27)]corresponds to achangeof kernel.
Define a new kernel function
K
q
, as follows:
K
q
(
x
i
,
x
j
)
=
x
T
i
x
j
+
1
2
C
δ
ij
=
K
(
x
i
,
x
j
)
+
1
2
C
δ
ij
which affects only the diagonal entries of the kernel matrix
K
, as
δ
ij
=
1 iff
i
=
j
, and
zero otherwise. Thus, the dual Lagrangian is given as
max
α
L
dual
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
K
q
(
x
i
,
x
j
)
(21.30)
subject to the constraints that
α
i
≥
0, and
n
i
=
1
α
i
y
i
=
0. The above optimization can be
solved using the same approach as for hinge loss, with a simple change of kernel.
Weight Vector and Bias
We can solve for
w
in feature space as follows:
w
=
α
i
>
0
α
i
y
i
φ(
x
i
)
(21.31)
Because
w
uses
φ(
x
i
)
directly, in general, we may not be able or willing to compute
w
explicitly. However, as we shall see next, it is not necessary to explicitly compute
w
for
classifying the points.
Let us now see how to compute the bias via kernel operations. Using Eq.(21.25),
we compute
b
as the average over the support vectors that are on the margin, that is,
those with 0
< α
i
<
C
, and
ξ
i
=
0:
b
=
avg
i,
0
<α
i
<
C
b
i
=
avg
i,
0
<α
i
<
C
y
i
−
w
T
φ(
x
i
)
(21.32)
21.4 Kernel SVM: Nonlinear Case
533
Substituting
w
from Eq. (21.31), we obtain a new expression for
b
i
as
b
i
=
y
i
−
α
j
>
0
α
j
y
j
φ(
x
j
)
T
φ(
x
i
)
=
y
i
−
α
j
>
0
α
j
y
j
K
(
x
j
,
x
i
)
(21.33)
Notice that
b
i
is a function of the dot product between two vectors in feature space and
therefore it can be computed via the kernel function in the input space.
Kernel SVM Classifier
We can predict the class for a new point
z
as follows:
ˆ
y
=
sign
(
w
T
φ(
z
)
+
b)
=
sign
α
i
>
0
α
i
y
i
φ(
x
i
)
T
φ(
z
)
+
b
=
sign
α
i
>
0
α
i
y
i
K
(
x
i
,
z
)
+
b
Once again we see that
ˆ
y
uses only dot products in feature space.
Based on the above derivations, we can see that, to train and test the SVM
classifier,the mappedpoints
φ(
x
i
)
areneverneededin isolation.Instead, alloperations
can be carried out in terms of the kernel function
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
. Thus, any
nonlinear kernel function can be used to do nonlinear classification in the input space.
Examples of such nonlinear kernels include the polynomial kernel [Eq.(5.9)], and the
Gaussian kernel [Eq.(5.10)], among others.
Example 21.6.
Let us consider the example dataset shown in Figure 21.4; it has 29
points in total. Although it is generally too expensive or infeasible (depending on
the choice of the kernel) to compute an explicit representation of the hyperplane in
feature space, and to map it back into input space, we will illustrate the application
of SVMs in both input and feature space to aid understanding.
We use an inhomogeneous polynomial kernel [Eq.(5.9)] of degree
q
=
2, that is,
we use the kernel:
K
(
x
i
,
x
j
)
=
φ(
x
i
)
T
φ(
x
j
)
=
(
1
+
x
T
i
x
j
)
2
With
C
=
4, solving the
L
dual
quadratic program [Eq.(21.30)] in input space
yields the following six support vectors, shown as the shaded (gray) points in
Figure 21.4.
534
Support Vector Machines
x
i
(x
i
1
,x
i
2
)
T
φ(
x
i
) y
i
α
i
x
1
(
1
,
2
)
T
(
1
,
1
.
41
,
2
.
83
,
1
,
4
,
2
.
83
)
T
+
1 0
.
6198
x
2
(
4
,
1
)
T
(
1
,
5
.
66
,
1
.
41
,
16
,
1
,
5
.
66
)
T
+
1 2
.
069
x
3
(
6
,
4
.
5
)
T
(
1
,
8
.
49
,
6
.
36
,
36
,
20
.
25
,
38
.
18
)
T
+
1 3
.
803
x
4
(
7
,
2
)
T
(
1
,
9
.
90
,
2
.
83
,
49
,
4
,
19
.
80
)
T
+
1 0
.
3182
x
5
(
4
,
4
)
T
(
1
,
5
.
66
,
5
.
66
,
16
,
16
,
15
.
91
)
T
−
1 2
.
9598
x
6
(
6
,
3
)
T
(
1
,
8
.
49
,
4
.
24
,
36
,
9
,
25
.
46
)
T
−
1 3
.
8502
For the inhomogeneous quadratic kernel, the mapping
φ
maps an input point
x
i
into feature space as follows:
φ
x
=
(x
1
,x
2
)
T
=
1
,
√
2
x
1
,
√
2
x
2
,x
2
1
,x
2
2
,
√
2
x
1
x
2
T
The table above shows all the mapped points, which reside in feature space. For
example,
x
1
=
(
1
,
2
)
T
is transformed into
φ(
x
i
)
=
1
,
√
2
·
1
,
√
2
·
2
,
1
2
,
2
2
,
√
2
·
1
·
2
T
=
(
1
,
1
.
41
,
2
.
83
,
1
,
2
,
2
.
83
)
T
We compute the weight vector for the hyperplane using Eq.(21.31):
w
=
i,α
i
>
0
α
i
y
i
φ(
x
i
)
=
(
0
,
−
1
.
413
,
−
3
.
298
,
0
.
256
,
0
.
82
,
−
0
.
018
)
T
and the bias is computed using Eq.(21.32), which yields
b
=−
8
.
841
For the quadratic polynomial kernel, the decision boundary in input space
corresponds to an ellipse. For our example, the center of the ellipse is given as
(
4
.
046
,
2
.
907
)
, and the semimajor axis length is 2
.
78 and the semiminor axis length
is 1
.
55. The resulting decision boundary is the ellipse shown in Figure 21.4. We
emphasize that in this example we explicitly transformed all the points into the
feature space just for illustration purposes. The kernel trick allows us to achieve the
same goal using only the kernel function.
21.5
SVM TRAINING ALGORITHMS
We now turn our attention to algorithms for solving the SVM optimization problems.
We will consider simple optimization approaches for solving the dual as well as the
primal formulations. It is important to note that these methods are not the most
efficient. However, since they are relatively simple, they can serve as a starting point
for more sophisticated methods.
For the SVM algorithms in this section, instead of dealing explicitly with the bias
b
, we map each point
x
i
∈
R
d
to the point
x
′
i
∈
R
d
+
1
as follows:
x
′
i
=
(x
i
1
,...,x
id
,
1
)
T
(21.34)
21.5 SVM Training Algorithms
535
Furthermore, we also map the weight vector to
R
d
+
1
, with
w
d
+
1
=
b
, so that
w
=
(w
1
,...,w
d
,b)
T
(21.35)
The equation of the hyperplane [Eq.(21.1)] is then given as follows:
h(
x
′
)
:
w
T
x
′
=
0
h(
x
′
)
:
w
1
···
w
d
b
x
i
1
.
.
.
x
id
1
=
0
h(
x
′
)
:
w
1
x
i
1
+···+
w
d
x
id
+
b
=
0
In the discussion below we assume that the bias term has been included in
w
, and
that each point has been mapped to
R
d
+
1
as per Eqs.(21.34) and (21.35). Thus, the last
component of
w
yields the bias
b
. Another consequence of mapping the points to
R
d
+
1
is that the constraint
n
i
=
1
α
i
y
i
=
0 does not apply in the SVM dual formulations given
in Eqs.(21.21), (21.27), (21.29), and (21.30), as there is no explicit bias term
b
for the
linear constraints in the SVM objective given in Eq.(21.17). The new set of constraints
is given as
y
i
w
T
x
≥
1
−
ξ
i
21.5.1
Dual Solution: Stochastic Gradient Ascent
We consider only thehinge loss casebecausequadraticloss can be handledby a change
of kernel, as shown in Eq.(21.30). The dual optimization objective for hinge loss
[Eq.(21.29)] is given as
max
α
J
(
α
)
=
n
i
=
1
α
i
−
1
2
n
i
=
1
n
j
=
1
α
i
α
j
y
i
y
j
K
(
x
i
,
x
j
)
subject to the constraints 0
≤
α
i
≤
C
for all
i
=
1
,...,n
. Here
α
=
(α
1
,α
2
,
···
,α
n
)
T
∈
R
n
.
Let us consider the terms in
J
(
α
)
that involve the Lagrange multiplier
α
k
:
J
(α
k
)
=
α
k
−
1
2
α
2
k
y
2
k
K
(
x
k
,
x
k
)
−
α
k
y
k
n
i
=
1
i
=
k
α
i
y
i
K
(
x
i
,
x
k
)
The gradient or the rate of change in the objective function at
α
is given as the
partial derivative of
J
(
α
)
with respect to
α
, that is, with respect to each
α
k
:
∇
J
(
α
)
=
∂
J
(
α
)
∂α
1
,
∂
J
(
α
)
∂α
2
,...,
∂
J
(
α
)
∂α
n
T
where the
k
th component of the gradient is obtained by differentiating
J
(α
k
)
with
respect to
α
k
:
∂
J
(
α
)
∂α
k
=
∂
J
(α
k
)
∂α
k
=
1
−
y
k
n
i
=
1
α
i
y
i
K
(
x
i
,
x
k
)
(21.36)
536
Support Vector Machines
Because we want to maximize the objective function
J
(
α
)
, we should move in the
direction of the gradient
∇
J
(
α
)
. Starting from an initial
α
, the gradient ascent approach
successively updates it as follows:
α
t
+
1
=
α
t
+
η
t
∇
J
(
α
t
)
where
α
t
is the estimate at the
t
th step, and
η
t
is the step size.
Instead of updating the entire
α
vector in each step, in the stochastic gradient
ascent approach, we update each component
α
k
independently and immediately use
the new value to update other components. This can result in faster convergence. The
update rule for the
k
-th component is given as
α
k
=
α
k
+
η
k
∂
J
(
α
)
∂α
k
=
α
k
+
η
k
1
−
y
k
n
i
=
1
α
i
y
i
K
(
x
i
,
x
k
)
(21.37)
where
η
k
is the step size. We also have to ensure that the constraints
α
k
∈
[0
,
C
] are
satisfied. Thus, in the update step above, if
α
k
<
0 we reset it to
α
k
=
0, and if
α
k
>
C
we reset it to
α
k
=
C
. The pseudo-code for stochastic gradient ascent is given in
Algorithm 21.1.
ALGORITHM 21.1. Dual SVM Algorithm: Stochastic Gradient Ascent
SVM-D
UAL
(D
,
K
,
C
,ǫ
)
:
foreach x
i
∈
D do x
i
←
x
i
1
// map to
R
d
+
1
1
if
loss =
hinge
then
2
K
←{
K
(
x
i
,
x
j
)
}
i,j
=
1
,...,n
// kernel matrix, hinge loss
3
else if
loss =
quadratic
then
4
K
←{
K
(
x
i
,
x
j
)
+
1
2
C
δ
ij
}
i,j
=
1
,...,n
// kernel matrix, quadratic loss
5
for
k
=
1
,...,n
do
η
k
←
1
K
(
x
k
,
x
k
)
// set step size
6
t
←
0
7
α
0
←
(
0
,...,
0
)
T
8
repeat
9
α
←
α
t
10
for
k
=
1
to
n
do
11
// update
k
th component of
α
α
k
←
α
k
+
η
k
1
−
y
k
n
i
=
1
α
i
y
i
K
(
x
i
,
x
k
)
12
if
α
k
<
0
then
α
k
←
0
13
if
α
k
>
C
then
α
k
←
C
14
α
t
+
1
=
α
15
t
←
t
+
1
16
until
α
t
−
α
t
−
1
≤
ǫ
17
21.5 SVM Training Algorithms
537
To determine the step size
η
k
, ideally, we would like to choose it so that the
gradient at
α
k
goes to zero, which happens when
η
k
=
1
K
(
x
k
,
x
k
)
(21.38)
To see why, note that when only
α
k
is updated, the other
α
i
do not change. Thus,
the new
α
has a change only in
α
k
, and from Eq.(21.36) we get
∂
J
(
α
)
∂α
k
=
1
−
y
k
i
=
k
α
i
y
i
K
(
x
i
,
x
k
)
−
y
k
α
k
y
k
K
(
x
k
,
x
k
)
Plugging in the value of
α
k
from Eq.(21.37), we have
∂
J
(
α
)
∂α
k
=
1
−
y
k
i
=
k
α
i
y
i
K
(
x
i
,
x
k
)
−
α
k
+
η
k
1
−
y
k
n
i
=
1
α
i
y
i
K
(
x
i
,
x
k
)
K
(
x
k
,
x
k
)
=
1
−
y
k
n
i
=
1
α
i
y
i
K
(
x
i
,
x
k
)
−
η
k
K
(
x
k
,
x
k
)
1
−
y
k
n
i
=
1
α
i
y
i
K
(
x
i
,
x
k
)
=
1
−
η
k
K
(
x
k
,
x
k
)
1
−
y
k
n
i
=
1
α
i
y
i
K
(
x
i
,
x
k
)
Substituting
η
k
from Eq.(21.38), we have
∂
J
(
α
)
∂a
k
=
1
−
1
K
(
x
k
,
x
k
)
K
(
x
k
,
x
k
)
1
−
y
k
n
i
=
1
α
i
y
i
K
(
x
i
,
x
k
)
=
0
In Algorithm 21.1, for better convergence, we thus choose
η
k
according to Eq.(21.38).
The method successively updates
α
and stops when the change falls below a given
threshold
ǫ
.Sincetheabovedescriptionassumesageneralkernelfunctionbetweenany
two points, we can recover the linear, nonseparable case by simply setting
K
(
x
i
,
x
j
)
=
x
T
i
x
j
. The computational complexity of the method is
O
(n
2
)
per iteration.
Note that once we obtain the final
α
, we classify a new point
z
∈
R
d
+
1
as follows:
ˆ
y
=
sign
h(φ(
z
))
=
sign
w
T
φ(
z
)
=
sign
α
i
>
0
α
i
y
i
K
(
x
i
,
z
)
Example 21.7 (Dual SVM: Linear Kernel).
Figure 21.5 shows the
n
=
150 points
from the Iris dataset, using
sepal length
and
sepal width
as the two attributes.
The goal is to discriminate between
Iris-setosa
(shown as circles) and other types
of Iris flowers (shown as triangles). Algorithm 21.1 was used to train the SVM
classifier with a linear kernel
K
(
x
i
,
x
j
)
=
x
T
i
x
j
and convergence threshold
ǫ
=
0
.
0001,
with hinge loss. Two different values of
C
were used; hyperplane
h
10
is obtained by
using
C
=
10, whereas
h
1000
uses
C
=
1000; the hyperplanes are given as follows:
h
10
(
x
)
: 2
.
74
x
1
−
3
.
74
x
2
−
3
.
09
=
0
h
1000
(
x
)
: 8
.
56
x
1
−
7
.
14
x
2
−
23
.
12
=
0
538
Support Vector Machines
2
2
.
5
3
.
0
3
.
5
4
.
0
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
X
1
X
2
h
1000
h
10
Figure 21.5.
SVM dual algorithm with linear kernel.
The hyperplane
h
10
has a larger margin, but it has a larger slack; it misclassifies one
of the circles. On the other hand, the hyperplane
h
1000
has a smaller margin, but it
minimizes the slack; it is a separating hyperplane. This example illustrates the fact
that the higher the value of
C
the more the emphasis on minimizing the slack.
Example 21.8 (Dual SVM: Quadratic Kernel).
Figure 21.6 shows the
n
=
150 points
from the Iris dataset projected on the first two principal components. The task is
to separate
Iris-versicolor
(in circles) from the other two types of Irises (in
triangles). The figure plots the decision boundaries obtained when using the linear
kernel
K
(
x
i
,
x
j
)
=
x
T
i
x
j
, and the homogeneous quadratic kernel
K
(
x
i
,
x
j
)
=
(
x
T
i
x
j
)
2
,
where
x
i
∈
R
d
+
1
, as per Eq.(21.34). The optimal hyperplane in both cases was found
via the gradient ascent approach in Algorithm 21.1, with
C
=
10,
ǫ
=
0
.
0001and using
hinge loss.
The optimal hyperplane
h
l
(shown in gray) for the linear kernel is given as
h
l
(
x
)
:0
.
16
x
1
+
1
.
9
x
2
+
0
.
8
=
0
As expected,
h
l
is unable to separate the classes. On the other hand, the optimal
hyperplane
h
q
(shown as clipped black ellipse) for the quadratic kernel is given as
h
q
(
x
)
:
w
T
φ(
x
)
=
1
.
86
x
2
1
+
1
.
87
x
1
x
2
+
0
.
14
x
1
+
0
.
85
x
2
2
−
1
.
22
x
2
−
3
.
25
=
0
where
x
=
(x
1
,x
2
)
T
,
w
=
1
.
86
,
1
.
32
,
0
.
099
,
0
.
85
,
−
0
.
87
,
−
3
.
25
T
and
φ(
x
)
=
x
2
1
,
√
2
x
1
x
2
,
√
2
x
1
,x
2
2
,
√
2
x
2
,
1
T
.
21.5 SVM Training Algorithms
539
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
h
l
h
q
Figure 21.6.
SVM dual algorithm with quadratic kernel.
The hyperplane
h
q
is able to separate the two classes quite well. Here we explicitly
reconstructed
w
for illustration purposes; note that the last element of
w
gives the
bias term
b
=−
3
.
25.
21.5.2
Primal Solution: Newton Optimization
The dual approach is the one most commonly used to train SVMs, but it is also possible
to train using the primal formulation.
Consider the primal optimization function for the linear, but nonseparable case
[Eq.(21.17)]. With
w
,
x
i
∈
R
d
+
1
as discussed earlier, we have to minimize the objective
function:
min
w
J
(
w
)
=
1
2
w
2
+
C
n
i
=
1
(ξ
i
)
k
(21.39)
subject to the linear constraints:
y
i
(
w
T
x
i
)
≥
1
−
ξ
i
and
ξ
i
≥
0 for all
i
=
1
,...,n
Rearranging the above, we obtain an expression for
ξ
i
ξ
i
≥
1
−
y
i
(
w
T
x
i
)
and
ξ
i
≥
0
,
which implies that
ξ
i
=
max
0
,
1
−
y
i
(
w
T
x
i
)
(21.40)
540
Support Vector Machines
Plugging Eq.(21.40) into the objective function [Eq.(21.39)], we obtain
J
(
w
)
=
1
2
w
2
+
C
n
i
=
1
max
0
,
1
−
y
i
(
w
T
x
i
)
k
=
1
2
w
2
+
C
y
i
(
w
T
x
i
)<
1
1
−
y
i
(
w
T
x
i
)
k
(21.41)
The last step follows from Eq.(21.40) because
ξ
i
>
0 if and only if 1
−
y
i
(
w
T
x
i
) >
0,
that is,
y
i
(
w
T
x
i
) <
1. Unfortunately, the hinge loss formulation, with
k
=
1, is not
differentiable.One could use a differentiableapproximation to the hinge loss, but here
we describe the quadratic loss formulation.
Quadratic Loss
For quadratic loss, we have
k
=
2, and the primal objective [Eq.(21.41)] can be
written as
J
(
w
)
=
1
2
w
2
+
C
y
i
(
w
T
x
i
)<
1
1
−
y
i
(
w
T
x
i
)
2
The gradient or the rate of change of the objective function at
w
is given as the partial
derivative of
J
(
w
)
with respect to
w
:
∇
w
=
∂
J
(
w
)
∂
w
=
w
−
2
C
y
i
(
w
T
x
i
)<
1
y
i
x
i
1
−
y
i
(
w
T
x
i
)
=
w
−
2
C
y
i
(
w
T
x
i
)<
1
y
i
x
i
+
2
C
y
i
(
w
T
x
i
)<
1
x
i
x
T
i
w
=
w
−
2
C
v
+
2
C
Sw
where the vector
v
and the matrix
S
are given as
v
=
y
i
(
w
T
x
i
)<
1
y
i
x
i
S
=
y
i
(
w
T
x
i
)<
1
x
i
x
T
i
Note that the matrix
S
is the scatter matrix, and the vector
v
is
m
times the mean of the,
say
m
, signed points
y
i
x
i
that satisfy the condition
y
i
h(
x
i
) <
1.
The
Hessian matrix
is defined as the matrix of second-order partial derivatives of
J
(
w
)
with respect to
w
, which is given as
H
w
=
∂
∇
w
∂
w
=
I
+
2
C
S
Because we want to minimize the objective function
J
(
w
)
, we should move in
the direction opposite to the gradient. The Newton optimization update rule for
w
is
given as
w
t
+
1
=
w
t
−
η
t
H
−
1
w
t
∇
w
t
(21.42)
where
η
t
>
0 is a scalar value denoting the step size at iteration
t
. Normally one needs
to use a line search method to find the optimal step size
η
t
, but the default value of
η
t
=
1 usually works for quadratic loss.
21.5 SVM Training Algorithms
541
ALGORITHM 21.2. Primal SVM Algorithm: Newton Optimization,
Quadratic Loss
SVM-P
RIMAL
(D
,
C
,ǫ
)
:
foreach x
i
∈
D do
1
x
i
←
x
i
1
// map to
R
d
+
1
2
t
←
0
3
w
0
←
(
0
,...,
0
)
T
// initialize
w
t
∈
R
d
+
1
4
repeat
5
v
←
y
i
(
w
T
t
x
i
)<
1
y
i
x
i
6
S
←
y
i
(
w
T
t
x
i
)<
1
x
i
x
T
i
7
∇ ←
(
I
+
2
C
S
)
w
t
−
2
C
v
// gradient
8
H
←
I
+
2
C
S
// Hessian
9
w
t
+
1
←
w
t
−
η
t
H
−
1
∇
// Newton update rule [Eq.
(21.42)
]
10
t
←
t
+
1
11
until
w
t
−
w
t
−
1
≤
ǫ
12
The Newton optimization algorithm for training linear, nonseparable SVMs in the
primal is givenin Algorithm21.2.The stepsize
η
t
is setto 1bydefault.Aftercomputing
the gradient and Hessian at
w
t
(lines 6–9),the Newton update rule is used to obtain the
newweightvector
w
t
+
1
(line 10).Theiterationscontinueuntil thereis verylittlechange
in theweightvector.Computing
S
requires
O
(nd
2
)
steps;computing thegradient
∇
,the
Hessian matrix
H
and updating the weight vector
w
t
+
1
takes time
O
(d
2
)
; and inverting
the Hessian takes
O
(d
3
)
operations, for a total computational complexity of
O
(nd
2
+
d
3
)
per iteration in the worst case.
Example 21.9 (Primal SVM).
Figure 21.7 plots the hyperplanes obtained using the
dual and primal approaches for the 2-dimensional Iris dataset comprising the
sepal
length
versus
sepal width
attributes. We used
C
=
1000 and
ǫ
=
0
.
0001 with the
quadratic loss function. The dual solution
h
d
(gray line) and the primal solution
h
p
(thick black line) are essentially identical; they are as follows:
h
d
(
x
)
: 7
.
47
x
1
−
6
.
34
x
2
−
19
.
89
=
0
h
p
(
x
)
: 7
.
47
x
1
−
6
.
34
x
2
−
19
.
91
=
0
Primal Kernel SVMs
In the preceding discussion we considered the linear, nonseparable case for primal
SVM learning. We now generalize the primal approach to learn kernel-based SVMs,
again for quadratic loss.
542
Support Vector Machines
2
2
.
5
3
.
0
3
.
5
4
.
0
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
X
1
X
2
h
d
,h
p
Figure 21.7.
SVM primal algorithm with linear kernel.
Let
φ
denote a mapping from the input space to the feature space; each input point
x
i
is mapped to the featurepoint
φ(
x
i
)
. Let
K
(
x
i
,
x
j
)
denote the kernel function, and let
w
denote the weight vector in feature space. The hyperplane in feature space is then
given as
h(
x
)
:
w
T
φ(
x
)
=
0
Using Eqs.(21.28) and (21.40), the primal objective function in feature space can
be written as
min
w
J
(
w
)
=
1
2
w
2
+
C
n
i
=
1
L
(y
i
,h(
x
i
))
(21.43)
where
L
(
y
i
,h(
x
i
)
)
=
max
{
0
,
1
−
y
i
h(
x
i
)
}
k
is the loss function.
The gradient at
w
is given as
∇
w
=
w
+
C
n
i
=
1
∂
L
(y
i
,h(
x
i
))
∂h(
x
i
)
·
∂h(
x
i
)
∂
w
where
∂h(
x
i
)
∂
w
=
∂
w
T
φ(
x
i
)
∂
w
=
φ(
x
i
)
21.5 SVM Training Algorithms
543
At the optimal solution, the gradient vanishes, that is,
∇
w
=
0
, which yields
w
=−
C
n
i
=
1
∂
L
(y
i
,h(
x
i
))
∂h(
x
i
)
·
φ(
x
i
)
=
n
i
=
1
β
i
φ(
x
i
)
(21.44)
where
β
i
is the coefficient of the point
φ(
x
i
)
in feature space. In other words, the
optimalweightvectorinfeaturespaceis expressedasalinearcombinationofthepoints
φ(
x
i
)
in feature space.
Using Eq.(21.44), the distance to the hyperplane in feature space can be
expressed as
y
i
h(
x
i
)
=
y
i
w
T
φ(
x
i
)
=
y
i
n
j
=
1
β
j
K
(
x
j
,
x
i
)
=
y
i
K
T
i
β
(21.45)
where
K
=
K
(
x
i
,
x
j
)
n
i,j
=
1
is the
n
×
n
kernel matrix,
K
i
is the
i
th column of
K
, and
β
=
β
1
,...,β
n
T
is the coefficient vector.
PluggingEqs.(21.44)and(21.45)intoEq.(21.43),withquadraticloss (
k
=
2),yields
the primal kernel SVM formulation purely in terms of the kernel matrix:
min
β
J
(
β
)
=
1
2
n
i
=
1
n
j
=
1
β
i
β
j
K
(
x
i
,
x
j
)
+
C
n
i
=
1
max
0
,
1
−
y
i
K
T
i
β
2
=
1
2
β
T
K
β
+
C
y
i
K
T
i
β
<
1
(
1
−
y
i
K
T
i
β
)
2
The gradient of
J
(
β
)
with respect to
β
is given as
∇
β
=
∂
J
(
β
)
∂
β
=
K
β
−
2
C
y
i
K
T
i
β
<
1
y
i
K
i
(
1
−
y
i
K
T
i
β
)
=
K
β
+
2
C
y
i
K
T
i
β
<
1
(
K
i
K
T
i
)
β
−
2
C
y
i
K
T
i
β
<
1
y
i
K
i
=
(
K
+
2
C
S
)
β
−
2
C
v
where the vector
v
∈
R
n
and the matrix
S
∈
R
n
×
n
are given as
v
=
y
i
K
T
i
β
<
1
y
i
K
i
S
=
y
i
K
T
i
β
<
1
K
i
K
T
i
Furthermore, the
Hessian matrix
is given as
H
β
=
∂
∇
β
∂
β
=
K
+
2
C
S
We can now minimize
J
(
β
)
by Newton optimization using the following update
rule:
β
t
+
1
=
β
t
−
η
t
H
−
1
β
∇
β
544
Support Vector Machines
ALGORITHM 21.3. Primal Kernel SVM Algorithm: Newton Optimization,
Quadratic Loss
SVM-P
RIMAL
-K
ERNEL
(D
,
K
,
C
,ǫ
)
:
foreach x
i
∈
D do
1
x
i
←
x
i
1
// map to
R
d
+
1
2
K
←{
K
(
x
i
,
x
j
)
}
i,j
=
1
,...,n
// compute kernel matrix
3
t
←
0
4
β
0
←
(
0
,...,
0
)
T
// initialize
β
t
∈
R
n
5
repeat
6
v
←
y
i
(
K
T
i
β
t
)<
1
y
i
K
i
7
S
←
y
i
(
K
T
i
β
t
)<
1
K
i
K
T
i
8
∇ ←
(
K
+
2
C
S
)
β
t
−
2
C
v
// gradient
9
H
←
K
+
2
C
S
// Hessian
10
β
t
+
1
←
β
t
−
η
t
H
−
1
∇
// Newton update rule
11
t
←
t
+
1
12
until
β
t
−
β
t
−
1
≤
ǫ
13
Note that if
H
β
is singular, that is, if it does not have an inverse, then we add a small
ridge
to the diagonal to regularize it. That is, we make
H
invertible as follows:
H
β
=
H
β
+
λ
I
where
λ>
0 is some small positive ridge value.
Once
β
has been found, it is easy to classify any test point
z
as follows:
ˆ
y
=
sign
w
T
φ(
z
)
=
sign
n
i
=
1
β
i
φ(
x
i
)
T
φ(
z
)
=
sign
n
i
=
1
β
i
K
(
x
i
,
z
)
The Newton optimization algorithm for kernel SVM in the primal is given in
Algorithm 21.3. The step size
η
t
is set to 1 by default, as in the linear case. In each
iteration, the method first computes the gradient and Hessian (lines 7–10). Next, the
Newton update rule is used to obtain the updated coefficient vector
β
t
+
1
(line 11). The
iterations continue until there is very little change in
β
. The computational complexity
of the method is
O
(n
3
)
per iteration in the worst case.
Example 21.10 (Primal SVM: Quadratic Kernel).
Figure 21.8 plots the hyperplanes
obtained using the dual and primal approaches on the Iris dataset projected onto
the first two principal components. The task is to separate
iris versicolor
from
the others, the same as in Example 21.8. Because a linear kernel is not suitable for
this task, we employ the quadratic kernel. We further set
C
=
10 and
ǫ
=
0
.
0001, with
21.6 Further Reading
545
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
h
d
h
p
Figure 21.8.
SVM quadratic kernel: dual and primal.
the quadratic loss function. The dual solution
h
d
(black contours) and the primal
solution
h
p
(gray contours) are given as follows:
h
d
(
x
)
: 1
.
4
x
2
1
+
1
.
34
x
1
x
2
−
0
.
05
x
1
+
0
.
66
x
2
2
−
0
.
96
x
2
−
2
.
66
=
0
h
p
(
x
)
: 0
.
87
x
2
1
+
0
.
64
x
1
x
2
−
0
.
5
x
1
+
0
.
43
x
2
2
−
1
.
04
x
2
−
2
.
398
=
0
Although the solutions are not identical,they are close, especiallyon the left decision
boundary.
21.6
FURTHER READING
The origins of support vector machines can be found in Vapnik (1982). In particular,
it introduced the generalized portrait approach for constructing an optimal separating
hyperplane. The use of the kernel trick for SVMs was introduced in Boser, Guyon,
and Vapnik (1992), and the soft margin SVM approach for nonseparable data was
proposed in Cortes and Vapnik (1995). For a good introduction to support vector
machines, including implementation techniques, see Cristianini and Shawe-Taylor
(2000) and Sch
¨
olkopf and Smola (2002). The primal training approach described in
this chapter is from Chapelle (2007).
Boser, B. E., Guyon, I. M., and Vapnik, V. N.(1992).“Atraining algorithm for optimal
margin classifiers.”
In Proceedings of the 5th Annual Workshop on Computational
Learning Theory,
ACM, pp. 144–152.
546
Support Vector Machines
Chapelle, O. (2007). “Training a support vector machine in the primal.”
Neural
Computation
, 19(5): 1155–1178.
Cortes, C. and Vapnik, V. (1995). “Support-vector networks.”
Machine Learning
,
20(3): 273–297.
Cristianini, N. and Shawe-Taylor, J. (2000).
An Introduction to Support Vector
MachinesandOtherKernel-basedLearningMethods
. Cambridge UniversityPress.
Sch
¨
olkopf, B. and Smola, A. J. (2002).
Learning with Kernels: Support Vector
Machines, Regularization,Optimization and Beyond
. Cambridge, MA: MIT Press.
Vapnik, V. N. (1982).
Estimation of Dependences Based on Empirical Data,
vol. 41.
New York: Springer-Verlag.
21.7
EXERCISES
Q1.
Consider the dataset in Figure 21.9, which has points from two classes
c
1
(triangles)
and
c
2
(circles). Answer the questions below.
(a)
Find the equations for the two hyperplanes
h
1
and
h
2
.
(b)
Show all the support vectors for
h
1
and
h
2
.
(c)
Which of the two hyperplanes is better at separating the two classes based on the
margin computation?
(d)
Find the equation of the best separating hyperplane for this dataset, and show
the corresponding support vectors. You can do this witout having to solve
the Lagrangian by considering the convex hull of each class and the possible
hyperplanes at the boundary of the two classes.
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
h
1
(
x
)
=
0
h
2
(
x
)
=
0
Figure 21.9.
Dataset for Q1.
CHAPTER 22
Classification Assessment
We have seen different classifiers in the preceding chapters, such as decision trees, full
and naive Bayes classifiers, nearest neighbors classifier, support vector machines, and
so on. In general, we may think of the classifier as a model or function
M
that predicts
the class label
ˆ
y
for a given input example
x
:
ˆ
y
=
M
(
x
)
where
x
=
(x
1
,x
2
,...,x
d
)
T
∈
R
d
is a point in
d
-dimensional space and
ˆ
y
∈{
c
1
,c
2
,...,c
k
}
is its predicted class.
To build the classification model
M
we need a
training set
of points along with
their known classes. Different classifiers are obtained depending on the assumptions
used to build the model
M
. For instance, support vector machines use the maximum
margin hyperplane to construct
M
. On the other hand, the Bayes classifier directly
computes the posterior probability
P(c
j
|
x
)
for each class
c
j
, and predicts the class
of
x
as the one with the maximum posterior probability,
ˆ
y
=
argmax
c
j
P(c
j
|
x
)
.
Once the model
M
has been trained, we assess its performance over a separate
testing set
of points for which we know the true classes. Finally, the model can
be deployed to predict the class for future points whose class we typically do not
know.
In this chapter we look at methods to assess a classifier, and to compare multiple
classifiers. We start by defining metrics of classifier accuracy. We then discuss how
to determine bounds on the expected error. We finally discuss how to assess the
performance of classifiers and compare them.
22.1
CLASSIFICATION PERFORMANCE MEASURES
Let
D
be the testing set comprising
n
points in a
d
dimensional space, let
{
c
1
,c
2
,...,c
k
}
denote the set of
k
class labels, and let
M
be a classifier. For
x
i
∈
D
, let
y
i
denote its
true class, and let
ˆ
y
i
=
M
(
x
i
)
denote its predicted class.
548
22.1 Classification Performance Measures
549
Error Rate
The error rate is the fraction of incorrect predictions for the classifier over the testing
set, defined as
Error Rate
=
1
n
n
i
=
1
I
(y
i
= ˆ
y
i
)
(22.1)
where
I
is an indicator function that has the value 1 when its argument is true, and 0
otherwise. Error rate is an estimate of the probability of misclassification. The lower
the error rate the better the classifier.
Accuracy
The accuracy of a classifier is the fraction of correct predictions over the testing set:
Accuracy
=
1
n
n
i
=
1
I
(y
i
= ˆ
y
i
)
=
1
−
Error Rate
(22.2)
Accuracy gives an estimate of the probability of a correct prediction; thus, the higher
the accuracy, the better the classifier.
Example 22.1.
Figure 22.1 shows the 2-dimensional Iris dataset, with the two
attributes being
sepal length
and
sepal width
. It has 150 points, and has three
equal-sized classes:
Iris-setosa
(
c
1
; circles),
Iris-versicolor
(
c
2
; squares) and
Iris-virginica
(
c
3
; triangles). The dataset is partitioned into training and testing
sets, in the ratio 80:20.Thus, the training set has 120 points (shown in light gray), and
2
2
.
5
3
.
0
3
.
5
4
.
0
4 4
.
5 5
.
0 5
.
5 6
.
0 6
.
5 7
.
0 7
.
5 8
.
0
X
1
X
2
Figure 22.1.
Iris dataset: three classes.
550
Classification Assessment
the testing set
D
has
n
=
30 points (shown in black). One can see that whereas
c
1
is
well separated from the other classes,
c
2
and
c
3
are not easy to separate. In fact, some
points are labeled as both
c
2
and
c
3
(e.g., the point
(
6
,
2
.
2
)
T
appears twice, labeled as
c
2
and
c
3
).
We classify the test points using the full Bayes classifier (see Chapter 18). Each
class is modeled using a single normal distribution, whose mean (in white) and
density contours (corresponding to one and two standard deviations) are also plotted
in Figure 22.1. The classifier misclassifies 8 out of the 30 test cases. Thus, we have
Error Rate
=
8
/
30
=
0
.
267
Accuracy
=
22
/
30
=
0
.
733
22.1.1
Contingency Table–based Measures
The error rate (and, thus also the accuracy) is a global measure in that it does not
explicitly consider the classes that contribute to the error. More informative measures
can be obtained by tabulating the class specific agreement and disagreement between
the true and predicted labels over the testing set. Let
D
= {
D
1
,
D
2
,...,
D
k
}
denote a
partitioning of the testing points based on their true class labels, where
D
j
={
x
i
∈
D
|
y
i
=
c
j
}
Let
n
i
=|
D
i
|
denote the size of true class
c
i
.
Let
R
={
R
1
,
R
2
,...,
R
k
}
denote a partitioning of the testing points based on the
predicted labels, that is,
R
j
={
x
i
∈
D
|ˆ
y
i
=
c
j
}
Let
m
j
=|
R
j
|
denote the size of the predicted class
c
j
.
R
and
D
induce a
k
×
k
contingencytable
N
, also calleda
confusionmatrix
, defined
as follows:
N
(i,j)
=
n
ij
=
R
i
∩
D
j
=
x
a
∈
D
|ˆ
y
a
=
c
i
and
y
a
=
c
j
where 1
≤
i,j
≤
k
. The count
n
ij
denotes the number of points with predicted class
c
i
whose true label is
c
j
. Thus,
n
ii
(for 1
≤
i
≤
k
) denotes the number of cases where the
classifier agrees on the true label
c
i
. The remaining counts
n
ij
, with
i
=
j
, are cases
where the classifier and true labels disagree.
Accuracy/Precision
The class-specific
accuracy
or
precision
of the classifier
M
for class
c
i
is given as the
fraction of correct predictions over all points predicted to be in class
c
i
acc
i
=
prec
i
=
n
ii
m
i
where
m
i
is the number of examples predicted as
c
i
by classifier
M
. The higher the
accuracy on class
c
i
the better the classifier.
22.1 Classification Performance Measures
551
The overall precision or accuracy of the classifier is the weighted average of the
class-specific accuracy:
Accuracy
=
Precision
=
k
i
=
1
m
i
n
acc
i
=
1
n
k
i
=
1
n
ii
This is identical to the expression in Eq.(22.2).
Coverage/Recall
The class-specific
coverage
or
recall
of
M
for class
c
i
is the fraction of correct
predictions over all points in class
c
i
:
coverage
i
=
recall
i
=
n
ii
n
i
where
n
i
is the number of points in class
c
i
. The higher the coverage the better the
classifier.
F-measure
Often there is a trade-off between the precision and recall of a classifier. For example,
it is easy to make
recall
i
=
1, by predicting all testing points to be in class
c
i
. However,
in this case
prec
i
will be low. On the other hand, we can make
prec
i
very high by
predicting only a few points as
c
i
, for instance, for those predictions where
M
has
the most confidence, but in this case
recall
i
will be low. Ideally, we would like both
precision and recall to be high.
The
class-specific F-measure
tries to balance the precision and recall values, by
computing their harmonic mean for class
c
i
:
F
i
=
2
1
prec
i
+
1
recall
i
=
2
·
prec
i
·
recall
i
prec
i
+
recall
i
=
2
n
ii
n
i
+
m
i
The higher the
F
i
value the better the classifier.
The overall
F-measure
for the classifier
M
is the mean of the class-specific values:
F
=
1
k
r
i
=
1
F
i
For a perfect classifier, the maximum value of the F-measure is 1.
Example 22.2.
Consider the 2-dimensional Iris dataset shown in Figure 22.1. In
Example 22.1we saw that the error rate was 26.7%.However, the error rate measure
does not give much information about the classes or instances that are more difficult
to classify. From the class-specific normal distribution in the figure, it is clear that
the Bayes classifier should perform well for
c
1
, but it is likely to have problems
discriminating some test cases that lie close to the decision boundary between
c
2
and
c
3
. This information is better captured by the confusion matrix obtained on the
testingset,asshownin Table22.1.Wecanobservethatall10pointsin
c
1
areclassified
correctly. However, only 7 out of the 10 for
c
2
and 5 out of the 10 for
c
3
are classified
correctly.
552
Classification Assessment
Table 22.1.
Contingency table for Iris dataset: testing set
True
Predicted Iris-setosa (
c
1
) Iris-versicolor (
c
2
) Iris-virginica(
c
3
)
Iris-setosa (
c
1
) 10 0 0
m
1
=
10
Iris-versicolor (
c
2
) 0 7 5
m
2
=
12
Iris-virginica (
c
3
) 0 3 5
m
3
=
8
n
1
=
10
n
2
=
10
n
3
=
10
n
=
30
From the confusion matrix we can compute the class-specific precision (or
accuracy) values:
prec
1
=
n
11
m
1
=
10
/
10
=
1
.
0
prec
2
=
n
22
m
2
=
7
/
12
=
0
.
583
prec
3
=
n
33
m
3
=
5
/
8
=
0
.
625
The overall accuracy tallies with that reported in Example 22.1:
Accuracy
=
(n
11
+
n
22
+
n
33
)
n
=
(
10
+
7
+
5
)
30
=
22
/
30
=
0
.
733
The class-specific recall (or coverage) values are given as
recall
1
=
n
11
n
1
=
10
/
10
=
1
.
0
recall
2
=
n
22
n
2
=
7
/
10
=
0
.
7
recall
3
=
n
33
n
3
=
5
/
10
=
0
.
5
From these we can compute the class-specific F-measure values:
F
1
=
2
·
n
11
(n
1
+
m
1
)
=
20
/
20
=
1
.
0
F
2
=
2
·
n
22
(n
2
+
m
2
)
=
14
/
22
=
0
.
636
F
3
=
2
·
n
33
(n
3
+
m
3
)
=
10
/
18
=
0
.
556
Thus, the overall F-measure for the classifier is
F
=
1
3
(
1
.
0
+
0
.
636
+
0
.
556
)
=
2
.
192
3
=
0
.
731
22.1 Classification Performance Measures
553
Table 22.2.
Confusion matrix for two classes
True Class
Predicted Class
Positive (
c
1
) Negative (
c
2
)
Positive (
c
1
) True Positive (
TP
) False Positive (
FP
)
Negative (
c
2
) False Negative (
FN
) True Negative (
TN
)
22.1.2
Binary Classification: Positive and Negative Class
When there are only
k
=
2 classes, we call class
c
1
the positive class and
c
2
the negative
class. The entries of the resulting 2
×
2 confusion matrix, shown in Table 22.2,are given
special names, as follows:
•
True Positives (TP):
The number of points that the classifier correctly predicts as
positive:
TP
=
n
11
=
{
x
i
|ˆ
y
i
=
y
i
=
c
1
}
•
False Positives (FP):
The number of points the classifier predicts to be positive, which
in fact belong to the negative class:
FP
=
n
12
=
{
x
i
|ˆ
y
i
=
c
1
and
y
i
=
c
2
}
•
False Negatives (FN):
The number of points the classifier predicts to be in the negative
class, which in fact belong to the positive class:
FN
=
n
21
=
{
x
i
|ˆ
y
i
=
c
2
and
y
i
=
c
1
}
•
True Negatives (TN):
The number of points that the classifier correctly predicts as
negative:
TN
=
n
22
=
{
x
i
|ˆ
y
i
=
y
i
=
c
2
}
Error Rate
The error rate [Eq.(22.1)] for the binary classification case is given as the fraction of
mistakes (or false predictions):
Error Rate
=
FP
+
FN
n
Accuracy
The accuracy [Eq.(22.2)] is the fraction of correct predictions:
Accuracy
=
TP
+
TN
n
Theaboveareglobalmeasuresofclassifierperformance.Wecanobtainclass-specific
measures as follows.
554
Classification Assessment
Class-specific Precision
The precision for the positive and negative class is given as
prec
P
=
TP
TP
+
FP
=
TP
m
1
prec
N
=
TN
TN
+
FN
=
TN
m
2
where
m
i
=|
R
i
|
is the number of points predicted by
M
as having class
c
i
.
Sensitivity: True Positive Rate
The true positive rate, also called
sensitivity
, is the fraction of correct predictions with
respect to all points in the positive class, that is, it is simply the recall for the positive
class
TPR
=
recall
P
=
TP
TP
+
FN
=
TP
n
1
where
n
1
is the size of the positive class.
Specificity: True Negative Rate
The true negative rate, also called
specificity
, is simply the recall for the negative class:
TNR
=
specificity
=
recall
N
=
TN
FP
+
TN
=
TN
n
2
where
n
2
is the size of the negative class.
False Negative Rate
The false negative rate is defined as
FNR
=
FN
TP
+
FN
=
FN
n
1
=
1
−
sensitivity
False Positive Rate
The false positive rate is defined as
FPR
=
FP
FP
+
TN
=
FP
n
2
=
1
−
specificity
Example 22.3.
Consider the Iris dataset projected onto its first two principal
components, as shown in Figure 22.2.The task is to separate
Iris-versicolor
(class
c
1
; in circles) from the other two Irises (class
c
2
; in triangles). The points from class
c
1
lie in-between the points from class
c
2
, making this is a hard problem for (linear)
classification. The dataset has been randomly split into 80% training (in gray) and
20% testing points (in black). Thus, the training set has 120 points and the testing set
has
n
=
30 points.
22.1 Classification Performance Measures
555
−
1
.
5
−
1
.
0
−
0
.
5
0
0
.
5
1
.
0
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
Figure 22.2.
Iris principal component dataset: training and testing sets.
ApplyingthenaiveBayesclassifier(withonenormal perclass)on thetrainingset
yields the following estimates for the mean, covariance matrix and prior probability
for each class:
ˆ
P(c
1
)
=
40
/
120
=
0
.
33
ˆ
P(c
2
)
=
80
/
120
=
0
.
67
ˆ
µ
1
=
−
0
.
641
−
0
.
204
T
ˆ
µ
2
=
0
.
27 0
.
14
T
1
=
0
.
29 0
0 0
.
18
2
=
6
.
14 0
0 0
.
206
The mean(in white) and thecontour plot of thenormal distribution for eachclass are
also shown in the figure; the contours are shown for one and two standard deviations
along each axis.
For each of the 30 testing points, we classify them using the above parameter
estimates (see Chapter 18). The naive Bayes classifier misclassified 10 out of the 30
test instances, resulting in an error rate and accuracy of
Error Rate
=
10
/
30
=
0
.
33
Accuracy
=
20
/
30
=
0
.
67
The confusion matrix for this binary classification problem is shown in
Table 22.3. From this table, we can compute the various performance measures:
prec
P
=
TP
TP
+
FP
=
7
14
=
0
.
5
556
Classification Assessment
Table 22.3.
Iris PC dataset: contingency table for binary classification
True
Predicted Positive (
c
1
) Negative (
c
2
)
Positive (
c
1
)
TP
=
7
FP
=
7
m
1
=
14
Negative (
c
2
)
FN
=
3
TN
=
13
m
2
=
16
n
1
=
10
n
2
=
20
n
=
30
prec
N
=
TN
TN
+
FN
=
13
16
=
0
.
8125
recall
P
=
sensitivity
=
TPR
=
TP
TP
+
FN
=
7
10
=
0
.
7
recall
N
=
specificity
=
TNR
=
TN
TN
+
FP
=
13
20
=
0
.
65
FNR
=
1
−
sensitivity
=
1
−
0
.
7
=
0
.
3
FPR
=
1
−
specificity
=
1
−
0
.
65
=
0
.
35
Wecanobservethattheprecision forthepositiveclassis ratherlow.Thetruepositive
rate is also low, and the false positive rate is relatively high. Thus, the naive Bayes
classifier is not particularly effective on this testing dataset.
22.1.3
ROC Analysis
Receiver Operating Characteristic (ROC) analysis is a popular strategy for assessing
the performance of classifiers when there are two classes. ROC analysis requires that
a classifier output a score value for the positive class for each point in the testing set.
These scores can then be used to order points in decreasing order. For instance, we
can use the posterior probability
P(c
1
|
x
i
)
as the score, for example, for the Bayes
classifiers. For SVM classifiers, we can use the signed distance from the hyperplane
as the score because large positive distances are high confidence predictions for
c
1
, and
large negative distances are very low confidence predictions for
c
1
(they are, in fact,
high confidence predictions for the negative class
c
2
).
Typically, a binary classifier chooses some positive score threshold
ρ
, and classifies
all points with score above
ρ
as positive, with the remaining points classified as
negative. However, such a threshold is likely to be somewhat arbitrary. Instead,
ROC analysis plots the performance of the classifier over all possible values of the
threshold parameter
ρ
. In particular, for each value of
ρ
, it plots the false positive rate
(1-specificity) on the
x
-axis versus the true positive rate (sensitivity) on the
y
-axis. The
resulting plot is called the
ROC curve
or
ROC plot
for the classifier.
Let
S
(
x
i
)
denote the real-valuedscore for the positive class output by a classifier
M
for the point
x
i
. Let the maximum and minimum score thresholds observed on testing
dataset
D
be as follows:
ρ
min
=
min
i
{
S
(
x
i
)
}
ρ
max
=
max
i
{
S
(
x
i
)
}
22.1 Classification Performance Measures
557
Table 22.4.
Different cases for 2
×
2 confusion matrix
True
Predicted Pos Neg
Pos 0 0
Neg
FN TN
(a) Initial: all negative
True
Predicted Pos Neg
Pos
TP FP
Neg 0 0
(b) Final: all positive
True
Predicted Pos Neg
Pos
TP
0
Neg 0
TN
(c) Ideal classifier
Initially, we classify all points as negative. Both
TP
and
FP
are thus initially zero (as
shown in Table 22.4a), resulting in
TPR
and
FPR
rates of zero, which correspond to
the point
(
0
,
0
)
at the lower left corner in the ROC plot. Next, for each distinct value
of
ρ
in the range [
ρ
min
,ρ
max
], we tabulate the set of positive points:
R
1
(ρ)
={
x
i
∈
D
:
S
(
x
i
) > ρ
}
and we compute the corresponding true and false positive rates, to obtain a new point
in the ROC plot. Finally, in the last step, we classify all points as positive. Both
FN
and
TN
are thus zero (as shown in Table 22.4b), resulting in
TPR
and
FPR
values of 1.
This results in the point
(
1
,
1
)
at the top right-hand corner in the ROC plot. An ideal
classifier corresponds to thetop leftpoint
(
0
,
1
)
,which corresponds to thecase
FPR
=
0
and
TPR
=
1, that is, the classifier has no false positives, and identifies all true positives
(as a consequence, it also correctly predicts all the points in the negative class). This
case is shown in Table 22.4c. As such, a ROC curve indicates the extent to which the
classifier ranks positive instances higher than the negativeinstances. An ideal classifier
should score all positive points higher than any negative point. Thus, a classifier with a
curvecloser totheidealcase,thatis, closerto theupper leftcorner,is a betterclassifier.
Area Under ROC Curve
The area under the ROC curve, abbreviated AUC, can be used as a measure of
classifier performance. Because the total area of the plot is 1, the AUC lies in the
interval [0
,
1] – the higher the better. The AUC value is essentially the probability that
the classifier will rank a random positive test case higher than a random negative test
instance.
ROC/AUC Algorithm
Algorithm 22.1 shows the steps for plotting a ROC curve, and for computing the area
under the curve. It takes as input the testing set
D
, and the classifier
M
. The first step is
to predict the score
S
(
x
i
)
for the positive class (
c
1
) for each test point
x
i
∈
D
. Next, we
sort the
(
S
(
x
i
),y
i
)
pairs, that is, the score and the true class pairs, in decreasing order of
the scores (line 3). Initially, we set the positive score threshold
ρ
=∞
(line 7). The for
loop (line 8) examines each pair
(
S
(
x
i
),y
i
)
in sorted order, and for each distinct value
of the score, it sets
ρ
=
S
(
x
i
)
and plots the point
(
FPR
,
TPR
)
=
FP
n
2
,
TP
n
1
558
Classification Assessment
ALGORITHM 22.1. ROC Curve and Area under the Curve
ROC-C
URVE
(D,
M
)
:
n
1
←
{
x
i
∈
D
|
y
i
=
c
1
}
// size of positive class
1
n
2
←
{
x
i
∈
D
|
y
i
=
c
2
}
// size of negative class
2
// classify, score, and sort all test points
L
←
sort the set
{
(
S
(
x
i
),y
i
)
:
x
i
∈
D
}
by decreasing scores
3
FP
←
TP
←
0
4
FP
prev
←
TP
prev
←
0
5
AUC
←
0
6
ρ
←∞
7
foreach
(
S
(
x
i
),y
i
)
∈
L
do
8
if
ρ >
S
(
x
i
)
then
9
plot point
FP
n
2
,
TP
n
1
10
AUC
←
AUC
+
T
RAPEZOID
-A
REA
FP
prev
n
2
,
TP
prev
n
1
,
FP
n
2
,
TP
n
1
11
ρ
←
S
(
x
i
)
12
FP
prev
←
FP
13
TP
prev
←
TP
14
if
y
i
=
c
1
then
TP
←
TP
+
1
15
else
FP
←
FP
+
1
16
plot point
FP
n
2
,
TP
n
1
17
AUC
←
AUC
+
T
RAPEZOID
-A
REA
FP
prev
n
2
,
TP
prev
n
1
,
FP
n
2
,
TP
n
1
18
T
RAPEZOID
-A
REA
(
(x
1
,y
1
),(x
2
,y
2
)
)
:
b
←|
x
2
−
x
1
|
// base of trapezoid
19
h
←
1
2
(y
2
+
y
1
)
// average height of trapezoid
20
return
(b
·
h)
21
As each test point is examined, the true and false positive values are adjusted based
on the true class
y
i
for the test point
x
i
. If
y
1
=
c
1
, we increment the true positives,
otherwise, we increment the false positives (lines 15-16). At the end of the for loop we
plot the final point in the ROC curve (line 17).
The AUC value is computed as each new point is added to the ROC plot. The
algorithm maintains the previous values of the false and true positives,
FP
prev
and
TP
prev
, for the previous score threshold
ρ
. Given the current
FP
and
TP
values, we
compute the area under the curve defined by the four points
(x
1
,y
1
)
=
FP
prev
n
2
,
TP
prev
n
1
(x
2
,y
2
)
=
FP
n
2
,
TP
n
1
(x
1
,
0
)
=
FP
prev
n
2
,
0
(x
2
,
0
)
=
FP
n
2
,
0
These four points define a trapezoid, whenever
x
2
> x
1
and
y
2
> y
1
, otherwise,
they define a rectangle (which may be degenerate, with zero area). The function
22.1 Classification Performance Measures
559
Table 22.5.
Sorted scores and true classes
S
(
x
i
)
0.93 0.82 0.80 0.77 0.74 0.71 0.69 0.67 0.66 0.61
y
i
c
2
c
1
c
2
c
1
c
1
c
1
c
2
c
1
c
2
c
2
S
(
x
i
)
0.59 0.55 0.55 0.53 0.47 0.30 0.26 0.11 0.04 2.97e-03
y
i
c
2
c
2
c
1
c
1
c
1
c
1
c
1
c
2
c
2
c
2
S
(
x
i
)
1.28e-03 2.55e-07 6.99e-08 3.11e-08 3.109e-08
y
i
c
2
c
2
c
2
c
2
c
2
S
(
x
i
)
1.53e-08 9.76e-09 2.08e-09 1.95e-09 7.83e-10
y
i
c
2
c
2
c
2
c
2
c
2
T
RAPEZOID
-A
REA
computes the area under the trapezoid, which is given as
b
·
h
,
where
b
=|
x
2
−
x
1
|
is the length of the base of the trapezoid and
h
=
1
2
(y
2
+
y
1
)
is the
average height of the trapezoid.
Example 22.4.
Consider the binary classification problem from Example 22.3 for
the Iris principal components dataset. The test dataset
D
has
n
=
30 points,
with
n
1
=
10 points in the positive class and
n
2
=
20 points in the negative
class.
We use the naive Bayes classifier to compute the probability that each test point
belongstothepositiveclass(
c
1
;
iris-versicolor
).The scoreof theclassifierfor test
point
x
i
is therefore
S
(
x
i
)
=
P(c
1
|
x
i
)
. The sorted scores (in decreasing order) along
with the true class labels are shown in Table 22.5.
The ROC curve for the test dataset is shown in Figure 22.3. Consider the
positive score threshold
ρ
=
0
.
71. If we classify all points with a score above
this value as positive, then we have the following counts for the true and false
positives:
TP
=
3
FP
=
2
The false positive rate is therefore
FP
n
2
=
2
/
20
=
0
.
1, and the true positive rate is
TP
n
1
=
3
/
10
=
0
.
3. This corresponds to the point
(
0
.
1
,
0
.
3
)
in the ROC curve. Other points on
the ROC curve are obtained in a similar manner as shown in Figure 22.3. The total
area under the curve is 0
.
775.
Example 22.5 (AUC).
To see why we need to account for trapezoids when comput-
ing the AUC, consider the following sorted scores, along with the true class, for some
testing dataset with
n
=
5,
n
1
=
3 and
n
2
=
2.
(
0
.
9
,c
1
),(
0
.
8
,c
2
),(
0
.
8
,c
1
),(
0
.
8
,c
1
),(
0
.
1
,c
2
)
560
Classification Assessment
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8
0
.
9
0 0
.
1 0
.
2 0
.
3 0
.
4 0
.
5 0
.
6 0
.
7 0
.
8 0
.
9
False Positive Rate
T
r
u
e
P
o
s
i
t
i
v
e
R
a
t
e
Figure 22.3.
ROC plot for Iris principal components dataset. The ROC curves for the naive Bayes (black)
and random (gray) classifiers are shown.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
0 0
.
2 0
.
4 0
.
6 0
.
8 1
.
0
False Positive Rate
T
r
u
e
P
o
s
i
t
i
v
e
R
a
t
e
0
.
333 0
.
5
Figure 22.4.
ROC plot and AUC: trapezoid region.
Algorithm 22.1yields thefollowing points thatareaddedto the ROCplot, along with
the running AUC:
ρ
FP TP
(
FPR
,
TPR
)
AUC
∞
0 0
(
0
,
0
)
0
0.9 0 1
(
0
,
0
.
333
)
0
0.8 1 3
(
0
.
5
,
1
)
0.333
0.1 2 3
(
1
,
1
)
0.833
22.1 Classification Performance Measures
561
Figure 22.4 shows the ROC plot, with the shaded region representing the AUC. We
can observe that a trapezoid is obtained whenever there is at least one positive and
one negative point with the same score. The total AUC is 0
.
833, obtained as the
sum of the trapezoidal region on the left (0
.
333) and the rectangular region on the
right (0
.
5).
Random Classifier
It is interesting to note that a random classifier corresponds to a diagonal line in
the ROC plot. To see this think of a classifier that randomly guesses the class of a
point as positive half the time, and negative the other half. We then expect that half
of the true positives and true negatives will be identified correctly, resulting in the
point
(
TPR
,
FPR
)
=
(
0
.
5
,
0
.
5
)
for the ROC plot. If, on the other hand, the classifier
guesses the class of a point as positive 90% of the time and as negative 10% of the
time, then we expect 90% of the true positives and 10% of the true negatives to be
labeled correctly, resulting in
TPR
=
0
.
9 and
FPR
=
1
−
TNR
=
1
−
0
.
1
=
0
.
9, that is, we
get the point
(
0
.
9
,
0
.
9
)
in the ROC plot. In general, any fixed probability of prediction,
say
r
, for the positive class, yields the point
(r,r)
in ROC space. The diagonal line
thus represents the performance of a random classifier, over all possible positive class
prediction thresholds
r
. If follows that if the ROC curve for any classifier is below
the diagonal, it indicates performance worse than random guessing. For such cases,
inverting the class assignment will produce a better classifier. As a consequence of
the diagonal ROC curve, the AUC value for a random classifier is 0
.
5. Thus, if any
classifier has an AUC value less than 0
.
5, that also indicates performance worse than
random.
Example 22.6.
In addition to the ROC curve for the naive Bayes classifier,
Figure 22.3 also shows the ROC plot for the random classifier (the diagonal line
in gray). We can see that the ROC curve for the naive Bayes classifier is much better
than random. Its AUC value is 0
.
775, which is much better than the 0
.
5 AUC for
a random classifier. However, at the very beginning naive Bayes performs worse
than the random classifier because the highest scored point is from the negative
class. As such, the ROC curve should be considered as a discrete approximation
of a smooth curve that would be obtained for a very large (infinite) testing
dataset.
Class Imbalance
It is worth remarking that ROC curves are insensitive to class skew. This is because the
TPR
, interpreted as the probability of predicting a positive point as positive, and the
FPR
, interpreted as the probability of predicting a negative point as positive, do not
depend on the ratio of the positive to negative class size. This is a desirable property,
since the ROC curve will essentially remain the same whether the classes are balanced
(have relativelythe same number of points) or skewed (when one class has many more
points than the other).
562
Classification Assessment
22.2
CLASSIFIER EVALUATION
In this section we discuss how to evaluate a classifier
M
using some performance
measure
θ
. Typically, the input dataset
D
is randomly split into a disjoint training
set and testing set. The training set is used to learn the model
M
, and the testing
set is used to evaluate the measure
θ
. However, how confident can we be about
the classification performance? The results may be due to an artifact of the random
split, for example, by random chance the testing set may have particularly easy (or
hard) to classify points, leading to good (or poor) classifier performance. As such,
a fixed, pre-defined partitioning of the dataset is not a good strategy for evaluating
classifiers. Also note that, in general,
D
is itself a
d
-dimensional multivariate random
sample drawn from the true (unknown) joint probability density function
f(
x
)
that
represents the population of interest. Ideally, we would like to know the expected
value
E
[
θ
] of the performance measure over all possible testing sets drawn from
f
.
However, because
f
is unknown, we have to estimate
E
[
θ
] from
D
. Cross-validation
and resampling are two common approaches to compute the expected value and
variance of a given performance measure; we discuss these methods in the following
sections.
22.2.1
K
-fold Cross-Validation
Cross-validation divides the dataset
D
into
K
equal-sized parts, called
folds
, namely
D
1
,
D
2
,
...
,
D
K
. Each fold
D
i
is, in turn, treated as the testing set, with the remaining
folds comprising the training set
D
D
i
=
j
=
i
D
j
. After training the model
M
i
on
D
D
i
, we assess its performance on the testing set
D
i
to obtain the
i
-th estimate
θ
i
.
The expected value of the performance measure can then be estimated as
ˆ
µ
θ
=
E
[
θ
]
=
1
K
K
i
=
1
θ
i
(22.3)
and its variance as
ˆ
σ
2
θ
=
1
K
K
i
=
1
(θ
i
− ˆ
µ
θ
)
2
(22.4)
Algorithm 22.2shows the pseudo-codefor
K
-foldcross-validation.Afterrandomly
shuffling the dataset
D
, we partition it into
K
equal folds (except for possibly the
last one). Next, each fold
D
i
is used as the testing set on which we assess the
performance
θ
i
of the classifier
M
i
trained on
D
D
i
. The estimated mean and variance
of
θ
can then be reported. Note that the
K
-fold cross-validation can be repeated
multiple times; the initial random shuffling ensures that the folds are different each
time.
Usually
K
is chosen to be 5 or 10. The special case, when
K
=
n
, is called
leave-one-out
cross-validation, where the testing set comprises a single point and the
remaining data is used for training purposes.
22.2 Classifier Evaluation
563
ALGORITHM 22.2.
K
-fold Cross-Validation
C
ROSS
-V
ALIDATION
(
K
, D)
:
D
←
randomly shuffle
D
1
{
D
1
,
D
2
,...,
D
K
}←
partition
D
in
K
equal parts
2
foreach
i
∈
[1
,
K
]
do
3
M
i
←
train classifier on
D
D
i
4
θ
i
←
assess
M
i
on
D
i
5
ˆ
µ
θ
=
1
K
K
i
=
1
θ
i
6
ˆ
σ
2
θ
=
1
K
K
i
=
1
(θ
i
− ˆ
µ
θ
)
2
7
return
ˆ
µ
θ
,
ˆ
σ
2
θ
8
Example 22.7.
Consider the 2-dimensional Iris dataset from Example 22.1 with
k
=
3
classes. We assess the error rate of the full Bayes classifier via 5-fold cross-validation,
obtaining the following error rates when testing on each fold:
θ
1
=
0
.
267
θ
2
=
0
.
133
θ
3
=
0
.
233
θ
4
=
0
.
367
θ
5
=
0
.
167
Using Eqs.(22.3) and (22.4), the mean and variance for the error rate are as follows:
ˆ
µ
θ
=
1
.
167
5
=
0
.
233
ˆ
σ
2
θ
=
0
.
00833
We can repeat the whole cross-validation approach multiple times, with a different
permutation of the input points, and then we can compute the mean of the average
error rate, and mean of the variance. Performing ten 5-fold cross-validation runs for
theIris datasetresultsinthemeanoftheexpectederrorrateas0
.
232,andthemeanof
thevarianceas0
.
00521,with thevariancein both theseestimatesbeingless than10
−
3
.
22.2.2
Bootstrap Resampling
Another approach to estimate the expected performance of a classifier is to use the
bootstrap resampling method. Instead of partitioning the input dataset
D
into disjoint
folds, the bootstrap method draws
K
random samples of size
n
with replacement
from
D
. Eachsample
D
i
is thus thesamesize as
D
, andhas severalrepeatedpoints. Consider
the probability that a point
x
j
∈
D
is not selected for the
i
th bootstrap sample
D
i
. Due
to sampling with replacement, the probability that a given point is selected is given as
p
=
1
n
, and thus the probability that it is not selected is
q
=
1
−
p
=
1
−
1
n
Because
D
i
has
n
points, the probability that
x
j
is not selected even after
n
tries is
given as
P(
x
j
∈
D
i
)
=
q
n
=
1
−
1
n
n
≃
e
−
1
=
0
.
368
564
Classification Assessment
ALGORITHM 22.3. Bootstrap Resampling Method
B
OOTSTRAP
-R
ESAMPLING
(
K
, D)
:
for
i
∈
[1
,
K
]
do
1
D
i
←
sample of size
n
with replacement from
D
2
M
i
←
train classifier on
D
i
3
θ
i
←
assess
M
i
on
D
4
ˆ
µ
θ
=
1
K
K
i
=
1
θ
i
5
ˆ
σ
2
θ
=
1
K
K
i
=
1
(θ
i
− ˆ
µ
θ
)
2
6
return
ˆ
µ
θ
,
ˆ
σ
2
θ
7
On the other hand, the probability that
x
j
∈
D
is given as
P(
x
j
∈
D
i
)
=
1
−
P(
x
j
∈
D
i
)
=
1
−
0
.
368
=
0
.
632
This means that each bootstrap sample contains approximately 63
.
2% of the points
from
D
.
The bootstrap samples can be used to evaluate the classifier by training it on each
of samples
D
i
and then using the full input dataset
D
as the testing set, as shown in
Algorithm 22.3. The expected value and variance of the performance measure
θ
can
be obtained using Eqs.(22.3) and (22.4). However, it should be borne in mind that the
estimates will be somewhat optimistic owing to the fairly large overlap between the
training and testing datasets (63.2%). The cross-validation approach does not suffer
from this limitation because it keeps the training and testing sets disjoint.
Example 22.8.
We continue with the Iris dataset from Example 22.7. However, we
now apply bootstrap sampling to estimate the error rate for the full Bayes classifier,
using
K
=
50samples.The samplingdistribution oferror ratesis shown in Figure22.5.
1
2
3
4
5
6
7
8
0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27
Error Rate
F
r
e
q
u
e
n
c
y
Figure 22.5.
Sampling distribution of error rates.
22.2 Classifier Evaluation
565
The expected value and variance of the error rate are
ˆ
µ
θ
=
0
.
213
ˆ
σ
2
θ
=
4
.
815
×
10
−
4
Due to the overlap between the training and testing sets, the estimates are
more optimistic (i.e., lower) compared to those obtained via cross-validation in
Example 22.7, where we had
ˆ
µ
θ
=
0
.
233 and
ˆ
σ
2
θ
=
0
.
00833.
22.2.3
Confidence Intervals
Having estimated the expected value and variance for a chosen performance measure,
we would like to derive confidence bounds on how much the estimate may deviate
from the true value.
To answer this question we makeuse of thecentrallimit theorem,which statesthat
the sum of a large number of independent and identically distributed (IID) random
variables has approximately a normal distribution, regardless of the distribution of
the individual random variables. More formally, let
θ
1
,θ
2
,...,θ
K
be a sequence of IID
random variables,representing, for example,the error rate or some other performance
measure overthe
K
-folds in cross-validation or
K
bootstrap samples. Assume thateach
θ
i
has a finite mean
E
[
θ
i
]
=
µ
and finite variance
var(θ
i
)
=
σ
2
.
Let
ˆ
µ
denote the sample mean:
ˆ
µ
=
1
K
(θ
1
+
θ
2
+···+
θ
K
)
By linearity of expectation, we have
E
[
ˆ
µ
]
=
E
1
K
(θ
1
+
θ
2
+···+
θ
K
)
=
1
K
K
i
=
1
E
[
θ
i
]
=
1
K
(
K
µ
)
=
µ
Utilizing the linearity of variance for independent random variables, and noting that
var(a
X
)
=
a
2
·
var(
X
)
for
a
∈
R
, the variance of
ˆ
µ
is given as
var(
ˆ
µ)
=
var
1
K
(θ
1
+
θ
2
+···+
θ
K
)
=
1
K
2
K
i
=
1
var(θ
i
)
=
1
K
2
K
σ
2
=
σ
2
K
Thus, the standard deviation of
ˆ
µ
is given as
std(
ˆ
µ)
=
var(
ˆ
µ)
=
σ
√
K
We are interested in the distribution of the
z
-score of
ˆ
µ
, which is itself a random
variable
Z
K
=
ˆ
µ
−
E
[
ˆ
µ
]
std(
ˆ
µ)
=
ˆ
µ
−
µ
σ
√
K
=
√
K
ˆ
µ
−
µ
σ
Z
K
specifies the deviation of the estimated mean from the true mean in terms of its
standard deviation. The central limit theorem states that as the sample size increases,
566
Classification Assessment
the random variable
Z
K
converges in distribution
to the standard normal distribution
(which has mean 0 and variance 1). That is, as
K
→∞
, for any
x
∈
R
, we have
lim
K
→∞
P(
Z
K
≤
x)
=
(x)
where
(x)
is the cumulative distribution function for the standard normal density
function
f(x
|
0
,
1
)
. Let
z
α/
2
denote the
z
-score value that encompasses
α/
2 of the
probability mass for a standard normal distribution, that is,
P(
0
≤
Z
K
≤
z
α/
2
)
=
(z
α/
2
)
−
(
0
)
=
α/
2
then, because the normal distribution is symmetric about the mean, we have
lim
K
→∞
P(
−
z
α/
2
≤
Z
K
≤
z
α/
2
)
=
2
·
P(
0
≤
Z
K
≤
z
α/
2
)
=
α
(22.5)
Note that
−
z
α/
2
≤
Z
K
≤
z
α/
2
=⇒ −
z
α/
2
≤
√
K
ˆ
µ
−
µ
σ
≤
z
α/
2
=⇒ −
z
α/
2
σ
√
K
≤ ˆ
µ
−
µ
≤
z
α/
2
σ
√
K
=⇒
ˆ
µ
−
z
α/
2
σ
√
K
≤
µ
≤
ˆ
µ
+
z
α/
2
σ
√
K
Substituting the above into Eq.(22.5) we obtain bounds on the value of the true mean
µ
in terms of the estimated value
ˆ
µ
, that is,
lim
K
→∞
P
ˆ
µ
−
z
α/
2
σ
√
K
≤
µ
≤ ˆ
µ
+
z
α/
2
σ
√
K
=
α
(22.6)
Thus, for any given level of confidence
α
, we can compute the probability that the
true mean
µ
lies in the
α
% confidence interval
ˆ
µ
−
z
α/
2
σ
√
K
,
ˆ
µ
+
z
α/
2
σ
√
K
. In other
words, even though we do not know the true mean
µ
, we can obtain a high-confidence
estimate of the interval within which it must lie (e.g., by setting
α
=
0
.
95 or
α
=
0
.
99).
Unknown Variance
The analysis above assumes that we know the true variance
σ
2
, which is generally not
the case. However, we can replace
σ
2
by the sample variance
ˆ
σ
2
=
1
K
K
i
=
1
(θ
i
− ˆ
µ)
2
(22.7)
because
ˆ
σ
2
is a
consistent
estimator for
σ
2
, that is, as
K
→ ∞
,
ˆ
σ
2
converges with
probability 1, also called
converges almost surely
, to
σ
2
. The central limit theorem
then states that the random variable
Z
∗
K
defined below converges in distribution to
the standard normal distribution:
Z
∗
K
=
√
K
ˆ
µ
−
µ
ˆ
σ
(22.8)
22.2 Classifier Evaluation
567
and thus, we have
lim
K
→∞
P
ˆ
µ
−
z
α/
2
ˆ
σ
√
K
≤
µ
≤ ˆ
µ
+
z
α/
2
ˆ
σ
√
K
=
α
(22.9)
In other words, we say that
ˆ
µ
−
z
α/
2
ˆ
σ
√
K
,
ˆ
µ
+
z
α/
2
ˆ
σ
√
K
is the
α
% confidence interval
for
µ
.
Example 22.9.
Consider Example 22.7, where we applied 5-fold cross-validation
(
K
=
5) to assess the error rate of the full Bayes classifier. The estimated expected
value and variance for the error rate were as follows:
ˆ
µ
θ
=
0
.
233
ˆ
σ
2
θ
=
0
.
00833
ˆ
σ
θ
=
√
0
.
00833
=
0
.
0913
Let
α
=
0
.
95 be the confidence value. It is known that the standard normal
distribution has 95% of the probability density within
z
α/
2
=
1
.
96 standard deviations
from the mean. Thus, in the limit of large sample size, we have
P
µ
∈
ˆ
µ
θ
−
z
α/
2
ˆ
σ
θ
√
K
,
ˆ
µ
θ
+
z
α/
2
ˆ
σ
θ
√
K
=
0
.
95
Because
z
α/
2
ˆ
σ
θ
√
K
=
1
.
96
×
0
.
0913
√
5
=
0
.
08, we have
P
µ
∈
(
0
.
233
−
0
.
08
,
0
.
233
+
0
.
08
)
=
P
µ
∈
(
0
.
153
,
0
.
313
)
=
0
.
95
Put differently, with 95% confidence, the true expected error rate lies in the interval
(
0
.
153
,
0
.
313
)
.
If we want greater confidence, for example, for
α
=
0
.
99, then the corresponding
z
-score value is
z
α/
2
=
2
.
58, and thus
z
α/
2
ˆ
σ
θ
√
K
=
2
.
58
×
0
.
0913
√
5
=
0
.
105. The 99% confidence
interval for
µ
is therefore wider
(
0
.
128
,
0
.
338
)
.
Nevertheless,
K
=
5 is not a large sample size, and thus the above confidence
intervals are not that reliable.
Small Sample Size
The confidence interval in Eq.(22.9) applies only when the sample size
K
→∞
. We
would like to obtain more precise confidence intervals for small samples. Consider the
random variables
V
i
, for
i
=
1
,...,
K
, defined as
V
i
=
θ
i
− ˆ
µ
σ
Further, consider the sum of their squares:
S
=
K
i
=
1
V
2
i
=
K
i
=
1
θ
i
− ˆ
µ
σ
2
=
1
σ
2
K
i
=
1
(θ
i
− ˆ
µ)
2
=
K
ˆ
σ
2
σ
2
(22.10)
The last step follows from the definition of sample variance in Eq.(22.7).
If we assume that the
V
i
’s are IID with the standard normal distribution, then
the sum
S
follows a chi-squared distribution with
K
−
1 degrees of freedom, denoted
568
Classification Assessment
χ
2
(
K
−
1
)
, since
S
is the sum of the squares of
K
random variables
V
i
. There are only
K
−
1 degrees of freedom because each
V
i
depends on
ˆ
µ
and the sum of the
θ
i
’s is thus
fixed.
Consider the random variable
Z
∗
K
in Eq.(22.8). We have,
Z
∗
K
=
√
K
ˆ
µ
−
µ
ˆ
σ
=
ˆ
µ
−
µ
ˆ
σ/
√
K
Dividing the numerator and denominator in the expression above by
σ/
√
K
, we get
Z
∗
K
=
ˆ
µ
−
µ
σ/
√
K
ˆ
σ/
√
K
σ/
√
K
=
ˆ
µ
−
µ
σ/
√
K
ˆ
σ/σ
=
Z
K
S
/
K
(22.11)
The last step follows from Eq.(22.10) because
S
=
K
ˆ
σ
2
σ
2
implies that
ˆ
σ
σ
=
S
/
K
Assuming that
Z
K
follows astandard normal distribution, and noting that
S
follows
a chi-squared distribution with
K
−
1 degrees of freedom, then the distribution of
Z
∗
K
is
precisely the Student’s
t
distribution with
K
−
1 degrees of freedom. Thus, in the small
sample case, instead of using the standard normal density to derive the confidence
interval, we use the
t
distribution. In particular, we choose the value
t
α/
2
,
K
−
1
such that
the cumulative
t
distribution function with
K
−
1 degrees of freedom encompasses
α/
2
of the probability mass, that is,
P(
0
≤
Z
∗
K
≤
t
α/
2
,
K
−
1
)
=
T
K
−
1
(t
α/
2
)
−
T
K
−
1
(
0
)
=
α/
2
where
T
K
−
1
is the cumulative distribution function for the Student’s
t
distribution with
K
−
1 degrees of freedom. Because the
t
distribution is symmetric about the mean, we
have
P
ˆ
µ
−
t
α/
2
,
K
−
1
ˆ
σ
√
K
≤
µ
≤ ˆ
µ
+
t
α/
2
,
K
−
1
ˆ
σ
√
K
=
α
(22.12)
The
α
% confidence interval for the true mean
µ
is thus
ˆ
µ
−
t
α/
2
,
K
−
1
ˆ
σ
√
K
≤
µ
≤ ˆ
µ
+
t
α/
2
,
K
−
1
ˆ
σ
√
K
Note the dependence of the interval on both
α
and the sample size
K
.
Figure 22.6 shows the
t
distribution density function for different values of
K
.
It also shows the standard normal density function. We can observe that the
t
distribution has more probability concentrated in its tails compared to the standard
normal distribution. Further, as
K
increases, the
t
distribution very rapidly converges
in distribution to the standard normal distribution, consistent with the large sample
case. Thus, for large samples, we may use the usual
z
α/
2
threshold.
22.2 Classifier Evaluation
569
0
.
1
0
.
2
0
.
3
0
.
4
0 1 2 3 4 5
−
1
−
2
−
3
−
4
−
5
x
y
f(x
|
0
,
1
)
t(
10
)
t(
4
)
t(
1
)
Figure 22.6.
Student’s
t
distribution:
K
degrees of freedom. The thick solid line is standard normal
distribution.
Example 22.10.
Consider Example 22.9. For 5-fold cross-validation, the estimated
mean error rate is
ˆ
µ
θ
=
0
.
233, and the estimated variance is
ˆ
σ
θ
=
0
.
0913.
Due to the small sample size (
K
=
5), we can get a better confidence interval by
using the
t
distribution. For
K
−
1
=
4 degrees of freedom, for
α
=
0
.
95, we use the
quantile function for the Student’s
t
-distribution to obtain
t
α/
2
,
K
−
1
=
2
.
776. Thus,
t
α/
2
,
K
−
1
ˆ
σ
θ
√
K
=
2
.
776
×
0
.
0913
√
5
=
0
.
113
The 95% confidence interval is therefore
(
0
.
233
−
0
.
113
,
0
.
233
+
0
.
113
)
=
(
0
.
12
,
0
.
346
)
which is much wider than the overly optimistic confidence interval
(
0
.
153
,
0
.
313
)
obtained for the large sample case in Example 22.9.
For
α
=
0
.
99, we have
t
α/
2
,
K
−
1
=
4
.
604, and thus
t
α/
2
,
K
−
1
ˆ
σ
θ
√
K
=
4
.
604
×
0
.
0913
√
5
=
0
.
188
and the 99% confidence interval is
(
0
.
233
−
0
.
188
,
0
.
233
+
0
.
188
)
=
(
0
.
045
,
0
.
421
)
This is also much wider than the 99% confidence interval
(
0
.
128
,
0
.
338
)
obtained for
the large sample case in Example 22.9.
22.2.4
Comparing Classifiers: Paired
t
-Test
In this section we look at a method that allows us to test for a significant difference in
the classification performance of two alternative classifiers,
M
A
and
M
B
. We want to
assess which of them has a superior classification performance on a given dataset
D
.
570
Classification Assessment
Following the evaluationmethodology above, we can apply
K
-fold cross-validation (or
bootstrap resampling) and tabulate their performance over each of the
K
folds, with
identicalfolds for both classifiers. That is, we perform a
pairedtest
, with both classifiers
trained and tested on the same data. Let
θ
A
1
,θ
A
2
,...,θ
A
K
and
θ
B
1
,θ
B
2
,...,θ
B
K
denote the
performance values for
M
A
and
M
B
, respectively. To determine if the two classifiers
have different or similar performance, define the random variable
δ
i
as the difference
in their performance on the
i
th dataset:
δ
i
=
θ
A
i
−
θ
B
i
Now consider the estimates for the expected difference and the variance of the
differences:
ˆ
µ
δ
=
1
K
K
i
=
1
δ
i
ˆ
σ
2
δ
=
1
K
K
i
=
1
(δ
i
− ˆ
µ
δ
)
2
We can set up a hypothesis testing framework to determine if there is a statistically
significant difference between the performance of
M
A
and
M
B
. The null hypothesis
H
0
is that their performance is the same, that is, the true expected difference is zero,
whereas the alternative hypothesis
H
a
is that they are not the same, that is, the true
expected difference
µ
δ
is not zero:
H
0
:
µ
δ
=
0
H
a
:
µ
δ
=
0
Let us define the
z
-score random variable for the estimated expected difference as
Z
∗
δ
=
√
K
ˆ
µ
δ
−
µ
δ
ˆ
σ
δ
Following a similar argument as in Eq.(22.11),
Z
∗
δ
follows a
t
distribution with
K
−
1
degrees of freedom. However, under the null hypothesis we have
µ
δ
=
0, and thus
Z
∗
δ
=
√
K
ˆ
µ
δ
ˆ
σ
δ
∼
t
K
−
1
wherethenotation
Z
∗
δ
∼
t
K
−
1
meansthat
Z
∗
δ
followsthe
t
distribution with
K
−
1degrees
of freedom.
Given a desired confidence level
α
, we conclude that
P
−
t
α/
2
,
K
−
1
≤
Z
∗
δ
≤
t
α/
2
,
K
−
1
=
α
Put another way, if
Z
∗
δ
∈
−
t
α/
2
,
K
−
1
,t
α/
2
,
K
−
1
, then we may reject the null hypothesis
with
α
% confidence. In this case, we conclude that there is a significant difference
between the performance of
M
A
and
M
B
. On the other hand, if
Z
∗
δ
does lie in the
above confidence interval, then we accept the null hypothesis that both
M
A
and
M
B
have essentially the same performance. The pseudo-code for the paired
t
-test is shown
in Algorithm 22.4.
22.2 Classifier Evaluation
571
ALGORITHM 22.4. Paired
t
-Test via Cross-Validation
P
AIRED
t
-T
EST
(
α
,
K
, D)
:
D
←
randomly shuffle
D
1
{
D
1
,
D
2
,...,
D
K
}←
partition
D
in
K
equal parts
2
foreach
i
∈
[1
,
K
]
do
3
M
A
i
,
M
B
i
←
train the two different classifiers on
D
D
i
4
θ
A
i
,θ
B
i
←
assess
M
A
i
and
M
B
i
on
D
i
5
δ
i
=
θ
A
i
−
θ
B
i
6
ˆ
µ
δ
=
1
K
K
i
=
1
δ
i
7
ˆ
σ
2
δ
=
1
K
K
i
=
1
(δ
i
− ˆ
µ
δ
)
2
8
Z
∗
δ
=
√
K
ˆ
µ
δ
ˆ
σ
δ
9
if
Z
∗
δ
∈
−
t
α/
2
,
K
−
1
,t
α/
2
,
K
−
1
then
10
Accept
H
0
; both classifiers have similar performance
11
else
12
Reject
H
0
; classifiers have significantly different performance
13
Example 22.11.
Consider the 2-dimensional Iris dataset from Example 22.1, with
k
=
3 classes. We compare the naive Bayes (
M
A
) with the full Bayes (
M
B
) classifier
via cross-validation using
K
=
5 folds. Using error rate as the performance measure,
we obtain the following values for the error rates and their difference over each of
the
K
folds:
i
1 2 3 4 5
θ
A
i
0
.
233 0
.
267 0
.
1 0
.
4 0
.
3
θ
B
i
0
.
2 0
.
2 0
.
167 0
.
333 0
.
233
δ
i
0
.
033 0
.
067
−
0
.
067 0
.
067 0
.
067
The estimated expected difference and variance of the differences are
ˆ
µ
δ
=
0
.
167
5
=
0
.
033
ˆ
σ
2
δ
=
0
.
00333
ˆ
σ
δ
=
√
0
.
00333
=
0
.
0577
The
z
-score value is given as
Z
∗
δ
=
√
K
ˆ
µ
δ
ˆ
σ
δ
=
√
5
×
0
.
033
0
.
0577
=
1
.
28
From Example 22.10, for
α
=
0
.
95 and
K
−
1
=
4 degrees of freedom, we have
t
α/
2
,
K
−
1
=
2
.
776. Because
Z
∗
δ
=
1
.
28
∈
(
−
2
.
776
,
2
.
776
)
=
−
t
α/
2
,
K
−
1
,t
α/
2
,
K
−
1
we cannot reject the null hypothesis. Instead, we accept the null hypothesis that
µ
δ
=
0, that is, there is no significant difference between the naive and full Bayes
classifier for this dataset.
572
Classification Assessment
22.3
BIAS-VARIANCE DECOMPOSITION
Given a training set
D
={
x
i
,y
i
}
n
i
=
1
, comprising
n
points
x
i
∈
R
d
, with their correspond-
ing classes
y
i
, a learned classification model
M
predicts the class for a given test point
x
. The various performance measures we described above mainly focus on minimizing
the prediction error by tabulating the fraction of misclassified points. However, in
many applications, there may be costs associated with making wrong predictions. A
loss function
specifies the cost or penalty of predicting the class to be
ˆ
y
=
M
(
x
)
, when
the true class is
y
. A commonly used loss function for classification is the
zero-oneloss
,
defined as
L
(y,
M
(
x
))
=
I
(
M
(
x
)
=
y)
=
0 if
M
(
x
)
=
y
1 if
M
(
x
)
=
y
Thus, zero-one loss assigns a cost of zero if the prediction is correct, and one otherwise.
Another commonly used loss function is the
squared loss
, defined as
L
(y,
M
(
x
))
=
(
y
−
M
(
x
)
)
2
where we assume that the classes are discrete valued, and not categorical.
Expected Loss
An ideal or optimal classifier is the one that minimizes the loss function. Because the
true class is not known for a test case
x
, the goal of learning a classification model can
be cast as minimizing the expected loss:
E
y
[
L
(y,
M
(
x
))
|
x
]
=
y
L
(y,
M
(
x
))
·
P(y
|
x
)
(22.13)
where
P(y
|
x
)
is the conditional probability of class
y
given test point
x
, and
E
y
denotes
that the expectation is taken over the different class values
y
.
Minimizing the expected zero–one loss corresponds to minimizing the error rate.
This can be seen by expanding Eq.(22.13) with zero–one loss. Let
M
(
x
)
=
c
i
, then we
have
E
y
[
L
(y,
M
(
x
))
|
x
]
=
E
y
[
I
(y
=
M
(
x
))
|
x
]
=
y
I
(y
=
c
i
)
·
P(y
|
x
)
=
y
=
c
i
P(y
|
x
)
=
1
−
P(c
i
|
x
)
Thus, to minimize the expectedloss we should choose
c
i
as the class that maximizes the
posterior probability, that is,
c
i
=
argmax
y
P(y
|
x
)
. Because by definition [Eq.(22.1)],
the error rate is simply an estimate of the expected zero–one loss, this choice also
minimizes the error rate.
22.3 Bias-Variance Decomposition
573
Bias and Variance
The expected loss for the squared loss function offers important insight into the
classification problem because it can be decomposed into bias and variance terms.
Intuitively, the
bias
of a classifier refers to the systematic deviation of its predicted
decision boundary from thetrue decision boundary, whereasthe
variance
of a classifier
refers to the deviation among the learned decision boundaries over different training
sets. More formally, because
M
depends on the training set, given a test point
x
, we
denote its predicted value as
M
(
x
,
D
)
. Consider the expected square loss:
E
y
L
y,
M
(
x
,
D
)
x
,
D
=
E
y
y
−
M
(
x
,
D
)
2
x
,
D
=
E
y
y
−
E
y
[
y
|
x
]
+
E
y
[
y
|
x
]
add and subtract same term
−
M
(
x
,
D
)
2
x
,
D
=
E
y
y
−
E
y
[
y
|
x
]
2
x
,
D
+
E
y
M
(
x
,
D
)
−
E
y
[
y
|
x
]
2
x
,
D
+
E
y
2
y
−
E
y
[
y
|
x
]
·
E
y
[
y
|
x
]
−
M
(
x
,
D
)
x
,
D
=
E
y
y
−
E
y
[
y
|
x
]
2
x
,
D
+
M
(
x
,
D
)
−
E
y
[
y
|
x
]
2
+
2
E
y
[
y
|
x
]
−
M
(
x
,
D
)
·
E
y
[
y
|
x
]
−
E
y
[
y
|
x
]
0
=
E
y
y
−
E
y
[
y
|
x
]
2
x
,
D
var(y
|
x
)
+
M
(
x
,
D
)
−
E
y
[
y
|
x
]
2
squared-error
(22.14)
Above, we made use of the fact that for any random variables
X
and
Y
, and for any
constant
a
, we have
E
[
X
+
Y
]
=
E
[
X
]
+
E
[
Y
],
E
[
a
X
]
=
a
E
[
X
], and
E
[
a
]
=
a
. The first
term in Eq.(22.14) is simply the variance of
y
given
x
. The second term is the squared
error between the predicted value
M
(
x
,
D
)
and the expected value
E
y
[
y
|
x
]. Because
this term depends on the training set, we can eliminate this dependence by averaging
over all possible training tests of size
n
. The average or expected squared error for a
given test point
x
over all training sets is then given as
E
D
M
(
x
,
D
)
−
E
y
[
y
|
x
]
2
=
E
D
M
(
x
,
D
)
−
E
D
[
M
(
x
,
D
)
]
+
E
D
[
M
(
x
,
D
)
]
add and subtract same term
−
E
y
[
y
|
x
]
2
=
E
D
M
(
x
,
D
)
−
E
D
[
M
(
x
,
D
)
]
2
+
E
D
E
D
[
M
(
x
,
D
)
]
−
E
y
[
y
|
x
]
2
+
2
E
D
[
M
(
x
,
D
)
]
−
E
y
[
y
|
x
]
·
E
D
[
M
(
x
,
D
)
]
−
E
D
[
M
(
x
,
D
)
]
0
=
E
D
M
(
x
,
D
)
−
E
D
[
M
(
x
,
D
)
]
2
variance
+
E
D
[
M
(
x
,
D
)
]
−
E
y
[
y
|
x
]
2
bias
(22.15)
574
Classification Assessment
This means that the expected squared error for a given test point can be decomposed
into bias and variance terms. Combining Eqs.(22.14) and (22.15) the expected squared
loss over all test points
x
and over all training sets
D
of size
n
yields the following
decomposition into noise, variance and bias terms:
E
x
,
D
,y
y
−
M
(
x
,
D
)
2
=
E
x
,
D
,y
y
−
E
y
[
y
|
x
]
2
x
,
D
+
E
x
,
D
M
(
x
,
D
)
−
E
y
[
y
|
x
]
2
=
E
x
,y
y
−
E
y
[
y
|
x
]
2
noise
+
E
x
,
D
M
(
x
,
D
)
−
E
D
[
M
(
x
,
D
)
]
2
average variance
+
E
x
E
D
[
M
(
x
,
D
)
]
−
E
y
[
y
|
x
]
2
average bias
(22.16)
Thus, the expected square loss over all test points and training sets can be decomposed
into three terms: noise, average bias, and average variance. The noise term is the
average variance
var(y
|
x
)
over all test points
x
. It contributes a fixed cost to the
loss independent of the model, and can thus be ignored when comparing different
classifiers. The classifier specific loss can then be attributed to the variance and bias
terms. In general, bias indicates whether the model
M
is correct or incorrect. It also
reflects our assumptions about the domain in terms of the decision boundary. For
example, if the decision boundary is nonlinear, and we use a linear classifier, then
it is likely to have high bias, that is, it will be consistently incorrect over different
training sets. On the other hand, a nonlinear (or a more complex) classifier is more
likely to capture the correct decision boundary, and is thus likely to have a low
bias. Nevertheless, this does not necessarily mean that a complex classifier will be
a better one, since we also have to consider the variance term, which measures the
inconsistency of the classifier decisions. A complex classifier induces a more complex
decision boundary and thus may be prone to
overfitting
, that is, it may try to model all
the small nuances in the training data, and thus may be susceptible to small changes in
training set, which may result in high variance.
In general, the expected loss can be attributed to high bias or high variance, with
typicallya trade-offbetweenthesetwo terms. Ideally,we seek a balance betweenthese
opposingtrends,thatis,wepreferaclassifierwithanacceptablebias(reflectingdomain
or dataset specific assumptions) and as low a variance as possible.
Example 22.12.
Figure 22.7 illustrates the trade-off between bias and variance, using
the Iris principal components dataset,which has
n
=
150points and
k
=
2 classes (
c
1
=
+
1, and
c
2
=−
1). We construct
K
=
10 training datasets via bootstrap sampling, and
use them to train SVM classifiers using a quadratic (homogeneous) kernel, varying
the regularization constant
C
from 10
−
2
to 10
2
.
Recall that
C
controls the weight placed on the slack variables, as opposed to
the margin of the hyperplane (see Section 21.3). A small value of
C
emphasizes
the margin, whereas a large value of
C
tries to minimize the slack terms.
Figures 22.7a, 22.7b, and 22.7c show that the variance of the SVM model increases
22.3 Bias-Variance Decomposition
575
−
3
−
2
−
1
0
1
2
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(a)
C
=
0
.
01
−
3
−
2
−
1
0
1
2
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(b)
C
=
1
−
3
−
2
−
1
0
1
2
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(c)
C
=
100
0
0
.
1
0
.
2
0
.
3
10
−
2
10
−
1
10
0
10
1
10
2
C
loss
bias
variance
(d) Bias-Variance
Figure 22.7.
Bias-variance decomposition: SVM quadratic kernels. Decision boundaries plotted for
K
=
10
bootstrap samples.
as we increase
C
, as seen from the varying decision boundaries. Figure 22.7d plots
the average variance and average bias for different values of
C
, as well as the
expected loss. The bias-variance tradeoff is clearly visible, since as the bias reduces,
the variance increases. The lowest expected loss is obtained when
C
=
1.
22.3.1
Ensemble Classifiers
A classifier is called
unstable
if small perturbations in the training set result in large
changes in the prediction or decision boundary. High varianceclassifiers are inherently
unstable, since they tend to overfit the data. On the other hand, high bias methods
typically underfit the data, and usually have low variance. In either case, the aim
of learning is to reduce classification error by reducing the variance or bias, ideally
576
Classification Assessment
both. Ensemble methods create a
combined classifier
using the output of multiple
base
classifiers
, which are trained on different data subsets. Depending on how the training
sets are selected, and on the stability of the base classifiers, ensemble classifiers can
help reduce the variance and the bias, leading to a better overall performance.
Bagging
Bagging
, which stands for
Bootstrap Aggregation
, is an ensemble classification method
that employs multiple bootstrap samples (with replacement) from the input training
data
D
to create slightly different training sets
D
i
,
i
=
1
,
2
,...,
K
. Different base
classifiers
M
i
are learned, with
M
i
trained on
D
i
. Given any test point
x
, it is first
classified using each of the
K
base classifiers,
M
i
. Let the number of classifiers that
predict the class of
x
as
c
j
be given as
v
j
(
x
)
=
M
i
(
x
)
=
c
j
i
=
1
,...,
K
The combined classifier, denoted
M
K
, predicts the class of a test point
x
by
majority
voting
among the
k
classes:
M
K
(
x
)
=
argmax
c
j
v
j
(
x
)
j
=
1
,...,k
For binary classification, assuming that the classes are given as
{+
1
,
−
1
}
, the combined
classifier
M
K
can be expressed more simply as
M
K
(
x
)
=
sign
K
i
=
1
M
i
(
x
)
Baggingcan help reduce the variance,especiallyif the base classifiers are unstable,
due to the averaging effect of majority voting. It does not, in general, have much effect
on the bias.
Example 22.13.
Figure 22.8a shows the averaging effect of bagging for the Iris
principal components dataset from Example 22.12. The figure shows the SVM
decision boundaries for the quadratic kernel using
C
=
1. The base SVM classifiers
are trained on
K
=
10 bootstrap samples. The combined (average) classifier is shown
in bold.
Figure 22.8b shows the combined classifiers obtained for different values of
K
,
keeping
C
=
1. The zero–one and squared loss for selected values of
K
are shown
below
K
Zero–one loss Squared loss
3 0.047 0.187
5 0.04 0.16
8 0.02 0.10
10 0.027 0.113
15 0.027 0.107
The worst training performance is obtained for
K
=
3 (in thick gray) and the best for
K
=
8 (in thick black).
22.3 Bias-Variance Decomposition
577
−
3
−
2
−
1
0
1
2
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(a)
K
=
10
−
3
−
2
−
1
0
1
2
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
(b) Effect of
K
Figure 22.8.
Bagging: combined classifiers. (a) uses
K
=
10 bootstrap samples. (b) shows average decision
boundary for different values of
K
.
Boosting
Boosting
is another ensemble technique that trains the base classifiers on different
samples. However, the main idea is to carefully select the samples to
boost
the
performance on hard to classify instances. Starting from an initial training sample
D
1
,
we train the base classifier
M
1
, and obtain its training error rate. To construct the
next sample
D
2
, we select the misclassified instances with higher probability, and after
training
M
2
, we obtain its training error rate. To construct
D
3
, those instances that are
hard to classify by
M
1
or
M
2
, havea higher probability of being selected.This process is
repeated for
K
iterations. Thus, unlike bagging that uses independent random samples
from the input dataset, boosting employs weighted or biased samples to construct the
differenttrainingsets,withthecurrent sampledependingon theprevious ones. Finally,
the combined classifier is obtained via weighted voting over the output of the
K
base
classifiers
M
1
,
M
2
,...,
M
K
.
Boosting is most beneficialwhen thebaseclassifiers are
weak
,thatis, havean error
rate that is slightly less than that for a random classifier. The idea is that whereas
M
1
may not be particularly good on all test instances, by design
M
2
may help classify some
cases where
M
1
fails, and
M
3
may help classify instances where
M
1
and
M
2
fail, and
so on. Thus, boosting has more of a bias reducing effect. Each of the weak learners is
likely to have high bias (it is only slightly better than random guessing), but the final
combined classifier can have much lower bias, since different weak learners learn to
classify instances in different regions of the input space. Several variants of boosting
canbeobtainedbasedon how theinstanceweightsarecomputed forsampling,how the
base classifiers are combined, and so on. We discuss
Adaptive Boosting (AdaBoost)
,
which is one of the most popular variants.
Adaptive Boosting: AdaBoost
Let
D
be the input training set, comprising
n
points
x
i
∈
R
d
. The boosting process will be repeated
K
times. Let
t
denote the iteration and
let
α
t
denote the weight for the
t
th classifier
M
t
. Let
w
t
i
denote the weight for
x
i
, with
w
t
=
(w
t
1
,w
t
2
,...,w
t
n
)
T
being the weight vector over all the points for the
t
th iteration.
578
Classification Assessment
ALGORITHM 22.5. Adaptive Boosting Algorithm: AdaBoost
A
DA
B
OOST
(
K
, D)
:
w
0
←
1
n
·
1
∈
R
n
1
t
←
1
2
while
t
≤
K
do
3
D
t
←
weighted resampling with replacement from
D
using
w
t
−
1
5
5
M
t
←
train classifier on
D
t
6
ǫ
t
←
n
i
=
1
w
t
−
1
i
·
I
M
t
(
x
i
)
=
y
i
// weighted error rate on
D
7
if
ǫ
t
=
0
then break
8
else if
ǫ
t
<
0
.
5
then
9
α
t
=
ln
1
−
ǫ
t
ǫ
t
// classifier weight
10
foreach
i
∈
[1
,n
]
do
11
// update point weights
w
t
i
=
w
t
−
1
i
if
M
t
(
x
i
)
=
y
i
w
t
−
1
i
1
−
ǫ
t
ǫ
t
if
M
t
(
x
i
)
=
y
i
12
w
t
=
w
t
1
T
w
t
// normalize weights
14
14
t
←
t
+
1
15
return
{
M
1
,
M
2
,...,
M
K
}
16
In fact,
w
is a probability vector, whose elements sum to one. Initially all points have
equal weights, that is,
w
0
=
1
n
,
1
n
,...,
1
n
T
=
1
n
1
where
1
∈
R
n
is the
n
-dimensional vector of all 1’s.
The pseudo-code for AdaBoost is shown in Algorithm 22.5. During iteration
t
,
the training sample
D
t
is obtained via weighted resampling using the distribution
w
t
−
1
,
that is, we draw a sample of size
n
with replacement, such that the
i
th point is chosen
according to its probability
w
t
−
1
i
. Next,we train the classifier
M
t
using
D
t
, and compute
its weighted error rate
ǫ
t
on the entire input dataset
D
:
ǫ
t
=
n
i
=
1
w
t
−
1
i
·
I
M
t
(
x
i
)
=
y
i
where
I
is an indicator function that is 1 when its argument is true, that is, when
M
t
misclassifies
x
i
, and is 0 otherwise.
The weight for the
t
th classifier is then set as
α
t
=
ln
1
−
ǫ
t
ǫ
t
22.3 Bias-Variance Decomposition
579
and the weight for each point
x
i
∈
D
is updated based on whether the point is
misclassified or not
w
t
i
=
w
t
−
1
i
·
exp
α
t
·
I
M
t
(
x
i
)
=
y
i
Thus, if the predicted class matches the true class, that is, if
M
t
(
x
i
)
=
y
i
, then
I
(
M
t
(
x
i
)
=
y
i
)
=
0, and the weight for point
x
i
remains unchanged. On the other hand, if the point
is misclassified, that is,
M
t
(
x
i
)
=
y
i
, then we have
I
(
M
t
(
x
i
)
=
y
i
)
=
1, and
w
t
i
=
w
t
−
1
i
·
exp
α
t
=
w
t
−
1
i
exp
ln
1
−
ǫ
t
ǫ
t
=
w
t
−
1
i
1
ǫ
t
−
1
Wecan observethatiftheerror rate
ǫ
t
is small, thenthereis agreaterweightincrement
for
x
i
. The intuition is that a point that is misclassified by a good classifier (with a low
error rate) should be more likely to be selected for the next training dataset. On the
other hand, if the error rate of the base classifier is close to 0
.
5, then there is only a
small change in the weight, since a bad classifier (with a high error rate) is expected
to misclassify many instances. Note that for a binary class problem, an error rate of
0
.
5 corresponds to a random classifier, that is, one that makes a random guess. Thus,
we require that a base classifier has an error rate at least slightly better than random
guessing, that is,
ǫ
t
<
0
.
5. If the error rate
ǫ
t
≥
0
.
5, then the boosting method discards
the classifier, and returns to line 5 to try another data sample. Alternatively, one can
simply invert the predictions for binary classification. It is worth emphasizing that for a
multi-class problem (with
k>
2), therequirement that
ǫ
t
<
0
.
5is a significantlystronger
requirement than for the binary (
k
=
2) class problem because in the multiclass case a
random classifier is expected to have an error rate of
k
−
1
k
. Note also that if the error
rate of the base classifier
ǫ
t
=
0, then we can stop the boosting iterations.
Once the point weights have been updated, we re-normalize the weights so that
w
t
is a probability vector (line 14):
w
t
=
w
t
1
T
w
t
=
1
n
j
=
1
w
t
j
w
t
1
,w
t
2
,...,w
t
n
T
Combined Classifier
Given the set of boosted classifiers,
M
1
,
M
2
,...,
M
K
, along with
their weights
α
1
,α
2
,...,α
K
, the class for a test case
x
is obtained via weighted majority
voting. Let
v
j
(
x
)
denote the weighted vote for class
c
j
over the
K
classifiers, given as
v
j
(
x
)
=
K
t
=
1
α
t
·
I
M
t
(
x
)
=
c
j
Because
I
(
M
t
(
x
)
=
c
j
)
is 1 only when
M
t
(
x
)
=
c
j
, the variable
v
j
(
x
)
simply obtains the
tally for class
c
j
among the
K
base classifiers, taking into account the classifier weights.
The combined classifier, denoted
M
K
, then predicts the class for
x
as follows:
M
K
(
x
)
=
argmax
c
j
v
j
(
x
)
j
=
1
,..,k
580
Classification Assessment
In the case of binary classification, with classes
{+
1
,
−
1
}
, the combined classifier
M
K
can be expressed more simply as
M
K
(
x
)
=
sign
K
t
=
1
α
t
M
t
(
x
)
Example 22.14.
Figure 22.9a illustrates the boosting approach on the Iris principal
components dataset, using linear SVMs as the base classifiers. The regularization
constant was set to
C
=
1. The hyperplane learned in iteration
t
is denoted
h
t
, thus,
the classifier model is given as
M
t
(
x
)
=
sign
(h
t
(
x
))
. As such, no individual linear
hyperplane can discriminate between the classes very well, as seen from their error
rates on the training set:
M
t
h
1
h
2
h
3
h
4
ǫ
t
0.280 0.305 0.174 0.282
α
t
0.944 0.826 1.559 0.935
However, when we combine the decisions from successive hyperplanes weighted by
α
t
, we observe a marked drop in the error rate for the combined classifier
M
K
(
x
)
as
K
increases:
combined model
M
1
M
2
M
3
M
4
training error rate 0.280 0.253 0.073 0.047
We can see, for example, that the combined classifier
M
3
, comprising
h
1
,
h
2
and
h
3
, has already captured the essential features of the nonlinear decision boundary
between the two classes, yielding an error rate of 7.3%. Further reduction in the
training error is obtained by increasing the number of boosting steps.
To assess the performance of the combined classifier on independent testing
data, we employ 5-fold cross-validation, and plot the average testing and training
error rates as a function of
K
in Figure 22.9b. We can see that as the number of base
−
2
−
1
0
1
−
4
−
3
−
2
−
1 0 1 2 3
u
1
u
2
h
1
h
2
h
3
h
4
(a)
0
0
.
05
0
.
10
0
.
15
0
.
20
0
.
25
0
.
30
0
.
35
0 50 100 150 200
K
Testing Error
Training Error
(b)
Figure 22.9.
(a) Boosting SVMs with linear kernel. (b) Average testing and training error: 5-fold
cross-validation.
22.4 Further Reading
581
classifiers
K
increases, both the training and testing error rates reduce. However,
while thetraining error essentiallygoes to 0, thetestingerror does not reduce beyond
0
.
02,which happens at
K
=
110.This example illustrates the effectivenessof boosting
in reducing the bias.
Bagging as a Special Case of AdaBoost:
Bagging can be considered as a special case
of AdaBoost, where
w
t
=
1
n
1
, and
α
t
=
1 for all
K
iterations. In this case, the weighted
resampling defaults to regular resampling with replacement, and the predicted class
for a test case also defaults to simple majority voting.
22.4
FURTHER READING
The application of ROC analysis to classifier performance was introduced in Provost
and Fawcett (1997), with an excellent introduction to ROC analysis given in Fawcett
(2006). For an in-depth description of the bootstrap, cross-validation, and other
methods for assessing classification accuracy see Efron and Tibshirani (1993). For
many datasets simple rules, like one-level decision trees, can yield good classification
performance; see Holte (1993) for details. For a recent review and comparison of
classifiers over multiple datasets see Dem
ˇ
sar (2006). A discussion of bias, variance,
and zero–one loss for classification appears in Friedman (1997), with a unified
decomposition of bias and variance for both squared and zero–one loss given in
Domingos (2000). The concept of bagging was proposed in Breiman (1996), and that
of adaptive boosting in Freund and Schapire (1997). Random forests is a tree-based
ensemble approach that can be very effective; see Breiman (2001) for details. For a
comprehensive overview on the evaluation of classification algorithms see Japkowicz
and Shah (2011).
Breiman, L. (1996). “Bagging predictors.”
Machine Learning
, 24(2): 123–140.
Breiman, L. (2001). “Random forests.”
Machine Learning
, 45(1): 5–32.
Dem
ˇ
sar, J. (2006). “Statistical comparisons of classifiers over multiple data sets.”
The
Journal of Machine Learning Research
, 7: 1–30.
Domingos, P. (2000). “A unified bias-variance decomposition for zero-one and
squared loss.”
In Proceedings of the National Conference on Artificial Intelligence
,
564–569.
Efron, B. and Tibshirani, R. (1993).
An Introduction to the Bootstrap,
vol. 57.
Boca Raton, FL: Chapman & Hall/CRC.
Fawcett, T. (2006). “An introduction to ROC analysis.”
Pattern Recognition Letters
,
27(8): 861–874.
Freund, Y. and Schapire, R. E. (1997). “A decision-theoretic generalization of on-line
learning and an application to boosting.”
Journal of Computer and System
Sciences
, 55(1): 119–139.
Friedman, J. H. (1997). “On bias, variance, 0/1-loss, and the curse-of-dimensionality.”
Data Mining and Knowledge Discovery
, 1(1): 55–77.
22.5 Exercises
583
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
h
1
h
2
h
3
h
4
h
5
h
6
Figure 22.10.
For Q4.
Table 22.7.
Critical values for
t
-test
dof 1 2 3 4 5 6
t
α/
2
12.7065 4.3026 3.1824 2.7764 2.5706 2.4469
interval for theexpected errorrate, usingthe
t
-distributioncritical values fordifferent
degrees of freedom (dof) given in Table 22.7.
Q5.
Consider the probabilities
P(
+
1
|
x
i
)
for the positive class obtained for some classifier,
and given the true class labels
y
i
x
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
10
y
i
+
1
−
1
+
1
+
1
−
1
+
1
−
1
+
1
−
1
−
1
P(
+
1
|
x
i
)
0.53 0.86 0.25 0.95 0.87 0.86 0.76 0.94 0.44 0.86
Plot the ROC curve for this classifier.
Index
accuracy, 549
Apriori algorithm, 223
association rule, 220, 301
antecedent, 301
assessment measures, 301
Bonferroni correction, 320
bootstrap sampling, 325
confidence, 220, 302
confidence interval, 325
consequent, 301
conviction, 306
Fisher exact test, 316
general, 315
improvement, 315
Jaccard coefficient, 305
leverage, 304
lift, 303
mining algorithm, 234
multiple hypothesis testing, 320
nonredundant, 315
odds ratio, 306
permutation test, 320
swap randomization, 321
productive, 315
randomization test, 320
redundant, 315
relative support, 220
significance, 320
specific, 315
support, 220, 302
relative, 302
swap randomization, 321
unproductive, 315
association rule mining, 234
attribute
binary, 3
categorical, 3
nominal, 3
ordinal, 3
continuous, 3
discrete, 3
numeric, 3
interval-scaled, 3
ratio-scaled, 3
bagging, 576
Bayes classifier, 467
categorical attributes, 471
numeric attributes, 468
Bayes theorem, 467, 492
Bernoulli distribution
mean, 64
sample mean, 64
sample variance, 64
variance, 64
Bernoulli variable, 63
BetaCV measure, 441
bias-variance decomposition, 572
binary database, 218
vertical representation, 218
Binomial distribution, 65
bivariate analysis
categorical, 72
numeric, 42
Bonferroni correction, 320
boosting, 577
AdaBoost, 577
combined classifier, 579
bootstrap
sampling, 325, 563
C-index, 441
Calinski–Harabasz index, 450
categorical attributes
angle, 87
cosine similarity, 88
covariance matrix, 68, 83
distance, 87
Euclidean distance, 87
Hamming distance, 88
585
586
Index
categorical attributes (
cont.
)
Jaccard coefficient, 88
mean, 67, 83
bivariate, 74
norm, 87
sample covariance matrix, 69
bivariate, 75
sample mean, 67
bivariate, 74
Cauchy–Schwartz inequality, 7
central limit theorem, 565
centroid, 333
Charm algorithm, 248
properties, 248
χ
2
distribution, 80
chi-squared statistic, 80
χ
2
statistic, 80, 85
classification, 29
accuracy, 549, 550, 553
area under ROC curve, 557
assessment measures, 548
contingency table based, 550
AUC, 557
bagging, 576
Bayes classifier, 467
bias, 573
bias-variance decomposition, 572
binary classes, 553
boosting, 577
AdaBoost, 577
classifier evaluation, 562
confidence interval, 565
confusion matrix, 550
coverage, 551
cross-validation, 562
decision trees, 481
ensemble classifiers, 575
error rate, 549, 553
expected loss, 572
F-measure, 551
false negative, 553
false negative rate, 554
false positive, 553
false positive rate, 554
K
nearest neighbors classifier, 477
KNN classifier, 477
loss function, 572
naive Bayes classifier, 473
overfitting, 574
paired
t
-test, 569
precision, 550, 554
recall, 551
sensitivity, 554
specificity, 554
true negative, 553
true negative rate, 554
true positive, 553
true positive rate, 554
unstable, 575
variance, 573
classifier evaluation, 562
bootstrap resampling, 563
confidence interval, 565
cross-validation, 562
paired
t
-test, 569
closed itemsets, 243
Charm algorithm, 248
equivalence class, 244
cluster stability, 454
clusterability, 457
clustering, 28
centroid, 333
curse of dimensionality, 388
DBSCAN, 375
border point, 375
core point, 375
density connected, 376
density-based cluster, 376
directly density reachable, 375
ǫ
-neighborhood, 375
noise point, 375
DENCLUE
density attractor, 385
dendrogram, 364
density-based
DBSCAN, 375
DENCLUE, 385
EM,
see
expectation maximization
EM algorithm,
see
expectation maximization
algorithm
evaluation, 425
expectation maximization, 342, 343
expectation step, 344, 348
initialization, 344, 348
maximization step, 345, 348
multivariate data, 346
univariate data, 344
expectation maximization algorithm,
349
external validation, 425
Gaussian mixture model, 342
graph cuts, 401
internal validation, 425
K-means, 334
specialization of EM, 353
kernel density estimation, 379
kernel K-means, 338
Markov chain, 416
Markov clustering, 416
Markov matrix, 416
relative validation, 425
spectral clustering
computational complexity, 407
stability, 425
sum of squared errors, 333
tendency, 425
validation
external, 425
Index
587
internal, 425
relative, 425
clustering evaluation, 425
clustering stability, 425
clustering tendency, 425, 457
distance distribution, 459
Hopkins statistic, 459
spatial histogram, 457
clustering validation
BetaCV measure, 441
C-index, 441
Calinski–Harabasz index, 450
clustering tendency, 457
conditional entropy, 430
contingency table, 426
correlation measures, 436
Davies–Bouldin index, 444
distance distribution, 459
Dunn index, 443
entropy-based measures, 430
external measures, 425
F-measure, 427
Fowlkes–Mallows measure,
435
gap statistic, 452
Hopkins statistic, 459
Hubert statistic, 437, 445
discretized, 438
internal measures, 440
Jaccard coefficient, 435
matching based measures, 426
maximum matching, 427
modularity, 443
mutual information, 431
normalized, 431
normalized cut, 442
pairwise measures, 433
purity, 426
Rand statistic, 435
relative measures, 448
silhouette coefficient, 444, 448
spatial histogram, 457
stability, 454
variation of information, 432
conditional entropy, 430
confidence interval, 325, 565
small sample, 567
unknown variance, 566
confusion matrix, 550
contingency table, 78
χ
2
test, 85
clustering validation, 426
multiway, 84
correlation, 45
cosine similarity, 7
covariance, 43
covariance matrix, 46, 49
bivariate, 74
determinant, 46
eigen-decomposition, 57
eigenvalues, 49
inner product, 50
outer product, 50
positive semidefinite, 49
trace, 46
cross-validation, 562
leave-one-out, 562
cumulative distribution
binomial, 18
cumulative distribution function, 18
empirical CDF, 33
empirical inverse CDF, 34
inverse CDF, 34
joint CDF, 22, 23
quantile function, 34
curse of dimensionality
clustering, 388
data dimensionality, 2
extrinsic, 13
intrinsic, 13
data matrix, 1
centering, 10
column space, 12
mean, 9
rank, 13
row space, 12
symbolic, 63
total variance, 9
data mining, 25
data normalization
range normalization, 52
standard score normalization, 52
Davies–Bouldin index, 444
DBSCAN algorithm, 375
decision tree algorithm, 485
decision trees, 481
axis-parallel hyperplane, 483
categorical attributes, 485
data partition, 483
decision rules, 485
entropy, 486
Gini index, 487
information gain, 487
purity, 484
split point, 483
split point evaluation, 488
categorical attributes, 492
measures, 486
numeric attributes, 488
DENCLUE
center-defined cluster, 386
density attractor, 385
density reachable, 387
density-based cluster, 387
DENCLUE algorithm, 385
dendrogram, 364
density attractor, 385
588
Index
density estimation, 379
nearest neighbors based, 384
density-based cluster, 387
density-based clustering
DBSCAN, 375
DENCLUE, 385
dimensionality reduction, 183
discrete random variable, 14
discretization, 89
equal-frequency intervals, 89
equal-width intervals, 89
dominant eigenvector, 105
power iteration method, 105
Dunn index, 443
Eclat algorithm, 225
computational complexity, 228
dEclat, 229
diffsets, 228
equivalence class, 226
empirical joint probability mass function, 457
ensemble classifiers, 575
bagging, 576
boosting, 577
entropy, 486
split, 487
EPMF,
see
empirical joint probability
mass function
error rate, 549
Euclidean distance, 7
expectation maximization, 342, 343, 357
expectation step, 358
maximization step, 359
expected value, 34
exploratory data analysis, 26
F-measure, 427
false negative, 553
false positive, 553
Fisher exact test, 316, 318
Fowlkes–Mallows measure, 435
FPGrowth algorithm, 231
frequent itemset, 219
frequent itemsets
mining, 221
frequent pattern mining, 27
gamma function, 80, 166
gap statistic, 452
Gauss error function, 55
Gaussian mixture model, 342
generalized itemset, 250
GenMax algorithm, 245
maximality checks, 245
Gini index, 487
graph, 280
adjacency matrix, 96
weighted, 96
authority score, 110
average degree, 98
average path length, 98
Barab
´
asi–Albert model, 124
clustering coefficient, 131
degree distribution, 125
diameter, 131
centrality
authority score, 110
betwenness, 103
closeness, 103
degree, 102
eccentricity, 102
eigenvector centrality, 104
hub score, 110
pagerank, 108
prestige, 104
clustering coefficient, 100
clustering effect, 114
degree, 97
degree distribution, 94
degree sequence, 94
diameter, 98
eccentricity, 98
effective diameter, 99
efficiency, 101
Erd
¨
os–R
´
enyi model, 116
HITS, 110
hub score, 110
labeled, 280
PageRank, 108
preferential attachment, 124
radius, 98
random graphs, 116
scale-free property, 113
shortest path, 95
small-world property, 112
transitivity, 101
Watts–Strogatz model, 118
clustering coefficient, 119
degree distribution, 121
diameter, 119, 122
graph clustering
average weight, 409
degree matrix, 395
graph cut, 402
k
-way cut, 401
Laplacian matrix, 398
Markov chain, 416
Markov clustering, 416
MCL algorithm, 418
modularity, 411
normalized adjacency matrix, 395
normalized asymmetric Laplacian, 400
normalized cut, 404
normalized modularity, 415
normalized symmetric Laplacian, 399
objective functions, 403, 409
ratio cut, 403
weighted adjacency matrix, 394
Index
589
graph cut, 402
graph isomorphism, 281
graph kernel, 156
exponential, 157
power kernel, 157
von Neumann, 158
graph mining
canonical DFS code, 287
canonical graph, 286
canonical representative, 285
DFS code, 286
edge growth, 283
extended edge, 280
graph isomorphism, 281
gSpan algorithm, 288
rightmost path extension, 284
rightmost vertex, 285
search space, 283
subgraph isomorphism, 282
graph models, 112
Barab
´
asi–Albert model, 124
Erd
¨
os–R
´
enyi model, 116
Watts–Strogatz model, 118
graphs
degree matrix, 395
Laplacian matrix, 398
normalized adjacency matrix, 395
normalized asymmetric Laplacian, 400
normalized symmetric Laplacian, 399
weighted adjacency matrix, 394
GSP algorithm, 261
gSpan algorithm, 288
candidate extension, 291
canonicality checking, 295
subgraph isomorphisms, 293
support computation, 291
hierarchical clustering, 364
agglomerative, 364
complete link, 367
dendrogram, 364, 365
distance measures, 367
divisive, 364
group average, 368
Lance–Williams formula, 370
mean distance, 368
minimum variance, 368
single link, 367
update distance matrix, 370
Ward’s method, 368
Hopkins statistic, 459
Hubert statistic, 437, 445
hyper-rectangle, 163
hyperball, 164
volume, 165
hypercube, 164
volume, 165
hyperspace, 163
density of multivariate normal, 172
diagonals, 171
angle, 171
hypersphere, 164
asymptotic volume, 167
closed, 164
inscribed within hypercube, 168
surface area, 167
volume of thin shell, 169
hypersphere volume, 175
Jacobian, 176–178
Jacobian matrix, 176–178
IID,
see
independent and identically distributed
inclusion–exclusion principle, 251
independent and identically distributed, 24
information gain, 487
interquartile range, 38
itemset, 217
itemset mining, 217, 221
Apriori algorithm, 223
level-wise approach, 223
candidate generation, 221
Charm algorithm, 248
computational complexity, 222
Eclat algorithm, 225
tidset intersection, 225
FPGrowth algorithm, 231
frequent pattern tree, 231
frequent pattern tree, 231
GenMax algorithm, 245
level-wise approach, 223
negative border, 240
partition algorithm, 238
prefix search tree, 221, 223
support computation, 221
tidset intersection, 225
itemsets
assessment measures, 309
closed, 313
maximal, 312
minimal generator, 313
minimum support threshold, 219
productive, 314
support, 309
relative, 309
closed, 243, 248
closure operator, 243
properties, 243
generalized, 250
maximal, 242, 245
minimal generators, 244
nonderivable, 250, 254
relative support, 219
rule-based assessment measures, 310
support, 219
Jaccard coefficient, 435
Jacobian matrix, 176–178
590
Index
K
nearest neighbors classifier, 477
K-means
algorithm, 334
kernel method, 338
k
-way cut, 401
kernel density estimation, 379
discrete kernel, 380, 382
Gaussian kernel, 380, 383
multivariate, 382
univariate, 379
kernel discriminant analysis, 505
kernel K-means, 338
kernel matrix, 135
centered, 151
normalized, 153
kernel methods
data-specific kernel map, 142
diffusion kernel, 156
exponential, 157
power kernel, 157
von Neumann, 158
empirical kernel map, 140
Gaussian kernel, 147
graph kernel, 156
Hilbert space, 140
kernel matrix, 135
kernel operations
centering, 151
distance, 149
mean, 149
norm, 148
normalization, 153
total variance, 150
kernel trick, 137
Mercer kernel map, 143
polynomial kernel
homogeneous, 144
inhomogeneous, 144
positive semidefinite kernel, 138
pre-Hilbert space, 140
reproducing kernel Hilbert space, 140
reproducing kernel map, 139
reproducing property, 140
spectrum kernel, 155
string kernel, 155
vector kernel, 144
kernel PCA,
see
kernel principal component
analysis
kernel principal component analysis, 202
kernel trick, 338
KL divergence,
see
Kullback–Leibler divergence
KNN classifier, 477
Kullback–Leibler divergence, 457
linear discriminant analysis, 498
between-class scatter matrix, 501
Fisher objective, 500
optimal linear discriminant, 501
within-class scatter matrix, 501
loss function, 572
squared loss, 572
zero-one loss, 572
Mahalanobis distance, 56
Markov chain, 416
Markov clustering, 416
maximal itemsets, 242
GenMax algorithm, 245
maximum likelihood estimation,
343, 353
covariance matrix, 355
mean, 354
mixture parameters, 356
maximum matching, 427
mean, 34
median, 35
minimal generator, 244
mode, 36
modularity, 412, 443
multinomial distribution, 71
covariance, 72
mean, 72
sample covariance, 72
sample mean, 72
multiple hypothesis testing, 320
multivariate analysis
categorical, 82
numeric, 48
multivariate Bernoulli variable, 66, 82
covariance matrix, 68, 83
empirical PMF, 69
joint PMF, 73
mean, 67, 83
probability mass function, 66, 73
sample covariance matrix, 69
sample mean, 67
multivariate variable
Bernoulli, 66
mutual information, 431
normalized, 431
naive Bayes classifier, 473
categorical attributes, 476
numeric attributes, 473
nearest neighbors density estimation,
384
nonderivable itemsets, 250, 254
inclusion–exclusion principle, 251
support bounds, 252
normal distribution
Gauss error function, 55
normalized cut, 442
orthogonal complement, 186
orthogonal projection matrix, 186
error vector, 186
orthogonal subspaces, 186
Index
591
pagerank, 108
paired
t
-test, 569
pattern assessment, 309
PCA,
see
principal component analysis
permutation test, 320
swap randomization, 321
population, 24
power iteration method, 105
PrefixSpan algorithm, 265
principal component, 187
kernel PCA, 202
principal component analysis, 187
choosing the dimensionality, 197
connection with SVD, 211
mean squared error, 193, 197
minimum squared error, 189
total projected variance, 192, 196
probability density function, 16
joint PDF, 20, 23
probability distribution
Bernoulli, 15, 63
binomial, 15
bivariate normal, 21
Gaussian, 17
multivariate normal, 56
normal, 17, 54
probability mass function, 15
empirical joint PMF, 43
empirical PMF, 34
joint PMF, 20, 23
purity, 426
quantile function, 34
quartile, 38
Rand statistic, 435
random graphs, 116
average degree, 116
clustering coefficient, 117
degree distribution, 116
diameter, 118
random sample, 24
multivariate, 24
statistic, 25
univariate, 24
random variable, 14
Bernoulli, 63
bivariate, 19
continuous, 14
correlation, 45
covariance, 43
covariance matrix, 46, 49
discrete, 14
empirical joint PMF, 43
expectation, 34
expected value, 34
generalized variance, 46, 49
independent and identically distributed, 24
interquartile range, 38
mean, 34
bivariate, 43
multivariate, 48
median, 35
mode, 36
moments about the mean, 39
multivariate, 23
standard deviation, 39
standardized covariance, 45
total variance, 43, 46, 49
value range, 38
variance, 38
vector, 23
receiver operating characteristic curve, 556
ROC curve,
see
receiver operating characteristic
curve
rule assessment, 301
sample covariance matrix
bivariate, 75
sample mean, 25
sample space, 14
sample variance
geometric interpretation, 40
sequence, 259
closed, 260
maximal, 260
sequence mining
alphabet, 259
GSP algorithm, 261
prefix, 259
PrefixSpan algorithm, 265
relative support, 260
search space, 260
sequence, 259
SPADE algorithm, 263
subsequence, 259
consecutive, 259
substring, 259
substring mining, 267
suffix, 259
suffix tree, 267
support, 260
silhouette coefficient, 444, 448
singular value decomposition, 208
connection with PCA, 211
Frobenius norm, 210
left singular vector, 209
reduced SVD, 209
right singular vector, 209
singular value, 209
spectral decomposition, 210
Spade algorithm
sequential joins, 263
spectral clustering
average weight, 409
computational complexity, 407
degree matrix, 395
k
-way cut, 401
592
Index
spectral clustering (
cont.
)
Laplacian matrix, 398
modularity, 411
normalized adjacency matrix, 395
normalized asymmetric Laplacian, 400
normalized cut, 404
normalized modularity, 415
normalized symmetric Laplacian, 399
objective functions, 403, 409
ratio cut, 403
weighted adjacency matrix, 394
spectral clustering algorithm, 406
standard deviation, 39
standard score, 39
statistic, 25
robustness, 35
sample correlation, 45
sample covariance, 44
sample covariance matrix, 46, 50
sample interquartile range, 38
sample mean, 25, 35
bivariate, 43
multivariate, 48
sample median, 36
sample mode, 36
sample range, 38
sample standard deviation, 39
sample total variance, 43
sample variance, 39
standard score, 39
trimmed mean, 35
unbiased estimator, 35
z
-score, 39
statistical independence, 22
Stirling numbers
second kind, 333
string,
see
sequence
string kernel
spectrum kernel, 155
subgraph, 281
connected, 281
support, 283
subgraph isomorphism, 282
substring mining, 267
suffix tree, 267
Ukkonen’s algorithm, 270
suffix tree, 267
Ukkonen’s algorithm, 270
support vector machines, 514
bias, 514
canonical hyperplane, 518
classifier, 522
directed distance, 515
dual algorithm, 535
dual objective, 521
hinge loss, 525, 532
hyperplane, 514
Karush–Kuhn–Tucker conditions, 521
kernel SVM, 530
linearly separable, 515
margin, 518
maximum margin hyperplane, 520
newton optimization algorithm, 539
nonseparable case, 524
nonlinear case, 530
primal algorithm, 539
primal kernel SVM algorithm, 541
primal objective, 520
quadratic loss, 529, 532
regularization constant, 525
separable case, 520
separating hyperplane, 515
slack variables, 525
soft margin, 525
stochastic gradient ascent algorithm, 535
support vectors, 518
training algorithms, 534
weight vector, 514
SVD,
see
singular value decomposition
SVM,
see
support vector machines
swap randomization, 321
tidset, 218
transaction identifiers, 218
tids, 218
total variance, 9, 43
transaction, 218
transaction database, 218
true negative, 553
true positive, 553
Ukkonen’s algorithm
computational cost, 271
implicit extensions, 272
implicit suffixes, 271
skip/count trick, 272
space requirement, 270
suffix links, 273
time complexity, 276
univariate analysis
categorical, 63
numeric, 33
variance, 38
variation of information, 432
vector
dot product, 6
Euclidean norm, 6
length, 6
linear combination, 4
L
p
-norm, 7
normalization, 7
orthogonal decomposition, 10
orthogonal projection, 11
orthogonality, 8
perpendicular distance, 11
standard basis, 4
unit vector, 6
Index
593
vector kernel, 144
Gaussian, 147
polynomial, 144
vector random variable, 23
vector space
basis, 13
column space, 12
dimension, 13
linear combination, 12
linear dependence, 13
linear independence, 13
orthogonal basis, 13
orthonormal basis, 13
row space, 12
span, 12
spanning set, 12
standard basis, 13
Watts–Strogatz model
clustering coefficient, 122
z
-score, 39