ACDH-CH Digital Prosopography Summer School 2020 | website

Analysis III: Statistics, Plotting, Unsupervised ML

2020-07-09, Vienna / Zoom

Sabine Laszakovits | Data Analyst | ACDH-CH | ÖAW | web | email

What we’ll do

  • Basic statistics

    • Exploratory data analysis (EDA)

    • Characteristic numbers

    • Variables

    • Plotting in Python

  • Find clusters in the data

    • K-Means algorithm

The dataset

https://en.wikipedia.org/wiki/Iris_flower_data_set

We have a list of 150 flowers from 3 species of irises with measurements of their leaves.

For now, we want to see what we can observe about these measurements.

Ultimately, we will want to figure out from the measurements, which species each flower belongs to.

Iris setosa

Iris versicolor

Iris virginica

iris-setosa-0ab3145a-68f2-41ca-a529-c02fa2f5b02-resize-320.jpeg

picture

picture

source: https://alchetron.com/Iris-setosa

source: https://commons.wikimedia.org/wiki/File:Iris_versicolor_3.jpg

source: https://commons.wikimedia.org/wiki/File:Iris_virginica.jpg

[1]:
import pandas

data_complete = pandas.read_csv('iris.csv')

data = data_complete.drop(columns=['species']) # that would be cheating
[2]:
data
[2]:
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

150 rows × 4 columns

What is a “petal” [ˈpɛtəl] and what is a “sepal” [ˈsɛpəl]?

451px-Petal-sepal.jpg Source: https://commons.wikimedia.org/wiki/File:Petal-sepal.jpg

Exploratory data analysis

  • How many variables are there?

  • What is the range of each variable?

  • What is the distribution of each variable?

  • Are the variables independent or dependent?

How many variables are there?

[3]:
data.head()
[3]:
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

What is the range of each variable?

[4]:
data.describe()
[4]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
[6]:
data['sepal_length'].plot.hist(bins=100)
[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1c32ffdf28>
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_12_1.png

Minimum, maximum

The smallest and largest value in the set.

[7]:
min = data['sepal_length'].min()
max = data['sepal_length'].max()

data['sepal_length'].plot.hist(bins=100)

import matplotlib.pyplot

matplotlib.pyplot.axvline(min, color='k', linestyle='dashed', linewidth=1)
matplotlib.pyplot.axvline(max, color='k', linestyle='dashed', linewidth=1)
[7]:
<matplotlib.lines.Line2D at 0x7f1c32e38d68>
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_14_1.png

What is the distribution of each variable?

  • Mean

  • Quantiles

  • Standard deviation

Mean

The arithmetic mean, also known as average.

\(\bar{x} = \frac{x_1+x_2+\ldots+x_n}{n} = \frac{\sum_{i=1}^{n}x_i}{n}\)

[8]:
data['sepal_length'].plot.hist(bins=100)

matplotlib.pyplot.axvline(data['sepal_length'].mean(), color='k', linestyle='dashed', linewidth=1)
[8]:
<matplotlib.lines.Line2D at 0x7f1c32d2dcc0>
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_17_1.png

Standard deviation (SD, std)

Measures the dispersion of a dataset relative to its mean.

\(\sigma = \sqrt{\frac{\sum_{i=1}^n \left(x_i - \bar{x}\right)^2}{n}}\)

Why square?

The SD is also used to express the confidence in statistical conclusions and to calculate the margin of error.

[9]:
data['sepal_length'].plot.hist(bins=100)

mean = data['sepal_length'].mean()
std = data['sepal_length'].std()

matplotlib.pyplot.axvline(mean, color='k', linestyle='dashed', linewidth=1)

matplotlib.pyplot.fill_between([mean-std, mean+std], 0, 11, facecolor='green', alpha=0.3)
[9]:
<matplotlib.collections.PolyCollection at 0x7f1c32a3ea90>
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_19_1.png
The effect of the standard deviation

The 2 distributions have the same mean but different standard deviations.

Comparison_standard_deviations.svg.png Source: https://commons.wikimedia.org/wiki/File:Comparison_standard_deviations.svg

Quantiles

Sort the sample ascending. Partition the sample into \(q\) groups with an equal number of datapoints in each group.

Groups have equal probabilities.

Some common quantiles: * quartile = partition into 4 groups = 4-quantile * percentile = partition into 100 groups = 100-quantile * median = 2-quantile

Example

25th percentile = the value of the datapoint that is ranked at position \(\frac{25}{100}n\)

= 1st quartile

[10]:
# pandas's calculation
q25 = data['sepal_length'].quantile(0.25)
print(f"Calculated value = {q25}")
Calculated value = 5.1
[11]:
sorted_values = list(data['sepal_length'].sort_values())
q25_index = int(len(sorted_values)*0.25)
q25_value = sorted_values[q25_index]
print(f"Value at index {q25_index} = {q25_value}")
Value at index 37 = 5.1
Partitions span across different intervals
[12]:
q50 = data['sepal_length'].quantile(0.50)
q75 = data['sepal_length'].quantile(0.75)

data['sepal_length'].plot.hist(bins=100)

matplotlib.pyplot.fill_between([min, q25], 0, 11, facecolor='red', alpha=0.2)
matplotlib.pyplot.fill_between([q25, q50], 0, 11, facecolor='red', alpha=0.4)
matplotlib.pyplot.fill_between([q50, q75], 0, 11, facecolor='red', alpha=0.6)
matplotlib.pyplot.fill_between([q75, max], 0, 11, facecolor='red', alpha=0.85)
[12]:
<matplotlib.collections.PolyCollection at 0x7f1c3292f3c8>
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_26_1.png

Dependencies

We have 4 variables: * sepal_length * sepal_width * petal_length * petal_width

Are there dependencies between these variables?

Example: * If sepal_length is a large value, sepal_width is also a large value. * petal_width is always about twice as large as sepal_width.

Dependency between sepal length & sepal width

[13]:
data.plot.scatter(x='sepal_length', y='sepal_width')
[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1c328e7a90>
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_30_1.png
[14]:
import seaborn
seaborn.set()
seaborn.pairplot(data, height=1.9)
[14]:
<seaborn.axisgrid.PairGrid at 0x7f1c328be358>
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_31_1.png

It looks like: * There are dependencies between the 4 variables. * The dependencies cannot be expressed between pairs of variables. * We suspect that our sample of flowers consists of multiple clusters. Each cluster refers to a different species of flower. * The dependencies are best described within each cluster.

[15]:
seaborn.pairplot(data_complete, hue="species", height=1.9)
[15]:
<seaborn.axisgrid.PairGrid at 0x7f1c30568828>
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_33_1.png

Clustering

We apply a machine learning technique to find the clusters of datapoints.

Machine Learning

ML = An algorithm looks at data and draws generalizations from it

Supervised ML = The algorithm gets a bunch of correctly labeled examples and learns the connection data <=> label.

Unsupervised ML = The algorithm gets unlabeled data and finds structure within the data.

Model

Algorithm + learned generalizations = model

We apply a model to new data, where it makes predictions by assigning a label.

K-means algorithm

This is an algorithm that finds \(k\) clusters in the given dataset. (You need to specify \(k\).)

How it works:

  1. Initialize: Declare at random \(k\) datapoints in the dataset. These are the “centers”.

  2. Assign: Assign each datapoint to its closest “center”. Now you have temporary clusters.

  3. Update: For each cluster, calculate the mean. Shift the cluster’s center to this mean value.

  4. Iterate: Keep doing the Assignment step and the Update step until the center points don’t move anymore (“converge”). If you’ve done this for a while and they still don’t converge, start over with different random initial centers.

1.JPG Source: https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-algorithm

2.JPG Source: https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-algorithm

3.JPG Source: https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-algorithm

4.JPG Source: https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-algorithm

5.JPG

Source: https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-algorithm

6.JPG

Source: https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-algorithm

Points of variation:

  • the value \(k\)

  • the randomness of the initialization

  • the definition of “closest”

  • the definition of “mean”

  • the number of iterations you wait for convergence

Running the algorithm

Library: https://scikit-learn.org/

[16]:
from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, n_init=1, random_state=245234) # declare k
km = km.fit(data) # run the algorith on the data
[17]:
km.labels_
[17]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0,
       0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)
[18]:
km.cluster_centers_
[18]:
array([[6.85384615, 3.07692308, 5.71538462, 2.05384615],
       [5.006     , 3.418     , 1.464     , 0.244     ],
       [5.88360656, 2.74098361, 4.38852459, 1.43442623]])

Evaluate the result

[19]:
evaluation = {
    'Iris-setosa':     {'0': 0, '1': 0, '2': 0},
    'Iris-versicolor': {'0': 0, '1': 0, '2': 0},
    'Iris-virginica':  {'0': 0, '1': 0, '2': 0}
}

for i in range(len(data_complete)):
    evaluation[data_complete['species'][i]][str(km.labels_[i])] += 1
[20]:
evaluation
[20]:
{'Iris-setosa': {'0': 0, '1': 50, '2': 0},
 'Iris-versicolor': {'0': 3, '1': 0, '2': 47},
 'Iris-virginica': {'0': 36, '1': 0, '2': 14}}

Run the algorithm again

Let’s see whether we get a different result with different random initial values.

[21]:
km2 = KMeans(n_clusters=3, n_init=1, random_state=123657651)
km2 = km2.fit(data)
[22]:
evaluation2 = {
    'Iris-setosa':     {'0': 0, '1': 0, '2': 0},
    'Iris-versicolor': {'0': 0, '1': 0, '2': 0},
    'Iris-virginica':  {'0': 0, '1': 0, '2': 0}
}

for i in range(150):
    evaluation2[data_complete['species'][i]][str(km2.labels_[i])] += 1
[23]:
evaluation, evaluation2
[23]:
({'Iris-setosa': {'0': 0, '1': 50, '2': 0},
  'Iris-versicolor': {'0': 3, '1': 0, '2': 47},
  'Iris-virginica': {'0': 36, '1': 0, '2': 14}},
 {'Iris-setosa': {'0': 50, '1': 0, '2': 0},
  'Iris-versicolor': {'0': 0, '1': 48, '2': 2},
  'Iris-virginica': {'0': 0, '1': 14, '2': 36}})
[24]:
km.cluster_centers_, km2.cluster_centers_
[24]:
(array([[6.85384615, 3.07692308, 5.71538462, 2.05384615],
        [5.006     , 3.418     , 1.464     , 0.244     ],
        [5.88360656, 2.74098361, 4.38852459, 1.43442623]]),
 array([[5.006     , 3.418     , 1.464     , 0.244     ],
        [5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
        [6.85      , 3.07368421, 5.74210526, 2.07105263]]))
[25]:
km.n_iter_, km2.n_iter_
[25]:
(7, 4)

Plotting the centers

[26]:
col_names = [ d for d in data ]
col_idx_name = [ (i, col_names[i]) for i in range(len(col_names)) ]
centers = [ { v: c[i] for (i,v) in col_idx_name } for c in km.cluster_centers_ ]
[27]:
centers
[27]:
[{'sepal_length': 6.853846153846154,
  'sepal_width': 3.076923076923077,
  'petal_length': 5.7153846153846155,
  'petal_width': 2.053846153846154},
 {'sepal_length': 5.006,
  'sepal_width': 3.418,
  'petal_length': 1.4639999999999995,
  'petal_width': 0.24400000000000022},
 {'sepal_length': 5.883606557377049,
  'sepal_width': 2.740983606557377,
  'petal_length': 4.388524590163934,
  'petal_width': 1.4344262295081966}]

Predicted species

[28]:
c = {0: 'g', 1: 'b', 2: 'brown'}
seaborn.scatterplot(x='sepal_length', y='sepal_width', data=data,
                    hue=km.labels_, palette=c)
matplotlib.pyplot.plot('sepal_length', 'sepal_width', 'gx', data=centers[0])
matplotlib.pyplot.plot('sepal_length', 'sepal_width', 'bx', data=centers[1])
matplotlib.pyplot.plot('sepal_length', 'sepal_width', 'rx', data=centers[2])
[28]:
[<matplotlib.lines.Line2D at 0x7f1c30568940>]
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_62_1.png

Correct species

[29]:
c = {'Iris-setosa': 'b', 'Iris-versicolor': 'brown', 'Iris-virginica': 'g'}
seaborn.scatterplot(x='sepal_length', y='sepal_width', data=data_complete,
                    hue='species', palette=c)
matplotlib.pyplot.plot('sepal_length', 'sepal_width', 'gx', data=centers[0])
matplotlib.pyplot.plot('sepal_length', 'sepal_width', 'bx', data=centers[1])
matplotlib.pyplot.plot('sepal_length', 'sepal_width', 'rx', data=centers[2])
[29]:
[<matplotlib.lines.Line2D at 0x7f1c2d31a400>]
../../_images/notebooks_session_4-4_Stats_KMeans_Session_4-4_Statistics_(Thursday_9-7-2020,_14_pm)_64_1.png

Questions?