Summer School: “People in the Digital Age” (Digital Prosopography)¶

Prepare¶

Jupyter notebook:
- either install locally:
  - Install Python, pip and optionally virtualenv
  - run pip install jupyter
  - and start the notebook server with jupyter notebook
- or use colab:
  - Fork this repo: https://github.com/acdh-oeaw/summerschool2020-notebooks
  - head to https://colab.research.google.com
  - and link your GitHub account to Colab
  - choose GitHub and the repo you just forked
  - open /session_3-2_NLP/Session 3-2 NLP (Wednesday 8-7-2020, 10 am).ipynb

Clone the GitLab¶

[1]:

import os
from getpass import getpass
import urllib

user = input('User name: ')
password = getpass('Password: ')
password = urllib.parse.quote(password)

cmd_string = 'git clone https://{0}:{1}@gitlab.com/acdh-oeaw/summerschool2020.git'.format(user, password)

os.system(cmd_string)
cmd_string, password = "", ""

User name:  YourName
Password:  ········

Importing Packages¶

[1]:

import os
import sys
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (20.0, 15.0)

import gensim
from gensim import corpora
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from gensim.models import KeyedVectors

import networkx as nx

from sklearn.manifold import TSNE
from sklearn.cluster import AffinityPropagation, DBSCAN, AgglomerativeClustering, MiniBatchKMeans
from sklearn.decomposition import PCA

from ipywidgets import IntProgress
from IPython.display import display

from bs4 import BeautifulSoup as bs
from xml.etree import ElementTree as ET

some tools/libraries¶

Gensim: A free Python library to build Language Models.
NetworkX: A Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

Natural Language Processing (NLP) for Digital Humanities¶

Word embeddings

Word embeddings is one of the many ways to representing textual documents. It is capable of capturing the context in which words appear, given a corpus.¶

What are word embeddings exactly? Loosely speaking, they are vector representations of a particular word.¶

First, let’s see the most traditional way to represent word as vectors: a technique called “Bag of Words”:¶

Let’s suppose we have three documents:¶

Document 1 - “I am feeling very happy today”
Document 2 - “I am not well today”
Document 3 - “I wish I could go to play”

The first step would to create a vocabulary using all unique words from all documents (the vocabulary).¶

For this small corpus, it would be: [I, am, feeling, very, happy, today, not, well, wish, could, go, to, play]¶

Then, for each word the frequency of the word in the corresponding document is inserted¶

Vector representation of documents

We could plot all words as vectors in a n-dimensional space (n is the size of our vocabulary). Every document (sentence) would be represented by the sum of its individual word vectors.¶

Vector space

But if the vocabulary is huge, it may lead to very sparse vectors (mainly composed by zeros).¶

Word Embeddings is an efficient and effective way of representing words as vectors. The whole body of the text is encapsulated in some space of much lower dimension. Instead of being individual vectors, all words are represented by a linear combination of a smaller set of vectors (usually around 200-300) and it is possible to explicitly define their relationship with each other. This is done using a neural network architecture like these:¶

Word2vec CBOW and Skip-gram architectures

The vector space is abstract, but we could think of an analogy like this:¶

Analogy on the word vector components

The resulting vector space can be illustrated by the following figure, that depicts how this space is able to capture relationships:¶

Geometrical relationships of embedded words

If we have corpora that spans different historical periods, we can analyse the evolution of the meaning of words:¶

Semantic change through time

Now let’s see it in practice!¶

Let’s download the German model:¶

[ ]:

! wget http://cloud.devmount.de/d2bc5672c523b086/german.model -P ./data/

Run the next cell: you should see a file aproximately this size: 738037646

[2]:

! ls -l ./data/german.model

-rw-rw-r-- 1 rsouza rsouza 738037646 jun 18  2018 ./data/german.model

Using German Word2vec trained on the German Wikipedia and German news articles - 15.5.2015 ¶

[2]:

# get trained model, files without a suffix, .bin or .model are treated as binary files
trained_model = gensim.models.KeyedVectors.load_word2vec_format('./data/german.model', binary=True)

[3]:

# remove original vectors to free up memory
trained_model.init_sims(replace=True)

Checking the words that are present in the model:¶

Change the word, as you like¶

[4]:

word = "Fruehstueck"
[k for k,w in trained_model.vocab.items() if k.startswith(word)][0:20]

[4]:

['Fruehstueck',
 'Fruehstueckstisch',
 'Fruehstuecksfernsehen',
 'Fruehstuecksbuffet',
 'Fruehstuecken',
 'Fruehstuecks',
 'Fruehstueck_Mittagessen',
 'Fruehstuecksei',
 'Fruehstueckspause',
 'Fruehstuecksbuefett',
 'Fruehstuecksraum',
 'Fruehstueckstreffen',
 'Fruehstueck_Tiffany',
 'Fruehstuecksflocken',
 'Fruehstueck_Abendessen',
 'Fruehstuecks-',
 'Fruehstuecksbroetchen',
 'Fruehstuecksbueffet',
 'Fruehstueck_Gruenen']

Examining the vector representation of a word¶

[5]:

word = "Wien"
print(trained_model[word].shape)
trained_model[word]

(300,)

[5]:

array([ 0.0238339 ,  0.04943422,  0.0424271 ,  0.0229144 , -0.00698469,
       -0.00395486,  0.0729572 ,  0.03326682, -0.05274953, -0.05048699,
       -0.01097974,  0.01941785, -0.00286674,  0.05530975, -0.01717546,
        0.00087026, -0.03554425,  0.08173548,  0.16878459,  0.05061062,
       -0.0181843 , -0.04035862,  0.06884448, -0.03644295, -0.01597526,
       -0.01427664, -0.03724374, -0.07765247, -0.04940229,  0.08755711,
        0.0406867 ,  0.10670323,  0.00047026, -0.08471008,  0.08366133,
       -0.02376188,  0.00558794,  0.01392431,  0.00027164,  0.02758369,
       -0.06885621, -0.09902706,  0.03808893,  0.04066271, -0.03511016,
        0.00529123,  0.00842345,  0.08330758,  0.07514405,  0.0469595 ,
        0.03891004,  0.02661347, -0.06389445, -0.13004696, -0.01151704,
        0.04386351, -0.10433266, -0.04676365, -0.09454329, -0.01589428,
        0.06392791, -0.13852148,  0.10145868,  0.04462037,  0.02761784,
        0.03245379,  0.03132826, -0.00032424, -0.04573069,  0.00376301,
       -0.03783711,  0.05745342, -0.00545467,  0.10306486,  0.08046119,
       -0.01117319,  0.07071265, -0.01459174, -0.03979302, -0.05731346,
       -0.0934009 ,  0.00143596,  0.04024531, -0.06594911,  0.01442153,
       -0.02157819,  0.00861198, -0.03360666, -0.00217473, -0.05325684,
       -0.06063867,  0.05235587, -0.07148876,  0.14030436,  0.08084998,
        0.05225494,  0.02232727, -0.0375154 , -0.05153104,  0.04204175,
        0.06927375, -0.0213337 , -0.0359464 , -0.0986684 , -0.10689045,
        0.0422255 , -0.04803819, -0.03605939,  0.0329151 ,  0.07575817,
        0.05432944, -0.01006523, -0.0031073 , -0.01198116, -0.00700702,
       -0.04663587, -0.06598163, -0.02608355,  0.03399949,  0.08767203,
       -0.07043833,  0.05742546,  0.02844732, -0.0641215 ,  0.03557531,
       -0.00807026, -0.01884014, -0.00610054,  0.10268665, -0.08756135,
        0.03461902,  0.01096097,  0.01700337,  0.02436678, -0.07595782,
        0.09562863,  0.02614196, -0.01443401, -0.01836962,  0.04879408,
       -0.03684777, -0.0711819 , -0.03422985, -0.07293234, -0.06803517,
       -0.01470779,  0.02610919,  0.03275155,  0.0098528 ,  0.04751247,
       -0.02650394, -0.04922485,  0.0966305 ,  0.06181309, -0.07539852,
        0.03954438,  0.06627589,  0.0742739 ,  0.01676205,  0.09427383,
        0.10264442,  0.010852  ,  0.01417513, -0.01856818, -0.02165503,
       -0.0018881 , -0.01918374,  0.03137087, -0.0257187 ,  0.04286152,
       -0.0495176 ,  0.01291669, -0.02210455,  0.04033438, -0.07039625,
       -0.01298227, -0.03088601, -0.0148988 ,  0.0213136 , -0.00308892,
       -0.03916396,  0.00842758, -0.07342014,  0.05616776, -0.0369697 ,
       -0.01181048,  0.06073389, -0.01649796, -0.0186084 , -0.00040346,
       -0.0586713 , -0.0343927 ,  0.01033644,  0.11091413, -0.0552993 ,
        0.01196672, -0.08076494, -0.00777858,  0.04654175,  0.04052089,
        0.03798108, -0.02726053, -0.10897127,  0.09802596, -0.02641043,
        0.04788393, -0.0549567 , -0.01355664, -0.10841926,  0.06741678,
       -0.04732709,  0.0523398 ,  0.0454935 ,  0.01515049, -0.02464675,
       -0.08356615,  0.04951958,  0.01018465,  0.04417115,  0.07852858,
        0.0225594 , -0.00074988, -0.04641408,  0.05589736, -0.1395837 ,
        0.08130305,  0.17868005,  0.01470505, -0.04527267, -0.01598396,
        0.01736509,  0.03316962, -0.09426542, -0.03690347,  0.11291514,
        0.02533004,  0.00281629, -0.05241821, -0.16159989,  0.06563755,
        0.01156138,  0.09131934, -0.05008974, -0.0002698 , -0.00804178,
        0.05430156, -0.0405793 , -0.15452953,  0.09096871,  0.00484005,
        0.08594571,  0.00039966, -0.05168523, -0.04028896, -0.01445882,
        0.03440348, -0.02790812, -0.0124769 , -0.08990064, -0.09347708,
        0.09387834, -0.03679283,  0.09682948,  0.03489142, -0.02784478,
       -0.13160774,  0.14424536,  0.00976484, -0.07370998, -0.0155739 ,
       -0.10138693,  0.00144212,  0.06416222, -0.04055014, -0.09581374,
       -0.0326473 , -0.00590264,  0.00297864,  0.0474886 ,  0.01101959,
        0.00588132,  0.05759835,  0.02765032, -0.03065836, -0.00759589,
        0.06519953, -0.03736394, -0.14547537,  0.00297634, -0.02188038,
        0.04868366,  0.02270741,  0.04986598, -0.03618561,  0.00272301,
       -0.01465227,  0.06472276, -0.03055525, -0.022267  ,  0.00160853],
      dtype=float32)

There are different metrics for vector distances¶

[6]:

print(trained_model.similarity('kopf', 'blau'))
print(trained_model.distance('kopf', 'blau'))

0.4494617
0.5505383014678955

What are the words most similar to each word of the list?¶

Change the words, as you like¶

[20]:

words = ['Werkzeug', 'blau', 'rot', 'kopf', 'Gewerbe']
series = []
for word in words:
    series.append(pd.DataFrame(trained_model.most_similar(word, topn=10),
                               columns=[f"Similar_{word}", "similarity"]))
df = pd.concat(series, axis=1)
df.head(10)

[20]:

	Similar_Werkzeug	similarity	Similar_blau	similarity	Similar_rot	similarity	Similar_kopf	similarity	Similar_Gewerbe	similarity
0	Werkzeuge	0.783259	rot	0.828190	gelb	0.835019	sperrangelweit_offen	0.623633	Gastronomie	0.690014
1	Schraubenzieher	0.734391	gruen	0.795948	gruen	0.829063	angewurzelt	0.593449	Handel_Handwerk	0.689797
2	Schweissgeraet	0.733918	gelb	0.783880	blau	0.828190	bombenfest	0.588438	Gewerbebetriebe	0.681497
3	Werkzeugen	0.722541	orange	0.774674	schwarz	0.782608	hinterm_Tresen	0.586510	Gewerbetreibende	0.673723
4	Akkuschrauber	0.719295	grau	0.762676	orange	0.771306	roten_Lettern	0.586199	Gewerbe_Industrie	0.673452
5	Werkzeugkoffer	0.712236	Blau	0.759318	grau	0.731490	Haende_Hosentaschen	0.584047	Kleingewerbe	0.669487
6	Stemmeisen	0.711309	schwarz	0.755070	rot_gelb	0.723444	Praesentierteller	0.579342	Handel_Gewerbe	0.668937
7	Bolzenschneider	0.701930	braun	0.744812	rote	0.722512	gluckst	0.577661	Einzelhandel	0.667275
8	Brecheisen	0.697887	tuerkis	0.736878	braun	0.721413	splitterfasernackt	0.576129	Dienstleistung	0.664609
9	Bohrmaschine	0.697744	rosa	0.733246	lila	0.719744	wackeligen_Beinen	0.574928	Dienstleistungsbetriebe	0.649108

What is the word that does not fit?¶

Change the words, as you like¶

[8]:

word1, word2, word3, word4, word5 = 'blau','rot','feld','gruen','gelb'
#word1, word2, word3, word4, word5 = 'Fruehstueck', "Fenster", 'Abendessen','Mittagessen', "Soupe"
#word1, word2, word3, word4, word5 = "Vater", "Mutter", "Sohn", "Tochter", "Oma"
#word1, word2, word3, word4, word5 = "Frankreich","England","Deutschland","Berlin","Oesterreich"

print(trained_model.doesnt_match([word1, word2, word3, word4, word5]))

feld

/usr/local/lib/python3.8/dist-packages/gensim/models/keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)

Let’s make some vectorial operations with words?¶

Change the words, as you like!¶

[9]:

positive_vectors = ['Koenig', 'frau']
negative_vectors = ['mann']

#positive_vectors = ['frau', 'blau']
#negative_vectors = ['mann']

for result in trained_model.most_similar(positive=positive_vectors,
                                          negative=negative_vectors):
    print(result)

('Prinzen', 0.6192535161972046)
('Prinzessin', 0.6132442951202393)
('Koenigin', 0.5914254188537598)
('Prinz', 0.5866174697875977)
('Koenigin_Niederlande', 0.5534396767616272)
('Regentin', 0.542670488357544)
('Maerchenprinzen', 0.5391709804534912)
('Majestaet', 0.5391337871551514)
('Thron', 0.5360094308853149)
('Thronfolger', 0.5285029411315918)

Let’s try reducing the dimensionality of the space and see a 2D projection of the vectors…¶

PCA Illustration

These helper functions will reduce the dimensionality and print the 2D projections

[11]:

def draw_words(model, words, pca=False, alternate=True, arrows=True, x1=3, x2=3, y1=3, y2=3, title=''):
    # get vectors for given words from model
    vectors = [model[word] for word in words]
    if pca:
        pca = PCA(n_components=2, whiten=True)
        vectors2d = pca.fit(vectors).transform(vectors)
    else:
        tsne = TSNE(n_components=2, random_state=0)
        vectors2d = tsne.fit_transform(vectors)

    # draw image
    plt.figure(figsize=(15,15))
    if pca:
        plt.axis([x1, x2, y1, y2])

    first = True # color alternation to divide given groups
    for point, word in zip(vectors2d , words):
        # plot points
        plt.scatter(point[0], point[1], c='r' if first else 'g')
        # plot word annotations
        plt.annotate(
            word,
            xy = (point[0], point[1]),
            xytext = (-7, -6) if first else (7, -6),
            textcoords = 'offset points',
            ha = 'right' if first else 'left',
            va = 'bottom',
            size = "x-large"
        )
        first = not first if alternate else first

    # draw arrows
    if arrows:
        for i in range(0, len(words)-1, 2):
            a = vectors2d[i][0] + 0.04
            b = vectors2d[i][1]
            c = vectors2d[i+1][0] - 0.04
            d = vectors2d[i+1][1]
            plt.arrow(a, b, c-a, d-b,
                shape='full',
                lw=0.1,
                edgecolor='#bbbbbb',
                facecolor='#bbbbbb',
                length_includes_head=True,
                head_width=0.08,
                width=0.01
            )

    # draw diagram title
    if title:
        plt.title(title)

[12]:

wordpairs = ["Mann", "Vater",
             "Frau",  "Mutter",
             "Mutter", "Oma",
             "Vater", "Grossvater",
             "Junge", "Mann",
             "Maedchen", "Frau",
            ]

draw_words(trained_model, wordpairs, True, True, True, -2.5, 2.5, -2.5, 2.5, r'$PCA Visualisierung:')

../../_images/notebooks_session_3-2_NLP_Session_3-2_NLP_(Wednesday_8-7-2020,_11_am)_32_0.png

[13]:

# plot currencies
wordpairs = ["Schweiz", "Franken",
             "Deutschland", "Euro",
             "Grossbritannien", "britische_Pfund",
             "Japan", "Yen",
             "Russland", "Rubel",
             "USA", "US-Dollar",
             "Kroatien", "Kuna",
             "Oesterreich", "Euro",]

draw_words(trained_model, wordpairs, True, True, True, -2, 2, -2, 2, r'$PCA Visualisierung:')

../../_images/notebooks_session_3-2_NLP_Session_3-2_NLP_(Wednesday_8-7-2020,_11_am)_33_0.png

Now it is time to build your own projection…¶

Change the words, as you like!¶

[14]:

# change any pairs of words and run the cell again

wordpairs = ["klein", "kleiner",
             "klein","kleinste",
             "gross", "grosser",
             "gross", "groessten"
            ]

draw_words(trained_model, wordpairs, True, True, True, -2.5, 2.5, -2.5, 2.5, r'$PCA Visualisierung:')

../../_images/notebooks_session_3-2_NLP_Session_3-2_NLP_(Wednesday_8-7-2020,_11_am)_35_0.png

Now we are going to build a graph with similar words…¶

[15]:

def build_neighbors(word, model, nviz=10):
    g = nx.Graph()
    g.add_node(word, color='r')
    viz1 = model.most_similar(word, topn=nviz)
    for v in viz1:
        g.add_node(v[0], color='b')
    g.add_weighted_edges_from([(word, v, w) for v,w in viz1 if w> 0.65])
    for v in viz1:
        for l in model.most_similar(v[0], topn=nviz):
            g.add_node(l[0], color='y')
        g.add_weighted_edges_from([(v[0], v2, w2) for v2,w2 in model.most_similar(v[0])])
    for v in viz1:
        g.add_node(v[0], color='b')
    g.add_node(word, color='r')
    return g

Change the word, as you like!¶

[16]:

word = 'Oesterreich'
#word = 'Akademie'
#word = "Kant"
#word = "Rock"

G = build_neighbors(word, trained_model, 10) # number of similar words to display
pos = nx.spring_layout(G, iterations=100)
nx.draw_networkx(G,
                 pos=pos,
                 node_color=nx.get_node_attributes(G,'color').values(),
                 node_size=1000,
                 alpha=0.8,
                 font_size=12,
                )

../../_images/notebooks_session_3-2_NLP_Session_3-2_NLP_(Wednesday_8-7-2020,_11_am)_39_0.png

Some exercises:¶

Let’s supose you want to find out if the German Wikipedia has any gender biases using vectorial analogies. Could you think of a test?

hint: see this paper
Can you train your own word2vec model from your own corpora?

hint: see the documentation

Let’s use one of the corpus available in this course:

[10]:

path = './summerschool2020/datasets/MPR/TEI/'
teixmlfiles = [f for f in os.listdir(path) if f.endswith('xml')]

[11]:

class WordTrainer(object):
    def __init__(self, dir_name):
        self.dir_name = dir_name
    def __iter__(self):
        for idx,file_name in enumerate([f for f in os.listdir(self.dir_name) if f.endswith('xml')]):
            soup = bs(open(os.path.join(self.dir_name, file_name),'r'), 'lxml').text
            words = [word.lower() for word in soup.split()]
            yield words

Next cell will train the model. It may take a while…

[12]:

generator = WordTrainer(path)

# Train word2vec model with 200 dimensions, 10 words window and 5 iterations
word_vector_model = gensim.models.Word2Vec(generator, size=200, window=10, min_count=5)

[16]:

# word_vector_model.wv.vocab.keys()
word = 'minist'
[k for k in word_vector_model.wv.vocab.keys() if k.startswith(word)][0:20]

[16]:

['ministerrates',
 'ministerium',
 'ministerratsprotokolle',
 'ministerratsprotokolle,',
 'ministerrat.dtd',
 'ministerkonferenz,',
 'ministerkonferenz',
 'ministerpräsidenten,',
 'ministers',
 'ministerien',
 'ministerrat,',
 'ministerrat',
 'minister',
 'ministerrate',
 'ministerium.',
 'ministern',
 'ministeriums',
 'ministerberatungen',
 'ministerkonferenz]',
 'ministerkonferenzen']

[18]:

words = ['ministerrates', 'ministerium', 'ministerratsprotokolle', 'ministerratsprotokolle,']
series = []
for word in words:
    series.append(pd.DataFrame(word_vector_model.wv.most_similar(word, topn=10),
                               columns=[f"Similar_{word}", "similarity"]))
df = pd.concat(series, axis=1)
df.head(10)

[18]:

	Similar_ministerrates	similarity	Similar_ministerium	similarity	Similar_ministerratsprotokolle	similarity	Similar_ministerratsprotokolle,	similarity
0	ministerrats	0.542947	ministeriums	0.698295	österreichisch-ungarischen	0.832775	1848-1866	0.940745
1	ministerrates,	0.534776	ministerio	0.619384	(https://creativecommons.org/licenses/by/4.0/d...	0.775498	(teilbestand)	0.920847
2	protokolle	0.495456	armeeoberkommando,	0.549736	online	0.774431	https://www.archivinformationssystem.at/detail...	0.875206
3	abteilung	0.489594	justizministerium	0.519481	österreichs	0.744246	ömr-prot	0.858871
4	kriegsministeriums	0.477623	reichsministerium	0.511560	monarchie	0.730170	quelle	0.807843
5	1848–1867,	0.475620	ministern	0.508819	herausgeber/in:	0.717563	kk	0.739119
6	ministerrats,	0.474603	ministers	0.507961	(mrp)	0.706194	tei-datei	0.704549
7	reichstages	0.470342	kriegsministerium	0.506486	heindl.	0.686432	ka	0.698466
8	staatsministeriums	0.470340	handelsministerium	0.505275	wissenschaftlicher	0.677123	at-oesta/hhsta	0.632534
9	handelsministers	0.465075	minister	0.487464	malfèr.	0.676343	notesstmt	0.557569

[ ]:

Summer School: “People in the Digital Age” (Digital Prosopography)¶

Prepare¶

Clone the GitLab¶

Importing Packages¶

some tools/libraries¶

Natural Language Processing (NLP) for Digital Humanities¶

Word embeddings is one of the many ways to representing textual documents. It is capable of capturing the context in which words appear, given a corpus.¶

What are word embeddings exactly? Loosely speaking, they are vector representations of a particular word.¶

First, let’s see the most traditional way to represent word as vectors: a technique called “Bag of Words”:¶

Let’s suppose we have three documents:¶

The first step would to create a vocabulary using all unique words from all documents (the vocabulary).¶

For this small corpus, it would be: [I, am, feeling, very, happy, today, not, well, wish, could, go, to, play]¶

Then, for each word the frequency of the word in the corresponding document is inserted¶

We could plot all words as vectors in a n-dimensional space (n is the size of our vocabulary). Every document (sentence) would be represented by the sum of its individual word vectors.¶

But if the vocabulary is huge, it may lead to very sparse vectors (mainly composed by zeros).¶

The vector space is abstract, but we could think of an analogy like this:¶

The resulting vector space can be illustrated by the following figure, that depicts how this space is able to capture relationships:¶

If we have corpora that spans different historical periods, we can analyse the evolution of the meaning of words:¶

Now let’s see it in practice!¶

Let’s download the German model:¶

Using German Word2vec trained on the German Wikipedia and German news articles - 15.5.2015 ¶

Checking the words that are present in the model:¶

Change the word, as you like¶

Examining the vector representation of a word¶

There are different metrics for vector distances¶

What are the words most similar to each word of the list?¶

Change the words, as you like¶

What is the word that does not fit?¶

Change the words, as you like¶

Let’s make some vectorial operations with words?¶

Change the words, as you like!¶

Let’s try reducing the dimensionality of the space and see a 2D projection of the vectors…¶

Now it is time to build your own projection…¶

Change the words, as you like!¶

Now we are going to build a graph with similar words…¶

Change the word, as you like!¶

Some exercises:¶

ACDH Digital Prosopography Summer School

Navigation

Related Topics

Summer School: “People in the Digital Age” (Digital Prosopography)¶

Prepare¶

Clone the GitLab¶

Importing Packages¶

some tools/libraries¶

Natural Language Processing (NLP) for Digital Humanities¶

Word embeddings is one of the many ways to representing textual documents. It is capable of capturing the context in which words appear, given a corpus.¶

What are word embeddings exactly? Loosely speaking, they are vector representations of a particular word.¶

First, let’s see the most traditional way to represent word as vectors: a technique called “Bag of Words”:¶

Let’s suppose we have three documents:¶

The first step would to create a vocabulary using all unique words from all documents (the vocabulary).¶

For this small corpus, it would be: [I, am, feeling, very, happy, today, not, well, wish, could, go, to, play]¶

Then, for each word the frequency of the word in the corresponding document is inserted¶

We could plot all words as vectors in a n-dimensional space (n is the size of our vocabulary). Every document (sentence) would be represented by the sum of its individual word vectors.¶

But if the vocabulary is huge, it may lead to very sparse vectors (mainly composed by zeros).¶

The vector space is abstract, but we could think of an analogy like this:¶

The resulting vector space can be illustrated by the following figure, that depicts how this space is able to capture relationships:¶

If we have corpora that spans different historical periods, we can analyse the evolution of the meaning of words:¶

Now let’s see it in practice!¶

Let’s download the German model:¶

Using German Word2vec trained on the German Wikipedia and German news articles - 15.5.2015¶

Checking the words that are present in the model:¶

Change the word, as you like¶

Examining the vector representation of a word¶

There are different metrics for vector distances¶

What are the words most similar to each word of the list?¶

Change the words, as you like¶

What is the word that does not fit?¶

Change the words, as you like¶

Let’s make some vectorial operations with words?¶

Change the words, as you like!¶

Let’s try reducing the dimensionality of the space and see a 2D projection of the vectors…¶

Now it is time to build your own projection…¶

Change the words, as you like!¶

Now we are going to build a graph with similar words…¶

Change the word, as you like!¶

Some exercises:¶

Using German Word2vec trained on the German Wikipedia and German news articles - 15.5.2015 ¶