The bigger picture

During the summer school the question on the “bigger picture” - how the single pieces that we discuss/present at the school align to a project - came up several times. To tackle that question we created a very (very!) simplified flow chart showing a typical (text-based) DH project.

It starts with digitizing original source documents by scanning, HTR (Transkribus) etc. In a second step the digitized data is transformed to the desired format (TEI, SQL, RDF, JSON etc.). In many projects the transformed documents will be ingested in a database (eXistdb for XML/TEI, MySQL for SQL, Blazegraph for RDF etc.) for further processing.

After the digitization there is an enrichment process. Enrichment can be anything from manual annotation in TEI (e.g. for creating a digital edition) to sophisticated neuronal networks extracting information from the texts. The enrichment process most of the time involves so called reference resources (GND, Viaf, Wikidata etc.). Reference resources are big knowledge bases that store metadata on entities (e.g. birth dates of people) and assign them unique identifiers 1.

Finally there is a publication process that sometimes involves another transformation process 2. The final data is consumed by a webapplication. Additionally to comply with the FAIR data principles we serialize the data to a common data format and publish it in a long time archival (e.g. Zenodo, ARCHE, Gams). As prosopographical databases are often a side project of digital editions (that is also the case in the MRP), often the prosopographical data will also be consumed by other applications (idealy via an API) e.g. the Digital Edition webapp.


The bigger picture (Please feel free to correct errors in the schema and/or add your own thoughts by comments in the original file)


In theory this can be anything, but it has become good practice that URLs in the namespace of the reference resources are used. The reference resources own their namespace and therefore can guarantee unique identifiers.


E.g in APIS we have several serializers in different formats (RDF, TEI, JSON etc.) that can be consumed via the API.