Skip to content

Text analysis workflow (Digital Humanities)

What this tutorial does

This tutorial shows how a typical Digital Humanities text analysis workflow is carried out using UCT eResearch services.

It connects data storage, transfer, computation, and sharing into a single workflow.


Before you begin

You should:

  • have access to UCT eResearch services (HPC, storage)
  • have a text dataset (e.g. corpus, archival material, OCR outputs)
  • be working with scripts or tools for text processing

If you are unsure where to begin:

Start here


Workflow overview

This workflow follows four stages:

  1. Store and organise your data
  2. Transfer data to compute
  3. Run text analysis
  4. Manage and share outputs

Step 1 — Store and organise your data

Store your raw and working data in a reliable location before analysis.

Store and manage research data


Step 2 — Transfer data to compute

Move your dataset to the environment where analysis will run.

Transfer data


Step 3 — Run text analysis

Run your scripts or tools to process and analyse the text corpus.

Run large-scale analysis


Step 4 — Manage and share outputs

Store results, organise outputs, and share with collaborators if needed.

Store and share research data


What this workflow looks like in practice

In Digital Humanities projects, this workflow often involves:

  • preparing and cleaning text corpora
  • iterating on scripts for analysis (e.g. tokenisation, topic modelling)
  • running compute-intensive tasks on HPC when datasets are large
  • managing multiple versions of datasets and outputs
  • sharing results with collaborators or publishing outputs

What you have done

You have:

  • organised your dataset for analysis
  • moved data to the appropriate compute environment
  • run analysis workflows
  • managed and shared outputs

Next steps

You can extend this workflow by: