Text analysis workflow (Digital Humanities)¶
What this tutorial does¶
This tutorial shows how a typical Digital Humanities text analysis workflow is carried out using UCT eResearch services.
It connects data storage, transfer, computation, and sharing into a single workflow.
Before you begin¶
You should:
- have access to UCT eResearch services (HPC, storage)
- have a text dataset (e.g. corpus, archival material, OCR outputs)
- be working with scripts or tools for text processing
If you are unsure where to begin:
Workflow overview¶
This workflow follows four stages:
- Store and organise your data
- Transfer data to compute
- Run text analysis
- Manage and share outputs
Step 1 — Store and organise your data¶
Store your raw and working data in a reliable location before analysis.
→ Store and manage research data
Step 2 — Transfer data to compute¶
Move your dataset to the environment where analysis will run.
Step 3 — Run text analysis¶
Run your scripts or tools to process and analyse the text corpus.
Step 4 — Manage and share outputs¶
Store results, organise outputs, and share with collaborators if needed.
→ Store and share research data
What this workflow looks like in practice¶
In Digital Humanities projects, this workflow often involves:
- preparing and cleaning text corpora
- iterating on scripts for analysis (e.g. tokenisation, topic modelling)
- running compute-intensive tasks on HPC when datasets are large
- managing multiple versions of datasets and outputs
- sharing results with collaborators or publishing outputs
What you have done¶
You have:
- organised your dataset for analysis
- moved data to the appropriate compute environment
- run analysis workflows
- managed and shared outputs
Next steps¶
You can extend this workflow by:
-
collaborating on code and analysis workflows
→ Collaborate on code -
improving reproducibility and version control
→ Work with code repositories -
scaling analysis for larger datasets or more complex workflows
→ Run large-scale analysis