Data transfer and movement (HPC)¶

What this page covers¶

This page explains how data moves into, within, and out of the HPC environment.

It focuses on:

how data transfer works conceptually
constraints and performance considerations
common patterns and trade-offs

It does not provide step-by-step instructions.

Why data transfer matters¶

Data transfer is often the hidden bottleneck in research workflows.

Poor transfer strategies can:

slow down analysis significantly
overload shared systems
cause failed or incomplete jobs
create unnecessary duplication of data

Efficient data movement is essential for:

reproducibility
performance
responsible use of shared infrastructure

Types of data movement¶

1. Ingress (data into HPC)¶

Moving data from:

personal machines (laptops/desktops)
institutional storage (e.g. network drives)
external systems (cloud, collaborators)

Typical use cases:

uploading input datasets
staging data before computation

2. Internal movement (within HPC)¶

Moving data between:

directories
storage tiers (e.g. home vs scratch)
compute nodes and storage systems

Typical use cases:

preparing data for jobs
reorganising outputs
optimising I/O performance

3. Egress (data out of HPC)¶

Moving data from HPC to:

local machines
institutional storage
external collaborators or repositories

Typical use cases:

retrieving results
archiving outputs
sharing datasets

Key concepts¶

Bandwidth vs latency¶

Bandwidth: how much data can be transferred per unit time
Latency: delay before transfer begins

Large files benefit from high bandwidth.
Many small files are heavily affected by latency.

File size and structure¶

Performance depends strongly on how data is organised:

Few large files → generally efficient
Many small files → slow and inefficient

This is especially important on shared filesystems.

Network boundaries¶

Data movement crosses different network zones:

local machine ↔ campus network
campus network ↔ HPC environment
HPC ↔ external systems

Each boundary may introduce:

bandwidth limits
security controls
authentication requirements

HPC systems often distinguish between:

Login nodes
intended for light interaction
not designed for heavy data transfer
Transfer nodes (if available)
optimised for data movement
designed to handle large transfers efficiently

Using the wrong node type can degrade performance for all users.

Shared filesystems¶

HPC environments typically use shared storage systems.

Implications:

many users access the same storage simultaneously
metadata operations (e.g. listing files) can be expensive
large numbers of small files can degrade performance

Common data transfer protocols¶

Different protocols are suited to different use cases:

SCP / SFTP
simple, widely available
suitable for smaller transfers
rsync
efficient for incremental updates
reduces redundant data transfer
Globus (if available)
optimised for large-scale, reliable transfers
handles retries and parallelism

Each protocol has trade-offs in:

performance
reliability
ease of use

Performance considerations¶

Many small files¶

This is one of the most common performance problems.

Issues:

high overhead per file
slow transfer speeds
heavy load on filesystem metadata

Common mitigation approach:

aggregate files before transfer (e.g. archive)

Large datasets¶

Challenges:

long transfer times
potential interruptions

Considerations:

use tools that support resuming transfers
minimise repeated transfers of unchanged data

Parallel vs sequential transfer¶

Some tools support parallel transfer:

can improve throughput
may increase load on shared systems

Balance is required to avoid impacting other users.

Data staging patterns¶

Stage-in → compute → stage-out¶

A common HPC workflow:

transfer data into HPC (stage-in)
run compute jobs
transfer results out (stage-out)

Use of scratch storage¶

Temporary (scratch) storage is often:

faster
optimised for computation

Typical pattern:

move data to scratch before running jobs
write outputs to scratch
move final results to long-term storage

Constraints and policies¶

Data transfer is subject to system constraints:

network bandwidth limits
fair usage policies
storage quotas
security and access controls

Users are expected to:

avoid excessive or unnecessary transfers
use appropriate tools for the task
minimise impact on shared infrastructure

Common pitfalls¶

transferring large datasets via login nodes
repeatedly copying the same data
working with many small files without aggregation
ignoring storage location (e.g. not using scratch)
assuming local-machine performance applies to HPC

Relationship to other documentation¶

Services → Data transfer
Overview of available tools and when to use them
How-to → Data transfer
Step-by-step instructions for specific tools
Reference → Storage and file systems
Details on how storage is structured and behaves

Summary¶

Effective data transfer in HPC requires understanding:

where data is moving (ingress, internal, egress)
how it is structured (file size and layout)
which systems and constraints apply

Good data movement practices:

improve performance
reduce system load
support reproducible research workflows

Data transfer and movement (HPC)¶

What this page covers¶

Why data transfer matters¶

Types of data movement¶

1. Ingress (data into HPC)¶

2. Internal movement (within HPC)¶

3. Egress (data out of HPC)¶

Key concepts¶

Bandwidth vs latency¶

File size and structure¶

Network boundaries¶

Transfer nodes vs login nodes¶

Shared filesystems¶

Common data transfer protocols¶

Performance considerations¶

Many small files¶

Large datasets¶

Parallel vs sequential transfer¶

Data staging patterns¶

Stage-in → compute → stage-out¶

Use of scratch storage¶

Constraints and policies¶

Common pitfalls¶

Relationship to other documentation¶

Summary¶