Back to Blog
3 min read

Architecting for Scale: Hybrid Workflows between Workstation and HPC

#general #phd #data engineering #infrastructure #open science

As the ingestion pipelines detailed in the previous entry (PubCrawl) reached maturity, the primary constraint on this research project shifted from network bandwidth to compute and storage I/O.

Processing a corpus of 10810^8 documents involves distinct computational phases: stateful data management (databases) and stateless batch processing (inference/training). A standard academic laptop or a single cloud instance is insufficient for this duality. Consequently, I have spent the last quarter architecting a hybrid computing environment that bridges local institute resources with the university’s High-Performance Computing (HPC) infrastructure.

The Architectural Split: Service vs. Batch

The fundamental design challenge is that “Big Data” science requires two opposing types of workloads:

  1. The “Hot” Storage Layer: Requires random access, persistent services (PostgreSQL, Elasticsearch), and low-latency queries for data cleaning and exploration.
  2. The “Heavy” Compute Layer: Requires massive parallelization (GPU/CPU clusters) for embedding generation and graph training, typically managed via a batch scheduler.

HPC environments are optimized for the latter but hostile to the former; you cannot easily run a persistent SQL database on a SLURM cluster node that resets every 24 hours.

Tier 1: The Institute Workstation (The “Data Mule”)

To handle the “Hot” layer, I have provisioned a dedicated high-memory workstation within the institute network. This machine serves as the persistent backbone of the project.

  • Role: Database Host & Prototyping
  • Stack: PostgreSQL (Metadata), Elasticsearch (Full-text index).
  • Function: All raw data acquired by the crawlers is funneled here. This node handles the messy I/O of millions of small files, sanitizing them into structured SQL tables. It allows for rapid prototyping of algorithms on subsets of data (e.g., 100k papers) without waiting in the HPC queue.

Tier 2: The TUHH HPC (The Compute Engine)

Once an algorithm (e.g., a concept extraction pipeline) is validated on the workstation, it is promoted to the TUHH HPC cluster for execution on the full corpus.

  • Role: Batch Inference & Training
  • Stack: Singularity (Containerization), SLURM (Scheduling).
  • Function: We utilize Singularity containers to replicate the Python environment from the workstation to the cluster, ensuring reproducibility. The workflow is strictly unidirectional:
    1. Export: Batches of clean text are serialized (Parquet/Arrow) on the workstation.
    2. Transfer: Data is moved to the HPC high-speed scratch file system.
    3. Compute: SLURM jobs distribute the inference workload across multiple nodes.
    4. Aggregation: Result vectors/graphs are returned to the workstation for analysis.

The Reproducibility Challenge

Operating across two distinct environments introduces the risk of “dependency drift.” To mitigate this, I have adopted a strict Container-First development lifecycle. Code is never installed directly on the host OS. All libraries—from standard numpy to custom NLP tools—are packaged in Docker images (converted to Singularity for HPC), ensuring that the code running on the local node is bit-for-bit identical to the code running on the cluster.

Outlook

With this infrastructure in place, the physical limitations of memory and storage are largely resolved. We can now store the full corpus and scale compute as needed.

However, raw compute power cannot solve algorithmic inefficiency. As I begin testing the entity extraction modules on larger subsets, I am observing that standard Python dictionary lookups and regex operations are becoming the new bottleneck (O(n)O(n) complexity issues). The next phase of research will need to focus on algorithmic optimization to process text at this scale.

Share this article

Spread the word with your network