Back to Blog
2 min read

Back to Academia: A PhD at the Intersection of CS and Science of Science

#general #phd #llms #nlp #concept extraction

The transition from industry to doctoral research is driven by a shift in objective: moving from immediate product application to fundamental inquiry. After spending time in both corporate structures at Porsche and the fast-paced start-up ecosystem at Cargodaces, I have identified a critical gap in how we structure and understand scientific knowledge.

I have joined the Hamburg University of Technology (TUHH) to pursue a PhD at the intersection of Computer Science and the Science of Science (SciSci). My research focuses on applying advanced Natural Language Processing (NLP) and Large Language Models (LLMs) to map the diffusion of concepts across the global scientific landscape.

The Research Domain: Science of Science

Science of Science is an interdisciplinary field that uses quantitative methods to understand the evolution of scientific research. Historically, this field relied on bibliometric metadata—primarily citation counts and co-authorship networks—to evaluate impact and trends.

However, metadata is a low-fidelity signal. It captures the connections between documents but ignores the content within them. With the exponential growth of scientific literature (now estimated at over 100 million scholarly documents), we face a data scale that requires automated, semantic analysis.

The Technical Challenge: Concept Extraction at Scale

My research agenda centers on Concept Extraction and Taxonomy Generation. The core hypothesis is that by extracting granular concepts (methods, tasks, materials) from unstructured text, we can construct a dynamic map of innovation that static taxonomies cannot capture.

To achieve this, I will be leveraging a stack focused on:

  1. Large Language Models (LLMs): Utilizing transformer-based architectures to move beyond keyword matching into semantic representation.
  2. High-Performance Computing: Processing terabytes of unstructured PDF and HTML data requires robust infrastructure. I will be utilizing the TUHH HPC cluster alongside local high-memory workstations to handle the ingestion and inference pipelines.
  3. Graph Theory: Modeling the extracted concepts as nodes in high-dimensional Knowledge Graphs to track their diffusion across disciplines over time.

Outlook

The objective of the next three years is to build the tooling necessary to quantify how ideas migrate—specifically, how Artificial Intelligence methodologies diffuse into applied sciences.

This blog will serve as a documentation of that process, covering the engineering of data pipelines, the development of custom NLP libraries, and the methodological findings of the research.

Share this article

Spread the word with your network