By Sukalp Muzumdar
(Disclaimer – The views and opinions expressed in this article are solely those of the authors and do not necessarily reflect the position of the authors’ employers.)
Over two millennia ago, the philosopher Plato introduced a powerful thought experiment: the Allegory of the Cave. He imagined prisoners chained in a cave, facing a blank wall their entire lives. Behind them, a fire burns, and puppeteers walk back and forth, casting shadows of various objects onto the wall. For the prisoners, these flickering shadows are not just representations; they are the only reality they have ever known, a complete and unquestioned universe of two-dimensional forms.
The allegory illustrates that what we perceive is often just a simplified projection of a far more complex system, and that achieving a deeper understanding requires a fundamental shift in perspective.
This same challenge—moving from a simplified view to a comprehensive one—is central to modern science. Today, powerful methods from Artificial Intelligence originally developed to understand the complexities of human language are offering a blueprint to solve one of the most significant data challenges in biology. This is a direct story of how progress in one field can provide the architectural inspiration needed to revolutionize another.
Representation matters
The problem of representation is central to Natural Language Processing (NLP). An AI model might learn to associate the word “cat” with an image, but how does it internally represent the concept of a cat? Historically, different AI models developed their own unique internal “languages”—high-dimensional vector embeddings—creating isolated, incompatible views of meaning. This incompatibility is a major practical hurdle, preventing the seamless transfer of knowledge between models and making it difficult to compare their conceptual understanding of the world. Each model was, in effect, its own prisoner, seeing a unique set of shadows.
The breakthrough came from pursuing a universal geometry of meaning, a concept formalized in ideas like the “Platonic Representation Hypothesis“. This hypothesis suggests that as different models are trained on the vastness of human language, they independently converge on similar hidden mathematical structures for representing concepts. This theory was powerfully demonstrated by vec2vec, an unsupervised method that can translate between the embedding spaces of different models. What makes it especially remarkable is that it learns this translation without any “paired” sentences – i.e. it does not need a translation handbook; it is able to intuit the common structure. The success of this approach proves that we can create a translation engine for AI models, allowing them to communicate by mapping their internal “shadows” to a shared, underlying “Form” of meaning (to harken back to Plato’s allegory). This is the blueprint: a system that can find and translate between universal representations.
Extending learnings to biology
This exact representational challenge is mirrored in computational biology, particularly in the effort to understand the immense complexity contained within a living cell. Our view of cellular activity is fragmented across multiple “omic” layers, each providing a different kind of “shadow”. Genomics reveals the cell’s static DNA blueprint; transcriptomics shows the active RNA recipes being copied from that blueprint; proteomics details the functional protein machinery built from those recipes; and metabolomics captures the real-time chemical activity of that machinery, and imaging data captures the microstructural layout and ordering of cells within organs and tissue.
Individually, each ‘omic’ dataset is powerful, but it provides an incomplete picture. A critical disease mechanism might not be visible in the genome alone, but only in the subtle interplay between a gene’s expression and a protein’s modification. The current fragmentation of this data means that our understanding of health and disease is fundamentally incomplete. We are watching different shadow plays and trying to guess the single, unified story happening behind the screen.
The success of universal translation in NLP provides a direct architectural model for biology. The ultimate goal is to build a “Universal Cell Embedding” (UCE)—a standardized, foundational representation of cellular states that is independent of the technology used to measure it. This would be more than just a data standard; it would be a powerful computational “index” of cell biology, allowing researchers to query all of history’s cell data with a new experiment. Ambitious projects like scGPT, Geneformer, and the UCE from Rosen et al. are already laying the groundwork for this by building foundation models using massive biological datasets.
The NLP blueprint shows how to build the crucial translational layer on top of this foundation. By establishing a shared information space, we can enable powerful cross-modality conversions that unlock important biological insights. This is not merely theoretical; early versions of such translators are demonstrating remarkable success:
- From Histology to Transcriptomics: Standard, inexpensive histology slides (H&E stains) are a cornerstone of clinical pathology. AI models can now analyze the rich morphological patterns in these images to directly predict spatial gene expression (e.g. ResSAT, SCHAF). This allows researchers to unlock molecular data from vast clinical archives of histology slides, effectively creating rich spatial maps from routine samples without the need for costly and specialized sequencing technologies.
- Bridging Single-Cell and Spatial Data: The biological world is spatial. Single-cell RNA-sequencing provides an incredibly deep profile of individual cell types but discards their location. Spatial transcriptomics preserves location but often at a lower (non-single-cell) resolution. Pioneering methods like Tangram and Cell2location now use AI to computationally fuse these two data types, accurately mapping individual cell profiles back onto their original tissue architecture. This allows us to answer previously intractable questions, such as pinpointing the precise signaling vocabulary used by a rare immune cell when it is in direct physical contact with a tumor cell.
On a practical level, creating a unified space for biological data stands to transform drug discovery. It opens the door to building “virtual cells” for in silico testing, allowing researchers to predict a compound’s efficacy and toxicity much earlier. This could significantly shorten drug development timelines—a process that can take over a decade and cost billions —and reduce the high rate of clinical trial failures that currently plague the industry. Beyond efficiency, a unified view enables new kinds of discovery. It allows us to spot novel drug targets and biomarkers that only emerge from the complex interplay across different ‘omic’ layers, paving the way for truly personalized therapies designed for an individual’s unique biology.
Yet, Plato’s allegory also contains a warning. The journey out of the cave is disruptive, and profound knowledge brings profound responsibility. Models that can fluently interpret and translate our deepest biological information raise significant ethical questions, from data privacy and re-identification to the potential for generating predictive health data that could be used for discrimination. As we move from observing biological shadows to understanding its true “Forms”, the wise and ethical stewardship of that knowledge becomes our most critical task.
Leave a comment