Pumpkinseed, The Biology Mining Company

by Jen Dionne

Sequencing the first human genome in 2003 cost about $3 billion and took 13 years. Today, it costs between $100 and $600 and often is done in a single day. Genomic sequencing gave rise to a new industry that underpins nearly every corner of modern medicine and biological research.

That revolution was made possible by one thing: the first ‘reader’ of molecules. Once we could sequence DNA directly and precisely at scale, the data flowed, and a rich seam of biological innovation followed.

Proteins are a different story. And for the past decade, it's a story I've been determined to change. Because the proteins we can't read are often the ones that determine whether a disease responds to treatment, if a cell is aging normally or not, and whether a biothreat is present in the environment.

Proteins are what biology actually runs on. Every function in a living cell — every enzymatic reaction, every immune response, every signal that tells a cell to divide or die — is executed by proteins. DNA is the instruction manual. Proteins are the machinery. And despite decades of proteomics research, we have only ever been able to read a small, well-lit corner of that machinery. The vast majority of proteoforms, modifications, and low-abundance signals that drive disease, aging, and biological function have remained largely invisible.

Not because the data isn't there, but because the ability to mine it hasn’t existed.

The richest seam in biology

Think about the scale of what is inaccessible to us. The human genome contains about 20,000 genes, which produce around 200,000 RNA transcripts. But the proteome — the full set of proteins those transcripts give rise to and are modified and shaped by cellular context — contains more than one million distinct proteoforms. And unlike DNA, which is largely static, the proteome is dynamic: it changes with disease state, drug treatment, age, and environment. It is the most information-rich layer of biology, and it has been the least legible.

Existing approaches were built not to solve this problem, but to work around it. Mass spectrometry matches protein fragments against reference databases, and can only find what it already knows to look for. Affinity-based methods measure a predetermined panel of proteins. The result is a field that has generated increasingly detailed maps of a small fraction of the proteome, while the most important biological signals have gone unreadable, and untranslated.

This is the gap that Jack, Nhat, and I founded Pumpkinseed to close — not by building a better workaround, but by solving the underlying problem directly.

How we’re mining the ore

deSIPHR — Pumpkinseed's proprietary nanophotonic chip platform — reads proteins the way physics reads matter. Using Raman spectroscopy, it detects the molecular vibrations of individual amino acids, sequencing proteins directly, one residue at a time, without reference databases or elaborate sample preparation. Fabricated at semiconductor scale with over 100 billion sensors per wafer, deSIPHR generates more biologically useful information per cell per dollar than any existing proteomics platform — and the cost per data point drops as the platform scales. Each spectrum is a molecular sentence, a record of what a protein is, what it has been through, and what it is doing in the cell at any particular moment. deSIPHR doesn't just detect proteins; it reads the language biology is written in. Comprehensive biological mining becomes not just possible but inevitable.

This is not an incremental improvement on existing proteomics measurements. It is a shift from confirmation to discovery, from sampling to mining. The difference between a field that can only find what it already suspects is there, and one that can read biology directly and completely for the first time.

Building deSIPHR required bringing together disciplines that don't often share a lab: nanophotonics, biochemistry, and machine learning. That convergence is what makes deSIPHR possible, and what makes it hard to replicate. Jack Hu is the kind of scientist who won't stop until the instrument actually works — not in theory, but on the bench, with real samples. Nhat Vu built the computational layer that turns raw Raman signals into biological sequence, a problem that required inventing new approaches rather than adapting existing ones. Together, we've built something none of us could have built alone. The result is a platform that is already generating revenue, already in the hands of partners, and already finding proteins that no other platform can see.

Three markets, one platform

The opportunity that opens when the proteome becomes legible is not confined to a single application. It runs across three of the most consequential areas of science and technology today.

Biopharma. In biopharma, the implications are immediate. Most drugs either are proteins or act on proteins, yet drug discovery has relied on genomic proxies — DNA and RNA — to understand them. deSIPHR reads what a cell is actually doing, not what its genome suggests it might do. Our collaboration with Genentech's Discovery Proteomics group, and contracts with DARPA and BARDA, validate the platform across commercial drug discovery and defense biosecurity. The near-term opportunity in biopharma alone represents a market we believe will rival DNA sequencing.

Biosecurity. deSIPHR addresses a problem that no existing platform can solve. Because deSIPHR doesn't require a reference catalog, it reads proteins that no catalog-dependent method can detect — including non-canonical amino acids and entirely novel sequences. That capability has applications across basic research, but it is also why DARPA and BARDA are partners: in biosecurity, the threats most worth detecting are precisely the ones designed to be invisible to existing detection methods.

Bio-AI. AI has opened new vistas in biotechnology. But already, we know that algorithms are not what’s holding back the promise of AI. It's the data: high-resolution measurements that capture not just gene expression, but the language cells use to execute biology — what proteins are actually doing, in real time, under perturbation. The next generation of virtual cell models and AI biology platforms will be trained on single cell multi-omic data. We are building the high-throughput platform that generates it.

Why now

The convergence that makes deSIPHR possible — nanophotonic fabrication at semiconductor scale, advances in Raman signal enhancement, and the machine learning infrastructure to interpret raw spectral data at scale — did not exist five years ago. It exists now. The biological AI platforms hungry for proteomic training data did not exist five years ago. They exist now. The drug discovery programs pushing against the limits of what genomic proxies can tell them have been building for a decade. The demand for this data has been building for decades. The platform to generate it has finally arrived.

With active contracts across pharma and defense, Pumpkinseed is scaling our platform from peptide sequencing today to full-length proteins, spatial proteomics, and a comprehensive biological dataset that grows in value with every experiment we run.

Biology has always held the answers. Pumpkinseed is building the miner.

Next
Next

Pumpkinseed Raises $20M Series A to Unlock Biology’s Most Valuable Hidden Data Layer