DNA to Phenotype
Our technology predicts phenotypes from a human genome. This site currently only supports whole genome sequencing data in VCF format.
What is a phenotype?
Phenotypes are the expressed traits, characteristics, and diseases of an organism that arise from its DNA.What is a polygenic score?
A polygenic score (also polygenic index/predictor), is a powerful statistical tool that infers the underlying continuous genetic axis of heritable predisposition to a trait conferred by common genetic polymorphisms.Polygenic scores are the natural scientific outgrowth of the infinitesimal model of population genetics, first proposed by Ronald Fisher and Francis Galton and later confirmed by modern large-scale GWAS. The infinitesimal model postulates that complex traits are influenced by a multitude of DNA variants, each with a small effect, but that combine linearly to explain a significant fraction of individual differences within and between families.
Massive statistics on large samples of phenotyped and genotyped individuals enable us to back out estimates of these genetic effects. Quantitative prediction from DNA alone is unleashed by aggregating these genetic effects into a polygenic score: The alleles an individual carries are weighted by their estimated effects on a trait and then summed to produce the individual's estimated position relative to the population along the genetic axis. This latent genetic axis underpinning a polygenic trait is typically distributed in a Gaussian (bell curve) manner in the population owing to central limit theorem convergence.
As the cumulative size of genotyping studies grows through the 21st century, our models will converge closer on the linear genetic architecture of important heritable characteristics. In the limit, as new biobanks come online and large genotyping studies are completed with high quality phenotyping, we will map out the genomic structure of human variation with increased resolution. For some traits, we will reach the holy grail of solving the heritability. Capturing all the additive heritable information about an individual genome means we can do the equivalent of meeting the genome's identical twin from DNA alone. This has nearly occurred already for height, for which we've captured about 85% of the total heritability.
For height and some biomarkers, we are close to saturation of the models, where we can see that additional training data yields only modest improvements in predictive performance. For nearly all other complex traits and diseases however, e.g. type 2 diabetes, prostate cancer, and breast cancer, our predictive ability is still heavily data-limited. We can see that the predictor continues to get better as the number of cases in the training increases, and this improvement shows no sign of plateauing even at the current sample size. Scaling laws provide more precise estimates, but generally you can expect our genetic effect estimates to reach convergence once the training data has about 100,000 cases for a complex disease and about 1,000,000 well-phenotyped genomes for a normally distributed quantitative trait. For almost all important phenotypes, we have not yet reached the limit of our ability to predict from DNA.
What is pleiotropy?
Pleiotropy occurs when a polymorphism in the structure of DNA influences multiple traits. A key finding of recent advances in population genetics is that pleiotropy between unrelated traits and diseases is typically diminutive. Between related traits and diseases, e.g. diseases that tend to frequently occur together, the field has found overwhelmingly synergistic pleiotropy. This means that when a change to the DNA increases the risk of one disease, it tends to also increase the risk of a related disease.In a future version of this site, you will be able to see the pleiotropy between each pair of predictors. It is expected that the multitude of traits will show a simplification and sparsity of the polygenic scores to roughly a dozen or so genetic constructs, e.g. a shared factor of cardiovascular disease liability. That is, the polygenic scores are not distributed as spheres, but rather as ellipsoids, in which the variants underlying each trait are shared with the variants underlying related traits.
How do the genetics of my annual physical measurements affect my health?
The creator of this site also trained predictors of 10-year disease risk from age, sex, and routine annual physical measurements for a wide range of common late-in-life diseases. The 26 physical measurements included things like BMI, blood pressure, and standard biomarkers, e.g., glucose, cholesterol, liver and kidney markers, and blood counts. It turns out that all of those measurements can be predicted from DNA. An obvious question is: could I predict your annual physical results from your DNA and plug them into the biometric-based disease predictors?Surprisingly, this method of stitching the output of the genes-to-biomarkers predictors to the input of the biomarkers-to-disease predictor works quite well across a range of diseases. For some diseases, e.g., non-alcoholic fatty liver disease, chronic kidney disease, it predicts disease more accurately than the regular genes-to-disease polygenic score. When there is limited training data available, it is often easier for linear models to learn about the relationship between genetics and biometrics and then separately about the relationship between biometrics and disease, as compared to regressing disease status directly on genetics, skipping the inner layer of additional biological information. For most diseases, performance improves slightly by including the biometric polygenic scores.
Using this concatenated model architecture, we can evaluate how our polygenic score for an annual physical measurement modifies overall disease risk. We take a polygenic score, e.g., for eGFR or mean arterial pressure, and estimate the change in absolute 10-year risk for some disease, e.g. coronary artery disease or hypothyroidism, at age 70 corresponding to an increase in the polygenic score by one standard deviation. We then do this for each disease and sum the effects on absolute risk. The net change in absolute risk tells us what it means to have a higher polygenic score for one of these routine physical measurements. It is the measure of the pleiotropy of a biometric polygenic score for general common disease risk. For example, we find that compared to having an average polygenic score for BMI, men with a polygenic score 1 sd above the mean have an overall greater absolute risk of any late-in-life disease by 10 percentage points.

In general, we do not see trade offs in genetic predisposition to routine physical measurements. That is, we do not see situations where a high biomarker polygenic score increases risk for one disease while decreasing risk for another disease. Instead, the data suggests a pattern of synergistic pleiotropy, where genetic variants controlling a biomarker that are associated with increased risk for one disease often contribute to increased risk for other related diseases as well. This points to the existence of a genetic general factor underlying overall health: a shared genetic architecture that influences overall disease susceptibility across multiple conditions.
Interestingly, we see very little activation of pulse rate as a predictor of general disease risk and also very little activation of triglyceride levels. The latter observation might be explained by HDL cholesterol being a stronger predictor of disease risk compared to triglycerides. eGFR continues to act as a kind of biological clock for general disease risk. Kidney filtration rate declines strongly with age, and having a lower eGFR than average for your age cohort tends to come along with a higher risk of disease than average for your age cohort.

You can download the data of the estimated effects on absolute risk at age 70 by an increase by one standard deviation in the polygenic score for each biometric.
Is the genetic architecture really additive?
For those new to the field, it might appear that the additive model is merely a compromise forced by limited sample sizes. However, this assumption is incorrect. It turns out that despite strong efforts to find nonlinear genetic effects, we are failing to do so. For instance, a recent study by Kelemen et al. 2025 demonstrated through simulations that if the heritability of complex traits were primarily influenced by epistasis (interactions between genetic loci), such effects should be detectable with neural networks. However, when examining a wide range of UK Biobank phenotypes, the researchers failed to detect nonlinear genetic effects. The figure below, taken from their study, illustrates that across numerous traits and diseases in the UK Biobank, neural network models do not surpass the performance of the basic additive model.
It is a very deep and important discovery of the 21st century that the genetic effects that make one individual different from another are approximately additive. This is not to say that gene x gene interactions do not occur in basic biological processes, we know that such interactions are important and can be disrupted by very rare Mendelian disorders. Rather it is to say that at the scale of common heritable differences between individuals within a population, these appear to be driven by genes acting additively within and between loci.
In light of evolution, this discovery is not so surprising. In 1930, Ronald Fisher published a differential equation that he named the fundamental theorem of natural selection, detailed in a contemporary review by Grafen (2020). It states that the rate at which a population can adapt to its environment is dominated by the amount of additive genetic variance in the population. That is, the greater the differences between members of a species controlled by additive genetic effects, the faster that species can evolve in response to selection pressure. By comparison, the speed of natural selection via nonlinear adaptations is much slower: locus-locus effects are harder to transmit to offspring because the interacting alleles are likely to break apart during meiosis.
The image below shows a pictogram of Fisher's fundamental theorem: the rate of change of average fitness over time (dF/dt) is approximately equal to the additive genetic variance (σ2A), i.e., the amount of population variance in fitness controlled by additive genetic effects. When σ2A is large (top panel), the rate of change of fitness over time is fast, and when σ2A is small (bottom panel), the rate is slow.

Fisher's fundamental theorem provides an evolutionary explanation for why the genetic architecture of population traits we see today is dominated by additive genetic effects. Genetic architectures that failed to adapt swiftly to environmental changes were outpaced by those that could. By the logic of natural selection, the architectures prevalent today should be those that facilitate rapid evolution, namely additive genetic architectures. Another perspective on Fisher's fundamental theorem is provided in the diagram below from Shafee (2014). It is easier to adapt modular molecular software with sparse additive features than brittle genetic code with highly interdependent components. The greater the influence of epistasis on trait variation, the slower and more challenging it is to evolve the trait.

Fisher showed us that the fastest way to shift a population trait is to change the frequencies of the alleles that have additive effects. This suggests that any population trait which underwent rapid change in recent evolutionary history is likely to be underpinned by an additive genetic architecture. We can tentatively conclude that the sparse and linear genetic architecture of complex traits that we are discovering today is a product of the evolution of our ancestral lineage. We now have the chance to map out that architecture in high resolution.