Predicting Gene Expression From Sequence

Determining how noncoding DNA influences gene expression is a mystery that has yet to be solved. For a few years, the state-of-the-art in predicing gene expression from DNA sequence was Enformer1, a 2021 model from Google Deepmind. However, recent DNA foundation models such as Nucleotide Transformer2, HyenaDNA3, Caduceus4, and Lyra5, as well as models that incorporate additional epigenomic signals, such as EPInformer6 and Seq2Exp7, are challenging Enformer’s reign.

Early work on this problem primarily applied CNNs. However, these models were typically only capable of capturing regulatory interactions up to 20kb (20,000 bases) away from the transcription start site due to the local nature of convolutions. The Enformer creators were the first to apply transformers to this problem, crafting a model that could capture the effects of regulatory elements such as enhancers, repressors, and insulators that are capable of influencing gene expression of highly distal genes. They estimated that the increased receptive field of Enformer allows it to capture 84% of revelant enhancer elements, up from the 47% typically captured by CNN models.

But before getting into all the modeling details, let’s cover some of the biological background for context.

Biology Background

Enhancers

Let’s first quickly cover some of the biology background, starting with some of the regulatory elements that commonly appear in DNA. An enhancer is a 50-1500 bp region of DNA that acts to increase the likelihood of a particular gene’s transcription. Wikipedia says that they can be located up to 1 million base pairs away from the target gene, which clearly shows the insufficiency of CNN models, that can only capture regulatory interactions up to 20kb away, for modeling this problem. While enhancers may be distant from their regulatory targets from a sequence perspective, they often live close to their targets, spatially speaking. This is because DNA gets folded with protein into chromatin within eukaryotic cells, and such folding can cause the enhancer regions to become spatially colocated with their targets. Enhancers can also be involved in disease, e.g. in myelosuppression. Enhancers work in coordination with other regulatory elements such as enhancers, silencers, and insulators to modulate gene expression.

Promoters

A promoter is the region of DNA to which proteins, such as RNA polymerase and other transcription factors, bind to initiate the transcription of a gene. Promoters are typically 100-1000 bp in length. Transcription factors have activator or repressor sequences that are designed to bind to particular promoters. Core and proximal promoters are located at, or close to, the transcription start site (TSS). Some biologists use the term distal promoters to refer to enhancers and silencers, but we will use their specific names and not refer to them as promoters. Promoters often contain motifs such as a TATA box which are usually located within 30-40 bp of the TSS. Core promoters often contain the TSS, a binding site for RNA polymerase (I, II, or III), transcription factor binding sites, and other possible motifs. Proximal promoters occur further from the TSS (~250 bp upstream) and contain TF binding sites and other regulatory elements. Finally, distal promoters often contain further regulatory elements that have a less significant influence on transcription and can be located multiple kilobases upstream of the TSS.

Other Sources of Epigenomic cis-regulation.

Other signals that regulate and influence gene expression include transcription factor binding, chromatin contacts and accessibility, DNA methylation, and histone modifications. DNA is stored in the nucleus of the cell in a complex called chromatin. Chromatin is simply DNA wrapped around protein complexes called histones. Chromatin can exist in one of two forms: euchromatin is less condensed and often actively transcribed whereas heterochromatin is highly condensed and usually not transcribed. Changes in the structure and folding of chromatin influence the accessibility of DNA and, ultimately, transcription and gene expression.

Sourcing Data

Epigenomic Data

Much of existing epigenomic data is sourced from one of three databases: ENCODE, FANTOM, GENCODE. This data is of a number of types, most prominently, H3K27ac, DNase, ChIP-seq, and HiC. The data predicted by models, in contrast, is gene expression data, most commonly CAGE-seq or RNA-seq.

H3K27ac

H3K27ac is a marker of histone acetylation, specifically, the acetylation of lysine at the N-terminal position 27 of the histone H3 protein. Acetylation of this sort can modify protein activity and alter control of cellular signaling. H3K27ac is an active enhancer mark that can be found in both distal and proximal regions of the target genes.

ChIP-seq

ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a method for mapping genome-wide protein-DNA interactions. It allows researchers to pinpoint where transcription factors, histone modifications, or other chromatin-associated proteins bind along the genome in a given biological context.

ChIP-seq addresses questions about where specific proteins interact with DNA, which is critical to understanding gene regulation, epigenetics, and chromatin biology. Researchers can use ChIP-seq to understand where transcription factors bind, which genomic regions have active or repressive histone marks, what distinguishes cell-type specific enhancers or promoters, how chromatin state changes in disease or development, and what epigenomic features correlate with gene expression patterns.

The following terms are often used when assessing ChIP-seq data.

FeatureDescription
PeaksRegions of enriched read density — suggest binding sites or modified chromatin
Motif analysisDiscover sequence motifs enriched within TF binding peaks
Peak annotationAssociate peaks with gene features (e.g., promoters, enhancers, introns)
Differential bindingIdentify condition- or cell-type-specific binding changes
Signal tracksContinuous signal for visualization in genome browsers (e.g., IGV, UCSC)

DNase

DNase is a group of enzymes that degrade DNA. DNase-seq is a method to identify open (accessible) regions of chromatin by sequencing fragments of DNA that are hypersensitive to cleavage by DNase I, an enzyme that preferentially cuts DNA in regions not protected by nucleosomes or protein complexes. These DNase I hypersensitive sites (DHSs) are often promoters, enhancers, insulators, locus control regions, or other regulatory elements.

DNase-seq is used because chromatin is not uniformly accessible. In active regulatory regions, nucleosomes are displaced or absent, exposing DNA to enzymatic cleavage. DNase-seq data helps researchers to understand where active cis-regulatory elements are in the genome, which genes may be actively regulated in a given cell type, how chromatin accessibility changes across conditions, development, or disease, and where the potential transcription factor binding sites are located. The emphasis here is on the active part. While other types of sequencing data might point to where regulatory elements are located in the sequence generally, they don’t often showcase which of these regulatory elements are active in a given transcription process. That’s where DNase comes in.

FeatureDescription
DHS peaksLocations with high DNase I sensitivity; inferred to be accessible
FootprintsSmall protected regions within DHSs, suggesting transcription factor binding (TFs block DNase cutting locally)
Differential DHSsChanges in chromatin accessibility between conditions, cell types, or time points
Motif enrichmentOverrepresented DNA motifs in DHSs hint at regulatory TFs involved

HiC

Hi-C is a genome-wide chromosome conformation capture technique that tells you which parts of the genome are physically close together in 3D space, even if they’re far apart linearly. It reveals the 3D organization of chromatin in the nucleus — including loops, domains, compartments, and chromosome territories.

Hi-C provides a contact map (matrix) where each cell indicates the number of times two loci interact. From this, we can extract several biological insights:

FeatureDescriptionResolution Needed
Chromosome territoriesIndividual chromosomes occupy distinct nuclear regionsLow (Mb)
Compartments (A/B)Large-scale active (A) vs. inactive (B) chromatin~100 kb
TADs (Topologically Associating Domains)Self-interacting domains where regions interact more within than outside~10–100 kb
LoopsSpecific point-to-point interactions, often enhancer-promoterHigh (~1–10 kb)

HiC sequencing can be used by researchers to understand how the genome is physically configured in three dimensions, whether regulatory elements such as enhancers physically contact their target genes, how chromatin architecture changes accross cell types, differentiation or diseases, and whether structural variants, such as deletions or inversions, are disrupting TADs or loops.

Common Benchmarks

The genomics field has gotten better about establishing robust benchmarks within the past few years. For example, the popular Genomics Benchmark8 contains nine datasets focusing on “regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm.” The tasks range from sequence classification, e.g. whether a sequence is coding or intergenic, predicting enhancers and promoters, and classifying regulatory elements.

The Nucleotide Transformer paper also introduced the Nucleotide Transformer Benchmark. This benchmark consists of 18 tasks grouped into roughly three categories: “splice site prediction tasks (GENCODE), promoter tasks (Eukaryotic Promoter Database), and histone modification and enhancer tasks (ENCODE).” A more detailed description of some of the tasks is as follows:

  1. Epigenetic marks prediction using Histone ChIP-seq data for ten marks in the K562 human cell line.
  2. Promoter sequence prediction using human promoters from the Eukaryotic Promoter Database.
  3. Enhancer sequence prediction using human enhancer elements retrieved from ENCODE’s SCREEN database.
  4. Splice site prediction using human annotated splice sites obtained from GENCODE V44 gene annotation.
  5. Chromatin profile prediction using the DeepSEA dataset.
  6. The SpliceAI benchmark.
  7. Enhancer activity prediction using the DeepSTARR enhancer activity dataset.

Related Work

The Rise of Genomic Foundation Models

In the past couple of years, we’ve seen the deep learning field shift its focus from task-specific models such as Enformer to large, pretrained foundation models that capture general information about a domain and can then be finetuned to perform a multitude of downstream tasks. Four of the most compelling works in this vein are Nucleotide Transformer2, HyenaDNA3, Caduceus4, and Lyra5.

HyenaDNA

HyenaDNA was perhaps the first work to eschew the transformer-based approach to foundation modeling of genomic sequences. It uses a Hyena SSM as its main block and is able to achieve context lengths of up to 1 million base pairs.

Caduceus

Caduceus is an interesting model, combining a number of novel architectural improvements tailor-made for modeling DNA. The basis of Caduceus is the long-range Mamba9 block, the model that kicked off the SSM revolution. SSMs such as Mamba are a natural fit for DNA modeling, because they can model long-range sequences with $O(N \log N)$ complexity, where $N$ is the sequence length. This scales much better than transformer models, because attention blocks scale as $O(N^2)$. The authors claim that Caduceus outperforms transformer models that are 10x larger (by parameter count).

The Mamba blocks in Caduceus are uniquely tailored to DNA sequences. They introduce two sub-architectures: BiMamba and MambaDNA. BiMamba is hardware efficient and supports bi-directional sequence modeling. This is important, because DNA is comprised of two strands which are read in opposite directions. Likewise, MambaDNA is reverse-complement equivariant. This equivariance forces the model to learn to generate DNA strands that are complementary when read in reverse. Caduceus also uses reverse-complement data augmentation during model training. The model is pretrained on the general masked language modeling objective standard for language models on DNA sequences.

Specialized Epigenomic Models

In addition to foundation models, another crop of methods is excelling by incorporating multimodal epigenomic information into predictions of gene expression.

EPInformer

EPInformer integrates specialized knowledge about epigenomics into its modeling choices. The model can combine genomic sequences, epigenomic signals, and chromatin contacts to facilitate better predictions of gene expression. Architecturally, the model is split into four modules: “a sequence encoder, a feature fusion laywer, a promoter-enhancer interaction encoder, and a predictor module.” The model is designed to detect promoter regions within 1 kb of the TSS and enhancer regions up to 100 kb from the TSS.

  1. The key purpose of the sequence encoder is to derive embeddings for DNA sequences. It achieves this by combining residual convolution layers to learn DNA motifs in promoter and enhancer sequences while dilated convolutional layers are used to learn motif cooperation. These disparate elevents are then combined via a series of convolution and pooling operations to dictate a full sequence embedding.

Real-World Use

Perhaps the most exciting result of deep learning’s application to prediction of gene expression and regulation is that predictions from these models are actually being used to design real-world, tissue-specific regulatory elements such as enhancers. For example, in Cell-type-directed design of synthetic enhancers, published in Nature, the authors use predictions from deep learning models to design enhancers that target glial cells in the fruit fly brain.10


  1. Avsec, Ž., Agarwal, V., Visentin, D. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods 18, 1196–1203 (2021). https://doi.org/10.1038/s41592-021-01252-x ↩︎

  2. Dalla-Torre, H., Gonzalez, L., Mendoza-Revilla, J. et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nature Methods 22, 287–297 (2025). https://doi.org/10.1038/s41592-024-02523-z ↩︎ ↩︎

  3. Nguyen, E., Poli, M., Faizi, M., Thomas, A., Wornow, M., Birch-Sykes, C., … & Baccus, S. (2023). HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution. Advances in Neural Information Processing Systems 36, 43177-43201. ↩︎ ↩︎

  4. Schiff, Yair, et al. Caduceus: Bi-directional equivariant long-range DNA sequence modeling. arXiv preprint arXiv:2403.03234 (2024). ↩︎ ↩︎

  5. Ramesh, Krithik et al. Lyra: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences. arXiv:2503.16351 (2025). ↩︎ ↩︎

  6. Lin J, Luo R, Pinello L. EPInformer: a scalable deep learning framework for gene expression prediction by integrating promoter-enhancer sequences with multimodal epigenomic data. bioRxiv [Preprint]. 2024 Aug 1:2024.08.01.606099. doi: 10.1101/2024.08.01.606099. PMID: 39131276; PMCID: PMC11312614. ↩︎

  7. Su, X., Yu, H., Zhi, D., & Ji, S. (2025). Learning to Discover Regulatory Elements for Gene Expression Prediction. International Conference on Learning Representations↩︎

  8. Grešová, K., Martinek, V., Čechák, D. et al. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genom Data 24, 25 (2023). https://doi.org/10.1186/s12863-023-01123-8 ↩︎

  9. Gu, A., & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752↩︎

  10. Taskiran, I.I., Spanier, K.I., Dickmänken, H. et al. Cell-type-directed design of synthetic enhancers. Nature 626, 212–220 (2024). https://doi.org/10.1038/s41586-023-06936-2 ↩︎

Daniel McNeela
Daniel McNeela
Machine Learning Researcher and Engineer

My research interests include patent and IP law, geometric deep learning, and computational drug discovery.