I try to keep this list up to date, but if I'm behind, check out my Google Scholar for the complete list!

2025

Pathway to Relevance: How Cross-Encoders Implement a Semantic Variant of BM25

Meng Lu, Catherine Chen, Carsten Eickhoff

EMNLP 2025

Mechanistic interpretation has greatly contributed to a more detailed understanding of generative language models, enabling significant progress in identifying structures that implement key behaviors through interactions between internal components. In contrast, interpretability in information retrieval (IR) remains relatively coarse-grained, and much is still unknown as to how IR models determine whether a document is relevant to a query. In this work, we address this gap by mechanistically analyzing how one commonly used model, a cross-encoder, estimates relevance. We find that the model extracts traditional relevance signals, such as term frequency and inverse document frequency, in early-to-middle layers. These concepts are then combined in later layers, similar to the well-known probabilistic ranking function, BM25. Overall, our analysis offers a more nuanced understanding of how IR models compute relevance. Isolating these components lays the groundwork for future interventions that could enhance transparency, mitigate safety risks, and improve scalability.

Pathway to Relevance: How Cross-Encoders Implement a Semantic Variant of BM25

Meng Lu, Catherine Chen, Carsten Eickhoff

EMNLP 2025

Mechanistic interpretation has greatly contributed to a more detailed understanding of generative language models, enabling significant progress in identifying structures that implement key behaviors through interactions between internal components. In contrast, interpretability in information retrieval (IR) remains relatively coarse-grained, and much is still unknown as to how IR models determine whether a document is relevant to a query. In this work, we address this gap by mechanistically analyzing how one commonly used model, a cross-encoder, estimates relevance. We find that the model extracts traditional relevance signals, such as term frequency and inverse document frequency, in early-to-middle layers. These concepts are then combined in later layers, similar to the well-known probabilistic ranking function, BM25. Overall, our analysis offers a more nuanced understanding of how IR models compute relevance. Isolating these components lays the groundwork for future interventions that could enhance transparency, mitigate safety risks, and improve scalability.

How Role-Play Shapes Relevance Judgment in Zero-Shot LLM Rankers

Yumeng Wang, Jirui Qi, Panagiotis Eustratiadis, Catherine Chen, Suzan Verberne

Accepted to ECIR 2026

Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts, where the model is assigned a functional role or identity, often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.

How Role-Play Shapes Relevance Judgment in Zero-Shot LLM Rankers

Yumeng Wang, Jirui Qi, Panagiotis Eustratiadis, Catherine Chen, Suzan Verberne

Accepted to ECIR 2026

Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts, where the model is assigned a functional role or identity, often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.

Towards Best Practices of Axiomatic Activation Patching in Information Retrieval

Gregory Polyakov, Catherine Chen, Carsten Eickhoff

SIGIR 2025

Mechanistic interpretability research, which aims to uncover the internal processes of machine learning models, has gained significant attention. One state-of-the-art technique, activation patching, has been applied to analyzing neural ranker behavior in relation to information retrieval (IR) axioms. To date, however, this remains a rapidly evolving topic in IR, with no established methodology for measuring results or constructing datasets to ensure pronounced, robust, and consistent patching effects. In this study, based on experimental results, we provide recommendations on measuring patching effects and designing diagnostic datasets for investigating term frequency. We identify the rareness and informativeness of injected terms as a key factor influencing the magnitude of patching effects. Additionally, we find that low score differences between baseline and perturbed documents introduce significant noise, which can be mitigated by filtering or applying penalty scores to the metric. More generally, we provide practical recommendations for the reliable application of activation patching in IR, advancing future interpretability research of neural ranking models. Our code is available at https://github.com/polgrisha/best-practices-ir-patching.

Towards Best Practices of Axiomatic Activation Patching in Information Retrieval

Gregory Polyakov, Catherine Chen, Carsten Eickhoff

SIGIR 2025

Mechanistic interpretability research, which aims to uncover the internal processes of machine learning models, has gained significant attention. One state-of-the-art technique, activation patching, has been applied to analyzing neural ranker behavior in relation to information retrieval (IR) axioms. To date, however, this remains a rapidly evolving topic in IR, with no established methodology for measuring results or constructing datasets to ensure pronounced, robust, and consistent patching effects. In this study, based on experimental results, we provide recommendations on measuring patching effects and designing diagnostic datasets for investigating term frequency. We identify the rareness and informativeness of injected terms as a key factor influencing the magnitude of patching effects. Additionally, we find that low score differences between baseline and perturbed documents introduce significant noise, which can be mitigated by filtering or applying penalty scores to the metric. More generally, we provide practical recommendations for the reliable application of activation patching in IR, advancing future interpretability research of neural ranking models. Our code is available at https://github.com/polgrisha/best-practices-ir-patching.

MechIR: A Mechanistic Interpretability Framework for Information Retrieval

Andrew Parry, Catherine Chen, Carsten Eickhoff, Sean MacAvaney

ECIR 2025

Mechanistic interpretability is an emerging diagnostic approach for neural models that has gained traction in broader natural language processing domains. This paradigm aims to provide attribution to components of neural systems where causal relationships between hidden layers and output were previously uninterpretable. As the use of neural models in IR for retrieval and evaluation becomes ubiquitous, we need to ensure that we can interpret why a model produces a given output for both transparency and the betterment of systems. This work comprises a flexible framework for diagnostic analysis and intervention within these highly parametric neural systems specifically tailored for IR tasks and architectures. In providing such a framework, we look to facilitate further research in interpretable IR with a broader scope for practical interventions derived from mechanistic interpretability. We provide preliminary analysis and look to demonstrate our framework through an axiomatic lens to show its applications and ease of use for those IR practitioners inexperienced in this emerging paradigm.

MechIR: A Mechanistic Interpretability Framework for Information Retrieval

Andrew Parry, Catherine Chen, Carsten Eickhoff, Sean MacAvaney

ECIR 2025

Mechanistic interpretability is an emerging diagnostic approach for neural models that has gained traction in broader natural language processing domains. This paradigm aims to provide attribution to components of neural systems where causal relationships between hidden layers and output were previously uninterpretable. As the use of neural models in IR for retrieval and evaluation becomes ubiquitous, we need to ensure that we can interpret why a model produces a given output for both transparency and the betterment of systems. This work comprises a flexible framework for diagnostic analysis and intervention within these highly parametric neural systems specifically tailored for IR tasks and architectures. In providing such a framework, we look to facilitate further research in interpretable IR with a broader scope for practical interventions derived from mechanistic interpretability. We provide preliminary analysis and look to demonstrate our framework through an axiomatic lens to show its applications and ease of use for those IR practitioners inexperienced in this emerging paradigm.

2024

Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models

Catherine Chen, Jack Merullo, Carsten Eickhoff

SIGIR 2024

Neural models have demonstrated remarkable performance across diverse ranking tasks. However, the processes and internal mechanisms along which they determine relevance are still largely unknown. Existing approaches for analyzing neural ranker behavior with respect to IR properties rely either on assessing overall model behavior or employing probing methods that may offer an incomplete understanding of causal mechanisms. To provide a more granular understanding of internal model decision-making processes, we propose the use of causal interventions to reverse engineer neural rankers, and demonstrate how mechanistic interpretability methods can be used to isolate components satisfying term-frequency axioms within a ranking model. We identify a group of attention heads that detect duplicate tokens in earlier layers of the model, then communicate with downstream heads to compute overall document relevance. More generally, we propose that this style of mechanistic analysis opens up avenues for reverse engineering the processes neural retrieval models use to compute relevance. This work aims to initiate granular interpretability efforts that will not only benefit retrieval model development and training, but ultimately ensure safer deployment of these models.

Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models

Catherine Chen, Jack Merullo, Carsten Eickhoff

SIGIR 2024

Neural models have demonstrated remarkable performance across diverse ranking tasks. However, the processes and internal mechanisms along which they determine relevance are still largely unknown. Existing approaches for analyzing neural ranker behavior with respect to IR properties rely either on assessing overall model behavior or employing probing methods that may offer an incomplete understanding of causal mechanisms. To provide a more granular understanding of internal model decision-making processes, we propose the use of causal interventions to reverse engineer neural rankers, and demonstrate how mechanistic interpretability methods can be used to isolate components satisfying term-frequency axioms within a ranking model. We identify a group of attention heads that detect duplicate tokens in earlier layers of the model, then communicate with downstream heads to compute overall document relevance. More generally, we propose that this style of mechanistic analysis opens up avenues for reverse engineering the processes neural retrieval models use to compute relevance. This work aims to initiate granular interpretability efforts that will not only benefit retrieval model development and training, but ultimately ensure safer deployment of these models.

Evaluating Search System Explainability with Psychometrics and Crowdsourcing

Catherine Chen, Carsten Eickhoff

SIGIR 2024

Information retrieval (IR) systems have become an integral part of our everyday lives. As search engines, recommender systems, and conversational agents are employed across various domains from recreational search to clinical decision support, there is an increasing need for transparent and explainable systems to guarantee accountable, fair, and unbiased results. Despite many recent advances towards explainable AI and IR techniques, there is no consensus on what it means for a system to be explainable. Although a growing body of literature suggests that explainability is comprised of multiple subfactors, virtually all existing approaches treat it as a singular notion. In this paper, we examine explainability in Web search systems, leveraging psychometrics and crowdsourcing to identify human-centered factors of explainability.

Evaluating Search System Explainability with Psychometrics and Crowdsourcing

Catherine Chen, Carsten Eickhoff

SIGIR 2024

Information retrieval (IR) systems have become an integral part of our everyday lives. As search engines, recommender systems, and conversational agents are employed across various domains from recreational search to clinical decision support, there is an increasing need for transparent and explainable systems to guarantee accountable, fair, and unbiased results. Despite many recent advances towards explainable AI and IR techniques, there is no consensus on what it means for a system to be explainable. Although a growing body of literature suggests that explainability is comprised of multiple subfactors, virtually all existing approaches treat it as a singular notion. In this paper, we examine explainability in Web search systems, leveraging psychometrics and crowdsourcing to identify human-centered factors of explainability.

2023

Outlier Dimensions Encode Task Specific Knowledge
Outlier Dimensions Encode Task Specific Knowledge

William Rudman, Catherine Chen, Carsten Eickhoff

EMNLP 2023

Representations from large language models (LLMs) are known to be dominated by a small subset of dimensions with exceedingly high variance. Previous works have argued that although ablating these outlier dimensions in LLM representations hurts downstream performance, outlier dimensions are detrimental to the representational quality of embeddings. In this study, we investigate how fine-tuning impacts outlier dimensions and show that 1) outlier dimensions that occur in pre-training persist in fine-tuned models and 2) a single outlier dimension can complete downstream tasks with a minimal error rate. Our results suggest that outlier dimensions can encode crucial task-specific knowledge and that the value of a representation in a single outlier dimension drives downstream model decisions.

Outlier Dimensions Encode Task Specific Knowledge
Outlier Dimensions Encode Task Specific Knowledge

William Rudman, Catherine Chen, Carsten Eickhoff

EMNLP 2023

Representations from large language models (LLMs) are known to be dominated by a small subset of dimensions with exceedingly high variance. Previous works have argued that although ablating these outlier dimensions in LLM representations hurts downstream performance, outlier dimensions are detrimental to the representational quality of embeddings. In this study, we investigate how fine-tuning impacts outlier dimensions and show that 1) outlier dimensions that occur in pre-training persist in fine-tuned models and 2) a single outlier dimension can complete downstream tasks with a minimal error rate. Our results suggest that outlier dimensions can encode crucial task-specific knowledge and that the value of a representation in a single outlier dimension drives downstream model decisions.

Quantifying and Advancing Information Retrieval System Explainability

Catherine Chen

SIGIR 2023

Quantifying and Advancing Information Retrieval System Explainability

Catherine Chen

SIGIR 2023