Publications | Paiheng Xu

2026

Check The Scoreboard: An Analysis of Scoring Schemes on Multiple-Choice Evaluation

Nishant Balepur , Paiheng Xu, Wei Ai , Eunsol Choi , Rachel Rudinger , and Jordan Boyd-Graber

Under review, 2026

Abs HTML Code

NLP benchmarks multiple-choice question answering (MCQA) via number-right scoring (i.e.,accuracy), but in educational testing, the scoring scheme is a key design choice that dictates which abilities to reward. We examine how alternatives to number right change what MCQA measures with six education-inspired schemes that assess abilities beyond accuracy: distractor elimination, abstention, confidence calibration, and self-correction. On LLM benchmarks, the schemes: 1) shift LLM rankings beyond prompt sensitivity under number right scoring; 2) better predict the LLMs users prefer in LLM Arena; and 3) reveal unique model behaviors, like that GPT-5 rarely abstains and readily self-corrects, while weaker open-weight models often abstain and hesitate to eliminate choices. Given the benefits of alternative scoring schemes, we discuss ways to blueprint them in tasks beyond MCQA.
Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates

Paiheng Xu, Jing Liu , and Wei Ai

arXiv:2606.03029, 2026

Abs HTML

A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers’ domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature–covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.
Does Geo-co-location Matter? A Case Study of Public Health Conversations during COVID-19

Paiheng Xu, Louiqa Raschid , and Vanessa Frias-Martinez

In ICWSM , May 2026

Abs HTML

Social media platforms like Twitter (now X) have been pivotal in information dissemination and public engagement, especially during COVID-19. A key goal for public health experts was to encourage prosocial behavior that could impact local outcomes such as masking and social distancing. Given the importance of local news and guidance during COVID-19, the objective of our research is to analyze the effect of localized engagement, on social media conversations. This study examines the impact of geographic co-location, as a proxy for localized engagement between public health experts (PHEs) and the public, on social media. We analyze a Twitter conversation dataset from January 2020 to November 2021, comprising over 19 K tweets from nearly five hundred PHEs, along with approximately 800 K replies from 350 K participants. Our findings reveal that geo-co-location is associated with higher engagement rates, especially in conversations on topics including masking, lockdowns, and education, and in conversations with academic and medical professionals. Lexical features associated with emotion and personal experiences were more common in geo-co-located contexts. This research provides insights into how geographic co-location influences social media engagement and can inform strategies to improve public health messaging.
Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs

Paiheng Xu, Gang Wu , Xiang Chen , Tong Yu , Chang Xiao , Franck Dernoncourt , and 3 more authors

In Findings of the Association for Computational Linguistics: EACL 2026 , Mar 2026

Abs HTML

Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer response times, and higher computational costs. To bridge the gap, we propose an offline simulation framework to curate a software-specific skillset—a collection of verified scripts—by exploiting LLMs and publicly available scripting guides. Our framework comprises two components: (1) task creation, using top-down functionality guidance and bottom-up API synergy exploration to generate helpful tasks; and (2) skill generation with trials, refining and validating scripts based on execution feedback. To efficiently navigate the extensive API landscape, we introduce a Graph Neural Network (GNN)-based link prediction model to capture API synergy, enabling the generation of skills involving underutilized APIs and expanding the skillset’s diversity. Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation. This is the first attempt to use software scripting interfaces as a testbed for LLM-based systems, highlighting the advantages of leveraging execution feedback in a controlled environment and offering valuable insights into aligning AI capabilities with user needs in specialized software domains.

2025

DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Yuhang Zhou , Jing Zhu , Shengyi Qian , Zhuokai Zhao , Xiyao Wang , Xiaoyu Liu , and 4 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2025 , Nov 2025

Abs HTML

Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups—assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks.
Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models

Zongxia Li , Lorena Calvo-Bartolomé , Alexander Miserlis Hoyle , Paiheng Xu, Daniel Kofi Stephens , Juan Francisco Fung , and 2 more authors

In ACL , Jul 2025

Abs HTML

A common use of NLP is to facilitate the understanding of large document collections, with models based on Large Language Models (LLMs) replacing probabilistic topic models. Yet the effectiveness of LLM-based approaches in real-world applications remains under explored. This study measures the knowledge users acquire with topic models—including traditional, unsupervised and supervised LLM- based approaches—on two datasets. While LLM-based methods generate more human- readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to LLM-based topic models improves data exploration by addressing hallucination and genericity but requires more human efforts. In contrast, traditional models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. This paper provides best practices—there is no one right model, the choice of models is situation-specific—and suggests potential improvements for scalable LLM-based topic models.
Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

Xiaoyu Liu* , Paiheng Xu*, Junda Wu , Jiaxin Yuan , Yifan Yang , Yuhang Zhou , and 7 more authors

Findings of the Association for Computational Linguistics: NAACL, Mar 2025

Abs HTML

Causal inference has shown potential in enhancing the predictive accuracy, fairness, robustness, and explainability of Natural Language Processing (NLP) models by capturing causal relationships among variables. The emergence of generative Large Language Models (LLMs) has significantly impacted various NLP domains, particularly through their advanced reasoning capabilities. This survey focuses on evaluating and improving LLMs from a causal view in the following areas: understanding and improving the LLMs’ reasoning capacity, addressing fairness and safety issues in LLMs, complementing LLMs with explanations, and handling multimodality. Meanwhile, LLMs’ strong reasoning capacities can in turn contribute to the field of causal inference by aiding causal relationship discovery and causal effect estimations. This review explores the interplay between causal inference frameworks and LLMs from both perspectives, emphasizing their collective potential to further the development of more advanced and equitable artificial intelligence systems.
Emojis decoded: Leveraging chatgpt for enhanced understanding in social media communications

Yuhang Zhou , Paiheng Xu, Xiyao Wang , Xuan Lu , Ge Gao , and Wei Ai

ICWSM, Jun 2025

Abs HTML

Emojis, which encapsulate semantics beyond mere words or phrases, have become prevalent in social network communications. This has spurred increasing scholarly interest in exploring their attributes and functionalities. However, emoji-related research and application face two primary challenges. First, researchers typically rely on crowd-sourcing to annotate emojis in order to understand their sentiments, usage intentions, and semantic meanings. Second, subjective interpretations by users can often lead to misunderstandings of emojis and cause the communication barrier. Large Language Models (LLMs) have achieved significant success in various annotation tasks, with ChatGPT demonstrating expertise across multiple domains. In our study, we assess ChatGPT’s effectiveness in handling previously annotated and downstream tasks. Our objective is to validate the hypothesis that ChatGPT can serve as a viable alternative to human annotators in emoji research and that its ability to explain emoji meanings can enhance clarity and transparency in online communications. Our findings indicate that ChatGPT has extensive knowledge of emojis. It is adept at elucidating the meaning of emojis across various application scenarios and demonstrates the potential to replace human annotators in a range of tasks.

2024

Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

Yuhang Zhou , Jing Zhu , Paiheng Xu, Xiaoyu Liu , Xiyao Wang , Danai Koutra , and 2 more authors

In Findings of the Association for Computational Linguistics: EMNLP , Nov 2024

Abs HTML

Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students’ reasoning capabilities. However, current methods struggle with sequence-level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.
The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education

Paiheng Xu, Jing Liu , Nathan Jones , Julie Cohen , and Wei Ai

In NAACL , Jun 2024

Abs HTML

Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers’ expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that mostly focuses on low-inference instructional practices on a singular basis, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that is widely acknowledged to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers’ utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.
Explore Spurious Correlations at the Concept Level in Language Models for Text Classification

Yuhang Zhou , Paiheng Xu, Xiaoyu Liu , Bang An , Wei Ai , and Furong Huang

In ACL , Aug 2024

Abs HTML Code

Language models (LMs) have achieved notable success in numerous NLP tasks, employing both fine-tuning and in-context learning (ICL) methods. While language models demonstrate exceptional performance, they face robustness challenges due to spurious correlations arising from imbalanced label distributions in training data or ICL exemplars. Previous research has primarily concentrated on word, phrase, and syntax features, neglecting the concept level, often due to the absence of concept labels and difficulty in identifying conceptual content in input texts. This paper introduces two main contributions. First, we employ ChatGPT to assign concept labels to texts, assessing concept bias in models during fine-tuning or ICL on test data. We find that LMs, when encountering spurious correlations between a concept and a label in training or prompts, resort to shortcuts for predictions. Second, we introduce a data rebalancing technique that incorporates ChatGPT-generated counterfactual data, thereby balancing label distribution and mitigating spurious correlations. Our method’s efficacy, surpassing traditional token removal approaches, is validated through extensive testing.
Twitter social mobility data reveal demographic variations in social distancing practices during the COVID-19 pandemic

Paiheng Xu, David A Broniatowski , and Mark Dredze

Scientific reports, Jan 2024

Abs HTML Website

The COVID-19 pandemic demonstrated the importance of social distancing practices to stem the spread of the virus. However, compliance with public health guidelines was mixed. Understanding what factors are associated with differences in compliance can improve public health messaging since messages could be targeted and tailored to different population segments. We utilize Twitter data on social mobility during COVID-19 to reveal which populations practiced social distancing and what factors correlated with this practice. We analyze correlations between demographic and political affiliation with reductions in physical mobility measured by public geolocation tweets. We find significant differences in mobility reduction between these groups in the United States. We observe that males, Asian and Latinx individuals, older individuals, Democrats, and people from higher population density states exhibited larger reductions in movement. Furthermore, our study also unveils meaningful insights into the interactions between different groups. We hope these findings will provide evidence to support public health policy-making.

2023

GFairHint: improving individual fairness for graph neural networks via fairness hint

Paiheng Xu*, Yuhang Zhou* , Bang An , Wei Ai , and Furong Huang

ACM Transactions on Knowledge Discovery from Data, May 2023

Abs HTML Code

Given the growing concerns about fairness in machine learning and the impressive performance of Graph Neural Networks (GNNs) on graph data learning, algorithmic fairness in GNNs has attracted significant attention. While many existing studies improve fairness at the group level, only a few works promote individual fairness, which renders similar outcomes for similar individuals. A desirable framework that promotes individual fairness should (1) balance between fairness and performance, (2) accommodate two commonly-used individual similarity measures (externally annotated and computed from input features), (3) generalize across various GNN models, and (4) be computationally efficient. Unfortunately, none of the prior work achieves all the desirables. In this work, we propose a novel method, GFairHint, which promotes individual fairness in GNNs and achieves all aforementioned desirables. GFairHint learns fairness representations through an auxiliary link prediction task, and then concatenates the representations with the learned node embeddings in original GNNs as a "fairness hint". Through extensive experimental investigations on five real-world graph datasets under three prevalent GNN models covering both individual similarity measures above, GFairHint achieves the best fairness results in almost all combinations of datasets with various backbone models, while generating comparable utility results, with much less computational cost compared to the previous state-of-the-art (SoTA) method.

2022

A Machine Learning Approach For Discovering Tobacco Brands, Products, and Manufacturers in the United States

Adam Poliak , Paiheng Xu, Eric Leas , Mario Navarro , Stephanie Pitts , Andie Malterud , and 2 more authors

In Annual Meeting of the Society for Research on Nicotine and Tobacco , May 2022

Code

2021

Using Noisy Self-Reports to Predict Twitter User Demographics

Zach Wood-Doughty* , Paiheng Xu*, Xiao Liu , and Mark Dredze

In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media , Jun 2021

Abs HTML Code Website

Computational social science studies often contextualize content analysis within standard demographics. Since demographics are unavailable on many social media platforms (e.g. Twitter) numerous studies have inferred demographics automatically. Despite many studies presenting proof of concept inference of race and ethnicity, training of practical systems remains elusive since there are few annotated datasets. Existing datasets are small, inaccurate, or fail to cover the four most common racial and ethnic groups in the United States. We present a method to identify self-reports of race and ethnicity from Twitter profile descriptions. Despite errors inherent in automated supervision, we produce models with good performance when measured on gold standard self-report survey data. The result is a reproducible method for creating large-scale training resources for race and ethnicity.

2020

The twitter social mobility index: Measuring social distancing practices with geolocated tweets

Paiheng Xu, Mark Dredze , and David A Broniatowski

Journal of medical Internet research, Dec 2020

HTML Website

2019

On predictability of time series

Paiheng Xu, Likang Yin , Zhongtao Yue , and Tao Zhou

Physica A: Statistical Mechanics and its Applications, Feb 2019

Abs HTML Code

The method to estimate the predictability of human mobility was proposed in Song et al. (2010), which is extensively followed in exploring the predictability of disparate time series. However, the ambiguous description in the original paper leads to some misunderstandings, including the inconsistent logarithm bases in the entropy estimator and the entropy-predictability-conversion equation, as well as the details in the calculation of the Lempel–Ziv estimator, which further results in remarkably overestimated predictability. This paper demonstrates the degree of overestimation by four different types of theoretically generated time series and an empirical data set, and shows the intrinsic deviation of the Lempel–Ziv estimator for highly random time series. This work provides a clear picture on this issue and thus helps researchers in correctly estimating the predictability of time series.

2018

A novel visibility graph transformation of time series into weighted networks

Paiheng Xu, Rong Zhang , and Yong Deng

Chaos, Solitons & Fractals, Nov 2018

Abs HTML

Analyzing time series from the perspective of complex network has interested many scientists. In this paper, based on visibility graph theory a novel method of constructing weighted complex network from time series is proposed. The first step is to determine the weights of vertices in time series, which linearly combines the weights generated by induced ordered averaging aggregation operator (IOWA) and visibility graph aggregation operator (VGA). Then, two strategies, averaging strategy and gravity strategy, are proposed to construct weighted network. To testify the validity of proposed method, an artificial case is adopted, in which link prediction is used to evaluate the performance of the weighted network. It is shown that the weighted network constructed by proposed method greatly outperforms the unweighted network obtained by traditional visibility graph theory.