Publications | Paiheng Xu

2025

Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs

Zongxia Li , Lorena Calvo-Bartolomé , Alexander Hoyle , Paiheng Xu, Alden Dima , Juan Francisco Fung , and 1 more author

arXiv preprint arXiv:2502.14748, 2025

Abs HTML

A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models. Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored. This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets. While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. In contrast, traditional. models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. We show that LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints. Dataset available at https://huggingface.co/datasets/zli12321/Bills.
Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

Xiaoyu Liu* , Paiheng Xu*, Junda Wu , Jiaxin Yuan , Yifan Yang , Yuhang Zhou , and 7 more authors

Findings of the Association for Computational Linguistics: NAACL 2025, Mar 2025

Abs HTML

Causal inference has shown potential in enhancing the predictive accuracy, fairness, robustness, and explainability of Natural Language Processing (NLP) models by capturing causal relationships among variables. The emergence of generative Large Language Models (LLMs) has significantly impacted various NLP domains, particularly through their advanced reasoning capabilities. This survey focuses on evaluating and improving LLMs from a causal view in the following areas: understanding and improving the LLMs’ reasoning capacity, addressing fairness and safety issues in LLMs, complementing LLMs with explanations, and handling multimodality. Meanwhile, LLMs’ strong reasoning capacities can in turn contribute to the field of causal inference by aiding causal relationship discovery and causal effect estimations. This review explores the interplay between causal inference frameworks and LLMs from both perspectives, emphasizing their collective potential to further the development of more advanced and equitable artificial intelligence systems.
Emojis decoded: Leveraging chatgpt for enhanced understanding in social media communications

Yuhang Zhou , Paiheng Xu, Xiyao Wang , Xuan Lu , Ge Gao , and Wei Ai

ICWSM, Mar 2025

Abs HTML

Emojis, which encapsulate semantics beyond mere words or phrases, have become prevalent in social network communications. This has spurred increasing scholarly interest in exploring their attributes and functionalities. However, emoji-related research and application face two primary challenges. First, researchers typically rely on crowd-sourcing to annotate emojis in order to understand their sentiments, usage intentions, and semantic meanings. Second, subjective interpretations by users can often lead to misunderstandings of emojis and cause the communication barrier. Large Language Models (LLMs) have achieved significant success in various annotation tasks, with ChatGPT demonstrating expertise across multiple domains. In our study, we assess ChatGPT’s effectiveness in handling previously annotated and downstream tasks. Our objective is to validate the hypothesis that ChatGPT can serve as a viable alternative to human annotators in emoji research and that its ability to explain emoji meanings can enhance clarity and transparency in online communications. Our findings indicate that ChatGPT has extensive knowledge of emojis. It is adept at elucidating the meaning of emojis across various application scenarios and demonstrates the potential to replace human annotators in a range of tasks.

2024

Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

Yuhang Zhou , Jing Zhu , Paiheng Xu, Xiaoyu Liu , Xiyao Wang , Danai Koutra , and 2 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2024 , Nov 2024

Abs HTML

Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students’ reasoning capabilities. However, current methods struggle with sequence-level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.
Does Geo-co-location Matter? A Case Study of Public Health Conversations during COVID-19

Paiheng Xu, Louiqa Raschid , and Vanessa Frias-Martinez

arXiv preprint arXiv:2405.17710, May 2024

Abs

Social media platforms like Twitter (now X) have been pivotal in information dissemination and public engagement, especially during COVID-19. A key goal for public health experts was to encourage prosocial behavior that could impact local outcomes such as masking and social distancing. Given the importance of local news and guidance during COVID-19, the objective of our research is to analyze the effect of localized engagement, on social media conversations. This study examines the impact of geographic co-location, as a proxy for localized engagement between public health experts (PHEs) and the public, on social media. We analyze a Twitter conversation dataset from January 2020 to November 2021, comprising over 19 K tweets from nearly five hundred PHEs, along with approximately 800 K replies from 350 K participants. Our findings reveal that geo-co-location is associated with higher engagement rates, especially in conversations on topics including masking, lockdowns, and education, and in conversations with academic and medical professionals. Lexical features associated with emotion and personal experiences were more common in geo-co-located contexts. This research provides insights into how geographic co-location influences social media engagement and can inform strategies to improve public health messaging.
The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education

Paiheng Xu, Jing Liu , Nathan Jones , Julie Cohen , and Wei Ai

In NAACL , Jun 2024

Abs HTML

Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers’ expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that mostly focuses on low-inference instructional practices on a singular basis, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that is widely acknowledged to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers’ utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.
Explore Spurious Correlations at the Concept Level in Language Models for Text Classification

Yuhang Zhou , Paiheng Xu, Xiaoyu Liu , Bang An , Wei Ai , and Furong Huang

In ACL , Aug 2024

Abs HTML Code

Language models (LMs) have achieved notable success in numerous NLP tasks, employing both fine-tuning and in-context learning (ICL) methods. While language models demonstrate exceptional performance, they face robustness challenges due to spurious correlations arising from imbalanced label distributions in training data or ICL exemplars. Previous research has primarily concentrated on word, phrase, and syntax features, neglecting the concept level, often due to the absence of concept labels and difficulty in identifying conceptual content in input texts. This paper introduces two main contributions. First, we employ ChatGPT to assign concept labels to texts, assessing concept bias in models during fine-tuning or ICL on test data. We find that LMs, when encountering spurious correlations between a concept and a label in training or prompts, resort to shortcuts for predictions. Second, we introduce a data rebalancing technique that incorporates ChatGPT-generated counterfactual data, thereby balancing label distribution and mitigating spurious correlations. Our method’s efficacy, surpassing traditional token removal approaches, is validated through extensive testing.
Twitter social mobility data reveal demographic variations in social distancing practices during the COVID-19 pandemic

Paiheng Xu, David A Broniatowski , and Mark Dredze

Scientific reports, Jan 2024

Abs HTML Website

The COVID-19 pandemic demonstrated the importance of social distancing practices to stem the spread of the virus. However, compliance with public health guidelines was mixed. Understanding what factors are associated with differences in compliance can improve public health messaging since messages could be targeted and tailored to different population segments. We utilize Twitter data on social mobility during COVID-19 to reveal which populations practiced social distancing and what factors correlated with this practice. We analyze correlations between demographic and political affiliation with reductions in physical mobility measured by public geolocation tweets. We find significant differences in mobility reduction between these groups in the United States. We observe that males, Asian and Latinx individuals, older individuals, Democrats, and people from higher population density states exhibited larger reductions in movement. Furthermore, our study also unveils meaningful insights into the interactions between different groups. We hope these findings will provide evidence to support public health policy-making.

2023

GFairHint: improving individual fairness for graph neural networks via fairness hint

Paiheng Xu*, Yuhang Zhou* , Bang An , Wei Ai , and Furong Huang

ACM Transactions on Knowledge Discovery from Data, May 2023

Abs HTML Code

Given the growing concerns about fairness in machine learning and the impressive performance of Graph Neural Networks (GNNs) on graph data learning, algorithmic fairness in GNNs has attracted significant attention. While many existing studies improve fairness at the group level, only a few works promote individual fairness, which renders similar outcomes for similar individuals. A desirable framework that promotes individual fairness should (1) balance between fairness and performance, (2) accommodate two commonly-used individual similarity measures (externally annotated and computed from input features), (3) generalize across various GNN models, and (4) be computationally efficient. Unfortunately, none of the prior work achieves all the desirables. In this work, we propose a novel method, GFairHint, which promotes individual fairness in GNNs and achieves all aforementioned desirables. GFairHint learns fairness representations through an auxiliary link prediction task, and then concatenates the representations with the learned node embeddings in original GNNs as a "fairness hint". Through extensive experimental investigations on five real-world graph datasets under three prevalent GNN models covering both individual similarity measures above, GFairHint achieves the best fairness results in almost all combinations of datasets with various backbone models, while generating comparable utility results, with much less computational cost compared to the previous state-of-the-art (SoTA) method.

2022

A Machine Learning Approach For Discovering Tobacco Brands, Products, and Manufacturers in the United States

Adam Poliak , Paiheng Xu, Eric Leas , Mario Navarro , Stephanie Pitts , Andie Malterud , and 2 more authors

In Annual Meeting of the Society for Research on Nicotine and Tobacco , May 2022

Code

2021

Using Noisy Self-Reports to Predict Twitter User Demographics

Zach Wood-Doughty* , Paiheng Xu*, Xiao Liu , and Mark Dredze

In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media , Jun 2021

Abs HTML Code Website

Computational social science studies often contextualize content analysis within standard demographics. Since demographics are unavailable on many social media platforms (e.g. Twitter) numerous studies have inferred demographics automatically. Despite many studies presenting proof of concept inference of race and ethnicity, training of practical systems remains elusive since there are few annotated datasets. Existing datasets are small, inaccurate, or fail to cover the four most common racial and ethnic groups in the United States. We present a method to identify self-reports of race and ethnicity from Twitter profile descriptions. Despite errors inherent in automated supervision, we produce models with good performance when measured on gold standard self-report survey data. The result is a reproducible method for creating large-scale training resources for race and ethnicity.

2020

The twitter social mobility index: Measuring social distancing practices with geolocated tweets

Paiheng Xu, Mark Dredze , and David A Broniatowski

Journal of medical Internet research, Dec 2020

HTML Website

2019

On predictability of time series

Paiheng Xu, Likang Yin , Zhongtao Yue , and Tao Zhou

Physica A: Statistical Mechanics and its Applications, Feb 2019

Abs HTML Code

The method to estimate the predictability of human mobility was proposed in Song et al. (2010), which is extensively followed in exploring the predictability of disparate time series. However, the ambiguous description in the original paper leads to some misunderstandings, including the inconsistent logarithm bases in the entropy estimator and the entropy-predictability-conversion equation, as well as the details in the calculation of the Lempel–Ziv estimator, which further results in remarkably overestimated predictability. This paper demonstrates the degree of overestimation by four different types of theoretically generated time series and an empirical data set, and shows the intrinsic deviation of the Lempel–Ziv estimator for highly random time series. This work provides a clear picture on this issue and thus helps researchers in correctly estimating the predictability of time series.

2018

A novel visibility graph transformation of time series into weighted networks

Paiheng Xu, Rong Zhang , and Yong Deng

Chaos, Solitons & Fractals, Nov 2018

Abs HTML

Analyzing time series from the perspective of complex network has interested many scientists. In this paper, based on visibility graph theory a novel method of constructing weighted complex network from time series is proposed. The first step is to determine the weights of vertices in time series, which linearly combines the weights generated by induced ordered averaging aggregation operator (IOWA) and visibility graph aggregation operator (VGA). Then, two strategies, averaging strategy and gravity strategy, are proposed to construct weighted network. To testify the validity of proposed method, an artificial case is adopted, in which link prediction is used to evaluate the performance of the weighted network. It is shown that the weighted network constructed by proposed method greatly outperforms the unweighted network obtained by traditional visibility graph theory.