Computation and Language
☆ Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
Recent benchmarks for Large Language Model (LLM) agents primarily focus on
evaluating reasoning, planning, and execution capabilities, while another
critical component-memory, encompassing how agents memorize, update, and
retrieve long-term information-is under-evaluated due to the lack of
benchmarks. We term agents with memory mechanisms as memory agents. In this
paper, we identify four core competencies essential for memory agents: accurate
retrieval, test-time learning, long-range understanding, and conflict
resolution. Existing datasets either rely on limited context lengths or are
tailored for static, long-context settings like book-based QA, which do not
reflect the interactive, multi-turn nature of memory agents that incrementally
accumulate information. Furthermore, no existing benchmarks cover all four
competencies. Therefore, we introduce MemoryAgentBench, a new benchmark
specifically designed for memory agents. Our benchmark combines reformulated
existing datasets with newly constructed ones, covering the above four memory
competencies, providing a systematic and challenging testbed for assessing
memory quality. We evaluate a diverse set of memory agents, ranging from simple
context-based and retrieval-augmented generation (RAG) systems to advanced
agents with external memory modules and tool integration. Empirical results
reveal that current methods fall short of mastering all four competencies,
underscoring the need for further research into comprehensive memory mechanisms
for LLM agents.
comment: 23 Pages, Y. Hu and Y. Wang contribute equally
☆ Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel
The remarkable reasoning capability of large language models (LLMs) stems
from cognitive behaviors that emerge through reinforcement with verifiable
rewards. This work investigates how to transfer this principle to Multimodal
LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage
paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning,
followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps,
surpassing all previous open-source efforts in scale. This pioneering work
reveals three fundamental insights: 1) Behavior transfer emerges surprisingly
early in cold start due to linguistic mental imagery. 2) Cold start broadly
memorizes visual behaviors, while RL critically discerns and scales up
effective patterns. 3) Transfer strategically favors high-utility behaviors
such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR),
achieves state-of-the-art performance on a suite of reasoning benchmarks,
including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We
release our model, data, and training dynamics to catalyze the development of
more capable, behavior-aligned multimodal reasoners.
☆ Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Contextual priming, where earlier stimuli covertly bias later judgments,
offers an unexplored attack surface for large language models (LLMs). We
uncover a contextual priming vulnerability in which the previous response in
the dialogue can steer its subsequent behavior toward policy-violating content.
Building on this insight, we propose Response Attack, which uses an auxiliary
LLM to generate a mildly harmful response to a paraphrased version of the
original malicious query. They are then formatted into the dialogue and
followed by a succinct trigger prompt, thereby priming the target model to
generate harmful content. Across eight open-source and proprietary LLMs, RA
consistently outperforms seven state-of-the-art jailbreak techniques, achieving
higher attack success rates. To mitigate this threat, we construct and release
a context-aware safety fine-tuning dataset, which significantly reduces the
attack success rate while preserving model capabilities. The code and data are
available at https://github.com/Dtc7w3PQ/Response-Attack.
comment: 21 pages, 9 figures. Code and data available at
https://github.com/Dtc7w3PQ/Response-Attack
☆ When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah
While chain-of-thought (CoT) monitoring is an appealing AI safety defense,
recent work on "unfaithfulness" has cast doubt on its reliability. These
findings highlight an important failure mode, particularly when CoT acts as a
post-hoc rationalization in applications like auditing for bias. However, for
the distinct problem of runtime monitoring to prevent severe harm, we argue the
key property is not faithfulness but monitorability. To this end, we introduce
a conceptual framework distinguishing CoT-as-rationalization from
CoT-as-computation. We expect that certain classes of severe harm will require
complex, multi-step reasoning that necessitates CoT-as-computation. Replicating
the experimental setups of prior work, we increase the difficulty of the bad
behavior to enforce this necessity condition; this forces the model to expose
its reasoning, making it monitorable. We then present methodology guidelines to
stress-test CoT monitoring against deliberate evasion. Applying these
guidelines, we find that models can learn to obscure their intentions, but only
when given significant help, such as detailed human-written strategies or
iterative optimization against the monitor. We conclude that, while not
infallible, CoT monitoring offers a substantial layer of defense that requires
active protection and continued stress-testing.
☆ SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?
Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Siheng Chen
The rapid advancements of AI agents have ignited the long-held ambition of
leveraging them to accelerate scientific discovery. Achieving this goal
requires a deep understanding of the frontiers of human knowledge. As such,
Humanity's Last Exam (HLE) provides an exceptionally challenging touchstone for
evaluating scientific AI agents. In this work, we aim to construct the
foundational architecture for general-purpose agents and validate the
capabilities through leading performance on HLE. To achieve this, we introduce
X-Master, a tool-augmented reasoning agent designed to emulate human
researchers by interacting flexibly with external tools during its reasoning
process. This agent, guided by the conceptualization of code as an interaction
language, can flexibly leverage built-in Python libraries and our customized
tools to augment the reasoning. We further scale its capabilities through
X-Masters, a scattered-and-stacked agentic workflow that systematically
enhances breadth and depth of reasoning. Our open-source solution, X-Masters,
sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing
OpenAI's and Google's Deep Research (26.6% and 26.9%) and becoming the first to
exceed the 30% threshold. This work allows us to gain a deeper understanding of
complex task-solving and accumulates valuable experience that can inform future
advancements, guiding subsequent model training.
comment: 12 pages, 7 figures
☆ Logit Reweighting for Topic-Focused Summarization
Generating abstractive summaries that adhere to a specific topic remains a
significant challenge for language models. While standard approaches, such as
fine-tuning, are resource-intensive, simpler methods like prompt engineering
often struggle to maintain topical focus, particularly with smaller models. To
address this, we propose a lightweight method that enhances topical relevance
by directly reweighting the logits of topic-relevant tokens during generation.
We evaluate three such reweighting techniques: Constant Shift, which adds a
constant value to logits; Factor Scaling, which multiplies them by a factor;
and Threshold Selection, which selectively boosts logits that exceed a
probability threshold. Experiments on the NEWTS topical summarization dataset,
using both Gemma-2B and Llama-3-8B models, show that these techniques
effectively increase the use of topic-relevant vocabulary. Notably, the
Threshold Selection method successfully improves topical focus without
compromising summary quality-a trade-off often seen in other approaches. Our
findings demonstrate that directly reweighting logits is a practical and
resource-efficient alternative to fine-tuning, offering a promising pathway for
precisely controlling the thematic content of generated text.
comment: 11 pages, 13 figures
☆ Interleaving Logic and Counting
Reasoning with quantifier expressions in natural language combines logical
and arithmetical features, transcending strict divides between qualitative and
quantitative. Our topic is this cooperation of styles as it occurs in common
linguistic usage and its extension into the broader practice of natural
language plus "grassroots mathematics".
We begin with a brief review of first-order logic with counting operators and
cardinality comparisons. This system is known to be of high complexity, and
drowns out finer aspects of the combination of logic and counting. We move to a
small fragment that can represent numerical syllogisms and basic reasoning
about comparative size: monadic first-order logic with counting. We provide
normal forms that allow for axiomatization, determine which arithmetical
notions can be defined on finite and on infinite models, and conversely, we
discuss which logical notions can be defined out of purely arithmetical ones,
and what sort of (non-)classical logics can be induced.
Next, we investigate a series of strengthenings, again using normal form
methods. The monadic second-order version is close, in a precise sense, to
additive Presburger Arithmetic, while versions with the natural device of tuple
counting take us to Diophantine equations, making the logic undecidable. We
also define a system that combines basic modal logic over binary accessibility
relations with counting, needed to formulate ubiquitous reasoning patterns such
as the Pigeonhole Principle.
We return to our starting point in natural language, confronting the
architecture of our formal systems with linguistic quantifier vocabulary and
syntax. We conclude with some general thoughts on yet further entanglements of
logic and counting in formal systems, on rethinking the
qualitative/quantitative divide, and on connecting our analysis to empirical
findings in cognitive science.
☆ MedGemma Technical Report
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry, Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang
Artificial intelligence (AI) has significant potential in healthcare
applications, but its training and deployment faces challenges due to
healthcare's diverse data, complex tasks, and the need to preserve privacy.
Foundation models that perform well on medical tasks and require less
task-specific tuning data are critical to accelerate the development of
healthcare AI applications. We introduce MedGemma, a collection of medical
vision-language foundation models based on Gemma 3 4B and 27B. MedGemma
demonstrates advanced medical understanding and reasoning on images and text,
significantly exceeding the performance of similar-sized generative models and
approaching the performance of task-specific models, while maintaining the
general capabilities of the Gemma 3 base models. For out-of-distribution tasks,
MedGemma achieves 2.6-10% improvement on medical multimodal question answering,
15.5-18.1% improvement on chest X-ray finding classification, and 10.8%
improvement on agentic evaluations compared to the base models. Fine-tuning
MedGemma further improves performance in subdomains, reducing errors in
electronic health record information retrieval by 50% and reaching comparable
performance to existing specialized state-of-the-art methods for pneumothorax
classification and histopathology patch classification. We additionally
introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
MedSigLIP powers the visual understanding capabilities of MedGemma and as an
encoder achieves comparable or better performance than specialized medical
image encoders. Taken together, the MedGemma collection provides a strong
foundation of medical image and text capabilities, with potential to
significantly accelerate medical research and development of downstream
applications. The MedGemma collection, including tutorials and model weights,
can be found at https://goo.gle/medgemma.
☆ Pre-Trained Policy Discriminators are General Reward Models
Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen
We offer a novel perspective on reward modeling by formulating it as a policy
discriminator, which quantifies the difference between two policies to generate
a reward signal, guiding the training policy towards a target policy with
desired behaviors. Based on this conceptual insight, we propose a scalable
pre-training method named Policy Discriminative Learning (POLAR), which trains
a reward model (RM) to discern identical policies and discriminate different
ones. Unlike traditional reward modeling methods relying on absolute
preferences, POLAR captures the relative difference between one policy and an
arbitrary target policy, which is a scalable, high-level optimization objective
suitable for modeling generic ranking relationships. Leveraging the POLAR
pre-training paradigm, we present a series of RMs with parameter scales from
1.8B to 7B. Empirical results show that POLAR substantially outperforms
traditional non-pre-trained methods, significantly enhancing RM performance.
For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on
STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA
baselines. POLAR also shows robust generalization capabilities in RLHF using
Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly
enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36%
to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover,
scaling experiments reveal a clear power-law relationship between computation
and performance, supported by linear correlation coefficients approaching 0.99.
The impressive performance, strong generalization, and scaling properties
suggest that POLAR is a promising direction for developing general and strong
reward models.
☆ From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
In an era of rampant misinformation, generating reliable news explanations is
vital, especially for under-represented languages like Hindi. Lacking robust
automated tools, Hindi faces challenges in scaling misinformation detection. To
bridge this gap, we propose a novel framework integrating Direct Preference
Optimization (DPO) with curriculum learning to align machine-generated
explanations with human reasoning. Fact-checked explanations from credible
sources serve as preferred responses, while LLM outputs highlight system
limitations and serve as non-preferred responses. To refine task-specific
alignment, we introduce two key parameters -- Actuality and Finesse -- into the
DPO loss function, enhancing explanation quality and consistency. Experiments
with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework's
effectiveness in generating coherent, contextually relevant explanations. This
scalable approach combats misinformation and extends automated explanation
generation to low-resource languages.
☆ OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model
Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang
Empathetic interaction is a cornerstone of human-machine communication, due
to the need for understanding speech enriched with paralinguistic cues and
generating emotional and expressive responses. However, the most powerful
empathetic LSLMs are increasingly closed off, leaving the crucial details about
the architecture, data and development opaque to researchers. Given the
critical need for transparent research into the LSLMs and empathetic behavior,
we present OpenS2S, a fully open-source, transparent and end-to-end LSLM
designed to enable empathetic speech interactions. Based on our empathetic
speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved
decoding architecture to achieve low-latency speech generation. To facilitate
end-to-end training, OpenS2S incorporates an automated data construction
pipeline that synthesizes diverse, high-quality empathetic speech dialogues at
low cost. By leveraging large language models to generate empathetic content
and controllable text-to-speech systems to introduce speaker and emotional
variation, we construct a scalable training corpus with rich paralinguistic
diversity and minimal human supervision. We release the fully open-source
OpenS2S model, including the dataset, model weights, pre-training and
fine-tuning codes, to empower the broader research community and accelerate
innovation in empathetic speech systems. The project webpage can be accessed at
https://casia-lm.github.io/OpenS2S
comment: Technical Report
☆ Critiques of World Models
World Model, the supposed algorithmic surrogate of the real-world environment
which biological agents experience with and act upon, has been an emerging
topic in recent years because of the rising needs to develop virtual agents
with artificial (general) intelligence. There has been much debate on what a
world model really is, how to build it, how to use it, and how to evaluate it.
In this essay, starting from the imagination in the famed Sci-Fi classic Dune,
and drawing inspiration from the concept of "hypothetical thinking" in
psychology literature, we offer critiques of several schools of thoughts on
world modeling, and argue the primary goal of a world model to be simulating
all actionable possibilities of the real world for purposeful reasoning and
acting. Building on the critiques, we propose a new architecture for a
general-purpose world model, based on hierarchical, multi-level, and mixed
continuous/discrete representations, and a generative and self-supervision
learning framework, with an outlook of a Physical, Agentic, and Nested (PAN)
AGI system enabled by such a model.
☆ InfoSteer: Steering Information Utility in Language Model Post-Training
Recent advancements in language models (LMs) gradually ushered in an era
where post-training is crucial. Yet, post-training approaches such as
supervised fine-tuning (SFT) do not guarantee effective use of knowledge
acquired during pretraining. We therefore present \ours, a lightweight method
that encourages parametric information utilization in LMs during post-training.
This is achieved via treating FFN layer as associate key-value memory, and
promotes the use of stored memory vectors via forward-pass interventions or
regularization during backpropagation. We find this simple guidance during
post-training phase delivers consistent performance improvements across diverse
model families--including Qwen, Gemma and Llama-spanning over 15 downstream
tasks in both ID and OOD evaluations. Beyond performance gains, we also find
that steered LMs can adaptively allocate information-placing more emphasis on
generating semantically meaningful tokens, while using fewer resources on
simple transition ones (e.g., `,' or `and'). Our work underscores that vanilla
post-training does not fully leverage pre-training potential, and steering LMs
in latent representation space offers a promising approach that enhances both
performance and interpretability.
☆ AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models
Large Language Models (LLMs) possess an extraordinary capability to produce
text that is not only coherent and contextually relevant but also strikingly
similar to human writing. They adapt to various styles and genres, producing
content that is both grammatically correct and semantically meaningful.
Recently, LLMs have been misused to create highly realistic phishing emails,
spread fake news, generate code to automate cyber crime, and write fraudulent
scientific articles. Additionally, in many real-world applications, the
generated content including style and topic and the generator model are not
known beforehand. The increasing prevalence and sophistication of artificial
intelligence (AI)-generated texts have made their detection progressively more
challenging. Various attempts have been made to distinguish machine-generated
text from human-authored content using linguistic, statistical, machine
learning, and ensemble-based approaches. This work focuses on two primary
objectives Task-A, which involves distinguishing human-written text from
machine-generated text, and Task-B, which attempts to identify the specific LLM
model responsible for the generation. Both of these tasks are based on fine
tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language
Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from
Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model
has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.
comment: 7 pages, 3 figures
☆ Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization
Learning Japanese vocabulary is a challenge for learners from Roman alphabet
backgrounds due to script differences. Japanese combines syllabaries like
hiragana with kanji, which are logographic characters of Chinese origin. Kanji
are also complicated due to their complexity and volume. Keyword mnemonics are
a common strategy to aid memorization, often using the compositional structure
of kanji to form vivid associations. Despite recent efforts to use large
language models (LLMs) to assist learners, existing methods for LLM-based
keyword mnemonic generation function as a black box, offering limited
interpretability. We propose a generative framework that explicitly models the
mnemonic construction process as driven by a set of common rules, and learn
them using a novel Expectation-Maximization-type algorithm. Trained on
learner-authored mnemonics from an online platform, our method learns latent
structures and compositional rules, enabling interpretable and systematic
mnemonics generation. Experiments show that our method performs well in the
cold-start setting for new learners while providing insight into the mechanisms
behind effective mnemonic creation.
☆ SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction
Item (question) difficulties play a crucial role in educational assessments,
enabling accurate and efficient assessment of student abilities and
personalization to maximize learning outcomes. Traditionally, estimating item
difficulties can be costly, requiring real students to respond to items,
followed by fitting an item response theory (IRT) model to get item difficulty
estimates. This approach cannot be applied to the cold-start setting for
previously unseen items either. In this work, we present SMART (Simulated
Students Aligned with IRT), a novel method for aligning simulated students with
instructed ability, which can then be used in simulations to predict the
difficulty of open-ended items. We achieve this alignment using direct
preference optimization (DPO), where we form preference pairs based on how
likely responses are under a ground-truth IRT model. We perform a simulation by
generating thousands of responses, evaluating them with an LLM-based scoring
model, and fit the resulting data to an IRT model to obtain item difficulty
estimates. Through extensive experiments on a real-world student response
dataset, we show that SMART outperforms other item difficulty prediction
methods by leveraging its improved ability alignment.
☆ An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques ACSA
Large Language Models (LLMs) continue to advance natural language processing
with their ability to generate human-like text across a range of tasks. Despite
the remarkable success of LLMs in Natural Language Processing (NLP), their
performance in text summarization across various domains and datasets has not
been comprehensively evaluated. At the same time, the ability to summarize text
effectively without relying on extensive training data has become a crucial
bottleneck. To address these issues, we present a systematic evaluation of six
LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog),
and ArXiv (scientific). By leveraging prompt engineering techniques including
zero-shot and in-context learning, our study evaluates the performance using
the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference
times is conducted to better understand the trade-off between summarization
quality and computational efficiency. For Long documents, introduce a
sentence-based chunking strategy that enables LLMs with shorter context windows
to summarize extended inputs in multiple stages. The findings reveal that while
LLMs perform competitively on news and dialog tasks, their performance on long
scientific documents improves significantly when aided by chunking strategies.
In addition, notable performance variations were observed based on model
parameters, dataset properties, and prompt design. These results offer
actionable insights into how different LLMs behave across task types,
contributing to ongoing research in efficient, instruction-based NLP systems.
comment: This manuscript is an extended version of the work accepted for
publication in the International Journal of Advanced Computer Science and
Applications (IJACSA), Volume 16, Issue 6, June 2025
☆ Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration
Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin
Historical documents represent an invaluable cultural heritage, yet have
undergone significant degradation over time through tears, water erosion, and
oxidation. Existing Historical Document Restoration (HDR) methods primarily
focus on single modality or limited-size restoration, failing to meet practical
needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel
automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and
6,543 synthetic images with character-level and line-level locations, as well
as character annotations in different damage grades. AutoHDR mimics historians'
restoration workflows through a three-stage approach: OCR-assisted damage
localization, vision-language context text prediction, and patch autoregressive
appearance restoration. The modular architecture of AutoHDR enables seamless
human-machine collaboration, allowing for flexible intervention and
optimization at each restoration stage. Experiments demonstrate AutoHDR's
remarkable performance in HDR. When processing severely damaged documents, our
method improves OCR accuracy from 46.83\% to 84.05\%, with further enhancement
to 94.25\% through human-machine collaboration. We believe this work represents
a significant advancement in automated historical document restoration and
contributes substantially to cultural heritage preservation. The model and
dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.
☆ AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics SC
Biomedical datasets often contain a large sample imbalance and are subject to
strict privacy constraints, which together hinder the development of accurate
machine learning models. One potential solution is to generate synthetic
images, as this can improve data availability while preserving patient privacy.
However, it remains difficult to generate synthetic images of sufficient
quality for training robust classifiers. In this work, we focus on the
classification of single white blood cells, a key component in the diagnosis of
hematological diseases such as acute myeloid leukemia (AML), a severe blood
cancer. We demonstrate how synthetic images generated with a fine-tuned stable
diffusion model using LoRA weights when guided by real few-shot samples of the
target white blood cell classes, can enhance classifier performance for limited
data. When training a ResNet classifier, accuracy increased from 27.3\% to
78.4\% (+51.1\%) by adding 5000 synthetic images per class to a small and
highly imbalanced real dataset. For a CLIP-based classifier, the accuracy
improved from 61.8\% to 76.8\% (+15.0\%). The synthetic images are highly
similar to real images, and they can help overcome dataset limitations,
enhancing model generalization. Our results establish synthetic images as a
tool in biomedical research, improving machine learning models, and
facilitating medical diagnosis and research.
comment: 8 pages, 6 figures, 2 tables. Final Degree Project (TFG) submitted at
ESCI-UPF and conducted at Helmholtz Munich
☆ Verified Language Processing with Hybrid Explainability: A Technical Report
The volume and diversity of digital information have led to a growing
reliance on Machine Learning techniques, such as Natural Language Processing,
for interpreting and accessing appropriate data. While vector and graph
embeddings represent data for similarity tasks, current state-of-the-art
pipelines lack guaranteed explainability, failing to determine similarity for
given full texts accurately. These considerations can also be applied to
classifiers exploiting generative language models with logical prompts, which
fail to correctly distinguish between logical implication, indifference, and
inconsistency, despite being explicitly trained to recognise the first two
classes. We present a novel pipeline designed for hybrid explainability to
address this. Our methodology combines graphs and logic to produce First-Order
Logic representations, creating machine- and human-readable representations
through Montague Grammar. Preliminary results indicate the effectiveness of
this approach in accurately capturing full text similarity. To the best of our
knowledge, this is the first approach to differentiate between implication,
inconsistency, and indifference for text classification tasks. To address the
limitations of existing approaches, we use three self-contained datasets
annotated for the former classification task to determine the suitability of
these approaches in capturing sentence structure equivalence, logical
connectives, and spatiotemporal reasoning. We also use these data to compare
the proposed method with language models pre-trained for detecting sentence
entailment. The results show that the proposed method outperforms
state-of-the-art models, indicating that natural language understanding cannot
be easily generalised by training over extensive document corpora. This work
offers a step toward more transparent and reliable Information Retrieval from
extensive textual data.
☆ Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification
Chenfei Xiong, Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Lorena Calvo-Bartolomé, Alexander Hoyle, Zhijing Jin, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Mennatallah El-Assady, Elliott Ash
We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt
ClassificaTion), a novel mixed-initiative annotation framework that integrates
human expertise with automatic annotation guided by large language models
(LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset
provided by a domain expert, then leverages the LLM to annotate the data and
identify edge cases that are not well described by the initial codebook.
Specifically, Co-DETECT flags challenging examples, induces high-level,
generalizable descriptions of edge cases, and assists user in incorporating
edge case handling rules to improve the codebook. This iterative process
enables more effective handling of nuanced phenomena through compact,
generalizable annotation rules. Extensive user study, qualitative and
quantitative analyses prove the effectiveness of Co-DETECT.
☆ Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search RecSys 2025
Matteo Attimonelli, Alessandro De Bellis, Claudio Pomo, Dietmar Jannach, Eugenio Di Sciascio, Tommaso Di Noia
Pre-trained language models (PLMs) are widely used to derive semantic
representations from item metadata in recommendation and search. In sequential
recommendation, PLMs enhance ID-based embeddings through textual metadata,
while in product search, they align item characteristics with user intent.
Recent studies suggest task and domain-specific fine-tuning are needed to
improve representational power. This paper challenges this assumption, showing
that Generalist Text Embedding Models (GTEs), pre-trained on large-scale
corpora, can guarantee strong zero-shot performance without specialized
adaptation. Our experiments demonstrate that GTEs outperform traditional and
fine-tuned models in both sequential recommendation and product search. We
attribute this to a superior representational power, as they distribute
features more evenly across the embedding space. Finally, we show that
compressing embedding dimensions by focusing on the most informative directions
(e.g., via PCA) effectively reduces noise and improves the performance of
specialized models. To ensure reproducibility, we provide our repository at
https://split.to/gte4ps.
comment: Accept as Short Paper at RecSys 2025
☆ From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity
to operate according to internal rules without external control. Accordingly,
autonomous vehicles (AuVs) are defined as systems capable of perceiving their
environment and executing preprogrammed tasks independently of external input.
However, both research and real-world deployments increasingly showcase
vehicles that demonstrate behaviors beyond this definition (including the SAE
levels 1 to 6), such as interaction with humans and machines, goal adaptation,
contextual reasoning, external tool use, and long-term planning, particularly
with the integration of large language models (LLMs) and agentic AI systems.
These developments reveal a conceptual gap between technical autonomy and the
broader cognitive and social capabilities needed for future human-centered
mobility systems. To address this, we introduce the concept of agentic vehicles
(AgVs), referring to vehicles that integrate agentic AI to reason, adapt, and
interact within complex environments. This paper presents a systems-level
framework to characterize AgVs, focusing on their cognitive and communicative
layers and differentiating them from conventional AuVs. It synthesizes relevant
advances in agentic AI, robotics, multi-agent systems, and human-machine
interaction, and highlights how agentic AI, through high-level reasoning and
tool use, can function not merely as computational tools but as interactive
agents embedded in mobility ecosystems. The paper concludes by identifying key
challenges in the development and governance of AgVs, including safety,
real-time control, public acceptance, ethical alignment, and regulatory
frameworks.
☆ Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models ICLR 2025
In the broader context of deep learning, Multimodal Large Language Models
have achieved significant breakthroughs by leveraging powerful Large Language
Models as a backbone to align different modalities into the language space. A
prime exemplification is the development of Video Large Language Models
(Video-LLMs). While numerous advancements have been proposed to enhance the
video understanding capabilities of these models, they are predominantly
trained on questions generated directly from video content. However, in
real-world scenarios, users often pose questions that extend beyond the
informational scope of the video, highlighting the need for Video-LLMs to
assess the relevance of the question. We demonstrate that even the
best-performing Video-LLMs fail to reject unfit questions-not necessarily due
to a lack of video understanding, but because they have not been trained to
identify and refuse such questions. To address this limitation, we propose
alignment for answerability, a framework that equips Video-LLMs with the
ability to evaluate the relevance of a question based on the input video and
appropriately decline to answer when the question exceeds the scope of the
video, as well as an evaluation framework with a comprehensive set of metrics
designed to measure model behavior before and after alignment. Furthermore, we
present a pipeline for creating a dataset specifically tailored for alignment
for answerability, leveraging existing video-description paired datasets.
comment: ICLR 2025
☆ ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian
The generative capabilities of Large Language Models (LLMs) are rapidly
expanding from static code to dynamic, interactive visual artifacts. This
progress is bottlenecked by a critical evaluation gap: established benchmarks
focus on algorithmic correctness and are blind to the visual fidelity and
interactive integrity that define modern user experiences. To bridge this gap,
we introduce ArtifactsBench, a new benchmark and paradigm for the automated,
multimodal evaluation of visual code generation. Our framework programmatically
renders each generated artifact and captures its dynamic behavior through
temporal screenshots. This visual evidence, alongside the source code, is then
assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a
fine-grained, per-task checklist to ensure holistic and reproducible scoring.
We construct a new benchmark of 1,825 diverse tasks and evaluate over 30
leading LLMs. Our automated evaluation achieves a striking 94.4% ranking
consistency with WebDev Arena, the gold-standard for human preference in web
development, and over 90% pairwise agreement with human experts. This
establishes ArtifactsBench as the first framework to reliably automate the
assessment of human-perceived quality at scale. Our analysis provides a
high-resolution map of the current SOTA, revealing that generalist models often
outperform domain-specific ones. We open-source ArtifactsBench, including the
benchmark, evaluation harness, and baseline results at
https://artifactsbenchmark.github.io/, to provide the community with a scalable
and accurate tool to accelerate the development of user-centric generative
models.
☆ Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation
Despite remarkable progress in image quality and prompt fidelity,
text-to-image (T2I) diffusion models continue to exhibit persistent
"hallucinations", where generated content subtly or significantly diverges from
the intended prompt semantics. While often regarded as unpredictable artifacts,
we argue that these failures reflect deeper, structured misalignments within
the generative process. In this work, we propose a cognitively inspired
perspective that reinterprets hallucinations as trajectory drift within a
latent alignment space. Empirical observations reveal that generation unfolds
within a multiaxial cognitive tension field, where the model must continuously
negotiate competing demands across three key critical axes: semantic coherence,
structural alignment, and knowledge grounding. We then formalize this
three-axis space as the \textbf{Hallucination Tri-Space} and introduce the
Alignment Risk Code (ARC): a dynamic vector representation that quantifies
real-time alignment tension during generation. The magnitude of ARC captures
overall misalignment, its direction identifies the dominant failure axis, and
its imbalance reflects tension asymmetry. Based on this formulation, we develop
the TensionModulator (TM-ARC): a lightweight controller that operates entirely
in latent space. TM-ARC monitors ARC signals and applies targeted,
axis-specific interventions during the sampling process. Extensive experiments
on standard T2I benchmarks demonstrate that our approach significantly reduces
hallucination without compromising image quality or diversity. This framework
offers a unified and interpretable approach for understanding and mitigating
generative failures in diffusion-based T2I systems.
comment: 12 pages, 6 figures, 4 tables
☆ ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding
While Multimodal Large Language Models (MLLMs) have achieved remarkable
progress in open-ended visual question answering, they remain vulnerable to
hallucinations. These are outputs that contradict or misrepresent input
semantics, posing a critical challenge to the reliability and factual
consistency. Existing methods often rely on external verification or post-hoc
correction, lacking an internal mechanism to validate outputs directly during
training. To bridge this gap, we propose ReLoop, a unified closed-loop training
framework that encourages multimodal consistency for cross-modal understanding
in MLLMs. ReLoop adopts a ring-shaped structure that integrates three
complementary consistency feedback mechanisms, obliging MLLMs to "seeing twice
and thinking backwards". Specifically, ReLoop employs the frozen Consistency
Feedback Plugin (CFP), comprising semantic reconstruction, visual description,
and an attention supervision module for attention alignment. These components
collectively enforce semantic reversibility, visual consistency, and
interpretable attention, enabling the model to correct its outputs during
training. Extensive evaluations and analyses demonstrate the effectiveness of
ReLoop in reducing hallucination rates across multiple benchmarks, establishing
a robust method for hallucination mitigation in MLLMs. We will release our
source code and data in the camera-ready version.
comment: 8 pages,6 figures,5 tables
☆ SIGIR 2025 -- LiveRAG Challenge Report
The LiveRAG Challenge at SIGIR 2025, held between March and May 2025,
provided a competitive platform for advancing Retrieval-Augmented Generation
(RAG) technologies. Participants from academia and industry were invited to
develop a RAG-based question-answering system using a fixed corpus
(Fineweb-10BT) and a common open-source LLM (Falcon3-10B-Instruct). The goal
was to facilitate challenging comparisons of retrieval and prompting
strategies. During the Live Challenge Day, 70 teams from 27 different countries
provided answers and supportive information to 500 unseen questions within a
strict two-hour time window. Evaluation was conducted in two stages: first an
automated LLM-as-a-judge approach was used to compute correctness and
faithfulness score, then a manual review of top ranked submissions was
conducted. The finalists were announced on June 12, 2025, with prizes awarded
during the LiveRAG Workshop at SIGIR 2025 in Padua, Italy.
comment: 9 pages, 5 tables
☆ O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d'apprentissage pour l'adaptation de modèles de langue à un domaine
Ismaël Rousseau, Claire Perroux, Pierre Adam, Thomas Girault, Lionel Delphin-Poulat, Morgan Veyret, Gwénolé Lecorvé, Géraldine Damnati
This paper presents the work carried out by the O_FT team, joint with Orange
and Ouest-France, on adapting language models to the defense domain as part of
the EvalLLM2025 challenge. This work focused on adapting the
\texttt{Mistral-7B-Instruct-v0.3} model using classical techniques of continued
pre-training and instruction-tuning. The core of our efforts is based on
collecting, generating, and selecting data for these two stages as well as for
model evaluation. Experiments show that our adapted models have better
domain-specific knowledge and improved domain-specific task processing skills,
along with comparable (or even superior) performance on general knowledge and
skills. Considering the carbon footprint of our adaptations, this work
demonstrates the feasibility of domain adaptation for relatively small models.
--
Ce document pr\'esente les travaux r\'ealis\'es par l'\'equipe O_FT conjointe
\`a Orange et Ouest-France sur l'adaptation de mod\`eles de langue au domaine
de la d\'efense dans le cadre du challenge EvalLLM2025. Ces travaux se sont
concentr\'es sur l'adaptation du mod\`ele \texttt{Mistral-7B-Instruct-v0.3}
avec des techniques classiques de poursuite du pr\'e-entra\^inement et
d'affinage sur instructions. L'essentiel de nos travaux a port\'e sur la
constitution, g\'en\'eration et s\'election de donn\'ees pour ces deux \'etapes
ainsi que pour l'\'evaluation des mod\`eles. Les exp\'eriences montrent que nos
mod\`eles adapt\'es ont de meilleures de connaissances de fond et une meilleure
capacit\'e de traitement de t\^aches sur le domaine de la d\'efense, ainsi que
des performances comparables (voire sup\'erieures) sur des connaissances ou
capacit\'es g\'en\'eralistes. Mis au regard des empreintes carbones de nos
adaptations, ces travaux d\'emontrent ainsi la viabilit\'e de l'adaptation \`a
un domaine de mod\`eles relativement petits.
comment: 22 pages + 10 pages appendices, in French language
☆ MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction
Accident severity prediction plays a critical role in transportation safety
systems but is a persistently difficult task due to incomplete data, strong
feature dependencies, and severe class imbalance in which rare but
high-severity cases are underrepresented and hard to detect. Existing methods
often rely on monolithic models or black box prompting, which struggle to scale
in noisy, real-world settings and offer limited interpretability. To address
these challenges, we propose MARBLE a multiagent rule based LLM engine that
decomposes the severity prediction task across a team of specialized reasoning
agents, including an interchangeable ML-backed agent. Each agent focuses on a
semantic subset of features (e.g., spatial, environmental, temporal), enabling
scoped reasoning and modular prompting without the risk of prompt saturation.
Predictions are coordinated through either rule-based or LLM-guided consensus
mechanisms that account for class rarity and confidence dynamics. The system
retains structured traces of agent-level reasoning and coordination outcomes,
supporting in-depth interpretability and post-hoc performance diagnostics.
Across both UK and US datasets, MARBLE consistently outperforms traditional
machine learning classifiers and state-of-the-art (SOTA) prompt-based reasoning
methods including Chain-of-Thought (CoT), Least-to-Most (L2M), and
Tree-of-Thought (ToT) achieving nearly 90% accuracy where others plateau below
48%. This performance redefines the practical ceiling for accident severity
classification under real world noise and extreme class imbalance. Our results
position MARBLE as a generalizable and interpretable framework for reasoning
under uncertainty in safety-critical applications.
comment: 13 pages, 5 figures
☆ Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations
Understanding the locus of semantic representation in large language models
(LLMs) is crucial for interpretability and architectural innovation. The
dominant paradigm posits that trainable input embeddings serve as foundational
"meaning vectors." This paper challenges that view. We construct Transformer
models where the embedding layer is entirely frozen, with vectors derived not
from data, but from the visual structure of Unicode glyphs. These non-semantic,
precomputed visual embeddings are fixed throughout training. Our method is
compatible with any tokenizer, including a novel Unicode-centric tokenizer we
introduce to ensure universal text coverage. Despite the absence of trainable,
semantically initialized embeddings, our models converge, generate coherent
text, and, critically, outperform architecturally identical models with
trainable embeddings on the MMLU reasoning benchmark. We attribute this to
"representational interference" in conventional models, where the embedding
layer is burdened with learning both structural and semantic features. Our
results indicate that high-level semantics are not inherent to input embeddings
but are an emergent property of the Transformer's compositional architecture
and data scale. This reframes the role of embeddings from meaning containers to
structural primitives. We release all code and models to foster further
research.
☆ Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions SIGDIAL 2025
Christos Vlachos, Nikolaos Stylianou, Alexandra Fiotaki, Spiros Methenitis, Elisavet Palogiannidi, Themos Stafylakis, Ion Androutsopoulos
We consider open-retrieval conversational question answering (OR-CONVQA), an
extension of question answering where system responses need to be (i) aware of
dialog history and (ii) grounded in documents (or document fragments) retrieved
per question. Domain-specific OR-CONVQA training datasets are crucial for
real-world applications, but hard to obtain. We propose a pipeline that
capitalizes on the abundance of plain text documents in organizations (e.g.,
product documentation) to automatically produce realistic OR-CONVQA dialogs
with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we
generate in-dialog question-answer pairs, self-contained (decontextualized,
e.g., no referring expressions) versions of user questions, and propositions
(sentences expressing prominent information from the documents) the system
responses are grounded in. We show how the synthetic dialogs can be used to
train efficient question rewriters that decontextualize user questions,
allowing existing dialog-unaware retrievers to be utilized. The retrieved
information and the decontextualized question are then passed on to an LLM that
generates the system's response.
comment: Accepted at SIGDIAL 2025
☆ Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite
This article presents the experiments and results obtained by the GRESEL team
in the IberLEF 2025 shared task PastReader: Transcribing Texts from the Past.
Three types of experiments were conducted with the dual aim of participating in
the task and enabling comparisons across different approaches. These included
the use of a web-based OCR service, a traditional OCR engine, and a compact
multimodal model. All experiments were run on consumer-grade hardware, which,
despite lacking high-performance computing capacity, provided sufficient
storage and stability. The results, while satisfactory, leave room for further
improvement. Future work will focus on exploring new techniques and ideas using
the Spanish-language dataset provided by the shared task, in collaboration with
Biblioteca Nacional de Espa\~na (BNE).
comment: This paper was written as part of a shared task organized within the
2025 edition of the Iberian Languages Evaluation Forum (IberLEF 2025), held
at SEPLN 2025 in Zaragoza. This paper describes the joint participation of
two teams in said competition, GRESEL1 and GRESEL2, each with an individual
paper that will be published in CEUR
☆ $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models
Shrey Ganatra, Swapnil Bhattacharyya, Harshvivek Kashid, Spandan Anaokar, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya
Access to consumer grievance redressal in India is often hindered by
procedural complexity, legal jargon, and jurisdictional challenges. To address
this, we present $\textbf{Grahak-Nyay}$ (Justice-to-Consumers), a chatbot that
streamlines the process using open-source Large Language Models (LLMs) and
Retrieval-Augmented Generation (RAG). Grahak-Nyay simplifies legal complexities
through a concise and up-to-date knowledge base. We introduce three novel
datasets: $\textit{GeneralQA}$ (general consumer law), $\textit{SectoralQA}$
(sector-specific knowledge) and $\textit{SyntheticQA}$ (for RAG evaluation),
along with $\textit{NyayChat}$, a dataset of 300 annotated chatbot
conversations. We also introduce $\textit{Judgments}$ data sourced from Indian
Consumer Courts to aid the chatbot in decision making and to enhance user
trust. We also propose $\textbf{HAB}$ metrics ($\textbf{Helpfulness, Accuracy,
Brevity}$) to evaluate chatbot performance. Legal domain experts validated
Grahak-Nyay's effectiveness. Code and datasets will be released.
☆ Dialogue-Based Multi-Dimensional Relationship Extraction from Novels NLPCC2025
Relation extraction is a crucial task in natural language processing, with
broad applications in knowledge graph construction and literary analysis.
However, the complex context and implicit expressions in novel texts pose
significant challenges for automatic character relationship extraction. This
study focuses on relation extraction in the novel domain and proposes a method
based on Large Language Models (LLMs). By incorporating relationship dimension
separation, dialogue data construction, and contextual learning strategies, the
proposed method enhances extraction performance. Leveraging dialogue structure
information, it improves the model's ability to understand implicit
relationships and demonstrates strong adaptability in complex contexts.
Additionally, we construct a high-quality Chinese novel relation extraction
dataset to address the lack of labeled resources and support future research.
Experimental results show that our method outperforms traditional baselines
across multiple evaluation metrics and successfully facilitates the automated
construction of character relationship networks in novels.
comment: The paper has been accepted by NLPCC2025. 12 pages, 5 figures, 5
tables
☆ Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems
Task-oriented dialogue (TOD) systems facilitate goal-driven interactions
between users and machines. While recent advances in deep learning have
improved the performance, TOD systems often struggle in low-resource scenarios
with limited labeled data. To address this challenge, we propose Spec-TOD, a
novel framework designed to train an end-to-end TOD system with limited data.
Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end
TOD framework that incorporates explicit task instructions for
instruction-tuned large language models (LLMs), and (ii) an efficient training
strategy that leverages lightweight, specialized LLMs to achieve strong
performance with minimal supervision. Experiments on the MultiWOZ dataset, a
widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive
results while significantly reducing the need for labeled data. These findings
highlight the potential of the proposed framework in advancing efficient and
effective TOD systems in low-resource settings.
comment: Accepted at SIGdial 2025
☆ From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach
The task of describing video content in natural language is commonly referred
to as video captioning. Unlike conventional video captions, which are typically
brief and widely available, long-form paragraph descriptions in natural
language are scarce. This limitation of current datasets is due to the
expensive human manual annotation required and to the highly challenging task
of explaining the language formation process from the perspective of the
underlying story, as a complex system of interconnected events in space and
time. Through a thorough analysis of recently published methods and available
datasets, we identify a general lack of published resources dedicated to the
problem of describing videos in complex language, beyond the level of
descriptions in the form of enumerations of simple captions. Furthermore, while
state-of-the-art methods produce impressive results on the task of generating
shorter captions from videos by direct end-to-end learning between the videos
and text, the problem of explaining the relationship between vision and
language is still beyond our reach. In this work, we propose a shared
representation between vision and language, based on graphs of events in space
and time, which can be obtained in an explainable and analytical way, to
integrate and connect multiple vision tasks to produce the final natural
language description. Moreover, we also demonstrate how our automated and
explainable video description generation process can function as a fully
automatic teacher to effectively train direct, end-to-end neural student
pathways, within a self-supervised neuro-analytical system. We validate that
our explainable neuro-analytical approach generates coherent, rich and relevant
textual descriptions on videos collected from multiple varied datasets, using
both standard evaluation metrics, human annotations and consensus from
ensembles of state-of-the-art VLMs.
comment: arXiv admin note: text overlap with arXiv:2501.08460
☆ A Survey of Pun Generation: Datasets, Evaluations and Methodologies
Pun generation seeks to creatively modify linguistic elements in text to
produce humour or evoke double meanings. It also aims to preserve coherence and
contextual appropriateness, making it useful in creative writing and
entertainment across various media and contexts. Although pun generation has
received considerable attention in computational linguistics, there is
currently no dedicated survey that systematically reviews this specific area.
To bridge this gap, this paper provides a comprehensive review of pun
generation datasets and methods across different stages, including conventional
approaches, deep learning techniques, and pre-trained language models.
Additionally, we summarise both automated and human evaluation metrics used to
assess the quality of pun generation. Finally, we discuss the research
challenges and propose promising directions for future work.
☆ Reason to Rote: Rethinking Memorization in Reasoning
Large language models readily memorize arbitrary training instances, such as
label noise, yet they perform strikingly well on reasoning tasks. In this work,
we investigate how language models memorize label noise, and why such
memorization in many cases does not heavily affect generalizable reasoning
capabilities. Using two controllable synthetic reasoning datasets with noisy
labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we
discover a reliance of memorization on generalizable reasoning mechanisms:
models continue to compute intermediate reasoning outputs even when retrieving
memorized noisy labels, and intervening reasoning adversely affects
memorization. We further show that memorization operates through distributed
encoding, i.e., aggregating various inputs and intermediate results, rather
than building a look-up mechanism from inputs to noisy labels. Moreover, our
FDA case study reveals memorization occurs via outlier heuristics, where
existing neuron activation patterns are slightly shifted to fit noisy labels.
Together, our findings suggest that memorization of label noise in language
models builds on, rather than overrides, the underlying reasoning mechanisms,
shedding lights on the intriguing phenomenon of benign memorization.
comment: 21 pages, 14 figures
☆ ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems
Yiming Zhang, Yingfan Ma, Yanmei Gu, Zhengkai Yang, Yihong Zhuang, Feng Wang, Zenan Huang, Yuanyuan Wang, Chao Huang, Bowen Song, Cheng Lin, Junbo Zhao
Large Language Models (LLMs) have shown impressive performance in domains
such as mathematics and programming, yet their capabilities in physics remain
underexplored and poorly understood. Physics poses unique challenges that
demand not only precise computation but also deep conceptual understanding and
physical modeling skills. Existing benchmarks often fall short due to limited
difficulty, multiple-choice formats, and static evaluation settings that fail
to capture physical modeling ability. In this paper, we introduce
ABench-Physics, a novel benchmark designed to rigorously evaluate LLMs'
physical reasoning and generalization capabilities. ABench-Physics consists of
two components: Phy_A, a static set of 400 graduate- or Olympiad-level
problems; and Phy_B, a dynamic subset of 100 problems equipped with an
automatic variation engine to test model robustness across changing conditions.
All questions require precise numerical answers, with strict formatting and
tolerance constraints. Our evaluation of several state-of-the-art LLMs reveals
substantial performance gaps, highlighting persistent limitations in physical
reasoning, especially in generalization to dynamic variants. ABench-Physics
provides a challenging and diagnostic framework for advancing scientific
reasoning in LLMs.
☆ CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering
Personalized text generation has become crucial for adapting language models
to diverse and evolving users' personal context across cultural, temporal, and
contextual dimensions. While existing methods often rely on centralized
fine-tuning or static preference alignment, they struggle to achieve real-time
adaptation under resource constraints inherent to personal devices. This
limitation creates a dilemma: large cloud-based models lack access to localized
user-specific information, while small on-device models cannot match the
generation quality of their cloud counterparts. To address this dichotomy, we
present CoSteer, a novel collaborative framework that enables decoding-time
personalization through localized delta steering. Our key insight lies in
leveraging the logits difference between personal context-aware and -agnostic
outputs from local small models as steering signals for cloud-based LLMs.
Specifically, we formulate token-level optimization as an online learning
problem, where local delta vectors dynamically adjust the remote LLM's logits
within the on-device environment. This approach preserves privacy by
transmitting only the final steered tokens rather than raw data or intermediate
vectors, while maintaining cloud-based LLMs' general capabilities without
fine-tuning. Through comprehensive experiments on various personalized
generation tasks, we demonstrate that CoSteer effectively assists LLMs in
generating personalized content by leveraging locally stored user profiles and
histories, ensuring privacy preservation through on-device data processing
while maintaining acceptable computational overhead.
☆ LLMs as Architects and Critics for Multi-Source Opinion Summarization
Anuj Attri, Arnav Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Multi-source Opinion Summarization (M-OS) extends beyond traditional opinion
summarization by incorporating additional sources of product metadata such as
descriptions, key features, specifications, and ratings, alongside reviews.
This integration results in comprehensive summaries that capture both
subjective opinions and objective product attributes essential for informed
decision-making. While Large Language Models (LLMs) have shown significant
success in various Natural Language Processing (NLP) tasks, their potential in
M-OS remains largely unexplored. Additionally, the lack of evaluation datasets
for this task has impeded further advancements. To bridge this gap, we
introduce M-OS-EVAL, a benchmark dataset for evaluating multi-source opinion
summaries across 7 key dimensions: fluency, coherence, relevance, faithfulness,
aspect coverage, sentiment consistency, specificity. Our results demonstrate
that M-OS significantly enhances user engagement, as evidenced by a user study
in which, on average, 87% of participants preferred M-OS over opinion
summaries. Our experiments demonstrate that factually enriched summaries
enhance user engagement. Notably, M-OS-PROMPTS exhibit stronger alignment with
human judgment, achieving an average Spearman correlation of \r{ho} = 0.74,
which surpasses the performance of previous methodologies.
☆ A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic
Judeo-Arabic refers to Arabic variants historically spoken by Jewish
communities across the Arab world, primarily during the Middle Ages. Unlike
standard Arabic, it is written in Hebrew script by Jewish writers and for
Jewish audiences. Transliterating Judeo-Arabic into Arabic script is
challenging due to ambiguous letter mappings, inconsistent orthographic
conventions, and frequent code-switching into Hebrew and Aramaic. In this
paper, we introduce a two-step approach to automatically transliterate
Judeo-Arabic into Arabic script: simple character-level mapping followed by
post-correction to address grammatical and orthographic errors. We also present
the first benchmark evaluation of LLMs on this task. Finally, we show that
transliteration enables Arabic NLP tools to perform morphosyntactic tagging and
machine translation, which would have not been feasible on the original texts.
☆ Word stress in self-supervised speech models: A cross-linguistic comparison
In this paper we study word stress representations learned by self-supervised
speech models (S3M), specifically the Wav2vec 2.0 model. We investigate the S3M
representations of word stress for five different languages: Three languages
with variable or lexical stress (Dutch, English and German) and two languages
with fixed or demarcative stress (Hungarian and Polish). We train diagnostic
stress classifiers on S3M embeddings and show that they can distinguish between
stressed and unstressed syllables in read-aloud short sentences with high
accuracy. We also tested language-specificity effects of S3M word stress. The
results indicate that the word stress representations are language-specific,
with a greater difference between the set of variable versus the set of fixed
stressed languages.
comment: Accepted to Interspeech 2025
☆ "This Suits You the Best": Query Focused Comparative Explainable Summarization
Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Product recommendations inherently involve comparisons, yet traditional
opinion summarization often fails to provide holistic comparative insights. We
propose the novel task of generating Query-Focused Comparative Explainable
Summaries (QF-CES) using Multi-Source Opinion Summarization (M-OS). To address
the lack of query-focused recommendation datasets, we introduce MS-Q2P,
comprising 7,500 queries mapped to 22,500 recommended products with metadata.
We leverage Large Language Models (LLMs) to generate tabular comparative
summaries with query-specific explanations. Our approach is personalized,
privacy-preserving, recommendation engine-agnostic, and category-agnostic. M-OS
as an intermediate step reduces inference latency approximately by 40% compared
to the direct input approach (DIA), which processes raw data directly. We
evaluate open-source and proprietary LLMs for generating and assessing QF-CES.
Extensive evaluations using QF-CES-PROMPT across 5 dimensions (clarity,
faithfulness, informativeness, format adherence, and query relevance) showed an
average Spearman correlation of 0.74 with human judgments, indicating its
potential for QF-CES evaluation.
☆ LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework
Long-context processing has become a fundamental capability for large
language models~(LLMs). To assess model's long-context performance, numerous
long-context evaluation benchmarks have been proposed. However, variations in
evaluation settings across these benchmarks lead to inconsistent results,
making it difficult to draw reliable comparisons. Besides, the high
computational cost of long-context evaluation poses a significant barrier for
the community to conduct comprehensive assessments of long-context models. In
this paper, we propose LOOM-Scope, a comprehensive and efficient framework for
long-context evaluation. LOOM-Scope standardizes evaluation settings across
diverse benchmarks, supports deployment of efficient long-context inference
acceleration methods, and introduces a holistic yet lightweight benchmark suite
to evaluate models comprehensively. Homepage: https://loomscope.github.io
☆ Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce
Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Customer reviews on e-commerce platforms capture critical affective signals
that drive purchasing decisions. However, no existing research has explored the
joint task of emotion detection and explanatory span identification in
e-commerce reviews - a crucial gap in understanding what triggers customer
emotional responses. To bridge this gap, we propose a novel joint task unifying
Emotion detection and Opinion Trigger extraction (EOT), which explicitly models
the relationship between causal text spans (opinion triggers) and affective
dimensions (emotion categories) grounded in Plutchik's theory of 8 primary
emotions. In the absence of labeled data, we introduce EOT-X, a human-annotated
collection of 2,400 reviews with fine-grained emotions and opinion triggers. We
evaluate 23 Large Language Models (LLMs) and present EOT-DETECT, a structured
prompting framework with systematic reasoning and self-reflection. Our
framework surpasses zero-shot and chain-of-thought techniques, across
e-commerce domains.
comment: 23 pages, 11 figures, 7 tables. Dataset and code will be made
publicly available
☆ XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL
Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, Jingren Zhou
To leverage the advantages of LLM in addressing challenges in the Text-to-SQL
task, we present XiYan-SQL, an innovative framework effectively generating and
utilizing multiple SQL candidates. It consists of three components: 1) a Schema
Filter module filtering and obtaining multiple relevant schemas; 2) a
multi-generator ensemble approach generating multiple highquality and diverse
SQL queries; 3) a selection model with a candidate reorganization strategy
implemented to obtain the optimal SQL query. Specifically, for the
multi-generator ensemble, we employ a multi-task fine-tuning strategy to
enhance the capabilities of SQL generation models for the intrinsic alignment
between SQL and text, and construct multiple generation models with distinct
generation styles by fine-tuning across different SQL formats. The experimental
results and comprehensive analysis demonstrate the effectiveness and robustness
of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63%
on the notable BIRD benchmark, surpassing all previous methods. It also attains
SOTA performance on the Spider test set with an accuracy of 89.65%.
☆ R1-RE: Cross-Domain Relationship Extraction with RLVR
Relationship extraction (RE) is a core task in natural language processing.
Traditional approaches typically frame RE as a supervised learning problem,
directly mapping context to labels-an approach that often suffers from poor
out-of-domain (OOD) generalization. Inspired by the workflow of human
annotators, we reframe RE as a reasoning task guided by annotation guidelines
and introduce R1-RE, the first reinforcement learning with verifiable reward
(RLVR) framework for RE tasks. Our method elicits the reasoning abilities of
small language models for annotation tasks, resulting in significantly improved
OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a
private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of
approximately 70%, on par with leading proprietary models such as GPT-4o.
Additionally, our comprehensive analysis provides novel insights into the
training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.
comment: 14 pages, 7 figures
☆ Put Teacher in Student's Shoes: Cross-Distillation for Ultra-compact Model Compression Framework KDD 2025
In the era of mobile computing, deploying efficient Natural Language
Processing (NLP) models in resource-restricted edge settings presents
significant challenges, particularly in environments requiring strict privacy
compliance, real-time responsiveness, and diverse multi-tasking capabilities.
These challenges create a fundamental need for ultra-compact models that
maintain strong performance across various NLP tasks while adhering to
stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT
framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently
compresses models through a comprehensive pipeline including hard token
pruning, cross-distillation and parameter quantization. Specifically, the
cross-distillation method uniquely positions the teacher model to understand
the student model's perspective, ensuring efficient knowledge transfer through
parameter integration and the mutual interplay between models. Through
extensive experiments, we achieve a remarkably compact BERT-based model of only
1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks.
This ultra-compact model has been successfully deployed across multiple
scenarios within the Alipay ecosystem, demonstrating significant improvements
in real-world applications. For example, it has been integrated into Alipay's
live Edge Recommendation system since January 2024, currently serving the app's
recommendation traffic across \textbf{8.4 million daily active devices}.
comment: Accepted by KDD 2025
☆ Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs
Large Language Models (LLMs) are powerful yet prone to generating factual
errors, commonly referred to as hallucinations. We present a lightweight,
interpretable framework for knowledge-aware self-correction of LLM outputs
using structured memory graphs based on RDF triples. Without retraining or
fine-tuning, our method post-processes model outputs and corrects factual
inconsistencies via external semantic memory. We demonstrate the approach using
DistilGPT-2 and show promising results on simple factual prompts.
comment: 8 pages, 4 figures
☆ Retain or Reframe? A Computational Framework for the Analysis of Framing in News Articles and Reader Comments
When a news article describes immigration as an "economic burden" or a
"humanitarian crisis," it selectively emphasizes certain aspects of the issue.
Although \textit{framing} shapes how the public interprets such issues,
audiences do not absorb frames passively but actively reorganize the presented
information. While this relationship between source content and audience
response is well-documented in the social sciences, NLP approaches often ignore
it, detecting frames in articles and responses in isolation. We present the
first computational framework for large-scale analysis of framing across source
content (news articles) and audience responses (reader comments).
Methodologically, we refine frame labels and develop a framework that
reconstructs dominant frames in articles and comments from sentence-level
predictions, and aligns articles with topically relevant comments. Applying our
framework across eleven topics and two news outlets, we find that frame reuse
in comments correlates highly across outlets, while topic-specific patterns
vary. We release a frame classifier that performs well on both articles and
comments, a dataset of article and comment sentences manually labeled for
frames, and a large-scale dataset of articles and comments with predicted frame
labels.
☆ PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes
Large language model (LLM) personalization aims to align model outputs with
individuals' unique preferences and opinions. While recent efforts have
implemented various personalization methods, a unified theoretical framework
that can systematically understand the drivers of effective personalization is
still lacking. In this work, we integrate the well-established cognitive
dual-memory model into LLM personalization, by mirroring episodic memory to
historical user engagements and semantic memory to long-term, evolving user
beliefs. Specifically, we systematically investigate memory instantiations and
introduce a unified framework, PRIME, using episodic and semantic memory
mechanisms. We further augment PRIME with a novel personalized thinking
capability inspired by the slow thinking strategy. Moreover, recognizing the
absence of suitable benchmarks, we introduce a dataset using Change My View
(CMV) from Reddit, specifically designed to evaluate long-context
personalization. Extensive experiments validate PRIME's effectiveness across
both long- and short-context scenarios. Further analysis confirms that PRIME
effectively captures dynamic personalization beyond mere popularity biases.
☆ VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz
Multimodal embedding models have been crucial in enabling various downstream
tasks such as semantic similarity, information retrieval, and clustering over
different modalities. However, existing multimodal embeddings like VLM2Vec,
E5-V, GME are predominantly focused on natural images, with limited support for
other visual forms such as videos and visual documents. This restricts their
applicability in real-world scenarios, including AI agents, multi-modal search
and recommendation, and retrieval-augmented generation (RAG). To close this
gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across
diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark
that extends MMEB with five new task types: visual document retrieval, video
retrieval, temporal grounding, video classification and video question
answering - spanning text, image, video, and visual document inputs. Next, we
train VLM2Vec-V2, a general-purpose embedding model that supports text, image,
video, and visual document inputs. Extensive experiments show that VLM2Vec-V2
achieves strong performance not only on the newly introduced video and document
retrieval tasks, but also improves over prior baselines on the original image
benchmarks. Through extensive evaluation, our study offers insights into the
generalizability of various multimodal embedding models and highlights
effective strategies for unified embedding learning, laying the groundwork for
more scalable and adaptable representation learning in both research and
real-world settings.
comment: Technical Report
♻ ☆ Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? ACL 2025
Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang
Charts are ubiquitous as they help people understand and reason with data.
Recently, various downstream tasks, such as chart question answering,
chart2text, and fact-checking, have emerged. Large Vision-Language Models
(LVLMs) show promise in tackling these tasks, but their evaluation is costly
and time-consuming, limiting real-world deployment. While using LVLMs as judges
to assess the chart comprehension capabilities of other LVLMs could streamline
evaluation processes, challenges like proprietary datasets, restricted access
to powerful models, and evaluation costs hinder their adoption in industrial
settings. To this end, we present a comprehensive evaluation of 13 open-source
LVLMs as judges for diverse chart comprehension and reasoning tasks. We design
both pairwise and pointwise evaluation tasks covering criteria like factual
correctness, informativeness, and relevancy. Additionally, we analyze LVLM
judges based on format adherence, positional consistency, length bias, and
instruction-following. We focus on cost-effective LVLMs (<10B parameters)
suitable for both research and commercial use, following a standardized
evaluation protocol and rubric to measure the LVLM judge's accuracy.
Experimental results reveal notable variability: while some open LVLM judges
achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4
judgments), others struggle (below ~10% agreement). Our findings highlight that
state-of-the-art open-source LVLMs can serve as cost-effective automatic
evaluators for chart-related tasks, though biases such as positional preference
and length bias persist.
comment: Accepted at ACL 2025 Industry Track
♻ ☆ Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
Many use cases require retrieving smaller portions of text, and dense
vector-based retrieval systems often perform better with shorter text segments,
as the semantics are less likely to be over-compressed in the embeddings.
Consequently, practitioners often split text documents into smaller chunks and
encode them separately. However, chunk embeddings created in this way can lose
contextual information from surrounding chunks, resulting in sub-optimal
representations. In this paper, we introduce a novel method called late
chunking, which leverages long context embedding models to first embed all
tokens of the long text, with chunking applied after the transformer model and
just before mean pooling - hence the term late in its naming. The resulting
chunk embeddings capture the full contextual information, leading to superior
results across various retrieval tasks. The method is generic enough to be
applied to a wide range of long-context embedding models and works without
additional training. To further increase the effectiveness of late chunking, we
propose a dedicated fine-tuning approach for embedding models.
comment: 11 pages, 3rd draft
♻ ☆ OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang
Can large language models (LLMs) accurately simulate the next web action of a
specific user? While LLMs have shown promising capabilities in generating
``believable'' human behaviors, evaluating their ability to mimic real user
behaviors remains an open challenge, largely due to the lack of high-quality,
publicly available datasets that capture both the observable actions and the
internal reasoning of an actual human user. To address this gap, we introduce
OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected
from real human participants during online shopping sessions. OPERA is the
first public dataset that comprehensively captures: user personas, browser
observations, fine-grained web actions, and self-reported just-in-time
rationales. We developed both an online questionnaire and a custom browser
plugin to gather this dataset with high fidelity. Using OPERA, we establish the
first benchmark to evaluate how well current LLMs can predict a specific user's
next action and rationale with a given persona and history. This dataset lays the groundwork for future research into
LLM agents that aim to act as personalized digital twins for human.
♻ ☆ The Super Weight in Large Language Models
Recent works have shown a surprising result: a small fraction of Large
Language Model (LLM) parameter outliers are disproportionately important to the
quality of the model. LLMs contain billions of parameters, so these small
fractions, such as 0.01%, translate to hundreds of thousands of parameters. In
this work, we present an even more surprising finding: Pruning as few as a
single parameter can destroy an LLM's ability to generate text -- increasing
perplexity by 3 orders of magnitude and reducing zero-shot accuracy to
guessing. We propose a data-free method for identifying such parameters, termed
super weights, using a single forward pass through the model. We additionally
find that these super weights induce correspondingly rare and large activation
outliers, termed super activations. When preserved with high precision, super
activations can improve simple round-to-nearest quantization to become
competitive with state-of-the-art methods. For weight quantization, we
similarly find that by preserving the super weight and clipping other weight
outliers, round-to-nearest quantization can scale to much larger block sizes
than previously considered. To facilitate further research into super weights,
we provide an index of super weight coordinates for common, openly available
LLMs.
♻ ☆ jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding
model that unifies text and image representations through a novel architecture
supporting both single-vector and multi-vector embeddings in the late
interaction style. The model incorporates task-specific Low-Rank Adaptation
(LoRA) adapters to optimize performance across diverse retrieval scenarios,
including query-document retrieval, semantic text similarity, and code search.
Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves
state-of-the-art performance on both single-modal and cross-modal retrieval
tasks, with particular strength in processing visually rich content such as
tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of
this capability, we also introduce Jina-VDR, a novel benchmark specifically
designed for visually rich image retrieval.
comment: 22 pages, 1-10 main, 14-22 experimental results, benchmark tables
♻ ☆ On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows
Souradip Chakraborty, Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Jindong Gu, Hamid Palangi, Tomas Pfister
Agentic AI workflows (systems that autonomously plan and act) are becoming
widespread, yet their task success rate on complex tasks remains low. A
promising solution is inference-time alignment, which uses extra compute at
test time to improve performance. Inference-time alignment relies on three
components: sampling, evaluation, and feedback. While most prior work studies
sampling and automatic evaluation, feedback remains underexplored. To study the
role of feedback, we introduce Iterative Agent Decoding (IAD), a procedure that
repeatedly inserts feedback extracted from different forms of critiques (reward
models or AI-generated textual feedback) between decoding steps. Through IAD,
we analyze feedback along four dimensions: (1) its role in the accuracy-compute
trade-offs with limited inference budget, (2) quantifying the gains over
diversity-only baselines such as best-of-N sampling, (3) effectiveness of
composing feedback from reward models versus textual critique, and (4)
robustness to noisy or low-quality feedback. Across Sketch2Code, Text2SQL,
Intercode, and WebShop, we show that IAD with proper integration of high
fidelity feedback leads to consistent gains up to 10 percent absolute
performance improvement over various baselines such as best-of-N. Our findings
underscore feedback as a crucial knob for inference-time alignment of agentic
AI workflows with limited inference budget.
♻ ☆ Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals
Large language models (LLMs) have demonstrated significant success in complex
reasoning tasks such as math and coding. In contrast to these tasks where
deductive reasoning predominates, inductive reasoning-the ability to derive
general rules from incomplete evidence, remains underexplored. This paper
investigates extended inductive reasoning in LLMs through the lens of
personalized preference inference, a critical challenge in LLM alignment where
current approaches struggle to capture diverse user preferences. The task
demands strong inductive reasoning capabilities as user preferences are
typically embedded implicitly across various interaction forms, requiring
models to synthesize consistent preference patterns from scattered signals. We
propose AlignXplore, a model that leverages extended reasoning chains to enable
systematic preference inference from behavioral signals in users' interaction
histories. Such explicit preference articulation enables efficient streaming
inference: when new behavioral signals emerge, the model can directly build
upon previously inferred preference descriptions rather than reprocessing
historical signals from scratch, while also supporting iterative refinement to
the inferred preferences. We develop AlignXplore by combining cold-start
training based on synthetic data with subsequent online reinforcement learning.
Through extensive experiments, we demonstrate that AlignXplore achieves
substantial improvements over the backbone model by an average of 15.49\% on
in-domain and out-of-domain benchmarks, while maintaining strong generalization
ability across different input formats and downstream models. Further analyses
establish best practices for preference inference learning through systematic
comparison of reward modeling strategies, while revealing the emergence of
human-like inductive reasoning patterns during training.
♻ ☆ Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward
Effective conversational agents like large language models (LLMs) must
personalize their interactions to adapt to user preferences, personalities, and
attributes across diverse domains like education and healthcare. Current
methods like Reinforcement Learning from Human Feedback (RLHF), often
prioritize helpfulness and safety but fall short in fostering truly empathetic,
adaptive, and personalized dialogues. Existing personalization approaches
typically rely on extensive user history, limiting their effectiveness for new
or context-limited users. To address these limitations, we propose leveraging a
user model to incorporate a curiosity-based intrinsic reward into multi-turn
RLHF. This novel reward mechanism encourages the LLM agent to actively infer
user traits by optimizing conversations to improve its user model's accuracy.
Consequently, the agent delivers more personalized interactions by learning
more about the user. We demonstrate our method's effectiveness in two distinct
domains: significantly improving personalization performance in a
conversational recommendation task, and personalizing conversations for
different learning styles in an educational setting. We show improved
generalization capabilities compared to traditional multi-turn RLHF, all while
maintaining conversation quality. Our method offers a promising solution for
creating more personalized, adaptive, and engaging conversational agents.
♻ ☆ Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation
Recent advancements in large language models have sparked interest in
utilizing them to aid the peer review process of scientific publication amid
the peer review crisis. However, having AI models generate full reviews in the
same way as human reviewers risks exacerbating the irresponsible use of
LLM-generated reviews. As an alternative, we propose adopting LLMs as
manuscript quality checkers. We introduce several baseline approaches and an
extendable automatic evaluation framework using top reasoning LLMs as judges to
tackle the difficulty of recruiting domain experts for manual evaluation.
Utilizing papers withdrawn from arXiv, we validated our proposed methods with
several leading reasoning LLMs from multiple vendors and assessed their
performance and API costs for identifying critical errors and unsoundness
problems in scientific papers. o3 exhibited the best problem identification
performance among all models at a modest cost. This paper provides insights
into document-based scientific understanding/reasoning and lays a foundation
for future applications. Our dataset, code, and model outputs are publicly
available.
comment: Add results from new experiments; update discussion and GitHub link
♻ ☆ NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge
Firoj Alam, Md Arid Hasan, Sahinur Rahman Laskar, Mucahid Kutlu, Kareem Darwish, Shammur Absar Chowdhury
The rapid advancement of large language models (LLMs) has raised concerns
about cultural bias, fairness, and their applicability in diverse linguistic
and underrepresented regional contexts. To enhance and benchmark the
capabilities of LLMs, there is a need to develop large-scale resources focused
on multilingual, local, and cultural contexts. In this study, we propose the
NativQA framework, which can seamlessly construct large-scale, culturally and
regionally aligned QA datasets in native languages. The framework utilizes
user-defined seed queries and leverages search engines to collect
location-specific, everyday information. It has been evaluated across 39
locations in 24 countries and in 7 languages -- ranging from extremely
low-resource to high-resource languages -- resulting in over 300K
Question-Answer (QA) pairs. The developed resources can be used for LLM
benchmarking and further fine-tuning. The framework has been made publicly
available for the community (https://gitlab.com/nativqa/nativqa-framework).
comment: LLMs, Native, Multilingual, Language Diversity, Contextual
Understanding, Minority Languages, Culturally Informed, Foundation Models,
Large Language Models
♻ ☆ SEPSIS: I Can Catch Your Lies -- A New Paradigm for Deception Detection ACL
Anku Rani, Dwip Dalal, Shreya Gautam, Pankaj Gupta, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
Deception is the intentional practice of twisting information. It is a
nuanced societal practice deeply intertwined with human societal evolution,
characterized by a multitude of facets. This research explores the problem of
deception through the lens of psychology, employing a framework that
categorizes deception into three forms: lies of omission, lies of commission,
and lies of influence. The primary focus of this study is specifically on
investigating only lies of omission. We propose a novel framework for deception
detection leveraging NLP techniques. We curated an annotated dataset of 876,784
samples by amalgamating a popular large-scale fake news dataset and scraped
news headlines from the Twitter handle of the Times of India, a well-known
Indian news media house. Each sample has been labeled with four layers, namely:
(i) the type of omission (speculation, bias, distortion, sounds factual, and
opinion), (ii) colors of lies(black, white, etc), and (iii) the intention of
such lies (to influence, etc) (iv) topic of lies (political, educational,
religious, etc). We present a novel multi-task learning pipeline that leverages
the dataless merging of fine-tuned language models to address the deception
detection task mentioned earlier. Our proposed model achieved an F1 score of
0.87, demonstrating strong performance across all layers, including the type,
color, intent, and topic aspects of deceptive content. Finally, our research
explores the relationship between lies of omission and propaganda techniques.
To accomplish this, we conducted an in-depth analysis, uncovering compelling
findings. For instance, our analysis revealed a significant correlation between
loaded language and opinion, shedding light on their interconnectedness. To
encourage further research in this field, we are releasing the SEPSIS dataset
and code at https://huggingface.co/datasets/ankurani/deception.
comment: ACL SRW 2025
♻ ☆ Language Models can Self-Improve at State-Value Estimation for Better Search
Collecting ground-truth rewards or human demonstrations for multi-step
reasoning tasks is often prohibitively expensive and time consuming, especially
in interactive domains like web tasks. To address this bottleneck, we present
self-taught lookahead (STL), a self-supervised method that leverages
state-transition dynamics to improve a value model capable of effectively
guiding language model-controlled search without any labeled data. We find that
moderately sized (8 billion parameters) open-weight value models improved with
STL can match the performance of using a gpt-4o value model. Furthermore, we
find that specialized value models learned with STL can be deployed with
computationally lightweight search algorithms, achieving performance that
matches that of more expensive tree search methods, while reducing costs by an
order of magnitude.
♻ ☆ End-to-End Evaluation for Low-Latency Simultaneous Speech Translation EMNLP 2023
Christian Huber, Tu Anh Dinh, Carlos Mullov, Ngoc Quan Pham, Thai Binh Nguyen, Fabian Retkowski, Stefan Constantin, Enes Yavuz Ugan, Danni Liu, Zhaolin Li, Sai Koneru, Jan Niehues, Alexander Waibel
The challenge of low-latency speech translation has recently draw significant
interest in the research community as shown by several publications and shared
tasks. Therefore, it is essential to evaluate these different approaches in
realistic scenarios. However, currently only specific aspects of the systems
are evaluated and often it is not possible to compare different approaches.
In this work, we propose the first framework to perform and evaluate the
various aspects of low-latency speech translation under realistic conditions.
The evaluation is carried out in an end-to-end fashion. This includes the
segmentation of the audio as well as the run-time of the different components.
Secondly, we compare different approaches to low-latency speech translation
using this framework. We evaluate models with the option to revise the output
as well as methods with fixed output. Furthermore, we directly compare
state-of-the-art cascaded as well as end-to-end systems. Finally, the framework
allows to automatically evaluate the translation quality as well as latency and
also provides a web interface to show the low-latency model outputs to the
user.
comment: Demo paper at EMNLP 2023
♻ ☆ Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information
Knowledge tracing models have enabled a range of intelligent tutoring systems
to provide feedback to students. However, existing methods for knowledge
tracing in learning sciences are predominantly reliant on statistical data and
instructor-defined knowledge components, making it challenging to integrate
AI-generated educational content with traditional established methods. We
propose a method for automatically extracting knowledge components from
educational content using instruction-tuned large multimodal models. We
validate this approach by comprehensively evaluating it against knowledge
tracing benchmarks in five domains. Our results indicate that the automatically
extracted knowledge components can effectively replace human-tagged labels,
offering a promising direction for enhancing intelligent tutoring systems in
limited-data scenarios, achieving more explainable assessments in educational
settings, and laying the groundwork for automated assessment.
comment: Accepted to Educational Data Mining 2025
♻ ☆ Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study
Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
Large Language Models (LLMs) hold promise in automating data analysis tasks,
yet open-source models face significant limitations in these kinds of
reasoning-intensive scenarios. In this work, we investigate strategies to
enhance the data analysis capabilities of open-source LLMs. By curating a seed
dataset of diverse, realistic scenarios, we evaluate models across three
dimensions: data understanding, code generation, and strategic planning. Our
analysis reveals three key findings: (1) Strategic planning quality serves as
the primary determinant of model performance; (2) Interaction design and task
complexity significantly influence reasoning capabilities; (3) Data quality
demonstrates a greater impact than diversity in achieving optimal performance.
We leverage these insights to develop a data synthesis methodology,
demonstrating significant improvements in open-source LLMs' analytical
reasoning capabilities.
comment: Work in progress
♻ ☆ Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science? SIGIR 2025
Automated detection of semantically equivalent questions in longitudinal
social science surveys is crucial for long-term studies informing empirical
research in the social, economic, and health sciences. Retrieving equivalent
questions faces dual challenges: inconsistent representation of theoretical
constructs (i.e. concept/sub-concept) across studies as well as between
question and response options, and the evolution of vocabulary and structure in
longitudinal text. To address these challenges, our multi-disciplinary
collaboration of computer scientists and survey specialists presents a new
information retrieval (IR) task of identifying concept (e.g. Housing, Job,
etc.) equivalence across question and response options to harmonise
longitudinal population studies. This paper investigates multiple unsupervised
approaches on a survey dataset spanning 1946-2020, including probabilistic
models, linear probing of language models, and pre-trained neural networks
specialised for IR. We show that IR-specialised neural models achieve the
highest overall performance with other approaches performing comparably.
Additionally, the re-ranking of the probabilistic model's results with neural
models only introduces modest improvements of 0.07 at most in F1-score.
Qualitative post-hoc evaluation by survey specialists shows that models
generally have a low sensitivity to questions with high lexical overlap,
particularly in cases where sub-concepts are mismatched. Altogether, our
analysis serves to further research on harmonising longitudinal studies in
social science.
comment: Accepted at SIGIR 2025
♻ ☆ Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes
Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert
With the rise of LLMs, ensuring model safety and alignment has become a
critical concern. While modern instruction-finetuned LLMs incorporate alignment
during training, they still frequently require moderation tools to prevent
unsafe behavior. The most common approach to moderation are guard models that
flag unsafe inputs. However, guards require costly training and are typically
limited to fixed-size, pre-trained options, making them difficult to adapt to
evolving risks and resource constraints. We hypothesize that
instruction-finetuned LLMs already encode safety-relevant information
internally and explore training-free safety assessment methods that work with
off-the-shelf models. We show that simple prompting allows models to recognize
harmful inputs they would otherwise mishandle. We also demonstrate that safe
and unsafe prompts are distinctly separable in the models' latent space.
Building on this, we introduce the Latent Prototype Moderator (LPM), a
training-free moderation method that uses Mahalanobis distance in latent space
to assess input safety. LPM is a lightweight, customizable add-on that
generalizes across model families and sizes. Our method matches or exceeds
state-of-the-art guard models across multiple safety benchmarks, offering a
practical and flexible solution for scalable LLM moderation.
♻ ☆ CritiQ: Mining Data Quality Criteria from Human Preferences ACL 2025
Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
Language model heavily depends on high-quality data for optimal performance.
Existing approaches rely on manually designed heuristics, the perplexity of
existing models, training classifiers, or careful prompt engineering, which
require significant expert experience and human annotation effort while
introduce biases. We introduce CritiQ, a novel data selection method that
automatically mines criteria from human preferences for data quality with only
~30 human-annotated pairs and performs efficient data selection. The main
component, CritiQ Flow, employs a manager agent to evolve quality criteria and
worker agents to make pairwise judgments. We build a knowledge base that
extracts quality criteria from previous work to boost CritiQ Flow. Compared to
perplexity- and classifier- based methods, verbal criteria are more
interpretable and possess reusable value. After deriving the criteria, we train
the CritiQ Scorer to give quality scores and perform efficient data selection.
We demonstrate the effectiveness of our method in the code, math, and logic
domains, achieving high accuracy on human-annotated test sets. To validate the
quality of the selected data, we continually train Llama 3.1 models and observe
improved performance on downstream tasks compared to uniform sampling. Ablation
studies validate the benefits of the knowledge base and the reflection process.
We analyze how criteria evolve and the effectiveness of majority voting.
comment: to be published in ACL 2025, Code is available at
https://github.com/KYLN24/CritiQ
♻ ☆ RewardAnything: Generalizable Principle-Following Reward Models
Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye
Reward Models, essential for guiding Large Language Model optimization, are
typically trained on fixed preference datasets, resulting in rigid alignment to
single, implicit preference distributions. This prevents adaptation to diverse
real-world needs-from conciseness in one task to detailed explanations in
another. The standard practice of collecting task-specific preference data and
retraining reward models is resource-intensive, often producing biased rewards,
and limits practical application. We introduce generalizable,
principle-following reward models. We propose that RMs should understand and
adhere to dynamically provided natural language specifications of reward
principles, similar to instruction-following in LLMs. To measure this
capability, we develop RABench, a comprehensive benchmark for RMs focusing on
generalization across diverse principles. Evaluations on RABench reveal poor
generalization of current RMs. As a solution, we present RewardAnything, a
novel RM designed and trained to explicitly follow natural language principles.
We achieve SotA performance with RewardAnything in traditional RM benchmark
simply by specifying a well-defined principle, and results on RABench show we
excel in adapting to novel principles without retraining. Furthermore,
RewardAnything integrates seamlessly with existing RLHF methods and we show by
a case study on how to automatically and efficiently align LLMs with only
natural language principles.
comment: 25 pages, 9 figures, Code & model weights available at:
https://zhuohaoyu.github.io/RewardAnything
♻ ☆ BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance ACM MM 2025
Text-video retrieval (TVR) systems often suffer from visual-linguistic biases
present in datasets, which cause pre-trained vision-language models to overlook
key details. To address this, we propose BiMa, a novel framework designed to
mitigate biases in both visual and textual representations. Our approach begins
by generating scene elements that characterize each video by identifying
relevant entities/objects and activities. For visual debiasing, we integrate
these scene elements into the video embeddings, enhancing them to emphasize
fine-grained and salient details. For textual debiasing, we introduce a
mechanism to disentangle text features into content and bias components,
enabling the model to focus on meaningful content while separately handling
biased information. Extensive experiments and ablation studies across five
major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo)
demonstrate the competitive performance of BiMa. Additionally, the model's bias
mitigation capability is consistently validated by its strong results on
out-of-distribution retrieval tasks.
comment: Accepted at ACM MM 2025
♻ ☆ Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems
This paper presents our system for the MLC-SLM Challenge 2025, focusing on
multilingual speech recognition and language modeling with large language
models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with
efficient projector architectures and various decoder configurations. We employ
a three-stage training methodology that progressively optimizes the encoder,
projector, and LLM components. Our system achieves competitive performance with
a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6%
using the Qwen2.5-7B as decoder-only language model.
comment: Accepted to Interspeech MLCSLM-2025 Workshop
♻ ☆ Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
To ensure a balance between open access to justice and personal data
protection, the South Korean judiciary mandates the de-identification of court
judgments before they can be publicly disclosed. However, the current
de-identification process is inadequate for handling court judgments at scale
while adhering to strict legal requirements. Additionally, the legal
definitions and categorizations of personal identifiers are vague and not
well-suited for technical solutions. To tackle these challenges, we propose a
de-identification framework called Thunder-DeID, which aligns with relevant
laws and practices. Specifically, we (i) construct and release the first Korean
legal dataset containing annotated judgments along with corresponding lists of
entity mentions, (ii) introduce a systematic categorization of Personally
Identifiable Information (PII), and (iii) develop an end-to-end deep neural
network (DNN)-based de-identification pipeline. Our experimental results
demonstrate that our model achieves state-of-the-art performance in the
de-identification of court judgments.
♻ ☆ Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization
The widespread dissemination of toxic content on social media poses a serious
threat to both online environments and public discourse, highlighting the
urgent need for detoxification methods that effectively remove toxicity while
preserving the original semantics. However, existing approaches often struggle
to simultaneously achieve strong detoxification performance, semantic
preservation, and robustness to out-of-distribution data. Moreover, they
typically rely on costly, manually annotated parallel corpora while showing
poor data efficiency. To address these challenges, we propose a two-stage
training framework that jointly optimizes for data efficiency, semantic
preservation, and model generalization. We first perform supervised fine-tuning
on a small set of high-quality, filtered parallel data to establish a strong
initialization. Then, we leverage unlabeled toxic inputs and a custom-designed
reward model to train the LLM using Group Relative Policy Optimization.
Experimental results demonstrate that our method effectively mitigates the
trade-offs faced by previous work, achieving state-of-the-art performance with
improved generalization and significantly reduced dependence on annotated data.
Our code is available at: https://github.com/allacnobug/Detoxification-of-Text.
♻ ☆ MAIN: Mutual Alignment Is Necessary for instruction tuning
Fanyi Yang, Jianfeng Liu, Xin Zhang, Haoyu Liu, Xixin Cao, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Qi Zhang
Instruction tuning has empowered large language models (LLMs) to achieve
remarkable performance, yet its success heavily depends on the availability of
large-scale, high-quality instruction-response pairs. To meet this demand,
various methods have been developed to synthesize data at scale. However,
current methods for scaling up data generation often overlook a crucial aspect:
the alignment between instructions and responses. We hypothesize that the
quality of instruction-response pairs is determined not by the individual
quality of each component, but by the degree of mutual alignment. To address
this, we propose a Mutual Alignment Framework (MAIN) which enforces coherence
between instructions and responses through mutual constraints. We demonstrate
that MAIN generalizes well across model architectures and sizes, achieving
state-of-the-art performance on LLaMA, Mistral, and Qwen models across diverse
benchmarks. This work underscores the critical role of instruction-response
alignment in enabling generalizable and high-quality instruction tuning for
LLMs.
♻ ☆ Markovian Transformers for Informative Language Modeling
Chain-of-Thought (CoT) reasoning often fails to faithfully reflect a language
model's underlying decision process. We address this by making CoT text
causally essential in a "Markovian" language model, factoring next-token
prediction through an intermediate CoT and training it to predict future tokens
independently of the original prompt. We formalize this via an
"informativeness" objective that quantifies how much a trained CoT improves
next-token predictions over a baseline. Using policy gradient, we show that
Llama 3.1 8B achieves a 33.2% absolute accuracy improvement on GSM8K.
Perturbation tests confirm stronger reliance on the CoT, while cross-model
transfers indicate these reasoning traces generalize across interpreters. Our
approach enhances both accuracy and interpretability, potentially extending CoT
reasoning to arbitrarily long contexts and diverse tasks.
comment: 18 pages, 6 figures
♻ ☆ Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading
Grading handwritten, open-ended responses remains a major bottleneck in large
university STEM courses. We introduce Pensieve (https://www.pensieve.co), an
AI-assisted grading platform that leverages large language models (LLMs) to
transcribe and evaluate student work, providing instructors with rubric-aligned
scores, transcriptions, and confidence ratings. Unlike prior tools that focus
narrowly on specific tasks like transcription or rubric generation, Pensieve
supports the entire grading pipeline-from scanned student submissions to final
feedback-within a human-in-the-loop interface.
Pensieve has been deployed in real-world courses at over 20 institutions and
has graded more than 300,000 student responses. We present system details and
empirical results across four core STEM disciplines: Computer Science,
Mathematics, Physics, and Chemistry. Our findings show that Pensieve reduces
grading time by an average of 65%, while maintaining a 95.4% agreement rate
with instructor-assigned grades for high-confidence predictions.
comment: 7 pages, 5 figues, 1 table
♻ ☆ Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
The rapid advancement of Large Language Models (LLMs) has intensified the
need for evaluation frameworks that address the requirements of linguistically
diverse regions, such as India, and go beyond English-centric benchmarks. We
introduce EKA-EVAL, a unified evaluation framework that integrates over 35+
benchmarks (including 10 Indic benchmarks) across nine major evaluation
categories. The framework provides broader coverage than existing Indian
language evaluation tools, offering 11 core capabilities through a modular
architecture, seamless integration with Hugging Face and proprietary models,
and plug-and-play usability. As the first end-to-end suite for scalable,
multilingual LLM benchmarking, the framework combines extensive benchmarks,
modular workflows, and dedicated support for low-resource Indian languages to
enable inclusive assessment of LLM capabilities across diverse domains. We
conducted extensive comparisons against five existing baselines, demonstrating
that EKA-EVAL achieves the highest participant ratings in four out of five
categories. The framework is open-source and publicly available at:
https://github.com/lingo-iitgn/eka-eval.
♻ ☆ Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models
Large Language Models (LLMs) and Vision-Language Models (VLMs) have achieved
impressive performance across a wide range of tasks, yet they remain vulnerable
to carefully crafted perturbations. In this study, we seek to pinpoint the
sources of this fragility by identifying parameters and input dimensions
(pixels or token embeddings) that are susceptible to such perturbations. To
this end, we propose a stability measure called \textbf{FI}, \textbf{F}irst
order local \textbf{I}nfluence, which is rooted in information geometry and
quantifies the sensitivity of individual parameter and input dimensions. Our
extensive analysis across LLMs and VLMs (from 1.5B to 13B parameters) reveals
that: (I) A small subset of parameters or input dimensions with high FI values
disproportionately contribute to model brittleness. (II) Mitigating the
influence of these vulnerable parameters during model merging leads to improved
performance.
♻ ☆ A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens ACL2025
Text embeddings from large language models (LLMs) have achieved excellent
results in tasks such as information retrieval, semantic textual similarity,
etc. In this work, we show an interesting finding: when feeding a text into the
LLM-based embedder, the obtained text embedding will be able to be aligned with
the key tokens in the input text. We first fully analyze this phenomenon on
eight LLM-based embedders and show that this phenomenon is universal and is not
affected by model architecture, training strategy, and embedding method. With a
deeper analysis, we find that the main change in embedding space between these
embedders and their LLM backbones is in the first principal component. By
adjusting the first principal component, we can align text embedding with the
key tokens. Finally, we give several examples to demonstrate the vast
application potential of this finding: (1) we propose a simple and practical
sparse retrieval method based on the aligned tokens, which can achieve 80% of
the dense retrieval effect of the same model while reducing the computation
significantly; (2) we show that our findings provide a novel perspective to
help understand novel technologies (e.g., instruction-following embedding) and
fuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.
comment: ACL2025 Oral
♻ ☆ Towards Cost-Effective Reward Guided Text Generation ICML 2025
Reward-guided text generation (RGTG) has emerged as a viable alternative to
offline reinforcement learning from human feedback (RLHF). RGTG methods can
align baseline language models to human preferences without further training
like in standard RLHF methods. However, they rely on a reward model to score
each candidate token generated by the language model at inference, incurring
significant test-time overhead. Additionally, the reward model is usually only
trained to score full sequences, which can lead to sub-optimal choices for
partial sequences. In this work, we present a novel reward model architecture
that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of
a sequence with just a \emph{single call} to the reward model at each step of
the generation process. That is, a score for all possible candidate tokens is
generated simultaneously, leading to efficient inference. We theoretically
analyze various RGTG reward models and demonstrate that prior techniques prefer
sub-optimal sequences compared to our method during inference. Empirically, our
reward model leads to significantly faster inference than other RGTG methods.
It requires fewer calls to the reward model and performs competitively compared
to previous RGTG and offline RLHF methods.
comment: 18 pages. Work accepted at ICML 2025
♻ ☆ Position: Machine Learning Conferences Should Establish a "Refutations and Critiques" Track
Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Francesco Orabona, Sanmi Koyejo, David Donoho
Science progresses by iteratively advancing and correcting humanity's
understanding of the world. In machine learning (ML) research, rapid
advancements have led to an explosion of publications, but have also led to
misleading, incorrect, flawed or perhaps even fraudulent studies being accepted
and sometimes highlighted at ML conferences due to the fallibility of peer
review. While such mistakes are understandable, ML conferences do not offer
robust processes to help the field systematically correct when such errors are
made. This position paper argues that ML conferences should establish a
dedicated "Refutations and Critiques" (R&C) Track. This R&C Track would provide
a high-profile, reputable platform to support vital research that critically
challenges prior research, thereby fostering a dynamic self-correcting research
ecosystem. We discuss key considerations including track design, review
principles, potential pitfalls, and provide an illustrative example submission
concerning a recent ICLR 2025 Oral. We conclude that ML conferences should
create official, reputable mechanisms to help ML research self-correct.
♻ ☆ Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation NAACL 2025
In tasks like summarization and open-book question answering (QA), Large
Language Models (LLMs) often encounter "contextual hallucination", where they
produce irrelevant or incorrect responses despite having access to accurate
source information. This typically occurs because these models tend to
prioritize self-generated content over the input context, causing them to
disregard pertinent details. To address this challenge, we introduce a novel
method called "Guided Attention Map Editing" (GAME), which dynamically adjusts
attention maps to improve contextual relevance. During inference, GAME employs
a trained classifier to identify attention maps prone to inducing
hallucinations and executes targeted interventions. These interventions, guided
by gradient-informed "edit directions'', strategically redistribute attention
weights across various heads to effectively reduce hallucination. Comprehensive
evaluations on challenging summarization and open-book QA tasks show that GAME
consistently reduces hallucinations across a variety of open-source models.
Specifically, GAME reduces hallucinations by 10% in the XSum summarization task
while achieving a 7X speed-up in computational efficiency compared to the
state-of-the-art baselines.
comment: Accepted as Finding of NAACL 2025
♻ ☆ Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
Navigating everyday social situations often requires juggling conflicting
goals, such as conveying a harsh truth, maintaining trust, all while still
being mindful of another person's feelings. These value trade-offs are an
integral part of human decision-making and language use, however, current tools
for interpreting such dynamic and multi-faceted notions of values in LLMs are
limited. In cognitive science, so-called "cognitive models" provide formal
accounts of these trade-offs in humans, by modeling the weighting of a
speaker's competing utility functions in choosing an action or utterance. In
this work, we use a leading cognitive model of polite speech to interpret the
extent to which LLMs represent human-like trade-offs. We apply this lens to
systematically evaluate value trade-offs in two encompassing model settings:
degrees of reasoning "effort" in frontier black-box models, and RL
post-training dynamics of open-source models. Our results highlight patterns of
higher informational utility than social utility in reasoning models, and in
open-source models shown to be stronger in mathematical reasoning. Our findings
from LLMs' training dynamics suggest large shifts in utility values early on in
training with persistent effects of the choice of base model and pretraining
data, compared to feedback dataset or alignment method. We show that our method
is responsive to diverse aspects of the rapidly evolving LLM landscape, with
insights for forming hypotheses about other high-level behaviors, shaping
training regimes for reasoning models, and better controlling trade-offs
between values during model training.
comment: 11 pages, 3 figures