MyArxiv
Computation and Language
☆ dMel: Speech Tokenization made Simple
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic tokens, potentially losing acoustic information, or model acoustic tokens, risking the loss of semantic information. Having multiple token types also complicates the architecture and requires additional pretraining. Here we show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel), that performs better than other existing speech tokenization methods. Using a transformer decoder-only architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR), speech synthesis (TTS). Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.
comment: under review
☆ J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling
Spoken dialogue plays a crucial role in human-AI interactions, necessitating dialogue-oriented spoken language models (SLMs). To develop versatile SLMs, large-scale and diverse speech datasets are essential. Additionally, to ensure hiqh-quality speech generation, the data must be spontaneous like in-wild data and must be acoustically clean with noise removed. Despite the critical need, no open-source corpus meeting all these criteria has been available. This study addresses this gap by constructing and releasing a large-scale spoken dialogue corpus, named Japanese Corpus for Human-AI Talks (J-CHAT), which is publicly accessible. Furthermore, this paper presents a language-independent method for corpus construction and describes experiments on dialogue generation using SLMs trained on J-CHAT. Experimental results indicate that the collected data from multiple domains by our method improve the naturalness and meaningfulness of dialogue generation.
comment: 8 pages, 6 figures
☆ Perceptions of Linguistic Uncertainty by Language Models and Humans
Uncertainty expressions such as ``probably'' or ``highly unlikely'' are pervasive in human language. While prior work has established that there is population-level agreement in terms of how humans interpret these expressions, there has been little inquiry into the abilities of language models to interpret such expressions. In this paper, we investigate how language models map linguistic expressions of uncertainty to numerical responses. Our approach assesses whether language models can employ theory of mind in this setting: understanding the uncertainty of another agent about a particular statement, independently of the model's own certainty about that statement. We evaluate both humans and 10 popular language models on a task created to assess these abilities. Unexpectedly, we find that 8 out of 10 models are able to map uncertainty expressions to probabilistic responses in a human-like manner. However, we observe systematically different behavior depending on whether a statement is actually true or false. This sensitivity indicates that language models are substantially more susceptible to bias based on their prior knowledge (as compared to humans). These findings raise important questions and have broad implications for human-AI alignment and AI-AI communication.
comment: In submission
☆ FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones
Progress in machine understanding of sign languages has been slow and hampered by limited data. In this paper, we present FSboard, an American Sign Language fingerspelling dataset situated in a mobile text entry use case, collected from 147 paid and consenting Deaf signers using Pixel 4A selfie cameras in a variety of environments. Fingerspelling recognition is an incomplete solution that is only one small part of sign language translation, but it could provide some immediate benefit to Deaf/Hard of Hearing signers as more broadly capable technology develops. At >3 million characters in length and >250 hours in duration, FSboard is the largest fingerspelling recognition dataset to date by a factor of >10x. As a simple baseline, we finetune 30 Hz MediaPipe Holistic landmark inputs into ByT5-Small and achieve 11.1% Character Error Rate (CER) on a test set with unique phrases and signers. This quality degrades gracefully when decreasing frame rate and excluding face/body landmarks: plausible optimizations to help models run on device in real time.
comment: Access FSboard at https://www.kaggle.com/datasets/googleai/fsboard
☆ Extracting Structured Insights from Financial News: An Augmented LLM Driven Approach
Financial news plays a crucial role in decision-making processes across the financial sector, yet the efficient processing of this information into a structured format remains challenging. This paper presents a novel approach to financial news processing that leverages Large Language Models (LLMs) to overcome limitations that previously prevented the extraction of structured data from unstructured financial news. We introduce a system that extracts relevant company tickers from raw news article content, performs sentiment analysis at the company level, and generates summaries, all without relying on pre-structured data feeds. Our methodology combines the generative capabilities of LLMs, and recent prompting techniques, with a robust validation framework that uses a tailored string similarity approach. Evaluation on a dataset of 5530 financial news articles demonstrates the effectiveness of our approach, with 90% of articles not missing any tickers compared with current data providers, and 22% of articles having additional relevant tickers. In addition to this paper, the methodology has been implemented at scale with the resulting processed data made available through a live API endpoint, which is updated in real-time with the latest news. To the best of our knowledge, we are the first data provider to offer granular, per-company sentiment analysis from news articles, enhancing the depth of information available to market participants. We also release the evaluation dataset of 5530 processed articles as a static file, which we hope will facilitate further research leveraging financial news.
comment: 7 pages, 6 figures
☆ Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning
Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge here is to develop steerable language models that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditioned Language Policy (CLP), a general framework for finetuning language models on multiple objectives. Building on techniques from multi-task training and parameter-efficient finetuning, CLP can learn steerable models that effectively trade-off conflicting objectives at inference time. Notably, this does not require training or maintaining multiple models to achieve different trade-offs between the objectives. Through an extensive set of experiments and ablations, we show that the CLP framework learns steerable models that outperform and Pareto-dominate the current state-of-the-art approaches for multi-objective finetuning.
comment: 40 pages
☆ LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.
comment: 29 pages
☆ OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context
When immigrating to a new country, it is easy to feel overwhelmed by the need to obtain information on financial support, housing, schooling, language courses, and other issues. If relocation is rushed or even forced, the necessity for high-quality answers to such questions is all the more urgent. Official immigration counselors are usually overbooked, and online systems could guide newcomers to the requested information or a suitable counseling service. To this end, we present OMoS-QA, a dataset of German and English questions paired with relevant trustworthy documents and manually annotated answers, specifically tailored to this scenario. Questions are automatically generated with an open-source large language model (LLM) and answer sentences are selected by crowd workers with high agreement. With our data, we conduct a comparison of 5 pretrained LLMs on the task of extractive question answering (QA) in German and English. Across all models and both languages, we find high precision and low-to-mid recall in selecting answer sentences, which is a favorable trade-off to avoid misleading users. This performance even holds up when the question language does not match the document language. When it comes to identifying unanswerable questions given a context, there are larger differences between the two languages.
comment: Accepted to KONVENS 2024
☆ DStruct2Design: Data and Benchmarks for Data Structure Driven Generative Floor Plan Design
Text conditioned generative models for images have yielded impressive results. Text conditioned floorplan generation as a special type of raster image generation task also received particular attention. However there are many use cases in floorpla generation where numerical properties of the generated result are more important than the aesthetics. For instance, one might want to specify sizes for certain rooms in a floorplan and compare the generated floorplan with given specifications Current approaches, datasets and commonly used evaluations do not support these kinds of constraints. As such, an attractive strategy is to generate an intermediate data structure that contains numerical properties of a floorplan which can be used to generate the final floorplan image. To explore this setting we (1) construct a new dataset for this data-structure to data-structure formulation of floorplan generation using two popular image based floorplan datasets RPLAN and ProcTHOR-10k, and provide the tools to convert further procedurally generated ProcTHOR floorplan data into our format. (2) We explore the task of floorplan generation given a partial or complete set of constraints and we design a series of metrics and benchmarks to enable evaluating how well samples generated from models respect the constraints. (3) We create multiple baselines by finetuning a large language model (LLM), Llama3, and demonstrate the feasibility of using floorplan data structure conditioned LLMs for the problem of floorplan generation respecting numerical constraints. We hope that our new datasets and benchmarks will encourage further research on different ways to improve the performance of LLMs and other generative modelling techniques for generating designs where quantitative constraints are only partially specified, but must be respected.
☆ Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability
Large language models (LLMs) have emerged as powerful tools for many AI problems and exhibit remarkable in-context learning (ICL) capabilities. Compositional ability, solving unseen complex tasks that combine two or more simple tasks, is an essential reasoning ability for Artificial General Intelligence. Despite LLM's tremendous success, how they approach composite tasks, especially those not encountered during the pretraining phase, remains an open question and largely ununderstood. In this study, we delve into the ICL capabilities of LLMs on composite tasks, with only simple tasks as in-context examples. We develop a test suite of composite tasks that include linguistic and logical challenges and perform empirical studies across different LLM families. We observe that models exhibit divergent behaviors: (1) For simpler composite tasks that apply distinct mapping mechanisms to different input segments, the models demonstrate decent compositional ability, while scaling up the model enhances this ability; (2) for more complex composite tasks that involving reasoning multiple steps, where each step represent one task, models typically underperform, and scaling up generally provide no improvements. We offer theoretical analysis in a simplified setting, explaining that models exhibit compositional capability when the task handles different input parts separately. We believe our work sheds new light on the capabilities of LLMs in solving composite tasks regarding the nature of the tasks and model scale. Our dataset and code are available at {\url{https://github.com/OliverXUZY/LLM_Compose}}.
☆ AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.
☆ Supporting the Digital Autonomy of Elders Through LLM Assistance
The internet offers tremendous access to services, social connections, and needed products. However, to those without sufficient experience, engaging with businesses and friends across the internet can be daunting due to the ever present danger of scammers and thieves, to say nothing of the myriad of potential computer viruses. Like a forest rich with both edible and poisonous plants, those familiar with the norms inhabit it safely with ease while newcomers need a guide. However, reliance on a human digital guide can be taxing and often impractical. We propose and pilot a simple but unexplored idea: could an LLM provide the necessary support to help the elderly who are separated by the digital divide safely achieve digital autonomy?
☆ Counter Turing Test ($CT^2$): Investigating AI-Generated Text Detection for Hindi -- Ranking LLMs based on Hindi AI Detectability Index ($ADI_{hi}$)
The widespread adoption of large language models (LLMs) and awareness around multilingual LLMs have raised concerns regarding the potential risks and repercussions linked to the misapplication of AI-generated text, necessitating increased vigilance. While these models are primarily trained for English, their extensive training on vast datasets covering almost the entire web, equips them with capabilities to perform well in numerous other languages. AI-Generated Text Detection (AGTD) has emerged as a topic that has already received immediate attention in research, with some initial methods having been proposed, soon followed by the emergence of techniques to bypass detection. In this paper, we report our investigation on AGTD for an indic language Hindi. Our major contributions are in four folds: i) examined 26 LLMs to evaluate their proficiency in generating Hindi text, ii) introducing the AI-generated news article in Hindi ($AG_{hi}$) dataset, iii) evaluated the effectiveness of five recently proposed AGTD techniques: ConDA, J-Guard, RADAR, RAIDAR and Intrinsic Dimension Estimation for detecting AI-generated Hindi text, iv) proposed Hindi AI Detectability Index ($ADI_{hi}$) which shows a spectrum to understand the evolving landscape of eloquence of AI-generated text in Hindi. We will make the codes and datasets available to encourage further research.
☆ Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models
Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making, such as in developing educational materials and designing public policies. The objective of these simulations is for LMs to capture the variations in human responses, rather than merely providing the expected correct answers. Prior work has shown that LMs often generate unrealistically accurate responses, but there are no established metrics to quantify how closely the knowledge distribution of LMs aligns with that of humans. To address this, we introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. Assessing this alignment involves collecting responses from both LMs and humans to the same set of test items and using Item Response Theory to analyze the differences in item functioning between the groups. We demonstrate that our metric can capture important variations in populations that traditional metrics, like differences in accuracy, fail to capture. We apply this metric to assess existing LMs for their alignment with human knowledge distributions across three real-world domains. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment. Interestingly, smaller LMs tend to achieve greater psychometric alignment than larger LMs. Further, training LMs on human response data from the target distribution enhances their psychometric alignment on unseen test items, but the effectiveness of such training varies across domains.
comment: Code and data: https://github.com/joyheyueya/psychometric-alignment
☆ RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation
Large language models (LLMs) have advanced the field of artificial intelligence (AI) in medicine. However LLMs often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RadioRAG is evaluated using a dedicated radiologic question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions, for which the correct gold-standard answers were available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved context-specific information from www.radiopaedia.org in real-time and incorporated them into its reply. RadioRAG consistently improved diagnostic accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It matched or exceeded question answering without RAG across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in its effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.
☆ Can GPT-4 learn to analyze moves in research article abstracts?
One of the most powerful and enduring ideas in written discourse analysis is that genres can be described in terms of the moves which structure a writer's purpose. Considerable research has sought to identify these distinct communicative acts, but analyses have been beset by problems of subjectivity, reliability and the time-consuming need for multiple coders to confirm analyses. In this paper we employ the affordances of GPT-4 to automate the annotation process by using natural language prompts. Focusing on abstracts from articles in four applied linguistics journals, we devise prompts which enable the model to identify moves effectively. The annotated outputs of these prompts were evaluated by two assessors with a third addressing disagreements. The results show that an 8-shot prompt was more effective than one using two, confirming that the inclusion of examples illustrating areas of variability can enhance GPT-4's ability to recognize multiple moves in a single sentence and reduce bias related to textual position. We suggest that GPT-4 offers considerable potential in automating this annotation process, when human actors with domain specific linguistic expertise inform the prompting process.
☆ StylusAI: Stylistic Adaptation for Robust German Handwritten Text Generation ICDAR 2024
In this study, we introduce StylusAI, a novel architecture leveraging diffusion models in the domain of handwriting style generation. StylusAI is specifically designed to adapt and integrate the stylistic nuances of one language's handwriting into another, particularly focusing on blending English handwriting styles into the context of the German writing system. This approach enables the generation of German text in English handwriting styles and German handwriting styles into English, enriching machine-generated handwriting diversity while ensuring that the generated text remains legible across both languages. To support the development and evaluation of StylusAI, we present the \lq{Deutscher Handschriften-Datensatz}\rq~(DHSD), a comprehensive dataset encompassing 37 distinct handwriting styles within the German language. This dataset provides a fundamental resource for training and benchmarking in the realm of handwritten text generation. Our results demonstrate that StylusAI not only introduces a new method for style adaptation in handwritten text generation but also surpasses existing models in generating handwriting samples that improve both text quality and stylistic fidelity, evidenced by its performance on the IAM database and our newly proposed DHSD. Thus, StylusAI represents a significant advancement in the field of handwriting style generation, offering promising avenues for future research and applications in cross-linguistic style adaptation for languages with similar scripts.
comment: Accepted in ICDAR 2024
☆ Unsupervised Robust Cross-Lingual Entity Alignment via Joint Modeling of Entity and Relation Texts
Cross-lingual entity alignment (EA) enables the integration of multiple knowledge graphs (KGs) across different languages, providing users with seamless access to diverse and comprehensive knowledge.Existing methods, mostly supervised, face challenges in obtaining labeled entity pairs. To address this, recent studies have shifted towards a self-supervised and unsupervised frameworks. Despite their effectiveness, these approaches have limitations: (1) they mainly focus on entity features, neglecting the semantic information of relations, (2) they assume isomorphism between source and target graphs, leading to noise and reduced alignment accuracy, and (3) they are susceptible to noise in the textual features, especially when encountering inconsistent translations or Out-Of-Vocabulary (OOV) problems. In this paper, we propose ERAlign, an unsupervised and robust cross-lingual EA framework that jointly performs Entity-level and Relation-level Alignment using semantic textual features of relations and entities. Its refinement process iteratively enhances results by fusing entity-level and relation-level alignments based on neighbor triple matching. The additional verification process examines the entities' neighbor triples as the linearized text. This \textit{Align-and-Verify} pipeline that rigorously assesses alignment results, achieving near-perfect alignment even in the presence of noisy textual features of entities. Our extensive experiments demonstrate that robustness and general applicability of \proposed improved the accuracy and effectiveness of EA tasks, contributing significantly to knowledge-oriented applications.
☆ An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought
Since the launch of ChatGPT at the end of 2022, generative dialogue models represented by ChatGPT have quickly become essential tools in daily life. As user expectations increase, enhancing the capability of generative dialogue models to solve complex problems has become a focal point of current research. This paper delves into the effectiveness of the RAFT (Retrieval Augmented Fine-Tuning) method in improving the performance of Generative dialogue models. RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and retrieval augmented generation (RAG), which significantly enhanced the model's information extraction and logical reasoning abilities. We evaluated the RAFT method across multiple datasets and analysed its performance in various reasoning tasks, including long-form QA and short-form QA tasks, tasks in both Chinese and English, and supportive and comparison reasoning tasks. Notably, it addresses the gaps in previous research regarding long-form QA tasks and Chinese datasets. Moreover, we also evaluate the benefit of the chain-of-thought (CoT) in the RAFT method. This work offers valuable insights for studies focused on enhancing the performance of generative dialogue models.
comment: 5 pages, 4 figures
☆ SETTP: Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning
Text style transfer, an important research direction in natural language processing, aims to adapt the text to various preferences but often faces challenges with limited resources. In this work, we introduce a novel method termed Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning (SETTP) for effective style transfer in low-resource scenarios. First, SETTP learns source style-level prompts containing fundamental style characteristics from high-resource style transfer. During training, the source style-level prompts are transferred through an attention module to derive a target style-level prompt for beneficial knowledge provision in low-resource style transfer. Additionally, we propose instance-level prompts obtained by clustering the target resources based on the semantic content to reduce semantic bias. We also propose an automated evaluation approach of style similarity based on alignment with human evaluations using ChatGPT-4. Our experiments across three resourceful styles show that SETTP requires only 1/20th of the data volume to achieve performance comparable to state-of-the-art methods. In tasks involving scarce data like writing style and role style, SETTP outperforms previous methods by 16.24\%.
☆ Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of `jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.
☆ Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups alongside dropping entire layers. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 1.8\% drop in average performance over the OpenLLM benchmark. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.
☆ Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners
Large Language Models (LLMs) showcase remarkable performance and robust deductive capabilities, yet their expansive size complicates deployment and raises environmental concerns due to substantial resource consumption. The recent development of a quantization technique known as Learnable Singular-value Increment (LSI) has addressed some of these quantization challenges. Leveraging insights from LSI and our extensive research, we have developed innovative methods that enhance the performance of quantized LLMs, particularly in low-bit settings. Our methods consistently deliver state-of-the-art results across various quantization scenarios and offer deep theoretical insights into the quantization process, elucidating the potential of quantized models for widespread application.
comment: Effecient Quantization Methods for LLMs
☆ Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models
We formalize the problem of prompt compression for large language models (LLMs) and present a framework to unify token-level prompt compression methods which create hard prompts for black-box models. We derive the distortion-rate function for this setup as a linear program, and provide an efficient algorithm to compute this fundamental limit via the dual of the linear program. Using the distortion-rate function as the baseline, we study the performance of existing compression schemes on a synthetic dataset consisting of prompts generated from a Markov chain, natural language queries, and their respective answers. Our empirical analysis demonstrates the criticality of query-aware prompt compression, where the compressor has knowledge of the downstream task/query for the black-box LLM. We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy, and propose a query-aware, variable-rate adaptation of a prior work to close the gap. We extend our experiments to a small natural language dataset to further confirm our findings on our synthetic dataset.
comment: 40 pages, 15 figures. Under review
☆ Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction
Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality corpora, due to the labor-intensive labeling of spelling errors in real-life human writing or typing scenarios. Two data augmentation methods are widely adopted: (1) \textit{Random Replacement} with the guidance of confusion sets and (2) \textit{OCR/ASR-based Generation} that simulates character misusing. However, both methods inevitably introduce noisy data (e.g., false spelling errors), potentially leading to over-correction. By carefully analyzing the two types of corpora, we find that though the latter achieves more robust generalization performance, the former yields better-calibrated CSC models. We then provide a theoretical analysis of this empirical observation, based on which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data samples are fed into a well-calibrated CSC model trained on random replacement-based corpora and then filtered based on prediction confidence. By learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set up impressive state-of-the-art performance on three widely-used benchmarks, while significantly alleviating over-correction (e.g., lowering false positive predictions).
☆ Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives
Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing the best practices in pretraining has therefore become a major point of focus for much of NLP research -- especially since the insights developed for monolingual English models need not carry to more complex multilingual. One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology. This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios. We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pre-training objective under the right conditions. We make our code, data, and model weights available at \texttt{\url{https://github.com/Helsinki-NLP/lm-vs-mt}}.
☆ Text-to-Battery Recipe: A language modeling-based protocol for automatic battery recipe extraction and retrieval
Recent studies have increasingly applied natural language processing (NLP) to automatically extract experimental research data from the extensive battery materials literature. Despite the complex process involved in battery manufacturing -- from material synthesis to cell assembly -- there has been no comprehensive study systematically organizing this information. In response, we propose a language modeling-based protocol, Text-to-Battery Recipe (T2BR), for the automatic extraction of end-to-end battery recipes, validated using a case study on batteries containing LiFePO4 cathode material. We report machine learning-based paper filtering models, screening 2,174 relevant papers from the keyword-based search results, and unsupervised topic models to identify 2,876 paragraphs related to cathode synthesis and 2,958 paragraphs related to cell assembly. Then, focusing on the two topics, two deep learning-based named entity recognition models are developed to extract a total of 30 entities -- including precursors, active materials, and synthesis methods -- achieving F1 scores of 88.18% and 94.61%. The accurate extraction of entities enables the systematic generation of 165 end-toend recipes of LiFePO4 batteries. Our protocol and results offer valuable insights into specific trends, such as associations between precursor materials and synthesis methods, or combinations between different precursor materials. We anticipate that our findings will serve as a foundational knowledge base for facilitating battery-recipe information retrieval. The proposed protocol will significantly accelerate the review of battery material literature and catalyze innovations in battery design and development.
☆ Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned
Hallucination, a phenomenon where large language models (LLMs) produce output that is factually incorrect or unrelated to the input, is a major challenge for LLM applications that require accuracy and dependability. In this paper, we introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within LLMs. Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD), and an intricate decision tree-based process to reliably detect a wide range of hallucinations in LLM responses. Furthermore, our team has crafted a rewriting mechanism that maintains an optimal mix of precision, response time, and cost-effectiveness. We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics, which are crucial for real-world deployment of these technologies. Our extensive evaluation, utilizing offline data and live production traffic, confirms the efficacy of our proposed framework and service.
☆ Empirical Capacity Model for Self-Attention Neural Networks
Large pretrained self-attention neural networks, or transformers, have been very successful in various tasks recently. The performance of a model on a given task depends on its ability to memorize and generalize the training data. Large transformer models, which may have billions of parameters, in theory have a huge capacity to memorize content. However, the current algorithms for the optimization fall short of the theoretical capacity, and the capacity is also highly dependent on the content. In this paper, we focus on the memory capacity of these models obtained using common training algorithms and synthetic training data. Based on the results, we derive an empirical capacity model (ECM) for a generic transformer. The ECM can be used to design task-specific transformer models with an optimal number of parameters in cases where the target memorization capability of the task can be defined.
comment: Submitted to BNAIC'24, 14 pages + refs
☆ LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework. We release the data, code and models in https://github.com/openaudiolab/LLaST.
☆ Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.
comment: Ongoing work (v1); 34 pages, 5 figures
☆ Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?
☆ ALLaM: Large Language Models for Arabic and English
We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.
☆ The Development of a Comprehensive Spanish Dictionary for Phonetic and Lexical Tagging in Socio-phonetic Research (ESPADA) LREC2022
Pronunciation dictionaries are an important component in the process of speech forced alignment. The accuracy of these dictionaries has a strong effect on the aligned speech data since they help the mapping between orthographic transcriptions and acoustic signals. In this paper, I present the creation of a comprehensive pronunciation dictionary in Spanish (ESPADA) that can be used in most of the dialect variants of Spanish data. Current dictionaries focus on specific regional variants, but with the flexible nature of our tool, it can be readily applied to capture the most common phonetic differences across major dialectal variants. We propose improvements to current pronunciation dictionaries as well as mapping other relevant annotations such as morphological and lexical information. In terms of size, it is currently the most complete dictionary with more than 628,000 entries, representing words from 16 countries. All entries come with their corresponding pronunciations, morphological and lexical tagging, and other relevant information for phonetic analysis: stress patterns, phonotactics, IPA transcriptions, and more. This aims to equip socio-phonetic researchers with a complete open-source tool that enhances dialectal research within socio-phonetic frameworks in the Spanish language.
comment: Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
☆ ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts
Social Media platforms have offered invaluable opportunities for linguistic research. The availability of up-to-date data, coming from any part in the world, and coming from natural contexts, has allowed researchers to study language in real time. One of the fields that has made great use of social media platforms is Corpus Linguistics. There is currently a wide range of projects which have been able to successfully create corpora from social media. In this paper, we present the development and deployment of a linguistic corpus from Twitter posts in English, coming from 26 news agencies and 27 individuals. The main goal was to create a fully annotated English corpus for linguistic analysis. We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n- grams. The information is presented through a range of powerful visualisations for users to explore linguistic patterns in the corpus. With this tool, we aim to contribute to the area of language technologies applied to linguistic research.
comment: Conference on Language Technologies & Digital Humanities Ljubljana, 2022
☆ A Network Analysis Approach to Conlang Research Literature
The field of conlang has evidenced an important growth in the last decades. This has been the product of a wide interest in the use and study of conlangs for artistic purposes. However, one important question is what it is happening with conlang in the academic world. This paper aims to have an overall understanding of the literature on conlang research. With this we aim to give a realistic picture of the field in present days. We have implemented a computational linguistic approach, combining bibliometrics and network analysis to examine all publications available in the Scopus database. Analysing over 2300 academic publications since 1927 until 2022, we have found that Esperanto is by far the most documented conlang. Three main authors have contributed to this: Garv\'ia R., Fiedler S., and Blanke D. The 1970s and 1980s have been the decades where the foundations of current research have been built. In terms of methodologies, language learning and experimental linguistics are the ones contributing to most to the preferred approaches of study in the field. We present the results and discuss our limitations and future work.
☆ Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias
The common toxicity and societal bias in contents generated by large language models (LLMs) necessitate strategies to reduce harm. Present solutions often demand white-box access to the model or substantial training, which is impractical for cutting-edge commercial LLMs. Moreover, prevailing prompting methods depend on external tool feedback and fail to simultaneously lessen toxicity and bias. Motivated by social psychology principles, we propose a novel strategy named \textbf{perspective-taking prompting (\textsc{PeT})} that inspires LLMs to integrate diverse human perspectives and self-regulate their responses. This self-correction mechanism can significantly diminish toxicity (up to $89\%$) and bias (up to $73\%$) in LLMs' responses. Rigorous evaluations and ablation studies are conducted on two commercial LLMs (ChatGPT and GLM) and three open-source LLMs, revealing \textsc{PeT}'s superiority in producing less harmful responses, outperforming five strong baselines.
☆ Dissecting Multiplication in Transformers: Insights into LLMs
Transformer-based large language models have achieved remarkable performance across various natural language processing tasks. However, they often struggle with seemingly easy tasks like arithmetic despite their vast capabilities. This stark disparity raise human's concerns about their safe and ethical use, hinder their widespread adoption.In this paper, we focus on a typical arithmetic task, integer multiplication, to explore and explain the imperfection of transformers in this domain. We provide comprehensive analysis of a vanilla transformer trained to perform n-digit integer multiplication. Our observations indicate that the model decomposes multiplication task into multiple parallel subtasks, sequentially optimizing each subtask for each digit to complete the final multiplication. Based on observation and analysis, we infer the reasons of transformers deficiencies in multiplication tasks lies in their difficulty in calculating successive carryovers and caching intermediate results, and confirmed this inference through experiments. Guided by these findings, we propose improvements to enhance transformers performance on multiplication tasks. These enhancements are validated through rigorous testing and mathematical modeling, not only enhance transformer's interpretability, but also improve its performance, e.g., we achieve over 99.9% accuracy on 5-digit integer multiplication with a tiny transformer, outperform LLMs GPT-4. Our method contributes to the broader fields of model understanding and interpretability, paving the way for analyzing more complex tasks and Transformer models. This work underscores the importance of explainable AI, helping to build trust in large language models and promoting their adoption in critical applications.
comment: 8 pages, 5 figures
☆ UF-HOBI at "Discharge Me!": A Hybrid Solution for Discharge Summary Generation Through Prompt-based Tuning of GatorTronGPT Models ACL 2024
Automatic generation of discharge summaries presents significant challenges due to the length of clinical documentation, the dispersed nature of patient information, and the diverse terminology used in healthcare. This paper presents a hybrid solution for generating discharge summary sections as part of our participation in the "Discharge Me!" Challenge at the BioNLP 2024 Shared Task. We developed a two-stage generation method using both extractive and abstractive techniques, in which we first apply name entity recognition (NER) to extract key clinical concepts, which are then used as input for a prompt-tuning-based GatorTronGPT model to generate coherent text for two important sections including "Brief Hospital Course" and "Discharge Instructions". Our system was ranked 5th in this challenge, achieving an overall score of 0.284. The results demonstrate the effectiveness of our hybrid solution in improving the quality of automated discharge section generation.
comment: BIONLP 2024 and Shared Tasks @ ACL 2024
☆ Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA
Retrieval augmented generation (RAG) enhances the accuracy and reliability of generative AI models by sourcing factual information from external databases, which is extensively employed in document-grounded question-answering (QA) tasks. Off-the-shelf RAG flows are well pretrained on general-purpose documents, yet they encounter significant challenges when being applied to knowledge-intensive vertical domains, such as electronic design automation (EDA). This paper addresses such issue by proposing a customized RAG framework along with three domain-specific techniques for EDA tool documentation QA, including a contrastive learning scheme for text embedding model fine-tuning, a reranker distilled from proprietary LLM, and a generative LLM fine-tuned with high-quality domain corpus. Furthermore, we have developed and released a documentation QA evaluation benchmark, ORD-QA, for OpenROAD, an advanced RTL-to-GDSII design platform. Experimental results demonstrate that our proposed RAG flow and techniques have achieved superior performance on ORD-QA as well as on a commercial tool, compared with state-of-the-arts. The ORD-QA benchmark and the training dataset for our customized RAG flow are open-source at https://github.com/lesliepy99/RAG-EDA.
☆ MAVEN-Fact: A Large-scale Event Factuality Detection Dataset
Event Factuality Detection (EFD) task determines the factuality of textual events, i.e., classifying whether an event is a fact, possibility, or impossibility, which is essential for faithfully understanding and utilizing event knowledge. However, due to the lack of high-quality large-scale data, event factuality detection is under-explored in event understanding research, which limits the development of EFD community. To address these issues and provide faithful event understanding, we introduce MAVEN-Fact, a large-scale and high-quality EFD dataset based on the MAVEN dataset. MAVEN-Fact includes factuality annotations of 112,276 events, making it the largest EFD dataset. Extensive experiments demonstrate that MAVEN-Fact is challenging for both conventional fine-tuned models and large language models (LLMs). Thanks to the comprehensive annotations of event arguments and relations in MAVEN, MAVEN-Fact also supports some further analyses and we find that adopting event arguments and relations helps in event factuality detection for fine-tuned models but does not benefit LLMs. Furthermore, we preliminarily study an application case of event factuality detection and find it helps in mitigating event-related hallucination in LLMs. Our dataset and codes can be obtained from \url{https://github.com/lcy2723/MAVEN-FACT}
comment: Under review
☆ LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation
Recent studies seek to provide Graph Neural Network (GNN) interpretability via multiple unsupervised learning models. Due to the scarcity of datasets, current methods easily suffer from learning bias. To solve this problem, we embed a Large Language Model (LLM) as knowledge into the GNN explanation network to avoid the learning bias problem. We inject LLM as a Bayesian Inference (BI) module to mitigate learning bias. The efficacy of the BI module has been proven both theoretically and experimentally. We conduct experiments on both synthetic and real-world datasets. The innovation of our work lies in two parts: 1. We provide a novel view of the possibility of an LLM functioning as a Bayesian inference to improve the performance of existing algorithms; 2. We are the first to discuss the learning bias issues in the GNN explanation problem.
comment: Preprint Paper with 13 pages
☆ Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions. Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with Large Language Models (LLMs). However, since the original question contains complex elements that require knowledge from different sources, acquiring different kinds of knowledge in a coupled manner may confuse models and hinder them from retrieving precise knowledge. Furthermore, the ``forward-only'' answering process fails to explicitly capture the knowledge needs of LLMs, which can further hurt answering quality. To cope with the above limitations, we propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion and uses LLM's feedback to specify the required knowledge. Specifically, DKA requires LLMs to specify what knowledge they need to answer the question and decompose the original complex question into two simple sub-questions: Image-based sub-question and Knowledge-based sub-question. Then we use the two sub-questions to retrieve knowledge from the image and knowledge base, respectively. In this way, two knowledge acquisition models can focus on the content that corresponds to them and avoid disturbance of irrelevant elements in the original complex question, which can help to provide more precise knowledge and better align the knowledge needs of LLMs to yield correct answers. Experiments on benchmark datasets show that DKA significantly outperforms SOTA models. To facilitate future research, our data and code are available at \url{https://github.com/Lackel/DKA}.
comment: Pre-print
☆ Improving Minimum Bayes Risk Decoding with Multi-Prompt
While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single "best" prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose multi-prompt decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks, and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.
☆ ZZU-NLP at SIGHAN-2024 dimABSA Task: Aspect-Based Sentiment Analysis with Coarse-to-Fine In-context Learning
The DimABSA task requires fine-grained sentiment intensity prediction for restaurant reviews, including scores for Valence and Arousal dimensions for each Aspect Term. In this study, we propose a Coarse-to-Fine In-context Learning(CFICL) method based on the Baichuan2-7B model for the DimABSA task in the SIGHAN 2024 workshop. Our method improves prediction accuracy through a two-stage optimization process. In the first stage, we use fixed in-context examples and prompt templates to enhance the model's sentiment recognition capability and provide initial predictions for the test data. In the second stage, we encode the Opinion field using BERT and select the most similar training data as new in-context examples based on similarity. These examples include the Opinion field and its scores, as well as related opinion words and their average scores. By filtering for sentiment polarity, we ensure that the examples are consistent with the test data. Our method significantly improves prediction accuracy and consistency by effectively utilizing training data and optimizing in-context examples, as validated by experimental results.
☆ Deep Learning for Economists
Deep learning provides powerful methods to impute structured information from large-scale, unstructured text and image datasets. For example, economists might wish to detect the presence of economic activity in satellite images, or to measure the topics or entities mentioned in social media, the congressional record, or firm filings. This review introduces deep neural networks, covering methods such as classifiers, regression models, generative AI, and embedding models. Applications include classification, document digitization, record linkage, and methods for data exploration in massive scale text and image corpora. When suitable methods are used, deep learning models can be cheap to tune and can scale affordably to problems involving millions or billions of data points.. The review is accompanied by a companion website, EconDL, with user-friendly demo notebooks, software resources, and a knowledge base that provides technical details and additional applications.
♻ ☆ Foundation Models for Autonomous Robots in Unstructured Environments
Automating activities through robots in unstructured environments, such as construction sites, has been a long-standing desire. However, the high degree of unpredictable events in these settings has resulted in far less adoption compared to more structured settings, such as manufacturing, where robots can be hard-coded or trained on narrowly defined datasets. Recently, pretrained foundation models, such as Large Language Models (LLMs), have demonstrated superior generalization capabilities by providing zero-shot solutions for problems do not present in the training data, proposing them as a potential solution for introducing robots to unstructured environments. To this end, this study investigates potential opportunities and challenges of pretrained foundation models from a multi-dimensional perspective. The study systematically reviews application of foundation models in two field of robotic and unstructured environment and then synthesized them with deliberative acting theory. Findings showed that linguistic capabilities of LLMs have been utilized more than other features for improving perception in human-robot interactions. On the other hand, findings showed that the use of LLMs demonstrated more applications in project management and safety in construction, and natural hazard detection in disaster management. Synthesizing these findings, we located the current state-of-the-art in this field on a five-level scale of automation, placing them at conditional automation. This assessment was then used to envision future scenarios, challenges, and solutions toward autonomous safe unstructured environments. Our study can be seen as a benchmark to track our progress toward that future.
comment: arXiv admin note: text overlap with arXiv:2312.07843, arXiv:2402.05741 by other authors
♻ ☆ Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to adversarial attacks is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. Focusing on the safety domain, we use Rainbow Teaming to target various state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90% across all tested models. Furthermore, we demonstrate that fine-tuning models with synthetic data generated by the Rainbow Teaming method significantly enhances their safety without sacrificing general performance or helpfulness. We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity, showcasing its potential to drive robust open-ended self-improvement in a wide range of applications.
♻ ☆ ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending models' context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, our work proposes a new benchmark for long-context LLMs focused on a practical meeting assistant scenario. In this scenario, the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers. Our experiments with recent long-context LLMs on ELITR-Bench highlight a gap between open-source and proprietary models, especially when questions are asked sequentially within a conversation. We also provide a thorough analysis of our GPT-4-based evaluation method, encompassing insights from a crowdsourcing study. Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
♻ ☆ Who Shares Fake News? Uncovering Insights from Social Media Users' Post Histories
We propose that social-media users' own post histories are an underused yet valuable resource for studying fake-news sharing. By extracting textual cues from their prior posts, and contrasting their prevalence against random social-media users and others (e.g., those with similar socio-demographics, political news-sharers, and fact-check sharers), researchers can identify cues that distinguish fake-news sharers, predict those most likely to share fake news, and identify promising constructs to build interventions. Our research includes studies along these lines. In Study 1, we explore the distinctive language patterns of fake-news sharers, highlighting elements such as their higher use of anger and power-related words. In Study 2, we show that adding textual cues into predictive models enhances their accuracy in predicting fake-news sharers. In Study 3, we explore the contrasting role of trait and situational anger, and show trait anger is associated with a greater propensity to share both true and fake news. In Study 4, we introduce a way to authenticate Twitter accounts in surveys, before using it to explore how crafting an ad copy that resonates with users' sense of power encourages the adoption of fact-checking tools. We hope to encourage the use of novel research methods for marketers and misinformation researchers.
comment: 108 pages
♻ ☆ Beyond Memorization: The Challenge of Random Memory Access in Language Models ACL 2024
Recent developments in Language Models (LMs) have shown their effectiveness in NLP tasks, particularly in knowledge-intensive tasks. However, the mechanisms underlying knowledge storage and memory access within their parameters remain elusive. In this paper, we investigate whether a generative LM (e.g., GPT-2) is able to access its memory sequentially or randomly. Through carefully-designed synthetic tasks, covering the scenarios of full recitation, selective recitation and grounded question answering, we reveal that LMs manage to sequentially access their memory while encountering challenges in randomly accessing memorized content. We find that techniques including recitation and permutation improve the random memory access capability of LMs. Furthermore, by applying this intervention to realistic scenarios of open-domain question answering, we validate that enhancing random access by recitation leads to notable improvements in question answering. The code to reproduce our experiments can be found at https://github.com/sail-sg/lm-random-memory-access.
comment: 9 pages, 4 figures; accepted by ACL 2024 (oral)
♻ ☆ Fast and Effective Weight Update for Pruned Large Language Models
Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations. In our paper, we propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). We further extend it with a simple gradual pruning mask selection and achieve state-of-the-art pruning performance across a wide range of LLMs. Code is available at https://github.com/fmfi-compbio/admm-pruning.
♻ ☆ GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.
♻ ☆ Chain of Code: Reasoning with a Language Model-Augmented Code Emulator ICML 2024
Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)". In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by "thinking in code".
comment: ICML 2024 Oral; Project webpage: https://chain-of-code.github.io
♻ ☆ Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.
comment: 20 pages; 13 visualizations (nine figures, four tables)
♻ ☆ AdaptEval: Evaluating Large Language Models on Domain Adaptation for Text Summarization
Despite the advances in the abstractive summarization task using Large Language Models (LLM), there is a lack of research that asses their abilities to easily adapt to different domains. We evaluate the domain adaptation abilities of a wide range of LLMs on the summarization task across various domains in both fine-tuning and in-context learning settings. We also present AdaptEval, the first domain adaptation evaluation suite. AdaptEval includes a domain benchmark and a set of metrics to facilitate the analysis of domain adaptation. Our results demonstrate that LLMs exhibit comparable performance in the in-context learning setting, regardless of their parameter scale.
♻ ☆ Mitigating Entity-Level Hallucination in Large Language Models
The emergence of Large Language Models (LLMs) has revolutionized how users access information, shifting from traditional search engines to direct question-and-answer interactions with LLMs. However, the widespread adoption of LLMs has revealed a significant challenge known as hallucination, wherein LLMs generate coherent yet factually inaccurate responses. This hallucination phenomenon has led to users' distrust in information retrieval systems based on LLMs. To tackle this challenge, this paper proposes Dynamic Retrieval Augmentation based on hallucination Detection (DRAD) as a novel method to detect and mitigate hallucinations in LLMs. DRAD improves upon traditional retrieval augmentation by dynamically adapting the retrieval process based on real-time hallucination detection. It features two main components: Real-time Hallucination Detection (RHD) for identifying potential hallucinations without external models, and Self-correction based on External Knowledge (SEK) for correcting these errors using external knowledge. Experiment results show that DRAD demonstrates superior performance in both detecting and mitigating hallucinations in LLMs. All of our code and data are open-sourced at https://github.com/oneal2000/EntityHallucination.
♻ ☆ A Survey of Large Language Models in Medicine: Progress, Application, and Challenge
Large language models (LLMs), such as ChatGPT, have received substantial attention due to their capabilities for understanding and generating human language. While there has been a burgeoning trend in research focusing on the employment of LLMs in supporting different medical tasks (e.g., enhancing clinical diagnostics and providing medical education), a review of these efforts, particularly their development, practical applications, and outcomes in medicine, remains scarce. Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face. In terms of development, we provide a detailed introduction to the principles of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data used for model development. It serves as a guide for practitioners in developing medical LLMs tailored to their specific needs. In terms of deployment, we offer a comparison of the performance of different LLMs across various medical tasks, and further compare them with state-of-the-art lightweight models, aiming to provide an understanding of the advantages and limitations of LLMs in medicine. Overall, in this review, we address the following questions: 1) What are the practices for developing medical LLMs 2) How to measure the medical task performance of LLMs in a medical setting? 3) How have medical LLMs been employed in real-world practice? 4) What challenges arise from the use of medical LLMs? and 5) How to more effectively develop and deploy medical LLMs? By answering these questions, this review aims to provide insights into the opportunities for LLMs in medicine and serve as a practical resource. We also maintain a regularly updated list of practical guides on medical LLMs at https://github.com/AI-in-Health/MedLLMsPracticalGuide
comment: Preprint. Version 6. Update Figures 1-5; Tables 2-3; 31 pages
♻ ☆ Cross-Speaker Encoding Network for Multi-Talker Speech Recognition ICASSP2024
End-to-end multi-talker speech recognition has garnered great interest as an effective approach to directly transcribe overlapped speech from multiple speakers. Current methods typically adopt either 1) single-input multiple-output (SIMO) models with a branched encoder, or 2) single-input single-output (SISO) models based on attention-based encoder-decoder architecture with serialized output training (SOT). In this work, we propose a Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models by aggregating cross-speaker representations. Furthermore, the CSE model is integrated with SOT to leverage both the advantages of SIMO and SISO while mitigating their drawbacks. To the best of our knowledge, this work represents an early effort to integrate SIMO and SISO for multi-talker speech recognition. Experiments on the two-speaker LibrispeechMix dataset show that the CES model reduces word error rate (WER) by 8% over the SIMO baseline. The CSE-SOT model reduces WER by 10% overall and by 16% on high-overlap speech compared to the SOT model. Code is available at https://github.com/kjw11/CSEnet-ASR.
comment: Accepted by ICASSP2024
♻ ☆ TTSDS -- Text-to-Speech Distribution Score
Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.
comment: Under review for SLT 2024
♻ ☆ Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection
The spread of fake news negatively impacts individuals and is regarded as a significant social challenge that needs to be addressed. A number of algorithmic and insightful features have been identified for detecting fake news. However, with the recent LLMs and their advanced generation capabilities, many of the detectable features (e.g., style-conversion attacks) can be altered, making it more challenging to distinguish from real news. This study proposes adversarial style augmentation, AdStyle, to train a fake news detector that remains robust against various style-conversion attacks. Our model's key mechanism is the careful use of LLMs to automatically generate a diverse yet coherent range of style-conversion attack prompts. This improves the generation of prompts that are particularly difficult for the detector to handle. Experiments show that our augmentation strategy improves robustness and detection performance when tested on fake news benchmark datasets.
comment: 8 pages
♻ ☆ Unipa-GPT: Large Language Models for university-oriented QA in Italian
This paper illustrates the architecture and training of Unipa-GPT, a chatbot relying on a Large Language Model, developed for assisting students in choosing a bachelor/master degree course at the University of Palermo. Unipa-GPT relies on gpt-3.5-turbo, it was presented in the context of the European Researchers' Night (SHARPER night). In our experiments we adopted both the Retrieval Augmented Generation (RAG) approach and fine-tuning to develop the system. The whole architecture of Unipa-GPT is presented, both the RAG and the fine-tuned systems are compared, and a brief discussion on their performance is reported. Further comparison with other Large Language Models and the experimental results during the SHARPER night are illustrated.
♻ ☆ MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models ACL 2024
Parameter Efficient Finetuning (PEFT) has emerged as a viable solution for improving the performance of Large Language Models (LLMs) without requiring massive resources and compute. Prior work on multilingual evaluation has shown that there is a large gap between the performance of LLMs on English and other languages. Further, there is also a large gap between the performance of smaller open-source models and larger LLMs. Finetuning can be an effective way to bridge this gap and make language models more equitable. In this work, we finetune the LLama-2-7B and Mistral-7B models on two synthetic multilingual instruction tuning datasets to determine its effect on model performance on six downstream tasks covering forty languages in all. Additionally, we experiment with various parameters, such as rank for low-rank adaptation and values of quantisation to determine their effects on downstream performance and find that higher rank and higher quantisation values benefit low-resource languages. We find that PEFT of smaller open-source models sometimes bridges the gap between the performance of these models and the larger ones, however, English performance can take a hit. We also find that finetuning sometimes improves performance on low-resource languages, while degrading performance on high-resource languages.
comment: 46 pages, 23 figures, 45 tables. Accepted in ACL 2024 findings
♻ ☆ General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History
Machine learning (ML) has recently shown promising results in medical predictions using electronic health records (EHRs). However, since ML models typically have a limited capability in terms of input sizes, selecting specific medical events from EHRs for use as input is necessary. This selection process, often relying on expert opinion, can cause bottlenecks in development. We propose Retrieval-Enhanced Medical prediction model (REMed) to address such challenges. REMed can essentially evaluate unlimited medical events, select the relevant ones, and make predictions. This allows for an unrestricted input size, eliminating the need for manual event selection. We verified these properties through experiments involving 27 clinical prediction tasks across four independent cohorts, where REMed outperformed the baselines. Notably, we found that the preferences of REMed align closely with those of medical experts. We expect our approach to significantly expedite the development of EHR prediction models by minimizing clinicians' need for manual involvement.
comment: The source codes corresponding to this paper are available at: https://github.com/starmpcc/REMed
♻ ☆ Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model
Large Language Models (LLMs) have demonstrated exceptional proficiency in mathematical reasoning tasks due to their extensive parameter counts and training on vast datasets. Despite these capabilities, deploying LLMs is hindered by their computational demands. Distilling LLM mathematical reasoning into Smaller Language Models (SLMs) has emerged as a solution to this challenge, although these smaller models often suffer from errors in calculation and semantic understanding. Prior work has proposed Program-of-Thought Distillation (PoTD) to avoid calculation error. To further address semantic understanding errors, we propose Key-Point-Driven Mathematical Reasoning Distillation (KPDD). KPDD enhances the reasoning performance of SLMs by breaking down the problem-solving process into three stages: Core Question Extraction, Problem-Solving Information Extraction, and Step-by-Step Solution. This method is further divided into KPDD-CoT, which generates Chain-of-Thought rationales, and KPDD-PoT, which creates Program-of-Thought rationales. The experiment results show that KPDD-CoT significantly improves reasoning abilities, while KPDD-PoT achieves state-of-the-art performance in mathematical reasoning tasks. Our approach effectively mitigates misunderstanding errors, advancing the deployment of efficient and capable SLMs.
♻ ☆ Meta-Task Prompting Elicits Embeddings from Large Language Models ACL 2024
We introduce a new unsupervised text embedding method, Meta-Task Prompting with Explicit One-Word Limitation (MetaEOL), for generating high-quality sentence embeddings from Large Language Models (LLMs) without the need for model fine-tuning. Leveraging meta-task prompting, MetaEOL guides LLMs to produce embeddings through a series of carefully designed prompts that address multiple representational aspects. Our comprehensive experiments demonstrate that embeddings averaged from various meta-tasks are versatile embeddings that yield competitive performance on Semantic Textual Similarity (STS) benchmarks and excel in downstream tasks, surpassing contrastive-trained models. Our findings suggest a new scaling law, offering a versatile and resource-efficient approach for embedding generation across diverse scenarios.
comment: ACL 2024
♻ ☆ EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation ACL 2022
Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior performance against the conventional MNMT by constructing multi-way aligned corpus, i.e., aligning bilingual training examples from different language pairs when either their source or target sides are identical. However, since exactly identical sentences from different language pairs are scarce, the power of the multi-way aligned corpus is limited by its scale. To handle this problem, this paper proposes "Extract and Generate" (EAG), a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data. Specifically, we first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences; and then generate the final aligned examples from the candidates with a well-trained generation model. With this two-step pipeline, EAG can construct a large-scale and multi-way aligned corpus whose diversity is almost identical to the original bilingual corpus. Experiments on two publicly available datasets i.e., WMT-5 and OPUS-100, show that the proposed method achieves significant improvements over strong baselines, with +1.1 and +1.4 BLEU points improvements on the two datasets respectively.
comment: Accepted as a long paper at ACL 2022
♻ ☆ From Black Boxes to Conversations: Incorporating XAI in a Conversational Agent
The goal of Explainable AI (XAI) is to design methods to provide insights into the reasoning process of black-box models, such as deep neural networks, in order to explain them to humans. Social science research states that such explanations should be conversational, similar to human-to-human explanations. In this work, we show how to incorporate XAI in a conversational agent, using a standard design for the agent comprising natural language understanding and generation components. We build upon an XAI question bank, which we extend by quality-controlled paraphrases, to understand the user's information needs. We further systematically survey the literature for suitable explanation methods that provide the information to answer those questions, and present a comprehensive list of suggestions. Our work is the first step towards truly natural conversations about machine learning models with an explanation agent. The comprehensive list of XAI questions and the corresponding explanation methods may support other researchers in providing the necessary information to address users' demands. To facilitate future work, we release our source code and data.
comment: Accepted at The World Conference on eXplainable Artificial Intelligence 2023 (XAI-2023)
♻ ☆ Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models ICML 2024
Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
comment: ICML 2024. Website: https://hanlin-zhang.com/impossibility-watermarks
♻ ☆ TokenSHAP: Interpreting Large Language Models with Monte Carlo Shapley Value Estimation
As large language models (LLMs) become increasingly prevalent in critical applications, the need for interpretable AI has grown. We introduce TokenSHAP, a novel method for interpreting LLMs by attributing importance to individual tokens or substrings within input prompts. This approach adapts Shapley values from cooperative game theory to natural language processing, offering a rigorous framework for understanding how different parts of an input contribute to a model's response. TokenSHAP leverages Monte Carlo sampling for computational efficiency, providing interpretable, quantitative measures of token importance. We demonstrate its efficacy across diverse prompts and LLM architectures, showing consistent improvements over existing baselines in alignment with human judgments, faithfulness to model behavior, and consistency. Our method's ability to capture nuanced interactions between tokens provides valuable insights into LLM behavior, enhancing model transparency, improving prompt engineering, and aiding in the development of more reliable AI systems. TokenSHAP represents a significant step towards the necessary interpretability for responsible AI deployment, contributing to the broader goal of creating more transparent, accountable, and trustworthy AI systems.
♻ ☆ MarkLLM: An Open-Source Toolkit for LLM Watermarking
LLM watermarking, which embeds imperceptible yet algorithmically detectable signals in model outputs to identify LLM-generated text, has become crucial in mitigating the potential misuse of large language models. However, the abundance of LLM watermarking algorithms, their intricate mechanisms, and the complex evaluation procedures and perspectives pose challenges for researchers and the community to easily experiment with, understand, and assess the latest advancements. To address these issues, we introduce MarkLLM, an open-source toolkit for LLM watermarking. MarkLLM offers a unified and extensible framework for implementing LLM watermarking algorithms, while providing user-friendly interfaces to ensure ease of access. Furthermore, it enhances understanding by supporting automatic visualization of the underlying mechanisms of these algorithms. For evaluation, MarkLLM offers a comprehensive suite of 12 tools spanning three perspectives, along with two types of automated evaluation pipelines. Through MarkLLM, we aim to support researchers while improving the comprehension and involvement of the general public in LLM watermarking technology, fostering consensus and driving further advancements in research and application. Our code is available at https://github.com/THU-BPM/MarkLLM.
comment: 16 pages, 5 figures, 6 tables
♻ ☆ UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs ACL 2024
Evaluation is pivotal for refining Large Language Models (LLMs), pinpointing their capabilities, and guiding enhancements. The rapid development of LLMs calls for a lightweight and easy-to-use framework for swift evaluation deployment. However, considering various implementation details, developing a comprehensive evaluation platform is never easy. Existing platforms are often complex and poorly modularized, hindering seamless incorporation into research workflows. This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency. We identify and reimplement three core components of model evaluation (models, data, and metrics). The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow. Additionally, UltraEval supports diverse models owing to a unified HTTP service and provides sufficient inference acceleration. UltraEval is now available for researchers publicly.
comment: Accepted by ACL 2024 System Demostration Track, update
♻ ☆ Video Understanding with Large Language Models: A Survey
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
♻ ☆ A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Current Large Language Models (LLMs) are not only limited to some maximum context length, but also are not able to robustly consume long inputs. To address these limitations, we propose ReadAgent, an LLM agent system that increases effective context length up to 20x in our experiments. Inspired by how humans interactively read long documents, we implement ReadAgent as a simple prompting system that uses the advanced language capabilities of LLMs to (1) decide what content to store together in a memory episode, (2) compress those memory episodes into short episodic memories called gist memories, and (3) take actions to look up passages in the original text if ReadAgent needs to remind itself of relevant details to complete a task. We evaluate ReadAgent against baselines using retrieval methods, using the original long contexts, and using the gist memories. These evaluations are performed on three long-document reading comprehension tasks: QuALITY, NarrativeQA, and QMSum. ReadAgent outperforms the baselines on all three tasks while extending the effective context window by 3.5-20x.
comment: Website: https://read-agent.github.io
♻ ☆ FineSurE: Fine-grained Summarization Evaluation using LLMs ACL 2024
Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at https://github.com/DISL-Lab/FineSurE-ACL24.
comment: Accepted at ACL 2024 (main, long)
♻ ☆ Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization
This work focuses on the task of query-based meeting summarization in which the summary of a context (meeting transcript) is generated in response to a specific query. When using Large Language Models (LLMs) for this task, usually a new call to the LLM inference endpoint/API is triggered for each new query, even if the context stays the same. However, repeated calls to the LLM inference endpoints would significantly increase the costs of using them in production, making LLMs impractical for many real-world use cases. To address this problem, in this paper, we investigate whether combining the queries for the same input context in a single prompt to minimize repeated calls can be successfully used in meeting summarization. In this regard, we conduct extensive experiments by comparing the performance of various popular LLMs: GPT-4, Gemini, Claude-3, LLaMA-2, Mistral, Phi-3, and Qwen-2 in single-query and multi-query settings. We observe that 100% reliability in generating the response in the expected format is usually limited to certain closed-source LLMs, with most open-source LLMs lagging behind (except a few 7B parameters LLMs like Mistral and Phi-3). We conclude that multi-query prompting could be useful to significantly optimize the inference costs in meeting summarization.
♻ ☆ GSQA: An End-to-End Model for Generative Spoken Question Answering
In recent advancements in spoken question answering (QA), end-to-end models have made significant strides. However, previous research has primarily focused on extractive span selection. While this extractive-based approach is effective when answers are present directly within the input, it falls short in addressing abstractive questions, where answers are not directly extracted but inferred from the given information. To bridge this gap, we introduce the first end-to-end Generative Spoken Question Answering (GSQA) model that empowers the system to engage in abstractive reasoning. The challenge in training our GSQA model lies in the absence of a spoken abstractive QA dataset. We propose using text models for initialization and leveraging the extractive QA dataset to transfer knowledge from the text generative model to the spoken generative model. Experimental results indicate that our model surpasses the previous extractive model by 3% on extractive QA datasets. Furthermore, the GSQA model has only been fine-tuned on the spoken extractive QA dataset. Despite not having seen any spoken abstractive QA data, it can still closely match the performance of the cascade model. In conclusion, our GSQA model shows the potential to generalize to a broad spectrum of questions, thus further expanding the spoken question answering capabilities of abstractive QA. Our code is available at https://voidful.github.io/GSQA
comment: 5 pages, 2 figures, Interspeech 2024
♻ ☆ Sketch-Guided Constrained Decoding for Boosting Blackbox Large Language Models without Logit Access ACL 2024
Constrained decoding, a technique for enforcing constraints on language model outputs, offers a way to control text generation without retraining or architectural modifications. Its application is, however, typically restricted to models that give users access to next-token distributions (usually via softmax logits), which poses a limitation with blackbox large language models (LLMs). This paper introduces sketch-guided constrained decoding (SGCD), a novel approach to constrained decoding for blackbox LLMs, which operates without access to the logits of the blackbox LLM. SGCD utilizes a locally hosted auxiliary model to refine the output of an unconstrained blackbox LLM, effectively treating this initial output as a "sketch" for further elaboration. This approach is complementary to traditional logit-based techniques and enables the application of constrained decoding in settings where full model transparency is unavailable. We demonstrate the efficacy of SGCD through experiments in closed information extraction and constituency parsing, showing how it enhances the utility and flexibility of blackbox LLMs for complex NLP tasks.
comment: Accepted to ACL 2024 Oral
♻ ☆ $\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks
This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM assessment in translating formal syntax -- such as first-order logic, regular expressions, etc -- to natural language (interpretation) or vice versa (compilation), thereby facilitating their use in applications such as generating/explaining logic and control flow for programs etc. Existing approaches for LLM assessment in these areas require labor-intensive ground-truth creation, the availability of which undermines the separation of training and test sets. Furthermore, such datasets typically include relatively few hand-coded test cases over which LLM accuracy is determined, thus making them inadequate for determining the safety or correctness of their generated outputs. We introduce a new approach that utilizes context-free grammars (CFGs) to generate out-of-distribution datasets on the fly and perform closed-loop testing of LLM capabilities using formal verifiers to guarantee the correctness of LLM outputs without any human intervention. We release our dataset and benchmark as open-source code at \url{https://github.com/AAIR-lab/auto-llm-assessment}. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm. Our experiments reveal that SOTA LLMs are unable to solve the formal translation task adequately.
♻ ☆ MLRegTest: A Benchmark for the Machine Learning of Regular Languages
Synthetic datasets constructed from formal languages allow fine-grained examination of the learning and generalization capabilities of machine learning systems for sequence classification. This article presents a new benchmark for machine learning systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that performance depends significantly on the kind of test set, the class of language, and the neural network architecture.
comment: 43 pages, MLRegTest benchmark available at https://doi.org/10.5061/dryad.dncjsxm4h , associated code at https://github.com/heinz-jeffrey/subregular-learning
Artificial Intelligence
☆ WayEx: Waypoint Exploration using a Single Demonstration ICRA 2024
We propose WayEx, a new method for learning complex goal-conditioned robotics tasks from a single demonstration. Our approach distinguishes itself from existing imitation learning methods by demanding fewer expert examples and eliminating the need for information about the actions taken during the demonstration. This is accomplished by introducing a new reward function and employing a knowledge expansion technique. We demonstrate the effectiveness of WayEx, our waypoint exploration strategy, across six diverse tasks, showcasing its applicability in various environments. Notably, our method significantly reduces training time by 50% as compared to traditional reinforcement learning methods. WayEx obtains a higher reward than existing imitation learning methods given only a single demonstration. Furthermore, we demonstrate its success in tackling complex environments where standard approaches fall short. More information is available at: https://waypoint-ex.github.io.
comment: ICRA 2024
☆ LLMmap: Fingerprinting For Large Language Models
We introduce LLMmap, a first-generation fingerprinting attack targeted at LLM-integrated applications. LLMmap employs an active fingerprinting approach, sending carefully crafted queries to the application and analyzing the responses to identify the specific LLM model in use. With as few as 8 interactions, LLMmap can accurately identify LLMs with over 95% accuracy. More importantly, LLMmap is designed to be robust across different application layers, allowing it to identify LLMs operating under various system prompts, stochastic sampling hyperparameters, and even complex generation frameworks such as RAG or Chain-of-Thought.
comment: version 0.1
☆ Reconstructing Training Data From Real World Models Trained with Transfer Learning
Current methods for reconstructing training data from trained classifiers are restricted to very small models, limited training set sizes, and low-resolution images. Such restrictions hinder their applicability to real-world scenarios. In this paper, we present a novel approach enabling data reconstruction in realistic settings for models trained on high-resolution images. Our method adapts the reconstruction scheme of arXiv:2206.07758 to real-world scenarios -- specifically, targeting models trained via transfer learning over image embeddings of large pre-trained models like DINO-ViT and CLIP. Our work employs data reconstruction in the embedding space rather than in the image space, showcasing its applicability beyond visual data. Moreover, we introduce a novel clustering-based method to identify good reconstructions from thousands of candidates. This significantly improves on previous works that relied on knowledge of the training set to identify good reconstructed images. Our findings shed light on a potential privacy risk for data leakage from models trained using transfer learning.
☆ CarFormer: Self-Driving with Learned Object-Centric Representations ECCV 2024
The choice of representation plays a key role in self-driving. Bird's eye view (BEV) representations have shown remarkable performance in recent years. In this paper, we propose to learn object-centric representations in BEV to distill a complex scene into more actionable information for self-driving. We first learn to place objects into slots with a slot attention model on BEV sequences. Based on these object-centric representations, we then train a transformer to learn to drive as well as reason about the future of other vehicles. We found that object-centric slot representations outperform both scene-level and object-level approaches that use the exact attributes of objects. Slot representations naturally incorporate information about objects from their spatial and temporal context such as position, heading, and speed without explicitly providing it. Our model with slots achieves an increased completion rate of the provided routes and, consequently, a higher driving score, with a lower variance across multiple runs, affirming slots as a reliable alternative in object-centric approaches. Additionally, we validate our model's performance as a world model through forecasting experiments, demonstrating its capability to predict future slot representations accurately. The code and the pre-trained models can be found at https://kuis-ai.github.io/CarFormer/.
comment: Accepted to ECCV 2024, code and the pre-trained models can be found at https://kuis-ai.github.io/CarFormer/
☆ HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning ECCV 2024
Predicting camera-space hand meshes from single RGB images is crucial for enabling realistic hand interactions in 3D virtual and augmented worlds. Previous work typically divided the task into two stages: given a cropped image of the hand, predict meshes in relative coordinates, followed by lifting these predictions into camera space in a separate and independent stage, often resulting in the loss of valuable contextual and scale information. To prevent the loss of these cues, we propose unifying these two stages into an end-to-end solution that addresses the 2D-3D correspondence problem. This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module. We also introduce an image rectification step that harmonizes both the training dataset and the input image as if they were acquired with the same camera, helping to alleviate the inherent scale-depth ambiguity of the problem. We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches across three public benchmarks.
comment: To be presented at ECCV 2024
☆ Importance Sampling-Guided Meta-Training for Intelligent Agents in Highly Interactive Environments
Training intelligent agents to navigate highly interactive environments presents significant challenges. While guided meta reinforcement learning (RL) approach that first trains a guiding policy to train the ego agent has proven effective in improving generalizability across various levels of interaction, the state-of-the-art method tends to be overly sensitive to extreme cases, impairing the agents' performance in the more common scenarios. This study introduces a novel training framework that integrates guided meta RL with importance sampling (IS) to optimize training distributions for navigating highly interactive driving scenarios, such as T-intersections. Unlike traditional methods that may underrepresent critical interactions or overemphasize extreme cases during training, our approach strategically adjusts the training distribution towards more challenging driving behaviors using IS proposal distributions and applies the importance ratio to de-bias the result. By estimating a naturalistic distribution from real-world datasets and employing a mixture model for iterative training refinements, the framework ensures a balanced focus across common and extreme driving scenarios. Experiments conducted with both synthetic dataset and T-intersection scenarios from the InD dataset demonstrate not only accelerated training but also improvement in agent performance under naturalistic conditions, showcasing the efficacy of combining IS with meta RL in training reliable autonomous agents for highly interactive navigation tasks.
☆ Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning
Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images. It excels in region-aware learning and provides strong initializations for various tasks, but struggles to capture high-level semantics without further supervised fine-tuning, likely due to the low-level nature of its pixel reconstruction objective. A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets. However, this approach poses significant training challenges as the reconstruction targets are learned in conjunction with the model, potentially leading to trivial or suboptimal solutions.Our study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM. Through a series of carefully designed experiments and extensive analysis, we identify the source of these challenges, including representation collapsing for joint online/target optimization, learning objectives, the high region correlation in latent space and decoding conditioning. By sequentially addressing these issues, we demonstrate that Latent MIM can indeed learn high-level representations while retaining the benefits of MIM models.
☆ dMel: Speech Tokenization made Simple
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic tokens, potentially losing acoustic information, or model acoustic tokens, risking the loss of semantic information. Having multiple token types also complicates the architecture and requires additional pretraining. Here we show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel), that performs better than other existing speech tokenization methods. Using a transformer decoder-only architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR), speech synthesis (TTS). Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.
comment: under review
☆ NV-Retriever: Improving text embedding models with effective hard-negative mining
Text embedding models have been popular for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). Those models are typically Transformer models that are fine-tuned with contrastive learning objectives. Many papers introduced new embedding model architectures and training approaches, however, one of the key ingredients, the process of mining negative passages, remains poorly explored or described. One of the challenging aspects of fine-tuning embedding models is the selection of high quality hard-negative passages for contrastive learning. In this paper we propose a family of positive-aware mining methods that leverage the positive relevance score for more effective false negatives removal. We also provide a comprehensive ablation study on hard-negative mining methods over their configurations, exploring different teacher and base models. We demonstrate the efficacy of our proposed methods by introducing the NV-Retriever-v1 model, which scores 60.9 on MTEB Retrieval (BEIR) benchmark and 0.65 points higher than previous methods. The model placed 1st when it was published to MTEB Retrieval on July 07, 2024.
☆ On shallow planning under partial observability
Formulating a real-world problem under the Reinforcement Learning framework involves non-trivial design choices, such as selecting a discount factor for the learning objective (discounted cumulative rewards), which articulates the planning horizon of the agent. This work investigates the impact of the discount factor on the biasvariance trade-off given structural parameters of the underlying Markov Decision Process. Our results support the idea that a shorter planning horizon might be beneficial, especially under partial observability.
comment: Presented at deployable RL (RLC conference 2024)
☆ Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning
Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://gemcollector.github.io/maniwhere/.
comment: Webpage: https://gemcollector.github.io/maniwhere/
☆ Perceptions of Linguistic Uncertainty by Language Models and Humans
Uncertainty expressions such as ``probably'' or ``highly unlikely'' are pervasive in human language. While prior work has established that there is population-level agreement in terms of how humans interpret these expressions, there has been little inquiry into the abilities of language models to interpret such expressions. In this paper, we investigate how language models map linguistic expressions of uncertainty to numerical responses. Our approach assesses whether language models can employ theory of mind in this setting: understanding the uncertainty of another agent about a particular statement, independently of the model's own certainty about that statement. We evaluate both humans and 10 popular language models on a task created to assess these abilities. Unexpectedly, we find that 8 out of 10 models are able to map uncertainty expressions to probabilistic responses in a human-like manner. However, we observe systematically different behavior depending on whether a statement is actually true or false. This sensitivity indicates that language models are substantially more susceptible to bias based on their prior knowledge (as compared to humans). These findings raise important questions and have broad implications for human-AI alignment and AI-AI communication.
comment: In submission
☆ Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
As scaling laws in generative AI push performance, they also simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to address this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose to randomly mask up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only \$1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118$\times$ lower cost than stable diffusion models and 14$\times$ lower cost than the current state-of-the-art approach that costs \$28,400. We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
comment: 41 pages, 28 figures, 5 tables
☆ CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning BMVC 2024
With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, large pre-trained models have become a common strategy to enhance performance in Continual Learning scenarios. This led to the development of numerous prompting strategies to effectively fine-tune transformer-based models without succumbing to catastrophic forgetting. However, these methods struggle to specialize the model on domains significantly deviating from the pre-training and preserving its zero-shot capabilities. In this work, we propose Continual Generative training for Incremental prompt-Learning, a novel approach to mitigate forgetting while adapting a VLM, which exploits generative replay to align prompts to tasks. We also introduce a new metric to evaluate zero-shot capabilities within CL benchmarks. Through extensive experiments on different domains, we demonstrate the effectiveness of our framework in adapting to new tasks while improving zero-shot capabilities. Further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.
comment: 15 pages, 1 figure. Accepted at the The 35th British Machine Vision Conference 2024 (BMVC 2024), Glasgow, UK
☆ Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels
Recent advances in reinforcement learning (RL) have predominantly leveraged neural network-based policies for decision-making, yet these models often lack interpretability, posing challenges for stakeholder comprehension and trust. Concept bottleneck models offer an interpretable alternative by integrating human-understandable concepts into neural networks. However, a significant limitation in prior work is the assumption that human annotations for these concepts are readily available during training, necessitating continuous real-time input from human annotators. To overcome this limitation, we introduce a novel training scheme that enables RL algorithms to efficiently learn a concept-based policy by only querying humans to label a small set of data, or in the extreme case, without any human labels. Our algorithm, LICORICE, involves three main contributions: interleaving concept learning and RL training, using a concept ensembles to actively select informative data points for labeling, and decorrelating the concept data with a simple strategy. We show how LICORICE reduces manual labeling efforts to to 500 or fewer concept labels in three environments. Finally, we present an initial study to explore how we can use powerful vision-language models to infer concepts from raw visual inputs without explicit labels at minimal cost to performance.
comment: 23 pages, 6 figures, 9 tables
☆ Diffusion Model Based Resource Allocation Strategy in Ultra-Reliable Wireless Networked Control Systems
Diffusion models are vastly used in generative AI, leveraging their capability to capture complex data distributions. However, their potential remains largely unexplored in the field of resource allocation in wireless networks. This paper introduces a novel diffusion model-based resource allocation strategy for Wireless Networked Control Systems (WNCSs) with the objective of minimizing total power consumption through the optimization of the sampling period in the control system, and blocklength and packet error probability in the finite blocklength regime of the communication system. The problem is first reduced to the optimization of blocklength only based on the derivation of the optimality conditions. Then, the optimization theory solution collects a dataset of channel gains and corresponding optimal blocklengths. Finally, the Denoising Diffusion Probabilistic Model (DDPM) uses this collected dataset to train the resource allocation algorithm that generates optimal blocklength values conditioned on the channel state information (CSI). Via extensive simulations, the proposed approach is shown to outperform previously proposed Deep Reinforcement Learning (DRL) based approaches with close to optimal performance regarding total power consumption. Moreover, an improvement of up to eighteen-fold in the reduction of critical constraint violations is observed, further underscoring the accuracy of the solution.
comment: 5 pages, 4 figures
☆ Explaining Decisions in ML Models: a Parameterized Complexity Analysis KR 2024
This paper presents a comprehensive theoretical investigation into the parameterized complexity of explanation problems in various machine learning (ML) models. Contrary to the prevalent black-box perception, our study focuses on models with transparent internal mechanisms. We address two principal types of explanation problems: abductive and contrastive, both in their local and global variants. Our analysis encompasses diverse ML models, including Decision Trees, Decision Sets, Decision Lists, Ordered Binary Decision Diagrams, Random Forests, and Boolean Circuits, and ensembles thereof, each offering unique explanatory challenges. This research fills a significant gap in explainable AI (XAI) by providing a foundational understanding of the complexities of generating explanations for these models. This work provides insights vital for further research in the domain of XAI, contributing to the broader discourse on the necessity of transparency and accountability in AI systems.
comment: A short version of the paper has been accepted at the 21st International Conference on Principles of Knowledge Representation and Reasoning (KR 2024)
☆ Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection
This paper addresses the challenge of robotic grasping of general objects. Similar to prior research, the task reads a single-view 3D observation (i.e., point clouds) captured by a depth camera as input. Crucially, the success of object grasping highly demands a comprehensive understanding of the shape of objects within the scene. However, single-view observations often suffer from occlusions (including both self and inter-object occlusions), which lead to gaps in the point clouds, especially in complex cluttered scenes. This renders incomplete perception of the object shape and frequently causes failures or inaccurate pose estimation during object grasping. In this paper, we tackle this issue with an effective albeit simple solution, namely completing grasping-related scene regions through local occupancy prediction. Following prior practice, the proposed model first runs by proposing a number of most likely grasp points in the scene. Around each grasp point, a module is designed to infer any voxel in its neighborhood to be either void or occupied by some object. Importantly, the occupancy map is inferred by fusing both local and global cues. We implement a multi-group tri-plane scheme for efficiently aggregating long-distance contextual information. The model further estimates 6-DoF grasp poses utilizing the local occupancy-enhanced object shape information and returns the top-ranked grasp proposal. Comprehensive experiments on both the large-scale GraspNet-1Billion benchmark and real robotic arm demonstrate that the proposed method can effectively complete the unobserved parts in cluttered and occluded scenes. Benefiting from the occupancy-enhanced feature, our model clearly outstrips other competing methods under various performance metrics such as grasping average precision.
☆ Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning
Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge here is to develop steerable language models that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditioned Language Policy (CLP), a general framework for finetuning language models on multiple objectives. Building on techniques from multi-task training and parameter-efficient finetuning, CLP can learn steerable models that effectively trade-off conflicting objectives at inference time. Notably, this does not require training or maintaining multiple models to achieve different trade-offs between the objectives. Through an extensive set of experiments and ablations, we show that the CLP framework learns steerable models that outperform and Pareto-dominate the current state-of-the-art approaches for multi-objective finetuning.
comment: 40 pages
☆ Model editing for distribution shifts in uranium oxide morphological analysis CVPR 2024
Deep learning still struggles with certain kinds of scientific data. Notably, pretraining data may not provide coverage of relevant distribution shifts (e.g., shifts induced via the use of different measurement instruments). We consider deep learning models trained to classify the synthesis conditions of uranium ore concentrates (UOCs) and show that model editing is particularly effective for improving generalization to distribution shifts common in this domain. In particular, model editing outperforms finetuning on two curated datasets comprising of micrographs taken of U$_{3}$O$_{8}$ aged in humidity chambers and micrographs acquired with different scanning electron microscopes, respectively.
comment: Presented at CV4MS @ CVPR 2024
☆ MoRSE: Bridging the Gap in Cybersecurity Expertise with Retrieval Augmented Generation
In this paper, we introduce MoRSE (Mixture of RAGs Security Experts), the first specialised AI chatbot for cybersecurity. MoRSE aims to provide comprehensive and complete knowledge about cybersecurity. MoRSE uses two RAG (Retrieval Augmented Generation) systems designed to retrieve and organize information from multidimensional cybersecurity contexts. MoRSE differs from traditional RAGs by using parallel retrievers that work together to retrieve semantically related information in different formats and structures. Unlike traditional Large Language Models (LLMs) that rely on Parametric Knowledge Bases, MoRSE retrieves relevant documents from Non-Parametric Knowledge Bases in response to user queries. Subsequently, MoRSE uses this information to generate accurate answers. In addition, MoRSE benefits from real-time updates to its knowledge bases, enabling continuous knowledge enrichment without retraining. We have evaluated the effectiveness of MoRSE against other state-of-the-art LLMs, evaluating the system on 600 cybersecurity specific questions. The experimental evaluation has shown that the improvement in terms of relevance and correctness of the answer is more than 10\% compared to known solutions such as GPT-4 and Mixtral 7x8.
☆ Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond ECCV 2024
In recent years, research on out-of-distribution (OoD) detection for semantic segmentation has mainly focused on road scenes -- a domain with a constrained amount of semantic diversity. In this work, we challenge this constraint and extend the domain of this task to general natural images. To this end, we introduce: 1. the ADE-OoD benchmark, which is based on the ADE20k dataset and includes images from diverse domains with a high semantic diversity, and 2. a novel approach that uses Diffusion score matching for OoD detection (DOoD) and is robust to the increased semantic diversity. ADE-OoD features indoor and outdoor images, defines 150 semantic categories as in-distribution, and contains a variety of OoD objects. For DOoD, we train a diffusion model with an MLP architecture on semantic in-distribution embeddings and build on the score matching interpretation to compute pixel-wise OoD scores at inference time. On common road scene OoD benchmarks, DOoD performs on par or better than the state of the art, without using outliers for training or making assumptions about the data domain. On ADE-OoD, DOoD outperforms previous approaches, but leaves much room for future improvements.
comment: ECCV 2024 - Benchmark page: https://ade-ood.github.io/
☆ Parallel Split Learning with Global Sampling
The expansion of IoT devices and the demands of Deep Learning have highlighted significant challenges in Distributed Deep Learning (DDL) systems. Parallel Split Learning (PSL) has emerged as a promising derivative of Split Learning that is well suited for distributed learning on resource-constrained devices. However, PSL faces several obstacles, such as large effective batch sizes, non-IID data distributions, and the straggler effect. We view these issues as a sampling dilemma and propose to address them by orchestrating the mini-batch sampling process on the server side. We introduce the Uniform Global Sampling (UGS) method to decouple the effective batch size from the number of clients and reduce mini-batch deviation in non-IID settings. To address the straggler effect, we introduce the Latent Dirichlet Sampling (LDS) method, which generalizes UGS to balance the trade-off between batch deviation and training time. Our simulations reveal that our proposed methods enhance model accuracy by up to 34.1% in non-IID settings and reduce the training time in the presence of stragglers by up to 62%. In particular, LDS effectively mitigates the straggler effect without compromising model accuracy or adding significant computational overhead compared to UGS. Our results demonstrate the potential of our methods as a promising solution for DDL in real applications.
☆ TaskGen: A Task-Based, Memory-Infused Agentic Framework using StrictJSON
TaskGen is an open-sourced agentic framework which uses an Agent to solve an arbitrary task by breaking them down into subtasks. Each subtask is mapped to an Equipped Function or another Agent to execute. In order to reduce verbosity (and hence token usage), TaskGen uses StrictJSON that ensures JSON output from the Large Language Model (LLM), along with additional features such as type checking and iterative error correction. Key to the philosophy of TaskGen is the management of information/memory on a need-to-know basis. We empirically evaluate TaskGen on various environments such as 40x40 dynamic maze navigation with changing obstacle locations (100% solve rate), TextWorld escape room solving with dense rewards and detailed goals (96% solve rate), web browsing (69% of actions successful), solving the MATH dataset (71% solve rate over 100 Level-5 problems), Retrieval Augmented Generation on NaturalQuestions dataset (F1 score of 47.03%)
comment: 53 pages
☆ DStruct2Design: Data and Benchmarks for Data Structure Driven Generative Floor Plan Design
Text conditioned generative models for images have yielded impressive results. Text conditioned floorplan generation as a special type of raster image generation task also received particular attention. However there are many use cases in floorpla generation where numerical properties of the generated result are more important than the aesthetics. For instance, one might want to specify sizes for certain rooms in a floorplan and compare the generated floorplan with given specifications Current approaches, datasets and commonly used evaluations do not support these kinds of constraints. As such, an attractive strategy is to generate an intermediate data structure that contains numerical properties of a floorplan which can be used to generate the final floorplan image. To explore this setting we (1) construct a new dataset for this data-structure to data-structure formulation of floorplan generation using two popular image based floorplan datasets RPLAN and ProcTHOR-10k, and provide the tools to convert further procedurally generated ProcTHOR floorplan data into our format. (2) We explore the task of floorplan generation given a partial or complete set of constraints and we design a series of metrics and benchmarks to enable evaluating how well samples generated from models respect the constraints. (3) We create multiple baselines by finetuning a large language model (LLM), Llama3, and demonstrate the feasibility of using floorplan data structure conditioned LLMs for the problem of floorplan generation respecting numerical constraints. We hope that our new datasets and benchmarks will encourage further research on different ways to improve the performance of LLMs and other generative modelling techniques for generating designs where quantitative constraints are only partially specified, but must be respected.
☆ Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability
Large language models (LLMs) have emerged as powerful tools for many AI problems and exhibit remarkable in-context learning (ICL) capabilities. Compositional ability, solving unseen complex tasks that combine two or more simple tasks, is an essential reasoning ability for Artificial General Intelligence. Despite LLM's tremendous success, how they approach composite tasks, especially those not encountered during the pretraining phase, remains an open question and largely ununderstood. In this study, we delve into the ICL capabilities of LLMs on composite tasks, with only simple tasks as in-context examples. We develop a test suite of composite tasks that include linguistic and logical challenges and perform empirical studies across different LLM families. We observe that models exhibit divergent behaviors: (1) For simpler composite tasks that apply distinct mapping mechanisms to different input segments, the models demonstrate decent compositional ability, while scaling up the model enhances this ability; (2) for more complex composite tasks that involving reasoning multiple steps, where each step represent one task, models typically underperform, and scaling up generally provide no improvements. We offer theoretical analysis in a simplified setting, explaining that models exhibit compositional capability when the task handles different input parts separately. We believe our work sheds new light on the capabilities of LLMs in solving composite tasks regarding the nature of the tasks and model scale. Our dataset and code are available at {\url{https://github.com/OliverXUZY/LLM_Compose}}.
☆ GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI
Alzheimer's Disease (AD) is an irreversible neurodegenerative disorder that often progresses from Mild Cognitive Impairment (MCI), leading to memory loss and significantly impacting patients' lives. Clinical trials indicate that early targeted interventions for MCI patients can potentially slow or halt the development and progression of AD. Previous research has shown that accurate medical classification requires the inclusion of extensive multimodal data, such as assessment scales and various neuroimaging techniques like Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET). However, consistently tracking the diagnosis of the same individual over time and simultaneously collecting multimodal data poses significant challenges. To address this issue, we introduce GFE-Mamba, a classifier based on Generative Feature Extraction (GFE). This classifier effectively integrates data from assessment scales, MRI, and PET, enabling deeper multimodal fusion. It efficiently extracts both long and short sequence information and incorporates additional information beyond the pixel space. This approach not only improves classification accuracy but also enhances the interpretability and stability of the model. We constructed datasets of over 3000 samples based on the Alzheimer's Disease Neuroimaging Initiative (ADNI) for a two-step training process. Our experimental results demonstrate that the GFE-Mamba model is effective in predicting the conversion from MCI to AD and outperforms several state-of-the-art methods. Our source code and ADNI dataset processing code are available at https://github.com/Tinysqua/GFE-Mamba.
comment: 35 pages, 4 figures
☆ Mamba meets crack segmentation
Cracks pose safety risks to infrastructure and cannot be overlooked. The prevailing structures in existing crack segmentation networks predominantly consist of CNNs or Transformers. However, CNNs exhibit a deficiency in global modeling capability, hindering the representation to entire crack features. Transformers can capture long-range dependencies but suffer from high and quadratic complexity. Recently, Mamba has garnered extensive attention due to its linear spatial and computational complexity and its powerful global perception. This study explores the representation capabilities of Mamba to crack features. Specifically, this paper uncovers the connection between Mamba and the attention mechanism, providing a profound insight, an attention perspective, into interpreting Mamba and devising a novel Mamba module following the principles of attention blocks, namely CrackMamba. We compare CrackMamba with the most prominent visual Mamba modules, Vim and Vmamba, on two datasets comprising asphalt pavement and concrete pavement cracks, and steel cracks, respectively. The quantitative results show that CrackMamba stands out as the sole Mamba block consistently enhancing the baseline model's performance across all evaluation measures, while reducing its parameters and computational costs. Moreover, this paper substantiates that Mamba can achieve global receptive fields through both theoretical analysis and visual interpretability. The discoveries of this study offer a dual contribution. First, as a plug-and-play and simple yet effective Mamba module, CrackMamba exhibits immense potential for integration into various crack segmentation models. Second, the proposed innovative Mamba design concept, integrating Mamba with the attention mechanism, holds significant reference value for all Mamba-based computer vision models, not limited to crack segmentation networks, as investigated in this study.
comment: 32 pages, 8 figures. Preprint submitted to Elsevier
☆ SwinSF: Image Reconstruction from Spatial-Temporal Spike Streams
The spike camera, with its high temporal resolution, low latency, and high dynamic range, addresses high-speed imaging challenges like motion blur. It captures photons at each pixel independently, creating binary spike streams rich in temporal information but challenging for image reconstruction. Current algorithms, both traditional and deep learning-based, still need to be improved in the utilization of the rich temporal detail and the restoration of the details of the reconstructed image. To overcome this, we introduce Swin Spikeformer (SwinSF), a novel model for dynamic scene reconstruction from spike streams. SwinSF is composed of Spike Feature Extraction, Spatial-Temporal Feature Extraction, and Final Reconstruction Module. It combines shifted window self-attention and proposed temporal spike attention, ensuring a comprehensive feature extraction that encapsulates both spatial and temporal dynamics, leading to a more robust and accurate reconstruction of spike streams. Furthermore, we build a new synthesized dataset for spike image reconstruction which matches the resolution of the latest spike camera, ensuring its relevance and applicability to the latest developments in spike camera imaging. Experimental results demonstrate that the proposed network SwinSF sets a new benchmark, achieving state-of-the-art performance across a series of datasets, including both real-world and synthesized data across various resolutions. Our codes and proposed dataset will be available soon.
☆ Predicting the Best of N Visual Trackers
We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the "Best of the N Trackers", called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks - LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols. The code, the trained models, and the results will soon be made publicly available on https://github.com/BasitAlawode/Best_of_N_Trackers.
☆ A Life-long Learning Intrusion Detection System for 6G-Enabled IoV
The introduction of 6G technology into the Internet of Vehicles (IoV) promises to revolutionize connectivity with ultra-high data rates and seamless network coverage. However, this technological leap also brings significant challenges, particularly for the dynamic and diverse IoV landscape, which must meet the rigorous reliability and security requirements of 6G networks. Furthermore, integrating 6G will likely increase the IoV's susceptibility to a spectrum of emerging cyber threats. Therefore, it is crucial for security mechanisms to dynamically adapt and learn new attack patterns, keeping pace with the rapid evolution and diversification of these threats - a capability currently lacking in existing systems. This paper presents a novel intrusion detection system leveraging the paradigm of life-long (or continual) learning. Our methodology combines class-incremental learning with federated learning, an approach ideally suited to the distributed nature of the IoV. This strategy effectively harnesses the collective intelligence of Connected and Automated Vehicles (CAVs) and edge computing capabilities to train the detection system. To the best of our knowledge, this study is the first to synergize class-incremental learning with federated learning specifically for cyber attack detection. Through comprehensive experiments on a recent network traffic dataset, our system has exhibited a robust adaptability in learning new cyber attack patterns, while effectively retaining knowledge of previously encountered ones. Additionally, it has proven to maintain high accuracy and a low false positive rate.
☆ AI-Driven Fast and Early Detection of IoT Botnet Threats: A Comprehensive Network Traffic Analysis Approach
In the rapidly evolving landscape of cyber threats targeting the Internet of Things (IoT) ecosystem, and in light of the surge in botnet-driven Distributed Denial of Service (DDoS) and brute force attacks, this study focuses on the early detection of IoT bots. It specifically addresses the detection of stealth bot communication that precedes and orchestrates attacks. This study proposes a comprehensive methodology for analyzing IoT network traffic, including considerations for both unidirectional and bidirectional flow, as well as packet formats. It explores a wide spectrum of network features critical for representing network traffic and characterizing benign IoT traffic patterns effectively. Moreover, it delves into the modeling of traffic using various semi-supervised learning techniques. Through extensive experimentation with the IoT-23 dataset - a comprehensive collection featuring diverse botnet types and traffic scenarios - we have demonstrated the feasibility of detecting botnet traffic corresponding to different operations and types of bots, specifically focusing on stealth command and control (C2) communications. The results obtained have demonstrated the feasibility of identifying C2 communication with a 100% success rate through packet-based methods and 94% via flow based approaches, with a false positive rate of 1.53%.
☆ HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning ECCV 2024
Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.
comment: Accepted as a main conference paper at ECCV 2024 (https://github.com/google/haloquest)
☆ Flow-guided Motion Prediction with Semantics and Dynamic Occupancy Grid Maps SC
Accurate prediction of driving scenes is essential for road safety and autonomous driving. Occupancy Grid Maps (OGMs) are commonly employed for scene prediction due to their structured spatial representation, flexibility across sensor modalities and integration of uncertainty. Recent studies have successfully combined OGMs with deep learning methods to predict the evolution of scene and learn complex behaviours. These methods, however, do not consider prediction of flow or velocity vectors in the scene. In this work, we propose a novel multi-task framework that leverages dynamic OGMs and semantic information to predict both future vehicle semantic grids and the future flow of the scene. This incorporation of semantic flow not only offers intermediate scene features but also enables the generation of warped semantic grids. Evaluation on the real-world NuScenes dataset demonstrates improved prediction capabilities and enhanced ability of the model to retain dynamic vehicles within the scene.
comment: Accepted for publication at the 27th IEEE International Conference on Intelligent Transportation Systems (ITSC) (ITSC 2024)
☆ Problems in AI, their roots in philosophy, and implications for science and society
Artificial Intelligence (AI) is one of today's most relevant emergent technologies. In view thereof, this paper proposes that more attention should be paid to the philosophical aspects of AI technology and its use. It is argued that this deficit is generally combined with philosophical misconceptions about the growth of knowledge. To identify these misconceptions, reference is made to the ideas of the philosopher of science Karl Popper and the physicist David Deutsch. The works of both thinkers aim against mistaken theories of knowledge, such as inductivism, empiricism, and instrumentalism. This paper shows that these theories bear similarities to how current AI technology operates. It also shows that these theories are very much alive in the (public) discourse on AI, often called Bayesianism. In line with Popper and Deutsch, it is proposed that all these theories are based on mistaken philosophies of knowledge. This includes an analysis of the implications of these mistaken philosophies for the use of AI in science and society, including some of the likely problem situations that will arise. This paper finally provides a realistic outlook on Artificial General Intelligence (AGI) and three propositions on A(G)I and philosophy (i.e., epistemology).
☆ SLVideo: A Sign Language Video Moment Retrieval Framework
Sign Language Recognition has been studied and developed throughout the years to help the deaf and hard-of-hearing people in their day-to-day lives. These technologies leverage manual sign recognition algorithms, however, most of them lack the recognition of facial expressions, which are also an essential part of Sign Language as they allow the speaker to add expressiveness to their dialogue or even change the meaning of certain manual signs. SLVideo is a video moment retrieval software for Sign Language videos with a focus on both hands and facial signs. The system extracts embedding representations for the hand and face signs from video frames to capture the language signs in full. This will then allow the user to search for a specific sign language video segment with text queries, or to search by similar sign language videos. To test this system, a collection of five hours of annotated Sign Language videos is used as the dataset, and the initial results are promising in a zero-shot setting.SLVideo is shown to not only address the problem of searching sign language videos but also supports a Sign Language thesaurus with a search by similarity technique. Project web page: https://novasearch.github.io/SLVideo/
comment: 5 pages, 3 figures, 1 table
☆ How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?
We consider the situation when a learner faces a set of unknown discrete distributions $(p_k)_{k\in \mathcal K}$ defined over a common alphabet $\mathcal X$, and can build for each distribution $p_k$ an individual high-probability confidence set thanks to $n_k$ observations sampled from $p_k$. The set $(p_k)_{k\in \mathcal K}$ is structured: each distribution $p_k$ is obtained from the same common, but unknown, distribution q via applying an unknown permutation to $\mathcal X$. We call this \emph{permutation-equivalence}. The goal is to build refined confidence sets \emph{exploiting} this structural property. Like other popular notions of structure (Lipschitz smoothness, Linearity, etc.) permutation-equivalence naturally appears in machine learning problems, and to benefit from its potential gain calls for a specific approach. We present a strategy to effectively exploit permutation-equivalence, and provide a finite-time high-probability bound on the size of the refined confidence sets output by the strategy. Since a refinement is not possible for too few observations in general, under mild technical assumptions, our finite-time analysis establish when the number of observations $(n_k)_{k\in \mathcal K}$ are large enough so that the output confidence sets improve over initial individual sets. We carefully characterize this event and the corresponding improvement. Further, our result implies that the size of confidence sets shrink at asymptotic rates of $O(1/\sqrt{\sum_{k\in \mathcal K} n_k})$ and $O(1/\max_{k\in K} n_{k})$, respectively for elements inside and outside the support of q, when the size of each individual confidence set shrinks at respective rates of $O(1/\sqrt{n_k})$ and $O(1/n_k)$. We illustrate the practical benefit of exploiting permutation equivalence on a reinforcement learning task.
☆ Evaluation of Reinforcement Learning for Autonomous Penetration Testing using A3C, Q-learning and DQN
Penetration testing is the process of searching for security weaknesses by simulating an attack. It is usually performed by experienced professionals, where scanning and attack tools are applied. By automating the execution of such tools, the need for human interaction and decision-making could be reduced. In this work, a Network Attack Simulator (NASim) was used as an environment to train reinforcement learning agents to solve three predefined security scenarios. These scenarios cover techniques of exploitation, post-exploitation and wiretapping. A large hyperparameter grid search was performed to find the best hyperparameter combinations. The algorithms Q-learning, DQN and A3C were used, whereby A3C was able to solve all scenarios and achieve generalization. In addition, A3C could solve these scenarios with fewer actions than the baseline automated penetration testing. Although the training was performed on rather small scenarios and with small state and action spaces for the agents, the results show that a penetration test can successfully be performed by the RL agent.
☆ Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models
Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making, such as in developing educational materials and designing public policies. The objective of these simulations is for LMs to capture the variations in human responses, rather than merely providing the expected correct answers. Prior work has shown that LMs often generate unrealistically accurate responses, but there are no established metrics to quantify how closely the knowledge distribution of LMs aligns with that of humans. To address this, we introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. Assessing this alignment involves collecting responses from both LMs and humans to the same set of test items and using Item Response Theory to analyze the differences in item functioning between the groups. We demonstrate that our metric can capture important variations in populations that traditional metrics, like differences in accuracy, fail to capture. We apply this metric to assess existing LMs for their alignment with human knowledge distributions across three real-world domains. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment. Interestingly, smaller LMs tend to achieve greater psychometric alignment than larger LMs. Further, training LMs on human response data from the target distribution enhances their psychometric alignment on unseen test items, but the effectiveness of such training varies across domains.
comment: Code and data: https://github.com/joyheyueya/psychometric-alignment
☆ RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation
Large language models (LLMs) have advanced the field of artificial intelligence (AI) in medicine. However LLMs often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RadioRAG is evaluated using a dedicated radiologic question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions, for which the correct gold-standard answers were available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved context-specific information from www.radiopaedia.org in real-time and incorporated them into its reply. RadioRAG consistently improved diagnostic accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It matched or exceeded question answering without RAG across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in its effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.
☆ Norface: Improving Facial Expression Analysis by Identity Normalization ECCV2024
Facial Expression Analysis remains a challenging task due to unexpected task-irrelevant noise, such as identity, head pose, and background. To address this issue, this paper proposes a novel framework, called Norface, that is unified for both Action Unit (AU) analysis and Facial Emotion Recognition (FER) tasks. Norface consists of a normalization network and a classification network. First, the carefully designed normalization network struggles to directly remove the above task-irrelevant noise, by maintaining facial expression consistency but normalizing all original images to a common identity with consistent pose, and background. Then, these additional normalized images are fed into the classification network. Due to consistent identity and other factors (e.g. head pose, background, etc.), the normalized images enable the classification network to extract useful expression information more effectively. Additionally, the classification network incorporates a Mixture of Experts to refine the latent representation, including handling the input of facial representations and the output of multiple (AU or emotion) labels. Extensive experiments validate the carefully designed framework with the insight of identity normalization. The proposed method outperforms existing SOTA methods in multiple facial expression analysis tasks, including AU detection, AU intensity estimation, and FER tasks, as well as their cross-dataset tasks. For the normalized datasets and code please visit {https://norface-fea.github.io/}.
comment: Accepted by ECCV2024
☆ Sustainable broadcasting in Blockchain Network with Reinforcement Learning
Recent estimates put the carbon footprint of Bitcoin and Ethereum at an average of 64 and 26 million tonnes of CO2 per year, respectively. To address this growing problem, several possible approaches have been proposed in the literature: creating alternative blockchain consensus mechanisms, applying redundancy reduction techniques, utilizing renewable energy sources, and employing energy-efficient devices, etc. In this paper, we follow the second avenue and propose an efficient approach based on reinforcement learning that improves the block broadcasting scheme in blockchain networks. The analysis and experimental results confirmed that the proposed improvement of the block propagation scheme could cleverly handle network dynamics and achieve better results than the default approach. Additionally, our technical integration of the simulator and developed RL environment can be used as a complete solution for further study of new schemes and protocols that use RL or other ML techniques.
comment: 7 pages, 4 figures
☆ Can GPT-4 learn to analyze moves in research article abstracts?
One of the most powerful and enduring ideas in written discourse analysis is that genres can be described in terms of the moves which structure a writer's purpose. Considerable research has sought to identify these distinct communicative acts, but analyses have been beset by problems of subjectivity, reliability and the time-consuming need for multiple coders to confirm analyses. In this paper we employ the affordances of GPT-4 to automate the annotation process by using natural language prompts. Focusing on abstracts from articles in four applied linguistics journals, we devise prompts which enable the model to identify moves effectively. The annotated outputs of these prompts were evaluated by two assessors with a third addressing disagreements. The results show that an 8-shot prompt was more effective than one using two, confirming that the inclusion of examples illustrating areas of variability can enhance GPT-4's ability to recognize multiple moves in a single sentence and reduce bias related to textual position. We suggest that GPT-4 offers considerable potential in automating this annotation process, when human actors with domain specific linguistic expertise inform the prompting process.
☆ A Pairwise Comparison Relation-assisted Multi-objective Evolutionary Neural Architecture Search Method with Multi-population Mechanism
Neural architecture search (NAS) enables re-searchers to automatically explore vast search spaces and find efficient neural networks. But NAS suffers from a key bottleneck, i.e., numerous architectures need to be evaluated during the search process, which requires a lot of computing resources and time. In order to improve the efficiency of NAS, a series of methods have been proposed to reduce the evaluation time of neural architectures. However, they are not efficient enough and still only focus on the accuracy of architectures. In addition to the classification accuracy, more efficient and smaller network architectures are required in real-world applications. To address the above problems, we propose the SMEM-NAS, a pairwise com-parison relation-assisted multi-objective evolutionary algorithm based on a multi-population mechanism. In the SMEM-NAS, a surrogate model is constructed based on pairwise compari-son relations to predict the accuracy ranking of architectures, rather than the absolute accuracy. Moreover, two populations cooperate with each other in the search process, i.e., a main population guides the evolution, while a vice population expands the diversity. Our method aims to provide high-performance models that take into account multiple optimization objectives. We conduct a series of experiments on the CIFAR-10, CIFAR-100 and ImageNet datasets to verify its effectiveness. With only a single GPU searching for 0.17 days, competitive architectures can be found by SMEM-NAS which achieves 78.91% accuracy with the MAdds of 570M on the ImageNet. This work makes a significant advance in the important field of NAS.
☆ Discrete Flow Matching
Despite Flow Matching and diffusion models having emerged as powerful generative paradigms for continuous variables such as images and videos, their application to high-dimensional discrete data, such as language, is still limited. In this work, we present Discrete Flow Matching, a novel discrete flow paradigm designed specifically for generating discrete data. Discrete Flow Matching offers several key contributions: (i) it works with a general family of probability paths interpolating between source and target distributions; (ii) it allows for a generic formula for sampling from these probability paths using learned posteriors such as the probability denoiser ($x$-prediction) and noise-prediction ($\epsilon$-prediction); (iii) practically, focusing on specific probability paths defined with different schedulers considerably improves generative perplexity compared to previous discrete diffusion and flow models; and (iv) by scaling Discrete Flow Matching models up to 1.7B parameters, we reach 6.7% Pass@1 and 13.4% Pass@10 on HumanEval and 6.7% Pass@1 and 20.6% Pass@10 on 1-shot MBPP coding benchmarks. Our approach is capable of generating high-quality discrete data in a non-autoregressive fashion, significantly closing the gap between autoregressive models and discrete flow models.
☆ Unsupervised Robust Cross-Lingual Entity Alignment via Joint Modeling of Entity and Relation Texts
Cross-lingual entity alignment (EA) enables the integration of multiple knowledge graphs (KGs) across different languages, providing users with seamless access to diverse and comprehensive knowledge.Existing methods, mostly supervised, face challenges in obtaining labeled entity pairs. To address this, recent studies have shifted towards a self-supervised and unsupervised frameworks. Despite their effectiveness, these approaches have limitations: (1) they mainly focus on entity features, neglecting the semantic information of relations, (2) they assume isomorphism between source and target graphs, leading to noise and reduced alignment accuracy, and (3) they are susceptible to noise in the textual features, especially when encountering inconsistent translations or Out-Of-Vocabulary (OOV) problems. In this paper, we propose ERAlign, an unsupervised and robust cross-lingual EA framework that jointly performs Entity-level and Relation-level Alignment using semantic textual features of relations and entities. Its refinement process iteratively enhances results by fusing entity-level and relation-level alignments based on neighbor triple matching. The additional verification process examines the entities' neighbor triples as the linearized text. This \textit{Align-and-Verify} pipeline that rigorously assesses alignment results, achieving near-perfect alignment even in the presence of noisy textual features of entities. Our extensive experiments demonstrate that robustness and general applicability of \proposed improved the accuracy and effectiveness of EA tasks, contributing significantly to knowledge-oriented applications.
☆ Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of `jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.
☆ Large-scale Time-Varying Portfolio Optimisation using Graph Attention Networks
Apart from assessing individual asset performance, investors in financial markets also need to consider how a set of firms performs collectively as a portfolio. Whereas traditional Markowitz-based mean-variance portfolios are widespread, network-based optimisation techniques have built upon these developments. However, most studies do not contain firms at risk of default and remove any firms that drop off indices over a certain time. This is the first study to incorporate risky firms and use all the firms in portfolio optimisation. We propose and empirically test a novel method that leverages Graph Attention networks (GATs), a subclass of Graph Neural Networks (GNNs). GNNs, as deep learning-based models, can exploit network data to uncover nonlinear relationships. Their ability to handle high-dimensional features and accommodate customised layers for specific purposes makes them particularly appealing for large-scale problems such as mid- and small-cap portfolio optimization. This study utilises 30 years of data on mid-cap firms, creating graphs of firms using distance correlation and the Triangulated Maximally Filtered Graph approach. These graphs are the inputs to a GAT model that we train using custom layers which impose weight and allocation constraints and a loss function derived from the Sharpe ratio, thus directly maximising portfolio risk-adjusted returns. This new model is benchmarked against a network characteristic-based portfolio, a mean variance-based portfolio, and an equal-weighted portfolio. The results show that the portfolio produced by the GAT-based model outperforms all benchmarks and is consistently superior to other strategies over a long period while also being informative of market dynamics.
comment: 37 pages, 7 figures, v1
☆ Interpretable Concept-Based Memory Reasoning
The lack of transparency in the decision-making processes of deep learning systems presents a significant challenge in modern artificial intelligence (AI), as it impairs users' ability to rely on and verify these systems. To address this challenge, Concept Bottleneck Models (CBMs) have made significant progress by incorporating human-interpretable concepts into deep learning architectures. This approach allows predictions to be traced back to specific concept patterns that users can understand and potentially intervene on. However, existing CBMs' task predictors are not fully interpretable, preventing a thorough analysis and any form of formal verification of their decision-making process prior to deployment, thereby raising significant reliability concerns. To bridge this gap, we introduce Concept-based Memory Reasoner (CMR), a novel CBM designed to provide a human-understandable and provably-verifiable task prediction process. Our approach is to model each task prediction as a neural selection mechanism over a memory of learnable logic rules, followed by a symbolic evaluation of the selected rule. The presence of an explicit memory and the symbolic evaluation allow domain experts to inspect and formally verify the validity of certain global properties of interest for the task prediction process. Experimental results demonstrate that CMR achieves comparable accuracy-interpretability trade-offs to state-of-the-art CBMs, discovers logic rules consistent with ground truths, allows for rule interventions, and allows pre-deployment verification.
☆ Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks
Generative artificial intelligence has transformed the generation of synthetic data, providing innovative solutions to challenges like data scarcity and privacy, which are particularly critical in fields such as medicine. However, the effective use of this synthetic data to train high-performance models remains a significant challenge. This paper addresses this issue by introducing Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers. At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information provided to classifiers through a synthetic dataset regeneration and soft labelling mechanism. The KR pipeline has been tested on a variety of datasets, with a focus on six highly heterogeneous medical image datasets, ranging from retinal images to organ scans. The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases. Furthermore, the resulting models show almost complete immunity to Membership Inference Attacks, manifesting privacy properties missing in models trained with conventional techniques.
☆ TOM: A Development Platform For Wearable Intelligent Assistants
Advanced digital assistants can significantly enhance task performance, reduce user burden, and provide personalized guidance to improve users' abilities. However, the development of such intelligent digital assistants presents a formidable challenge. To address this, we introduce TOM, a conceptual architecture and software platform (https://github.com/TOM-Platform) designed to support the development of intelligent wearable assistants that are contextually aware of both the user and the environment. This system was developed collaboratively with AR/MR researchers, HCI researchers, AI/Robotic researchers, and software developers, and it continues to evolve to meet the diverse requirements of these stakeholders. TOM facilitates the creation of intelligent assistive AR applications for daily activities and supports the recording and analysis of user interactions, integration of new devices, and the provision of assistance for various activities. Additionally, we showcase several proof-of-concept assistive services and discuss the challenges involved in developing such services.
comment: 14 pages, 6 figures, 2 tables
☆ Future-Proofing Mobile Networks: A Digital Twin Approach to Multi-Signal Management
Digital Twins (DTs) are set to become a key enabling technology in future wireless networks, with their use in network management increasing significantly. We developed a DT framework that leverages the heterogeneity of network access technologies as a resource for enhanced network performance and management, enabling smart data handling in the physical network. Tested in a \textit{Campus Area Network} environment, our framework integrates diverse data sources to provide real-time, holistic insights into network performance and environmental sensing. We also envision that traditional analytics will evolve to rely on emerging AI models, such as Generative AI (GenAI), while leveraging current analytics capabilities. This capacity can simplify analytics processes through advanced ML models, enabling descriptive, diagnostic, predictive, and prescriptive analytics in a unified fashion. Finally, we present specific research opportunities concerning interoperability aspects and envision aligning advancements in DT technology with evolved AI integration.
comment: A shortened version of this paper is currently under review for publication in an IEEE magazine. If accepted, the copyright will be transferred to IEEE
☆ Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation ACL
Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.
comment: Accepted at the MACLEAN workshop in the ECML/PKDD 2024
☆ Algebraic anti-unification
Abstraction is key to human and artificial intelligence as it allows one to see common structure in otherwise distinct objects or situations and as such it is a key element for generality in AI. Anti-unification (or generalization) is \textit{the} part of theoretical computer science and AI studying abstraction. It has been successfully applied to various AI-related problems, most importantly inductive logic programming. Up to this date, anti-unification is studied only from a syntactic perspective in the literature. The purpose of this paper is to initiate an algebraic (i.e. semantic) theory of anti-unification within general algebras. This is motivated by recent applications to similarity and analogical proportions.
☆ Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners
Large Language Models (LLMs) showcase remarkable performance and robust deductive capabilities, yet their expansive size complicates deployment and raises environmental concerns due to substantial resource consumption. The recent development of a quantization technique known as Learnable Singular-value Increment (LSI) has addressed some of these quantization challenges. Leveraging insights from LSI and our extensive research, we have developed innovative methods that enhance the performance of quantized LLMs, particularly in low-bit settings. Our methods consistently deliver state-of-the-art results across various quantization scenarios and offer deep theoretical insights into the quantization process, elucidating the potential of quantized models for widespread application.
comment: Effecient Quantization Methods for LLMs
☆ In-Context Learning Improves Compositional Understanding of Vision-Language Models
Vision-Language Models (VLMs) have shown remarkable capabilities in a large number of downstream tasks. Nonetheless, compositional image understanding remains a rather difficult task due to the object bias present in training data. In this work, we investigate the reasons for such a lack of capability by performing an extensive bench-marking of compositional understanding in VLMs. We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses. Furthermore, we leverage In-Context Learning (ICL) as a way to improve the ability of VLMs to perform more complex reasoning and understanding given an image. Our extensive experiments demonstrate that our proposed approach outperforms baseline models across multiple compositional understanding datasets.
☆ A Multi-Level Corroborative Approach for Verification and Validation of Autonomous Robotic Swarms
Modelling and characterizing emergent behaviour within a swarm can pose significant challenges in terms of 'assurance'. Assurance tasks encompass adherence to standards, certification processes, and the execution of verification and validation (V&V) methods, such as model checking. In this study, we propose a holistic, multi-level modelling approach for formally verifying and validating autonomous robotic swarms, which are defined at the macroscopic formal modelling, low-fidelity simulation, high-fidelity simulation, and real-robot levels. Our formal macroscopic models, used for verification, are characterized by data derived from actual simulations, ensuring both accuracy and traceability across different system models. Furthermore, our work combines formal verification with experimental validation involving real robots. In this way, our corroborative approach for V&V seeks to enhance confidence in the evidence, in contrast to employing these methods separately. We explore our approach through a case study focused on a swarm of robots operating within a public cloakroom.
comment: 15 pages, 11 figures
Pre-Training and Prompting for Few-Shot Node Classification on Text-Attributed Graphs KDD'24
The text-attributed graph (TAG) is one kind of important real-world graph-structured data with each node associated with raw texts. For TAGs, traditional few-shot node classification methods directly conduct training on the pre-processed node features and do not consider the raw texts. The performance is highly dependent on the choice of the feature pre-processing method. In this paper, we propose P2TAG, a framework designed for few-shot node classification on TAGs with graph pre-training and prompting. P2TAG first pre-trains the language model (LM) and graph neural network (GNN) on TAGs with self-supervised loss. To fully utilize the ability of language models, we adapt the masked language modeling objective for our framework. The pre-trained model is then used for the few-shot node classification with a mixed prompt method, which simultaneously considers both text and graph information. We conduct experiments on six real-world TAGs, including paper citation networks and product co-purchasing networks. Experimental results demonstrate that our proposed framework outperforms existing graph few-shot learning methods on these datasets with +18.98% ~ +35.98% improvements.
comment: Accepted to KDD'24
☆ Decoding BACnet Packets: A Large Language Model Approach for Packet Interpretation
The Industrial Control System (ICS) environment encompasses a wide range of intricate communication protocols, posing substantial challenges for Security Operations Center (SOC) analysts tasked with monitoring, interpreting, and addressing network activities and security incidents. Conventional monitoring tools and techniques often struggle to provide a clear understanding of the nature and intent of ICS-specific communications. To enhance comprehension, we propose a software solution powered by a Large Language Model (LLM). This solution currently focused on BACnet protocol, processes a packet file data and extracts context by using a mapping database, and contemporary context retrieval methods for Retrieval Augmented Generation (RAG). The processed packet information, combined with the extracted context, serves as input to the LLM, which generates a concise packet file summary for the user. The software delivers a clear, coherent, and easily understandable summary of network activities, enabling SOC analysts to better assess the current state of the control system.
comment: 12 pages
☆ YOLO-pdd: A Novel Multi-scale PCB Defect Detection Method Using Deep Representations with Sequential Images
With the rapid growth of the PCB manufacturing industry, there is an increasing demand for computer vision inspection to detect defects during production. Improving the accuracy and generalization of PCB defect detection models remains a significant challenge. This paper proposes a high-precision, robust, and real-time end-to-end method for PCB defect detection based on deep Convolutional Neural Networks (CNN). Traditional methods often suffer from low accuracy and limited applicability. We propose a novel approach combining YOLOv5 and multiscale modules for hierarchical residual-like connections. In PCB defect detection, noise can confuse the background and small targets. The YOLOv5 model provides a strong foundation with its real-time processing and accurate object detection capabilities. The multi-scale module extends traditional approaches by incorporating hierarchical residual-like connections within a single block, enabling multiscale feature extraction. This plug-and-play module significantly enhances performance by extracting features at multiple scales and levels, which are useful for identifying defects of varying sizes and complexities. Our multi-scale architecture integrates feature extraction, defect localization, and classification into a unified network. Experiments on a large-scale PCB dataset demonstrate significant improvements in precision, recall, and F1-score compared to existing methods. This work advances computer vision inspection for PCB defect detection, providing a reliable solution for high-precision, robust, real-time, and domain-adaptive defect detection in the PCB manufacturing industry.
☆ Empirical Capacity Model for Self-Attention Neural Networks
Large pretrained self-attention neural networks, or transformers, have been very successful in various tasks recently. The performance of a model on a given task depends on its ability to memorize and generalize the training data. Large transformer models, which may have billions of parameters, in theory have a huge capacity to memorize content. However, the current algorithms for the optimization fall short of the theoretical capacity, and the capacity is also highly dependent on the content. In this paper, we focus on the memory capacity of these models obtained using common training algorithms and synthetic training data. Based on the results, we derive an empirical capacity model (ECM) for a generic transformer. The ECM can be used to design task-specific transformer models with an optimal number of parameters in cases where the target memorization capability of the task can be defined.
comment: Submitted to BNAIC'24, 14 pages + refs
☆ Integrating IP Broadcasting with Audio Tags- Workflow and Challenges
The broadcasting industry is increasingly adopting IP techniques, revolutionising both live and pre-recorded content production, from news gathering to live music events. IP broadcasting allows for the transport of audio and video signals in an easily configurable way, aligning with modern networking techniques. This shift towards an IP workflow allows for much greater flexibility, not only in routing signals but with the integration of tools using standard web development techniques. One possible tool could include the use of live audio tagging, which has a number of uses in the production of content. These include from automated closed captioning to identifying unwanted sound events within a scene. In this paper, we describe the process of containerising an audio tagging model into a microservice, a small segregated code module that can be integrated into a multitude of different network setups. The goal is to develop a modular, accessible, and flexible tool capable of seamless deployment into broadcasting workflows of all sizes, from small productions to large corporations. Challenges surrounding latency of the selected audio tagging model and its effect on the usefulness of the end product are discussed.
comment: Submitted to DCASE 2024 Workshop
☆ Planning behavior in a recurrent neural network that plays Sokoban ICML 2024
To predict how advanced neural networks generalize to novel situations, it is essential to understand how they reason. Guez et al. (2019, "An investigation of model-free planning") trained a recurrent neural network (RNN) to play Sokoban with model-free reinforcement learning. They found that adding extra computation steps to the start of episodes at test time improves the RNN's success rate. We further investigate this phenomenon, finding that it rapidly emerges early on in training and then slowly fades, but only for comparatively easier levels. The RNN also often takes redundant actions at episode starts, and these are reduced by adding extra computation steps. Our results suggest that the RNN learns to take time to think by `pacing', despite the per-step penalties, indicating that training incentivizes planning capabilities. The small size (1.29M parameters) and interesting behavior of this model make it an excellent model organism for mechanistic interpretability.
comment: Mechanistic Interpretability workshop, ICML 2024
☆ Automated Road Safety: Enhancing Sign and Surface Damage Detection with AI
Public transportation plays a crucial role in our lives, and the road network is a vital component in the implementation of smart cities. Recent advancements in AI have enabled the development of advanced monitoring systems capable of detecting anomalies in road surfaces and road signs, which, if unaddressed, can lead to serious road accidents. This paper presents an innovative approach to enhance road safety through the detection and classification of traffic signs and road surface damage using advanced deep learning techniques. This integrated approach supports proactive maintenance strategies, improving road safety and resource allocation for the Molise region and the city of Campobasso. The resulting system, developed as part of the Casa delle Tecnologie Emergenti (House of Emergent Technologies) Molise (Molise CTE) research project funded by the Italian Minister of Economic Growth (MIMIT), leverages cutting-edge technologies such as Cloud Computing and High Performance Computing with GPU utilization. It serves as a valuable tool for municipalities, enabling quick detection of anomalies and the prompt organization of maintenance operations
comment: 16 pages, 10 figures
☆ Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.
comment: Ongoing work (v1); 34 pages, 5 figures
☆ Offline Imitation Learning Through Graph Search and Retrieval
Imitation learning is a powerful machine learning algorithm for a robot to acquire manipulation skills. Nevertheless, many real-world manipulation tasks involve precise and dexterous robot-object interactions, which make it difficult for humans to collect high-quality expert demonstrations. As a result, a robot has to learn skills from suboptimal demonstrations and unstructured interactions, which remains a key challenge. Existing works typically use offline deep reinforcement learning (RL) to solve this challenge, but in practice these algorithms are unstable and fragile due to the deadly triad issue. To overcome this problem, we propose GSR, a simple yet effective algorithm that learns from suboptimal demonstrations through Graph Search and Retrieval. We first use pretrained representation to organize the interaction experience into a graph and perform a graph search to calculate the values of different behaviors. Then, we apply a retrieval-based procedure to identify the best behavior (actions) on each state and use behavior cloning to learn that behavior. We evaluate our method in both simulation and real-world robotic manipulation tasks with complex visual inputs, covering various precise and dexterous manipulation skills with objects of different physical properties. GSR can achieve a 10% to 30% higher success rate and over 30% higher proficiency compared to baselines. Our project page is at https://zhaohengyin.github.io/gsr.
comment: Robotics: Science and Systems (RSS) 2024
☆ Tackling Selfish Clients in Federated Learning ECAI
Federated Learning (FL) is a distributed machine learning paradigm facilitating participants to collaboratively train a model without revealing their local data. However, when FL is deployed into the wild, some intelligent clients can deliberately deviate from the standard training process to make the global model inclined toward their local model, thereby prioritizing their local data distribution. We refer to this novel category of misbehaving clients as selfish. In this paper, we propose a Robust aggregation strategy for FL server to mitigate the effect of Selfishness (in short RFL-Self). RFL-Self incorporates an innovative method to recover (or estimate) the true updates of selfish clients from the received ones, leveraging robust statistics (median of norms) of the updates at every round. By including the recovered updates in aggregation, our strategy offers strong robustness against selfishness. Our experimental results, obtained on MNIST and CIFAR-10 datasets, demonstrate that just 2% of clients behaving selfishly can decrease the accuracy by up to 36%, and RFL-Self can mitigate that effect without degrading the global model performance.
comment: 10 pages, 16 figures. European Conference on Artificial Intelligence (ECAI) 2024
☆ Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?
☆ Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation ECCV 2024
The scene graph generation (SGG) task involves detecting objects within an image and predicting predicates that represent the relationships between the objects. However, in SGG benchmark datasets, each subject-object pair is annotated with a single predicate even though a single predicate may exhibit diverse semantics (i.e., semantic diversity), existing SGG models are trained to predict the one and only predicate for each pair. This in turn results in the SGG models to overlook the semantic diversity that may exist in a predicate, thus leading to biased predictions. In this paper, we propose a novel model-agnostic Semantic Diversity-aware Prototype-based Learning (DPL) framework that enables unbiased predictions based on the understanding of the semantic diversity of predicates. Specifically, DPL learns the regions in the semantic space covered by each predicate to distinguish among the various different semantics that a single predicate can represent. Extensive experiments demonstrate that our proposed model-agnostic DPL framework brings significant performance improvement on existing SGG models, and also effectively understands the semantic diversity of predicates.
comment: 14 pages, ECCV 2024
☆ ALLaM: Large Language Models for Arabic and English
We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.
☆ Towards Robust Vision Transformer via Masked Adaptive Ensemble
Adversarial training (AT) can help improve the robustness of Vision Transformers (ViT) against adversarial attacks by intentionally injecting adversarial examples into the training data. However, this way of adversarial injection inevitably incurs standard accuracy degradation to some extent, thereby calling for a trade-off between standard accuracy and robustness. Besides, the prominent AT solutions are still vulnerable to adaptive attacks. To tackle such shortcomings, this paper proposes a novel ViT architecture, including a detector and a classifier bridged by our newly developed adaptive ensemble. Specifically, we empirically discover that detecting adversarial examples can benefit from the Guided Backpropagation technique. Driven by this discovery, a novel Multi-head Self-Attention (MSA) mechanism is introduced to enhance our detector to sniff adversarial examples. Then, a classifier with two encoders is employed for extracting visual representations respectively from clean images and adversarial examples, with our adaptive ensemble to adaptively adjust the proportion of visual representations from the two encoders for accurate classification. This design enables our ViT architecture to achieve a better trade-off between standard accuracy and robustness. Besides, our adaptive ensemble technique allows us to mask off a random subset of image patches within input data, boosting our ViT's robustness against adaptive attacks, while maintaining high standard accuracy. Experimental results exhibit that our ViT architecture, on CIFAR-10, achieves the best standard accuracy and adversarial robustness of 90.3% and 49.8%, respectively.
comment: 9 pages
☆ Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias
The common toxicity and societal bias in contents generated by large language models (LLMs) necessitate strategies to reduce harm. Present solutions often demand white-box access to the model or substantial training, which is impractical for cutting-edge commercial LLMs. Moreover, prevailing prompting methods depend on external tool feedback and fail to simultaneously lessen toxicity and bias. Motivated by social psychology principles, we propose a novel strategy named \textbf{perspective-taking prompting (\textsc{PeT})} that inspires LLMs to integrate diverse human perspectives and self-regulate their responses. This self-correction mechanism can significantly diminish toxicity (up to $89\%$) and bias (up to $73\%$) in LLMs' responses. Rigorous evaluations and ablation studies are conducted on two commercial LLMs (ChatGPT and GLM) and three open-source LLMs, revealing \textsc{PeT}'s superiority in producing less harmful responses, outperforming five strong baselines.
☆ A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model
Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H\&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.
comment: 44 pages, 9 figures
☆ X-Recon: Learning-based Patient-specific High-Resolution CT Reconstruction from Orthogonal X-Ray Images
Rapid and accurate diagnosis of pneumothorax, utilizing chest X-ray and computed tomography (CT), is crucial for assisted diagnosis. Chest X-ray is commonly used for initial localization of pneumothorax, while CT ensures accurate quantification. However, CT scans involve high radiation doses and can be costly. To achieve precise quantitative diagnosis while minimizing radiation exposure, we proposed X-Recon, a CT ultra-sparse reconstruction network based on ortho-lateral chest X-ray images. X-Recon integrates generative adversarial networks (GANs), including a generator with a multi-scale fusion rendering module and a discriminator enhanced by 3D coordinate convolutional layers, designed to facilitate CT reconstruction. To improve precision, a projective spatial transformer is utilized to incorporate multi-angle projection loss. Additionally, we proposed PTX-Seg, a zero-shot pneumothorax segmentation algorithm, combining image processing techniques with deep-learning models for the segmentation of air-accumulated regions and lung structures. Experiments on a large-scale dataset demonstrate its superiority over existing approaches. X-Recon achieved a significantly higher reconstruction resolution with a higher average spatial resolution and a lower average slice thickness. The reconstruction metrics achieved state-of-the-art performance in terms of several metrics including peak signal-to-noise ratio. The zero-shot segmentation algorithm, PTX-Seg, also demonstrated high segmentation precision for the air-accumulated region, the left lung, and the right lung. Moreover, the consistency analysis for the pneumothorax chest occupancy ratio between reconstructed CT and original CT obtained a high correlation coefficient. Code will be available at: https://github.com/wangyunpengbio/X-Recon
☆ LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation
Recent studies seek to provide Graph Neural Network (GNN) interpretability via multiple unsupervised learning models. Due to the scarcity of datasets, current methods easily suffer from learning bias. To solve this problem, we embed a Large Language Model (LLM) as knowledge into the GNN explanation network to avoid the learning bias problem. We inject LLM as a Bayesian Inference (BI) module to mitigate learning bias. The efficacy of the BI module has been proven both theoretically and experimentally. We conduct experiments on both synthetic and real-world datasets. The innovation of our work lies in two parts: 1. We provide a novel view of the possibility of an LLM functioning as a Bayesian inference to improve the performance of existing algorithms; 2. We are the first to discuss the learning bias issues in the GNN explanation problem.
comment: Preprint Paper with 13 pages
☆ Robust personalized pricing under uncertainty of purchase probabilities
This paper is concerned with personalized pricing models aimed at maximizing the expected revenues or profits for a single item. While it is essential for personalized pricing to predict the purchase probabilities for each consumer, these predicted values are inherently subject to unavoidable errors that can negatively impact the realized revenues and profits. To address this issue, we focus on robust optimization techniques that yield reliable solutions to optimization problems under uncertainty. Specifically, we propose a robust optimization model for personalized pricing that accounts for the uncertainty of predicted purchase probabilities. This model can be formulated as a mixed-integer linear optimization problem, which can be solved exactly using mathematical optimization solvers. We also develop a Lagrangian decomposition algorithm combined with line search to efficiently find high-quality solutions for large-scale optimization problems. Experimental results demonstrate the effectiveness of our robust optimization model and highlight the utility of our Lagrangian decomposition algorithm in terms of both computational efficiency and solution quality.
☆ Odyssey: Empowering Agents with Open-World Skills
Recent studies have delved into constructing generalist agents for open-world embodied environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce ODYSSEY, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. ODYSSEY comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new open-world benchmark includes thousands of long-term planning tasks, tens of dynamic-immediate planning tasks, and one autonomous exploration task. Extensive experiments demonstrate that the proposed ODYSSEY framework can effectively evaluate the planning and exploration capabilities of agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.
☆ FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification
Histopathological image classification constitutes a pivotal task in computer-aided diagnostics. The precise identification and categorization of histopathological images are of paramount significance for early disease detection and treatment. In the diagnostic process of pathologists, a multi-tiered approach is typically employed to assess abnormalities in cell regions at different magnifications. However, feature extraction is often performed at a single granularity, overlooking the multi-granular characteristics of cells. To address this issue, we propose the Fuzzy-guided Multi-granularity Deep Neural Network (FMDNN). Inspired by the multi-granular diagnostic approach of pathologists, we perform feature extraction on cell structures at coarse, medium, and fine granularity, enabling the model to fully harness the information in histopathological images. We incorporate the theory of fuzzy logic to address the challenge of redundant key information arising during multi-granular feature extraction. Cell features are described from different perspectives using multiple fuzzy membership functions, which are fused to create universal fuzzy features. A fuzzy-guided cross-attention module guides universal fuzzy features toward multi-granular features. We propagate these features through an encoder to all patch tokens, aiming to achieve enhanced classification accuracy and robustness. In experiments on multiple public datasets, our model exhibits a significant improvement in accuracy over commonly used classification methods for histopathological image classification and shows commendable interpretability.
comment: This paper has been accepted by IEEE Transactions on Fuzzy Systems for publication. Permission from IEEE must be obtained for all other uses, in any current or future media. The final version is available at [doi: 10.1109/TFUZZ.2024.3410929]
♻ ☆ Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers
Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.
comment: 11 pages, 7 figures, 2 tables, 1 algorithm. Version Update: Figure 6
♻ ☆ ACEGEN: Reinforcement learning of generative chemical agents for drug discovery
In recent years, reinforcement learning (RL) has emerged as a valuable tool in drug design, offering the potential to propose and optimize molecules with desired properties. However, striking a balance between capabilities, flexibility, reliability, and efficiency remains challenging due to the complexity of advanced RL algorithms and the significant reliance on specialized code. In this work, we introduce ACEGEN, a comprehensive and streamlined toolkit tailored for generative drug design, built using TorchRL, a modern RL library that offers thoroughly tested reusable components. We validate ACEGEN by benchmarking against other published generative modeling algorithms and show comparable or improved performance. We also show examples of ACEGEN applied in multiple drug discovery case studies. ACEGEN is accessible at \url{https://github.com/acellera/acegen-open} and available for use under the MIT license.
♻ ☆ Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to adversarial attacks is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. Focusing on the safety domain, we use Rainbow Teaming to target various state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90% across all tested models. Furthermore, we demonstrate that fine-tuning models with synthetic data generated by the Rainbow Teaming method significantly enhances their safety without sacrificing general performance or helpfulness. We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity, showcasing its potential to drive robust open-ended self-improvement in a wide range of applications.
♻ ☆ Turing's Test, a Beautiful Thought Experiment
In the wake of the latest trends of artificial intelligence (AI), there has been a resurgence of claims and questions about the Turing test and its value, which are reminiscent of decades of practical "Turing" tests. If AI were quantum physics, by now several "Schr\"odinger's" cats would have been killed. It is time for a historical reconstruction of Turing's beautiful thought experiment. This paper presents a wealth of evidence, including new archival sources, and gives original answers to several open questions about Turing's 1950 paper, including its relation with early AI.
comment: Author's accepted manuscript to appear in IEEE Annals of the History of Computing. "\copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media ..."
♻ ☆ ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending models' context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, our work proposes a new benchmark for long-context LLMs focused on a practical meeting assistant scenario. In this scenario, the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers. Our experiments with recent long-context LLMs on ELITR-Bench highlight a gap between open-source and proprietary models, especially when questions are asked sequentially within a conversation. We also provide a thorough analysis of our GPT-4-based evaluation method, encompassing insights from a crowdsourcing study. Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
♻ ☆ DropKAN: Regularizing KANs by masking post-activations
We propose DropKAN (Dropout Kolmogorov-Arnold Networks) a regularization method that prevents co-adaptation of activation function weights in Kolmogorov-Arnold Networks (KANs). DropKAN operates by randomly masking some of the post-activations within the KANs computation graph, while scaling-up the retained post-activations. We show that this simple procedure that require minimal coding effort has a regularizing effect and consistently lead to better generalization of KANs. We analyze the adaptation of the standard Dropout with KANs and demonstrate that Dropout applied to KANs' neurons can lead to unpredictable behaviour in the feedforward pass. We carry an empirical study with real world Machine Learning datasets to validate our findings. Our results suggest that DropKAN is consistently a better alternative to using standard Dropout with KANs, and improves the generalization performance of KANs. Our implementation of DropKAN is available at: \url{https://github.com/Ghaith81/dropkan}.
♻ ☆ Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics, relying on manually annotated videos to predict fixed object categories. This limits their generalizability to unseen surgical procedures and tasks. We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals for multi-modal representation learning, bypassing manual annotations. We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions. We introduce SurgVLP - Surgical Vision Language Pre-training - a novel method for multi-modal representation learning. SurgVLP employs a new contrastive learning objective, aligning video clip embeddings with corresponding multiple text embeddings in a joint latent space. We demonstrate the representational capability of this space through several vision-and-language surgical tasks and vision-only tasks specific to surgery. Unlike current fully supervised approaches, SurgVLP adapts to different surgical procedures and tasks without specific fine-tuning, achieving zero-shot adaptation to tasks such as surgical tool, phase, and triplet recognition without manual annotation. These results highlight the transferability and versatility of the learned multi-modal representations in surgical video analysis. The code is available at https://github.com/CAMMA-public/SurgVLP
♻ ☆ Interrogating AI: Characterizing Emergent Playful Interactions with ChatGPT
In an era of AI's growing capabilities and influences, recent advancements are constantly reshaping HCI and CSCW's view of AI. Playful interactions with AI systems naturally emerged as an important way for users to make sense of the ever-changing technology. However, these emergent and playful interactions are underexamined. We target this gap by investigating playful interactions exhibited by users of an emerging AI technology, ChatGPT. Through a thematic analysis of 372 user-generated posts on the ChatGPT subreddit, we found that more than half of user discourse revolves around playful interactions. The analysis further allowed us to construct a preliminary framework to describe these interactions, categorizing them into six types: reflecting, jesting, imitating, challenging, tricking, and contriving; each included sub-categories. This study contributes to the field of HCI and CSCW by illuminating the multifaceted nature of playful interactions with AI, underlining their significance in helping users assess AI agency, shaping the human-AI relationship, and offering rich implications to AI system design.
comment: 23 pages
♻ ☆ FAIntbench: A Holistic and Precise Benchmark for Bias Evaluation in Text-to-Image Models
The rapid development and reduced barriers to entry for Text-to-Image (T2I) models have raised concerns about the biases in their outputs, but existing research lacks a holistic definition and evaluation framework of biases, limiting the enhancement of debiasing techniques. To address this issue, we introduce FAIntbench, a holistic and precise benchmark for biases in T2I models. In contrast to existing benchmarks that evaluate bias in limited aspects, FAIntbench evaluate biases from four dimensions: manifestation of bias, visibility of bias, acquired attributes, and protected attributes. We applied FAIntbench to evaluate seven recent large-scale T2I models and conducted human evaluation, whose results demonstrated the effectiveness of FAIntbench in identifying various biases. Our study also revealed new research questions about biases, including the side-effect of distillation. The findings presented here are preliminary, highlighting the potential of FAIntbench to advance future research aimed at mitigating the biases in T2I models. Our benchmark is publicly available to ensure the reproducibility.
♻ ☆ Explorative Imitation Learning: A Path Signature Approach for Continuous Environments ECAI
Some imitation learning methods combine behavioural cloning with self-supervision to infer actions from state pairs. However, most rely on a large number of expert trajectories to increase generalisation and human intervention to capture key aspects of the problem, such as domain constraints. In this paper, we propose Continuous Imitation Learning from Observation (CILO), a new method augmenting imitation learning with two important features: (i) exploration, allowing for more diverse state transitions, requiring less expert trajectories and resulting in fewer training iterations; and (ii) path signatures, allowing for automatic encoding of constraints, through the creation of non-parametric representations of agents and expert trajectories. We compared CILO with a baseline and two leading imitation learning methods in five environments. It had the best overall performance of all methods in all environments, outperforming the expert in two of them.
comment: This paper has been accepted in the 27th European Conference on Artificial Intelligence (ECAI) 2024
♻ ☆ SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder
Face swapping has gained significant attention for its varied applications. Most previous face swapping approaches have relied on the seesaw game training scheme, also known as the target-oriented approach. However, this often leads to instability in model training and results in undesired samples with blended identities due to the target identity leakage problem. Source-oriented methods achieve more stable training with self-reconstruction objective but often fail to accurately reflect target image's skin color and illumination. This paper introduces the Shape Agnostic Masked AutoEncoder (SAMAE) training scheme, a novel self-supervised approach that combines the strengths of both target-oriented and source-oriented approaches. Our training scheme addresses the limitations of traditional training methods by circumventing the conventional seesaw game and introducing clear ground truth through its self-reconstruction training regime. Our model effectively mitigates identity leakage and reflects target albedo and illumination through learned disentangled identity and non-identity features. Additionally, we closely tackle the shape misalignment and volume discrepancy problems with new techniques, including perforation confusion and random mesh scaling. SAMAE establishes a new state-of-the-art, surpassing other baseline methods, preserving both identity and non-identity attributes without sacrificing on either aspect.
♻ ☆ Beyond Memorization: The Challenge of Random Memory Access in Language Models ACL 2024
Recent developments in Language Models (LMs) have shown their effectiveness in NLP tasks, particularly in knowledge-intensive tasks. However, the mechanisms underlying knowledge storage and memory access within their parameters remain elusive. In this paper, we investigate whether a generative LM (e.g., GPT-2) is able to access its memory sequentially or randomly. Through carefully-designed synthetic tasks, covering the scenarios of full recitation, selective recitation and grounded question answering, we reveal that LMs manage to sequentially access their memory while encountering challenges in randomly accessing memorized content. We find that techniques including recitation and permutation improve the random memory access capability of LMs. Furthermore, by applying this intervention to realistic scenarios of open-domain question answering, we validate that enhancing random access by recitation leads to notable improvements in question answering. The code to reproduce our experiments can be found at https://github.com/sail-sg/lm-random-memory-access.
comment: 9 pages, 4 figures; accepted by ACL 2024 (oral)
♻ ☆ Multimodal Explainability via Latent Shift applied to COVID-19 stratification
We are witnessing a widespread adoption of artificial intelligence in healthcare. However, most of the advancements in deep learning in this area consider only unimodal data, neglecting other modalities. Their multimodal interpretation necessary for supporting diagnosis, prognosis and treatment decisions. In this work we present a deep architecture, which jointly learns modality reconstructions and sample classifications using tabular and imaging data. The explanation of the decision taken is computed by applying a latent shift that, simulates a counterfactual prediction revealing the features of each modality that contribute the most to the decision and a quantitative score indicating the modality importance. We validate our approach in the context of COVID-19 pandemic using the AIforCOVID dataset, which contains multimodal data for the early identification of patients at risk of severe outcome. The results show that the proposed method provides meaningful explanations without degrading the classification performance.
♻ ☆ Predictive Coding Networks and Inference Learning: Tutorial and Survey
Recent years have witnessed a growing call for renewed emphasis on neuroscience-inspired approaches in artificial intelligence research, under the banner of NeuroAI. A prime example of this is predictive coding networks (PCNs), based on the neuroscientific framework of predictive coding. This framework views the brain as a hierarchical Bayesian inference model that minimizes prediction errors through feedback connections. Unlike traditional neural networks trained with backpropagation (BP), PCNs utilize inference learning (IL), a more biologically plausible algorithm that explains patterns of neural activity that BP cannot. Historically, IL has been more computationally intensive, but recent advancements have demonstrated that it can achieve higher efficiency than BP with sufficient parallelization. Furthermore, PCNs can be mathematically considered a superset of traditional feedforward neural networks (FNNs), significantly extending the range of trainable architectures. As inherently probabilistic (graphical) latent variable models, PCNs provide a versatile framework for both supervised learning and unsupervised (generative) modeling that goes beyond traditional artificial neural networks. This work provides a comprehensive review and detailed formal specification of PCNs, particularly situating them within the context of modern ML methods. Additionally, we introduce a Python library (PRECO) for practical implementation. This positions PC as a promising framework for future ML innovations.
comment: 46 pages, 13 figures, 8 tables
♻ ☆ GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.
♻ ☆ Chain of Code: Reasoning with a Language Model-Augmented Code Emulator ICML 2024
Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)". In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by "thinking in code".
comment: ICML 2024 Oral; Project webpage: https://chain-of-code.github.io
♻ ☆ LLM experiments with simulation: Large Language Model Multi-Agent System for Simulation Model Parametrization in Digital Twins
This paper presents a novel design of a multi-agent system framework that applies large language models (LLMs) to automate the parametrization of simulation models in digital twins. This framework features specialized LLM agents tasked with observing, reasoning, decision-making, and summarizing, enabling them to dynamically interact with digital twin simulations to explore parametrization possibilities and determine feasible parameter settings to achieve an objective. The proposed approach enhances the usability of simulation model by infusing it with knowledge heuristics from LLM and enables autonomous search for feasible parametrization to solve a user task. Furthermore, the system has the potential to increase user-friendliness and reduce the cognitive load on human users by assisting in complex decision-making processes. The effectiveness and functionality of the system are demonstrated through a case study, and the visualized demos and codes are available at a GitHub Repository: https://github.com/YuchenXia/LLMDrivenSimulation
comment: Submitted to IEEE-ETFA2024, under peer-review
♻ ☆ Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.
comment: 20 pages; 13 visualizations (nine figures, four tables)
♻ ☆ Learning Non-Vacuous Generalization Bounds from Optimization
One of the fundamental challenges in the deep learning community is to theoretically understand how well a deep neural network generalizes to unseen data. However, current approaches often yield generalization bounds that are either too loose to be informative of the true generalization error or only valid to the compressed nets. In this study, we present a simple yet non-vacuous generalization bound from the optimization perspective. We achieve this goal by leveraging that the hypothesis set accessed by stochastic gradient algorithms is essentially fractal-like and thus can derive a tighter bound over the algorithm-dependent Rademacher complexity. The main argument rests on modeling the discrete-time recursion process via a continuous-time stochastic differential equation driven by fractional Brownian motion. Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks such as ResNet and Vision Transformer, even when they are trained on a large-scale dataset (e.g. ImageNet-1K).
comment: 35pages
♻ ☆ Surrogate Neural Networks Local Stability for Aircraft Predictive Maintenance
Surrogate Neural Networks are nowadays routinely used in industry as substitutes for computationally demanding engineering simulations (e.g., in structural analysis). They allow to generate faster predictions and thus analyses in industrial applications e.g., during a product design, testing or monitoring phases. Due to their performance and time-efficiency, these surrogate models are now being developed for use in safety-critical applications. Neural network verification and in particular the assessment of their robustness (e.g., to perturbations) is the next critical step to allow their inclusion in real-life applications and certification. We assess the applicability and scalability of empirical and formal methods in the context of aircraft predictive maintenance for surrogate neural networks designed to predict the stress sustained by an aircraft part from external loads. The case study covers a high-dimensional input and output space and the verification process thus accommodates multi-objective constraints. We explore the complementarity of verification methods in assessing the local stability property of such surrogate models to input noise. We showcase the effectiveness of sequentially combining methods in one verification 'pipeline' and demonstrate the subsequent gain in runtime required to assess the targeted property.
comment: Peer-reviewed and accepted at the 29th International Conference on Formal Methods for Industrial Critical Systems (FMICS 2024) - 15 pages
♻ ☆ Graph Condensation: A Survey
The rapid growth of graph data poses significant challenges in storage, transmission, and particularly the training of graph neural networks (GNNs). To address these challenges, graph condensation (GC) has emerged as an innovative solution. GC focuses on synthesizing a compact yet highly representative graph, enabling GNNs trained on it to achieve performance comparable to those trained on the original large graph. The notable efficacy of GC and its broad prospects have garnered significant attention and spurred extensive research. This survey paper provides an up-to-date and systematic overview of GC, organizing existing research into five categories aligned with critical GC evaluation criteria: effectiveness, generalization, efficiency, fairness, and robustness. To facilitate an in-depth and comprehensive understanding of GC, this paper examines various methods under each category and thoroughly discusses two essential components within GC: optimization strategies and condensed graph generation. We also empirically compare and analyze representative GC methods with diverse optimization strategies based on the five proposed GC evaluation criteria. Finally, we explore the applications of GC in various fields, outline the related open-source libraries, and highlight the present challenges and novel insights, with the aim of promoting advancements in future research. The related resources can be found at https://github.com/XYGaoG/Graph-Condensation-Papers.
♻ ☆ A Survey of Large Language Models in Medicine: Progress, Application, and Challenge
Large language models (LLMs), such as ChatGPT, have received substantial attention due to their capabilities for understanding and generating human language. While there has been a burgeoning trend in research focusing on the employment of LLMs in supporting different medical tasks (e.g., enhancing clinical diagnostics and providing medical education), a review of these efforts, particularly their development, practical applications, and outcomes in medicine, remains scarce. Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face. In terms of development, we provide a detailed introduction to the principles of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data used for model development. It serves as a guide for practitioners in developing medical LLMs tailored to their specific needs. In terms of deployment, we offer a comparison of the performance of different LLMs across various medical tasks, and further compare them with state-of-the-art lightweight models, aiming to provide an understanding of the advantages and limitations of LLMs in medicine. Overall, in this review, we address the following questions: 1) What are the practices for developing medical LLMs 2) How to measure the medical task performance of LLMs in a medical setting? 3) How have medical LLMs been employed in real-world practice? 4) What challenges arise from the use of medical LLMs? and 5) How to more effectively develop and deploy medical LLMs? By answering these questions, this review aims to provide insights into the opportunities for LLMs in medicine and serve as a practical resource. We also maintain a regularly updated list of practical guides on medical LLMs at https://github.com/AI-in-Health/MedLLMsPracticalGuide
comment: Preprint. Version 6. Update Figures 1-5; Tables 2-3; 31 pages
♻ ☆ Cross-Speaker Encoding Network for Multi-Talker Speech Recognition ICASSP2024
End-to-end multi-talker speech recognition has garnered great interest as an effective approach to directly transcribe overlapped speech from multiple speakers. Current methods typically adopt either 1) single-input multiple-output (SIMO) models with a branched encoder, or 2) single-input single-output (SISO) models based on attention-based encoder-decoder architecture with serialized output training (SOT). In this work, we propose a Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models by aggregating cross-speaker representations. Furthermore, the CSE model is integrated with SOT to leverage both the advantages of SIMO and SISO while mitigating their drawbacks. To the best of our knowledge, this work represents an early effort to integrate SIMO and SISO for multi-talker speech recognition. Experiments on the two-speaker LibrispeechMix dataset show that the CES model reduces word error rate (WER) by 8% over the SIMO baseline. The CSE-SOT model reduces WER by 10% overall and by 16% on high-overlap speech compared to the SOT model. Code is available at https://github.com/kjw11/CSEnet-ASR.
comment: Accepted by ICASSP2024
♻ ☆ KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search
Quantum architecture search (QAS) is a promising direction for optimization and automated design of quantum circuits towards quantum advantage. Recent techniques in QAS focus on machine learning-based approaches from reinforcement learning, like deep Q-network. While multi-layer perceptron-based deep Q-networks have been applied for QAS, their interpretability remains challenging due to the high number of parameters. In this work, we evaluate the practicality of Kolmogorov-Arnold Networks (KANs) in QAS problems, analyzing their efficiency in the task of quantum state preparation and quantum chemistry. In quantum state preparation, our results show that in a noiseless scenario, the probability of success and the number of optimal quantum circuit configurations to generate the multi-qubit maximally entangled states are $2\times$ to $5\times$ higher than Multi-Layer perceptions (MLPs). Moreover, in noisy scenarios, KAN can achieve a better fidelity in approximating maximally entangled state than MLPs, where the performance of the MLP significantly depends on the choice of activation function. In tackling quantum chemistry problems, we enhance the recently proposed QAS algorithm by integrating Curriculum Reinforcement Learning (CRL) with a KAN structure instead of the traditional MLP. This modification allows us to design a parameterized quantum circuit that contains fewer 2-qubit gates and has a shallower depth, thereby improving the efficiency of finding the ground state of a chemical Hamiltonian. Further investigation reveals that KAN requires a significantly smaller number of learnable parameters compared to MLPs; however, the average time of executing each episode for KAN is higher.
comment: 11 pages and 5 figures, 7 tables. New experiments added and typo removed
♻ ☆ Optimal Layout Synthesis for Deep Quantum Circuits on NISQ Processors with 100+ Qubits
Layout synthesis is mapping a quantum circuit to a quantum processor. SWAP gate insertions are needed for scheduling 2-qubit gates only on connected physical qubits. With the ever-increasing number of qubits in NISQ processors, scalable layout synthesis is of utmost importance. With large optimality gaps observed in heuristic approaches, scalable exact methods are needed. While recent exact and near-optimal approaches scale to moderate circuits, large deep circuits are still out of scope. In this work, we propose a SAT encoding based on parallel plans that apply 1 SWAP and a group of CNOTs at each time step. Using domain-specific information, we maintain optimality in parallel plans while scaling to large and deep circuits. From our results, we show the scalability of our approach which significantly outperforms leading exact and near-optimal approaches (up to 100x). For the first time, we can optimally map several 8, 14, and 16 qubit circuits onto 54, 80, and 127 qubit platforms with up to 17 SWAPs. While adding optimal SWAPs, we also report near-optimal depth in our mapped circuits.
comment: 7 Figures, 4 Tables, 1 Listing
♻ ☆ Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection
The spread of fake news negatively impacts individuals and is regarded as a significant social challenge that needs to be addressed. A number of algorithmic and insightful features have been identified for detecting fake news. However, with the recent LLMs and their advanced generation capabilities, many of the detectable features (e.g., style-conversion attacks) can be altered, making it more challenging to distinguish from real news. This study proposes adversarial style augmentation, AdStyle, to train a fake news detector that remains robust against various style-conversion attacks. Our model's key mechanism is the careful use of LLMs to automatically generate a diverse yet coherent range of style-conversion attack prompts. This improves the generation of prompts that are particularly difficult for the detector to handle. Experiments show that our augmentation strategy improves robustness and detection performance when tested on fake news benchmark datasets.
comment: 8 pages
♻ ☆ BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline
Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we select 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10%, in terms of F1-score, achieving an average score of 83.19 and 68.5 over specific and general approaches respectively. As a result of this research, we provide a dataset of the extracted features combined with BotArtist predictions over the 10.929.533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War, over a 16-month period. This dataset was created in collaboration with [Shevtsov et al., 2022a] where the original authors share anonymized tweets on the discussion of the Russo-Ukrainian war with a total amount of 127.275.386 tweets. The combination of the existing text dataset and the provided labeled bot and human profiles will allow for the future development of a more advanced bot detection large language model in the post-Twitter API era.
♻ ☆ Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data ICML 2024
Ambient diffusion is a recently proposed framework for training diffusion models using corrupted data. Both Ambient Diffusion and alternative SURE-based approaches for learning diffusion models from corrupted data resort to approximations which deteriorate performance. We present the first framework for training diffusion models that provably sample from the uncorrupted distribution given only noisy training data, solving an open problem in this space. Our key technical contribution is a method that uses a double application of Tweedie's formula and a consistency loss function that allows us to extend sampling at noise levels below the observed data noise. We also provide further evidence that diffusion models memorize from their training sets by identifying extremely corrupted images that are almost perfectly reconstructed, raising copyright and privacy concerns. Our method for training using corrupted samples can be used to mitigate this problem. We demonstrate this by fine-tuning Stable Diffusion XL to generate samples from a distribution using only noisy samples. Our framework reduces the amount of memorization of the fine-tuning dataset, while maintaining competitive performance.
comment: Accepted to ICML 2024
♻ ☆ REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability
Understanding the agent's learning process, particularly the factors that contribute to its success or failure post-training, is crucial for comprehending the rationale behind the agent's decision-making process. Prior methods clarify the learning process by creating a structural causal model (SCM) or visually representing the distribution of value functions. Nevertheless, these approaches have constraints as they exclusively function in 2D-environments or with uncomplicated transition dynamics. Understanding the agent's learning process in complicated environments or tasks is more challenging. In this paper, we propose REVEAL-IT, a novel framework for explaining the learning process of an agent in complex environments. Initially, we visualize the policy structure and the agent's learning process for various training tasks. By visualizing these findings, we can understand how much a particular training task or stage affects the agent's performance in test. Then, a GNN-based explainer learns to highlight the most important section of the policy, providing a more clear and robust explanation of the agent's learning process. The experiments demonstrate that explanations derived from this framework can effectively help in the optimization of the training tasks, resulting in improved learning efficiency and final performance.
♻ ☆ MSegRNN:Enhanced SegRNN Model with Mamba for Long-Term Time Series Forecasting
The field of long-term time series forecasting demands handling extensive look-back windows and long-range prediction steps, posing significant challenges for RNN-based methodologies. Among these, SegRNN, a robust RNN-driven model, has gained considerable attention in LTSF analysis for achieving state-of-the-art results while maintaining a remarkably streamlined architecture. Concurrently, the Mamba structure has demonstrated its advantages in small to medium-sized models due to its capability for information selection. This study introduces a variant of SegRNN that preprocesses information using a fine-tuned single-layer Mamba structure. Additionally, it incorporates implicit segmentation and residual structures into the model's encoding section to further reduce the inherent data iterative cycles of RNN architectures and implicitly integrate inter-channel correlations. This variant, named MSegRNN, utilizes the Mamba structure to select useful information, resulting in a transformed sequence. The linear-strategy-adapted derivative retains the superior memory efficiency of the original SegRNN while demonstrating enhanced performance. Empirical evaluations on real-world LTSF datasets demonstrate the superior performance of our model, thereby contributing to the advancement of LTSF methodologies.
♻ ☆ Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model
Large Language Models (LLMs) have demonstrated exceptional proficiency in mathematical reasoning tasks due to their extensive parameter counts and training on vast datasets. Despite these capabilities, deploying LLMs is hindered by their computational demands. Distilling LLM mathematical reasoning into Smaller Language Models (SLMs) has emerged as a solution to this challenge, although these smaller models often suffer from errors in calculation and semantic understanding. Prior work has proposed Program-of-Thought Distillation (PoTD) to avoid calculation error. To further address semantic understanding errors, we propose Key-Point-Driven Mathematical Reasoning Distillation (KPDD). KPDD enhances the reasoning performance of SLMs by breaking down the problem-solving process into three stages: Core Question Extraction, Problem-Solving Information Extraction, and Step-by-Step Solution. This method is further divided into KPDD-CoT, which generates Chain-of-Thought rationales, and KPDD-PoT, which creates Program-of-Thought rationales. The experiment results show that KPDD-CoT significantly improves reasoning abilities, while KPDD-PoT achieves state-of-the-art performance in mathematical reasoning tasks. Our approach effectively mitigates misunderstanding errors, advancing the deployment of efficient and capable SLMs.
♻ ☆ Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function for estimating expected future rewards, our Q* can effectively guide LLMs to select the most promising next reasoning step without fine-tuning LLMs for the current task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP demonstrate the superiority of our method, contributing to improving the reasoning performance of existing open-source LLMs.
♻ ☆ PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers ECCV
Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.
comment: Accepted as a main conference paper at the European Conference of Computer Vision (ECCV) 2024
♻ ☆ CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on the structures of protein pockets. Our evaluations are conducted with fairness, encompassing comprehensive perspectives on interaction, chemical properties, geometry authenticity, and substructure validity. We further provide the pre-trained versions of the state-of-the-art models and deep insights with analysis from empirical studies. The codebase for CBGBench is publicly accessible at \url{https://github.com/Edapinenut/CBGBench}.
comment: 9 pages main context
♻ ☆ EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation ACL 2022
Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior performance against the conventional MNMT by constructing multi-way aligned corpus, i.e., aligning bilingual training examples from different language pairs when either their source or target sides are identical. However, since exactly identical sentences from different language pairs are scarce, the power of the multi-way aligned corpus is limited by its scale. To handle this problem, this paper proposes "Extract and Generate" (EAG), a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data. Specifically, we first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences; and then generate the final aligned examples from the candidates with a well-trained generation model. With this two-step pipeline, EAG can construct a large-scale and multi-way aligned corpus whose diversity is almost identical to the original bilingual corpus. Experiments on two publicly available datasets i.e., WMT-5 and OPUS-100, show that the proposed method achieves significant improvements over strong baselines, with +1.1 and +1.4 BLEU points improvements on the two datasets respectively.
comment: Accepted as a long paper at ACL 2022
♻ ☆ Generalizing Denoising to Non-Equilibrium Structures Improves Equivariant Force Fields
Understanding the interactions of atoms such as forces in 3D atomistic systems is fundamental to many applications like molecular dynamics and catalyst design. However, simulating these interactions requires compute-intensive ab initio calculations and thus results in limited data for training neural networks. In this paper, we propose to use denoising non-equilibrium structures (DeNS) as an auxiliary task to better leverage training data and improve performance. For training with DeNS, we first corrupt a 3D structure by adding noise to its 3D coordinates and then predict the noise. Different from previous works on denoising, which are limited to equilibrium structures, the proposed method generalizes denoising to a much larger set of non-equilibrium structures. The main difference is that a non-equilibrium structure does not correspond to local energy minima and has non-zero forces, and therefore it can have many possible atomic positions compared to an equilibrium structure. This makes denoising non-equilibrium structures an ill-posed problem since the target of denoising is not uniquely defined. Our key insight is to additionally encode the forces of the original non-equilibrium structure to specify which non-equilibrium structure we are denoising. Concretely, given a corrupted non-equilibrium structure and the forces of the original one, we predict the non-equilibrium structure satisfying the input forces instead of any arbitrary structures. Since DeNS requires encoding forces, DeNS favors equivariant networks, which can easily incorporate forces and other higher-order tensors in node embeddings. We study the effectiveness of training equivariant networks with DeNS on OC20, OC22 and MD17 datasets and demonstrate that DeNS can achieve new state-of-the-art results on OC20 and OC22 and significantly improve training efficiency on MD17.
♻ ☆ From Black Boxes to Conversations: Incorporating XAI in a Conversational Agent
The goal of Explainable AI (XAI) is to design methods to provide insights into the reasoning process of black-box models, such as deep neural networks, in order to explain them to humans. Social science research states that such explanations should be conversational, similar to human-to-human explanations. In this work, we show how to incorporate XAI in a conversational agent, using a standard design for the agent comprising natural language understanding and generation components. We build upon an XAI question bank, which we extend by quality-controlled paraphrases, to understand the user's information needs. We further systematically survey the literature for suitable explanation methods that provide the information to answer those questions, and present a comprehensive list of suggestions. Our work is the first step towards truly natural conversations about machine learning models with an explanation agent. The comprehensive list of XAI questions and the corresponding explanation methods may support other researchers in providing the necessary information to address users' demands. To facilitate future work, we release our source code and data.
comment: Accepted at The World Conference on eXplainable Artificial Intelligence 2023 (XAI-2023)
♻ ☆ TAPI: Towards Target-Specific and Adversarial Prompt Injection against Code LLMs
Recently, code-oriented large language models (Code LLMs) have been widely and successfully used to simplify and facilitate code programming. With these tools, developers can easily generate desired complete functional codes based on incomplete code and natural language prompts. However, a few pioneering works revealed that these Code LLMs are also vulnerable, e.g., against backdoor and adversarial attacks. The former could induce LLMs to respond to triggers to insert malicious code snippets by poisoning the training data or model parameters, while the latter can craft malicious adversarial input codes to reduce the quality of generated codes. However, both attack methods have underlying limitations: backdoor attacks rely on controlling the model training process, while adversarial attacks struggle with fulfilling specific malicious purposes. To inherit the advantages of both backdoor and adversarial attacks, this paper proposes a new attack paradigm, i.e., target-specific and adversarial prompt injection (TAPI), against Code LLMs. TAPI generates unreadable comments containing information about malicious instructions and hides them as triggers in the external source code. When users exploit Code LLMs to complete codes containing the trigger, the models will generate attacker-specified malicious code snippets at specific locations. We evaluate our TAPI attack on four representative LLMs under three representative malicious objectives and seven cases. The results show that our method is highly threatening (achieving an attack success rate of up to 98.3%) and stealthy (saving an average of 53.1% of tokens in the trigger design). In particular, we successfully attack some famous deployed code completion integrated applications, including CodeGeex and Github Copilot. This further confirms the realistic threat of our attack.
♻ ☆ ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents
Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. These API-based agents, leveraging the strong autonomy and planning capabilities of LLMs, can efficiently solve problems requiring multi-step actions. However, their ability to handle multi-dimensional difficulty levels, diverse task types, and real-world demands through APIs remains unknown. In this paper, we introduce \textsc{ShortcutsBench}, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving tasks with varying levels of difficulty, diverse task types, and real-world demands. \textsc{ShortcutsBench} includes a wealth of real APIs from Apple Inc.'s operating systems, refined user queries from shortcuts, human-annotated high-quality action sequences from shortcut developers, and accurate parameter filling values about primitive parameter types, enum parameter types, outputs from previous actions, and parameters that need to request necessary information from the system or user. Our extensive evaluation of agents built with $5$ leading open-source (size >= 57B) and $4$ closed-source LLMs (e.g. Gemini-1.5-Pro and GPT-3.5) reveals significant limitations in handling complex queries related to API selection, parameter filling, and requesting necessary information from systems and users. These findings highlight the challenges that API-based agents face in effectively fulfilling real and complex user queries. All datasets, code, and experimental results will be available at \url{https://github.com/eachsheep/shortcutsbench}.
♻ ☆ LLM4ED: Large Language Models for Automatic Equation Discovery
Equation discovery is aimed at directly extracting physical laws from data and has emerged as a pivotal research domain. Previous methods based on symbolic mathematics have achieved substantial advancements, but often require the design of implementation of complex algorithms. In this paper, we introduce a new framework that utilizes natural language-based prompts to guide large language models (LLMs) in automatically mining governing equations from data. Specifically, we first utilize the generation capability of LLMs to generate diverse equations in string form, and then evaluate the generated equations based on observations. In the optimization phase, we propose two alternately iterated strategies to optimize generated equations collaboratively. The first strategy is to take LLMs as a black-box optimizer and achieve equation self-improvement based on historical samples and their performance. The second strategy is to instruct LLMs to perform evolutionary operators for global search. Experiments are extensively conducted on both partial differential equations and ordinary differential equations. Results demonstrate that our framework can discover effective equations to reveal the underlying physical laws under various nonlinear dynamic systems. Further comparisons are made with state-of-the-art models, demonstrating good stability and usability. Our framework substantially lowers the barriers to learning and applying equation discovery techniques, demonstrating the application potential of LLMs in the field of knowledge discovery.
♻ ☆ Optimal Defender Strategies for CAGE-2 using Causal Modeling and Tree Search
The CAGE-2 challenge is considered a standard benchmark to compare methods for autonomous cyber defense. Current state-of-the-art methods evaluated against this benchmark are based on model-free (offline) reinforcement learning, which does not provide provably optimal defender strategies. We address this limitation and present a formal (causal) model of CAGE-2 together with a method that produces a provably optimal defender strategy, which we call Causal Partially Observable Monte-Carlo Planning (C-POMCP). It has two key properties. First, it incorporates the causal structure of the target system, i.e., the causal relationships among the system variables. This structure allows for a significant reduction of the search space of defender strategies. Second, it is an online method that updates the defender strategy at each time step via tree search. Evaluations against the CAGE-2 benchmark show that C-POMCP achieves state-of-the-art performance with respect to effectiveness and is two orders of magnitude more efficient in computing time than the closest competitor method.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ CAD-Prompted Generative Models: A Pathway to Feasible and Novel Engineering Designs
Text-to-image generative models have increasingly been used to assist designers during concept generation in various creative domains, such as graphic design, user interface design, and fashion design. However, their applications in engineering design remain limited due to the models' challenges in generating images of feasible designs concepts. To address this issue, this paper introduces a method that improves the design feasibility by prompting the generation with feasible CAD images. In this work, the usefulness of this method is investigated through a case study with a bike design task using an off-the-shelf text-to-image model, Stable Diffusion 2.1. A diverse set of bike designs are produced in seven different generation settings with varying CAD image prompting weights, and these designs are evaluated on their perceived feasibility and novelty. Results demonstrate that the CAD image prompting successfully helps text-to-image models like Stable Diffusion 2.1 create visibly more feasible design images. While a general tradeoff is observed between feasibility and novelty, when the prompting weight is kept low around 0.35, the design feasibility is significantly improved while its novelty remains on par with those generated by text prompts alone. The insights from this case study offer some guidelines for selecting the appropriate CAD image prompting weight for different stages of the engineering design process. When utilized effectively, our CAD image prompting method opens doors to a wider range of applications of text-to-image models in engineering design.
comment: 11 pages, 3 figures, 4 tables
♻ ☆ Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL ECCV 2024
Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models(VFM) and offline reinforcement learning(offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as "Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust policy within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. We evaluate our agent on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned agent from virtual environments to a real-world robot.
comment: ECCV 2024
♻ ☆ Socially Integrated Navigation: A Social Acting Robot with Deep Reinforcement Learning IROS
Mobile robots are being used on a large scale in various crowded situations and become part of our society. The socially acceptable navigation behavior of a mobile robot with individual human consideration is an essential requirement for scalable applications and human acceptance. Deep Reinforcement Learning (DRL) approaches are recently used to learn a robot's navigation policy and to model the complex interactions between robots and humans. We propose to divide existing DRL-based navigation approaches based on the robot's exhibited social behavior and distinguish between social collision avoidance with a lack of social behavior and socially aware approaches with explicit predefined social behavior. In addition, we propose a novel socially integrated navigation approach where the robot's social behavior is adaptive and emerges from the interaction with humans. The formulation of our approach is derived from a sociological definition, which states that social acting is oriented toward the acting of others. The DRL policy is trained in an environment where other agents interact socially integrated and reward the robot's behavior individually. The simulation results indicate that the proposed socially integrated navigation approach outperforms a socially aware approach in terms of ego navigation performance while significantly reducing the negative impact on all agents within the environment.
comment: Accepted at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
♻ ☆ Intersymbolic AI: Interlinking Symbolic AI and Subsymbolic AI
This perspective piece calls for the study of the new field of Intersymbolic AI, by which we mean the combination of symbolic AI, whose building blocks have inherent significance/meaning, with subsymbolic AI, whose entirety creates significance/effect despite the fact that individual building blocks escape meaning. Canonical kinds of symbolic AI are logic, games and planning. Canonical kinds of subsymbolic AI are (un)supervised machine and reinforcement learning. Intersymbolic AI interlinks the worlds of symbolic AI with its compositional symbolic significance and meaning and of subsymbolic AI with its summative significance or effect to enable culminations of insights from both worlds by going between and across symbolic AI insights with subsymbolic AI techniques that are being helped by symbolic AI principles. For example, Intersymbolic AI may start with symbolic AI to understand a dynamic system, continue with subsymbolic AI to learn its control, and end with symbolic AI to safely use the outcome of the learned subsymbolic AI controller in the dynamic system. Intersymbolic AI combines both symbolic and subsymbolic AI to increase the effectiveness of AI compared to either kind of AI alone, in much the same way that the combination of both conscious and subconscious thought increases the effectiveness of human thought compared to either kind of thought alone. Some successful contributions to the Intersymbolic AI paradigm are surveyed here but many more are considered possible by advancing Intersymbolic AI.
♻ ☆ A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Current Large Language Models (LLMs) are not only limited to some maximum context length, but also are not able to robustly consume long inputs. To address these limitations, we propose ReadAgent, an LLM agent system that increases effective context length up to 20x in our experiments. Inspired by how humans interactively read long documents, we implement ReadAgent as a simple prompting system that uses the advanced language capabilities of LLMs to (1) decide what content to store together in a memory episode, (2) compress those memory episodes into short episodic memories called gist memories, and (3) take actions to look up passages in the original text if ReadAgent needs to remind itself of relevant details to complete a task. We evaluate ReadAgent against baselines using retrieval methods, using the original long contexts, and using the gist memories. These evaluations are performed on three long-document reading comprehension tasks: QuALITY, NarrativeQA, and QMSum. ReadAgent outperforms the baselines on all three tasks while extending the effective context window by 3.5-20x.
comment: Website: https://read-agent.github.io
♻ ☆ ARCO:Adaptive Multi-Agent Reinforcement Learning-Based Hardware/Software Co-Optimization Compiler for Improved Performance in DNN Accelerator Design
This paper presents ARCO, an adaptive Multi-Agent Reinforcement Learning (MARL)-based co-optimizing compilation framework designed to enhance the efficiency of mapping machine learning (ML) models - such as Deep Neural Networks (DNNs) - onto diverse hardware platforms. The framework incorporates three specialized actor-critic agents within MARL, each dedicated to a distinct aspect of compilation/optimization at an abstract level: one agent focuses on hardware, while two agents focus on software optimizations. This integration results in a collaborative hardware/software co-optimization strategy that improves the precision and speed of DNN deployments. Concentrating on high-confidence configurations simplifies the search space and delivers superior performance compared to current optimization methods. The ARCO framework surpasses existing leading frameworks, achieving a throughput increase of up to 37.95% while reducing the optimization time by up to 42.2% across various DNNs.
comment: Under review
♻ ☆ CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FPGAs), introduces considerable challenges. These challenges stem primarily from the non-linear calculations and high computational and memory demands of ViTs. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs in order to maximize performance. Our framework is built upon three fundamental contributions: multi-kernel design to maximize the bandwidth, mainly targeting benefits of multi DDR memory banks, approximate non-linear functions that exhibit minimal accuracy degradation, and efficient use of available logic blocks on the FPGA, and efficient compiler to maximize the performance and memory-efficiency of the computing kernels by presenting a novel algorithm for design space exploration to find optimal hardware configuration that achieves optimal throughput and latency. Compared to the state-of-the-art ViT accelerators, CHOSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
♻ ☆ A Survey on 3D Gaussian Splatting
3D Gaussian splatting (GS) has recently emerged as a transformative technique in the realm of explicit radiance field and computer graphics. This innovative approach, characterized by the utilization of millions of learnable 3D Gaussians, represents a significant departure from mainstream neural radiance field approaches, which predominantly use implicit, coordinate-based models to map spatial coordinates to pixel values. 3D GS, with its explicit scene representation and differentiable rendering algorithm, not only promises real-time rendering capability but also introduces unprecedented levels of editability. This positions 3D GS as a potential game-changer for the next generation of 3D reconstruction and representation. In the present paper, we provide the first systematic overview of the recent developments and critical contributions in the domain of 3D GS. We begin with a detailed exploration of the underlying principles and the driving forces behind the emergence of 3D GS, laying the groundwork for understanding its significance. A focal point of our discussion is the practical applicability of 3D GS. By enabling unprecedented rendering speed, 3D GS opens up a plethora of applications, ranging from virtual reality to interactive media and beyond. This is complemented by a comparative analysis of leading 3D GS models, evaluated across various benchmark tasks to highlight their performance and practical utility. The survey concludes by identifying current challenges and suggesting potential avenues for future research in this domain. Through this survey, we aim to provide a valuable resource for both newcomers and seasoned researchers, fostering further exploration and advancement in applicable and explicit radiance field representation.
comment: Ongoing project
♻ ☆ FineSurE: Fine-grained Summarization Evaluation using LLMs ACL 2024
Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at https://github.com/DISL-Lab/FineSurE-ACL24.
comment: Accepted at ACL 2024 (main, long)
♻ ☆ Graph Signal Processing for Cross-Domain Recommendation
Cross-domain recommendation (CDR) extends conventional recommender systems by leveraging user-item interactions from dense domains to mitigate data sparsity and the cold start problem. While CDR offers substantial potential for enhancing recommendation performance, most existing CDR methods suffer from sensitivity to the ratio of overlapping users and intrinsic discrepancy between source and target domains. To overcome these limitations, in this work, we explore the application of graph signal processing (GSP) in CDR scenarios. We propose CGSP, a unified CDR framework based on GSP, which employs a cross-domain similarity graph constructed by flexibly combining target-only similarity and source-bridged similarity. By processing personalized graph signals computed for users from either the source or target domain, our framework effectively supports both inter-domain and intra-domain recommendations. Our empirical evaluation demonstrates that CGSP consistently outperforms various encoder-based CDR approaches in both intra-domain and inter-domain recommendation scenarios, especially when the ratio of overlapping users is low, highlighting its significant practical implication in real-world applications.
♻ ☆ Reinforcement Learning With Sparse-Executing Actions via Sparsity Regularization
Reinforcement learning (RL) has demonstrated impressive performance in decision-making tasks like embodied control, autonomous driving and financial trading. In many decision-making tasks, the agents often encounter the problem of executing actions under limited budgets. However, classic RL methods typically overlook the challenges posed by such sparse-executing actions. They operate under the assumption that all actions can be taken for a unlimited number of times, both in the formulation of the problem and in the development of effective algorithms. To tackle the issue of limited action execution in RL, this paper first formalizes the problem as a Sparse Action Markov Decision Process (SA-MDP), in which specific actions in the action space can only be executed for a limited time. Then, we propose a policy optimization algorithm, Action Sparsity REgularization (ASRE), which adaptively handles each action with a distinct preference. ASRE operates through two steps: First, ASRE evaluates action sparsity by constrained action sampling. Following this, ASRE incorporates the sparsity evaluation into policy learning by way of an action distribution regularization. We provide theoretical identification that validates the convergence of ASRE to a regularized optimal value function. Experiments on tasks with known sparse-executing actions, where classical RL algorithms struggle to train policy efficiently, ASRE effectively constrains the action sampling and outperforms baselines. Moreover, we present that ASRE can generally improve the performance in Atari games, demonstrating its broad applicability.
♻ ☆ IWISDM: Assessing instruction following in multimodal models at scale
The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models' ability to precisely follow instructions with that of humans.The code of iWISDM is available on GitHub at https://github.com/BashivanLab/iWISDM.
♻ ☆ GSQA: An End-to-End Model for Generative Spoken Question Answering
In recent advancements in spoken question answering (QA), end-to-end models have made significant strides. However, previous research has primarily focused on extractive span selection. While this extractive-based approach is effective when answers are present directly within the input, it falls short in addressing abstractive questions, where answers are not directly extracted but inferred from the given information. To bridge this gap, we introduce the first end-to-end Generative Spoken Question Answering (GSQA) model that empowers the system to engage in abstractive reasoning. The challenge in training our GSQA model lies in the absence of a spoken abstractive QA dataset. We propose using text models for initialization and leveraging the extractive QA dataset to transfer knowledge from the text generative model to the spoken generative model. Experimental results indicate that our model surpasses the previous extractive model by 3% on extractive QA datasets. Furthermore, the GSQA model has only been fine-tuned on the spoken extractive QA dataset. Despite not having seen any spoken abstractive QA data, it can still closely match the performance of the cascade model. In conclusion, our GSQA model shows the potential to generalize to a broad spectrum of questions, thus further expanding the spoken question answering capabilities of abstractive QA. Our code is available at https://voidful.github.io/GSQA
comment: 5 pages, 2 figures, Interspeech 2024
♻ ☆ Logical Reasoning with Relation Network for Inductive Knowledge Graph Completion KDD
Inductive knowledge graph completion (KGC) aims to infer the missing relation for a set of newly-coming entities that never appeared in the training set. Such a setting is more in line with reality, as real-world KGs are constantly evolving and introducing new knowledge. Recent studies have shown promising results using message passing over subgraphs to embed newly-coming entities for inductive KGC. However, the inductive capability of these methods is usually limited by two key issues. (i) KGC always suffers from data sparsity, and the situation is even exacerbated in inductive KGC where new entities often have few or no connections to the original KG. (ii) Cold-start problem. It is over coarse-grained for accurate KG reasoning to generate representations for new entities by gathering the local information from few neighbors. To this end, we propose a novel iNfOmax RelAtion Network, namely NORAN, for inductive KG completion. It aims to mine latent relation patterns for inductive KG completion. Specifically, by centering on relations, NORAN provides a hyper view towards KG modeling, where the correlations between relations can be naturally captured as entity-independent logical evidence to conduct inductive KGC. Extensive experiment results on five benchmarks show that our framework substantially outperforms the state-of-the-art KGC methods.
comment: 30th SIGKDD Conference on Knowledge Discovery and Data Mining
♻ ☆ Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities, making them a promising architecture for the brain of embodied agents. However, there is no comprehensive survey for Embodied AI in the era of MLMs. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI. Our analysis firstly navigates through the forefront of representative works of embodied robots and simulators, to fully understand the research focuses and their limitations. Then, we analyze four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, covering the state-of-the-art methods, essential paradigms, and comprehensive datasets. Additionally, we explore the complexities of MLMs in virtual and real embodied agents, highlighting their significance in facilitating interactions in dynamic digital and physical environments. Finally, we summarize the challenges and limitations of embodied AI and discuss their potential future directions. We hope this survey will serve as a foundational reference for the research community and inspire continued innovation. The associated project can be found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.
comment: The first comprehensive review of Embodied AI in the era of MLMs, 35 pages. We also provide the paper list for Embodied AI: https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List
♻ ☆ Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability
World models can foresee the outcomes of different actions, which is of paramount importance for autonomous driving. Nevertheless, existing driving world models still have limitations in generalization to unseen environments, prediction fidelity of critical details, and action controllability for flexible application. In this paper, we present Vista, a generalizable driving world model with high fidelity and versatile controllability. Based on a systematic diagnosis of existing methods, we introduce several key ingredients to address these limitations. To accurately predict real-world dynamics at high resolution, we propose two novel losses to promote the learning of moving instances and structural information. We also devise an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, we incorporate a versatile set of controls from high-level intentions (command, goal point) to low-level maneuvers (trajectory, angle, and speed) through an efficient learning strategy. After large-scale training, the capabilities of Vista can seamlessly generalize to different scenarios. Extensive experiments on multiple datasets show that Vista outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Moreover, for the first time, we utilize the capacity of Vista itself to establish a generalizable reward for real-world action evaluation without accessing the ground truth actions.
comment: Code and model: https://github.com/OpenDriveLab/Vista, video demos: https://vista-demo.github.io
♻ ☆ $\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks
This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM assessment in translating formal syntax -- such as first-order logic, regular expressions, etc -- to natural language (interpretation) or vice versa (compilation), thereby facilitating their use in applications such as generating/explaining logic and control flow for programs etc. Existing approaches for LLM assessment in these areas require labor-intensive ground-truth creation, the availability of which undermines the separation of training and test sets. Furthermore, such datasets typically include relatively few hand-coded test cases over which LLM accuracy is determined, thus making them inadequate for determining the safety or correctness of their generated outputs. We introduce a new approach that utilizes context-free grammars (CFGs) to generate out-of-distribution datasets on the fly and perform closed-loop testing of LLM capabilities using formal verifiers to guarantee the correctness of LLM outputs without any human intervention. We release our dataset and benchmark as open-source code at \url{https://github.com/AAIR-lab/auto-llm-assessment}. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm. Our experiments reveal that SOTA LLMs are unable to solve the formal translation task adequately.
Machine Learning
☆ Reconstructing Training Data From Real World Models Trained with Transfer Learning
Current methods for reconstructing training data from trained classifiers are restricted to very small models, limited training set sizes, and low-resolution images. Such restrictions hinder their applicability to real-world scenarios. In this paper, we present a novel approach enabling data reconstruction in realistic settings for models trained on high-resolution images. Our method adapts the reconstruction scheme of arXiv:2206.07758 to real-world scenarios -- specifically, targeting models trained via transfer learning over image embeddings of large pre-trained models like DINO-ViT and CLIP. Our work employs data reconstruction in the embedding space rather than in the image space, showcasing its applicability beyond visual data. Moreover, we introduce a novel clustering-based method to identify good reconstructions from thousands of candidates. This significantly improves on previous works that relied on knowledge of the training set to identify good reconstructed images. Our findings shed light on a potential privacy risk for data leakage from models trained using transfer learning.
☆ HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning ECCV 2024
Predicting camera-space hand meshes from single RGB images is crucial for enabling realistic hand interactions in 3D virtual and augmented worlds. Previous work typically divided the task into two stages: given a cropped image of the hand, predict meshes in relative coordinates, followed by lifting these predictions into camera space in a separate and independent stage, often resulting in the loss of valuable contextual and scale information. To prevent the loss of these cues, we propose unifying these two stages into an end-to-end solution that addresses the 2D-3D correspondence problem. This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module. We also introduce an image rectification step that harmonizes both the training dataset and the input image as if they were acquired with the same camera, helping to alleviate the inherent scale-depth ambiguity of the problem. We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches across three public benchmarks.
comment: To be presented at ECCV 2024
☆ Perceptions of Linguistic Uncertainty by Language Models and Humans
Uncertainty expressions such as ``probably'' or ``highly unlikely'' are pervasive in human language. While prior work has established that there is population-level agreement in terms of how humans interpret these expressions, there has been little inquiry into the abilities of language models to interpret such expressions. In this paper, we investigate how language models map linguistic expressions of uncertainty to numerical responses. Our approach assesses whether language models can employ theory of mind in this setting: understanding the uncertainty of another agent about a particular statement, independently of the model's own certainty about that statement. We evaluate both humans and 10 popular language models on a task created to assess these abilities. Unexpectedly, we find that 8 out of 10 models are able to map uncertainty expressions to probabilistic responses in a human-like manner. However, we observe systematically different behavior depending on whether a statement is actually true or false. This sensitivity indicates that language models are substantially more susceptible to bias based on their prior knowledge (as compared to humans). These findings raise important questions and have broad implications for human-AI alignment and AI-AI communication.
comment: In submission
☆ Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
As scaling laws in generative AI push performance, they also simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to address this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose to randomly mask up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only \$1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118$\times$ lower cost than stable diffusion models and 14$\times$ lower cost than the current state-of-the-art approach that costs \$28,400. We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
comment: 41 pages, 28 figures, 5 tables
☆ CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning BMVC 2024
With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, large pre-trained models have become a common strategy to enhance performance in Continual Learning scenarios. This led to the development of numerous prompting strategies to effectively fine-tune transformer-based models without succumbing to catastrophic forgetting. However, these methods struggle to specialize the model on domains significantly deviating from the pre-training and preserving its zero-shot capabilities. In this work, we propose Continual Generative training for Incremental prompt-Learning, a novel approach to mitigate forgetting while adapting a VLM, which exploits generative replay to align prompts to tasks. We also introduce a new metric to evaluate zero-shot capabilities within CL benchmarks. Through extensive experiments on different domains, we demonstrate the effectiveness of our framework in adapting to new tasks while improving zero-shot capabilities. Further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.
comment: 15 pages, 1 figure. Accepted at the The 35th British Machine Vision Conference 2024 (BMVC 2024), Glasgow, UK
☆ Robust Mixture Learning when Outliers Overwhelm Small Groups
We study the problem of estimating the means of well-separated mixtures when an adversary may add arbitrary outliers. While strong guarantees are available when the outlier fraction is significantly smaller than the minimum mixing weight, much less is known when outliers may crowd out low-weight clusters - a setting we refer to as list-decodable mixture learning (LD-ML). In this case, adversarial outliers can simulate additional spurious mixture components. Hence, if all means of the mixture must be recovered up to a small error in the output list, the list size needs to be larger than the number of (true) components. We propose an algorithm that obtains order-optimal error guarantees for each mixture mean with a minimal list-size overhead, significantly improving upon list-decodable mean estimation, the only existing method that is applicable for LD-ML. Although improvements are observed even when the mixture is non-separated, our algorithm achieves particularly strong guarantees when the mixture is separated: it can leverage the mixture structure to partially cluster the samples before carefully iterating a base learner for list-decodable mean estimation at different scales.
☆ Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels
Recent advances in reinforcement learning (RL) have predominantly leveraged neural network-based policies for decision-making, yet these models often lack interpretability, posing challenges for stakeholder comprehension and trust. Concept bottleneck models offer an interpretable alternative by integrating human-understandable concepts into neural networks. However, a significant limitation in prior work is the assumption that human annotations for these concepts are readily available during training, necessitating continuous real-time input from human annotators. To overcome this limitation, we introduce a novel training scheme that enables RL algorithms to efficiently learn a concept-based policy by only querying humans to label a small set of data, or in the extreme case, without any human labels. Our algorithm, LICORICE, involves three main contributions: interleaving concept learning and RL training, using a concept ensembles to actively select informative data points for labeling, and decorrelating the concept data with a simple strategy. We show how LICORICE reduces manual labeling efforts to to 500 or fewer concept labels in three environments. Finally, we present an initial study to explore how we can use powerful vision-language models to infer concepts from raw visual inputs without explicit labels at minimal cost to performance.
comment: 23 pages, 6 figures, 9 tables
☆ In Search of Quantum Advantage: Estimating the Number of Shots in Quantum Kernel Methods
Quantum Machine Learning (QML) has gathered significant attention through approaches like Quantum Kernel Machines. While these methods hold considerable promise, their quantum nature presents inherent challenges. One major challenge is the limited resolution of estimated kernel values caused by the finite number of circuit runs performed on a quantum device. In this study, we propose a comprehensive system of rules and heuristics for estimating the required number of circuit runs in quantum kernel methods. We introduce two critical effects that necessitate an increased measurement precision through additional circuit runs: the spread effect and the concentration effect. The effects are analyzed in the context of fidelity and projected quantum kernels. To address these phenomena, we develop an approach for estimating desired precision of kernel values, which, in turn, is translated into the number of circuit runs. Our methodology is validated through extensive numerical simulations, focusing on the problem of exponential value concentration. We stress that quantum kernel methods should not only be considered from the machine learning performance perspective, but also from the context of the resource consumption. The results provide insights into the possible benefits of quantum kernel methods, offering a guidance for their application in quantum machine learning tasks.
comment: 18 + 13 pages, 8 figures. This manuscript is a first release that will be improved in future versions. We wanted to provide this preview now as we recently became aware of extensive modifications in arXiv:2208.11060
☆ STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay ECCV 2024
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time. Existing TTA methods often focus on improving recognition performance specifically for test data associated with classes in the training set. However, during the open-world inference process, there are inevitably test data instances from unknown classes, commonly referred to as outliers. This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist. To address this problem, we propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch. In particular, the memory bank is dynamically updated by selecting low-entropy and label-consistent samples in a class-balanced manner. In addition, we develop a self-weighted entropy minimization strategy that assigns higher weight to low-entropy samples. Extensive results demonstrate that STAMP outperforms existing TTA methods in terms of both recognition and outlier detection performance. The code is released at https://github.com/yuyongcan/STAMP.
comment: Accepted by ECCV 2024
☆ Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning
Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge here is to develop steerable language models that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditioned Language Policy (CLP), a general framework for finetuning language models on multiple objectives. Building on techniques from multi-task training and parameter-efficient finetuning, CLP can learn steerable models that effectively trade-off conflicting objectives at inference time. Notably, this does not require training or maintaining multiple models to achieve different trade-offs between the objectives. Through an extensive set of experiments and ablations, we show that the CLP framework learns steerable models that outperform and Pareto-dominate the current state-of-the-art approaches for multi-objective finetuning.
comment: 40 pages
☆ Model editing for distribution shifts in uranium oxide morphological analysis CVPR 2024
Deep learning still struggles with certain kinds of scientific data. Notably, pretraining data may not provide coverage of relevant distribution shifts (e.g., shifts induced via the use of different measurement instruments). We consider deep learning models trained to classify the synthesis conditions of uranium ore concentrates (UOCs) and show that model editing is particularly effective for improving generalization to distribution shifts common in this domain. In particular, model editing outperforms finetuning on two curated datasets comprising of micrographs taken of U$_{3}$O$_{8}$ aged in humidity chambers and micrographs acquired with different scanning electron microscopes, respectively.
comment: Presented at CV4MS @ CVPR 2024
☆ LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.
comment: 29 pages
☆ Robustness of Speech Separation Models for Similar-pitch Speakers
Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments. This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal. Building on earlier findings by Ditter and Gerkmann, which identified a significant performance drop for the 2018 Chimera++ under similar-pitch conditions, our study extends the analysis to more recent and sophisticated Neural Network models. Our experiments reveal that modern models have substantially reduced the performance gap for matched training and testing conditions. However, a substantial performance gap persists under mismatched conditions, with models performing well for large pitch differences but showing worse performance if the speakers' pitches are similar. These findings motivate further research into the generalizability of speech separation models to similar-pitch speakers and unseen data.
☆ Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond ECCV 2024
In recent years, research on out-of-distribution (OoD) detection for semantic segmentation has mainly focused on road scenes -- a domain with a constrained amount of semantic diversity. In this work, we challenge this constraint and extend the domain of this task to general natural images. To this end, we introduce: 1. the ADE-OoD benchmark, which is based on the ADE20k dataset and includes images from diverse domains with a high semantic diversity, and 2. a novel approach that uses Diffusion score matching for OoD detection (DOoD) and is robust to the increased semantic diversity. ADE-OoD features indoor and outdoor images, defines 150 semantic categories as in-distribution, and contains a variety of OoD objects. For DOoD, we train a diffusion model with an MLP architecture on semantic in-distribution embeddings and build on the score matching interpretation to compute pixel-wise OoD scores at inference time. On common road scene OoD benchmarks, DOoD performs on par or better than the state of the art, without using outliers for training or making assumptions about the data domain. On ADE-OoD, DOoD outperforms previous approaches, but leaves much room for future improvements.
comment: ECCV 2024 - Benchmark page: https://ade-ood.github.io/
☆ Parallel Split Learning with Global Sampling
The expansion of IoT devices and the demands of Deep Learning have highlighted significant challenges in Distributed Deep Learning (DDL) systems. Parallel Split Learning (PSL) has emerged as a promising derivative of Split Learning that is well suited for distributed learning on resource-constrained devices. However, PSL faces several obstacles, such as large effective batch sizes, non-IID data distributions, and the straggler effect. We view these issues as a sampling dilemma and propose to address them by orchestrating the mini-batch sampling process on the server side. We introduce the Uniform Global Sampling (UGS) method to decouple the effective batch size from the number of clients and reduce mini-batch deviation in non-IID settings. To address the straggler effect, we introduce the Latent Dirichlet Sampling (LDS) method, which generalizes UGS to balance the trade-off between batch deviation and training time. Our simulations reveal that our proposed methods enhance model accuracy by up to 34.1% in non-IID settings and reduce the training time in the presence of stragglers by up to 62%. In particular, LDS effectively mitigates the straggler effect without compromising model accuracy or adding significant computational overhead compared to UGS. Our results demonstrate the potential of our methods as a promising solution for DDL in real applications.
☆ Inferring turbulent velocity and temperature fields and their statistics from Lagrangian velocity measurements using physics-informed Kolmogorov-Arnold Networks
We propose the Artificial Intelligence Velocimetry-Thermometry (AIVT) method to infer hidden temperature fields from experimental turbulent velocity data. This physics-informed machine learning method enables us to infer continuous temperature fields using only sparse velocity data, hence eliminating the need for direct temperature measurements. Specifically, AIVT is based on physics-informed Kolmogorov-Arnold Networks (not neural networks) and is trained by optimizing a combined loss function that minimizes the residuals of the velocity data, boundary conditions, and the governing equations. We apply AIVT to a unique set of experimental volumetric and simultaneous temperature and velocity data of Rayleigh-B\'enard convection (RBC) that we acquired by combining Particle Image Thermometry and Lagrangian Particle Tracking. This allows us to compare AIVT predictions and measurements directly. We demonstrate that we can reconstruct and infer continuous and instantaneous velocity and temperature fields from sparse experimental data at a fidelity comparable to direct numerical simulations (DNS) of turbulence. This, in turn, enables us to compute important quantities for quantifying turbulence, such as fluctuations, viscous and thermal dissipation, and QR distribution. This paradigm shift in processing experimental data using AIVT to infer turbulent fields at DNS-level fidelity is a promising avenue in breaking the current deadlock of quantitative understanding of turbulence at high Reynolds numbers, where DNS is computationally infeasible.
comment: 50 pages, 8 figures
☆ Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning
In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, dataset diversity is incompletely understood. The current best practice is to try to maximize dataset size and class balance. Yet large, class-balanced datasets are not guaranteed to be diverse: images can still be arbitrarily similar. We hypothesized that, for a given model architecture, better model performance can be achieved by maximizing dataset diversity more directly. This could open a path for performance improvement without additional computational resources or architectural advances. To test this hypothesis, we introduce a comprehensive framework of diversity measures, developed in ecology, that generalizes familiar quantities like Shannon entropy by accounting for similarities and differences among images. (Dataset size and class balance emerge from this framework as special cases.) By analyzing thousands of subsets from seven medical datasets representing ultrasound, X-ray, CT, and pathology images, we found that the best correlates of performance were not size or class balance but $A$ -- ``big alpha'' -- a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for similarities among images. One of these, $A_0$, explained 67\% of the variance in balanced accuracy across all subsets, vs. 54\% for class balance and just 39\% for size. The best pair was size and $A_1$ (79\%), which outperformed size and class balance (74\%). $A$ performed best for subsets from individual datasets as well as across datasets, supporting the generality of these results. We propose maximizing $A$ as a potential new way to improve the performance of deep learning in medical imaging.
comment: 11 pages, 5 figures, 3 tables
☆ Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability
Large language models (LLMs) have emerged as powerful tools for many AI problems and exhibit remarkable in-context learning (ICL) capabilities. Compositional ability, solving unseen complex tasks that combine two or more simple tasks, is an essential reasoning ability for Artificial General Intelligence. Despite LLM's tremendous success, how they approach composite tasks, especially those not encountered during the pretraining phase, remains an open question and largely ununderstood. In this study, we delve into the ICL capabilities of LLMs on composite tasks, with only simple tasks as in-context examples. We develop a test suite of composite tasks that include linguistic and logical challenges and perform empirical studies across different LLM families. We observe that models exhibit divergent behaviors: (1) For simpler composite tasks that apply distinct mapping mechanisms to different input segments, the models demonstrate decent compositional ability, while scaling up the model enhances this ability; (2) for more complex composite tasks that involving reasoning multiple steps, where each step represent one task, models typically underperform, and scaling up generally provide no improvements. We offer theoretical analysis in a simplified setting, explaining that models exhibit compositional capability when the task handles different input parts separately. We believe our work sheds new light on the capabilities of LLMs in solving composite tasks regarding the nature of the tasks and model scale. Our dataset and code are available at {\url{https://github.com/OliverXUZY/LLM_Compose}}.
☆ Estimating Probability Densities with Transformer and Denoising Diffusion ICML 2024
Transformers are often the go-to architecture to build foundation models that ingest a large amount of training data. But these models do not estimate the probability density distribution when trained on regression problems, yet obtaining full probabilistic outputs is crucial to many fields of science, where the probability distribution of the answer can be non-Gaussian and multimodal. In this work, we demonstrate that training a probabilistic model using a denoising diffusion head on top of the Transformer provides reasonable probability density estimation even for high-dimensional inputs. The combined Transformer+Denoising Diffusion model allows conditioning the output probability density on arbitrary combinations of inputs and it is thus a highly flexible density function emulator of all possible input/output combinations. We illustrate our Transformer+Denoising Diffusion model by training it on a large dataset of astronomical observations and measured labels of stars within our Galaxy and we apply it to a variety of inference tasks to show that the model can infer labels accurately with reasonable distributions.
comment: Accepted at the ICML 2024 Workshop on Foundation Models in the Wild
☆ SoftCVI: contrastive variational inference with self-generated soft labels
Estimating a distribution given access to its unnormalized density is pivotal in Bayesian inference, where the posterior is generally known only up to an unknown normalizing constant. Variational inference and Markov chain Monte Carlo methods are the predominant tools for this task; however, both methods are often challenging to apply reliably, particularly when the posterior has complex geometry. Here, we introduce Soft Contrastive Variational Inference (SoftCVI), which allows a family of variational objectives to be derived through a contrastive estimation framework. These objectives have zero variance gradient when the variational approximation is exact, without the need for specialized gradient estimators. The approach involves parameterizing a classifier in terms of the variational distribution, which allows the inference task to be reframed as a contrastive estimation problem, aiming to identify a single true posterior sample among a set of samples. Despite this framing, we do not require positive or negative samples, but rather learn by sampling the variational distribution and computing ground truth soft classification labels from the unnormalized posterior itself. We empirically investigate the performance on a variety of Bayesian inference tasks, using both using both simple (e.g. normal) and expressive (normalizing flow) variational distributions. We find that SoftCVI objectives often outperform other commonly used variational objectives.
☆ HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning ECCV 2024
Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.
comment: Accepted as a main conference paper at ECCV 2024 (https://github.com/google/haloquest)
☆ vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling computation from memory defragmentation and offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous approach, ensuring efficient, fragmentation-free memory management while accommodating various computation kernels across different LLM architectures. Experimental results indicate that vTensor achieves an average speedup of 1.86x across different models, with up to 2.42x in multi-turn chat scenarios. Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton prefix-prefilling kernels and vLLM paged Attention kernel, respectively. Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100 GPU compared to vLLM, enabling more memory-intensive workloads.
comment: 16 pages, 12 figures
☆ A spatiotemporal deep learning framework for prediction of crack dynamics in heterogeneous solids: efficient mapping of concrete microstructures to its fracture properties
A spatiotemporal deep learning framework is proposed that is capable of 2D full-field prediction of fracture in concrete mesostructures. This framework not only predicts fractures but also captures the entire history of the fracture process, from the crack initiation in the interfacial transition zone to the subsequent propagation of the cracks in the mortar matrix. In addition, a convolutional neural network is developed which can predict the averaged stress-strain curve of the mesostructures. The UNet modeling framework, which comprises an encoder-decoder section with skip connections, is used as the deep learning surrogate model. Training and test data are generated from high-fidelity fracture simulations of randomly generated concrete mesostructures. These mesostructures include geometric variabilities such as different aggregate particle geometrical features, spatial distribution, and the total volume fraction of aggregates. The fracture simulations are carried out in Abaqus, utilizing the cohesive phase-field fracture modeling technique as the fracture modeling approach. In this work, to reduce the number of training datasets, the spatial distribution of three sets of material properties for three-phase concrete mesostructures, along with the spatial phase-field damage index, are fed to the UNet to predict the corresponding stress and spatial damage index at the subsequent step. It is shown that after the training process using this methodology, the UNet model is capable of accurately predicting damage on the unseen test dataset by using 470 datasets. Moreover, another novel aspect of this work is the conversion of irregular finite element data into regular grids using a developed pipeline. This approach allows for the implementation of less complex UNet architecture and facilitates the integration of phase-field fracture equations into surrogate models for future developments.
☆ How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?
We consider the situation when a learner faces a set of unknown discrete distributions $(p_k)_{k\in \mathcal K}$ defined over a common alphabet $\mathcal X$, and can build for each distribution $p_k$ an individual high-probability confidence set thanks to $n_k$ observations sampled from $p_k$. The set $(p_k)_{k\in \mathcal K}$ is structured: each distribution $p_k$ is obtained from the same common, but unknown, distribution q via applying an unknown permutation to $\mathcal X$. We call this \emph{permutation-equivalence}. The goal is to build refined confidence sets \emph{exploiting} this structural property. Like other popular notions of structure (Lipschitz smoothness, Linearity, etc.) permutation-equivalence naturally appears in machine learning problems, and to benefit from its potential gain calls for a specific approach. We present a strategy to effectively exploit permutation-equivalence, and provide a finite-time high-probability bound on the size of the refined confidence sets output by the strategy. Since a refinement is not possible for too few observations in general, under mild technical assumptions, our finite-time analysis establish when the number of observations $(n_k)_{k\in \mathcal K}$ are large enough so that the output confidence sets improve over initial individual sets. We carefully characterize this event and the corresponding improvement. Further, our result implies that the size of confidence sets shrink at asymptotic rates of $O(1/\sqrt{\sum_{k\in \mathcal K} n_k})$ and $O(1/\max_{k\in K} n_{k})$, respectively for elements inside and outside the support of q, when the size of each individual confidence set shrinks at respective rates of $O(1/\sqrt{n_k})$ and $O(1/n_k)$. We illustrate the practical benefit of exploiting permutation equivalence on a reinforcement learning task.
☆ MuTT: A Multimodal Trajectory Transformer for Robot Skills
High-level robot skills represent an increasingly popular paradigm in robot programming. However, configuring the skills' parameters for a specific task remains a manual and time-consuming endeavor. Existing approaches for learning or optimizing these parameters often require numerous real-world executions or do not work in dynamic environments. To address these challenges, we propose MuTT, a novel encoder-decoder transformer architecture designed to predict environment-aware executions of robot skills by integrating vision, trajectory, and robot skill parameters. Notably, we pioneer the fusion of vision and trajectory, introducing a novel trajectory projection. Furthermore, we illustrate MuTT's efficacy as a predictor when combined with a model-based robot skill optimizer. This approach facilitates the optimization of robot skill parameters for the current environment, without the need for real-world executions during optimization. Designed for compatibility with any representation of robot skills, MuTT demonstrates its versatility across three comprehensive experiments, showcasing superior performance across two different skill representations.
☆ Link Polarity Prediction from Sparse and Noisy Labels via Multiscale Social Balance
Signed Graph Neural Networks (SGNNs) have recently gained attention as an effective tool for several learning tasks on signed networks, i.e., graphs where edges have an associated polarity. One of these tasks is to predict the polarity of the links for which this information is missing, starting from the network structure and the other available polarities. However, when the available polarities are few and potentially noisy, such a task becomes challenging. In this work, we devise a semi-supervised learning framework that builds around the novel concept of \emph{multiscale social balance} to improve the prediction of link polarities in settings characterized by limited data quantity and quality. Our model-agnostic approach can seamlessly integrate with any SGNN architecture, dynamically reweighting the importance of each data sample while making strategic use of the structural information from unlabeled edges combined with social balance theory. Empirical validation demonstrates that our approach outperforms established baseline models, effectively addressing the limitations imposed by noisy and sparse data. This result underlines the benefits of incorporating multiscale social balance into SGNNs, opening new avenues for robust and accurate predictions in signed network analysis.
☆ Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models
In this paper, we propose and investigate the use of neural audio codec language models for the automatic generation of sample-based musical instruments based on text or reference audio prompts. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding. We identify maintaining timbral consistency within the generated instruments as a major challenge. To tackle this issue, we introduce three distinct conditioning schemes. We analyze our methods through objective metrics and human listening tests, demonstrating that our approach can produce compelling musical instruments. Specifically, we introduce a new objective metric to evaluate the timbral consistency of the generated instruments and adapt the average Contrastive Language-Audio Pretraining (CLAP) score for the text-to-instrument case, noting that its naive application is unsuitable for assessing this task. Our findings reveal a complex interplay between timbral consistency, the quality of generated samples, and their correspondence to the input prompt.
comment: 8 pages, 2 figures. Accepted to the 25th Conference of the International Society for Music Information Retrieval (ISMIR)
☆ RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation
Large language models (LLMs) have advanced the field of artificial intelligence (AI) in medicine. However LLMs often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RadioRAG is evaluated using a dedicated radiologic question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions, for which the correct gold-standard answers were available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved context-specific information from www.radiopaedia.org in real-time and incorporated them into its reply. RadioRAG consistently improved diagnostic accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It matched or exceeded question answering without RAG across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in its effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.
☆ Dual Test-time Training for Out-of-distribution Recommender System
Deep learning has been widely applied in recommender systems, which has achieved revolutionary progress recently. However, most existing learning-based methods assume that the user and item distributions remain unchanged between the training phase and the test phase. However, the distribution of user and item features can naturally shift in real-world scenarios, potentially resulting in a substantial decrease in recommendation performance. This phenomenon can be formulated as an Out-Of-Distribution (OOD) recommendation problem. To address this challenge, we propose a novel Dual Test-Time-Training framework for OOD Recommendation, termed DT3OR. In DT3OR, we incorporate a model adaptation mechanism during the test-time phase to carefully update the recommendation model, allowing the model to specially adapt to the shifting user and item features. To be specific, we propose a self-distillation task and a contrastive task to assist the model learning both the user's invariant interest preferences and the variant user/item characteristics during the test-time phase, thus facilitating a smooth adaptation to the shifting features. Furthermore, we provide theoretical analysis to support the rationale behind our dual test-time training framework. To the best of our knowledge, this paper is the first work to address OOD recommendation via a test-time-training strategy. We conduct experiments on three datasets with various backbones. Comprehensive experimental results have demonstrated the effectiveness of DT3OR compared to other state-of-the-art baselines.
☆ Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets
Feature selection poses a challenge in small-sample high-dimensional datasets, where the number of features exceeds the number of observations, as seen in microarray, gene expression, and medical datasets. There isn't a universally optimal feature selection method applicable to any data distribution, and as a result, the literature consistently endeavors to address this issue. One recent approach in feature selection is termed frequency-based feature selection. However, existing methods in this domain tend to overlook feature values, focusing solely on the distribution in the response variable. In response, this paper introduces the Distance-based Mutual Congestion (DMC) as a filter method that considers both the feature values and the distribution of observations in the response variable. DMC sorts the features of datasets, and the top 5% are retained and clustered by KMeans to mitigate multicollinearity. This is achieved by randomly selecting one feature from each cluster. The selected features form the feature space, and the search space for the Genetic Algorithm with Adaptive Rates (GAwAR) will be approximated using this feature space. GAwAR approximates the combination of the top 10 features that maximizes prediction accuracy within a wrapper scheme. To prevent premature convergence, GAwAR adaptively updates the crossover and mutation rates. The hybrid DMC-GAwAR is applicable to binary classification datasets, and experimental results demonstrate its superiority over some recent works. The implementation and corresponding data are available at https://github.com/hnematzadeh/DMC-GAwAR
☆ Discrete Flow Matching
Despite Flow Matching and diffusion models having emerged as powerful generative paradigms for continuous variables such as images and videos, their application to high-dimensional discrete data, such as language, is still limited. In this work, we present Discrete Flow Matching, a novel discrete flow paradigm designed specifically for generating discrete data. Discrete Flow Matching offers several key contributions: (i) it works with a general family of probability paths interpolating between source and target distributions; (ii) it allows for a generic formula for sampling from these probability paths using learned posteriors such as the probability denoiser ($x$-prediction) and noise-prediction ($\epsilon$-prediction); (iii) practically, focusing on specific probability paths defined with different schedulers considerably improves generative perplexity compared to previous discrete diffusion and flow models; and (iv) by scaling Discrete Flow Matching models up to 1.7B parameters, we reach 6.7% Pass@1 and 13.4% Pass@10 on HumanEval and 6.7% Pass@1 and 20.6% Pass@10 on 1-shot MBPP coding benchmarks. Our approach is capable of generating high-quality discrete data in a non-autoregressive fashion, significantly closing the gap between autoregressive models and discrete flow models.
☆ Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models
Object-centric (OC) representations, which represent the state of a visual scene by modeling it as a composition of objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have not been thoroughly analyzed yet. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains from language to computer vision, marking them as a potential cornerstone of future research for a multitude of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, and demonstrate a viable way to achieve the best of both worlds. The extensiveness of our study, encompassing over 800 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.
☆ Data driven weather forecasts trained and initialised directly from observations
Skilful Machine Learned weather forecasts have challenged our approach to numerical weather prediction, demonstrating competitive performance compared to traditional physics-based approaches. Data-driven systems have been trained to forecast future weather by learning from long historical records of past weather such as the ECMWF ERA5. These datasets have been made freely available to the wider research community, including the commercial sector, which has been a major factor in the rapid rise of ML forecast systems and the levels of accuracy they have achieved. However, historical reanalyses used for training and real-time analyses used for initial conditions are produced by data assimilation, an optimal blending of observations with a physics-based forecast model. As such, many ML forecast systems have an implicit and unquantified dependence on the physics-based models they seek to challenge. Here we propose a new approach, training a neural network to predict future weather purely from historical observations with no dependence on reanalyses. We use raw observations to initialise a model of the atmosphere (in observation space) learned directly from the observations themselves. Forecasts of crucial weather parameters (such as surface temperature and wind) are obtained by predicting weather parameter observations (e.g. SYNOP surface data) at future times and arbitrary locations. We present preliminary results on forecasting observations 12-hours into the future. These already demonstrate successful learning of time evolutions of the physical processes captured in real observations. We argue that this new approach, by staying purely in observation space, avoids many of the challenges of traditional data assimilation, can exploit a wider range of observations and is readily expanded to simultaneous forecasting of the full Earth system (atmosphere, land, ocean and composition).
☆ Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing
We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.
☆ A New Theoretical Perspective on Data Heterogeneity in Federated Optimization ICML 2024
In federated learning (FL), data heterogeneity is the main reason that existing theoretical analyses are pessimistic about the convergence rate. In particular, for many FL algorithms, the convergence rate grows dramatically when the number of local updates becomes large, especially when the product of the gradient divergence and local Lipschitz constant is large. However, empirical studies can show that more local updates can improve the convergence rate even when these two parameters are large, which is inconsistent with the theoretical findings. This paper aims to bridge this gap between theoretical understanding and practical performance by providing a theoretical analysis from a new perspective on data heterogeneity. In particular, we propose a new and weaker assumption compared to the local Lipschitz gradient assumption, named the heterogeneity-driven pseudo-Lipschitz assumption. We show that this and the gradient divergence assumptions can jointly characterize the effect of data heterogeneity. By deriving a convergence upper bound for FedAvg and its extensions, we show that, compared to the existing works, local Lipschitz constant is replaced by the much smaller heterogeneity-driven pseudo-Lipschitz constant and the corresponding convergence upper bound can be significantly reduced for the same number of local updates, although its order stays the same. In addition, when the local objective function is quadratic, more insights on the impact of data heterogeneity can be obtained using the heterogeneity-driven pseudo-Lipschitz constant. For example, we can identify a region where FedAvg can outperform mini-batch SGD even when the gradient divergence can be arbitrarily large. Our findings are validated using experiments.
comment: ICML 2024
☆ The Rlign Algorithm for Enhanced Electrocardiogram Analysis through R-Peak Alignment for Explainable Classification and Clustering
Electrocardiogram (ECG) recordings have long been vital in diagnosing different cardiac conditions. Recently, research in the field of automatic ECG processing using machine learning methods has gained importance, mainly by utilizing deep learning methods on raw ECG signals. A major advantage of models like convolutional neural networks (CNNs) is their ability to effectively process biomedical imaging or signal data. However, this strength is tempered by challenges related to their lack of explainability, the need for a large amount of training data, and the complexities involved in adapting them for unsupervised clustering tasks. In addressing these tasks, we aim to reintroduce shallow learning techniques, including support vector machines and principal components analysis, into ECG signal processing by leveraging their semi-structured, cyclic form. To this end, we developed and evaluated a transformation that effectively restructures ECG signals into a fully structured format, facilitating their subsequent analysis using shallow learning algorithms. In this study, we present this adaptive transformative approach that aligns R-peaks across all signals in a dataset and resamples the segments between R-peaks, both with and without heart rate dependencies. We illustrate the substantial benefit of this transformation for traditional analysis techniques in the areas of classification, clustering, and explainability, outperforming commercial software for median beat transformation and CNN approaches. Our approach demonstrates a significant advantage for shallow machine learning methods over CNNs, especially when dealing with limited training data. Additionally, we release a fully tested and publicly accessible code framework, providing a robust alignment pipeline to support future research, available at https://github.com/ imi-ms/rlign.
☆ Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of `jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.
☆ Inverted Activations
The scaling of neural networks with increasing data and model sizes necessitates more efficient deep learning algorithms. This paper addresses the memory footprint challenge in neural network training by proposing a modification to the handling of activation tensors in pointwise nonlinearity layers. Traditionally, these layers save the entire input tensor for the backward pass, leading to substantial memory use. Our method involves saving the output tensor instead, reducing the memory required when the subsequent layer also saves its input tensor. This approach is particularly beneficial for transformer-based architectures like GPT, BERT, Mistral, and Llama. Application of our method involves taken an inverse function of nonlinearity. To the best of our knowledge, that can not be done analitically and instead we buid an accurate approximations using simpler functions. Experimental results confirm that our method significantly reduces memory usage without affecting training accuracy. The implementation is available at https://github.com/PgLoLo/optiacts.
☆ Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints IJCAI 2024
In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints. The penalty function method has recently been studied as an effective approach for handling constraints, which imposes constraints penalties on the objective to transform the constrained problem into an unconstrained one. However, it is challenging to choose appropriate penalties that balance policy performance and constraint satisfaction efficiently. In this paper, we propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN). PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration. We theoretically prove that EPO consistently improves constraint satisfaction with a convergence guarantee. We propose a new surrogate function and provide worst-case constraint violation and approximation error. In practice, we propose an effective smooth penalty function, which can be easily implemented with a first-order optimizer. Extensive experiments are conducted, showing that EPO outperforms the baselines in terms of policy performance and constraint satisfaction with a stable training process, particularly on complex tasks.
comment: To be published in the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024)
☆ Interpretable Concept-Based Memory Reasoning
The lack of transparency in the decision-making processes of deep learning systems presents a significant challenge in modern artificial intelligence (AI), as it impairs users' ability to rely on and verify these systems. To address this challenge, Concept Bottleneck Models (CBMs) have made significant progress by incorporating human-interpretable concepts into deep learning architectures. This approach allows predictions to be traced back to specific concept patterns that users can understand and potentially intervene on. However, existing CBMs' task predictors are not fully interpretable, preventing a thorough analysis and any form of formal verification of their decision-making process prior to deployment, thereby raising significant reliability concerns. To bridge this gap, we introduce Concept-based Memory Reasoner (CMR), a novel CBM designed to provide a human-understandable and provably-verifiable task prediction process. Our approach is to model each task prediction as a neural selection mechanism over a memory of learnable logic rules, followed by a symbolic evaluation of the selected rule. The presence of an explicit memory and the symbolic evaluation allow domain experts to inspect and formally verify the validity of certain global properties of interest for the task prediction process. Experimental results demonstrate that CMR achieves comparable accuracy-interpretability trade-offs to state-of-the-art CBMs, discovers logic rules consistent with ground truths, allows for rule interventions, and allows pre-deployment verification.
☆ Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks
Generative artificial intelligence has transformed the generation of synthetic data, providing innovative solutions to challenges like data scarcity and privacy, which are particularly critical in fields such as medicine. However, the effective use of this synthetic data to train high-performance models remains a significant challenge. This paper addresses this issue by introducing Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers. At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information provided to classifiers through a synthetic dataset regeneration and soft labelling mechanism. The KR pipeline has been tested on a variety of datasets, with a focus on six highly heterogeneous medical image datasets, ranging from retinal images to organ scans. The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases. Furthermore, the resulting models show almost complete immunity to Membership Inference Attacks, manifesting privacy properties missing in models trained with conventional techniques.
☆ Multiple importance sampling for stochastic gradient estimation
We introduce a theoretical and practical framework for efficient importance sampling of mini-batch samples for gradient estimation from single and multiple probability distributions. To handle noisy gradients, our framework dynamically evolves the importance distribution during training by utilizing a self-adaptive metric. Our framework combines multiple, diverse sampling distributions, each tailored to specific parameter gradients. This approach facilitates the importance sampling of vector-valued gradient estimation. Rather than naively combining multiple distributions, our framework involves optimally weighting data contribution across multiple distributions. This adapted combination of multiple importance yields superior gradient estimates, leading to faster training convergence. We demonstrate the effectiveness of our approach through empirical evaluations across a range of optimization tasks like classification and regression on both image and point cloud datasets.
comment: 13 pages, 11 figures
☆ Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups alongside dropping entire layers. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 1.8\% drop in average performance over the OpenLLM benchmark. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.
☆ Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation ACL
Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.
comment: Accepted at the MACLEAN workshop in the ECML/PKDD 2024
☆ Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models
We formalize the problem of prompt compression for large language models (LLMs) and present a framework to unify token-level prompt compression methods which create hard prompts for black-box models. We derive the distortion-rate function for this setup as a linear program, and provide an efficient algorithm to compute this fundamental limit via the dual of the linear program. Using the distortion-rate function as the baseline, we study the performance of existing compression schemes on a synthetic dataset consisting of prompts generated from a Markov chain, natural language queries, and their respective answers. Our empirical analysis demonstrates the criticality of query-aware prompt compression, where the compressor has knowledge of the downstream task/query for the black-box LLM. We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy, and propose a query-aware, variable-rate adaptation of a prior work to close the gap. We extend our experiments to a small natural language dataset to further confirm our findings on our synthetic dataset.
comment: 40 pages, 15 figures. Under review
☆ Affordance Labeling and Exploration: A Manifold-Based Approach
The advancement in computing power has significantly reduced the training times for deep learning, fostering the rapid development of networks designed for object recognition. However, the exploration of object utility, which is the affordance of the object, as opposed to object recognition, has received comparatively less attention. This work focuses on the problem of exploration of object affordances using existing networks trained on the object classification dataset. While pre-trained networks have proven to be instrumental in transfer learning for classification tasks, this work diverges from conventional object classification methods. Instead, it employs pre-trained networks to discern affordance labels without the need for specialized layers, abstaining from modifying the final layers through the addition of classification layers. To facilitate the determination of affordance labels without such modifications, two approaches, i.e. subspace clustering and manifold curvature methods are tested. These methods offer a distinct perspective on affordance label recognition. Especially, manifold curvature method has been successfully tested with nine distinct pre-trained networks, each achieving an accuracy exceeding 95%. Moreover, it is observed that manifold curvature and subspace clustering methods explore affordance labels that are not marked in the ground truth, but object affords in various cases.
comment: 17 Pages, 3 Figures, 3 Tables
☆ MODRL-TA:A Multi-Objective Deep Reinforcement Learning Framework for Traffic Allocation in E-Commerce Search
Traffic allocation is a process of redistributing natural traffic to products by adjusting their positions in the post-search phase, aimed at effectively fostering merchant growth, precisely meeting customer demands, and ensuring the maximization of interests across various parties within e-commerce platforms. Existing methods based on learning to rank neglect the long-term value of traffic allocation, whereas approaches of reinforcement learning suffer from balancing multiple objectives and the difficulties of cold starts within realworld data environments. To address the aforementioned issues, this paper propose a multi-objective deep reinforcement learning framework consisting of multi-objective Q-learning (MOQ), a decision fusion algorithm (DFM) based on the cross-entropy method(CEM), and a progressive data augmentation system(PDA). Specifically. MOQ constructs ensemble RL models, each dedicated to an objective, such as click-through rate, conversion rate, etc. These models individually determine the position of items as actions, aiming to estimate the long-term value of multiple objectives from an individual perspective. Then we employ DFM to dynamically adjust weights among objectives to maximize long-term value, addressing temporal dynamics in objective preferences in e-commerce scenarios. Initially, PDA trained MOQ with simulated data from offline logs. As experiments progressed, it strategically integrated real user interaction data, ultimately replacing the simulated dataset to alleviate distributional shifts and the cold start problem. Experimental results on real-world online e-commerce systems demonstrate the significant improvements of MODRL-TA, and we have successfully deployed MODRL-TA on an e-commerce search platform.
☆ The Diversity Bonus: Learning from Dissimilar Distributed Clients in Personalized Federated Learning
Personalized Federated Learning (PFL) is a commonly used framework that allows clients to collaboratively train their personalized models. PFL is particularly useful for handling situations where data from different clients are not independent and identically distributed (non-IID). Previous research in PFL implicitly assumes that clients can gain more benefits from those with similar data distributions. Correspondingly, methods such as personalized weight aggregation are developed to assign higher weights to similar clients during training. We pose a question: can a client benefit from other clients with dissimilar data distributions and if so, how? This question is particularly relevant in scenarios with a high degree of non-IID, where clients have widely different data distributions, and learning from only similar clients will lose knowledge from many other clients. We note that when dealing with clients with similar data distributions, methods such as personalized weight aggregation tend to enforce their models to be close in the parameter space. It is reasonable to conjecture that a client can benefit from dissimilar clients if we allow their models to depart from each other. Based on this idea, we propose DiversiFed which allows each client to learn from clients with diversified data distribution in personalized federated learning. DiversiFed pushes personalized models of clients with dissimilar data distributions apart in the parameter space while pulling together those with similar distributions. In addition, to achieve the above effect without using prior knowledge of data distribution, we design a loss function that leverages the model similarity to determine the degree of attraction and repulsion between any two models. Experiments on several datasets show that DiversiFed can benefit from dissimilar clients and thus outperform the state-of-the-art methods.
comment: 14 pages, 9 figures
☆ Efficient Retrieval with Learned Similarities
Retrieval plays a fundamental role in recommendation systems, search, and natural language processing by efficiently finding relevant items from a large corpus given a query. Dot products have been widely used as the similarity function in such retrieval tasks, thanks to Maximum Inner Product Search (MIPS) that enabled efficient retrieval based on dot products. However, state-of-the-art retrieval algorithms have migrated to learned similarities. Such algorithms vary in form; the queries can be represented with multiple embeddings, complex neural networks can be deployed, the item ids can be decoded directly from queries using beam search, and multiple approaches can be combined in hybrid solutions. Unfortunately, we lack efficient solutions for retrieval in these state-of-the-art setups. Our work investigates techniques for approximate nearest neighbor search with learned similarity functions. We first prove that Mixture-of-Logits (MoL) is a universal approximator, and can express all learned similarity functions. We next propose techniques to retrieve the approximate top K results using MoL with a tight bound. We finally compare our techniques with existing approaches, showing that MoL sets new state-of-the-art results on recommendation retrieval tasks, and our approximate top-k retrieval with learned similarities outperforms baselines by up to two orders of magnitude in latency, while achieving > .99 recall rate of exact algorithms.
☆ Score matching for bridges without time-reversals
We propose a new algorithm for learning a bridged diffusion process using score-matching methods. Our method relies on reversing the dynamics of the forward process and using this to learn a score function, which, via Doob's $h$-transform, gives us a bridged diffusion process; that is, a process conditioned on an endpoint. In contrast to prior methods, ours learns the score term $\nabla_x \log p(t, x; T, y)$, for given $t, Y$ directly, completely avoiding the need for first learning a time reversal. We compare the performance of our algorithm with existing methods and see that it outperforms using the (learned) time-reversals to learn the score term. The code can be found at https://github.com/libbylbaker/forward_bridge.
☆ Regression under demographic parity constraints via unlabeled post-processing
We address the problem of performing regression while ensuring demographic parity, even without access to sensitive attributes during inference. We present a general-purpose post-processing algorithm that, using accurate estimates of the regression function and a sensitive attribute predictor, generates predictions that meet the demographic parity constraint. Our method involves discretization and stochastic minimization of a smooth convex function. It is suitable for online post-processing and multi-class classification tasks only involving unlabeled data for the post-processing. Unlike prior methods, our approach is fully theory-driven. We require precise control over the gradient norm of the convex function, and thus, we rely on more advanced techniques than standard stochastic gradient descent. Our algorithm is backed by finite-sample analysis and post-processing bounds, with experimental results validating our theoretical findings.
☆ GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs CIKM 2024
Graph Neural Networks (GNNs) have emerged as powerful tools for supervised machine learning over graph-structured data, while sampling-based node representation learning is widely utilized in unsupervised learning. However, scalability remains a major challenge in both supervised and unsupervised learning for large graphs (e.g., those with over 1 billion nodes). The scalability bottleneck largely stems from the mini-batch sampling phase in GNNs and the random walk sampling phase in unsupervised methods. These processes often require storing features or embeddings in memory. In the context of distributed training, they require frequent, inefficient random access to data stored across different workers. Such repeated inter-worker communication for each mini-batch leads to high communication overhead and computational inefficiency. We propose GraphScale, a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. This separation allows us to decouple computing and storage in graph training, thus effectively building a pipeline where data fetching and data computation can overlap asynchronously. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings. We evaluate GraphScale both on public and proprietary graph datasets and observe a reduction of at least 40% in end-to-end training times compared to popular distributed frameworks, without any loss in performance. While most existing methods don't support billion-node graphs for training node embeddings, GraphScale is currently deployed in production at TikTok enabling efficient learning over such large graphs.
comment: Published in the Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024), 8 Pages, 12 Figures
☆ Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays
We study the stochastic combinatorial semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints. This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available and fairness among different choices (or arms) is crucial. We consider two types of unrestricted feedback delays: reward-independent delays where the feedback delays are independent of the rewards, and reward-dependent delays where the feedback delays are correlated with the rewards. Furthermore, we introduce merit-based fairness constraints to ensure a fair selection of the arms. We define the reward regret and the fairness regret and present new bandit algorithms to select arms under unrestricted feedback delays based on their merits. We prove that our algorithms all achieve sublinear expected reward regret and expected fairness regret, with a dependence on the quantiles of the delay distribution. We also conduct extensive experiments using synthetic and real-world data and show that our algorithms can fairly select arms with different feedback delays.
comment: 28 pages, 9 figures, accepted for 27th European Conference on Artificial Intelligence
Pre-Training and Prompting for Few-Shot Node Classification on Text-Attributed Graphs KDD'24
The text-attributed graph (TAG) is one kind of important real-world graph-structured data with each node associated with raw texts. For TAGs, traditional few-shot node classification methods directly conduct training on the pre-processed node features and do not consider the raw texts. The performance is highly dependent on the choice of the feature pre-processing method. In this paper, we propose P2TAG, a framework designed for few-shot node classification on TAGs with graph pre-training and prompting. P2TAG first pre-trains the language model (LM) and graph neural network (GNN) on TAGs with self-supervised loss. To fully utilize the ability of language models, we adapt the masked language modeling objective for our framework. The pre-trained model is then used for the few-shot node classification with a mixed prompt method, which simultaneously considers both text and graph information. We conduct experiments on six real-world TAGs, including paper citation networks and product co-purchasing networks. Experimental results demonstrate that our proposed framework outperforms existing graph few-shot learning methods on these datasets with +18.98% ~ +35.98% improvements.
comment: Accepted to KDD'24
☆ Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training
Combining different data modalities enables deep neural networks to tackle complex tasks more effectively, making multimodal learning increasingly popular. To harness multimodal data closer to end users, it is essential to integrate multimodal learning with privacy-preserving training approaches such as federated learning (FL). However, compared to conventional unimodal learning, multimodal setting requires dedicated encoders for each modality, resulting in larger and more complex models that demand significant resources. This presents a substantial challenge for FL clients operating with limited computational resources and communication bandwidth. To address these challenges, we introduce LW-FedMML, a layer-wise federated multimodal learning approach, which decomposes the training process into multiple steps. Each step focuses on training only a portion of the model, thereby significantly reducing the memory and computational requirements. Moreover, FL clients only need to exchange the trained model portion with the central server, lowering the resulting communication cost. We conduct extensive experiments across various FL scenarios and multimodal learning setups to validate the effectiveness of our proposed method. The results demonstrate that LW-FedMML can compete with conventional end-to-end federated multimodal learning (FedMML) while significantly reducing the resource burden on FL clients. Specifically, LW-FedMML reduces memory usage by up to $2.7\times$, computational operations (FLOPs) by $2.4\times$, and total communication cost by $2.3\times$. We also introduce a progressive training approach called Prog-FedMML. While it offers lesser resource efficiency than LW-FedMML, Prog-FedMML has the potential to surpass the performance of end-to-end FedMML, making it a viable option for scenarios with fewer resource constraints.
☆ Empirical Capacity Model for Self-Attention Neural Networks
Large pretrained self-attention neural networks, or transformers, have been very successful in various tasks recently. The performance of a model on a given task depends on its ability to memorize and generalize the training data. Large transformer models, which may have billions of parameters, in theory have a huge capacity to memorize content. However, the current algorithms for the optimization fall short of the theoretical capacity, and the capacity is also highly dependent on the content. In this paper, we focus on the memory capacity of these models obtained using common training algorithms and synthetic training data. Based on the results, we derive an empirical capacity model (ECM) for a generic transformer. The ECM can be used to design task-specific transformer models with an optimal number of parameters in cases where the target memorization capability of the task can be defined.
comment: Submitted to BNAIC'24, 14 pages + refs
☆ Planning behavior in a recurrent neural network that plays Sokoban ICML 2024
To predict how advanced neural networks generalize to novel situations, it is essential to understand how they reason. Guez et al. (2019, "An investigation of model-free planning") trained a recurrent neural network (RNN) to play Sokoban with model-free reinforcement learning. They found that adding extra computation steps to the start of episodes at test time improves the RNN's success rate. We further investigate this phenomenon, finding that it rapidly emerges early on in training and then slowly fades, but only for comparatively easier levels. The RNN also often takes redundant actions at episode starts, and these are reduced by adding extra computation steps. Our results suggest that the RNN learns to take time to think by `pacing', despite the per-step penalties, indicating that training incentivizes planning capabilities. The small size (1.29M parameters) and interesting behavior of this model make it an excellent model organism for mechanistic interpretability.
comment: Mechanistic Interpretability workshop, ICML 2024
☆ Weights Shuffling for Improving DPSGD in Transformer-based Models
Differential Privacy (DP) mechanisms, especially in high-dimensional settings, often face the challenge of maintaining privacy without compromising the data utility. This work introduces an innovative shuffling mechanism in Differentially-Private Stochastic Gradient Descent (DPSGD) to enhance the utility of large models at the same privacy guarantee of the unshuffled case. Specifically, we reveal that random shuffling brings additional randomness to the trajectory of gradient descent while not impacting the model accuracy by the permutation invariance property -- the model can be equivalently computed in both forward and backward propagations under permutation. We show that permutation indeed improves the privacy guarantee of DPSGD in theory, but tracking the exact privacy loss on shuffled model is particularly challenging. Hence we exploit the approximation on sum of lognormal distributions to derive the condition for the shuffled DPSGD to meet the DP guarantee. Auditing results show that our condition offers a DP guarantee quite close to the audited privacy level, demonstrating our approach an effective estimation in practice. Experimental results have verified our theoretical derivation and illustrate that our mechanism improves the accuracy of DPSGD over the state-of-the-art baselines on a variety of models and tasks.
☆ Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.
comment: Ongoing work (v1); 34 pages, 5 figures
☆ Offline Imitation Learning Through Graph Search and Retrieval
Imitation learning is a powerful machine learning algorithm for a robot to acquire manipulation skills. Nevertheless, many real-world manipulation tasks involve precise and dexterous robot-object interactions, which make it difficult for humans to collect high-quality expert demonstrations. As a result, a robot has to learn skills from suboptimal demonstrations and unstructured interactions, which remains a key challenge. Existing works typically use offline deep reinforcement learning (RL) to solve this challenge, but in practice these algorithms are unstable and fragile due to the deadly triad issue. To overcome this problem, we propose GSR, a simple yet effective algorithm that learns from suboptimal demonstrations through Graph Search and Retrieval. We first use pretrained representation to organize the interaction experience into a graph and perform a graph search to calculate the values of different behaviors. Then, we apply a retrieval-based procedure to identify the best behavior (actions) on each state and use behavior cloning to learn that behavior. We evaluate our method in both simulation and real-world robotic manipulation tasks with complex visual inputs, covering various precise and dexterous manipulation skills with objects of different physical properties. GSR can achieve a 10% to 30% higher success rate and over 30% higher proficiency compared to baselines. Our project page is at https://zhaohengyin.github.io/gsr.
comment: Robotics: Science and Systems (RSS) 2024
☆ Tackling Selfish Clients in Federated Learning ECAI
Federated Learning (FL) is a distributed machine learning paradigm facilitating participants to collaboratively train a model without revealing their local data. However, when FL is deployed into the wild, some intelligent clients can deliberately deviate from the standard training process to make the global model inclined toward their local model, thereby prioritizing their local data distribution. We refer to this novel category of misbehaving clients as selfish. In this paper, we propose a Robust aggregation strategy for FL server to mitigate the effect of Selfishness (in short RFL-Self). RFL-Self incorporates an innovative method to recover (or estimate) the true updates of selfish clients from the received ones, leveraging robust statistics (median of norms) of the updates at every round. By including the recovered updates in aggregation, our strategy offers strong robustness against selfishness. Our experimental results, obtained on MNIST and CIFAR-10 datasets, demonstrate that just 2% of clients behaving selfishly can decrease the accuracy by up to 36%, and RFL-Self can mitigate that effect without degrading the global model performance.
comment: 10 pages, 16 figures. European Conference on Artificial Intelligence (ECAI) 2024
☆ Poisoning with A Pill: Circumventing Detection in Federated Learning
Without direct access to the client's data, federated learning (FL) is well-known for its unique strength in data privacy protection among existing distributed machine learning techniques. However, its distributive and iterative nature makes FL inherently vulnerable to various poisoning attacks. To counteract these threats, extensive defenses have been proposed to filter out malicious clients, using various detection metrics. Based on our analysis of existing attacks and defenses, we find that there is a lack of attention to model redundancy. In neural networks, various model parameters contribute differently to the model's performance. However, existing attacks in FL manipulate all the model update parameters with the same strategy, making them easily detectable by common defenses. Meanwhile, the defenses also tend to analyze the overall statistical features of the entire model updates, leaving room for sophisticated attacks. Based on these observations, this paper proposes a generic and attack-agnostic augmentation approach designed to enhance the effectiveness and stealthiness of existing FL poisoning attacks against detection in FL, pointing out the inherent flaws of existing defenses and exposing the necessity of fine-grained FL security. Specifically, we employ a three-stage methodology that strategically constructs, generates, and injects poison (generated by existing attacks) into a pill (a tiny subnet with a novel structure) during the FL training, named as pill construction, pill poisoning, and pill injection accordingly. Extensive experimental results show that FL poisoning attacks enhanced by our method can bypass all the popular defenses, and can gain an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.
☆ LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation
Recent studies seek to provide Graph Neural Network (GNN) interpretability via multiple unsupervised learning models. Due to the scarcity of datasets, current methods easily suffer from learning bias. To solve this problem, we embed a Large Language Model (LLM) as knowledge into the GNN explanation network to avoid the learning bias problem. We inject LLM as a Bayesian Inference (BI) module to mitigate learning bias. The efficacy of the BI module has been proven both theoretically and experimentally. We conduct experiments on both synthetic and real-world datasets. The innovation of our work lies in two parts: 1. We provide a novel view of the possibility of an LLM functioning as a Bayesian inference to improve the performance of existing algorithms; 2. We are the first to discuss the learning bias issues in the GNN explanation problem.
comment: Preprint Paper with 13 pages
☆ Fever Detection with Infrared Thermography: Enhancing Accuracy through Machine Learning Techniques
The COVID-19 pandemic has underscored the necessity for advanced diagnostic tools in global health systems. Infrared Thermography (IRT) has proven to be a crucial non-contact method for measuring body temperature, vital for identifying febrile conditions associated with infectious diseases like COVID-19. Traditional non-contact infrared thermometers (NCITs) often exhibit significant variability in readings. To address this, we integrated machine learning algorithms with IRT to enhance the accuracy and reliability of temperature measurements. Our study systematically evaluated various regression models using heuristic feature engineering techniques, focusing on features' physiological relevance and statistical significance. The Convolutional Neural Network (CNN) model, utilizing these techniques, achieved the lowest RMSE of 0.2223, demonstrating superior performance compared to results reported in previous literature. Among non-neural network models, the Binning method achieved the best performance with an RMSE of 0.2296. Our findings highlight the potential of combining advanced feature engineering with machine learning to improve diagnostic tools' effectiveness, with implications extending to other non-contact or remote sensing biomedical applications. This paper offers a comprehensive analysis of these methodologies, providing a foundation for future research in the field of non-invasive medical diagnostics.
☆ U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks
Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the H\'ajek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.
♻ ☆ Multicell-Fold: geometric learning in folding multicellular life
During developmental processes such as embryogenesis, how a group of cells fold into specific structures, is a central question in biology that defines how living organisms form. Establishing tissue-level morphology critically relies on how every single cell decides to position itself relative to its neighboring cells. Despite its importance, it remains a major challenge to understand and predict the behavior of every cell within the living tissue over time during such intricate processes. To tackle this question, we propose a geometric deep learning model that can predict multicellular folding and embryogenesis, accurately capturing the highly convoluted spatial interactions among cells. We demonstrate that multicellular data can be represented with both granular and foam-like physical pictures through a unified graph data structure, considering both cellular interactions and cell junction networks. We successfully use our model to achieve two important tasks, interpretable 4-D morphological sequence alignment, and predicting local cell rearrangements before they occur at single-cell resolution. Furthermore, using an activation map and ablation studies, we demonstrate that cell geometries and cell junction networks together regulate local cell rearrangement which is critical for embryo morphogenesis. This approach provides a novel paradigm to study morphogenesis, highlighting a unified data structure and harnessing the power of geometric deep learning to accurately model the mechanisms and behaviors of cells during development. It offers a pathway toward creating a unified dynamic morphological atlas for a variety of developmental processes such as embryogenesis.
♻ ☆ Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers
Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.
comment: 11 pages, 7 figures, 2 tables, 1 algorithm. Version Update: Figure 6
♻ ☆ ACEGEN: Reinforcement learning of generative chemical agents for drug discovery
In recent years, reinforcement learning (RL) has emerged as a valuable tool in drug design, offering the potential to propose and optimize molecules with desired properties. However, striking a balance between capabilities, flexibility, reliability, and efficiency remains challenging due to the complexity of advanced RL algorithms and the significant reliance on specialized code. In this work, we introduce ACEGEN, a comprehensive and streamlined toolkit tailored for generative drug design, built using TorchRL, a modern RL library that offers thoroughly tested reusable components. We validate ACEGEN by benchmarking against other published generative modeling algorithms and show comparable or improved performance. We also show examples of ACEGEN applied in multiple drug discovery case studies. ACEGEN is accessible at \url{https://github.com/acellera/acegen-open} and available for use under the MIT license.
♻ ☆ Uncertainty Quantification and Propagation in Surrogate-based Bayesian Inference
Surrogate models are statistical or conceptual approximations for more complex simulation models. In this context, it is crucial to propagate the uncertainty induced by limited simulation budget and surrogate approximation error to predictions, inference, and subsequent decision-relevant quantities. However, quantifying and then propagating the uncertainty of surrogates is usually limited to special analytic cases or is otherwise computationally very expensive. In this paper, we propose a framework enabling a scalable, Bayesian approach to surrogate modeling with thorough uncertainty quantification, propagation, and validation. Specifically, we present three methods for Bayesian inference with surrogate models given measurement data. This is a task where the propagation of surrogate uncertainty is especially relevant, because failing to account for it may lead to biased and/or overconfident estimates of the parameters of interest. We showcase our approach in three detailed case studies for linear and nonlinear real-world modeling scenarios. Uncertainty propagation in surrogate models enables more reliable and safe approximation of expensive simulators and will therefore be useful in various fields of applications.
♻ ☆ Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to adversarial attacks is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. Focusing on the safety domain, we use Rainbow Teaming to target various state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90% across all tested models. Furthermore, we demonstrate that fine-tuning models with synthetic data generated by the Rainbow Teaming method significantly enhances their safety without sacrificing general performance or helpfulness. We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity, showcasing its potential to drive robust open-ended self-improvement in a wide range of applications.
♻ ☆ ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending models' context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, our work proposes a new benchmark for long-context LLMs focused on a practical meeting assistant scenario. In this scenario, the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers. Our experiments with recent long-context LLMs on ELITR-Bench highlight a gap between open-source and proprietary models, especially when questions are asked sequentially within a conversation. We also provide a thorough analysis of our GPT-4-based evaluation method, encompassing insights from a crowdsourcing study. Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
♻ ☆ DropKAN: Regularizing KANs by masking post-activations
We propose DropKAN (Dropout Kolmogorov-Arnold Networks) a regularization method that prevents co-adaptation of activation function weights in Kolmogorov-Arnold Networks (KANs). DropKAN operates by randomly masking some of the post-activations within the KANs computation graph, while scaling-up the retained post-activations. We show that this simple procedure that require minimal coding effort has a regularizing effect and consistently lead to better generalization of KANs. We analyze the adaptation of the standard Dropout with KANs and demonstrate that Dropout applied to KANs' neurons can lead to unpredictable behaviour in the feedforward pass. We carry an empirical study with real world Machine Learning datasets to validate our findings. Our results suggest that DropKAN is consistently a better alternative to using standard Dropout with KANs, and improves the generalization performance of KANs. Our implementation of DropKAN is available at: \url{https://github.com/Ghaith81/dropkan}.
♻ ☆ Uncertainty-aware transfer across tasks using hybrid model-based successor feature reinforcement learning
Sample efficiency is central to developing practical reinforcement learning (RL) for complex and large-scale decision-making problems. The ability to transfer and generalize knowledge gained from previous experiences to downstream tasks can significantly improve sample efficiency. Recent research indicates that successor feature (SF) RL algorithms enable knowledge generalization between tasks with different rewards but identical transition dynamics. It has recently been hypothesized that combining model-based (MB) methods with SF algorithms can alleviate the limitation of fixed transition dynamics. Furthermore, uncertainty-aware exploration is widely recognized as another appealing approach for improving sample efficiency. Putting together two ideas of hybrid model-based successor feature (MB-SF) and uncertainty leads to an approach to the problem of sample efficient uncertainty-aware knowledge transfer across tasks with different transition dynamics or/and reward functions. In this paper, the uncertainty of the value of each action is approximated by a Kalman filter (KF)-based multiple-model adaptive estimation. This KF-based framework treats the parameters of a model as random variables. To the best of our knowledge, this is the first attempt at formulating a hybrid MB-SF algorithm capable of generalizing knowledge across large or continuous state space tasks with various transition dynamics while requiring less computation at decision time than MB methods. The number of samples required to learn the tasks was compared to recent SF and MB baselines. The results show that our algorithm generalizes its knowledge across different transition dynamics, learns downstream tasks with significantly fewer samples than starting from scratch, and outperforms existing approaches.
comment: 40 pages
♻ ☆ Simulation-Based Inference with Quantile Regression ICML 2024
We present Neural Quantile Estimation (NQE), a novel Simulation-Based Inference (SBI) method based on conditional quantile regression. NQE autoregressively learns individual one dimensional quantiles for each posterior dimension, conditioned on the data and previous posterior dimensions. Posterior samples are obtained by interpolating the predicted quantiles using monotonic cubic Hermite spline, with specific treatment for the tail behavior and multi-modal distributions. We introduce an alternative definition for the Bayesian credible region using the local Cumulative Density Function (CDF), offering substantially faster evaluation than the traditional Highest Posterior Density Region (HPDR). In case of limited simulation budget and/or known model misspecification, a post-processing calibration step can be integrated into NQE to ensure the unbiasedness of the posterior estimation with negligible additional computational cost. We demonstrate that NQE achieves state-of-the-art performance on a variety of benchmark problems.
comment: 9+13 pages, 8+8 figures, ICML 2024
♻ ☆ Explorative Imitation Learning: A Path Signature Approach for Continuous Environments ECAI
Some imitation learning methods combine behavioural cloning with self-supervision to infer actions from state pairs. However, most rely on a large number of expert trajectories to increase generalisation and human intervention to capture key aspects of the problem, such as domain constraints. In this paper, we propose Continuous Imitation Learning from Observation (CILO), a new method augmenting imitation learning with two important features: (i) exploration, allowing for more diverse state transitions, requiring less expert trajectories and resulting in fewer training iterations; and (ii) path signatures, allowing for automatic encoding of constraints, through the creation of non-parametric representations of agents and expert trajectories. We compared CILO with a baseline and two leading imitation learning methods in five environments. It had the best overall performance of all methods in all environments, outperforming the expert in two of them.
comment: This paper has been accepted in the 27th European Conference on Artificial Intelligence (ECAI) 2024
♻ ☆ Geometric Analysis of Unconstrained Feature Models with $d=K$
Recently, interesting empirical phenomena known as Neural Collapse have been observed during the final phase of training deep neural networks for classification tasks. We examine this issue when the feature dimension d is equal to the number of classes K. We demonstrate that two popular unconstrained feature models are strict saddle functions, with every critical point being either a global minimum or a strict saddle point that can be exited using negative curvatures. The primary findings conclusively confirm the conjecture on the unconstrained feature models in previous articles.
♻ ☆ Multimodal Explainability via Latent Shift applied to COVID-19 stratification
We are witnessing a widespread adoption of artificial intelligence in healthcare. However, most of the advancements in deep learning in this area consider only unimodal data, neglecting other modalities. Their multimodal interpretation necessary for supporting diagnosis, prognosis and treatment decisions. In this work we present a deep architecture, which jointly learns modality reconstructions and sample classifications using tabular and imaging data. The explanation of the decision taken is computed by applying a latent shift that, simulates a counterfactual prediction revealing the features of each modality that contribute the most to the decision and a quantitative score indicating the modality importance. We validate our approach in the context of COVID-19 pandemic using the AIforCOVID dataset, which contains multimodal data for the early identification of patients at risk of severe outcome. The results show that the proposed method provides meaningful explanations without degrading the classification performance.
♻ ☆ Predictive Coding Networks and Inference Learning: Tutorial and Survey
Recent years have witnessed a growing call for renewed emphasis on neuroscience-inspired approaches in artificial intelligence research, under the banner of NeuroAI. A prime example of this is predictive coding networks (PCNs), based on the neuroscientific framework of predictive coding. This framework views the brain as a hierarchical Bayesian inference model that minimizes prediction errors through feedback connections. Unlike traditional neural networks trained with backpropagation (BP), PCNs utilize inference learning (IL), a more biologically plausible algorithm that explains patterns of neural activity that BP cannot. Historically, IL has been more computationally intensive, but recent advancements have demonstrated that it can achieve higher efficiency than BP with sufficient parallelization. Furthermore, PCNs can be mathematically considered a superset of traditional feedforward neural networks (FNNs), significantly extending the range of trainable architectures. As inherently probabilistic (graphical) latent variable models, PCNs provide a versatile framework for both supervised learning and unsupervised (generative) modeling that goes beyond traditional artificial neural networks. This work provides a comprehensive review and detailed formal specification of PCNs, particularly situating them within the context of modern ML methods. Additionally, we introduce a Python library (PRECO) for practical implementation. This positions PC as a promising framework for future ML innovations.
comment: 46 pages, 13 figures, 8 tables
♻ ☆ A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits NeurIPS 2022
We study the stochastic multi-armed bandit problem and design new policies that enjoy both worst-case optimality for expected regret and light-tailed risk for regret distribution. Specifically, our policy design (i) enjoys the worst-case optimality for the expected regret at order $O(\sqrt{KT\ln T})$ and (ii) has the worst-case tail probability of incurring a regret larger than any $x>0$ being upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$, a rate that we prove to be best achievable with respect to $T$ for all worst-case optimal policies. Our proposed policy achieves a delicate balance between doing more exploration at the beginning of the time horizon and doing more exploitation when approaching the end, compared to standard confidence-bound-based policies. We also enhance the policy design to accommodate the "any-time" setting where $T$ is unknown a priori, and prove equivalently desired policy performances as compared to the "fixed-time" setting with known $T$. Numerical experiments are conducted to illustrate the theoretical findings. We find that from a managerial perspective, our new policy design yields better tail distributions and is preferable than celebrated policies especially when (i) there is a risk of under-estimating the volatility profile, or (ii) there is a challenge of tuning policy hyper-parameters. We conclude by extending our proposed policy design to the stochastic linear bandit setting that leads to both worst-case optimality in terms of expected regret and light-tailed risk on the regret distribution.
comment: Preliminary version appeared in NeurIPS 2022
♻ ☆ Fast and Effective Weight Update for Pruned Large Language Models
Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations. In our paper, we propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). We further extend it with a simple gradual pruning mask selection and achieve state-of-the-art pruning performance across a wide range of LLMs. Code is available at https://github.com/fmfi-compbio/admm-pruning.
♻ ☆ Automated Deterministic Auction Design with Objective Decomposition
Identifying high-revenue mechanisms that are both dominant strategy incentive compatible (DSIC) and individually rational (IR) is a fundamental challenge in auction design. While theoretical approaches have encountered bottlenecks in multi-item auctions, there has been much empirical progress in automated designing such mechanisms using machine learning. However, existing research primarily focuses on randomized auctions, with less attention given to the more practical deterministic auctions. Therefore, this paper investigates the automated design of deterministic auctions and introduces OD-VVCA, an objective decomposition approach for automated designing Virtual Valuations Combinatorial Auctions (VVCAs). Firstly, we restrict our mechanism to deterministic VVCAs, which are inherently DSIC and IR. Afterward, we utilize a parallelizable dynamic programming algorithm to compute the allocation and revenue outcomes of a VVCA efficiently. We then decompose the revenue objective function into continuous and piecewise constant discontinuous components, optimizing each using distinct methods. Extensive experiments show that OD-VVCA achieves high revenue in multi-item auctions, especially in large-scale settings where it outperforms both randomized and deterministic baselines, indicating its efficacy and scalability.
♻ ☆ Chain of Code: Reasoning with a Language Model-Augmented Code Emulator ICML 2024
Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)". In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by "thinking in code".
comment: ICML 2024 Oral; Project webpage: https://chain-of-code.github.io
♻ ☆ An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification
Predictive models may generate biased predictions when classifying imbalanced datasets. This happens when the model favors the majority class, leading to low performance in accurately predicting the minority class. To address this issue, balancing or resampling methods are critical data-centric AI approaches in the modeling process to improve prediction performance. However, there have been debates and questions about the functionality of these methods in recent years. In particular, many candidate models may exhibit very similar predictive performance, called the Rashomon effect, in model selection, and they may even produce different predictions for the same observations. Selecting one of these models without considering the predictive multiplicity -- which is the case of yielding conflicting models' predictions for any sample -- can result in blind selection. In this paper, the impact of balancing methods on predictive multiplicity is examined using the Rashomon effect. It is crucial because the blind model selection in data-centric AI is risky from a set of approximately equally accurate models. This may lead to severe problems in model selection, validation, and explanation. To tackle this matter, we conducted real dataset experiments to observe the impact of balancing methods on predictive multiplicity through the Rashomon effect by using a newly proposed metric obscurity in addition to the existing ones: ambiguity and discrepancy. Our findings showed that balancing methods inflate the predictive multiplicity and yield varying results. To monitor the trade-off between the prediction performance and predictive multiplicity for conducting the modeling process responsibly, we proposed using the extended version of the performance-gain plot when balancing the training data.
comment: 16 pages, 6 figures
♻ ☆ Quantum Normalizing Flows for Anomaly Detection
A Normalizing Flow computes a bijective mapping from an arbitrary distribution to a predefined (e.g. normal) distribution. Such a flow can be used to address different tasks, e.g. anomaly detection, once such a mapping has been learned. In this work we introduce Normalizing Flows for Quantum architectures, describe how to model and optimize such a flow and evaluate our method on example datasets. Our proposed models show competitive performance for anomaly detection compared to classical methods, esp. those ones where there are already quantum inspired algorithms available. In the experiments we compare our performance to isolation forests (IF), the local outlier factor (LOF) or single-class SVMs.
comment: v3: 15 pages, 8 figures
♻ ☆ 360VFI: A Dataset and Benchmark for Omnidirectional Video Frame Interpolation
With the development of VR-related techniques, viewers can enjoy a realistic and immersive experience through a head-mounted display, while omnidirectional video with a low frame rate can lead to user dizziness. However, the prevailing plane frame interpolation methodologies are unsuitable for Omnidirectional Video Interpolation, chiefly due to the lack of models tailored to such videos with strong distortion, compounded by the scarcity of valuable datasets for Omnidirectional Video Frame Interpolation. In this paper, we introduce the benchmark dataset, 360VFI, for Omnidirectional Video Frame Interpolation. We present a practical implementation that introduces a distortion prior from omnidirectional video into the network to modulate distortions. We especially propose a pyramid distortion-sensitive feature extractor that uses the unique characteristics of equirectangular projection (ERP) format as prior information. Moreover, we devise a decoder that uses an affine transformation to facilitate the synthesis of intermediate frames further. 360VFI is the first dataset and benchmark that explores the challenge of Omnidirectional Video Frame Interpolation. Through our benchmark analysis, we presented four different distortion conditions scenes in the proposed 360VFI dataset to evaluate the challenge triggered by distortion during interpolation. Besides, experimental results demonstrate that Omnidirectional Video Interpolation can be effectively improved by modeling for omnidirectional distortion.
♻ ☆ Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.
comment: 20 pages; 13 visualizations (nine figures, four tables)
♻ ☆ Ridge Estimation with Nonlinear Transformations
Ridge estimation is an important manifold learning technique. The goal of this paper is to examine the effects of nonlinear transformations on the ridge sets. The main result proves the inclusion relationship between ridges: $\cR(f\circ p)\subseteq \cR(p)$, provided that the transformation $f$ is strictly increasing and concave on the range of the function $p$. Additionally, given an underlying true manifold $\cM$, we show that the Hausdorff distance between $\cR(f\circ p)$ and its projection onto $\cM$ is smaller than the Hausdorff distance between $\cR(p)$ and the corresponding projection. This motivates us to apply an increasing and concave transformation before the ridge estimation. In specific, we show that the power transformations $f^{q}(y)=y^q/q,-\infty
comment: There are some flaws in the proofs for Lemma 1 and Theorem 1. We want to withdraw this version to prevent any potential misunderstanding for readers
♻ ☆ Learning Non-Vacuous Generalization Bounds from Optimization
One of the fundamental challenges in the deep learning community is to theoretically understand how well a deep neural network generalizes to unseen data. However, current approaches often yield generalization bounds that are either too loose to be informative of the true generalization error or only valid to the compressed nets. In this study, we present a simple yet non-vacuous generalization bound from the optimization perspective. We achieve this goal by leveraging that the hypothesis set accessed by stochastic gradient algorithms is essentially fractal-like and thus can derive a tighter bound over the algorithm-dependent Rademacher complexity. The main argument rests on modeling the discrete-time recursion process via a continuous-time stochastic differential equation driven by fractional Brownian motion. Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks such as ResNet and Vision Transformer, even when they are trained on a large-scale dataset (e.g. ImageNet-1K).
comment: 35pages
♻ ☆ Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data ICLR 2024
A powerful concept behind much of the recent progress in machine learning is the extraction of common features across data from heterogeneous sources or tasks. Intuitively, using all of one's data to learn a common representation function benefits both computational effort and statistical generalization by leaving a smaller number of parameters to fine-tune on a given task. Toward theoretically grounding these merits, we propose a general setting of recovering linear operators $M$ from noisy vector measurements $y = Mx + w$, where the covariates $x$ may be both non-i.i.d. and non-isotropic. We demonstrate that existing isotropy-agnostic representation learning approaches incur biases on the representation update, which causes the scaling of the noise terms to lose favorable dependence on the number of source tasks. This in turn can cause the sample complexity of representation learning to be bottlenecked by the single-task data size. We introduce an adaptation, $\texttt{De-bias & Feature-Whiten}$ ($\texttt{DFW}$), of the popular alternating minimization-descent scheme proposed independently in Collins et al., (2021) and Nayer and Vaswani (2022), and establish linear convergence to the optimal representation with noise level scaling down with the $\textit{total}$ source data size. This leads to generalization bounds on the same order as an oracle empirical risk minimizer. We verify the vital importance of $\texttt{DFW}$ on various numerical simulations. In particular, we show that vanilla alternating-minimization descent fails catastrophically even for iid, but mildly non-isotropic data. Our analysis unifies and generalizes prior work, and provides a flexible framework for a wider range of applications, such as in controls and dynamical systems.
comment: Appeared at ICLR 2024 (spotlight presentation)
♻ ☆ Machine-learning-based particle identification with missing data
In this work, we introduce a novel method for Particle Identification (PID) within the scope of the ALICE experiment at the Large Hadron Collider at CERN. Identifying products of ultrarelativisitc collisions delivered by the LHC is one of the crucial objectives of ALICE. Typically employed PID methods rely on hand-crafted selections, which compare experimental data to theoretical simulations. To improve the performance of the baseline methods, novel approaches use machine learning models that learn the proper assignment in a classification task. However, because of the various detection techniques used by different subdetectors, as well as the limited detector efficiency and acceptance, produced particles do not always yield signals in all of the ALICE components. This results in data with missing values. Machine learning techniques cannot be trained with such examples, so a significant part of the data is skipped during training. In this work, we propose the first method for PID that can be trained with all of the available data examples, including incomplete ones. Our approach improves the PID purity and efficiency of the selected sample for all investigated particle species.
♻ ☆ Surrogate Neural Networks Local Stability for Aircraft Predictive Maintenance
Surrogate Neural Networks are nowadays routinely used in industry as substitutes for computationally demanding engineering simulations (e.g., in structural analysis). They allow to generate faster predictions and thus analyses in industrial applications e.g., during a product design, testing or monitoring phases. Due to their performance and time-efficiency, these surrogate models are now being developed for use in safety-critical applications. Neural network verification and in particular the assessment of their robustness (e.g., to perturbations) is the next critical step to allow their inclusion in real-life applications and certification. We assess the applicability and scalability of empirical and formal methods in the context of aircraft predictive maintenance for surrogate neural networks designed to predict the stress sustained by an aircraft part from external loads. The case study covers a high-dimensional input and output space and the verification process thus accommodates multi-objective constraints. We explore the complementarity of verification methods in assessing the local stability property of such surrogate models to input noise. We showcase the effectiveness of sequentially combining methods in one verification 'pipeline' and demonstrate the subsequent gain in runtime required to assess the targeted property.
comment: Peer-reviewed and accepted at the 29th International Conference on Formal Methods for Industrial Critical Systems (FMICS 2024) - 15 pages
♻ ☆ Graph Condensation: A Survey
The rapid growth of graph data poses significant challenges in storage, transmission, and particularly the training of graph neural networks (GNNs). To address these challenges, graph condensation (GC) has emerged as an innovative solution. GC focuses on synthesizing a compact yet highly representative graph, enabling GNNs trained on it to achieve performance comparable to those trained on the original large graph. The notable efficacy of GC and its broad prospects have garnered significant attention and spurred extensive research. This survey paper provides an up-to-date and systematic overview of GC, organizing existing research into five categories aligned with critical GC evaluation criteria: effectiveness, generalization, efficiency, fairness, and robustness. To facilitate an in-depth and comprehensive understanding of GC, this paper examines various methods under each category and thoroughly discusses two essential components within GC: optimization strategies and condensed graph generation. We also empirically compare and analyze representative GC methods with diverse optimization strategies based on the five proposed GC evaluation criteria. Finally, we explore the applications of GC in various fields, outline the related open-source libraries, and highlight the present challenges and novel insights, with the aim of promoting advancements in future research. The related resources can be found at https://github.com/XYGaoG/Graph-Condensation-Papers.
♻ ☆ TTSDS -- Text-to-Speech Distribution Score
Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.
comment: Under review for SLT 2024
♻ ☆ KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search
Quantum architecture search (QAS) is a promising direction for optimization and automated design of quantum circuits towards quantum advantage. Recent techniques in QAS focus on machine learning-based approaches from reinforcement learning, like deep Q-network. While multi-layer perceptron-based deep Q-networks have been applied for QAS, their interpretability remains challenging due to the high number of parameters. In this work, we evaluate the practicality of Kolmogorov-Arnold Networks (KANs) in QAS problems, analyzing their efficiency in the task of quantum state preparation and quantum chemistry. In quantum state preparation, our results show that in a noiseless scenario, the probability of success and the number of optimal quantum circuit configurations to generate the multi-qubit maximally entangled states are $2\times$ to $5\times$ higher than Multi-Layer perceptions (MLPs). Moreover, in noisy scenarios, KAN can achieve a better fidelity in approximating maximally entangled state than MLPs, where the performance of the MLP significantly depends on the choice of activation function. In tackling quantum chemistry problems, we enhance the recently proposed QAS algorithm by integrating Curriculum Reinforcement Learning (CRL) with a KAN structure instead of the traditional MLP. This modification allows us to design a parameterized quantum circuit that contains fewer 2-qubit gates and has a shallower depth, thereby improving the efficiency of finding the ground state of a chemical Hamiltonian. Further investigation reveals that KAN requires a significantly smaller number of learnable parameters compared to MLPs; however, the average time of executing each episode for KAN is higher.
comment: 11 pages and 5 figures, 7 tables. New experiments added and typo removed
♻ ☆ BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline
Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we select 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10%, in terms of F1-score, achieving an average score of 83.19 and 68.5 over specific and general approaches respectively. As a result of this research, we provide a dataset of the extracted features combined with BotArtist predictions over the 10.929.533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War, over a 16-month period. This dataset was created in collaboration with [Shevtsov et al., 2022a] where the original authors share anonymized tweets on the discussion of the Russo-Ukrainian war with a total amount of 127.275.386 tweets. The combination of the existing text dataset and the provided labeled bot and human profiles will allow for the future development of a more advanced bot detection large language model in the post-Twitter API era.
♻ ☆ Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data ICML 2024
Ambient diffusion is a recently proposed framework for training diffusion models using corrupted data. Both Ambient Diffusion and alternative SURE-based approaches for learning diffusion models from corrupted data resort to approximations which deteriorate performance. We present the first framework for training diffusion models that provably sample from the uncorrupted distribution given only noisy training data, solving an open problem in this space. Our key technical contribution is a method that uses a double application of Tweedie's formula and a consistency loss function that allows us to extend sampling at noise levels below the observed data noise. We also provide further evidence that diffusion models memorize from their training sets by identifying extremely corrupted images that are almost perfectly reconstructed, raising copyright and privacy concerns. Our method for training using corrupted samples can be used to mitigate this problem. We demonstrate this by fine-tuning Stable Diffusion XL to generate samples from a distribution using only noisy samples. Our framework reduces the amount of memorization of the fine-tuning dataset, while maintaining competitive performance.
comment: Accepted to ICML 2024
♻ ☆ One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning ICML 2024
In real-world continual learning (CL) scenarios, tasks often exhibit intricate and unpredictable semantic shifts, posing challenges for fixed prompt management strategies which are tailored to only handle semantic shifts of uniform degree (i.e., uniformly mild or uniformly abrupt). To address this limitation, we propose an adaptive prompting approach that effectively accommodates semantic shifts of varying degree where mild and abrupt shifts are mixed. AdaPromptCL employs the assign-and-refine semantic grouping mechanism that dynamically manages prompt groups in accordance with the semantic similarity between tasks, enhancing the quality of grouping through continuous refinement. Our experiment results demonstrate that AdaPromptCL outperforms existing prompting methods by up to 21.3%, especially in the benchmark datasets with diverse semantic shifts between tasks.
comment: ICML 2024
♻ ☆ Harnessing Quantum Support Vector Machines for Cross-Domain Classification of Quantum States
In the present study, we use cross-domain classification using quantum machine learning for quantum advantages to readdress the entanglement versus separability paradigm. The inherent structure of quantum states and its relation to a particular class of quantum states are used to intuitively classify testing states from domains different from training states, called \textit{cross-domain classification}. Using our quantum machine learning algorithm, we demonstrate efficient classifications of two-qubit mixed states into entangled and separable classes. For analyzing the quantumness of correlations, our model adequately classifies Bell diagonal states as zero and non-zero discord states. In addition, we also extend our analysis to evaluate the robustness of our model using random local unitary transformations. Our results demonstrate the potential of the quantum support vector machine for classifying quantum states across the multi-dimensional Hilbert space in comparison to classical support vector machines and neural networks.
♻ ☆ General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History
Machine learning (ML) has recently shown promising results in medical predictions using electronic health records (EHRs). However, since ML models typically have a limited capability in terms of input sizes, selecting specific medical events from EHRs for use as input is necessary. This selection process, often relying on expert opinion, can cause bottlenecks in development. We propose Retrieval-Enhanced Medical prediction model (REMed) to address such challenges. REMed can essentially evaluate unlimited medical events, select the relevant ones, and make predictions. This allows for an unrestricted input size, eliminating the need for manual event selection. We verified these properties through experiments involving 27 clinical prediction tasks across four independent cohorts, where REMed outperformed the baselines. Notably, we found that the preferences of REMed align closely with those of medical experts. We expect our approach to significantly expedite the development of EHR prediction models by minimizing clinicians' need for manual involvement.
comment: The source codes corresponding to this paper are available at: https://github.com/starmpcc/REMed
♻ ☆ Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing M\'elange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, M\'elange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting.
♻ ☆ Semantic Communication for Cooperative Multi-Task Processing over Wireless Networks
In this paper, we investigated semantic communication for multi-task processing using an information-theoretic approach. We introduced the concept of a "semantic source", allowing multiple semantic interpretations from a single observation. We formulated an end-to-end optimization problem taking into account the communication channel, maximizing mutual information (infomax) to design the semantic encoding and decoding process exploiting the statistical relations between semantic variables. To solve the problem we perform data-driven deep learning employing variational approximation techniques. Our semantic encoder is divided into a common unit and multiple specific units to facilitate cooperative multi-task processing. Simulation results demonstrate the effectiveness of our proposed semantic source and system design when statistical relationships exist, comparing cooperative task processing with independent task processing. However, our findings highlight that cooperative multi-tasking is not always beneficial, emphasizing the importance of statistical relationships between tasks and indicating the need for further investigation into the semantically processing of multiple tasks.
comment: This work has been submitted to the IEEE Wireless Communications Letters for possible publication
♻ ☆ MSegRNN:Enhanced SegRNN Model with Mamba for Long-Term Time Series Forecasting
The field of long-term time series forecasting demands handling extensive look-back windows and long-range prediction steps, posing significant challenges for RNN-based methodologies. Among these, SegRNN, a robust RNN-driven model, has gained considerable attention in LTSF analysis for achieving state-of-the-art results while maintaining a remarkably streamlined architecture. Concurrently, the Mamba structure has demonstrated its advantages in small to medium-sized models due to its capability for information selection. This study introduces a variant of SegRNN that preprocesses information using a fine-tuned single-layer Mamba structure. Additionally, it incorporates implicit segmentation and residual structures into the model's encoding section to further reduce the inherent data iterative cycles of RNN architectures and implicitly integrate inter-channel correlations. This variant, named MSegRNN, utilizes the Mamba structure to select useful information, resulting in a transformed sequence. The linear-strategy-adapted derivative retains the superior memory efficiency of the original SegRNN while demonstrating enhanced performance. Empirical evaluations on real-world LTSF datasets demonstrate the superior performance of our model, thereby contributing to the advancement of LTSF methodologies.
♻ ☆ DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation
Federated learning (FL) allows clients to collaboratively train a global model without sharing their local data with a server. However, clients' contributions to the server can still leak sensitive information. Differential privacy (DP) addresses such leakage by providing formal privacy guarantees, with mechanisms that add randomness to the clients' contributions. The randomness makes it infeasible to train large transformer-based models, common in modern federated learning systems. In this work, we empirically evaluate the practicality of fine-tuning large scale on-device transformer-based models with differential privacy in a federated learning system. We conduct comprehensive experiments on various system properties for tasks spanning a multitude of domains: speech recognition, computer vision (CV) and natural language understanding (NLU). Our results show that full fine-tuning under differentially private federated learning (DP-FL) generally leads to huge performance degradation which can be alleviated by reducing the dimensionality of contributions through parameter-efficient fine-tuning (PEFT). Our benchmarks of existing DP-PEFT methods show that DP-Low-Rank Adaptation (DP-LoRA) consistently outperforms other methods. An even more promising approach, DyLoRA, which makes the low rank variable, when naively combined with FL would straightforwardly break differential privacy. We therefore propose an adaptation method that can be combined with differential privacy and call it DP-DyLoRA. Finally, we are able to reduce the accuracy degradation and word error rate (WER) increase due to DP to less than 2% and 7% respectively with 1 million clients and a stringent privacy budget of $\epsilon=2$.
comment: 16 pages, 10 figures, 5 tables
♻ ☆ Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning CVPR'24
Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .
comment: Accepted to CVPR'24
♻ ☆ RepCodec: A Speech Representation Codec for Speech Tokenization ACL 2024
With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec. We believe our method can facilitate large language modeling research on speech processing.
comment: ACL 2024 (Main)
♻ ☆ PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers ECCV
Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.
comment: Accepted as a main conference paper at the European Conference of Computer Vision (ECCV) 2024
♻ ☆ Biometrics and Behavior Analysis for Detecting Distractions in e-Learning
In this article, we explore computer vision approaches to detect abnormal head pose during e-learning sessions and we introduce a study on the effects of mobile phone usage during these sessions. We utilize behavioral data collected from 120 learners monitored while participating in a MOOC learning sessions. Our study focuses on the influence of phone-usage events on behavior and physiological responses, specifically attention, heart rate, and meditation, before, during, and after phone usage. Additionally, we propose an approach for estimating head pose events using images taken by the webcam during the MOOC learning sessions to detect phone-usage events. Our hypothesis suggests that head posture undergoes significant changes when learners interact with a mobile phone, contrasting with the typical behavior seen when learners face a computer during e-learning sessions. We propose an approach designed to detect deviations in head posture from the average observed during a learner's session, operating as a semi-supervised method. This system flags events indicating alterations in head posture for subsequent human review and selection of mobile phone usage occurrences with a sensitivity over 90%.
comment: Accepted in CEDI 2024 (VII Congreso Espa\~nol de Inform\'atica), A Coru\~na, Spain
♻ ☆ Towards diffusion models for large-scale sea-ice modelling ICML 2024
We make the first steps towards diffusion models for unconditional generation of multivariate and Arctic-wide sea-ice states. While targeting to reduce the computational costs by diffusion in latent space, latent diffusion models also offer the possibility to integrate physical knowledge into the generation process. We tailor latent diffusion models to sea-ice physics with a censored Gaussian distribution in data space to generate data that follows the physical bounds of the modelled variables. Our latent diffusion models reach similar scores as the diffusion model trained in data space, but they smooth the generated fields as caused by the latent mapping. While enforcing physical bounds cannot reduce the smoothing, it improves the representation of the marginal ice zone. Therefore, for large-scale Earth system modelling, latent diffusion models can have many advantages compared to diffusion in data space if the significant barrier of smoothing can be resolved.
comment: 21 pages, 5 Figures, Camera-ready version for the ICML 2024 Machine Learning for Earth System Modeling workshop
♻ ☆ MidiCaps: A large-scale MIDI dataset with text captions
Generative models guided by text prompts are increasingly becoming more popular. However, no text-to-MIDI models currently exist due to the lack of a captioned MIDI dataset. This work aims to enable research that combines LLMs with symbolic music by presenting, the first openly available large-scale MIDI dataset with text captions. MIDI (Musical Instrument Digital Interface) files are widely used for encoding musical information and can capture the nuances of musical composition. They are widely used by music producers, composers, musicologists, and performers alike. Inspired by recent advancements in captioning techniques, we present a curated dataset of over 168k MIDI files with textual descriptions. Each MIDI caption describes the musical content, including tempo, chord progression, time signature, instruments, genre, and mood, thus facilitating multi-modal exploration and analysis. The dataset encompasses various genres, styles, and complexities, offering a rich data source for training and evaluating models for tasks such as music information retrieval, music understanding, and cross-modal translation. We provide detailed statistics about the dataset and have assessed the quality of the captions in an extensive listening study. We anticipate that this resource will stimulate further research at the intersection of music and natural language processing, fostering advancements in both fields.
comment: Accepted in ISMIR2024
♻ ☆ CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on the structures of protein pockets. Our evaluations are conducted with fairness, encompassing comprehensive perspectives on interaction, chemical properties, geometry authenticity, and substructure validity. We further provide the pre-trained versions of the state-of-the-art models and deep insights with analysis from empirical studies. The codebase for CBGBench is publicly accessible at \url{https://github.com/Edapinenut/CBGBench}.
comment: 9 pages main context
♻ ☆ Generalizing Denoising to Non-Equilibrium Structures Improves Equivariant Force Fields
Understanding the interactions of atoms such as forces in 3D atomistic systems is fundamental to many applications like molecular dynamics and catalyst design. However, simulating these interactions requires compute-intensive ab initio calculations and thus results in limited data for training neural networks. In this paper, we propose to use denoising non-equilibrium structures (DeNS) as an auxiliary task to better leverage training data and improve performance. For training with DeNS, we first corrupt a 3D structure by adding noise to its 3D coordinates and then predict the noise. Different from previous works on denoising, which are limited to equilibrium structures, the proposed method generalizes denoising to a much larger set of non-equilibrium structures. The main difference is that a non-equilibrium structure does not correspond to local energy minima and has non-zero forces, and therefore it can have many possible atomic positions compared to an equilibrium structure. This makes denoising non-equilibrium structures an ill-posed problem since the target of denoising is not uniquely defined. Our key insight is to additionally encode the forces of the original non-equilibrium structure to specify which non-equilibrium structure we are denoising. Concretely, given a corrupted non-equilibrium structure and the forces of the original one, we predict the non-equilibrium structure satisfying the input forces instead of any arbitrary structures. Since DeNS requires encoding forces, DeNS favors equivariant networks, which can easily incorporate forces and other higher-order tensors in node embeddings. We study the effectiveness of training equivariant networks with DeNS on OC20, OC22 and MD17 datasets and demonstrate that DeNS can achieve new state-of-the-art results on OC20 and OC22 and significantly improve training efficiency on MD17.
♻ ☆ What is Dataset Distillation Learning? ICML 2024
Dataset distillation has emerged as a strategy to overcome the hurdles associated with large datasets by learning a compact set of synthetic data that retains essential information from the original dataset. While distilled data can be used to train high performing models, little is understood about how the information is stored. In this study, we posit and answer three questions about the behavior, representativeness, and point-wise information content of distilled data. We reveal distilled data cannot serve as a substitute for real data during training outside the standard evaluation setting for dataset distillation. Additionally, the distillation process retains high task performance by compressing information related to the early training dynamics of real models. Finally, we provide an framework for interpreting distilled data and reveal that individual distilled data points contain meaningful semantic information. This investigation sheds light on the intricate nature of distilled data, providing a better understanding on how they can be effectively utilized.
comment: ICML 2024
♻ ☆ Decentralized Entropic Optimal Transport for Distributed Distribution Comparison
Distributed distribution comparison aims to measure the distance between the distributions whose data are scattered across different agents in a distributed system and cannot even be shared directly among the agents. In this study, we propose a novel decentralized entropic optimal transport (DEOT) method, which provides a communication-efficient and privacy-preserving solution to this problem with theoretical guarantees. In particular, we design a mini-batch randomized block-coordinate descent (MRBCD) scheme to optimize the DEOT distance in its dual form. The dual variables are scattered across different agents and updated locally and iteratively with limited communications among partial agents. The kernel matrix involved in the gradients of the dual variables is estimated by a decentralized kernel approximation method, in which each agent only needs to approximate and store a sub-kernel matrix by one-shot communication and without sharing raw data. Besides computing entropic Wasserstein distance, we show that the proposed MRBCD scheme and kernel approximation method also apply to entropic Gromov-Wasserstein distance. We analyze our method's communication complexity and, under mild assumptions, provide a theoretical bound for the approximation error caused by the convergence error, the estimated kernel, and the mismatch between the storage and communication protocols. In addition, we discuss the trade-off between the precision of the EOT distance and the strength of privacy protection when implementing our method. Experiments on synthetic data and real-world distributed domain adaptation tasks demonstrate the effectiveness of our method.
♻ ☆ Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models ICML 2024
Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
comment: ICML 2024. Website: https://hanlin-zhang.com/impossibility-watermarks
♻ ☆ Meta-Learning and representation learner: A short theoretical note
Meta-learning, or "learning to learn," is a subfield of machine learning where the goal is to develop models and algorithms that can learn from various tasks and improve their learning process over time. Unlike traditional machine learning methods focusing on learning a specific task, meta-learning aims to leverage experience from previous tasks to enhance future learning. This approach is particularly beneficial in scenarios where the available data for a new task is limited, but there exists abundant data from related tasks. By extracting and utilizing the underlying structure and patterns across these tasks, meta-learning algorithms can achieve faster convergence and better performance with fewer data. The following notes are mainly inspired from \cite{vanschoren2018meta}, \cite{baxter2019learning}, and \cite{maurer2005algorithmic}.
♻ ☆ Analyzing the Generalization and Reliability of Steering Vectors
Steering vectors (SVs) are a new approach to efficiently adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain many technical difficulties of applying steering vectors to guide models' behaviour at scale.
♻ ☆ SemiSFL: Split Federated Learning on Unlabeled and Non-IID Data
Federated Learning (FL) has emerged to allow multiple clients to collaboratively train machine learning models on their private data at the network edge. However, training and deploying large-scale models on resource-constrained devices is challenging. Fortunately, Split Federated Learning (SFL) offers a feasible solution by alleviating the computation and/or communication burden on clients. However, existing SFL works often assume sufficient labeled data on clients, which is usually impractical. Besides, data non-IIDness poses another challenge to ensure efficient model training. To our best knowledge, the above two issues have not been simultaneously addressed in SFL. Herein, we propose a novel Semi-supervised SFL system, termed SemiSFL, which incorporates clustering regularization to perform SFL with unlabeled and non-IID client data. Moreover, our theoretical and experimental investigations into model convergence reveal that the inconsistent training processes on labeled and unlabeled data have an influence on the effectiveness of clustering regularization. To mitigate the training inconsistency, we develop an algorithm for dynamically adjusting the global updating frequency, so as to improve training performance. Extensive experiments on benchmark models and datasets show that our system provides a 3.8x speed-up in training time, reduces the communication cost by about 70.3% while reaching the target accuracy, and achieves up to 5.8% improvement in accuracy under non-IID scenarios compared to the state-of-the-art baselines.
comment: 16 pages
♻ ☆ A Benchmark Study of Deep-RL Methods for Maximum Coverage Problems over Graphs VLDB 2024
Recent years have witnessed a growing trend toward employing deep reinforcement learning (Deep-RL) to derive heuristics for combinatorial optimization (CO) problems on graphs. Maximum Coverage Problem (MCP) and its probabilistic variant on social networks, Influence Maximization (IM), have been particularly prominent in this line of research. In this paper, we present a comprehensive benchmark study that thoroughly investigates the effectiveness and efficiency of five recent Deep-RL methods for MCP and IM. These methods were published in top data science venues, namely S2V-DQN, Geometric-QN, GCOMB, RL4IM, and LeNSE. Our findings reveal that, across various scenarios, the Lazy Greedy algorithm consistently outperforms all Deep-RL methods for MCP. In the case of IM, theoretically sound algorithms like IMM and OPIM demonstrate superior performance compared to Deep-RL methods in most scenarios. Notably, we observe an abnormal phenomenon in IM problem where Deep-RL methods slightly outperform IMM and OPIM when the influence spread nearly does not increase as the budget increases. Furthermore, our experimental results highlight common issues when applying Deep-RL methods to MCP and IM in practical settings. Finally, we discuss potential avenues for improving Deep-RL methods. Our benchmark study sheds light on potential challenges in current deep reinforcement learning research for solving combinatorial optimization problems.
comment: This paper has been accepted by VLDB 2024
♻ ☆ Ensemble Kalman Filtering Meets Gaussian Process SSM for Non-Mean-Field and Online Inference
The Gaussian process state-space models (GPSSMs) represent a versatile class of data-driven nonlinear dynamical system models. However, the presence of numerous latent variables in GPSSM incurs unresolved issues for existing variational inference approaches, particularly under the more realistic non-mean-field (NMF) assumption, including extensive training effort, compromised inference accuracy, and infeasibility for online applications, among others. In this paper, we tackle these challenges by incorporating the ensemble Kalman filter (EnKF), a well-established model-based filtering technique, into the NMF variational inference framework to approximate the posterior distribution of the latent states. This novel marriage between EnKF and GPSSM not only eliminates the need for extensive parameterization in learning variational distributions, but also enables an interpretable, closed-form approximation of the evidence lower bound (ELBO). Moreover, owing to the streamlined parameterization via the EnKF, the new GPSSM model can be easily accommodated in online learning applications. We demonstrate that the resulting EnKF-aided online algorithm embodies a principled objective function by ensuring data-fitting accuracy while incorporating model regularizations to mitigate overfitting. We also provide detailed analysis and fresh insights for the proposed algorithms. Comprehensive evaluation across diverse real and synthetic datasets corroborates the superior learning and inference performance of our EnKF-aided variational inference algorithms compared to existing methods.
comment: Gaussian process, state-space model, ensemble Kalman filter, online learning, variational inference
♻ ☆ Speed-accuracy trade-off for the diffusion models: Wisdom from nonequilibrium thermodynamics and optimal transport
We discuss a connection between a generative model, called the diffusion model, and nonequilibrium thermodynamics for the Fokker-Planck equation, called stochastic thermodynamics. Based on the techniques of stochastic thermodynamics, we derive the speed-accuracy trade-off for the diffusion models, which is a trade-off relationship between the speed and accuracy of data generation in diffusion models. Our result implies that the entropy production rate in the forward process affects the errors in data generation. From a stochastic thermodynamic perspective, our results provide quantitative insight into how best to generate data in diffusion models. The optimal learning protocol is introduced by the conservative force in stochastic thermodynamics and the geodesic of space by the 2-Wasserstein distance in optimal transport theory. We numerically illustrate the validity of the speed-accuracy trade-off for the diffusion models with different noise schedules such as the cosine schedule, the conditional optimal transport, and the optimal transport.
comment: 26 pages, 5 figures
♻ ☆ LLM4ED: Large Language Models for Automatic Equation Discovery
Equation discovery is aimed at directly extracting physical laws from data and has emerged as a pivotal research domain. Previous methods based on symbolic mathematics have achieved substantial advancements, but often require the design of implementation of complex algorithms. In this paper, we introduce a new framework that utilizes natural language-based prompts to guide large language models (LLMs) in automatically mining governing equations from data. Specifically, we first utilize the generation capability of LLMs to generate diverse equations in string form, and then evaluate the generated equations based on observations. In the optimization phase, we propose two alternately iterated strategies to optimize generated equations collaboratively. The first strategy is to take LLMs as a black-box optimizer and achieve equation self-improvement based on historical samples and their performance. The second strategy is to instruct LLMs to perform evolutionary operators for global search. Experiments are extensively conducted on both partial differential equations and ordinary differential equations. Results demonstrate that our framework can discover effective equations to reveal the underlying physical laws under various nonlinear dynamic systems. Further comparisons are made with state-of-the-art models, demonstrating good stability and usability. Our framework substantially lowers the barriers to learning and applying equation discovery techniques, demonstrating the application potential of LLMs in the field of knowledge discovery.
♻ ☆ Optimal Defender Strategies for CAGE-2 using Causal Modeling and Tree Search
The CAGE-2 challenge is considered a standard benchmark to compare methods for autonomous cyber defense. Current state-of-the-art methods evaluated against this benchmark are based on model-free (offline) reinforcement learning, which does not provide provably optimal defender strategies. We address this limitation and present a formal (causal) model of CAGE-2 together with a method that produces a provably optimal defender strategy, which we call Causal Partially Observable Monte-Carlo Planning (C-POMCP). It has two key properties. First, it incorporates the causal structure of the target system, i.e., the causal relationships among the system variables. This structure allows for a significant reduction of the search space of defender strategies. Second, it is an online method that updates the defender strategy at each time step via tree search. Evaluations against the CAGE-2 benchmark show that C-POMCP achieves state-of-the-art performance with respect to effectiveness and is two orders of magnitude more efficient in computing time than the closest competitor method.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ MergeSFL: Split Federated Learning with Feature Merging and Batch Size Regulation
Recently, federated learning (FL) has emerged as a popular technique for edge AI to mine valuable knowledge in edge computing (EC) systems. To mitigate the computing/communication burden on resource-constrained workers and protect model privacy, split federated learning (SFL) has been released by integrating both data and model parallelism. Despite resource limitations, SFL still faces two other critical challenges in EC, i.e., statistical heterogeneity and system heterogeneity. To address these challenges, we propose a novel SFL framework, termed MergeSFL, by incorporating feature merging and batch size regulation in SFL. Concretely, feature merging aims to merge the features from workers into a mixed feature sequence, which is approximately equivalent to the features derived from IID data and is employed to promote model accuracy. While batch size regulation aims to assign diverse and suitable batch sizes for heterogeneous workers to improve training efficiency. Moreover, MergeSFL explores to jointly optimize these two strategies upon their coupled relationship to better enhance the performance of SFL. Extensive experiments are conducted on a physical platform with 80 NVIDIA Jetson edge devices, and the experimental results show that MergeSFL can improve the final model accuracy by 5.82% to 26.22%, with a speedup by about 1.74x to 4.14x, compared to the baselines.
♻ ☆ LabObf: A Label Protection Scheme for Vertical Federated Learning Through Label Obfuscation
Split Neural Network, as one of the most common architectures used in vertical federated learning, is popular in industry due to its privacy-preserving characteristics. In this architecture, the party holding the labels seeks cooperation from other parties to improve model performance due to insufficient feature data. Each of these participants has a self-defined bottom model to learn hidden representations from its own feature data and uploads the embedding vectors to the top model held by the label holder for final predictions. This design allows participants to conduct joint training without directly exchanging data. However, existing research points out that malicious participants may still infer label information from the uploaded embeddings, leading to privacy leakage. In this paper, we first propose an embedding extension attack manipulating embeddings to undermine existing defense strategies, which rely on constraining the correlation between the embeddings uploaded by participants and the labels. Subsequently, we propose a new label obfuscation defense strategy, called `LabObf', which randomly maps each original integer-valued label to multiple real-valued soft labels with values intertwined, significantly increasing the difficulty for attackers to infer the labels. We conduct experiments on four different types of datasets, and the results show that LabObf significantly reduces the attacker's success rate compared to raw models while maintaining desirable model accuracy.
♻ ☆ Retrieval Augmented Deep Anomaly Detection for Tabular Data CIKM 2024
Deep learning for tabular data has garnered increasing attention in recent years, yet employing deep models for structured data remains challenging. While these models excel with unstructured data, their efficacy with structured data has been limited. Recent research has introduced retrieval-augmented models to address this gap, demonstrating promising results in supervised tasks such as classification and regression. In this work, we investigate using retrieval-augmented models for anomaly detection on tabular data. We propose a reconstruction-based approach in which a transformer model learns to reconstruct masked features of \textit{normal} samples. We test the effectiveness of KNN-based and attention-based modules to select relevant samples to help in the reconstruction process of the target sample. Our experiments on a benchmark of 31 tabular datasets reveal that augmenting this reconstruction-based anomaly detection (AD) method with sample-sample dependencies via retrieval modules significantly boosts performance. The present work supports the idea that retrieval module are useful to augment any deep AD method to enhance anomaly detection on tabular data.
comment: Accepted at CIKM 2024
♻ ☆ Socially Integrated Navigation: A Social Acting Robot with Deep Reinforcement Learning IROS
Mobile robots are being used on a large scale in various crowded situations and become part of our society. The socially acceptable navigation behavior of a mobile robot with individual human consideration is an essential requirement for scalable applications and human acceptance. Deep Reinforcement Learning (DRL) approaches are recently used to learn a robot's navigation policy and to model the complex interactions between robots and humans. We propose to divide existing DRL-based navigation approaches based on the robot's exhibited social behavior and distinguish between social collision avoidance with a lack of social behavior and socially aware approaches with explicit predefined social behavior. In addition, we propose a novel socially integrated navigation approach where the robot's social behavior is adaptive and emerges from the interaction with humans. The formulation of our approach is derived from a sociological definition, which states that social acting is oriented toward the acting of others. The DRL policy is trained in an environment where other agents interact socially integrated and reward the robot's behavior individually. The simulation results indicate that the proposed socially integrated navigation approach outperforms a socially aware approach in terms of ego navigation performance while significantly reducing the negative impact on all agents within the environment.
comment: Accepted at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
♻ ☆ ARCO:Adaptive Multi-Agent Reinforcement Learning-Based Hardware/Software Co-Optimization Compiler for Improved Performance in DNN Accelerator Design
This paper presents ARCO, an adaptive Multi-Agent Reinforcement Learning (MARL)-based co-optimizing compilation framework designed to enhance the efficiency of mapping machine learning (ML) models - such as Deep Neural Networks (DNNs) - onto diverse hardware platforms. The framework incorporates three specialized actor-critic agents within MARL, each dedicated to a distinct aspect of compilation/optimization at an abstract level: one agent focuses on hardware, while two agents focus on software optimizations. This integration results in a collaborative hardware/software co-optimization strategy that improves the precision and speed of DNN deployments. Concentrating on high-confidence configurations simplifies the search space and delivers superior performance compared to current optimization methods. The ARCO framework surpasses existing leading frameworks, achieving a throughput increase of up to 37.95% while reducing the optimization time by up to 42.2% across various DNNs.
comment: Under review
♻ ☆ Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing
While prior research has proposed a plethora of methods that build neural classifiers robust against adversarial robustness, practitioners are still reluctant to adopt them due to their unacceptably severe clean accuracy penalties. This paper significantly alleviates this accuracy-robustness trade-off by mixing the output probabilities of a standard classifier and a robust classifier, where the standard network is optimized for clean accuracy and is not robust in general. We show that the robust base classifier's confidence difference for correct and incorrect examples is the key to this improvement. In addition to providing intuitions and empirical evidence, we theoretically certify the robustness of the mixed classifier under realistic assumptions. Furthermore, we adapt an adversarial input detector into a mixing network that adaptively adjusts the mixture of the two base models, further reducing the accuracy penalty of achieving robustness. The proposed flexible method, termed "adaptive smoothing", can work in conjunction with existing or even future methods that improve clean accuracy, robustness, or adversary detection. Our empirical evaluation considers strong attack methods, including AutoAttack and adaptive attack. On the CIFAR-100 dataset, our method achieves an 85.21% clean accuracy while maintaining a 38.72% $\ell_\infty$-AutoAttacked ($\epsilon = 8/255$) accuracy, becoming the second most robust method on the RobustBench CIFAR-100 benchmark as of submission, while improving the clean accuracy by ten percentage points compared with all listed models. The code that implements our method is available at https://github.com/Bai-YT/AdaptiveSmoothing.
♻ ☆ Reinforcement Learning With Sparse-Executing Actions via Sparsity Regularization
Reinforcement learning (RL) has demonstrated impressive performance in decision-making tasks like embodied control, autonomous driving and financial trading. In many decision-making tasks, the agents often encounter the problem of executing actions under limited budgets. However, classic RL methods typically overlook the challenges posed by such sparse-executing actions. They operate under the assumption that all actions can be taken for a unlimited number of times, both in the formulation of the problem and in the development of effective algorithms. To tackle the issue of limited action execution in RL, this paper first formalizes the problem as a Sparse Action Markov Decision Process (SA-MDP), in which specific actions in the action space can only be executed for a limited time. Then, we propose a policy optimization algorithm, Action Sparsity REgularization (ASRE), which adaptively handles each action with a distinct preference. ASRE operates through two steps: First, ASRE evaluates action sparsity by constrained action sampling. Following this, ASRE incorporates the sparsity evaluation into policy learning by way of an action distribution regularization. We provide theoretical identification that validates the convergence of ASRE to a regularized optimal value function. Experiments on tasks with known sparse-executing actions, where classical RL algorithms struggle to train policy efficiently, ASRE effectively constrains the action sampling and outperforms baselines. Moreover, we present that ASRE can generally improve the performance in Atari games, demonstrating its broad applicability.
♻ ☆ Advancing TTP Analysis: Harnessing the Power of Large Language Models with Retrieval Augmented Generation
Tactics, Techniques, and Procedures (TTPs) outline the methods attackers use to exploit vulnerabilities. The interpretation of TTPs in the MITRE ATT&CK framework can be challenging for cybersecurity practitioners due to presumed expertise and complex dependencies. Meanwhile, advancements with Large Language Models (LLMs) have led to recent surge in studies exploring its uses in cybersecurity operations. It is, however, unclear how LLMs can be used in an efficient and proper way to provide accurate responses for critical domains such as cybersecurity. This leads us to investigate how to better use two types of LLMs: small-scale encoder-only (e.g., RoBERTa) and larger decoder-only (e.g., GPT-3.5) LLMs to comprehend and summarize TTPs with the intended purposes (i.e., tactics) of a cyberattack procedure. This work studies and compares the uses of supervised fine-tuning (SFT) of encoder-only LLMs vs. Retrieval Augmented Generation (RAG) for decoder-only LLMs (without fine-tuning). Both SFT and RAG techniques presumably enhance the LLMs with relevant contexts for each cyberattack procedure. Our studies show decoder-only LLMs with RAG achieves better performance than encoder-only models with SFT, particularly when directly relevant context is extracted by RAG. The decoder-only results could suffer low `Precision' while achieving high `Recall'. Our findings further highlight a counter-intuitive observation that more generic prompts tend to yield better predictions of cyberattack tactics than those that are more specifically tailored.
♻ ☆ Investigating the Indirect Object Identification circuit in Mamba
How well will current interpretability techniques generalize to future models? A relevant case study is Mamba, a recent recurrent architecture with scaling comparable to Transformers. We adapt pre-Mamba techniques to Mamba and partially reverse-engineer the circuit responsible for the Indirect Object Identification (IOI) task. Our techniques provide evidence that 1) Layer 39 is a key bottleneck, 2) Convolutions in layer 39 shift names one position forward, and 3) The name entities are stored linearly in Layer 39's SSM. Finally, we adapt an automatic circuit discovery tool, positional Edge Attribution Patching, to identify a Mamba IOI circuit. Our contributions provide initial evidence that circuit-based mechanistic interpretability tools work well for the Mamba architecture.
♻ ☆ On the Federated Learning Framework for Cooperative Perception
Cooperative perception is essential to enhance the efficiency and safety of future transportation systems, requiring extensive data sharing among vehicles on the road, which raises significant privacy concerns. Federated learning offers a promising solution by enabling data privacy-preserving collaborative enhancements in perception, decision-making, and planning among connected and autonomous vehicles (CAVs). However, federated learning is impeded by significant challenges arising from data heterogeneity across diverse clients, potentially diminishing model accuracy and prolonging convergence periods. This study introduces a specialized federated learning framework for CP, termed the federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by dynamic adjusting loss (DALoss) function. This framework employs dynamic client weighting to direct model convergence and integrates a novel loss function that utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental effects of non-independently and identically distributed (Non-IID) and unbalanced data. Utilizing the BEV transformer as the primary model, our rigorous testing on the OpenV2V dataset, augmented with FedBEVT data, demonstrates significant improvements in the average intersection over union (IoU). These results highlight the substantial potential of our federated learning framework to address data heterogeneity challenges in CP, thereby enhancing the accuracy of environmental perception models and facilitating more robust and efficient collaborative learning solutions in the transportation sector.
♻ ☆ Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities, making them a promising architecture for the brain of embodied agents. However, there is no comprehensive survey for Embodied AI in the era of MLMs. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI. Our analysis firstly navigates through the forefront of representative works of embodied robots and simulators, to fully understand the research focuses and their limitations. Then, we analyze four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, covering the state-of-the-art methods, essential paradigms, and comprehensive datasets. Additionally, we explore the complexities of MLMs in virtual and real embodied agents, highlighting their significance in facilitating interactions in dynamic digital and physical environments. Finally, we summarize the challenges and limitations of embodied AI and discuss their potential future directions. We hope this survey will serve as a foundational reference for the research community and inspire continued innovation. The associated project can be found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.
comment: The first comprehensive review of Embodied AI in the era of MLMs, 35 pages. We also provide the paper list for Embodied AI: https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List
♻ ☆ DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.
♻ ☆ Multi-Objective Latent Space Optimization of Generative Molecular Design Models
Molecular design based on generative models, such as variational autoencoders (VAEs), has become increasingly popular in recent years due to its efficiency for exploring high-dimensional molecular space to identify molecules with desired properties. While the efficacy of the initial model strongly depends on the training data, the sampling efficiency of the model for suggesting novel molecules with enhanced properties can be further enhanced via latent space optimization. In this paper, we propose a multi-objective latent space optimization (LSO) method that can significantly enhance the performance of generative molecular design (GMD). The proposed method adopts an iterative weighted retraining approach, where the respective weights of the molecules in the training data are determined by their Pareto efficiency. We demonstrate that our multi-objective GMD LSO method can significantly improve the performance of GMD for jointly optimizing multiple molecular properties.
comment: 23 pages, 9 figures
♻ ☆ MLRegTest: A Benchmark for the Machine Learning of Regular Languages
Synthetic datasets constructed from formal languages allow fine-grained examination of the learning and generalization capabilities of machine learning systems for sequence classification. This article presents a new benchmark for machine learning systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that performance depends significantly on the kind of test set, the class of language, and the neural network architecture.
comment: 43 pages, MLRegTest benchmark available at https://doi.org/10.5061/dryad.dncjsxm4h , associated code at https://github.com/heinz-jeffrey/subregular-learning
Computer Vision and Pattern Recognition
☆ AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
comment: Project Page: https://www.robots.ox.ac.uk/~vgg/research/autoad-zero/
☆ BoostMVSNeRFs: Boosting MVS-based NeRFs to Generalizable View Synthesis in Large-scale Scenes SIGGRAPH 2024
While Neural Radiance Fields (NeRFs) have demonstrated exceptional quality, their protracted training duration remains a limitation. Generalizable and MVS-based NeRFs, although capable of mitigating training time, often incur tradeoffs in quality. This paper presents a novel approach called BoostMVSNeRFs to enhance the rendering quality of MVS-based NeRFs in large-scale scenes. We first identify limitations in MVS-based NeRF methods, such as restricted viewport coverage and artifacts due to limited input views. Then, we address these limitations by proposing a new method that selects and combines multiple cost volumes during volume rendering. Our method does not require training and can adapt to any MVS-based NeRF methods in a feed-forward fashion to improve rendering quality. Furthermore, our approach is also end-to-end trainable, allowing fine-tuning on specific scenes. We demonstrate the effectiveness of our method through experiments on large-scale datasets, showing significant rendering quality improvements in large-scale scenes and unbounded outdoor scenarios. We release the source code of BoostMVSNeRFs at https://su-terry.github.io/BoostMVSNeRFs/.
comment: SIGGRAPH 2024 Conference Papers. Project page: https://su-terry.github.io/BoostMVSNeRFs/
☆ Reconstructing Training Data From Real World Models Trained with Transfer Learning
Current methods for reconstructing training data from trained classifiers are restricted to very small models, limited training set sizes, and low-resolution images. Such restrictions hinder their applicability to real-world scenarios. In this paper, we present a novel approach enabling data reconstruction in realistic settings for models trained on high-resolution images. Our method adapts the reconstruction scheme of arXiv:2206.07758 to real-world scenarios -- specifically, targeting models trained via transfer learning over image embeddings of large pre-trained models like DINO-ViT and CLIP. Our work employs data reconstruction in the embedding space rather than in the image space, showcasing its applicability beyond visual data. Moreover, we introduce a novel clustering-based method to identify good reconstructions from thousands of candidates. This significantly improves on previous works that relied on knowledge of the training set to identify good reconstructed images. Our findings shed light on a potential privacy risk for data leakage from models trained using transfer learning.
☆ CarFormer: Self-Driving with Learned Object-Centric Representations ECCV 2024
The choice of representation plays a key role in self-driving. Bird's eye view (BEV) representations have shown remarkable performance in recent years. In this paper, we propose to learn object-centric representations in BEV to distill a complex scene into more actionable information for self-driving. We first learn to place objects into slots with a slot attention model on BEV sequences. Based on these object-centric representations, we then train a transformer to learn to drive as well as reason about the future of other vehicles. We found that object-centric slot representations outperform both scene-level and object-level approaches that use the exact attributes of objects. Slot representations naturally incorporate information about objects from their spatial and temporal context such as position, heading, and speed without explicitly providing it. Our model with slots achieves an increased completion rate of the provided routes and, consequently, a higher driving score, with a lower variance across multiple runs, affirming slots as a reliable alternative in object-centric approaches. Additionally, we validate our model's performance as a world model through forecasting experiments, demonstrating its capability to predict future slot representations accurately. The code and the pre-trained models can be found at https://kuis-ai.github.io/CarFormer/.
comment: Accepted to ECCV 2024, code and the pre-trained models can be found at https://kuis-ai.github.io/CarFormer/
☆ HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning ECCV 2024
Predicting camera-space hand meshes from single RGB images is crucial for enabling realistic hand interactions in 3D virtual and augmented worlds. Previous work typically divided the task into two stages: given a cropped image of the hand, predict meshes in relative coordinates, followed by lifting these predictions into camera space in a separate and independent stage, often resulting in the loss of valuable contextual and scale information. To prevent the loss of these cues, we propose unifying these two stages into an end-to-end solution that addresses the 2D-3D correspondence problem. This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module. We also introduce an image rectification step that harmonizes both the training dataset and the input image as if they were acquired with the same camera, helping to alleviate the inherent scale-depth ambiguity of the problem. We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches across three public benchmarks.
comment: To be presented at ECCV 2024
☆ Artist: Aesthetically Controllable Text-Driven Stylization without Training
Diffusion models entangle content and style generation during the denoising process, leading to undesired content modification when directly applied to stylization tasks. Existing methods struggle to effectively control the diffusion model to meet the aesthetic-level requirements for stylization. In this paper, we introduce \textbf{Artist}, a training-free approach that aesthetically controls the content and style generation of a pretrained diffusion model for text-driven stylization. Our key insight is to disentangle the denoising of content and style into separate diffusion processes while sharing information between them. We propose simple yet effective content and style control methods that suppress style-irrelevant content generation, resulting in harmonious stylization results. Extensive experiments demonstrate that our method excels at achieving aesthetic-level stylization requirements, preserving intricate details in the content image and aligning well with the style prompt. Furthermore, we showcase the highly controllability of the stylization strength from various perspectives. Code will be released, project home page: https://DiffusionArtist.github.io
comment: WIP,webpage: https://DiffusionArtist.github.io
☆ SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.
comment: Technical report
☆ MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering and Short Visual Question Answering. To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at 1/6 the cost of manual construction. Through extensive experiment validation and ablation experiments, we demonstrate that MMInstruct could significantly improve the performance of VLLMs, e.g., the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks. The code and data shall be available at https://github.com/yuecao0119/MMInstruct.
comment: 18 pages, 8 figures, technical report
☆ Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning
Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images. It excels in region-aware learning and provides strong initializations for various tasks, but struggles to capture high-level semantics without further supervised fine-tuning, likely due to the low-level nature of its pixel reconstruction objective. A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets. However, this approach poses significant training challenges as the reconstruction targets are learned in conjunction with the model, potentially leading to trivial or suboptimal solutions.Our study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM. Through a series of carefully designed experiments and extensive analysis, we identify the source of these challenges, including representation collapsing for joint online/target optimization, learning objectives, the high region correlation in latent space and decoding conditioning. By sequentially addressing these issues, we demonstrate that Latent MIM can indeed learn high-level representations while retaining the benefits of MIM models.
☆ Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.
☆ Enhancing Cell Instance Segmentation in Scanning Electron Microscopy Images via a Deep Contour Closing Operator
Accurately segmenting and individualizing cells in SEM images is a highly promising technique for elucidating tissue architecture in oncology. While current AI-based methods are effective, errors persist, necessitating time-consuming manual corrections, particularly in areas where the quality of cell contours in the image is poor and requires gap filling. This study presents a novel AI-driven approach for refining cell boundary delineation to improve instance-based cell segmentation in SEM images, also reducing the necessity for residual manual correction. A CNN COp-Net is introduced to address gaps in cell contours, effectively filling in regions with deficient or absent information. The network takes as input cell contour probability maps with potentially inadequate or missing information and outputs corrected cell contour delineations. The lack of training data was addressed by generating low integrity probability maps using a tailored PDE. We showcase the efficacy of our approach in augmenting cell boundary precision using both private SEM images from PDX hepatoblastoma tissues and publicly accessible images datasets. The proposed cell contour closing operator exhibits a notable improvement in tested datasets, achieving respectively close to 50% (private data) and 10% (public data) increase in the accurately-delineated cell proportion compared to state-of-the-art methods. Additionally, the need for manual corrections was significantly reduced, therefore facilitating the overall digitalization process. Our results demonstrate a notable enhancement in the accuracy of cell instance segmentation, particularly in highly challenging regions where image quality compromises the integrity of cell boundaries, necessitating gap filling. Therefore, our work should ultimately facilitate the study of tumour tissue bioarchitecture in onconanotomy field.
comment: 13 pages, 8 figures, 2 tables
☆ Efficient and generalizable prediction of molecular alterations in multiple cancer cohorts using H&E whole slide images
Molecular testing of tumor samples for targetable biomarkers is restricted by a lack of standardization, turnaround-time, cost, and tissue availability across cancer types. Additionally, targetable alterations of low prevalence may not be tested in routine workflows. Algorithms that predict DNA alterations from routinely generated hematoxylin and eosin (H&E)-stained images could prioritize samples for confirmatory molecular testing. Costs and the necessity of a large number of samples containing mutations limit approaches that train individual algorithms for each alteration. In this work, models were trained for simultaneous prediction of multiple DNA alterations from H&E images using a multi-task approach. Compared to biomarker-specific models, this approach performed better on average, with pronounced gains for rare mutations. The models reasonably generalized to independent temporal-holdout, externally-stained, and multi-site TCGA test sets. Additionally, whole slide image embeddings derived using multi-task models demonstrated strong performance in downstream tasks that were not a part of training. Overall, this is a promising approach to develop clinically useful algorithms that provide multiple actionable predictions from a single slide.
☆ Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning
Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://gemcollector.github.io/maniwhere/.
comment: Webpage: https://gemcollector.github.io/maniwhere/
☆ Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
As scaling laws in generative AI push performance, they also simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to address this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose to randomly mask up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only \$1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118$\times$ lower cost than stable diffusion models and 14$\times$ lower cost than the current state-of-the-art approach that costs \$28,400. We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
comment: 41 pages, 28 figures, 5 tables
☆ Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems AAAI
Facial Recognition Systems (FRSs) are being developed and deployed globally at unprecedented rates. Most platforms are designed in a limited set of countries but deployed in worldwide, without adequate checkpoints. This is especially problematic for Global South countries which lack strong legislation to safeguard persons facing disparate performance of these systems. A combination of unavailability of datasets, lack of understanding of FRS functionality and low-resource bias mitigation measures accentuate the problem. In this work, we propose a new face dataset composed of 6,579 unique male and female sportspersons from eight countries around the world. More than 50% of the dataset comprises individuals from the Global South countries and is demographically diverse. To aid adversarial audits and robust model training, each image has four adversarial variants, totaling over 40,000 images. We also benchmark five popular FRSs, both commercial and open-source, for the task of gender prediction (and country prediction for one of the open-source models as an example of red-teaming). Experiments on industrial FRSs reveal accuracies ranging from 98.2%--38.1%, with a large disparity between males and females in the Global South (max difference of 38.5%). Biases are also observed in all FRSs between females of the Global North and South (max difference of ~50%). Grad-CAM analysis identifies the nose, forehead and mouth as the regions of interest on one of the open-source FRSs. Utilizing this insight, we design simple, low-resource bias mitigation solutions using few-shot and novel contrastive learning techniques significantly improving the accuracy with disparity between males and females reducing from 50% to 1.5% in one of the settings. In the red-teaming experiment with the open-source Deepface model, contrastive learning proves more effective than simple fine-tuning.
comment: This work has been accepted for publication at AAAI/ACM AIES 2024
☆ FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones
Progress in machine understanding of sign languages has been slow and hampered by limited data. In this paper, we present FSboard, an American Sign Language fingerspelling dataset situated in a mobile text entry use case, collected from 147 paid and consenting Deaf signers using Pixel 4A selfie cameras in a variety of environments. Fingerspelling recognition is an incomplete solution that is only one small part of sign language translation, but it could provide some immediate benefit to Deaf/Hard of Hearing signers as more broadly capable technology develops. At >3 million characters in length and >250 hours in duration, FSboard is the largest fingerspelling recognition dataset to date by a factor of >10x. As a simple baseline, we finetune 30 Hz MediaPipe Holistic landmark inputs into ByT5-Small and achieve 11.1% Character Error Rate (CER) on a test set with unique phrases and signers. This quality degrades gracefully when decreasing frame rate and excluding face/body landmarks: plausible optimizations to help models run on device in real time.
comment: Access FSboard at https://www.kaggle.com/datasets/googleai/fsboard
☆ Adaptive Extensions of Unbiased Risk Estimators for Unsupervised Magnetic Resonance Image Denoising
The application of Deep Neural Networks (DNNs) to image denoising has notably challenged traditional denoising methods, particularly within complex noise scenarios prevalent in medical imaging. Despite the effectiveness of traditional and some DNN-based methods, their reliance on high-quality, noiseless ground truth images limits their practical utility. In response to this, our work introduces and benchmarks innovative unsupervised learning strategies, notably Stein's Unbiased Risk Estimator (SURE), its extension (eSURE), and our novel implementation, the Extended Poisson Unbiased Risk Estimator (ePURE), within medical imaging frameworks. This paper presents a comprehensive evaluation of these methods on MRI data afflicted with Gaussian and Poisson noise types, a scenario typical in medical imaging but challenging for most denoising algorithms. Our main contribution lies in the effective adaptation and implementation of the SURE, eSURE, and particularly the ePURE frameworks for medical images, showcasing their robustness and efficacy in environments where traditional noiseless ground truth cannot be obtained.
☆ Robust Facial Reactions Generation: An Emotion-Aware Framework with Modality Compensation
The objective of the Multiple Appropriate Facial Reaction Generation (MAFRG) task is to produce contextually appropriate and diverse listener facial behavioural responses based on the multimodal behavioural data of the conversational partner (i.e., the speaker). Current methodologies typically assume continuous availability of speech and facial modality data, neglecting real-world scenarios where these data may be intermittently unavailable, which often results in model failures. Furthermore, despite utilising advanced deep learning models to extract information from the speaker's multimodal inputs, these models fail to adequately leverage the speaker's emotional context, which is vital for eliciting appropriate facial reactions from human listeners. To address these limitations, we propose an Emotion-aware Modality Compensatory (EMC) framework. This versatile solution can be seamlessly integrated into existing models, thereby preserving their advantages while significantly enhancing performance and robustness in scenarios with missing modalities. Our framework ensures resilience when faced with missing modality data through the Compensatory Modality Alignment (CMA) module. It also generates more appropriate emotion-aware reactions via the Emotion-aware Attention (EA) module, which incorporates the speaker's emotional information throughout the entire encoding and decoding process. Experimental results demonstrate that our framework improves the appropriateness metric FRCorr by an average of 57.2\% compared to the original model structure. In scenarios where speech modality data is missing, the performance of appropriate generation shows an improvement, and when facial data is missing, it only exhibits minimal degradation.
☆ MILAN: Milli-Annotations for Lidar Semantic Segmentation
Annotating lidar point clouds for autonomous driving is a notoriously expensive and time-consuming task. In this work, we show that the quality of recent self-supervised lidar scan representations allows a great reduction of the annotation cost. Our method has two main steps. First, we show that self-supervised representations allow a simple and direct selection of highly informative lidar scans to annotate: training a network on these selected scans leads to much better results than a random selection of scans and, more interestingly, to results on par with selections made by SOTA active learning methods. In a second step, we leverage the same self-supervised representations to cluster points in our selected scans. Asking the annotator to classify each cluster, with a single click per cluster, then permits us to close the gap with fully-annotated training sets, while only requiring one thousandth of the point labels.
☆ AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection ECCV 2024
Zero-shot anomaly detection (ZSAD) targets the identification of anomalies within images from arbitrary novel categories. This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-language model (VLM), CLIP. AdaCLIP incorporates learnable prompts into CLIP and optimizes them through training on auxiliary annotated anomaly detection data. Two types of learnable prompts are proposed: static and dynamic. Static prompts are shared across all images, serving to preliminarily adapt CLIP for ZSAD. In contrast, dynamic prompts are generated for each test image, providing CLIP with dynamic adaptation capabilities. The combination of static and dynamic prompts is referred to as hybrid prompts, and yields enhanced ZSAD performance. Extensive experiments conducted across 14 real-world anomaly detection datasets from industrial and medical domains indicate that AdaCLIP outperforms other ZSAD methods and can generalize better to different categories and even domains. Finally, our analysis highlights the importance of diverse auxiliary data and optimized prompts for enhanced generalization capacity. Code is available at https://github.com/caoyunkang/AdaCLIP.
comment: Accepted by ECCV 2024
☆ Disentangling spatio-temporal knowledge for weakly supervised object detection and segmentation in surgical video
Weakly supervised video object segmentation (WSVOS) enables the identification of segmentation maps without requiring an extensive training dataset of object masks, relying instead on coarse video labels indicating object presence. Current state-of-the-art methods either require multiple independent stages of processing that employ motion cues or, in the case of end-to-end trainable networks, lack in segmentation accuracy, in part due to the difficulty of learning segmentation maps from videos with transient object presence. This limits the application of WSVOS for semantic annotation of surgical videos where multiple surgical tools frequently move in and out of the field of view, a problem that is more difficult than typically encountered in WSVOS. This paper introduces Video Spatio-Temporal Disentanglement Networks (VDST-Net), a framework to disentangle spatiotemporal information using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs). A teacher network designed to resolve temporal conflicts when specifics about object location and timing in the video are not provided works with a student network that integrates information over time by leveraging temporal dependencies. We demonstrate the efficacy of our framework on a public reference dataset and on a more challenging surgical video dataset where objects are, on average, present in less than 60\% of annotated frames. Our method outperforms state-of-the-art techniques and generates superior segmentation masks under video-level weak supervision.
comment: 13 pages, 6 figures, 8 tables
☆ CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning BMVC 2024
With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, large pre-trained models have become a common strategy to enhance performance in Continual Learning scenarios. This led to the development of numerous prompting strategies to effectively fine-tune transformer-based models without succumbing to catastrophic forgetting. However, these methods struggle to specialize the model on domains significantly deviating from the pre-training and preserving its zero-shot capabilities. In this work, we propose Continual Generative training for Incremental prompt-Learning, a novel approach to mitigate forgetting while adapting a VLM, which exploits generative replay to align prompts to tasks. We also introduce a new metric to evaluate zero-shot capabilities within CL benchmarks. Through extensive experiments on different domains, we demonstrate the effectiveness of our framework in adapting to new tasks while improving zero-shot capabilities. Further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.
comment: 15 pages, 1 figure. Accepted at the The 35th British Machine Vision Conference 2024 (BMVC 2024), Glasgow, UK
☆ RADA: Robust and Accurate Feature Learning with Domain Adaptation
Recent advancements in keypoint detection and descriptor extraction have shown impressive performance in local feature learning tasks. However, existing methods generally exhibit suboptimal performance under extreme conditions such as significant appearance changes and domain shifts. In this study, we introduce a multi-level feature aggregation network that incorporates two pivotal components to facilitate the learning of robust and accurate features with domain adaptation. First, we employ domain adaptation supervision to align high-level feature distributions across different domains to achieve invariant domain representations. Second, we propose a Transformer-based booster that enhances descriptor robustness by integrating visual and geometric information through wave position encoding concepts, effectively handling complex conditions. To ensure the accuracy and robustness of features, we adopt a hierarchical architecture to capture comprehensive information and apply meticulous targeted supervision to keypoint detection, descriptor extraction, and their coupled processing. Extensive experiments demonstrate that our method, RADA, achieves excellent results in image matching, camera pose estimation, and visual localization tasks.
☆ Unsupervised Mastoidectomy for Cochlear CT Mesh Reconstruction Using Highly Noisy Data
Cochlear Implant (CI) procedures involve inserting an array of electrodes into the cochlea located inside the inner ear. Mastoidectomy is a surgical procedure that uses a high-speed drill to remove part of the mastoid region of the temporal bone, providing safe access to the cochlea through the middle and inner ear. We aim to develop an intraoperative navigation system that registers plans created using 3D preoperative Computerized Tomography (CT) volumes with the 2D surgical microscope view. Herein, we propose a method to synthesize the mastoidectomy volume using only the preoperative CT scan, where the mastoid is intact. We introduce an unsupervised learning framework designed to synthesize mastoidectomy. For model training purposes, this method uses postoperative CT scans to avoid manual data cleaning or labeling, even when the region removed during mastoidectomy is visible but affected by metal artifacts, low signal-to-noise ratio, or electrode wiring. Our approach estimates mastoidectomy regions with a mean dice score of 70.0%. This approach represents a major step forward for CI intraoperative navigation by predicting realistic mastoidectomy-removed regions in preoperative planning that can be used to register the pre-surgery plan to intraoperative microscopy.
☆ STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay ECCV 2024
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time. Existing TTA methods often focus on improving recognition performance specifically for test data associated with classes in the training set. However, during the open-world inference process, there are inevitably test data instances from unknown classes, commonly referred to as outliers. This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist. To address this problem, we propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch. In particular, the memory bank is dynamically updated by selecting low-entropy and label-consistent samples in a class-balanced manner. In addition, we develop a self-weighted entropy minimization strategy that assigns higher weight to low-entropy samples. Extensive results demonstrate that STAMP outperforms existing TTA methods in terms of both recognition and outlier detection performance. The code is released at https://github.com/yuyongcan/STAMP.
comment: Accepted by ECCV 2024
☆ Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis ECCV 2024
Object detection is a pivotal task in computer vision that has received significant attention in previous years. Nonetheless, the capability of a detector to localise objects out of the training distribution remains unexplored. Whilst recent approaches in object-level out-of-distribution (OoD) detection heavily rely on class labels, such approaches contradict truly open-world scenarios where the class distribution is often unknown. In this context, anomaly detection focuses on detecting unseen instances rather than classifying detections as OoD. This work aims to bridge this gap by leveraging an open-world object detector and an OoD detector via virtual outlier synthesis. This is achieved by using the detector backbone features to first learn object pseudo-classes via self-supervision. These pseudo-classes serve as the basis for class-conditional virtual outlier sampling of anomalous features that are classified by an OoD head. Our approach empowers our overall object detector architecture to learn anomaly-aware feature representations without relying on class labels, hence enabling truly open-world object anomaly detection. Empirical validation of our approach demonstrates its effectiveness across diverse datasets encompassing various imaging modalities (visible, infrared, and X-ray). Moreover, our method establishes state-of-the-art performance on object-level anomaly detection, achieving an average recall score improvement of over 5.4% for natural images and 23.5% for a security X-ray dataset compared to the current approaches. In addition, our method detects anomalies in datasets where current approaches fail. Code available at https://github.com/KostadinovShalon/oln-ssos.
comment: 35 pages, 21 figures, includes supplementary material, accepted at ECCV 2024
☆ LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.
comment: 29 pages
☆ Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond ECCV 2024
In recent years, research on out-of-distribution (OoD) detection for semantic segmentation has mainly focused on road scenes -- a domain with a constrained amount of semantic diversity. In this work, we challenge this constraint and extend the domain of this task to general natural images. To this end, we introduce: 1. the ADE-OoD benchmark, which is based on the ADE20k dataset and includes images from diverse domains with a high semantic diversity, and 2. a novel approach that uses Diffusion score matching for OoD detection (DOoD) and is robust to the increased semantic diversity. ADE-OoD features indoor and outdoor images, defines 150 semantic categories as in-distribution, and contains a variety of OoD objects. For DOoD, we train a diffusion model with an MLP architecture on semantic in-distribution embeddings and build on the score matching interpretation to compute pixel-wise OoD scores at inference time. On common road scene OoD benchmarks, DOoD performs on par or better than the state of the art, without using outliers for training or making assumptions about the data domain. On ADE-OoD, DOoD outperforms previous approaches, but leaves much room for future improvements.
comment: ECCV 2024 - Benchmark page: https://ade-ood.github.io/
☆ Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders
Despite the proliferation of large vision-language foundation models, estimation of the learning and forgetting outcomes following fine-tuning of these models remains largely unexplored. Inspired by work highlighting the significance of the modality gap in contrastive dual-encoders, we propose the Inter-Intra Modal Measure (IIMM). Combining terms quantifying the similarity between image embeddings and the similarity between incorrect image and label embedding pairs, the IIMM functions as a strong predictor of performance changes with fine-tuning. Our extensive empirical analysis across four state-of-the-art vision-language models (CLIP, SigLIP, CoCa, EVA-02-CLIP) and five fine-tuning techniques (full fine-tuning, BitFit, attention-weight tuning, LoRA, CLIP-Adapter) demonstrates a strong, statistically significant linear relationship: fine-tuning on tasks with higher IIMM scores produces greater in-domain performance gains but also induces more severe out-of-domain performance degradation, with some parameter-efficient fine-tuning (PEFT) methods showing extreme forgetting. We compare our measure against transfer scores from state-of-the-art model selection methods and show that the IIMM is significantly more predictive of accuracy gains. With only a single forward pass of the target data, practitioners can leverage this key insight to heuristically evaluate the degree to which a model can be expected to improve following fine-tuning. Given additional knowledge about the model's performance on a few diverse tasks, this heuristic further evolves into a strong predictor of expected performance changes when training for new tasks.
☆ SAM2CLIP2SAM: Vision Language Model for Segmentation of 3D CT Scans for Covid-19 Detection
This paper presents a new approach for effective segmentation of images that can be integrated into any model and methodology; the paradigm that we choose is classification of medical images (3-D chest CT scans) for Covid-19 detection. Our approach includes a combination of vision-language models that segment the CT scans, which are then fed to a deep neural architecture, named RACNet, for Covid-19 detection. In particular, a novel framework, named SAM2CLIP2SAM, is introduced for segmentation that leverages the strengths of both Segment Anything Model (SAM) and Contrastive Language-Image Pre-Training (CLIP) to accurately segment the right and left lungs in CT scans, subsequently feeding these segmented outputs into RACNet for classification of COVID-19 and non-COVID-19 cases. At first, SAM produces multiple part-based segmentation masks for each slice in the CT scan; then CLIP selects only the masks that are associated with the regions of interest (ROIs), i.e., the right and left lungs; finally SAM is given these ROIs as prompts and generates the final segmentation mask for the lungs. Experiments are presented across two Covid-19 annotated databases which illustrate the improved performance obtained when our method has been used for segmentation of the CT scans.
☆ Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning
In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, dataset diversity is incompletely understood. The current best practice is to try to maximize dataset size and class balance. Yet large, class-balanced datasets are not guaranteed to be diverse: images can still be arbitrarily similar. We hypothesized that, for a given model architecture, better model performance can be achieved by maximizing dataset diversity more directly. This could open a path for performance improvement without additional computational resources or architectural advances. To test this hypothesis, we introduce a comprehensive framework of diversity measures, developed in ecology, that generalizes familiar quantities like Shannon entropy by accounting for similarities and differences among images. (Dataset size and class balance emerge from this framework as special cases.) By analyzing thousands of subsets from seven medical datasets representing ultrasound, X-ray, CT, and pathology images, we found that the best correlates of performance were not size or class balance but $A$ -- ``big alpha'' -- a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for similarities among images. One of these, $A_0$, explained 67\% of the variance in balanced accuracy across all subsets, vs. 54\% for class balance and just 39\% for size. The best pair was size and $A_1$ (79\%), which outperformed size and class balance (74\%). $A$ performed best for subsets from individual datasets as well as across datasets, supporting the generality of these results. We propose maximizing $A$ as a potential new way to improve the performance of deep learning in medical imaging.
comment: 11 pages, 5 figures, 3 tables
☆ GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI
Alzheimer's Disease (AD) is an irreversible neurodegenerative disorder that often progresses from Mild Cognitive Impairment (MCI), leading to memory loss and significantly impacting patients' lives. Clinical trials indicate that early targeted interventions for MCI patients can potentially slow or halt the development and progression of AD. Previous research has shown that accurate medical classification requires the inclusion of extensive multimodal data, such as assessment scales and various neuroimaging techniques like Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET). However, consistently tracking the diagnosis of the same individual over time and simultaneously collecting multimodal data poses significant challenges. To address this issue, we introduce GFE-Mamba, a classifier based on Generative Feature Extraction (GFE). This classifier effectively integrates data from assessment scales, MRI, and PET, enabling deeper multimodal fusion. It efficiently extracts both long and short sequence information and incorporates additional information beyond the pixel space. This approach not only improves classification accuracy but also enhances the interpretability and stability of the model. We constructed datasets of over 3000 samples based on the Alzheimer's Disease Neuroimaging Initiative (ADNI) for a two-step training process. Our experimental results demonstrate that the GFE-Mamba model is effective in predicting the conversion from MCI to AD and outperforms several state-of-the-art methods. Our source code and ADNI dataset processing code are available at https://github.com/Tinysqua/GFE-Mamba.
comment: 35 pages, 4 figures
☆ Harmonizing Flows: Leveraging normalizing flows for unsupervised and source-free MRI harmonization
Lack of standardization and various intrinsic parameters for magnetic resonance (MR) image acquisition results in heterogeneous images across different sites and devices, which adversely affects the generalization of deep neural networks. To alleviate this issue, this work proposes a novel unsupervised harmonization framework that leverages normalizing flows to align MR images, thereby emulating the distribution of a source domain. The proposed strategy comprises three key steps. Initially, a normalizing flow network is trained to capture the distribution characteristics of the source domain. Then, we train a shallow harmonizer network to reconstruct images from the source domain via their augmented counterparts. Finally, during inference, the harmonizer network is updated to ensure that the output images conform to the learned source domain distribution, as modeled by the normalizing flow network. Our approach, which is unsupervised, source-free, and task-agnostic is assessed in the context of both adults and neonatal cross-domain brain MRI segmentation, as well as neonatal brain age estimation, demonstrating its generalizability across tasks and population demographics. The results underscore its superior performance compared to existing methodologies. The code is available at https://github.com/farzad-bz/Harmonizing-Flows
☆ Mamba meets crack segmentation
Cracks pose safety risks to infrastructure and cannot be overlooked. The prevailing structures in existing crack segmentation networks predominantly consist of CNNs or Transformers. However, CNNs exhibit a deficiency in global modeling capability, hindering the representation to entire crack features. Transformers can capture long-range dependencies but suffer from high and quadratic complexity. Recently, Mamba has garnered extensive attention due to its linear spatial and computational complexity and its powerful global perception. This study explores the representation capabilities of Mamba to crack features. Specifically, this paper uncovers the connection between Mamba and the attention mechanism, providing a profound insight, an attention perspective, into interpreting Mamba and devising a novel Mamba module following the principles of attention blocks, namely CrackMamba. We compare CrackMamba with the most prominent visual Mamba modules, Vim and Vmamba, on two datasets comprising asphalt pavement and concrete pavement cracks, and steel cracks, respectively. The quantitative results show that CrackMamba stands out as the sole Mamba block consistently enhancing the baseline model's performance across all evaluation measures, while reducing its parameters and computational costs. Moreover, this paper substantiates that Mamba can achieve global receptive fields through both theoretical analysis and visual interpretability. The discoveries of this study offer a dual contribution. First, as a plug-and-play and simple yet effective Mamba module, CrackMamba exhibits immense potential for integration into various crack segmentation models. Second, the proposed innovative Mamba design concept, integrating Mamba with the attention mechanism, holds significant reference value for all Mamba-based computer vision models, not limited to crack segmentation networks, as investigated in this study.
comment: 32 pages, 8 figures. Preprint submitted to Elsevier
☆ SwinSF: Image Reconstruction from Spatial-Temporal Spike Streams
The spike camera, with its high temporal resolution, low latency, and high dynamic range, addresses high-speed imaging challenges like motion blur. It captures photons at each pixel independently, creating binary spike streams rich in temporal information but challenging for image reconstruction. Current algorithms, both traditional and deep learning-based, still need to be improved in the utilization of the rich temporal detail and the restoration of the details of the reconstructed image. To overcome this, we introduce Swin Spikeformer (SwinSF), a novel model for dynamic scene reconstruction from spike streams. SwinSF is composed of Spike Feature Extraction, Spatial-Temporal Feature Extraction, and Final Reconstruction Module. It combines shifted window self-attention and proposed temporal spike attention, ensuring a comprehensive feature extraction that encapsulates both spatial and temporal dynamics, leading to a more robust and accurate reconstruction of spike streams. Furthermore, we build a new synthesized dataset for spike image reconstruction which matches the resolution of the latest spike camera, ensuring its relevance and applicability to the latest developments in spike camera imaging. Experimental results demonstrate that the proposed network SwinSF sets a new benchmark, achieving state-of-the-art performance across a series of datasets, including both real-world and synthesized data across various resolutions. Our codes and proposed dataset will be available soon.
☆ Predicting the Best of N Visual Trackers
We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the "Best of the N Trackers", called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks - LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols. The code, the trained models, and the results will soon be made publicly available on https://github.com/BasitAlawode/Best_of_N_Trackers.
☆ Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition
Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: https://github.com/liujf69/MMCL-Action.
☆ YOLOv10 for Automated Fracture Detection in Pediatric Wrist Trauma X-rays
Wrist fractures are highly prevalent among children and can significantly impact their daily activities, such as attending school, participating in sports, and performing basic self-care tasks. If not treated properly, these fractures can result in chronic pain, reduced wrist functionality, and other long-term complications. Recently, advancements in object detection have shown promise in enhancing fracture detection, with systems achieving accuracy comparable to, or even surpassing, that of human radiologists. The YOLO series, in particular, has demonstrated notable success in this domain. This study is the first to provide a thorough evaluation of various YOLOv10 variants to assess their performance in detecting pediatric wrist fractures using the GRAZPEDWRI-DX dataset. It investigates how changes in model complexity, scaling the architecture, and implementing a dual-label assignment strategy can enhance detection performance. Experimental results indicate that our trained model achieved mean average precision (mAP@50-95) of 51.9\% surpassing the current YOLOv9 benchmark of 43.3\% on this dataset. This represents an improvement of 8.6\%. The implementation code is publicly available at https://github.com/ammarlodhi255/YOLOv10-Fracture-Detection
comment: The code will soon be made publicly available
☆ Differentiable Convex Polyhedra Optimization from Multi-view Images ECCV2024
This paper presents a novel approach for the differentiable rendering of convex polyhedra, addressing the limitations of recent methods that rely on implicit field supervision. Our technique introduces a strategy that combines non-differentiable computation of hyperplane intersection through duality transform with differentiable optimization for vertex positioning with three-plane intersection, enabling gradient-based optimization without the need for 3D implicit fields. This allows for efficient shape representation across a range of applications, from shape parsing to compact mesh reconstruction. This work not only overcomes the challenges of previous approaches but also sets a new standard for representing shapes with convex polyhedra.
comment: ECCV2024 https://github.com/kimren227/DiffConvex
☆ Enhancing Transferability of Targeted Adversarial Examples: A Self-Universal Perspective
Transfer-based targeted adversarial attacks against black-box deep neural networks (DNNs) have been proven to be significantly more challenging than untargeted ones. The impressive transferability of current SOTA, the generative methods, comes at the cost of requiring massive amounts of additional data and time-consuming training for each targeted label. This results in limited efficiency and flexibility, significantly hindering their deployment in practical applications. In this paper, we offer a self-universal perspective that unveils the great yet underexplored potential of input transformations in pursuing this goal. Specifically, transformations universalize gradient-based attacks with intrinsic but overlooked semantics inherent within individual images, exhibiting similar scalability and comparable results to time-consuming learning over massive additional data from diverse classes. We also contribute a surprising empirical insight that one of the most fundamental transformations, simple image scaling, is highly effective, scalable, sufficient, and necessary in enhancing targeted transferability. We further augment simple scaling with orthogonal transformations and block-wise applicability, resulting in the Simple, faSt, Self-universal yet Strong Scale Transformation (S$^4$ST) for self-universal TTA. On the ImageNet-Compatible benchmark dataset, our method achieves a 19.8% improvement in the average targeted transfer success rate against various challenging victim models over existing SOTA transformation methods while only consuming 36% time for attacking. It also outperforms resource-intensive attacks by a large margin in various challenging settings.
comment: 8 pages and 9 figures
☆ HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning ECCV 2024
Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.
comment: Accepted as a main conference paper at ECCV 2024 (https://github.com/google/haloquest)
☆ Flow-guided Motion Prediction with Semantics and Dynamic Occupancy Grid Maps SC
Accurate prediction of driving scenes is essential for road safety and autonomous driving. Occupancy Grid Maps (OGMs) are commonly employed for scene prediction due to their structured spatial representation, flexibility across sensor modalities and integration of uncertainty. Recent studies have successfully combined OGMs with deep learning methods to predict the evolution of scene and learn complex behaviours. These methods, however, do not consider prediction of flow or velocity vectors in the scene. In this work, we propose a novel multi-task framework that leverages dynamic OGMs and semantic information to predict both future vehicle semantic grids and the future flow of the scene. This incorporation of semantic flow not only offers intermediate scene features but also enables the generation of warped semantic grids. Evaluation on the real-world NuScenes dataset demonstrates improved prediction capabilities and enhanced ability of the model to retain dynamic vehicles within the scene.
comment: Accepted for publication at the 27th IEEE International Conference on Intelligent Transportation Systems (ITSC) (ITSC 2024)
☆ SLVideo: A Sign Language Video Moment Retrieval Framework
Sign Language Recognition has been studied and developed throughout the years to help the deaf and hard-of-hearing people in their day-to-day lives. These technologies leverage manual sign recognition algorithms, however, most of them lack the recognition of facial expressions, which are also an essential part of Sign Language as they allow the speaker to add expressiveness to their dialogue or even change the meaning of certain manual signs. SLVideo is a video moment retrieval software for Sign Language videos with a focus on both hands and facial signs. The system extracts embedding representations for the hand and face signs from video frames to capture the language signs in full. This will then allow the user to search for a specific sign language video segment with text queries, or to search by similar sign language videos. To test this system, a collection of five hours of annotated Sign Language videos is used as the dataset, and the initial results are promising in a zero-shot setting.SLVideo is shown to not only address the problem of searching sign language videos but also supports a Sign Language thesaurus with a search by similarity technique. Project web page: https://novasearch.github.io/SLVideo/
comment: 5 pages, 3 figures, 1 table
☆ MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics
Place recognition is a challenging task in computer vision, crucial for enabling autonomous vehicles and robots to navigate previously visited environments. While significant progress has been made in learnable multimodal methods that combine onboard camera images and LiDAR point clouds, the full potential of these methods remains largely unexplored in localization applications. In this paper, we study the impact of leveraging a multi-camera setup and integrating diverse data sources for multimodal place recognition, incorporating explicit visual semantics and text descriptions. Our proposed method named MSSPlace utilizes images from multiple cameras, LiDAR point clouds, semantic segmentation masks, and text annotations to generate comprehensive place descriptors. We employ a late fusion approach to integrate these modalities, providing a unified representation. Through extensive experiments on the Oxford RobotCar and NCLT datasets, we systematically analyze the impact of each data source on the overall quality of place descriptors. Our experiments demonstrate that combining data from multiple sensors significantly improves place recognition model performance compared to single modality approaches and leads to state-of-the-art quality. We also show that separate usage of visual or textual semantics (which are more compact representations of sensory data) can achieve promising results in place recognition. The code for our method is publicly available: https://github.com/alexmelekhin/MSSPlace
comment: This work has been submitted to the IEEE for possible publication
☆ DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving
In autonomous driving, deep models have shown remarkable performance across various visual perception tasks with the demand of high-quality and huge-diversity training datasets. Such datasets are expected to cover various driving scenarios with adverse weather, lighting conditions and diverse moving objects. However, manually collecting these data presents huge challenges and expensive cost. With the rapid development of large generative models, we propose DriveDiTFit, a novel method for efficiently generating autonomous Driving data by Fine-tuning pre-trained Diffusion Transformers (DiTs). Specifically, DriveDiTFit utilizes a gap-driven modulation technique to carefully select and efficiently fine-tune a few parameters in DiTs according to the discrepancy between the pre-trained source data and the target driving data. Additionally, DriveDiTFit develops an effective weather and lighting condition embedding module to ensure diversity in the generated data, which is initialized by a nearest-semantic-similarity initialization approach. Through progressive tuning scheme to refined the process of detail generation in early diffusion process and enlarging the weights corresponding to small objects in training loss, DriveDiTFit ensures high-quality generation of small moving objects in the generated data. Extensive experiments conducted on driving datasets confirm that our method could efficiently produce diverse real driving data. The source codes will be available at https://github.com/TtuHamg/DriveDiTFit.
☆ TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly ECCV 2024
Inferring step-wise actions to assemble 3D objects with primitive bricks from images is a challenging task due to complex constraints and the vast number of possible combinations. Recent studies have demonstrated promising results on sequential LEGO brick assembly through the utilization of LEGO-Graph modeling to predict sequential actions. However, existing approaches are class-specific and require significant computational and 3D annotation resources. In this work, we first propose a computationally efficient breadth-first search (BFS) LEGO-Tree structure to model the sequential assembly actions by considering connections between consecutive layers. Based on the LEGO-Tree structure, we then design a class-agnostic tree-transformer framework to predict the sequential assembly actions from the input multi-view images. A major challenge of the sequential brick assembly task is that the step-wise action labels are costly and tedious to obtain in practice. We mitigate this problem by leveraging synthetic-to-real transfer learning. Specifically, our model is first pre-trained on synthetic data with full supervision from the available action labels. We then circumvent the requirement for action labels in the real data by proposing an action-to-silhouette projection that replaces action labels with input image silhouettes for self-supervision. Without any annotation on the real data, our model outperforms existing methods with 3D supervision by 7.8% and 11.3% in mIoU on the MNIST and ModelNet Construction datasets, respectively.
comment: ECCV 2024
☆ SS-SFR: Synthetic Scenes Spatial Frequency Response on Virtual KITTI and Degraded Automotive Simulations for Object Detection
Automotive simulation can potentially compensate for a lack of training data in computer vision applications. However, there has been little to no image quality evaluation of automotive simulation and the impact of optical degradations on simulation is little explored. In this work, we investigate Virtual KITTI and the impact of applying variations of Gaussian blur on image sharpness. Furthermore, we consider object detection, a common computer vision application on three different state-of-the-art models, thus allowing us to characterize the relationship between object detection and sharpness. It was found that while image sharpness (MTF50) degrades from an average of 0.245cy/px to approximately 0.119cy/px; object detection performance stays largely robust within 0.58\%(Faster RCNN), 1.45\%(YOLOF) and 1.93\%(DETR) across all respective held-out test sets.
comment: 8 pages, 2 figures, 2 tables
☆ Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better motion controllability, as well as stronger temporal consistency and smoothness. In general, we propose three effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. Additionally, a structural similarity index-based strategy is proposed to enable Cinemo to have better controllability of motion intensity. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such three strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.
comment: Project webpage: https://maxin-cn.github.io/cinemo_project/
☆ A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control ECCV 2024
Virtual interventions enable the physics-based simulation of device deployment within coronary arteries. This framework allows for counterfactual reasoning by deploying the same device in different arterial anatomies. However, current methods to create such counterfactual arteries face a trade-off between controllability and realism. In this study, we investigate how Latent Diffusion Models (LDMs) can custom synthesize coronary anatomy for virtual intervention studies based on mid-level anatomic constraints such as topological validity, local morphological shape, and global skeletal structure. We also extend diffusion model guidance strategies to the context of morpho-skeletal conditioning and propose a novel guidance method for continuous attributes that adaptively updates the negative guiding condition throughout sampling. Our framework enables the generation and editing of coronary anatomy in a controllable manner, allowing device designers to derive mechanistic insights regarding anatomic variation and simulated device deployment.
comment: Accepted to ECCV 2024
☆ Reinforcement Learning Meets Visual Odometry
Visual Odometry (VO) is essential to downstream mobile robotics and augmented/virtual reality tasks. Despite recent advances, existing VO methods still rely on heuristic design choices that require several weeks of hyperparameter tuning by human experts, hindering generalizability and robustness. We address these challenges by reframing VO as a sequential decision-making task and applying Reinforcement Learning (RL) to adapt the VO process dynamically. Our approach introduces a neural network, operating as an agent within the VO pipeline, to make decisions such as keyframe and grid-size selection based on real-time conditions. Our method minimizes reliance on heuristic choices using a reward function based on pose error, runtime, and other metrics to guide the system. Our RL framework treats the VO system and the image sequence as an environment, with the agent receiving observations from keypoints, map statistics, and prior poses. Experimental results using classical VO methods and public benchmarks demonstrate improvements in accuracy and robustness, validating the generalizability of our RL-enhanced VO approach to different scenarios. We believe this paradigm shift advances VO technology by eliminating the need for time-intensive parameter tuning of heuristics.
☆ Norface: Improving Facial Expression Analysis by Identity Normalization ECCV2024
Facial Expression Analysis remains a challenging task due to unexpected task-irrelevant noise, such as identity, head pose, and background. To address this issue, this paper proposes a novel framework, called Norface, that is unified for both Action Unit (AU) analysis and Facial Emotion Recognition (FER) tasks. Norface consists of a normalization network and a classification network. First, the carefully designed normalization network struggles to directly remove the above task-irrelevant noise, by maintaining facial expression consistency but normalizing all original images to a common identity with consistent pose, and background. Then, these additional normalized images are fed into the classification network. Due to consistent identity and other factors (e.g. head pose, background, etc.), the normalized images enable the classification network to extract useful expression information more effectively. Additionally, the classification network incorporates a Mixture of Experts to refine the latent representation, including handling the input of facial representations and the output of multiple (AU or emotion) labels. Extensive experiments validate the carefully designed framework with the insight of identity normalization. The proposed method outperforms existing SOTA methods in multiple facial expression analysis tasks, including AU detection, AU intensity estimation, and FER tasks, as well as their cross-dataset tasks. For the normalized datasets and code please visit {https://norface-fea.github.io/}.
comment: Accepted by ECCV2024
☆ Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning
Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.
comment: Accepted to ACM International Conference on Multimedia (MM) 2024
☆ Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models
Foundation models (FMs) are large neural networks trained on broad datasets, excelling in downstream tasks with minimal fine-tuning. Human activity recognition in video has advanced with FMs, driven by competition among different architectures. However, high accuracies on standard benchmarks can draw an artificially rosy picture, as they often overlook real-world factors like changing camera perspectives. Popular benchmarks, mostly from YouTube or movies, offer diverse views but only coarse actions, which are insufficient for use-cases needing fine-grained, domain-specific actions. Domain-specific datasets (e.g., for industrial assembly) typically use data from limited static perspectives. This paper empirically evaluates how perspective changes affect different FMs in fine-grained human activity recognition. We compare multiple backbone architectures and design choices, including image- and video- based models, and various strategies for temporal information fusion, including commonly used score averaging and more novel attention-based temporal aggregation mechanisms. This is the first systematic study of different foundation models and specific design choices for human activity recognition from unknown views, conducted with the goal to provide guidance for backbone- and temporal- fusion scheme selection. Code and models will be made publicly available to the community.
☆ Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information
Accurate localization in diverse environments is a fundamental challenge in computer vision and robotics. The task involves determining a sensor's precise position and orientation, typically a camera, within a given space. Traditional localization methods often rely on passive sensing, which may struggle in scenarios with limited features or dynamic environments. In response, this paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy. Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications. Our results demonstrate that our method performs better than the existing one, targeting similar problems and generalizing on synthetic and real data. We also release an open-source implementation to benefit the community.
comment: www.github.com/rvp-group/learning-where-to-look
☆ All rivers run into the sea: Unified Modality Brain-like Emotional Central Mechanism
In the field of affective computing, fully leveraging information from a variety of sensory modalities is essential for the comprehensive understanding and processing of human emotions. Inspired by the process through which the human brain handles emotions and the theory of cross-modal plasticity, we propose UMBEnet, a brain-like unified modal affective processing network. The primary design of UMBEnet includes a Dual-Stream (DS) structure that fuses inherent prompts with a Prompt Pool and a Sparse Feature Fusion (SFF) module. The design of the Prompt Pool is aimed at integrating information from different modalities, while inherent prompts are intended to enhance the system's predictive guidance capabilities and effectively manage knowledge related to emotion classification. Moreover, considering the sparsity of effective information across different modalities, the SSF module aims to make full use of all available sensory data through the sparse integration of modality fusion prompts and inherent prompts, maintaining high adaptability and sensitivity to complex emotional states. Extensive experiments on the largest benchmark datasets in the Dynamic Facial Expression Recognition (DFER) field, including DFEW, FERV39k, and MAFW, have proven that UMBEnet consistently outperforms the current state-of-the-art methods. Notably, in scenarios of Modality Missingness and multimodal contexts, UMBEnet significantly surpasses the leading current methods, demonstrating outstanding performance and adaptability in tasks that involve complex emotional understanding with rich multimodal information.
☆ Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models
Object-centric (OC) representations, which represent the state of a visual scene by modeling it as a composition of objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have not been thoroughly analyzed yet. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains from language to computer vision, marking them as a potential cornerstone of future research for a multitude of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, and demonstrate a viable way to achieve the best of both worlds. The extensiveness of our study, encompassing over 800 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.
☆ Not All Pairs are Equal: Hierarchical Learning for Average-Precision-Oriented Video Retrieval
The rapid growth of online video resources has significantly promoted the development of video retrieval methods. As a standard evaluation metric for video retrieval, Average Precision (AP) assesses the overall rankings of relevant videos at the top list, making the predicted scores a reliable reference for users. However, recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally, leading to an evident gap between the training objective and evaluation metric. To effectively bridge this gap, in this work, we aim to address two primary challenges: a) The current similarity measure and AP-based loss are suboptimal for video retrieval; b) The noticeable noise from frame-to-frame matching introduces ambiguity in estimating the AP loss. In response to these challenges, we propose the Hierarchical learning framework for Average-Precision-oriented Video Retrieval (HAP-VR). For the former challenge, we develop the TopK-Chamfer Similarity and QuadLinear-AP loss to measure and optimize video-level similarities in terms of AP. For the latter challenge, we suggest constraining the frame-level similarities to achieve an accurate AP loss estimation. Experimental results present that HAP-VR outperforms existing methods on several benchmark datasets, providing a feasible solution for video retrieval tasks and thus offering potential benefits for the multi-media application.
☆ Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping ECCV 2024
Learning efficient representations of local features is a key challenge in feature volume-based 3D neural mapping, especially in large-scale environments. In this paper, we introduce Decomposition-based Neural Mapping (DNMap), a storage-efficient large-scale 3D mapping method that employs a discrete representation based on a decomposition strategy. This decomposition strategy aims to efficiently capture repetitive and representative patterns of shapes by decomposing each discrete embedding into component vectors that are shared across the embedding space. Our DNMap optimizes a set of component vectors, rather than entire discrete embeddings, and learns composition rather than indexing the discrete embeddings. Furthermore, to complement the mapping quality, we additionally learn low-resolution continuous embeddings that require tiny storage space. By combining these representations with a shallow neural network and an efficient octree-based feature volume, our DNMap successfully approximates signed distance functions and compresses the feature volume while preserving mapping quality. Our source code is available at https://github.com/minseong-p/dnmap.
comment: ECCV 2024
☆ Differentiable Product Quantization for Memory Efficient Camera Relocalization ECCV
Camera relocalization relies on 3D models of the scene with a large memory footprint that is incompatible with the memory budget of several applications. One solution to reduce the scene memory size is map compression by removing certain 3D points and descriptor quantization. This achieves high compression but leads to performance drop due to information loss. To address the memory performance trade-off, we train a light-weight scene-specific auto-encoder network that performs descriptor quantization-dequantization in an end-to-end differentiable manner updating both product quantization centroids and network parameters through back-propagation. In addition to optimizing the network for descriptor reconstruction, we encourage it to preserve the descriptor-matching performance with margin-based metric loss functions. Results show that for a local descriptor memory of only 1MB, the synergistic combination of the proposed network and map compression achieves the best performance on the Aachen Day-Night compared to existing compression methods.
comment: Accepted to the European Conference on Computer Vision (ECCV) 2024
☆ Double Deep Learning-based Event Data Coding and Classification
Event cameras have the ability to capture asynchronous per-pixel brightness changes, called "events", offering advantages over traditional frame-based cameras for computer vision applications. Efficiently coding event data is critical for transmission and storage, given the significant volume of events. This paper proposes a novel double deep learning-based architecture for both event data coding and classification, using a point cloud-based representation for events. In this context, the conversions from events to point clouds and back to events are key steps in the proposed solution, and therefore its impact is evaluated in terms of compression and classification performance. Experimental results show that it is possible to achieve a classification performance of compressed events which is similar to one of the original events, even after applying a lossy point cloud codec, notably the recent learning-based JPEG Pleno Point Cloud Coding standard, with a clear rate reduction. Experimental results also demonstrate that events coded using JPEG PCC achieve better classification performance than those coded using the conventional lossy MPEG Geometry-based Point Cloud Coding standard. Furthermore, the adoption of learning-based coding offers high potential for performing computer vision tasks in the compressed domain, which allows skipping the decoding stage while mitigating the impact of coding artifacts.
☆ Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks
Generative artificial intelligence has transformed the generation of synthetic data, providing innovative solutions to challenges like data scarcity and privacy, which are particularly critical in fields such as medicine. However, the effective use of this synthetic data to train high-performance models remains a significant challenge. This paper addresses this issue by introducing Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers. At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information provided to classifiers through a synthetic dataset regeneration and soft labelling mechanism. The KR pipeline has been tested on a variety of datasets, with a focus on six highly heterogeneous medical image datasets, ranging from retinal images to organ scans. The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases. Furthermore, the resulting models show almost complete immunity to Membership Inference Attacks, manifesting privacy properties missing in models trained with conventional techniques.
☆ Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation ACL
Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.
comment: Accepted at the MACLEAN workshop in the ECML/PKDD 2024
☆ SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over Time
Generating high-resolution images with generative models has recently been made widely accessible by leveraging diffusion models pre-trained on large-scale datasets. Various techniques, such as MultiDiffusion and SyncDiffusion, have further pushed image generation beyond training resolutions, i.e., from square images to panorama, by merging multiple overlapping diffusion paths or employing gradient descent to maintain perceptual coherence. However, these methods suffer from significant computational inefficiencies due to generating and averaging numerous predictions, which is required in practice to produce high-quality and seamless images. This work addresses this limitation and presents a novel approach that eliminates the need to generate and average numerous overlapping denoising predictions. Our method shifts non-overlapping denoising windows over time, ensuring that seams in one timestep are corrected in the next. This results in coherent, high-resolution images with fewer overall steps. We demonstrate the effectiveness of our approach through qualitative and quantitative evaluations, comparing it with MultiDiffusion, SyncDiffusion, and StitchDiffusion. Our method offers several key benefits, including improved computational efficiency and faster inference times while producing comparable or better image quality.
☆ WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation ECCV 2024
In the era of content creation revolution propelled by advancements in generative models, the field of web design remains unexplored despite its critical role in modern digital communication. The web design process is complex and often time-consuming, especially for those with limited expertise. In this paper, we introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at automating the generation for visual presentation of web pages based on their HTML code. WebRPG would contribute to a faster web development workflow. Since there is no existing benchmark available, we develop a new dataset for WebRPG through an automated pipeline. Moreover, we present baseline models, utilizing VAE to manage numerous elements and rendering parameters, along with custom HTML embedding for capturing essential semantic and hierarchical information from HTML. Extensive experiments, including customized quantitative evaluations for this specific task, are conducted to evaluate the quality of the generated results.
comment: Accepted at ECCV 2024. The dataset and code can be accessed at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/WebRPG
☆ TextureCrop: Enhancing Synthetic Image Detection through Texture-based Cropping
Generative AI technologies produce hyper-realistic imagery that can be used for nefarious purposes such as producing misleading or harmful content, among others. This makes Synthetic Image Detection (SID) an essential tool for defending against AI-generated harmful content. Current SID methods typically resize input images to a fixed resolution or perform center-cropping due to computational concerns, leading to challenges in effectively detecting artifacts in high-resolution images. To this end, we propose TextureCrop, a novel image pre-processing technique. By focusing on high-frequency image parts where generation artifacts are prevalent, TextureCrop effectively enhances SID accuracy while maintaining manageable memory requirements. Experimental results demonstrate a consistent improvement in AUC across various detectors by 5.7% compared to center cropping and by 14% compared to resizing, across high-resolution images from the Forensynths and Synthbuster datasets.
comment: 17 pages, 6 images
☆ DiffX: Guide Your Layout to Cross-Modal Generative Modeling
Diffusion models have made significant strides in text-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, including chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal "RGB+X" generation, called DiffX. We firstly construct the cross-modal image datasets with text descriptions using the LLaVA model for image captioning, supplemented by manual corrections. Notably, DiffX presents a simple yet effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space, facilitated by our Dual-Path Variational AutoEncoder (DP-VAE). Furthermore, we incorporate the gated cross-attention mechanism to connect the layout and text conditions, leveraging Long-CLIP for embedding long captions to enhance user guidance. Through extensive experiments, DiffX demonstrates robustness and flexibility in cross-modal generation across three RGB+X datasets: FLIR, MFNet, and COME15K, guided by various layout types. It also shows the potential for adaptive generation of "RGB+X+Y" or more diverse modalities. Our code and processed image captions are available at https://github.com/zeyuwang-zju/DiffX.
☆ In-Context Learning Improves Compositional Understanding of Vision-Language Models
Vision-Language Models (VLMs) have shown remarkable capabilities in a large number of downstream tasks. Nonetheless, compositional image understanding remains a rather difficult task due to the object bias present in training data. In this work, we investigate the reasons for such a lack of capability by performing an extensive bench-marking of compositional understanding in VLMs. We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses. Furthermore, we leverage In-Context Learning (ICL) as a way to improve the ability of VLMs to perform more complex reasoning and understanding given an image. Our extensive experiments demonstrate that our proposed approach outperforms baseline models across multiple compositional understanding datasets.
☆ Subthalamic Nucleus segmentation in high-field Magnetic Resonance data. Is space normalization by template co-registration necessary?
Deep Brain Stimulation (DBS) is one of the most successful methods to diminish late-stage Parkinson's Disease (PD) symptoms. It is a delicate surgical procedure which requires detailed pre-surgical patient's study. High-field Magnetic Resonance Imaging (MRI) has proven its improved capacity of capturing the Subthalamic Nucleus (STN) - the main target of DBS in PD - in greater detail than lower field images. Here, we present a comparison between the performance of two different Deep Learning (DL) automatic segmentation architectures, one based in the registration to a brain template and the other performing the segmentation in in the MRI acquisition native space. The study was based on publicly available high-field 7 Tesla (T) brain MRI datasets of T1-weighted and T2-weighted sequences. nnUNet was used on the segmentation step of both architectures, while the data pre and post-processing pipelines diverged. The evaluation metrics showed that the performance of the segmentation directly in the native space yielded better results for the STN segmentation, despite not showing any advantage over the template-based method for the to other analysed structures: the Red Nucleus (RN) and the Substantia Nigra (SN).
☆ 6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model ECCV 2024
We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. 6DGS avoids the iterative process typical of analysis-by-synthesis methods (e.g. iNeRF) that also require an initialization of the camera pose in order to converge. Instead, our method estimates a 6DoF pose by inverting the 3DGS rendering process. Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an "a priori" pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations. Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite not requiring any initialization pose. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.
comment: Project page: https://mbortolon97.github.io/6dgs/ Accepted to ECCV 2024
☆ Diverse Image Harmonization
Image harmonization aims to adjust the foreground illumination in a composite image to make it harmonious. The existing harmonization methods can only produce one deterministic result for a composite image, ignoring that a composite image could have multiple plausible harmonization results due to multiple plausible reflectances. In this work, we first propose a reflectance-guided harmonization network, which can achieve better performance with the guidance of ground-truth foreground reflectance. Then, we also design a diverse reflectance generation network to predict multiple plausible foreground reflectances, leading to multiple plausible harmonization results. The extensive experiments on the benchmark datasets demonstrate the effectiveness of our method.
☆ Affordance Labeling and Exploration: A Manifold-Based Approach
The advancement in computing power has significantly reduced the training times for deep learning, fostering the rapid development of networks designed for object recognition. However, the exploration of object utility, which is the affordance of the object, as opposed to object recognition, has received comparatively less attention. This work focuses on the problem of exploration of object affordances using existing networks trained on the object classification dataset. While pre-trained networks have proven to be instrumental in transfer learning for classification tasks, this work diverges from conventional object classification methods. Instead, it employs pre-trained networks to discern affordance labels without the need for specialized layers, abstaining from modifying the final layers through the addition of classification layers. To facilitate the determination of affordance labels without such modifications, two approaches, i.e. subspace clustering and manifold curvature methods are tested. These methods offer a distinct perspective on affordance label recognition. Especially, manifold curvature method has been successfully tested with nine distinct pre-trained networks, each achieving an accuracy exceeding 95%. Moreover, it is observed that manifold curvature and subspace clustering methods explore affordance labels that are not marked in the ground truth, but object affords in various cases.
comment: 17 Pages, 3 Figures, 3 Tables
☆ Learning deep illumination-robust features from multispectral filter array images
Multispectral (MS) snapshot cameras equipped with a MS filter array (MSFA), capture multiple spectral bands in a single shot, resulting in a raw mosaic image where each pixel holds only one channel value. The fully-defined MS image is estimated from the raw one through $\textit{demosaicing}$, which inevitably introduces spatio-spectral artifacts. Moreover, training on fully-defined MS images can be computationally intensive, particularly with deep neural networks (DNNs), and may result in features lacking discrimination power due to suboptimal learning of spatio-spectral interactions. Furthermore, outdoor MS image acquisition occurs under varying lighting conditions, leading to illumination-dependent features. This paper presents an original approach to learn discriminant and illumination-robust features directly from raw images. It involves: $\textit{raw spectral constancy}$ to mitigate the impact of illumination, $\textit{MSFA-preserving}$ transformations suited for raw image augmentation to train DNNs on diverse raw textures, and $\textit{raw-mixing}$ to capture discriminant spatio-spectral interactions in raw images. Experiments on MS image classification show that our approach outperforms both handcrafted and recent deep learning-based methods, while also requiring significantly less computational effort.~The source code will be available.
☆ Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions ECCV24
Existing 2D human pose estimation research predominantly concentrates on well-lit scenarios, with limited exploration of poor lighting conditions, which are a prevalent aspect of daily life. Recent studies on low-light pose estimation require the use of paired well-lit and low-light images with ground truths for training, which are impractical due to the inherent challenges associated with annotation on low-light images. To this end, we introduce a novel approach that eliminates the need for low-light ground truths. Our primary novelty lies in leveraging two complementary-teacher networks to generate more reliable pseudo labels, enabling our model achieves competitive performance on extremely low-light images without the need for training with low-light ground truths. Our framework consists of two stages. In the first stage, our model is trained on well-lit data with low-light augmentations. In the second stage, we propose a dual-teacher framework to utilize the unlabeled low-light data, where a center-based main teacher produces the pseudo labels for relatively visible cases, while a keypoints-based complementary teacher focuses on producing the pseudo labels for the missed persons of the main teacher. With the pseudo labels from both teachers, we propose a person-specific low-light augmentation to challenge a student model in training to outperform the teachers. Experimental results on real low-light dataset (ExLPose-OCN) show, our method achieves 6.8% (2.4 AP) improvement over the state-of-the-art (SOTA) method, despite no low-light ground-truth data is used in our approach, in contrast to the SOTA method. Our code will be available at:https://github.com/ayh015-dev/DA-LLPose.
comment: 18 pages, 3 figure. Accepted by ECCV24
☆ SIGMA:Sinkhorn-Guided Masked Video Modeling ECCV 24
Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. Our project website with code is available at: https://quva-lab.github.io/SIGMA.
comment: Accepted at ECCV 24
☆ Text2Place: Affordance-aware Text Guided Human Placement ECCV 2024
For a given scene, humans can easily reason for the locations and pose to place objects. Designing a computational model to reason about these affordances poses a significant challenge, mirroring the intuitive reasoning abilities of humans. This work tackles the problem of realistic human insertion in a given background scene termed as \textbf{Semantic Human Placement}. This task is extremely challenging given the diverse backgrounds, scale, and pose of the generated person and, finally, the identity preservation of the person. We divide the problem into the following two stages \textbf{i)} learning \textit{semantic masks} using text guidance for localizing regions in the image to place humans and \textbf{ii)} subject-conditioned inpainting to place a given subject adhering to the scene affordance within the \textit{semantic masks}. For learning semantic masks, we leverage rich object-scene priors learned from the text-to-image generative models and optimize a novel parameterization of the semantic mask, eliminating the need for large-scale training. To the best of our knowledge, we are the first ones to provide an effective solution for realistic human placements in diverse real-world scenes. The proposed method can generate highly realistic scene compositions while preserving the background and subject identity. Further, we present results for several downstream tasks - scene hallucination from a single or multiple generated persons and text-based attribute editing. With extensive comparisons against strong baselines, we show the superiority of our method in realistic human placement.
comment: ECCV 2024, Project Page: https://rishubhpar.github.io/Text2Place/
☆ Enhancement of 3D Gaussian Splatting using Raw Mesh for Photorealistic Recreation of Architectures
The photorealistic reconstruction and rendering of architectural scenes have extensive applications in industries such as film, games, and transportation. It also plays an important role in urban planning, architectural design, and the city's promotion, especially in protecting historical and cultural relics. The 3D Gaussian Splatting, due to better performance over NeRF, has become a mainstream technology in 3D reconstruction. Its only input is a set of images but it relies heavily on geometric parameters computed by the SfM process. At the same time, there is an existing abundance of raw 3D models, that could inform the structural perception of certain buildings but cannot be applied. In this paper, we propose a straightforward method to harness these raw 3D models to guide 3D Gaussians in capturing the basic shape of the building and improve the visual quality of textures and details when photos are captured non-systematically. This exploration opens up new possibilities for improving the effectiveness of 3D reconstruction techniques in the field of architectural design.
☆ Learning at a Glance: Towards Interpretable Data-limited Continual Semantic Segmentation via Semantic-Invariance Modelling
Continual semantic segmentation (CSS) based on incremental learning (IL) is a great endeavour in developing human-like segmentation models. However, current CSS approaches encounter challenges in the trade-off between preserving old knowledge and learning new ones, where they still need large-scale annotated data for incremental training and lack interpretability. In this paper, we present Learning at a Glance (LAG), an efficient, robust, human-like and interpretable approach for CSS. Specifically, LAG is a simple and model-agnostic architecture, yet it achieves competitive CSS efficiency with limited incremental data. Inspired by human-like recognition patterns, we propose a semantic-invariance modelling approach via semantic features decoupling that simultaneously reconciles solid knowledge inheritance and new-term learning. Concretely, the proposed decoupling manner includes two ways, i.e., channel-wise decoupling and spatial-level neuron-relevant semantic consistency. Our approach preserves semantic-invariant knowledge as solid prototypes to alleviate catastrophic forgetting, while also constraining sample-specific contents through an asymmetric contrastive learning method to enhance model robustness during IL steps. Experimental results in multiple datasets validate the effectiveness of the proposed method. Furthermore, we introduce a novel CSS protocol that better reflects realistic data-limited CSS settings, and LAG achieves superior performance under multiple data-limited conditions.
☆ YOLO-pdd: A Novel Multi-scale PCB Defect Detection Method Using Deep Representations with Sequential Images
With the rapid growth of the PCB manufacturing industry, there is an increasing demand for computer vision inspection to detect defects during production. Improving the accuracy and generalization of PCB defect detection models remains a significant challenge. This paper proposes a high-precision, robust, and real-time end-to-end method for PCB defect detection based on deep Convolutional Neural Networks (CNN). Traditional methods often suffer from low accuracy and limited applicability. We propose a novel approach combining YOLOv5 and multiscale modules for hierarchical residual-like connections. In PCB defect detection, noise can confuse the background and small targets. The YOLOv5 model provides a strong foundation with its real-time processing and accurate object detection capabilities. The multi-scale module extends traditional approaches by incorporating hierarchical residual-like connections within a single block, enabling multiscale feature extraction. This plug-and-play module significantly enhances performance by extracting features at multiple scales and levels, which are useful for identifying defects of varying sizes and complexities. Our multi-scale architecture integrates feature extraction, defect localization, and classification into a unified network. Experiments on a large-scale PCB dataset demonstrate significant improvements in precision, recall, and F1-score compared to existing methods. This work advances computer vision inspection for PCB defect detection, providing a reliable solution for high-precision, robust, real-time, and domain-adaptive defect detection in the PCB manufacturing industry.
☆ Bidirectional skip-frame prediction for video anomaly detection with intra-domain disparity-driven attention
With the widespread deployment of video surveillance devices and the demand for intelligent system development, video anomaly detection (VAD) has become an important part of constructing intelligent surveillance systems. Expanding the discriminative boundary between normal and abnormal events to enhance performance is the common goal and challenge of VAD. To address this problem, we propose a Bidirectional Skip-frame Prediction (BiSP) network based on a dual-stream autoencoder, from the perspective of learning the intra-domain disparity between different features. The BiSP skips frames in the training phase to achieve the forward and backward frame prediction respectively, and in the testing phase, it utilizes bidirectional consecutive frames to co-predict the same intermediate frames, thus expanding the degree of disparity between normal and abnormal events. The BiSP designs the variance channel attention and context spatial attention from the perspectives of movement patterns and object scales, respectively, thus ensuring the maximization of the disparity between normal and abnormal in the feature extraction and delivery with different dimensions. Extensive experiments from four benchmark datasets demonstrate the effectiveness of the proposed BiSP, which substantially outperforms state-of-the-art competing methods.
comment: 7 pages
☆ Local All-Pair Correspondence for Point Tracking ECCV 2024
We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.
comment: ECCV 2024. Project page: https://ku-cvlab.github.io/locotrack Code: https://github.com/KU-CVLAB/locotrack
☆ Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models ECCV 2024
With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. Despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the chronological understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text from the ground truth and its chronologically shuffled version. CAR reveals many cases where current motion-language models fail to distinguish the event chronology of human motion, despite their impressive performance in terms of conventional evaluation metrics. To achieve better temporal alignment between text and motion, we further propose to use these texts with shuffled sequence of events as negative samples during training to reinforce the motion-language models. We conduct experiments on text-motion retrieval and text-to-motion generation using the reinforced motion-language models, which demonstrate improved performance over conventional approaches, indicating the necessity to consider temporal elements in motion-language alignment.
comment: To appear at ECCV 2024. Project page: https://kfworks.com/CAR-WP/
☆ Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.
comment: Ongoing work (v1); 34 pages, 5 figures
☆ Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation ECCV 2024
The scene graph generation (SGG) task involves detecting objects within an image and predicting predicates that represent the relationships between the objects. However, in SGG benchmark datasets, each subject-object pair is annotated with a single predicate even though a single predicate may exhibit diverse semantics (i.e., semantic diversity), existing SGG models are trained to predict the one and only predicate for each pair. This in turn results in the SGG models to overlook the semantic diversity that may exist in a predicate, thus leading to biased predictions. In this paper, we propose a novel model-agnostic Semantic Diversity-aware Prototype-based Learning (DPL) framework that enables unbiased predictions based on the understanding of the semantic diversity of predicates. Specifically, DPL learns the regions in the semantic space covered by each predicate to distinguish among the various different semantics that a single predicate can represent. Extensive experiments demonstrate that our proposed model-agnostic DPL framework brings significant performance improvement on existing SGG models, and also effectively understands the semantic diversity of predicates.
comment: 14 pages, ECCV 2024
☆ Towards Robust Vision Transformer via Masked Adaptive Ensemble
Adversarial training (AT) can help improve the robustness of Vision Transformers (ViT) against adversarial attacks by intentionally injecting adversarial examples into the training data. However, this way of adversarial injection inevitably incurs standard accuracy degradation to some extent, thereby calling for a trade-off between standard accuracy and robustness. Besides, the prominent AT solutions are still vulnerable to adaptive attacks. To tackle such shortcomings, this paper proposes a novel ViT architecture, including a detector and a classifier bridged by our newly developed adaptive ensemble. Specifically, we empirically discover that detecting adversarial examples can benefit from the Guided Backpropagation technique. Driven by this discovery, a novel Multi-head Self-Attention (MSA) mechanism is introduced to enhance our detector to sniff adversarial examples. Then, a classifier with two encoders is employed for extracting visual representations respectively from clean images and adversarial examples, with our adaptive ensemble to adaptively adjust the proportion of visual representations from the two encoders for accurate classification. This design enables our ViT architecture to achieve a better trade-off between standard accuracy and robustness. Besides, our adaptive ensemble technique allows us to mask off a random subset of image patches within input data, boosting our ViT's robustness against adaptive attacks, while maintaining high standard accuracy. Experimental results exhibit that our ViT architecture, on CIFAR-10, achieves the best standard accuracy and adversarial robustness of 90.3% and 49.8%, respectively.
comment: 9 pages
☆ Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data ECCV 2024
This paper aims to adapt the source model to the target environment, leveraging small user feedback (i.e., labeled target data) readily available in real-world applications. We find that existing semi-supervised domain adaptation (SemiSDA) methods often suffer from poorly improved adaptation performance when directly utilizing such feedback data, as shown in Figure 1. We analyze this phenomenon via a novel concept called Negatively Biased Feedback (NBF), which stems from the observation that user feedback is more likely for data points where the model produces incorrect predictions. To leverage this feedback while avoiding the issue, we propose a scalable adapting approach, Retrieval Latent Defending. This approach helps existing SemiSDA methods to adapt the model with a balanced supervised signal by utilizing latent defending samples throughout the adaptation process. We demonstrate the problem caused by NBF and the efficacy of our approach across various benchmarks, including image classification, semantic segmentation, and a real-world medical imaging application. Our extensive experiments reveal that integrating our approach with multiple state-of-the-art SemiSDA methods leads to significant performance improvements.
comment: Accepted to ECCV 2024, Project page: https://sites.google.com/view/junha/nbf-rld
☆ Iterative approach to reconstructing neural disparity fields from light-field data
This study proposes a neural disparity field (NDF) that establishes an implicit, continuous representation of scene disparity based on a neural field and an iterative approach to address the inverse problem of NDF reconstruction from light-field data. NDF enables seamless and precise characterization of disparity variations in three-dimensional scenes and can discretize disparity at any arbitrary resolution, overcoming the limitations of traditional disparity maps that are prone to sampling errors and interpolation inaccuracies. The proposed NDF network architecture utilizes hash encoding combined with multilayer perceptrons to capture detailed disparities in texture levels, thereby enhancing its ability to represent the geometric information of complex scenes. By leveraging the spatial-angular consistency inherent in light-field data, a differentiable forward model to generate a central view image from the light-field data is developed. Based on the forward model, an optimization scheme for the inverse problem of NDF reconstruction using differentiable propagation operators is established. Furthermore, an iterative solution method is adopted to reconstruct the NDF in the optimization scheme, which does not require training datasets and applies to light-field data captured by various acquisition methods. Experimental results demonstrate that high-quality NDF can be reconstructed from light-field data using the proposed method. High-resolution disparity can be effectively recovered by NDF, demonstrating its capability for the implicit, continuous representation of scene disparities.
comment: 12 pages, 7 figures
☆ Sparse Prior Is Not All You Need: When Differential Directionality Meets Saliency Coherence for Infrared Small Target Detection
Infrared small target detection is crucial for the efficacy of infrared search and tracking systems. Current tensor decomposition methods emphasize representing small targets with sparsity but struggle to separate targets from complex backgrounds due to insufficient use of intrinsic directional information and reduced target visibility during decomposition. To address these challenges, this study introduces a Sparse Differential Directionality prior (SDD) framework. SDD leverages the distinct directional characteristics of targets to differentiate them from the background, applying mixed sparse constraints on the differential directional images and continuity difference matrix of the temporal component, both derived from Tucker decomposition. We further enhance target detectability with a saliency coherence strategy that intensifies target contrast against the background during hierarchical decomposition. A Proximal Alternating Minimization-based (PAM) algorithm efficiently solves our proposed model. Experimental results on several real-world datasets validate our method's effectiveness, outperforming ten state-of-the-art methods in target detection and clutter suppression. Our code is available at https://github.com/GrokCV/SDD.
comment: Submitted to IEEE TIM, Minor Revision
☆ A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model
Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H\&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.
comment: 44 pages, 9 figures
☆ X-Recon: Learning-based Patient-specific High-Resolution CT Reconstruction from Orthogonal X-Ray Images
Rapid and accurate diagnosis of pneumothorax, utilizing chest X-ray and computed tomography (CT), is crucial for assisted diagnosis. Chest X-ray is commonly used for initial localization of pneumothorax, while CT ensures accurate quantification. However, CT scans involve high radiation doses and can be costly. To achieve precise quantitative diagnosis while minimizing radiation exposure, we proposed X-Recon, a CT ultra-sparse reconstruction network based on ortho-lateral chest X-ray images. X-Recon integrates generative adversarial networks (GANs), including a generator with a multi-scale fusion rendering module and a discriminator enhanced by 3D coordinate convolutional layers, designed to facilitate CT reconstruction. To improve precision, a projective spatial transformer is utilized to incorporate multi-angle projection loss. Additionally, we proposed PTX-Seg, a zero-shot pneumothorax segmentation algorithm, combining image processing techniques with deep-learning models for the segmentation of air-accumulated regions and lung structures. Experiments on a large-scale dataset demonstrate its superiority over existing approaches. X-Recon achieved a significantly higher reconstruction resolution with a higher average spatial resolution and a lower average slice thickness. The reconstruction metrics achieved state-of-the-art performance in terms of several metrics including peak signal-to-noise ratio. The zero-shot segmentation algorithm, PTX-Seg, also demonstrated high segmentation precision for the air-accumulated region, the left lung, and the right lung. Moreover, the consistency analysis for the pneumothorax chest occupancy ratio between reconstructed CT and original CT obtained a high correlation coefficient. Code will be available at: https://github.com/wangyunpengbio/X-Recon
☆ Attention Beats Linear for Fast Implicit Neural Representation Generation ECCV 2024
Implicit Neural Representation (INR) has gained increasing popularity as a data representation method, serving as a prerequisite for innovative generation models. Unlike gradient-based methods, which exhibit lower efficiency in inference, the adoption of hyper-network for generating parameters in Multi-Layer Perceptrons (MLP), responsible for executing INR functions, has surfaced as a promising and efficient alternative. However, as a global continuous function, MLP is challenging in modeling highly discontinuous signals, resulting in slow convergence during the training phase and inaccurate reconstruction performance. Moreover, MLP requires massive representation parameters, which implies inefficiencies in data representation. In this paper, we propose a novel Attention-based Localized INR (ANR) composed of a localized attention layer (LAL) and a global MLP that integrates coordinate features with data features and converts them to meaningful outputs. Subsequently, we design an instance representation framework that delivers a transformer-like hyper-network to represent data instances as a compact representation vector. With instance-specific representation vector and instance-agnostic ANR parameters, the target signals are well reconstructed as a continuous function. We further address aliasing artifacts with variational coordinates when obtaining the super-resolution inference results. Extensive experimentation across four datasets showcases the notable efficacy of our ANR method, e.g. enhancing the PSNR value from 37.95dB to 47.25dB on the CelebA dataset. Code is released at https://github.com/Roninton/ANR.
comment: Accept by ECCV 2024
☆ Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection ECCV 2024
The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement.
comment: Accepted to ECCV 2024. Project page: https://github.com/zlichen/VectorFormer
☆ WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding ECCV24
In this paper, we address the challenge of fine-grained video event understanding in traffic scenarios, vital for autonomous driving and safety. Traditional datasets focus on driver or vehicle behavior, often neglecting pedestrian perspectives. To fill this gap, we introduce the WTS dataset, highlighting detailed behaviors of both vehicles and pedestrians across over 1.2k video events in hundreds of traffic scenarios. WTS integrates diverse perspectives from vehicle ego and fixed overhead cameras in a vehicle-infrastructure cooperative environment, enriched with comprehensive textual descriptions and unique 3D Gaze data for a synchronized 2D/3D view, focusing on pedestrian analysis. We also pro-vide annotations for 5k publicly sourced pedestrian-related traffic videos. Additionally, we introduce LLMScorer, an LLM-based evaluation metric to align inference captions with ground truth. Using WTS, we establish a benchmark for dense video-to-text tasks, exploring state-of-the-art Vision-Language Models with an instance-aware VideoLLM method as a baseline. WTS aims to advance fine-grained video event understanding, enhancing traffic safety and autonomous driving development.
comment: ECCV24. Website: https://woven-visionai.github.io/wts-dataset-homepage/
☆ RoadPainter: Points Are Ideal Navigators for Topology transformER ECCV 2024
Topology reasoning aims to provide a precise understanding of road scenes, enabling autonomous systems to identify safe and efficient routes. In this paper, we present RoadPainter, an innovative approach for detecting and reasoning the topology of lane centerlines using multi-view images. The core concept behind RoadPainter is to extract a set of points from each centerline mask to improve the accuracy of centerline prediction. We start by implementing a transformer decoder that integrates a hybrid attention mechanism and a real-virtual separation strategy to predict coarse lane centerlines and establish topological associations. Then, we generate centerline instance masks guided by the centerline points from the transformer decoder. Moreover, we derive an additional set of points from each mask and combine them with previously detected centerline points for further refinement. Additionally, we introduce an optional module that incorporates a Standard Definition (SD) map to further optimize centerline detection and enhance topological reasoning performance. Experimental evaluations on the OpenLane-V2 dataset demonstrate the state-of-the-art performance of RoadPainter.
comment: 17 pages, 5 figures, Accepted by ECCV 2024
♻ ☆ NPLMV-PS: Neural Point-Light Multi-View Photometric Stereo
In this work we present a novel multi-view photometric stereo (MVPS) method. Like many works in 3D reconstruction we are leveraging neural shape representations and learnt renderers. However, our work differs from the state-of-the-art multi-view PS methods such as PS-NeRF or Supernormal in that we explicitly leverage per-pixel intensity renderings rather than relying mainly on estimated normals. We model point light attenuation and explicitly raytrace cast shadows in order to best approximate the incoming radiance for each point. The estimated incoming radiance is used as input to a fully neural material renderer that uses minimal prior assumptions and it is jointly optimised with the surface. Estimated normals and segmentation maps are also incorporated in order to maximise the surface accuracy. Our method is among the first (along with Supernormal) to outperform the classical MVPS approach proposed by the DiLiGenT-MV benchmark and achieves average 0.2mm Chamfer distance for objects imaged at approx 1.5m distance away with approximate 400x400 resolution. Moreover, our method shows high robustness to the sparse MVPS setup (6 views, 6 lights) greatly outperforming the SOTA competitor (0.38mm vs 0.61mm), illustrating the importance of neural rendering in multi-view photometric stereo.
♻ ☆ Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers
Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.
comment: 11 pages, 7 figures, 2 tables, 1 algorithm. Version Update: Figure 6
♻ ☆ On the Matrix Form of the Quaternion Fourier Transform and Quaternion Convolution
We study matrix forms of quaternionic versions of the Fourier Transform and Convolution operations. Quaternions offer a powerful representation unit, however they are related to difficulties in their use that stem foremost from non-commutativity of quaternion multiplication, and due to that $\mu^2 = -1$ possesses infinite solutions in the quaternion domain. Handling of quaternionic matrices is consequently complicated in several aspects (definition of eigenstructure, determinant, etc.). Our research findings clarify the relation of the Quaternion Fourier Transform matrix to the standard (complex) Discrete Fourier Transform matrix, and the extend on which well-known complex-domain theorems extend to quaternions. We focus especially on the relation of Quaternion Fourier Transform matrices to Quaternion Circulant matrices (representing quaternionic convolution), and the eigenstructure of the latter. A proof-of-concept application that makes direct use of our theoretical results is presented, where we present a method to bound the Lipschitz constant of a Quaternionic Convolutional Neural Network. Code is publicly available at: \url{https://github.com/sfikas/quaternion-fourier-convolution-matrix}.
comment: 25 pages, 4 figures
♻ ☆ Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics, relying on manually annotated videos to predict fixed object categories. This limits their generalizability to unseen surgical procedures and tasks. We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals for multi-modal representation learning, bypassing manual annotations. We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions. We introduce SurgVLP - Surgical Vision Language Pre-training - a novel method for multi-modal representation learning. SurgVLP employs a new contrastive learning objective, aligning video clip embeddings with corresponding multiple text embeddings in a joint latent space. We demonstrate the representational capability of this space through several vision-and-language surgical tasks and vision-only tasks specific to surgery. Unlike current fully supervised approaches, SurgVLP adapts to different surgical procedures and tasks without specific fine-tuning, achieving zero-shot adaptation to tasks such as surgical tool, phase, and triplet recognition without manual annotation. These results highlight the transferability and versatility of the learned multi-modal representations in surgical video analysis. The code is available at https://github.com/CAMMA-public/SurgVLP
♻ ☆ Reducing Texture Bias of Deep Neural Networks via Edge Enhancing Diffusion
Convolutional neural networks (CNNs) for image processing tend to focus on localized texture patterns, commonly referred to as texture bias. While most of the previous works in the literature focus on the task of image classification, we go beyond this and study the texture bias of CNNs in semantic segmentation. In this work, we propose to train CNNs on pre-processed images with less texture to reduce the texture bias. Therein, the challenge is to suppress image texture while preserving shape information. To this end, we utilize edge enhancing diffusion (EED), an anisotropic image diffusion method initially introduced for image compression, to create texture reduced duplicates of existing datasets. Extensive numerical studies are performed with both CNNs and vision transformer models trained on original data and EED-processed data from the Cityscapes dataset and the CARLA driving simulator. We observe strong texture-dependence of CNNs and moderate texture-dependence of transformers. Training CNNs on EED-processed images enables the models to become completely ignorant with respect to texture, demonstrating resilience with respect to texture re-introduction to any degree. Additionally we analyze the performance reduction in depth on a level of connected components in the semantic segmentation and study the influence of EED pre-processing on domain generalization as well as adversarial robustness.
♻ ☆ FAIntbench: A Holistic and Precise Benchmark for Bias Evaluation in Text-to-Image Models
The rapid development and reduced barriers to entry for Text-to-Image (T2I) models have raised concerns about the biases in their outputs, but existing research lacks a holistic definition and evaluation framework of biases, limiting the enhancement of debiasing techniques. To address this issue, we introduce FAIntbench, a holistic and precise benchmark for biases in T2I models. In contrast to existing benchmarks that evaluate bias in limited aspects, FAIntbench evaluate biases from four dimensions: manifestation of bias, visibility of bias, acquired attributes, and protected attributes. We applied FAIntbench to evaluate seven recent large-scale T2I models and conducted human evaluation, whose results demonstrated the effectiveness of FAIntbench in identifying various biases. Our study also revealed new research questions about biases, including the side-effect of distillation. The findings presented here are preliminary, highlighting the potential of FAIntbench to advance future research aimed at mitigating the biases in T2I models. Our benchmark is publicly available to ensure the reproducibility.
♻ ☆ Condition-Invariant Semantic Segmentation
Adaptation of semantic segmentation networks to different visual conditions is vital for robust perception in autonomous cars and robots. However, previous work has shown that most feature-level adaptation methods, which employ adversarial training and are validated on synthetic-to-real adaptation, provide marginal gains in condition-level adaptation, being outperformed by simple pixel-level adaptation via stylization. Motivated by these findings, we propose to leverage stylization in performing feature-level adaptation by aligning the internal network features extracted by the encoder of the network from the original and the stylized view of each input image with a novel feature invariance loss. In this way, we encourage the encoder to extract features that are already invariant to the style of the input, allowing the decoder to focus on parsing these features and not on further abstracting from the specific style of the input. We implement our method, named Condition-Invariant Semantic Segmentation (CISS), on the current state-of-the-art domain adaptation architecture and achieve outstanding results on condition-level adaptation. In particular, CISS sets the new state of the art in the popular daytime-to-nighttime Cityscapes$\to$Dark Zurich benchmark. Furthermore, our method achieves the second-best performance on the normal-to-adverse Cityscapes$\to$ACDC benchmark. CISS is shown to generalize well to domains unseen during training, such as BDD100K-night and ACDC-night. Code is publicly available at https://github.com/SysCV/CISS .
comment: Submitted for review to IEEE T-PAMI
♻ ☆ SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder
Face swapping has gained significant attention for its varied applications. Most previous face swapping approaches have relied on the seesaw game training scheme, also known as the target-oriented approach. However, this often leads to instability in model training and results in undesired samples with blended identities due to the target identity leakage problem. Source-oriented methods achieve more stable training with self-reconstruction objective but often fail to accurately reflect target image's skin color and illumination. This paper introduces the Shape Agnostic Masked AutoEncoder (SAMAE) training scheme, a novel self-supervised approach that combines the strengths of both target-oriented and source-oriented approaches. Our training scheme addresses the limitations of traditional training methods by circumventing the conventional seesaw game and introducing clear ground truth through its self-reconstruction training regime. Our model effectively mitigates identity leakage and reflects target albedo and illumination through learned disentangled identity and non-identity features. Additionally, we closely tackle the shape misalignment and volume discrepancy problems with new techniques, including perforation confusion and random mesh scaling. SAMAE establishes a new state-of-the-art, surpassing other baseline methods, preserving both identity and non-identity attributes without sacrificing on either aspect.
♻ ☆ SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation ECCV-24
In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target. The resulting models can be generalized seamlessly to novel segmentation tasks, significantly reducing the labeling and training costs compared with conventional pipelines. However, in-context segmentation is more challenging than classic ones requiring the model to learn segmentation rules conditioned on a few samples. Unlike previous work with ad-hoc or non-end-to-end designs, we propose SEGIC, an end-to-end segment-in-context framework built upon a single vision foundation model (VFM). In particular, SEGIC leverages the emergent correspondence within VFM to capture dense relationships between target images and in-context samples. As such, information from in-context samples is then extracted into three types of instructions, i.e. geometric, visual, and meta instructions, serving as explicit conditions for the final mask prediction. SEGIC is a straightforward yet effective approach that yields state-of-the-art performance on one-shot segmentation benchmarks. Notably, SEGIC can be easily generalized to diverse tasks, including video object segmentation and open-vocabulary segmentation. Code will be available at https://github.com/MengLcool/SEGIC.
comment: ECCV-24 camera-ready
♻ ☆ PIPNet3D: Interpretable Detection of Alzheimer in MRI Scans MICCAI 2024
Information from neuroimaging examinations is increasingly used to support diagnoses of dementia, e.g., Alzheimer's disease. While current clinical practice is mainly based on visual inspection and feature engineering, Deep Learning approaches can be used to automate the analysis and to discover new image-biomarkers. Part-prototype neural networks (PP-NN) are an alternative to standard blackbox models, and have shown promising results in general computer vision. PP-NN's base their reasoning on prototypical image regions that are learned fully unsupervised, and combined with a simple-to-understand decision layer. We present PIPNet3D, a PP-NN for volumetric images. We apply PIPNet3D to the clinical diagnosis of Alzheimer's Disease from structural Magnetic Resonance Imaging (sMRI). We assess the quality of prototypes under a systematic evaluation framework, propose new functionally grounded metrics to evaluate brain prototypes and develop an evaluation scheme to assess their coherency with domain experts. Our results show that PIPNet3D is an interpretable, compact model for Alzheimer's diagnosis with its reasoning well aligned to medical domain knowledge. Notably, PIPNet3D achieves the same accuracy as its blackbox counterpart; and removing the remaining clinically irrelevant prototypes from its decision process does not decrease predictive performance.
comment: Accepted at iMIMIC workshop @MICCAI 2024
♻ ☆ Patch-based Intuitive Multimodal Prototypes Network (PIMPNet) for Alzheimer's Disease classification
Volumetric neuroimaging examinations like structural Magnetic Resonance Imaging (sMRI) are routinely applied to support the clinical diagnosis of dementia like Alzheimer's Disease (AD). Neuroradiologists examine 3D sMRI to detect and monitor abnormalities in brain morphology due to AD, like global and/or local brain atrophy and shape alteration of characteristic structures. There is a strong research interest in developing diagnostic systems based on Deep Learning (DL) models to analyse sMRI for AD. However, anatomical information extracted from an sMRI examination needs to be interpreted together with patient's age to distinguish AD patterns from the regular alteration due to a normal ageing process. In this context, part-prototype neural networks integrate the computational advantages of DL in an interpretable-by-design architecture and showed promising results in medical imaging applications. We present PIMPNet, the first interpretable multimodal model for 3D images and demographics applied to the binary classification of AD from 3D sMRI and patient's age. Despite age prototypes do not improve predictive performance compared to the single modality model, this lays the foundation for future work in the direction of the model's design and multimodal prototype training process
comment: Accepted "late-breaking work" at XAI-2024
♻ ☆ Context-Aware Indoor Point Cloud Object Generation through User Instructions
Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our model revolutionizes scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation.
♻ ☆ Dark Side Augmentation: Generating Diverse Night Examples for Metric Learning
Image retrieval methods based on CNN descriptors rely on metric learning from a large number of diverse examples of positive and negative image pairs. Domains, such as night-time images, with limited availability and variability of training data suffer from poor retrieval performance even with methods performing well on standard benchmarks. We propose to train a GAN-based synthetic-image generator, translating available day-time image examples into night images. Such a generator is used in metric learning as a form of augmentation, supplying training data to the scarce domain. Various types of generators are evaluated and analyzed. We contribute with a novel light-weight GAN architecture that enforces the consistency between the original and translated image through edge consistency. The proposed architecture also allows a simultaneous training of an edge detector that operates on both night and day images. To further increase the variability in the training examples and to maximize the generalization of the trained model, we propose a novel method of diverse anchor mining. The proposed method improves over the state-of-the-art results on a standard Tokyo 24/7 day-night retrieval benchmark while preserving the performance on Oxford and Paris datasets. This is achieved without the need of training image pairs of matching day and night images. The source code is available at https://github.com/mohwald/gandtr .
comment: paper with supplementary, 26 pages, 8 figures, 13 tables
♻ ☆ DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing ECCV 2024
We consider the problem of editing 3D objects and scenes based on open-ended language instructions. A common approach to this problem is to use a 2D image generator or editor to guide the 3D editing process, obviating the need for 3D data. However, this process is often inefficient due to the need for iterative updates of costly 3D representations, such as neural radiance fields, either through individual view edits or score distillation sampling. A major disadvantage of this approach is the slow convergence caused by aggregating inconsistent information across views, as the guidance from 2D models is not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two stages. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. To do so, we propose a training-free approach that integrates cues from the 3D geometry of the underlying scene. Second, given a multi-view consistent edited sequence of images, we directly and efficiently optimize the 3D representation, which is based on 3D Gaussian Splatting. Because it avoids incremental and iterative edits, DGE is significantly more accurate and efficient than existing approaches and offers additional benefits, such as enabling selective editing of parts of the scene.
comment: ECCV 2024. Project Page: https://silent-chen.github.io/DGE/
♻ ☆ Score Normalization for Demographic Fairness in Face Recognition
Fair biometric algorithms have similar verification performance across different demographic groups given a single decision threshold. Unfortunately, for state-of-the-art face recognition networks, score distributions differ between demographics. Contrary to work that tries to align those distributions by extra training or fine-tuning, we solely focus on score post-processing methods. As proved, well-known sample-centered score normalization techniques, Z-norm and T-norm, do not improve fairness for high-security operating points. Thus, we extend the standard Z/T-norm to integrate demographic information in normalization. Additionally, we investigate several possibilities to incorporate cohort similarities for both genuine and impostor pairs per demographic to improve fairness across different operating points. We run experiments on two datasets with different demographics (gender and ethnicity) and show that our techniques generally improve the overall fairness of five state-of-the-art pre-trained face recognition networks, without downgrading verification performance. We also indicate that an equal contribution of False Match Rate (FMR) and False Non-Match Rate (FNMR) in fairness evaluation is required for the highest gains. Code and protocols are available.
comment: Accepted for presentation at IJCB 2024
♻ ☆ 360VFI: A Dataset and Benchmark for Omnidirectional Video Frame Interpolation
With the development of VR-related techniques, viewers can enjoy a realistic and immersive experience through a head-mounted display, while omnidirectional video with a low frame rate can lead to user dizziness. However, the prevailing plane frame interpolation methodologies are unsuitable for Omnidirectional Video Interpolation, chiefly due to the lack of models tailored to such videos with strong distortion, compounded by the scarcity of valuable datasets for Omnidirectional Video Frame Interpolation. In this paper, we introduce the benchmark dataset, 360VFI, for Omnidirectional Video Frame Interpolation. We present a practical implementation that introduces a distortion prior from omnidirectional video into the network to modulate distortions. We especially propose a pyramid distortion-sensitive feature extractor that uses the unique characteristics of equirectangular projection (ERP) format as prior information. Moreover, we devise a decoder that uses an affine transformation to facilitate the synthesis of intermediate frames further. 360VFI is the first dataset and benchmark that explores the challenge of Omnidirectional Video Frame Interpolation. Through our benchmark analysis, we presented four different distortion conditions scenes in the proposed 360VFI dataset to evaluate the challenge triggered by distortion during interpolation. Besides, experimental results demonstrate that Omnidirectional Video Interpolation can be effectively improved by modeling for omnidirectional distortion.
♻ ☆ LDConv: Linear deformable convolution for improving convolutional neural networks
Neural networks based on convolutional operations have achieved remarkable results in the field of deep learning, but there are two inherent flaws in standard convolutional operations. On the one hand, the convolution operation is confined to a local window, so it cannot capture information from other locations, and its sampled shapes is fixed. On the other hand, the size of the convolutional kernel are fixed to k $\times$ k, which is a fixed square shape, and the number of parameters tends to grow squarely with size. Although Deformable Convolution (Deformable Conv) address the problem of fixed sampling of standard convolutions, the number of parameters also tends to grow in a squared manner. In response to the above questions, the Linear Deformable Convolution (LDConv) is explored in this work, which gives the convolution kernel an arbitrary number of parameters and arbitrary sampled shapes to provide richer options for the trade-off between network overhead and performance. In LDConv, a novel coordinate generation algorithm is defined to generate different initial sampled positions for convolutional kernels of arbitrary size. To adapt to changing targets, offsets are introduced to adjust the shape of the samples at each position. LDConv corrects the growth trend of the number of parameters for standard convolution and Deformable Conv to a linear growth. Moreover, it completes the process of efficient feature extraction by irregular convolutional operations and brings more exploration options for convolutional sampled shapes. Object detection experiments on representative datasets COCO2017, VOC 7+12, and VisDrone-DET2021 fully demonstrate the advantages of LDConv. LDConv is a plug-and-play convolutional operation that can replace the convolutional operation to improve network performance. The code for the relevant tasks can be found at https://github.com/CV-ZhangXin/LDConv.
comment: 14 pages, 9 figures
♻ ☆ TokenPacker: Efficient Visual Projector for Multimodal LLM
The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.
comment: 16 pages, Codes:https://github.com/CircleRadon/TokenPacker
♻ ☆ DynoSurf: Neural Deformation-based Temporally Consistent Dynamic Surface Reconstruction
This paper explores the problem of reconstructing temporally consistent surfaces from a 3D point cloud sequence without correspondence. To address this challenging task, we propose DynoSurf, an unsupervised learning framework integrating a template surface representation with a learnable deformation field. Specifically, we design a coarse-to-fine strategy for learning the template surface based on the deformable tetrahedron representation. Furthermore, we propose a learnable deformation representation based on the learnable control points and blending weights, which can deform the template surface non-rigidly while maintaining the consistency of the local shape. Experimental results demonstrate the significant superiority of DynoSurf over current state-of-the-art approaches, showcasing its potential as a powerful tool for dynamic mesh reconstruction. The code is publicly available at https://github.com/yaoyx689/DynoSurf.
♻ ☆ Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data ICML 2024
Ambient diffusion is a recently proposed framework for training diffusion models using corrupted data. Both Ambient Diffusion and alternative SURE-based approaches for learning diffusion models from corrupted data resort to approximations which deteriorate performance. We present the first framework for training diffusion models that provably sample from the uncorrupted distribution given only noisy training data, solving an open problem in this space. Our key technical contribution is a method that uses a double application of Tweedie's formula and a consistency loss function that allows us to extend sampling at noise levels below the observed data noise. We also provide further evidence that diffusion models memorize from their training sets by identifying extremely corrupted images that are almost perfectly reconstructed, raising copyright and privacy concerns. Our method for training using corrupted samples can be used to mitigate this problem. We demonstrate this by fine-tuning Stable Diffusion XL to generate samples from a distribution using only noisy samples. Our framework reduces the amount of memorization of the fine-tuning dataset, while maintaining competitive performance.
comment: Accepted to ICML 2024
♻ ☆ SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments ECCV 2024
In this paper, we present SPVLoc, a global indoor localization method that accurately determines the six-dimensional (6D) camera pose of a query image and requires minimal scene-specific prior knowledge and no scene-specific training. Our approach employs a novel matching procedure to localize the perspective camera's viewport, given as an RGB image, within a set of panoramic semantic layout representations of the indoor environment. The panoramas are rendered from an untextured 3D reference model, which only comprises approximate structural information about room shapes, along with door and window annotations. We demonstrate that a straightforward convolutional network structure can successfully achieve image-to-panorama and ultimately image-to-model matching. Through a viewport classification score, we rank reference panoramas and select the best match for the query image. Then, a 6D relative pose is estimated between the chosen panorama and query image. Our experiments demonstrate that this approach not only efficiently bridges the domain gap but also generalizes well to previously unseen scenes that are not part of the training data. Moreover, it achieves superior localization accuracy compared to the state of the art methods and also estimates more degrees of freedom of the camera pose. Our source code is publicly available at https://fraunhoferhhi.github.io/spvloc .
comment: ECCV 2024. 24 pages, 11 figures, 8 tables. Includes paper and supplementary material
♻ ☆ Vision-Based Power Line Cables and Pylons Detection for Low Flying Aircrafts
Power lines are dangerous for low-flying aircrafts, especially in low-visibility conditions. Thus, a vision-based system able to analyze the aircraft's surroundings and to provide the pilots with a "second pair of eyes" can contribute to enhancing their safety. To this end, we have developed a deep learning approach to jointly detect power line cables and pylons from images captured at distances of several hundred meters by aircraft-mounted cameras. In doing so, we have combined a modern convolutional architecture with transfer learning and a loss function adapted to curvilinear structure delineation. We use a single network for both detection tasks and demonstrated its performance on two benchmarking datasets. We have integrated it within an onboard system and run it in flight, and have demonstrated with our experiments that it outperforms the prior distant cable detection method on both datasets, while also successfully detecting pylons, given their annotations are available for the data.
♻ ☆ Exploring connections of spectral analysis and transfer learning in medical imaging
In this paper, we use spectral analysis to investigate transfer learning and study model sensitivity to frequency shortcuts in medical imaging. By analyzing the power spectrum density of both pre-trained and fine-tuned model gradients, as well as artificially generated frequency shortcuts, we observe notable differences in learning priorities between models pre-trained on natural vs medical images, which generally persist during fine-tuning. We find that when a model's learning priority aligns with the power spectrum density of an artifact, it results in overfitting to that artifact. Based on these observations, we show that source data editing can alter the model's resistance to shortcut learning.
♻ ☆ A Survey on Video Prediction: From Deterministic to Generative Approaches
Video prediction, a fundamental task in computer vision, aims to enable models to generate sequences of future frames based on existing video content. This task has garnered widespread application across various domains. In this paper, we comprehensively survey both historical and contemporary works in this field, encompassing the most widely used datasets and algorithms. Our survey scrutinizes the challenges and evolving landscape of video prediction within the realm of computer vision. We propose a novel taxonomy centered on the stochastic nature of video prediction algorithms. This taxonomy accentuates the gradual transition from deterministic to generative prediction methodologies, underlining significant advancements and shifts in approach.
comment: under review
♻ ☆ Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning CVPR'24
Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .
comment: Accepted to CVPR'24
♻ ☆ A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks ECCV 2024
Model Inversion (MI) attacks aim to reconstruct privacy-sensitive training data from released models by utilizing output information, raising extensive concerns about the security of Deep Neural Networks (DNNs). Recent advances in generative adversarial networks (GANs) have contributed significantly to the improved performance of MI attacks due to their powerful ability to generate realistic images with high fidelity and appropriate semantics. However, previous MI attacks have solely disclosed private information in the latent space of GAN priors, limiting their semantic extraction and transferability across multiple target models and datasets. To address this challenge, we propose a novel method, Intermediate Features enhanced Generative Model Inversion (IF-GMI), which disassembles the GAN structure and exploits features between intermediate blocks. This allows us to extend the optimization space from latent code to intermediate features with enhanced expressive capabilities. To prevent GAN priors from generating unrealistic images, we apply a L1 ball constraint to the optimization process. Experiments on multiple benchmarks demonstrate that our method significantly outperforms previous approaches and achieves state-of-the-art results under various settings, especially in the out-of-distribution (OOD) scenario. Our code is available at: https://github.com/final-solution/IF-GMI
comment: ECCV 2024
♻ ☆ Efficient Multimodal Learning from Data-centric Perspective
Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference, limiting accessibility to the broader research and user communities. A straightforward solution is to leverage smaller pre-trained vision and language models, which inevitably cause significant performance drops. In this paper, we demonstrate the possibility of training a smaller but better MLLM with high-quality training data. Specifically, we introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning from selected training data. Experiments show that our Bunny-4B/8B outperforms the state-of-the-art large MLLMs on multiple benchmarks. We expect that this work can provide the community with a clean and flexible open-source tool for further research and development. The code, models, and data can be found in https://github.com/BAAI-DCAI/Bunny.
♻ ☆ CLIP-Guided Networks for Transferable Targeted Attacks ECCV 2024
Transferable targeted adversarial attacks aim to mislead models into outputting adversary-specified predictions in black-box scenarios. Recent studies have introduced \textit{single-target} generative attacks that train a generator for each target class to generate highly transferable perturbations, resulting in substantial computational overhead when handling multiple classes. \textit{Multi-target} attacks address this by training only one class-conditional generator for multiple classes. However, the generator simply uses class labels as conditions, failing to leverage the rich semantic information of the target class. To this end, we design a \textbf{C}LIP-guided \textbf{G}enerative \textbf{N}etwork with \textbf{C}ross-attention modules (CGNC) to enhance multi-target attacks by incorporating textual knowledge of CLIP into the generator. Extensive experiments demonstrate that CGNC yields significant improvements over previous multi-target generative attacks, e.g., a 21.46\% improvement in success rate from ResNet-152 to DenseNet-121. Moreover, we propose a masked fine-tuning mechanism to further strengthen our method in attacking a single class, which surpasses existing single-target methods.
comment: ECCV 2024
♻ ☆ PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers ECCV
Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.
comment: Accepted as a main conference paper at the European Conference of Computer Vision (ECCV) 2024
♻ ☆ Biometrics and Behavior Analysis for Detecting Distractions in e-Learning
In this article, we explore computer vision approaches to detect abnormal head pose during e-learning sessions and we introduce a study on the effects of mobile phone usage during these sessions. We utilize behavioral data collected from 120 learners monitored while participating in a MOOC learning sessions. Our study focuses on the influence of phone-usage events on behavior and physiological responses, specifically attention, heart rate, and meditation, before, during, and after phone usage. Additionally, we propose an approach for estimating head pose events using images taken by the webcam during the MOOC learning sessions to detect phone-usage events. Our hypothesis suggests that head posture undergoes significant changes when learners interact with a mobile phone, contrasting with the typical behavior seen when learners face a computer during e-learning sessions. We propose an approach designed to detect deviations in head posture from the average observed during a learner's session, operating as a semi-supervised method. This system flags events indicating alterations in head posture for subsequent human review and selection of mobile phone usage occurrences with a sensitivity over 90%.
comment: Accepted in CEDI 2024 (VII Congreso Espa\~nol de Inform\'atica), A Coru\~na, Spain
♻ ☆ Dynamic Identity-Guided Attention Network for Visible-Infrared Person Re-identification
Visible-infrared person re-identification (VI-ReID) aims to match people with the same identity between visible and infrared modalities. VI-ReID is a challenging task due to the large differences in individual appearance under different modalities. Existing methods generally try to bridge the cross-modal differences at image or feature level, which lacks exploring the discriminative embeddings. Effectively minimizing these cross-modal discrepancies relies on obtaining representations that are guided by identity and consistent across modalities, while also filtering out representations that are irrelevant to identity. To address these challenges, we introduce a dynamic identity-guided attention network (DIAN) to mine identity-guided and modality-consistent embeddings, facilitating effective bridging the gap between different modalities. Specifically, in DIAN, to pursue a semantically richer representation, we first use orthogonal projection to fuse the features from two connected coarse and fine layers. Furthermore, we first use dynamic convolution kernels to mine identity-guided and modality-consistent representations. More notably, a cross embedding balancing loss is introduced to effectively bridge cross-modal discrepancies by above embeddings. Experimental results on SYSU-MM01 and RegDB datasets show that DIAN achieves state-of-the-art performance. Specifically, for indoor search on SYSU-MM01, our method achieves 86.28% rank-1 accuracy and 87.41% mAP, respectively. Our code will be available soon.
comment: I need to further debug my code to improve accuracy
♻ ☆ NOVUM: Neural Object Volumes for Robust Object Classification ECCV 2024
Discriminative models for object classification typically learn image-based representations that do not capture the compositional and 3D nature of objects. In this work, we show that explicitly integrating 3D compositional object representations into deep networks for image classification leads to a largely enhanced generalization in out-of-distribution scenarios. In particular, we introduce a novel architecture, referred to as \OURS, that consists of a feature extractor and a \textit{neural object volume} for every target object class. Each neural object volume is a composition of 3D Gaussians that emit feature vectors. This compositional object representation allows for a highly robust and fast estimation of the object class by independently matching the features of the 3D Gaussians of each category to features extracted from an input image. Additionally, the object pose can be estimated via inverse rendering of the corresponding neural object volume. To enable the classification of objects, the neural features at each 3D Gaussian are trained discriminatively to be distinct from (i) the features of 3D Gaussians in other categories, (ii) features of other 3D Gaussians of the same object, and (iii) the background features. Our experiments show that \OURS offers intriguing advantages over standard architectures due to the 3D compositional structure of the object representation, namely: (1) An exceptional robustness across a spectrum of real-world and synthetic out-of-distribution shifts and (2) an enhanced human interpretability compared to standard models, all while maintaining real-time inference and a competitive accuracy on in-distribution data.
comment: 14 pages, 4 figures, accepted at ECCV 2024, code is accessible at https://github.com/GenIntel/NOVUM
♻ ☆ Denoising Vision Transformers ECCV2024
We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.
comment: Accepted to ECCV2024. Project website: https://jiawei-yang.github.io/DenoisingViT/
♻ ☆ Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan
The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.
comment: ACM Multimedia Conference - Grand Challenge
♻ ☆ T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning ECCV 2024
The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. Consequently, scholars have been actively investigating efficacious self-supervised pre-training paradigms. Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-training strategy, namely Temporal Masked Auto-Encoders (T-MAE), which takes as input temporally adjacent frames and learns temporal dependency. A SiamWCA backbone, containing a Siamese encoder and a windowed cross-attention (WCA) module, is established for the two-frame input. Considering that the movement of an ego-vehicle alters the view of the same instance, temporal modeling also serves as a robust and natural data augmentation, enhancing the comprehension of target objects. SiamWCA is a powerful architecture but heavily relies on annotated data. Our T-MAE pre-training strategy alleviates its demand for annotated data. Comprehensive experiments demonstrate that T-MAE achieves the best performance on both Waymo and ONCE datasets among competitive self-supervised approaches. Codes will be released at https://github.com/codename1995/T-MAE
comment: Accepted to ECCV 2024
♻ ☆ Multi-label Image Classification using Adaptive Graph Convolutional Networks: from a Single Domain to Multiple Domains
This paper proposes an adaptive graph-based approach for multi-label image classification. Graph-based methods have been largely exploited in the field of multi-label classification, given their ability to model label correlations. Specifically, their effectiveness has been proven not only when considering a single domain but also when taking into account multiple domains. However, the topology of the used graph is not optimal as it is pre-defined heuristically. In addition, consecutive Graph Convolutional Network (GCN) aggregations tend to destroy the feature similarity. To overcome these issues, an architecture for learning the graph connectivity in an end-to-end fashion is introduced. This is done by integrating an attention-based mechanism and a similarity-preserving strategy. The proposed framework is then extended to multiple domains using an adversarial training scheme. Numerous experiments are reported on well-known single-domain and multi-domain benchmarks. The results demonstrate that our approach achieves competitive results in terms of mean Average Precision (mAP) and model size as compared to the state-of-the-art. The code will be made publicly available.
♻ ☆ Understanding Physical Dynamics with Counterfactual World Modeling ECCV 2024
The ability to understand physical dynamics is critical for agents to act in the world. Here, we use Counterfactual World Modeling (CWM) to extract vision structures for dynamics understanding. CWM uses a temporally-factored masking policy for masked prediction of video data without annotations. This policy enables highly effective "counterfactual prompting" of the predictor, allowing a spectrum of visual structures to be extracted from a single pre-trained predictor without finetuning on annotated datasets. We demonstrate that these structures are useful for physical dynamics understanding, allowing CWM to achieve the state-of-the-art performance on the Physion benchmark.
comment: ECCV 2024. Project page at: https://neuroailab.github.io/cwm-physics/
♻ ☆ MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks
Medical report generation from X-ray images is a challenging task, particularly in an unpaired setting where paired image-report data is unavailable for training. To address this challenge, we propose a novel model that leverages the available information in two distinct datasets, one comprising reports and the other consisting of images. The core idea of our model revolves around the notion that combining auto-encoding report generation with multi-modal (report-image) alignment can offer a solution. However, the challenge persists regarding how to achieve this alignment when pair correspondence is absent. Our proposed solution involves the use of auxiliary tasks, particularly contrastive learning and classification, to position related images and reports in close proximity to each other. This approach differs from previous methods that rely on pre-processing steps, such as using external information stored in a knowledge graph. Our model, named MedRAT, surpasses previous state-of-the-art methods, demonstrating the feasibility of generating comprehensive medical reports without the need for paired data or external tools.
♻ ☆ Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs
Advancements in generative models have sparked significant interest in generating images while adhering to specific structural guidelines. Scene graph to image generation is one such task of generating images which are consistent with the given scene graph. However, the complexity of visual scenes poses a challenge in accurately aligning objects based on specified relations within the scene graph. Existing methods approach this task by first predicting a scene layout and generating images from these layouts using adversarial training. In this work, we introduce a novel approach to generate images from scene graphs which eliminates the need of predicting intermediate layouts. We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images. Towards this, we first pre-train our graph encoder to align graph features with CLIP features of corresponding images using a GAN based training. Further, we fuse the graph features with CLIP embedding of object labels present in the given scene graph to create a graph consistent CLIP guided conditioning signal. In the conditioning input, object embeddings provide coarse structure of the image and graph features provide structural alignment based on relationships among objects. Finally, we fine tune a pre-trained diffusion model with the graph consistent conditioning signal with reconstruction and CLIP alignment loss. Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks of COCO-stuff and Visual Genome dataset.
♻ ☆ A Survey on Continual Semantic Segmentation: Theory, Challenge, Method and Application
Continual learning, also known as incremental learning or life-long learning, stands at the forefront of deep learning and AI systems. It breaks through the obstacle of one-way training on close sets and enables continuous adaptive learning on open-set conditions. In the recent decade, continual learning has been explored and applied in multiple fields especially in computer vision covering classification, detection and segmentation tasks. Continual semantic segmentation (CSS), of which the dense prediction peculiarity makes it a challenging, intricate and burgeoning task. In this paper, we present a review of CSS, committing to building a comprehensive survey on problem formulations, primary challenges, universal datasets, neoteric theories and multifarious applications. Concretely, we begin by elucidating the problem definitions and primary challenges. Based on an in-depth investigation of relevant approaches, we sort out and categorize current CSS models into two main branches including data-replay and data-free sets. In each branch, the corresponding approaches are similarity-based clustered and thoroughly analyzed, following qualitative comparison and quantitative reproductions on relevant datasets. Besides, we also introduce four CSS specialities with diverse application scenarios and development tendencies. Furthermore, we develop a benchmark for CSS encompassing representative references, evaluation results and reproductions, which is available at~\url{https://github.com/YBIO/SurveyCSS}. We hope this survey can serve as a reference-worthy and stimulating contribution to the advancement of the life-long learning field, while also providing valuable perspectives for related fields.
comment: 20 pages, 11 figures
♻ ☆ Video Understanding with Large Language Models: A Survey
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
♻ ☆ Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL ECCV 2024
Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models(VFM) and offline reinforcement learning(offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as "Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust policy within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. We evaluate our agent on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned agent from virtual environments to a real-world robot.
comment: ECCV 2024
♻ ☆ Long-CLIP: Unlocking the Long-Text Capability of CLIP ECCV 2024
Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.
comment: ECCV 2024. All codes and models are publicly available at https://github.com/beichenzbc/Long-CLIP
♻ ☆ Fully Test-Time rPPG Estimation via Synthetic Signal-Guided Feature Learning
Many remote photoplethysmography (rPPG) estimation models have achieved promising performance on the training domain but often fail to measure the physiological signals or heart rates (HR) on test domains. Domain generalization (DG) or domain adaptation (DA) techniques are therefore adopted in the offline training stage to adapt the model to the unobserved or observed test domain by referring to all the available source domain data. However, in rPPG estimation problems, the adapted model usually confronts challenges of estimating target data with various domain information, such as different video capturing settings, individuals of different age ranges, or of different HR distributions. In contrast, Test-Time Adaptation (TTA), by online adapting to unlabeled target data without referring to any source data, enables the model to adaptively estimate rPPG signals of various unseen domains. In this paper, we first propose a novel TTA-rPPG benchmark, which encompasses various domain information and HR distributions, to simulate the challenges encountered in rPPG estimation. Next, we propose a novel synthetic signal-guided rPPG estimation framework with a two-fold purpose. First, we design an effective spectral-based entropy minimization to enforce the rPPG model to learn new target domain information. Second, we develop a synthetic signal-guided feature learning, by synthesizing pseudo rPPG signals as pseudo ground-truths to guide a conditional generator to generate latent rPPG features. The synthesized rPPG signals and the generated rPPG features are used to guide the rPPG model to broadly cover various HR distributions. Our extensive experiments on the TTA-rPPG benchmark show that the proposed method achieves superior performance and outperforms previous DG and DA methods across most protocols of the proposed TTA-rPPG benchmark.
♻ ☆ MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis
The common practice in developing computer-aided diagnosis (CAD) models based on transformer architectures usually involves fine-tuning from ImageNet pre-trained weights. However, with recent advances in large-scale pre-training and the practice of scaling laws, Vision Transformers (ViT) have become much larger and less accessible to medical imaging communities. Additionally, in real-world scenarios, the deployments of multiple CAD models can be troublesome due to problems such as limited storage space and time-consuming model switching. To address these challenges, we propose a new method MeLo (Medical image Low-rank adaptation), which enables the development of a single CAD model for multiple clinical tasks in a lightweight manner. It adopts low-rank adaptation instead of resource-demanding fine-tuning. By fixing the weight of ViT models and only adding small low-rank plug-ins, we achieve competitive results on various diagnosis tasks across different imaging modalities using only a few trainable parameters. Specifically, our proposed method achieves comparable performance to fully fine-tuned ViT models on four distinct medical imaging datasets using about 0.17% trainable parameters. Moreover, MeLo adds only about 0.5MB of storage space and allows for extremely fast model switching in deployment and inference. Our source code and pre-trained weights are available on our website (https://absterzhu.github.io/melo.github.io/).
comment: IEEE International Symposium on Biomedical Imaging (ISBI 2024)
♻ ☆ RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection
Robustness against noisy imaging is crucial for practical image anomaly detection systems. This study introduces a Robust Anomaly Detection (RAD) dataset with free views, uneven illuminations, and blurry collections to systematically evaluate the robustness of current anomaly detection methods. Specifically, RAD aims to identify foreign objects on working platforms as anomalies. The collection process incorporates various sources of imaging noise, such as viewpoint changes, uneven illuminations, and blurry collections, to replicate real-world inspection scenarios. Subsequently, we assess and analyze 11 state-of-the-art unsupervised and zero-shot methods on RAD. Our findings indicate that: 1) Variations in viewpoint, illumination, and blurring affect anomaly detection methods to varying degrees; 2) Methods relying on memory banks and assisted by synthetic anomalies demonstrate stronger robustness; 3) Effectively leveraging the general knowledge of foundational models is a promising avenue for enhancing the robustness of anomaly detection methods. The dataset is available at https://github.com/hustCYQ/RAD-dataset.
comment: 6 pages, 5 figures
♻ ☆ DragVideo: Interactive Drag-style Video Editing
Video generation models have shown their superior ability to generate photo-realistic video. However, how to accurately control (or edit) the video remains a formidable challenge. The main issues are: 1) how to perform direct and accurate user control in editing; 2) how to execute editings like changing shape, expression, and layout without unsightly distortion and artifacts to the edited content; and 3) how to maintain spatio-temporal consistency of video after editing. To address the above issues, we propose DragVideo, a general drag-style video editing framework. Inspired by DragGAN, DragVideo addresses issues 1) and 2) by proposing the drag-style video latent optimization method which gives desired control by updating noisy video latent according to drag instructions through video-level drag objective function. We amend issue 3) by integrating the video diffusion model with sample-specific LoRA and Mutual Self-Attention in DragVideo to ensure the edited result is spatio-temporally consistent. We also present a series of testing examples for drag-style video editing and conduct extensive experiments across a wide array of challenging editing tasks, such as motion, skeleton editing, etc, underscoring DragVideo can edit video in an intuitive, faithful to the user's intention manner, with nearly unnoticeable distortion and artifacts, while maintaining spatio-temporal consistency. While traditional prompt-based video editing fails to do the former two and directly applying image drag editing fails in the last, DragVideo's versatility and generality are emphasized. Github link: https://github.com/RickySkywalker/DragVideo-Official.
♻ ☆ MedIAnomaly: A comparative study of anomaly detection in medical images
Anomaly detection (AD) aims at detecting abnormal samples that deviate from the expected normal patterns. Generally, it can be trained merely on normal data, without a requirement for abnormal samples, and thereby plays an important role in the recognition of rare diseases and health screening in the medical domain. Despite the emergence of numerous methods for medical AD, we observe a lack of a fair and comprehensive evaluation, which causes ambiguous conclusions and hinders the development of this field. To address this problem, this paper builds a benchmark with unified comparison. Seven medical datasets with five image modalities, including chest X-rays, brain MRIs, retinal fundus images, dermatoscopic images, and histopathology whole slide images, are curated for extensive evaluation. Thirty typical AD methods, including reconstruction and self-supervised learning-based methods, are involved in comparison of image-level anomaly classification and pixel-level anomaly segmentation. Furthermore, for the first time, we formally explore the effect of key components in existing methods, clearly revealing unresolved challenges and potential future directions. The datasets and code are available at \url{https://github.com/caiyu6666/MedIAnomaly}.
comment: Under submission
♻ ☆ CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference
Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FPGAs), introduces considerable challenges. These challenges stem primarily from the non-linear calculations and high computational and memory demands of ViTs. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs in order to maximize performance. Our framework is built upon three fundamental contributions: multi-kernel design to maximize the bandwidth, mainly targeting benefits of multi DDR memory banks, approximate non-linear functions that exhibit minimal accuracy degradation, and efficient use of available logic blocks on the FPGA, and efficient compiler to maximize the performance and memory-efficiency of the computing kernels by presenting a novel algorithm for design space exploration to find optimal hardware configuration that achieves optimal throughput and latency. Compared to the state-of-the-art ViT accelerators, CHOSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.
♻ ☆ A Survey on 3D Gaussian Splatting
3D Gaussian splatting (GS) has recently emerged as a transformative technique in the realm of explicit radiance field and computer graphics. This innovative approach, characterized by the utilization of millions of learnable 3D Gaussians, represents a significant departure from mainstream neural radiance field approaches, which predominantly use implicit, coordinate-based models to map spatial coordinates to pixel values. 3D GS, with its explicit scene representation and differentiable rendering algorithm, not only promises real-time rendering capability but also introduces unprecedented levels of editability. This positions 3D GS as a potential game-changer for the next generation of 3D reconstruction and representation. In the present paper, we provide the first systematic overview of the recent developments and critical contributions in the domain of 3D GS. We begin with a detailed exploration of the underlying principles and the driving forces behind the emergence of 3D GS, laying the groundwork for understanding its significance. A focal point of our discussion is the practical applicability of 3D GS. By enabling unprecedented rendering speed, 3D GS opens up a plethora of applications, ranging from virtual reality to interactive media and beyond. This is complemented by a comparative analysis of leading 3D GS models, evaluated across various benchmark tasks to highlight their performance and practical utility. The survey concludes by identifying current challenges and suggesting potential avenues for future research in this domain. Through this survey, we aim to provide a valuable resource for both newcomers and seasoned researchers, fostering further exploration and advancement in applicable and explicit radiance field representation.
comment: Ongoing project
♻ ☆ Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video
Generating dynamic 3D object from a single-view video is challenging due to the lack of 4D labeled data. An intuitive approach is to extend previous image-to-3D pipelines by transferring off-the-shelf image generation models such as score distillation sampling.However, this approach would be slow and expensive to scale due to the need for back-propagating the information-limited supervision signals through a large pretrained model. To address this, we propose an efficient video-to-4D object generation framework called Efficient4D. It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data to directly reconstruct the 4D content through a 4D Gaussian splatting model. Importantly, our method can achieve real-time rendering under continuous camera trajectories. To enable robust reconstruction under sparse views, we introduce inconsistency-aware confidence-weighted loss design, along with a lightly weighted score distillation loss. Extensive experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed when compared to prior art alternatives while preserving the quality of novel view synthesis. For example, Efficient4D takes only 10 minutes to model a dynamic object, vs 120 minutes by the previous art model Consistent4D.
comment: Technical report
♻ ☆ PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving ECCV 2024
We present a new interaction mechanism of prediction and planning for end-to-end autonomous driving, called PPAD (Iterative Interaction of Prediction and Planning Autonomous Driving), which considers the timestep-wise interaction to better integrate prediction and planning. An ego vehicle performs motion planning at each timestep based on the trajectory prediction of surrounding agents (e.g., vehicles and pedestrians) and its local road conditions. Unlike existing end-to-end autonomous driving frameworks, PPAD models the interactions among ego, agents, and the dynamic environment in an autoregressive manner by interleaving the Prediction and Planning processes at every timestep, instead of a single sequential process of prediction followed by planning. Specifically, we design ego-to-agent, ego-to-map, and ego-to-BEV interaction mechanisms with hierarchical dynamic key objects attention to better model the interactions. The experiments on the nuScenes benchmark show that our approach outperforms state-of-the-art methods.
comment: Accepted to ECCV 2024. Project page: https://github.com/zlichen/PPAD
♻ ☆ Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing
While prior research has proposed a plethora of methods that build neural classifiers robust against adversarial robustness, practitioners are still reluctant to adopt them due to their unacceptably severe clean accuracy penalties. This paper significantly alleviates this accuracy-robustness trade-off by mixing the output probabilities of a standard classifier and a robust classifier, where the standard network is optimized for clean accuracy and is not robust in general. We show that the robust base classifier's confidence difference for correct and incorrect examples is the key to this improvement. In addition to providing intuitions and empirical evidence, we theoretically certify the robustness of the mixed classifier under realistic assumptions. Furthermore, we adapt an adversarial input detector into a mixing network that adaptively adjusts the mixture of the two base models, further reducing the accuracy penalty of achieving robustness. The proposed flexible method, termed "adaptive smoothing", can work in conjunction with existing or even future methods that improve clean accuracy, robustness, or adversary detection. Our empirical evaluation considers strong attack methods, including AutoAttack and adaptive attack. On the CIFAR-100 dataset, our method achieves an 85.21% clean accuracy while maintaining a 38.72% $\ell_\infty$-AutoAttacked ($\epsilon = 8/255$) accuracy, becoming the second most robust method on the RobustBench CIFAR-100 benchmark as of submission, while improving the clean accuracy by ten percentage points compared with all listed models. The code that implements our method is available at https://github.com/Bai-YT/AdaptiveSmoothing.
♻ ☆ Diffusion Model is a Good Pose Estimator from 3D RF-Vision
Human pose estimation (HPE) from Radio Frequency vision (RF-vision) performs human sensing using RF signals that penetrate obstacles without revealing privacy (e.g., facial information). Recently, mmWave radar has emerged as a promising RF-vision sensor, providing radar point clouds by processing RF signals. However, the mmWave radar has a limited resolution with severe noise, leading to inaccurate and inconsistent human pose estimation. This work proposes mmDiff, a novel diffusion-based pose estimator tailored for noisy radar data. Our approach aims to provide reliable guidance as conditions to diffusion models. Two key challenges are addressed by mmDiff: (1) miss-detection of parts of human bodies, which is addressed by a module that isolates feature extraction from different body parts, and (2) signal inconsistency due to environmental interference, which is tackled by incorporating prior knowledge of body structure and motion. Several modules are designed to achieve these goals, whose features work as the conditions for the subsequent diffusion model, eliminating the miss-detection and instability of HPE based on RF-vision. Extensive experiments demonstrate that mmDiff outperforms existing methods significantly, achieving state-of-the-art performances on public datasets.
♻ ☆ OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion
Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.
comment: Technical Report
♻ ☆ A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions
Text-guided 3D visual grounding (T-3DVG), which aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene, has drawn increasing attention in the 3D research community over the past few years. Compared to 2D visual grounding, this task presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing. In this survey, we attempt to provide a comprehensive overview of the T-3DVG progress, including its fundamental elements, recent research advances, and future research directions. To the best of our knowledge, this is the first systematic survey on the T-3DVG task. Specifically, we first provide a general structure of the T-3DVG pipeline with detailed components in a tutorial style, presenting a complete background overview. Then, we summarize the existing T-3DVG approaches into different categories and analyze their strengths and weaknesses. We also present the benchmark datasets and evaluation metrics to assess their performances. Finally, we discuss the potential limitations of existing T-3DVG and share some insights on several promising research directions. The latest papers are continually collected at https://github.com/liudaizong/Awesome-3D-Visual-Grounding.
♻ ☆ Multi-Sentence Grounding for Long-term Instructional Video
In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep. We make the following contributions: (i) improving the quality of sentences in dataset by upgrading ASR systems to reduce errors from speech recognition and prompting a large language model to transform noisy ASR transcripts into descriptive steps; (ii) proposing a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to temporally align the generated steps to corresponding video segments. To measure the quality of our curated datasets, we train models for the task of multi-sentence grounding on it, i.e., given a long-form video, and associated multiple sentences, to determine their corresponding timestamps in the video simultaneously, as a result, the model shows superior performance on a series of multi-sentence grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.0% on HT-Step, 5.1% on HTM-Align and 1.9% on CrossTask. All codes, models, and the resulting dataset have been publicly released.
Computers and Society
☆ Inequalities in Computational Thinking Among Incoming Students in an STEM Chilean University
While computational thinking arises as an essential skill worldwide, formal primary and secondary education in Latin America rarely incorporates mechanisms to develop it in their curricula. The extent to which students in the region acquire computational thinking skills remains largely unknown. To start addressing this void, this article presents findings from a cross-sectional study that characterizes the computational thinking abilities of incoming students at a Chilean university with a strong emphasis on STEM disciplines. Based on more than 500 responses, this study provides evidence of significant inequalities in computational thinking across gender, type of school (private or no), and prior programming knowledge. The discussion offers insights into how these disparities relate to contextual factors of the country, such as a highly socio-economically segregated educational system, public policies focused mainly on technology access, and heavy reliance on voluntary initiatives, to develop computational thinking. The findings can enlighten upcoming research endeavors and formulate strategies to create a more equitable field for students entering STEM degrees in nations facing similar circumstances.
comment: Pre-print. Final version published in: IEEE Transactions on Education
☆ Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems AAAI
Facial Recognition Systems (FRSs) are being developed and deployed globally at unprecedented rates. Most platforms are designed in a limited set of countries but deployed in worldwide, without adequate checkpoints. This is especially problematic for Global South countries which lack strong legislation to safeguard persons facing disparate performance of these systems. A combination of unavailability of datasets, lack of understanding of FRS functionality and low-resource bias mitigation measures accentuate the problem. In this work, we propose a new face dataset composed of 6,579 unique male and female sportspersons from eight countries around the world. More than 50% of the dataset comprises individuals from the Global South countries and is demographically diverse. To aid adversarial audits and robust model training, each image has four adversarial variants, totaling over 40,000 images. We also benchmark five popular FRSs, both commercial and open-source, for the task of gender prediction (and country prediction for one of the open-source models as an example of red-teaming). Experiments on industrial FRSs reveal accuracies ranging from 98.2%--38.1%, with a large disparity between males and females in the Global South (max difference of 38.5%). Biases are also observed in all FRSs between females of the Global North and South (max difference of ~50%). Grad-CAM analysis identifies the nose, forehead and mouth as the regions of interest on one of the open-source FRSs. Utilizing this insight, we design simple, low-resource bias mitigation solutions using few-shot and novel contrastive learning techniques significantly improving the accuracy with disparity between males and females reducing from 50% to 1.5% in one of the settings. In the red-teaming experiment with the open-source Deepface model, contrastive learning proves more effective than simple fine-tuning.
comment: This work has been accepted for publication at AAAI/ACM AIES 2024
☆ Examining Inequality in Park Quality for Promoting Health Across 35 Global Cities
Urban parks provide significant health benefits by offering spaces and facilities for various recreational and leisure activities. However, the capacity of specific park spaces and elements to foster health remains underexamined. Traditional studies have focused on parks' size, greenery, and accessibility, often overlooking their ability to facilitate specific health-promoting activities. To address this gap, we propose a taxonomy consisting of six categories of health-promoting activities in parks: physical, mind-body, nature appreciation, environmental, social, and cultural. We estimate the capacity of parks in 35 global cities to promote health by establishing a lexicon linking park spaces and elements with specific health-promoting activities from our taxonomy. Using this lexicon, we collected data on elements and spaces in all parks in 35 cities from OpenStreetMap. Our analysis covers 23,477 parks with a total of 827,038 elements and spaces. By first comparing similarly sized parks across cities, we found that North American parks offer more spaces for physical activities, while European parks focus more on nature appreciation. Second, by scoring parks based on both elements and spaces, we investigated the variability in their health-promoting potential. We found the most uniform provision across parks for physical activities and the highest disparities regarding social activities. Additionally, parks offering a variety of activities are usually located in city centers, while offerings diminish in parks towards the suburbs. Lastly, we identified significant inequalities in park standards across cities, regardless of their continental location: Tokyo and Paris offer the most uniform park standards, while Copenhagen and Rio de Janeiro exhibit the most pronounced disparities. Our study provides insights for making urban parks more equitable, engaging, and health-promoting.
comment: 29 pages main paper, 10 pages appendix
☆ Coca4ai: checking energy behaviors on AI data centers
Monitoring energy behaviors in AI data centers is crucial, both to reduce their energy consumption and to raise awareness among their users which are key actors in the AI field. This paper shows a proof of concept of easy and lightweight monitoring of energy behaviors at the scale of a whole data center, a user or a job submission. Our system uses software wattmeters and we validate our setup with per node accurate external wattmeters. Results show that there is an interesting potential from the efficiency point of view, providing arguments to create user engagement thanks to energy monitoring.
comment: accepted at ICT4S 2024
☆ Dressed to Gamble: How Poker Drives the Dynamics of Wearables and Visits on Decentraland's Social Virtual World
Decentraland is a blockchain-based social virtual world touted to be a creative space owned by its community, unlike previous virtual worlds. Its users can create and publish wearables, virtual garments to customize avatars, which can be then sold or given away via the blockchain. Decentral Games (DG), a single project owning two prominent in-world casinos, has by far created the most wearables, with these being necessary to earn cryptocurrency in their flagship game ICE Poker. We thus present a comprehensive study that investigates how DG and ICE Poker influence the overall dynamics of Decentraland wearables and in-world visits. To this end, we analyzed 5.9 million wearable transfers made on the Polygon blockchain (and related sales) over a two-year period, and 677 million log events of in-world user positions in an overlapping 10-month period. We found that the influence of DG and Ice Poker is not only significant, but also substantial for transfers and sales monetary value of wearables, and very large for daily unique visitors and time spent in the virtual world. Despite several alternative in-world economic and artistic initiatives in Decentraland, some of which have attracted much attention from the general public, online poker appears to be the main driver of the analyzed dynamics. Our work thus contributes to the current understanding of user behavior in social virtual worlds and it is among the first to study the emerging phenomenon of blockchain-based online gambling in virtual spaces.
☆ FAIR evaluation of ten widely used chemical datasets: Lessons learned and recommendations
This document focuses on databases disseminating data on (hazardous) substances found on the North American and the European (EU) market. The goal is to analyse the FAIRness (Findability, Accessibility, Interoperability and Reusability) of published open data on these substances and to qualitatively evaluate to what extend the selected databases already fulfil the criteria set out in the commission draft regulation on a common data chemicals platform. We implemented two complementary approaches: Manual, and Automatic. The manual approach is based on online questionnaires. These questionnaires provide a structured approach to evaluating FAIRness by guiding users through a series of questions related to the FAIR principles. They are particularly useful for initiating discussions on FAIR implementation within research teams and for identifying areas that require further attention. Automated tools for FAIRness assessment, such as F-UJI and FAIR Checker, are gaining prominence and are continuously under development. Unlike manual tools, automated tools perform a series of tests automatically starting from a dereferenceable URL to the data resource to be evaluated. We analysed ten widely adopted datasets managed in Europe and North America. The highest score from automatic analysis was 54/100. The manual analysis shows that several FAIR metrics were satisfied, but not detectable by automatic tools because there is no metadata, or the format of the information was not a standard one. Thus, it was not interpretable by the tool. We present the details of the analysis and tables summarizing the outcomes, the issues, and the suggestions to address these issues.
☆ Regression under demographic parity constraints via unlabeled post-processing
We address the problem of performing regression while ensuring demographic parity, even without access to sensitive attributes during inference. We present a general-purpose post-processing algorithm that, using accurate estimates of the regression function and a sensitive attribute predictor, generates predictions that meet the demographic parity constraint. Our method involves discretization and stochastic minimization of a smooth convex function. It is suitable for online post-processing and multi-class classification tasks only involving unlabeled data for the post-processing. Unlike prior methods, our approach is fully theory-driven. We require precise control over the gradient norm of the convex function, and thus, we rely on more advanced techniques than standard stochastic gradient descent. Our algorithm is backed by finite-sample analysis and post-processing bounds, with experimental results validating our theoretical findings.
☆ Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias
The common toxicity and societal bias in contents generated by large language models (LLMs) necessitate strategies to reduce harm. Present solutions often demand white-box access to the model or substantial training, which is impractical for cutting-edge commercial LLMs. Moreover, prevailing prompting methods depend on external tool feedback and fail to simultaneously lessen toxicity and bias. Motivated by social psychology principles, we propose a novel strategy named \textbf{perspective-taking prompting (\textsc{PeT})} that inspires LLMs to integrate diverse human perspectives and self-regulate their responses. This self-correction mechanism can significantly diminish toxicity (up to $89\%$) and bias (up to $73\%$) in LLMs' responses. Rigorous evaluations and ablation studies are conducted on two commercial LLMs (ChatGPT and GLM) and three open-source LLMs, revealing \textsc{PeT}'s superiority in producing less harmful responses, outperforming five strong baselines.
♻ ☆ Turing's Test, a Beautiful Thought Experiment
In the wake of the latest trends of artificial intelligence (AI), there has been a resurgence of claims and questions about the Turing test and its value, which are reminiscent of decades of practical "Turing" tests. If AI were quantum physics, by now several "Schr\"odinger's" cats would have been killed. It is time for a historical reconstruction of Turing's beautiful thought experiment. This paper presents a wealth of evidence, including new archival sources, and gives original answers to several open questions about Turing's 1950 paper, including its relation with early AI.
comment: Author's accepted manuscript to appear in IEEE Annals of the History of Computing. "\copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media ..."
♻ ☆ Who Shares Fake News? Uncovering Insights from Social Media Users' Post Histories
We propose that social-media users' own post histories are an underused yet valuable resource for studying fake-news sharing. By extracting textual cues from their prior posts, and contrasting their prevalence against random social-media users and others (e.g., those with similar socio-demographics, political news-sharers, and fact-check sharers), researchers can identify cues that distinguish fake-news sharers, predict those most likely to share fake news, and identify promising constructs to build interventions. Our research includes studies along these lines. In Study 1, we explore the distinctive language patterns of fake-news sharers, highlighting elements such as their higher use of anger and power-related words. In Study 2, we show that adding textual cues into predictive models enhances their accuracy in predicting fake-news sharers. In Study 3, we explore the contrasting role of trait and situational anger, and show trait anger is associated with a greater propensity to share both true and fake news. In Study 4, we introduce a way to authenticate Twitter accounts in surveys, before using it to explore how crafting an ad copy that resonates with users' sense of power encourages the adoption of fact-checking tools. We hope to encourage the use of novel research methods for marketers and misinformation researchers.
comment: 108 pages
♻ ☆ Dynamics of Ideological Biases of Social Media Users
Humanity for centuries has perfected skills of interpersonal interactions and evolved patterns that enable people to detect lies and deceiving behavior of others in face-to-face settings. Unprecedented growth of people's access to mobile phones and social media raises an important question: How does this new technology influence people's interactions and support the use of traditional patterns? In this article, we answer this question for homophily-driven patterns in social media. In our previous studies, we found that, on a university campus, changes in student opinions were driven by the desire to hold popular opinions. Here, we demonstrate that the evolution of online platform-wide opinion groups is driven by the same desire. We focus on two social media: Twitter and Parler, on which we tracked the political biases of their users. On Parler, an initially stable group of Right-biased users evolved into a permanent Right-leaning echo chamber dominating weaker, transient groups of members with opposing political biases. In contrast, on Twitter, the initial presence of two large opposing bias groups led to the evolution of a bimodal bias distribution, with a high degree of polarization. We capture the movement of users from the initial to final bias groups during the tracking period. We also show that user choices are influenced by side-effects of homophily. Users entering the platform attempt to find a sufficiently large group whose members hold political biases within the range sufficiently close to their own. If successful, they stabilize their biases and become permanent members of the group. Otherwise, they leave the platform. We believe that the dynamics of users' behavior uncovered in this article create a foundation for technical solutions supporting social groups on social media and socially aware networks.
comment: 7 pages, 3 figures. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: https://doi.org/10.1109/MCOM.001.2300333/
♻ ☆ Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.
comment: 20 pages; 13 visualizations (nine figures, four tables)
♻ ☆ The Efficacy of Conversational Artificial Intelligence in Rectifying the Theory of Mind and Autonomy Biases: Comparative Analysis
Background: The increasing deployment of Conversational Artificial Intelligence (CAI) in mental health interventions necessitates an evaluation of their efficacy in rectifying cognitive biases and recognizing affect in human-AI interactions. These biases, including theory of mind and autonomy biases, can exacerbate mental health conditions such as depression and anxiety. Objective: This study aimed to assess the effectiveness of therapeutic chatbots (Wysa, Youper) versus general-purpose language models (GPT-3.5, GPT-4, Gemini Pro) in identifying and rectifying cognitive biases and recognizing affect in user interactions. Methods: The study employed virtual case scenarios simulating typical user-bot interactions. Cognitive biases assessed included theory of mind biases (anthropomorphism, overtrust, attribution) and autonomy biases (illusion of control, fundamental attribution error, just-world hypothesis). Responses were evaluated on accuracy, therapeutic quality, and adherence to Cognitive Behavioral Therapy (CBT) principles, using an ordinal scale. The evaluation involved double review by cognitive scientists and a clinical psychologist. Results: The study revealed that general-purpose chatbots outperformed therapeutic chatbots in rectifying cognitive biases, particularly in overtrust bias, fundamental attribution error, and just-world hypothesis. GPT-4 achieved the highest scores across all biases, while therapeutic bots like Wysa scored the lowest. Affect recognition showed similar trends, with general-purpose bots outperforming therapeutic bots in four out of six biases. However, the results highlight the need for further refinement of therapeutic chatbots to enhance their efficacy and ensure safe, effective use in digital mental health interventions. Future research should focus on improving affective response and addressing ethical considerations in AI-based therapy.
comment: 28 pages, 5 tables, 6 figures
♻ ☆ The Dark Side of NFTs: A Large-Scale Empirical Study of Wash Trading
NFTs (Non-Fungible Tokens) have seen significant growth since they first captured public attention in 2021. However, the NFT market is plagued by fake transactions and economic bubbles, e.g., NFT wash trading. Wash trading typically refers to a transaction involving the same person or two colluding individuals, and has become a major threat to the NFT ecosystem. Previous studies only detect NFT wash trading from the financial aspect, while the real-world wash trading cases are much more complicated (e.g., not aiming at inflating the market value). There is still a lack of multi-dimension analysis to better understand NFT wash trading. Therefore, we present the most comprehensive study of NFT wash trading, analyzing 8,717,031 transfer events and 3,830,141 sale events from 2,701,883 NFTs. We first optimize the dataset collected via the OpenSea API. Next, we identify three types of NFT wash trading and propose identification algorithms. Our experimental results reveal 824 transfer events and 5,330 sale events (accounting for a total of \$8,857,070.41) and 370 address pairs related to NFT wash trading behaviors, causing a minimum loss of \$3,965,247.13. Furthermore, we provide insights from six aspects, i.e., marketplace design, profitability, NFT project design, payment token, user behavior, and NFT ecosystem.
♻ ☆ A Decentralised Digital Token Architecture for Public Transport
Digitisation is often viewed as beneficial to a user. Whereas traditionally, people would physically have to identify to a service, pay for a ticket in cash, or go into a library to access a book, people can now achieve all of this through a click of a button. Such actions may seem functionally identical to their analogue counterparts, but in the digital case, a user's actions are automatically recorded. The recording of user's interactions presents a problem because once the information is collected, it is outside of the control of the person whom it concerns. This issue is only exacerbated by the centralisation of the authentication mechanisms underpinning the aforementioned services, permitting the aggregation and analysis of even more data. This work aims to motivate the need and establish the feasibility of the application of a privacy-enhancing digital token management service to public transit. A proof-of-concept implementation is developed, building upon a design proposed by Goodell and Aste. This implementation was optimised for the public transport use case. Its performance is tested in a local environment to better understand the technical challenges and assess the technical feasibility of the system in a production setting. It was observed that for loads between one and five requests per second the proof-of-concept performs comparably to other contactless payment systems, with a maximum median response time less than two seconds. Due to hardware bottlenecks, reliable throughput in our test environment was limited to five requests per second. The demonstrated throughput and latency indicate that the system can feasibly compete with solutions currently in use. Yet, further work is needed to demonstrate their performance characteristics in an environment similar to that experienced in production.
comment: 16 pages, 4 figures, 3 tables
♻ ☆ Hierarchical accompanying and inhibiting patterns on the spatial arrangement of taxis' local hotspots
The spatial arrangement of taxi hotspots indicates their inherent distribution relationships, reflecting spatial organization structure and has received attention in urban studies. Previous studies mainly explore large-scale hotspots by visual analysis or simple indexes, where hotspots usually cover the entire central business district, train stations, or dense residential areas, reaching a radius of hundreds or even thousands of meters. However, the spatial arrangement patterns of small-scale hotspots, reflecting the specific popular pick-up and drop-off locations, have not received much attention. This study quantitatively examines the spatial arrangement of fine-grained local hotspots in Wuhan and Beijing, China, using taxi trajectory data. Hotspots are adaptatively identified with sizes of 90m*90m in Wuhan and 105m*105m in Beijing according to identification method. Findings show popular hotspots are typically surrounded by less popular ones, though regions with many popular hotspots inhibit the presence of less popular ones. We term these configurations as hierarchical accompany and inhibiting patterns. Finally, inspired by both patterns, a KNN-based model is developed to describe these relationships, successfully reproducing the spatial distribution of less popular hotspots based on the most popular ones. These insights enhance understanding of local urban structures and support urban planning.
Computation and Language
☆ Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection ECCV 2024
Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabilities of generative foundational models. We introduce a novel method for structured synthetic data generation aimed at enhancing the compositional understanding of VL models in language-based object detection. Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains. By leveraging these synthetic triplets, we transform 'weaker' VL models into 'stronger' models in terms of compositional understanding, a process we call "Weak-to-Strong Compositional Learning" (WSCL). To achieve this, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.
comment: ECCV 2024
☆ Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
Large Language Models (LLMs) are capable of producing content that perpetuates stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a computationally efficient method for reducing harmful content in the responses of LLMs. However, the process of how injecting self-correction instructions can modify the behavior of LLMs remains under-explored. In this paper, we explore the effectiveness of moral self-correction by answering three research questions: (1) In what scenarios does moral self-correction work? (2) What are the internal mechanisms of LLMs, e.g., hidden states, that are influenced by moral self-correction instructions? (3) Is intrinsic moral self-correction actually superficial? We argue that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly reducing the immorality stored in hidden states. Through empirical investigation with tasks of language generation and multi-choice question answering, we conclude: (i) LLMs exhibit good performance across both tasks, and self-correction instructions are particularly beneficial when the correct answer is already top-ranked; (ii) The morality levels in intermediate hidden states are strong indicators as to whether one instruction would be more effective than another; (iii) Based on our analysis of intermediate hidden states and task case studies of self-correction behaviors, we are first to propose the hypothesis that intrinsic moral self-correction is in fact superficial.
☆ SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant commonsense persona knowledge, but retrieving this knowledge is challenging due to complex contexts and the implicit nature of commonsense. This paper presents our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge, addressing the critical need for integrating persona and commonsense knowledge in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that leverages Large Language Models to generate high-quality synthetic datasets for training commonsense persona knowledge linkers. To demonstrate the efficacy of our approach, we present SynCPKL, a new dataset specifically designed for this task. Our experiments validate the effectiveness of SynCPKL for training commonsense persona knowledge linkers. Additionally, our top-performing model, Derberta-SynCPKL, secured first place in the CPKL challenge by a 16% improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at https://github.com/irislin1006/CPKL.
☆ Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation
Multimodal foundation models hold significant potential for automating radiology report generation, thereby assisting clinicians in diagnosing cardiac diseases. However, generated reports often suffer from serious factual inaccuracy. In this paper, we introduce a fact-aware multimodal retrieval-augmented pipeline in generating accurate radiology reports (FactMM-RAG). We first leverage RadGraph to mine factual report pairs, then integrate factual knowledge to train a universal multimodal retriever. Given a radiology image, our retriever can identify high-quality reference reports to augment multimodal foundation models, thus enhancing the factual completeness and correctness of report generation. Experiments on two benchmark datasets show that our multimodal retriever outperforms state-of-the-art retrievers on both language generation and radiology-specific metrics, up to 6.5% and 2% score in F1CheXbert and F1RadGraph. Further analysis indicates that employing our factually-informed training strategy imposes an effective supervision signal, without relying on explicit diagnostic label guidance, and successfully propagates fact-aware capabilities from the multimodal retriever to the multimodal foundation model in radiology report generation.
☆ XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models
In this survey, we address the key challenges in Large Language Models (LLM) research, focusing on the importance of interpretability. Driven by increasing interest from AI and business sectors, we highlight the need for transparency in LLMs. We examine the dual paths in current LLM research and eXplainable Artificial Intelligence (XAI): enhancing performance through XAI and the emerging focus on model interpretability. Our paper advocates for a balanced approach that values interpretability equally with functional advancements. Recognizing the rapid development in LLM research, our survey includes both peer-reviewed and preprint (arXiv) papers, offering a comprehensive overview of XAI's role in LLM research. We conclude by urging the research community to advance both LLM and XAI fields together.
☆ Two eyes, Two views, and finally, One summary! Towards Multi-modal Multi-tasking Knowledge-Infused Medical Dialogue Summarization
We often summarize a multi-party conversation in two stages: chunking with homogeneous units and summarizing the chunks. Thus, we hypothesize that there exists a correlation between homogeneous speaker chunking and overall summarization tasks. In this work, we investigate the effectiveness of a multi-faceted approach that simultaneously produces summaries of medical concerns, doctor impressions, and an overall view. We introduce a multi-modal, multi-tasking, knowledge-infused medical dialogue summary generation (MMK-Summation) model, which is incorporated with adapter-based fine-tuning through a gated mechanism for multi-modal information integration. The model, MMK-Summation, takes dialogues as input, extracts pertinent external knowledge based on the context, integrates the knowledge and visual cues from the dialogues into the textual content, and ultimately generates concise summaries encompassing medical concerns, doctor impressions, and a comprehensive overview. The introduced model surpasses multiple baselines and traditional summarization models across all evaluation metrics (including human evaluation), which firmly demonstrates the efficacy of the knowledge-guided multi-tasking, multimodal medical conversation summarization. The code is available at https://github.com/NLP-RL/MMK-Summation.
☆ TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data
Instruction tuning has achieved unprecedented success in NLP, turning large language models into versatile chatbots. However, the increasing variety and volume of instruction datasets demand significant computational resources. To address this, it is essential to extract a small and highly informative subset (i.e., Coreset) that achieves comparable performance to the full dataset. Achieving this goal poses non-trivial challenges: 1) data selection requires accurate data representations that reflect the training samples' quality, 2) considering the diverse nature of instruction datasets, and 3) ensuring the efficiency of the coreset selection algorithm for large models. To address these challenges, we propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS). Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection. Experimental results show that our algorithm, selecting only 5% of the data, surpasses other unsupervised methods and achieves performance close to that of the full dataset.
comment: Preprint. Our code and models are available at: https://github.com/2003pro/TAGCOS
☆ The Hitchhiker's Guide to Human Alignment with *PO
With the growing utilization of large language models (LLMs) across domains, alignment towards human preferences has become one of the most critical aspects of training models. At the forefront of state-of-the-art human alignment methods are preference optimization methods (*PO). However, prior research has often concentrated on identifying the best-performing method, typically involving a grid search over hyperparameters, which can be impractical for general practitioners. In this paper, we aim to identify the algorithm that, while being performant, is simultaneously more robust to varying hyperparameters, thereby increasing the likelihood of achieving better results. We focus on a realistic out-of-distribution (OOD) scenario that mirrors real-world applications of human alignment, offering practical insights into the strengths and weaknesses of these methods. Furthermore, to better understand the shortcomings of generations from the different methods, we analyze the model generations through the lens of KL divergence of the SFT model and the response length statistics. Our analysis reveals that the widely adopted DPO method consistently produces lengthy responses of inferior quality that are very close to the SFT responses. Motivated by these findings, we propose an embarrassingly simple extension to the DPO algorithm, LN-DPO, resulting in more concise responses without sacrificing quality compared to the policy obtained by vanilla DPO.
comment: 10 pages
☆ A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech ACL 2024
Violence-provoking speech -- speech that implicitly or explicitly promotes violence against the members of the targeted community, contributed to a massive surge in anti-Asian crimes during the pandemic. While previous works have characterized and built tools for detecting other forms of harmful speech, like fear speech and hate speech, our work takes a community-centric approach to studying anti-Asian violence-provoking speech. Using data from ~420k Twitter posts spanning a 3-year duration (January 1, 2020 to February 1, 2023), we develop a codebook to characterize anti-Asian violence-provoking speech and collect a community-crowdsourced dataset to facilitate its large-scale detection using state-of-the-art classifiers. We contrast the capabilities of natural language processing classifiers, ranging from BERT-based to LLM-based classifiers, in detecting violence-provoking speech with their capabilities to detect anti-Asian hateful speech. In contrast to prior work that has demonstrated the effectiveness of such classifiers in detecting hateful speech ($F_1 = 0.89$), our work shows that accurate and reliable detection of violence-provoking speech is a challenging task ($F_1 = 0.69$). We discuss the implications of our findings, particularly the need for proactive interventions to support Asian communities during public health crises. The resources related to the study are available at https://claws-lab.github.io/violence-provoking-speech/.
comment: Accepted to ACL 2024 Main
☆ When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image "jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of "highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.
☆ A Survey on Employing Large Language Models for Text-to-SQL Tasks
The increasing volume of data stored in relational databases has led to the need for efficient querying and utilization of this data in various sectors. However, writing SQL queries requires specialized knowledge, which poses a challenge for non-professional users trying to access and query databases. Text-to-SQL parsing solves this issue by converting natural language queries into SQL queries, thus making database access more accessible for non-expert users. To take advantage of the recent developments in Large Language Models (LLMs), a range of new methods have emerged, with a primary focus on prompt engineering and fine-tuning. This survey provides a comprehensive overview of LLMs in text-to-SQL tasks, discussing benchmark datasets, prompt engineering, fine-tuning methods, and future research directions. We hope this review will enable readers to gain a broader understanding of the recent advances in this field and offer some insights into its future trajectory.
☆ Decoding Multilingual Moral Preferences: Unveiling LLM's Biases Through the Moral Machine Experiment
Large language models (LLMs) increasingly find their way into the most diverse areas of our everyday lives. They indirectly influence people's decisions or opinions through their daily use. Therefore, understanding how and which moral judgements these LLMs make is crucial. However, morality is not universal and depends on the cultural background. This raises the question of whether these cultural preferences are also reflected in LLMs when prompted in different languages or whether moral decision-making is consistent across different languages. So far, most research has focused on investigating the inherent values of LLMs in English. While a few works conduct multilingual analyses of moral bias in LLMs in a multilingual setting, these analyses do not go beyond atomic actions. To the best of our knowledge, a multilingual analysis of moral bias in dilemmas has not yet been conducted. To address this, our paper builds on the moral machine experiment (MME) to investigate the moral preferences of five LLMs, Falcon, Gemini, Llama, GPT, and MPT, in a multilingual setting and compares them with the preferences collected from humans belonging to different cultures. To accomplish this, we generate 6500 scenarios of the MME and prompt the models in ten languages on which action to take. Our analysis reveals that all LLMs inhibit different moral biases to some degree and that they not only differ from the human preferences but also across multiple languages within the models themselves. Moreover, we find that almost all models, particularly Llama 3, divert greatly from human values and, for instance, prefer saving fewer people over saving more.
comment: to be published in AIES 2024 Proceedings
☆ Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope
The maximum supported context length is a critical bottleneck limiting the practical application of the Large Language Model (LLM). Although existing length extrapolation methods can extend the context of LLMs to millions of tokens, these methods all have an explicit upper bound. In this work, we propose LongCache, a training-free approach that enables LLM to support an infinite context with finite context scope, through full-context cache selection and training-free integration. This effectively frees LLMs from the length extrapolation issue. We validate LongCache on the LongBench and L-Eval and demonstrate its performance is on par with traditional full-attention mechanisms. Furthermore, we have applied LongCache on mainstream LLMs, including LLaMA3 and Mistral-v0.3, enabling them to support context lengths of at least 400K in Needle-In-A-Haystack tests. We will improve the efficiency of LongCache by GPU-aware optimization soon.
comment: 8 pages, 7 figures
☆ When Can Transformers Count to n?
Large language models based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times a token in the vocabulary have appeared in a string. We show that if the dimension of the transformer state is linear in the context length, this task can be solved. However, the solution we propose does not scale beyond this limit, and we provide theoretical arguments for why it is likely impossible for a size limited transformer to implement this task. Our empirical results demonstrate the same phase-transition in performance, as anticipated by the theoretical argument. Our results demonstrate the importance of understanding how transformers can solve simple tasks.
☆ Fine-grained Gender Control in Machine Translation with Large Language Models NAACL 2024
In machine translation, the problem of ambiguously gendered input has been pointed out, where the gender of an entity is not available in the source sentence. To address this ambiguity issue, the task of controlled translation that takes the gender of the ambiguous entity as additional input have been proposed. However, most existing works have only considered a simplified setup of one target gender for input. In this paper, we tackle controlled translation in a more realistic setting of inputs with multiple entities and propose Gender-of-Entity (GoE) prompting method for LLMs. Our proposed method instructs the model with fine-grained entity-level gender information to translate with correct gender inflections. By utilizing four evaluation benchmarks, we investigate the controlled translation capability of LLMs in multiple dimensions and find that LLMs reach state-of-the-art performance in controlled translation. Furthermore, we discover an emergence of gender interference phenomenon when controlling the gender of multiple entities. Finally, we address the limitations of existing gender accuracy evaluation metrics and propose leveraging LLMs as an evaluator for gender inflection in machine translation.
comment: NAACL 2024 Main track long paper
☆ A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.
☆ DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer
In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in multi-modal large language models (MLLMs). Unlike existing solutions that typically involve costly supplementary training data or the integration of external knowledge sources, DOPRA innovatively addresses hallucinations by decoding specific weighted layer penalties and redistribution, offering an economical and effective solution without additional resources. DOPRA is grounded in unique insights into the intrinsic mechanisms controlling hallucinations within MLLMs, especially the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. This phenomenon is particularly pronounced in certain strata. To counteract this over-reliance, DOPRA employs a strategy of weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Furthermore, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. Overall, DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process.
☆ Multi-Agent Causal Discovery Using Large Language Models
Large Language Models (LLMs) have demonstrated significant potential in causal discovery tasks by utilizing their vast expert knowledge from extensive text corpora. However, the multi-agent capabilities of LLMs in causal discovery remain underexplored. This paper introduces a general framework to investigate this potential. The first is the Meta Agents Model, which relies exclusively on reasoning and discussions among LLM agents to conduct causal discovery. The second is the Coding Agents Model, which leverages the agents' ability to plan, write, and execute code, utilizing advanced statistical libraries for causal discovery. The third is the Hybrid Model, which integrates both the Meta Agents Model and CodingAgents Model approaches, combining the statistical analysis and reasoning skills of multiple agents. Our proposed framework shows promising results by effectively utilizing LLMs expert knowledge, reasoning capabilities, multi-agent cooperation, and statistical causal methods. By exploring the multi-agent potential of LLMs, we aim to establish a foundation for further research in utilizing LLMs multi-agent for solving causal-related problems.
☆ Relational Database Augmented Large Language Model
Large language models (LLMs) excel in many natural language processing (NLP) tasks. However, since LLMs can only incorporate new knowledge through training or supervised fine-tuning processes, they are unsuitable for applications that demand precise, up-to-date, and private information not available in the training corpora. This precise, up-to-date, and private information is typically stored in relational databases. Thus, a promising solution is to augment LLMs with the inclusion of relational databases as external memory. This can ensure the timeliness, correctness, and consistency of data, and assist LLMs in performing complex arithmetic operations beyond their inherent capabilities. However, bridging the gap between LLMs and relational databases is challenging. It requires the awareness of databases and data values stored in databases to select correct databases and issue correct SQL queries. Besides, it is necessary for the external memory to be independent of the LLM to meet the needs of real-world applications. We introduce a novel LLM-agnostic memory architecture comprising a database selection memory, a data value memory, and relational databases. And we design an elegant pipeline to retrieve information from it. Besides, we carefully design the prompts to instruct the LLM to maximize the framework's potential. To evaluate our method, we compose a new dataset with various types of questions. Experimental results show that our framework enables LLMs to effectively answer database-related questions, which is beyond their direct ability.
☆ Natural Language Task-Oriented Dialog System 2.0
Task-oriented dialog (TOD) systems play a crucial role in facilitating efficient interactions between users and machines by focusing on achieving specific goals through natural language communication. These systems traditionally rely on manually annotated metadata, such as dialog states and policy annotations, which is labor-intensive, expensive, inconsistent, and prone to errors, thereby limiting the potential to leverage the vast amounts of available conversational data. A critical aspect of TOD systems involves accessing and integrating information from external sources to effectively engage users. The process of determining when and how to query external resources represents a fundamental challenge in system design, however existing approaches expect this information to provided in the context. In this paper, we introduce Natural Language Task Oriented Dialog System (NL-ToD), a novel model that removes the dependency on manually annotated turn-wise data by utilizing dialog history and domain schemas to create a Zero Shot Generalizable TOD system. We also incorporate query generation as a core task of the system, where the output of the system could be a response to the user or an API query to communicate with an external resource. To achieve a more granular analysis of the system output, we classify the output into multiple categories: slot filling, retrieval, and query generation. Our analysis reveals that slot filling is the most challenging TOD task for all models. Experimental results on three popular TOD datasets (SGD, KETOD and BiToD) shows the effectiveness of our approach as NL-ToD outperforms state-of-the-art approaches, particularly with a \textbf{31.4\%} and \textbf{82.1\%} improvement in the BLEU-4 score on the SGD and KETOD dataset.
☆ End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA (+1.0%). Furthermore, through both quantitative and qualitative analyses, we validate the effectiveness of each design choice.
☆ Audio-visual training for improved grounding in video-text LLMs
Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.
☆ Enhancing Incremental Summarization with Structured Representations
Large language models (LLMs) often struggle with processing extensive input contexts, which can lead to redundant, inaccurate, or incoherent summaries. Recent methods have used unstructured memory to incrementally process these contexts, but they still suffer from information overload due to the volume of unstructured data handled. In our study, we introduce structured knowledge representations ($GU_{json}$), which significantly improve summarization performance by 40% and 14% across two public datasets. Most notably, we propose the Chain-of-Key strategy ($CoK_{json}$) that dynamically updates or augments these representations with new information, rather than recreating the structured memory for each new source. This method further enhances performance by 7% and 4% on the datasets.
☆ Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly when the task format is diversified slightly (such as by shuffling answer choice order). In this work we ask: how do successful models perform formatted MCQA? We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer. We find that prediction of a specific answer symbol is causally attributed to a single middle layer, and specifically its multi-head self-attention mechanism. We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles. We additionally uncover differences in how different models adjust to alternative symbols. Finally, we demonstrate that a synthetic task can disentangle sources of model error to pinpoint when a model has learned formatted MCQA, and show that an inability to separate answer symbol tokens in vocabulary space is a property of models unable to perform formatted MCQA tasks.
comment: Preprint. Code will be available at https://github.com/allenai/understanding_mcqa
♻ ☆ OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.
♻ ☆ Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine
Language models (LMs), including large language models (such as ChatGPT), have the potential to assist clinicians in generating various clinical notes. However, LMs are prone to produce ``hallucinations'', i.e., generated content that is not aligned with facts and knowledge. In this paper, we propose the Re$^3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning to enable LMs to generate faithful clinical texts. We demonstrate the effectiveness of our method in generating patient discharge instructions. It requires the LMs not to only understand the patients' long clinical documents, i.e., the health records during hospitalization, but also to generate critical instructional information provided both to carers and to the patient at the time of discharge. The proposed Re$^3$Writer imitates the working patterns of physicians to first \textbf{re}trieve related working experience from historical instructions written by physicians, then \textbf{re}ason related medical knowledge. Finally, it \textbf{re}fines the retrieved working experience and reasoned medical knowledge to extract useful information, which is used to generate the discharge instructions for previously-unseen patients. Our experiments show that, using our method, the performance of five representative LMs can be substantially boosted across all metrics. Meanwhile, we show results from human evaluations to measure the effectiveness in terms of fluency, faithfulness, and comprehensiveness.
♻ ☆ DPO Meets PPO: Reinforced Token Optimization for RLHF
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive real-world alignment experiments verify the effectiveness of the proposed approach.
♻ ☆ Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space ACL 2024
Multimodal large language models (MLLMs) like LLaVA and GPT-4(V) enable general-purpose conversations about images with the language modality. As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications. The prevalent architecture of current open-source MLLMs comprises two major modules: an image-language (cross-modal) projection network and a large language model. It is desirable to understand the roles of these two modules in modeling domain-specific visual attributes to inform the design of future models and streamline the interpretability efforts on the current models. To this end, via experiments on 4 datasets and under 2 fine-tuning settings, we find that as the MLLM is fine-tuned, it indeed gains domain-specific visual capabilities, but the updates do not lead to the projection extracting relevant domain-specific visual attributes. Our results indicate that the domain-specific visual attributes are modeled by the LLM, even when only the projection is fine-tuned. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLM architectures. Project webpage: https://claws-lab.github.io/projection-in-MLLMs/
comment: Accepted at ACL 2024 (Main, Short)
♻ ☆ Superpixel Semantics Representation and Pre-training for Vision-Language Task
The key to integrating visual language tasks is to establish a good alignment strategy. Recently, visual semantic representation has achieved fine-grained visual understanding by dividing grids or image patches. However, the coarse-grained semantic interactions in image space should not be ignored, which hinders the extraction of complex contextual semantic relations at the scene boundaries. This paper proposes superpixels as comprehensive and robust visual primitives, which mine coarse-grained semantic interactions by clustering perceptually similar pixels, speeding up the subsequent processing of primitives. To capture superpixel-level semantic features, we propose a Multiscale Difference Graph Convolutional Network (MDGCN). It allows parsing the entire image as a fine-to-coarse visual hierarchy. To reason actual semantic relations, we reduce potential noise interference by aggregating difference information between adjacent graph nodes. Finally, we propose a multi-level fusion rule in a bottom-up manner to avoid understanding deviation by mining complementary spatial information at different levels. Experiments show that the proposed method can effectively promote the learning of multiple downstream tasks. Encouragingly, our method outperforms previous methods on all metrics. Our code will be released upon publication.
♻ ☆ The Oscars of AI Theater: A Survey on Role-Playing with Language Models
This survey explores the burgeoning field of role-playing with language models, focusing on their development from early persona-based models to advanced character-driven simulations facilitated by Large Language Models (LLMs). Initially confined to simple persona consistency due to limited model capabilities, role-playing tasks have now expanded to embrace complex character portrayals involving character consistency, behavioral alignment, and overall attractiveness. We provide a comprehensive taxonomy of the critical components in designing these systems, including data, models and alignment, agent architecture and evaluation. This survey not only outlines the current methodologies and challenges, such as managing dynamic personal profiles and achieving high-level persona consistency but also suggests avenues for future research in improving the depth and realism of role-playing applications. The goal is to guide future research by offering a structured overview of current methodologies and identifying potential areas for improvement. Related resources and papers are available at https://github.com/nuochenpku/Awesome-Role-Play-Papers.
comment: 28 pages
♻ ☆ Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Large Language Models have excelled in various fields but encounter efficiency limitations due to the substantial Key-Value (KV) cache required for long-sequence inference. Recent efforts try to evict non-critical cache elements during runtime, thereby reducing cache size within given memory budgets while preserving generation quality. Our reexamination of foundational principles reveals that prevailing methods aim to minimize an upper bound of eviction loss, quantified as the L1 distance between the pre- and post-eviction outputs of multi-head self-attention mechanisms. Moreover, our analysis indicates that the common practices of uniformly assigning budgets across different attention heads during cache eviction hinder their budget utilization, negatively impacting generation quality. In light of these findings, we propose a simple yet effective adaptive budget allocation algorithm. This algorithm not only optimizes the loss upper bound in theory but also reduces the eviction loss in practice by aligning with the intrinsic patterns of self-attention mechanisms. Integrating this algorithm into two advanced methods, we develop Ada-SnapKV and Ada-Pyramid. Extensive evaluations on 16 datasets and the Needle-in-a-Haystack test confirm that they both significantly boost performance across various tasks.
♻ ☆ FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models
Emotional Support Conversation (ESC) is a typical dialogue that can effectively assist the user in mitigating emotional pressures. However, owing to the inherent subjectivity involved in analyzing emotions, current non-artificial methodologies face challenges in effectively appraising the emotional support capability. These metrics exhibit a low correlation with human judgments. Concurrently, manual evaluation methods extremely will cause high costs. To solve these problems, we propose a novel model FEEL (Framework for Evaluating Emotional Support Capability with Large Lan-guage Models), employing Large Language Models (LLMs) as evaluators to assess emotional support capabilities. The model meticulously considers various evaluative aspects of ESC to apply a more comprehensive and accurate evaluation method for ESC. Additionally, it employs a probability distribution approach for a more stable result and integrates an ensemble learning strategy, leveraging multiple LLMs with assigned weights to enhance evaluation accuracy. To appraise the performance of FEEL, we conduct extensive experiments on existing ESC model dialogues. Experimental results demonstrate our model exhibits a substantial enhancement in alignment with human evaluations compared to the baselines. Our source code is available at https://github.com/Ansisy/FEEL.
comment: Accepted to ICIC 2024
♻ ☆ VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency ICML 2024
Large Language Models (LLMs), combined with program-based solving techniques, are increasingly demonstrating proficiency in mathematical reasoning. For example, closed-source models such as OpenAI GPT-4 and Claude show excellent results in solving math word problems. However, progress in math word problem-solving for open-source LLMs is limited, and the challenges these models face are not well-studied. In this paper, we study the performance of strong open-source LLMs, including Llama 2 (7B), Code Llama (7B), and Mistral (7B) on math word problems using program-based solving techniques. Specifically, we analyze the outputs of these models when applied to math word problems and identify a category of problems that pose a significant challenge, particularly those involving quantities spanning multiple units. To address this issue, we propose a systematic approach by defining the units for each quantity and ensuring the consistency of these units during mathematical operations. We developed Unit Consistency Programs (UCPs), an annotated dataset of math word problems, each paired with programs containing unit specifications and unit verification routines. We fine-tuned Llama 2 (7B), Code Llama (7B), and Mistral (7B) models with UCPs to produce theirVerityMath variants. Our findings indicate that our approach, which incorporates unit consistency, currently slightly underperforms compared to an approach that does not. To understand the reasons behind this, we conduct an in-depth error analysis and suggest options for future improvements. Our code and dataset are available at https://github.com/vernontoh/VerityMath.
comment: AI4MATH Workshop @ ICML 2024
♻ ☆ Towards Better Question Generation in QA-based Event Extraction ACL2024
Event Extraction (EE) is an essential information extraction task that aims to extract event-related information from unstructured texts. The paradigm of this task has shifted from conventional classification-based methods to more contemporary question-answering-based (QA-based) approaches. However, in QA-based EE, the quality of the questions dramatically affects the extraction accuracy, and how to generate high-quality questions for QA-based EE remains a challenge. In this work, to tackle this challenge, we suggest four criteria to evaluate the quality of a question and propose a reinforcement learning method, RLQG, for QA-based EE that can generate generalizable, high-quality, and context-dependent questions and provides clear guidance to QA models. The extensive experiments conducted on ACE and RAMS datasets have strongly validated our approach's effectiveness, which also demonstrates its robustness in scenarios with limited training data. The corresponding code of RLQG is released for further research.
comment: Accepted to ACL2024 Findings
♻ ☆ Automatically Identifying Local and Global Circuits with Linear Computation Graphs
Circuit analysis of any certain model behavior is a central task in mechanistic interpretability. We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders. With these two modules inserted into the model, the model's computation graph with respect to OV and MLP circuits becomes strictly linear. Our methods do not require linear approximation to compute the causal effect of each node. This fine-grained graph identifies both end-to-end and local circuits accounting for either logits or intermediate features. We can scalably apply this pipeline with a technique called Hierarchical Attribution. We analyze three kinds of circuits in GPT-2 Small: bracket, induction, and Indirect Object Identification circuits. Our results reveal new findings underlying existing discoveries.
♻ ☆ Language Models as Science Tutors
NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.
comment: 8 pages without bibliography and appendix, 26 pages total
♻ ☆ RouteLLM: Learning to Route LLMs with Preference Data
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.
♻ ☆ Active Prompting with Chain-of-Thought for Large Language Models ACL 2024
The increasing scale of large language models (LLMs) brings emergent abilities to various complex tasks requiring reasoning, such as arithmetic and commonsense reasoning. It is known that the effective design of task-specific prompts is critical for LLMs' ability to produce high-quality answers. In particular, an effective approach for complex question-and-answer tasks is example-based prompting with chain-of-thought (CoT) reasoning, which significantly improves the performance of LLMs. However, current CoT methods rely on a fixed set of human-annotated exemplars, which are not necessarily the most effective examples for different tasks. This paper proposes a new method, Active-Prompt, to adapt LLMs to different tasks with task-specific example prompts (annotated with human-designed CoT reasoning). For this purpose, we propose a solution to the key problem of determining which questions are the most important and helpful ones to annotate from a pool of task-specific queries. By borrowing ideas from the related problem of uncertainty-based active learning, we introduce several metrics to characterize the uncertainty so as to select the most uncertain questions for annotation. Experimental results demonstrate the superiority of our proposed method, achieving state-of-the-art on eight complex reasoning tasks. Further analyses of different uncertainty metrics, pool sizes, zero-shot learning, and accuracy-uncertainty relationship demonstrate the effectiveness of our method. Our code will be available at https://github.com/shizhediao/active-prompt.
comment: Published in ACL 2024
♻ ☆ Will the Real Linda Please Stand up...to Large Language Models? Examining the Representativeness Heuristic in LLMs
Although large language models (LLMs) have demonstrated remarkable proficiency in modeling text and generating human-like text, they may exhibit biases acquired from training data in doing so. Specifically, LLMs may be susceptible to a common cognitive trap in human decision-making called the representativeness heuristic. This is a concept in psychology that refers to judging the likelihood of an event based on how closely it resembles a well-known prototype or typical example, versus considering broader facts or statistical evidence. This research investigates the impact of the representativeness heuristic on LLM reasoning. We created ReHeAT (Representativeness Heuristic AI Testing), a dataset containing a series of problems spanning six common types of representativeness heuristics. Experiments reveal that four LLMs applied to ReHeAT all exhibited representativeness heuristic biases. We further identify that the model's reasoning steps are often incorrectly based on a stereotype rather than on the problem's description. Interestingly, the performance improves when adding a hint in the prompt to remind the model to use its knowledge. This suggests the uniqueness of the representativeness heuristic compared to traditional biases. It can occur even when LLMs possess the correct knowledge while falling into a cognitive trap. This highlights the importance of future research focusing on the representativeness heuristic in model reasoning and decision-making and on developing solutions to address it.
comment: work in progress
♻ ☆ Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval ACL 2024
Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added post-hoc to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch and apply it to the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM. We evaluate RPT on four long-range language modeling tasks, spanning books, code, and mathematical writing, and demonstrate that RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
comment: Accepted to TACL 2024
♻ ☆ Paths of A Million People: Extracting Life Trajectories from Wikipedia
The life trajectories of notable people have been studied to pinpoint the times and places of significant events such as birth, death, education, marriage, competition, work, speeches, scientific discoveries, artistic achievements, and battles. Understanding how these individuals interact with others provides valuable insights for broader research into human dynamics. However, the scarcity of trajectory data in terms of volume, density, and inter-person interactions, limits relevant studies from being comprehensive and interactive. We mine millions of biography pages from Wikipedia and tackle the generalization problem stemming from the variety and heterogeneity of the trajectory descriptions. Our ensemble model COSMOS, which combines the idea of semi-supervised learning and contrastive learning, achieves an F1 score of 85.95%. For this task, we also create a hand-curated dataset, WikiLifeTrajectory, consisting of 8,852 (person, time, location) triplets as ground truth. Besides, we perform an empirical analysis on the trajectories of 8,272 historians to demonstrate the validity of the extracted results. To facilitate the research on trajectory extractions and help the analytical studies to construct grand narratives, we make our code, the million-level extracted trajectories, and the WikiLifeTrajectory dataset publicly available.
comment: Accepted to ICWSM 2025. 15 pages
♻ ☆ Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation CVPR 2024
Spatial control is a core capability in controllable image generation. Advancements in layout-guided image generation have shown promising results on in-distribution (ID) datasets with similar spatial configurations. However, it is unclear how these models perform when facing out-of-distribution (OOD) samples with arbitrary, unseen layouts. In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape. We benchmark two recent representative layout-guided image generation methods and observe that the good ID layout control may not generalize well to arbitrary layouts in the wild (e.g., objects at the boundary). Next, we propose IterInpaint, a new baseline that generates foreground and background regions step-by-step via inpainting, demonstrating stronger generalizability than existing models on OOD layouts in LayoutBench. We perform quantitative and qualitative evaluation and fine-grained analysis on the four LayoutBench skills to pinpoint the weaknesses of existing models. We show comprehensive ablation studies on IterInpaint, including training task ratio, crop&paste vs. repaint, and generation order. Lastly, we evaluate the zero-shot performance of different pretrained layout-guided image generation models on LayoutBench-COCO, our new benchmark for OOD layouts with real objects, where our IterInpaint consistently outperforms SOTA baselines in all four splits. Project website: https://layoutbench.github.io
comment: CVPR 2024 Workshop; Project website: https://layoutbench.github.io
♻ ☆ Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator
Large Language Models (LLMs) have demonstrated remarkable proficiency in human interactions, yet their application within the medical field remains insufficiently explored. Previous works mainly focus on the performance of medical knowledge with examinations, which is far from the realistic scenarios, falling short in assessing the abilities of LLMs on clinical tasks. In the quest to enhance the application of Large Language Models (LLMs) in healthcare, this paper introduces the Automated Interactive Evaluation (AIE) framework and the State-Aware Patient Simulator (SAPS), targeting the gap between traditional LLM evaluations and the nuanced demands of clinical practice. Unlike prior methods that rely on static medical knowledge assessments, AIE and SAPS provide a dynamic, realistic platform for assessing LLMs through multi-turn doctor-patient simulations. This approach offers a closer approximation to real clinical scenarios and allows for a detailed analysis of LLM behaviors in response to complex patient interactions. Our extensive experimental validation demonstrates the effectiveness of the AIE framework, with outcomes that align well with human evaluations, underscoring its potential to revolutionize medical LLM testing for improved healthcare delivery.
comment: 23 pages, 5 figures
♻ ☆ Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks
How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-context scenarios, leading to significant memory consumption. Existing KV cache eviction methods often degrade the performance of LLMs in long-context scenarios due to the information loss introduced by eviction. In this paper, we propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks without significant performance degradation under constrained memory budgets. Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence. To facilitate merging, we develop an effective yet straightforward merging set identification algorithm to identify suitable KV states for merging. Our merging set identification algorithm stimulates the second observation that KV cache sparsity, from similarity perspective, is independent of the dataset and remains persistent at the model level. Subsequently, we propose a Gaussian kernel weighted merging algorithm to selectively merge all states within each merging set. We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets, applying it to models including Llama2-7B-chat and Llama2-13B-chat. Using the LongBench and ZeroScroll benchmarks, we compare our method with other KV cache compression techniques, including H2O and CaM, showing that our method achieves superior performance across tasks with both 50% and 35% KV cache budgets.
♻ ☆ Medical Spoken Named Entity Recognition
Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: https://github.com/leduckhai/MultiMed
comment: Preprint, 41 pages
♻ ☆ BrainStorm @ iREL at #SMM4H 2024: Leveraging Translation and Topical Embeddings for Annotation Detection in Tweets ACL 2024
The proliferation of LLMs in various NLP tasks has sparked debates regarding their reliability, particularly in annotation tasks where biases and hallucinations may arise. In this shared task, we address the challenge of distinguishing annotations made by LLMs from those made by human domain experts in the context of COVID-19 symptom detection from tweets in Latin American Spanish. This paper presents BrainStorm @ iRELs approach to the SMM4H 2024 Shared Task, leveraging the inherent topical information in tweets, we propose a novel approach to identify and classify annotations, aiming to enhance the trustworthiness of annotated data.
comment: Accepted at SMM4H, colocated at ACL 2024
Artificial Intelligence
☆ Enhancing Hardware Fault Tolerance in Machines with Reinforcement Learning Policy Gradient Algorithms
Industry is rapidly moving towards fully autonomous and interconnected systems that can detect and adapt to changing conditions, including machine hardware faults. Traditional methods for adding hardware fault tolerance to machines involve duplicating components and algorithmically reconfiguring a machine's processes when a fault occurs. However, the growing interest in reinforcement learning-based robotic control offers a new perspective on achieving hardware fault tolerance. However, limited research has explored the potential of these approaches for hardware fault tolerance in machines. This paper investigates the potential of two state-of-the-art reinforcement learning algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), to enhance hardware fault tolerance into machines. We assess the performance of these algorithms in two OpenAI Gym simulated environments, Ant-v2 and FetchReach-v1. Robot models in these environments are subjected to six simulated hardware faults. Additionally, we conduct an ablation study to determine the optimal method for transferring an agent's knowledge, acquired through learning in a normal (pre-fault) environment, to a (post-)fault environment in a continual learning setting. Our results demonstrate that reinforcement learning-based approaches can enhance hardware fault tolerance in simulated machines, with adaptation occurring within minutes. Specifically, PPO exhibits the fastest adaptation when retaining the knowledge within its models, while SAC performs best when discarding all acquired knowledge. Overall, this study highlights the potential of reinforcement learning-based approaches, such as PPO and SAC, for hardware fault tolerance in machines. These findings pave the way for the development of robust and adaptive machines capable of effectively operating in real-world scenarios.
☆ Unifying Invariant and Variant Features for Graph Out-of-Distribution via Probability of Necessity and Sufficiency
Graph Out-of-Distribution (OOD), requiring that models trained on biased data generalize to the unseen test data, has considerable real-world applications. One of the most mainstream methods is to extract the invariant subgraph by aligning the original and augmented data with the help of environment augmentation. However, these solutions might lead to the loss or redundancy of semantic subgraphs and result in suboptimal generalization. To address this challenge, we propose exploiting Probability of Necessity and Sufficiency (PNS) to extract sufficient and necessary invariant substructures. Beyond that, we further leverage the domain variant subgraphs related to the labels to boost the generalization performance in an ensemble manner. Specifically, we first consider the data generation process for graph data. Under mild conditions, we show that the sufficient and necessary invariant subgraph can be extracted by minimizing an upper bound, built on the theoretical advance of the probability of necessity and sufficiency. To further bridge the theory and algorithm, we devise the model called Sufficiency and Necessity Inspired Graph Learning (SNIGL), which ensembles an invariant subgraph classifier on top of latent sufficient and necessary invariant subgraphs, and a domain variant subgraph classifier specific to the test domain for generalization enhancement. Experimental results demonstrate that our SNIGL model outperforms the state-of-the-art techniques on six public benchmarks, highlighting its effectiveness in real-world scenarios.
☆ New Rules for Causal Identification with Background Knowledge
Identifying causal relations is crucial for a variety of downstream tasks. In additional to observational data, background knowledge (BK), which could be attained from human expertise or experiments, is usually introduced for uncovering causal relations. This raises an open problem that in the presence of latent variables, what causal relations are identifiable from observational data and BK. In this paper, we propose two novel rules for incorporating BK, which offer a new perspective to the open problem. In addition, we show that these rules are applicable in some typical causality tasks, such as determining the set of possible causal effects with observational data. Our rule-based approach enhances the state-of-the-art method by circumventing a process of enumerating block sets that would otherwise take exponential complexity.
☆ Explaining Decisions of Agents in Mixed-Motive Games
In recent years, agents have become capable of communicating seamlessly via natural language and navigating in environments that involve cooperation and competition, a fact that can introduce social dilemmas. Due to the interleaving of cooperation and competition, understanding agents' decision-making in such environments is challenging, and humans can benefit from obtaining explanations. However, such environments and scenarios have rarely been explored in the context of explainable AI. While some explanation methods for cooperative environments can be applied in mixed-motive setups, they do not address inter-agent competition, cheap-talk, or implicit communication by actions. In this work, we design explanation methods to address these issues. Then, we proceed to demonstrate their effectiveness and usefulness for humans, using a non-trivial mixed-motive game as a test case. Lastly, we establish generality and demonstrate the applicability of the methods to other games, including one where we mimic human game actions using large language models.
☆ Genetic Algorithm to Optimize Design of Micro-Surgical Scissors
Microrobotics is an attractive area of research as small-scale robots have the potential to improve the precision and dexterity offered by minimally invasive surgeries. One example of such a tool is a pair of micro-surgical scissors that was developed for cutting of tumors or cancerous tissues present deep inside the body such as in the brain. This task is often deemed difficult or impossible with conventional robotic tools due to their size and dexterity. The scissors are designed with two magnets placed a specific distance apart to maximize deflection and generate cutting forces. However, remote actuation and size requirements of the micro-surgical scissors limits the force that can be generated to puncture the tissue. To address the limitation of small output forces, we use an evolutionary algorithm to further optimize the performance of the scissors. In this study, the design of the previously developed untethered micro-surgical scissors has been modified and their performance is enhanced by determining the optimal position of the magnets as well as the direction of each magnetic moment. The developed algorithm is successfully applied to a 4-magnet configuration which results in increased net torque. This improvement in net torque is directly translated into higher cutting forces. The new configuration generates a cutting force of 58 mN from 80 generations of the evolutionary algorithm which is a 1.65 times improvement from the original design. Furthermore, the developed algorithm has the advantage that it can be deployed with minor modifications to other microrobotic tools and systems, opening up new possibilities for various medical procedures and applications.
comment: Accepted for presentation at the International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS) 2024, Delft, Netherlands
☆ Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective
Image-text retrieval (ITR), an important task in information retrieval (IR), is driven by pretrained vision-language models (VLMs) that consistently achieve state-of-the-art performance. However, a significant challenge lies in the brittleness of existing ITR benchmarks. In standard datasets for the task, captions often provide broad summaries of scenes, neglecting detailed information about specific concepts. Additionally, the current evaluation setup assumes simplistic binary matches between images and texts and focuses on intra-modality rather than cross-modal relationships, which can lead to misinterpretations of model performance. Motivated by this gap, in this study, we focus on examining the brittleness of the ITR evaluation pipeline with a focus on concept granularity. We start by analyzing two common benchmarks, MS-COCO and Flickr30k, and compare them with their augmented versions, MS-COCO-FG and Flickr30k-FG, given a specified set of linguistic features capturing concept granularity. We discover that Flickr30k-FG and MS COCO-FG consistently achieve higher scores across all the selected features. To investigate the performance of VLMs on coarse and fine-grained datasets, we introduce a taxonomy of perturbations. We apply these perturbations to the selected datasets. We evaluate four state-of-the-art models - ALIGN, AltCLIP, CLIP, and GroupViT - on the standard and fine-grained datasets under zero-shot conditions, with and without the applied perturbations. The results demonstrate that although perturbations generally degrade model performance, the fine-grained datasets exhibit a smaller performance drop than their standard counterparts. Moreover, the relative performance drop across all setups is consistent across all models and datasets, indicating that the issue lies within the benchmarks. We conclude the paper by providing an agenda for improving ITR evaluation pipelines.
☆ Variational Potential Flow: A Novel Probabilistic Framework for Energy-Based Generative Modelling
Energy based models (EBMs) are appealing for their generality and simplicity in data likelihood modeling, but have conventionally been difficult to train due to the unstable and time-consuming implicit MCMC sampling during contrastive divergence training. In this paper, we present a novel energy-based generative framework, Variational Potential Flow (VAPO), that entirely dispenses with implicit MCMC sampling and does not rely on complementary latent models or cooperative training. The VAPO framework aims to learn a potential energy function whose gradient (flow) guides the prior samples, so that their density evolution closely follows an approximate data likelihood homotopy. An energy loss function is then formulated to minimize the Kullback-Leibler divergence between density evolution of the flow-driven prior and the data likelihood homotopy. Images can be generated after training the potential energy, by initializing the samples from Gaussian prior and solving the ODE governing the potential flow on a fixed time interval using generic ODE solvers. Experiment results show that the proposed VAPO framework is capable of generating realistic images on various image datasets. In particular, our proposed framework achieves competitive FID scores for unconditional image generation on the CIFAR-10 and CelebA datasets.
☆ TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data
Instruction tuning has achieved unprecedented success in NLP, turning large language models into versatile chatbots. However, the increasing variety and volume of instruction datasets demand significant computational resources. To address this, it is essential to extract a small and highly informative subset (i.e., Coreset) that achieves comparable performance to the full dataset. Achieving this goal poses non-trivial challenges: 1) data selection requires accurate data representations that reflect the training samples' quality, 2) considering the diverse nature of instruction datasets, and 3) ensuring the efficiency of the coreset selection algorithm for large models. To address these challenges, we propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS). Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection. Experimental results show that our algorithm, selecting only 5% of the data, surpasses other unsupervised methods and achieves performance close to that of the full dataset.
comment: Preprint. Our code and models are available at: https://github.com/2003pro/TAGCOS
♻ ☆ OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.
♻ ☆ DPO Meets PPO: Reinforced Token Optimization for RLHF
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive real-world alignment experiments verify the effectiveness of the proposed approach.
♻ ☆ Artificial Intelligence for Cochlear Implants: Review of Strategies, Challenges, and Perspectives
Automatic speech recognition (ASR) plays a pivotal role in our daily lives, offering utility not only for interacting with machines but also for facilitating communication for individuals with partial or profound hearing impairments. The process involves receiving the speech signal in analog form, followed by various signal processing algorithms to make it compatible with devices of limited capacities, such as cochlear implants (CIs). Unfortunately, these implants, equipped with a finite number of electrodes, often result in speech distortion during synthesis. Despite efforts by researchers to enhance received speech quality using various state-of-the-art (SOTA) signal processing techniques, challenges persist, especially in scenarios involving multiple sources of speech, environmental noise, and other adverse conditions. The advent of new artificial intelligence (AI) methods has ushered in cutting-edge strategies to address the limitations and difficulties associated with traditional signal processing techniques dedicated to CIs. This review aims to comprehensively cover advancements in CI-based ASR and speech enhancement, among other related aspects. The primary objective is to provide a thorough overview of metrics and datasets, exploring the capabilities of AI algorithms in this biomedical field, and summarizing and commenting on the best results obtained. Additionally, the review will delve into potential applications and suggest future directions to bridge existing research gaps in this domain.
♻ ☆ Computational Copyright: Towards A Royalty Model for Music Generative AI
The advancement of generative AI has given rise to pressing copyright challenges, especially within the music industry. This paper focuses on the economic aspects of these challenges, emphasizing that the economic impact constitutes a central issue in the copyright arena. Furthermore, the complexity of the black-box generative AI technologies not only suggests but necessitates algorithmic solutions. Yet, such solutions have been largely missing, exacerbating regulatory hurdles in this landscape. We seek to address this gap by proposing viable royalty models for revenue sharing on AI music generation platforms. We start by examining existing royalty models utilized by platforms like Spotify and YouTube, and then discuss how to adapt them to the unique context of AI-generated music. A significant challenge emerging from this adaptation is the attribution of AI-generated music to influential copyrighted content in the training data. To this end, we present algorithmic solutions employing data attribution techniques. We also conduct a range of experiments to verify the effectiveness and robustness of these solutions. This research is one of the early attempts to integrate technical advancements with economic and legal considerations in the field of music generative AI, offering a computational copyright solution for the challenges posed by the opaque nature of AI technologies.
♻ ☆ StuGPTViz: A Visual Analytics Approach to Understand Student-ChatGPT Interactions
The integration of Large Language Models (LLMs), especially ChatGPT, into education is poised to revolutionize students' learning experiences by introducing innovative conversational learning methodologies. To empower students to fully leverage the capabilities of ChatGPT in educational scenarios, understanding students' interaction patterns with ChatGPT is crucial for instructors. However, this endeavor is challenging due to the absence of datasets focused on student-ChatGPT conversations and the complexities in identifying and analyzing the evolutional interaction patterns within conversations. To address these challenges, we collected conversational data from 48 students interacting with ChatGPT in a master's level data visualization course over one semester. We then developed a coding scheme, grounded in the literature on cognitive levels and thematic analysis, to categorize students' interaction patterns with ChatGPT. Furthermore, we present a visual analytics system, StuGPTViz, that tracks and compares temporal patterns in student prompts and the quality of ChatGPT's responses at multiple scales, revealing significant pedagogical insights for instructors. We validated the system's effectiveness through expert interviews with six data visualization instructors and three case studies. The results confirmed StuGPTViz's capacity to enhance educators' insights into the pedagogical value of ChatGPT. We also discussed the potential research opportunities of applying visual analytics in education and developing AI-driven personalized learning solutions.
comment: 11 pages. To be published at IEEE Visualization 2024
♻ ☆ Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space ACL 2024
Multimodal large language models (MLLMs) like LLaVA and GPT-4(V) enable general-purpose conversations about images with the language modality. As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications. The prevalent architecture of current open-source MLLMs comprises two major modules: an image-language (cross-modal) projection network and a large language model. It is desirable to understand the roles of these two modules in modeling domain-specific visual attributes to inform the design of future models and streamline the interpretability efforts on the current models. To this end, via experiments on 4 datasets and under 2 fine-tuning settings, we find that as the MLLM is fine-tuned, it indeed gains domain-specific visual capabilities, but the updates do not lead to the projection extracting relevant domain-specific visual attributes. Our results indicate that the domain-specific visual attributes are modeled by the LLM, even when only the projection is fine-tuned. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLM architectures. Project webpage: https://claws-lab.github.io/projection-in-MLLMs/
comment: Accepted at ACL 2024 (Main, Short)
Machine Learning
☆ Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection ECCV 2024
Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabilities of generative foundational models. We introduce a novel method for structured synthetic data generation aimed at enhancing the compositional understanding of VL models in language-based object detection. Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains. By leveraging these synthetic triplets, we transform 'weaker' VL models into 'stronger' models in terms of compositional understanding, a process we call "Weak-to-Strong Compositional Learning" (WSCL). To achieve this, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.
comment: ECCV 2024
☆ Revisiting Neighborhood Aggregation in Graph Neural Networks for Node Classification using Statistical Signal Processing
We delve into the issue of node classification within graphs, specifically reevaluating the concept of neighborhood aggregation, which is a fundamental component in graph neural networks (GNNs). Our analysis reveals conceptual flaws within certain benchmark GNN models when operating under the assumption of edge-independent node labels, a condition commonly observed in benchmark graphs employed for node classification. Approaching neighborhood aggregation from a statistical signal processing perspective, our investigation provides novel insights which may be used to design more efficient GNN models.
☆ Enhancing Hardware Fault Tolerance in Machines with Reinforcement Learning Policy Gradient Algorithms
Industry is rapidly moving towards fully autonomous and interconnected systems that can detect and adapt to changing conditions, including machine hardware faults. Traditional methods for adding hardware fault tolerance to machines involve duplicating components and algorithmically reconfiguring a machine's processes when a fault occurs. However, the growing interest in reinforcement learning-based robotic control offers a new perspective on achieving hardware fault tolerance. However, limited research has explored the potential of these approaches for hardware fault tolerance in machines. This paper investigates the potential of two state-of-the-art reinforcement learning algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), to enhance hardware fault tolerance into machines. We assess the performance of these algorithms in two OpenAI Gym simulated environments, Ant-v2 and FetchReach-v1. Robot models in these environments are subjected to six simulated hardware faults. Additionally, we conduct an ablation study to determine the optimal method for transferring an agent's knowledge, acquired through learning in a normal (pre-fault) environment, to a (post-)fault environment in a continual learning setting. Our results demonstrate that reinforcement learning-based approaches can enhance hardware fault tolerance in simulated machines, with adaptation occurring within minutes. Specifically, PPO exhibits the fastest adaptation when retaining the knowledge within its models, while SAC performs best when discarding all acquired knowledge. Overall, this study highlights the potential of reinforcement learning-based approaches, such as PPO and SAC, for hardware fault tolerance in machines. These findings pave the way for the development of robust and adaptive machines capable of effectively operating in real-world scenarios.
☆ Conformal Predictions under Markovian Data
We study the split Conformal Prediction method when applied to Markovian data. We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data). This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as $\sqrt{t_\mathrm{mix}\ln(n)/n}$ (where $t_\mathrm{mix}$ is the mixing time of the chain). We also derive upper bounds on the impact of the correlations on the size of the prediction set. Finally we present $K$-split CP, a method that consists in thinning the calibration dataset and that adapts to the mixing properties of the chain. Its coverage gap is reduced to $t_\mathrm{mix}/(n\ln(n))$ without really affecting the size of the prediction set. We finally test our algorithms on synthetic and real-world datasets.
☆ Unifying Invariant and Variant Features for Graph Out-of-Distribution via Probability of Necessity and Sufficiency
Graph Out-of-Distribution (OOD), requiring that models trained on biased data generalize to the unseen test data, has considerable real-world applications. One of the most mainstream methods is to extract the invariant subgraph by aligning the original and augmented data with the help of environment augmentation. However, these solutions might lead to the loss or redundancy of semantic subgraphs and result in suboptimal generalization. To address this challenge, we propose exploiting Probability of Necessity and Sufficiency (PNS) to extract sufficient and necessary invariant substructures. Beyond that, we further leverage the domain variant subgraphs related to the labels to boost the generalization performance in an ensemble manner. Specifically, we first consider the data generation process for graph data. Under mild conditions, we show that the sufficient and necessary invariant subgraph can be extracted by minimizing an upper bound, built on the theoretical advance of the probability of necessity and sufficiency. To further bridge the theory and algorithm, we devise the model called Sufficiency and Necessity Inspired Graph Learning (SNIGL), which ensembles an invariant subgraph classifier on top of latent sufficient and necessary invariant subgraphs, and a domain variant subgraph classifier specific to the test domain for generalization enhancement. Experimental results demonstrate that our SNIGL model outperforms the state-of-the-art techniques on six public benchmarks, highlighting its effectiveness in real-world scenarios.
☆ LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme
Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storagebased approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to function as a system-wide shared cache with low overheads.LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information to significantly enhance cache performance. Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a mechanism for prefetching node feature data from a Victim Buffer located in CPU pinned-memory to further reduce the pressure on the storage devices. Experimental results show that despite the lower compute capabilities and memory capacities, LSM-GNN in a single node with two GPUs offers superior performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x speed up on end-to-end epoch time while running large-scale GNN training
☆ TimeInf: Time Series Data Contribution via Influence Functions
Evaluating the contribution of individual data points to a model's prediction is critical for interpreting model predictions and improving model performance. Existing data contribution methods have been applied to various data types, including tabular data, images, and texts; however, their primary focus has been on i.i.d. settings. Despite the pressing need for principled approaches tailored to time series datasets, the problem of estimating data contribution in such settings remains unexplored, possibly due to challenges associated with handling inherent temporal dependencies. This paper introduces TimeInf, a data contribution estimation method for time-series datasets. TimeInf uses influence functions to attribute model predictions to individual time points while preserving temporal structures. Our extensive empirical results demonstrate that TimeInf outperforms state-of-the-art methods in identifying harmful anomalies and helpful time points for forecasting. Additionally, TimeInf offers intuitive and interpretable attributions of data values, allowing us to easily distinguish diverse anomaly patterns through visualizations.
☆ Weyl Calculus and Exactly Solvable Schrödinger Bridges with Quadratic State Cost
Schr\"{o}dinger bridge--a stochastic dynamical generalization of optimal mass transport--exhibits a learning-control duality. Viewed as a stochastic control problem, the Schr\"{o}dinger bridge finds an optimal control policy that steers a given joint state statistics to another while minimizing the total control effort subject to controlled diffusion and deadline constraints. Viewed as a stochastic learning problem, the Schr\"{o}dinger bridge finds the most-likely distribution-valued trajectory connecting endpoint distributional observations, i.e., solves the two point boundary-constrained maximum likelihood problem over the manifold of probability distributions. Recent works have shown that solving the Schr\"{o}dinger bridge problem with state cost requires finding the Markov kernel associated with a reaction-diffusion PDE where the state cost appears as a state-dependent reaction rate. We explain how ideas from Weyl calculus in quantum mechanics, specifically the Weyl operator and the Weyl symbol, can help determine such Markov kernels. We illustrate these ideas by explicitly finding the Markov kernel for the case of quadratic state cost via Weyl calculus, recovering our earlier results but avoiding tedious computation with Hermite polynomials.
♻ ☆ Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine
Language models (LMs), including large language models (such as ChatGPT), have the potential to assist clinicians in generating various clinical notes. However, LMs are prone to produce ``hallucinations'', i.e., generated content that is not aligned with facts and knowledge. In this paper, we propose the Re$^3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning to enable LMs to generate faithful clinical texts. We demonstrate the effectiveness of our method in generating patient discharge instructions. It requires the LMs not to only understand the patients' long clinical documents, i.e., the health records during hospitalization, but also to generate critical instructional information provided both to carers and to the patient at the time of discharge. The proposed Re$^3$Writer imitates the working patterns of physicians to first \textbf{re}trieve related working experience from historical instructions written by physicians, then \textbf{re}ason related medical knowledge. Finally, it \textbf{re}fines the retrieved working experience and reasoned medical knowledge to extract useful information, which is used to generate the discharge instructions for previously-unseen patients. Our experiments show that, using our method, the performance of five representative LMs can be substantially boosted across all metrics. Meanwhile, we show results from human evaluations to measure the effectiveness in terms of fluency, faithfulness, and comprehensiveness.
♻ ☆ DPO Meets PPO: Reinforced Token Optimization for RLHF
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive real-world alignment experiments verify the effectiveness of the proposed approach.
♻ ☆ Fast Risk Assessment in Power Grids through Novel Gaussian Process and Active Learning
This paper presents a graph-structured Gaussian process (GP) model for data-driven risk assessment of critical voltage constraints. The proposed GP is based on a novel kernel, named the vertex-degree kernel (VDK), that decomposes the voltage-load relationship based on the network graph. To estimate the GP efficiently, we propose a novel active learning scheme that leverages the additive structure of VDK. Further, we prove a probabilistic bound on the error in risk estimation using VDK-GP model that demonstrates that it is statistically comparable to using standard AC power flow (AC-PF), but does not require computing a large number of ACPF solutions. Simulations demonstrate that the proposed VDK-GP achieves more than two fold sample complexity reduction, compared to a generic GP on medium scale 500-Bus and large scale 1354-Bus power systems. Moreover, active learning achieves an impressive reduction of over 15 times in comparison to the time complexity of Monte-Carlo simulations (MCS), and have risk estimation error of order 1E-4 for both 500-Bus and 1354-Bus system, demonstrating its superior efficiency in risk estimation.
comment: 9 pages
♻ ☆ Convergence Analysis of Probability Flow ODE for Score-based Generative Models
Score-based generative models have emerged as a powerful approach for sampling high-dimensional probability distributions. Despite their effectiveness, their theoretical underpinnings remain relatively underdeveloped. In this work, we study the convergence properties of deterministic samplers based on probability flow ODEs from both theoretical and numerical perspectives. Assuming access to $L^2$-accurate estimates of the score function, we prove the total variation between the target and the generated data distributions can be bounded above by $\mathcal{O}(d^{3/4}\delta^{1/2})$ in the continuous time level, where $d$ denotes the data dimension and $\delta$ represents the $L^2$-score matching error. For practical implementations using a $p$-th order Runge-Kutta integrator with step size $h$, we establish error bounds of $\mathcal{O}(d^{3/4}\delta^{1/2} + d\cdot(dh)^p)$ at the discrete level. Finally, we present numerical studies on problems up to 128 dimensions to verify our theory.
comment: 37 pages, 7 figures
♻ ☆ Explainable bank failure prediction models: Counterfactual explanations to reduce the failure risk
The accuracy and understandability of bank failure prediction models are crucial. While interpretable models like logistic regression are favored for their explainability, complex models such as random forest, support vector machines, and deep learning offer higher predictive performance but lower explainability. These models, known as black boxes, make it difficult to derive actionable insights. To address this challenge, using counterfactual explanations is suggested. These explanations demonstrate how changes in input variables can alter the model output and suggest ways to mitigate bank failure risk. The key challenge lies in selecting the most effective method for generating useful counterfactuals, which should demonstrate validity, proximity, sparsity, and plausibility. The paper evaluates several counterfactual generation methods: WhatIf, Multi Objective, and Nearest Instance Counterfactual Explanation, and also explores resampling methods like undersampling, oversampling, SMOTE, and the cost sensitive approach to address data imbalance in bank failure prediction in the US. The results indicate that the Nearest Instance Counterfactual Explanation method yields higher quality counterfactual explanations, mainly using the cost sensitive approach. Overall, the Multi Objective Counterfactual and Nearest Instance Counterfactual Explanation methods outperform others regarding validity, proximity, and sparsity metrics, with the cost sensitive approach providing the most desirable counterfactual explanations. These findings highlight the variability in the performance of counterfactual generation methods across different balancing strategies and machine learning models, offering valuable strategies to enhance the utility of black box bank failure prediction models.
comment: 20 pages, 1 figure
♻ ☆ Compact Proofs of Model Performance via Mechanistic Interpretability ICML 2024
We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-$K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
comment: accepted to ICML 2024 Workshop on Mechanistic Interpretability (Spotlight)
Computers and Society
☆ PUFFLE: Balancing Privacy, Utility, and Fairness in Federated Learning
Training and deploying Machine Learning models that simultaneously adhere to principles of fairness and privacy while ensuring good utility poses a significant challenge. The interplay between these three factors of trustworthiness is frequently underestimated and remains insufficiently explored. Consequently, many efforts focus on ensuring only two of these factors, neglecting one in the process. The decentralization of the datasets and the variations in distributions among the clients exacerbate the complexity of achieving this ethical trade-off in the context of Federated Learning (FL). For the first time in FL literature, we address these three factors of trustworthiness. We introduce PUFFLE, a high-level parameterised approach that can help in the exploration of the balance between utility, privacy, and fairness in FL scenarios. We prove that PUFFLE can be effective across diverse datasets, models, and data distributions, reducing the model unfairness up to 75%, with a maximum reduction in the utility of 17% in the worst-case scenario, while maintaining strict privacy guarantees during the FL training.
☆ Multiple Object Detection and Tracking in Panoramic Videos for Cycling Safety Analysis
Panoramic cycling videos can record 360{\deg} views around the cyclists. Thus, it is essential to conduct automatic road user analysis on them using computer vision models to provide data for studies on cycling safety. However, the features of panoramic data such as severe distortions, large number of small objects and boundary continuity have brought great challenges to the existing CV models, including poor performance and evaluation methods that are no longer applicable. In addition, due to the lack of data with annotations, it is not easy to re-train the models. In response to these problems, the project proposed and implemented a three-step methodology: (1) improve the prediction performance of the pre-trained object detection models on panoramic data by projecting the original image into 4 perspective sub-images; (2) introduce supports for boundary continuity and category information into DeepSORT, a commonly used multiple object tracking model, and set an improved detection model as its detector; (3) using the tracking results, develop an application for detecting the overtaking behaviour of the surrounding vehicles. Evaluated on the panoramic cycling dataset built by the project, the proposed methodology improves the average precision of YOLO v5m6 and Faster RCNN-FPN under any input resolution setting. In addition, it raises MOTA and IDF1 of DeepSORT by 7.6\% and 9.7\% respectively. When detecting the overtakes in the test videos, it achieves the F-score of 0.88. The code is available on GitHub at github.com/cuppp1998/360_object_tracking to ensure the reproducibility and further improvements of results.
☆ Decoding Multilingual Moral Preferences: Unveiling LLM's Biases Through the Moral Machine Experiment
Large language models (LLMs) increasingly find their way into the most diverse areas of our everyday lives. They indirectly influence people's decisions or opinions through their daily use. Therefore, understanding how and which moral judgements these LLMs make is crucial. However, morality is not universal and depends on the cultural background. This raises the question of whether these cultural preferences are also reflected in LLMs when prompted in different languages or whether moral decision-making is consistent across different languages. So far, most research has focused on investigating the inherent values of LLMs in English. While a few works conduct multilingual analyses of moral bias in LLMs in a multilingual setting, these analyses do not go beyond atomic actions. To the best of our knowledge, a multilingual analysis of moral bias in dilemmas has not yet been conducted. To address this, our paper builds on the moral machine experiment (MME) to investigate the moral preferences of five LLMs, Falcon, Gemini, Llama, GPT, and MPT, in a multilingual setting and compares them with the preferences collected from humans belonging to different cultures. To accomplish this, we generate 6500 scenarios of the MME and prompt the models in ten languages on which action to take. Our analysis reveals that all LLMs inhibit different moral biases to some degree and that they not only differ from the human preferences but also across multiple languages within the models themselves. Moreover, we find that almost all models, particularly Llama 3, divert greatly from human values and, for instance, prefer saving fewer people over saving more.
comment: to be published in AIES 2024 Proceedings
☆ The dark side of the metaverse: The role of gamification in event virtualization
The virtualization of cultural events in the metaverse creates opportunities to generate valuable and innovative experiences that replicate and extend in-person events; but the process faces associated challenges. In the absence of relevant empirical studies, the aim of this article is to analyze the positive and negative aspects of the user experience in a cultural event held in the metaverse. A mixed-methods approach is employed to test the proposed hypotheses. The results from three focus groups demonstrated the difficulty that users face in focusing their attention on the main elements of the metaverse, and the inability of this virtual sphere to convey the authenticity of a cultural event. Based on these findings, a metaverse-focused quantitative study was conducted to examine whether perceived gamification mitigate the negative effects of users failing to pay attention in their metaverse experiences. When users increased their attention levels, their ability to imagine the real experience and their perceptions of the authenticity of the cultural event increased, which produced positive behavioral intentions. This is one of the first studies to empirically analyze the tourist experience in the metaverse; managers and policymakers can benefit from the results to hold valuable virtual cultural events.
comment: Published research article
♻ ☆ Evaluation of LLMs Biases Towards Elite Universities: A Persona-Based Exploration
Elite universities are a dream destination for not just students but also top employers who get a supply of amazing talents. When we hear about top universities, the first thing that comes to mind is their academic rigor, prestigious reputation, and highly successful alumni. However, society at large is not just represented by a few elite universities, but several others. We have seen several examples where many, even without formal education, built big businesses. There are various instances in which several people, however talented, couldn't make it to top elite universities because of several resource constraints. For recruitment of candidates, we do see candidates from a few elite universities well represented in top technology companies. However, we found during our study that LLMs go overboard in representing that. This study investigates whether popular LLMs exhibit bias towards elite universities when generating personas for technology industry professionals. We employed a novel persona-based approach to compare the educational background predictions of GPT-3.5, Gemini, and Claude 3 Sonnet with actual data from LinkedIn. The study focused on various roles at Microsoft, Meta, and Google, including VP Product, Director of Engineering, and Software Engineer. We generated 432 personas across the three LLMs and analyzed the frequency of elite universities (Stanford, MIT, UC Berkeley, and Harvard) in these personas compared to LinkedIn data. Results showed that LLMs significantly overrepresented elite universities, with 72.45% of generated personas featuring these institutions, compared to only 8.56% in the actual LinkedIn data. ChatGPT 3.5 exhibited the highest bias, followed by Claude Sonnet 3, while Gemini performed best. This research highlights the need to address educational bias in LLMs and suggests strategies for mitigating such biases in AI-driven recruitment processes.
comment: 14 pages, 4 Figures
♻ ☆ Characteristics of ChatGPT users from Germany: implications for the digital divide from web tracking data
A major challenge of our time is reducing disparities in access to and effective use of digital technologies, with recent discussions highlighting the role of AI in exacerbating the digital divide. We examine user characteristics that predict usage of the AI-powered conversational agent ChatGPT. We combine behavioral and survey data in a web tracked sample of N=1376 German citizens to investigate differences in ChatGPT activity (usage, visits, and adoption) during the first 11 months from the launch of the service (November 30, 2022). Guided by a model of technology acceptance (UTAUT- 2), we examine the role of socio-demographics commonly associated with the digital divide in ChatGPT activity and explore further socio-political attributes identified via stability selection in Lasso regressions. We confirm that lower age and higher education affect ChatGPT usage, but do not find that gender or income do. We find full-time employment and more children to be barriers to ChatGPT activity. Using a variety of social media was positively associated with ChatGPT activity. In terms of political variables, political knowledge and political self-efficacy as well as some political behaviors such as voting, debating political issues online and offline and political action online were all associated with ChatGPT activity, with online political debating and political self-efficacy negatively so. Finally, need for cognition and communication skills such as writing, attending meetings, or giving presentations, were also associated with ChatGPT engagement, though chairing/organizing meetings was negatively associated. Our research informs efforts to address digital disparities and promote digital literacy among underserved populations by presenting implications, recommendations, and discussions on ethical and social issues of our findings.
♻ ☆ A Finger on the Pulse of Cardiovascular: Estimating Blood Pressure with Smartphone Photoplethysmography-Based Pulse Waveform Analysis
Utilizing mobile phone cameras for continuous blood pressure(BP) monitoring presents a cost-effective and accessible approach, yet it is challenged by limitations in accuracy and interpretability. This study introduces four innovative strategies to enhance smartphone-based photoplethysmography for BP estimation(SPW-BP), addressing the interpretability-accuracy dilemma: First, we employ often-neglected data quality improvement techniques, such as height normalization, corrupt data removal, and boundary signal reconstruction; second, we conduct an in-depth analysis of over twenty waveform indicators across three categories to identify the most predictive features; third, we employ SHapley Additive exPlanations(SHAP) analysis to ensure the transparency and explainability of machine learning outcomes; and fourth, we utilize Bland-Altman analysis alongside AAMI and BHS standards for comparative evaluation. Analysis of data from 127 participants demonstrated a significant correlation between smartphone-captured waveform features and those from standard BP monitoring devices. Employing multiple linear regression within a cross-validation framework, waveform variables predicted systolic blood pressure(SBP) with a mean absolute error(MAE) of 9.86 6.78 mmHg and diastolic blood pressure(DBP) with an MAE of 8.01 5.15 mmHg. Further application of Random Forest models significantly improved the prediction MAE for SBP to 8.91 6.30 mmHg and for DBP to 6.68 4.54 mmHg, indicating enhanced predictive accuracy. Correlation and SHAP analysis identified key features for improving BP estimation. However, Bland-Altman analysis revealed systematic biases, and MAE analysis showed that the results did not meet AAMI and BHS accuracy standards. Our findings highlight the potential of SPW-BP, yet suggest that smartphone PPG technology is not yet a viable alternative to traditional medical devices for BP measurement.
comment: 30 pages, 7 figures
Computation and Language
☆ Improving Citation Text Generation: Overcoming Limitations in Length Control
A key challenge in citation text generation is that the length of generated text often differs from the length of the target, lowering the quality of the generation. While prior works have investigated length-controlled generation, their effectiveness depends on knowing the appropriate generation length. In this work, we present an in-depth study of the limitations of predicting scientific citation text length and explore the use of heuristic estimates of desired length.
☆ Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data ICML
Despite the proven utility of large language models (LLMs) in real-world applications, there remains a lack of understanding regarding how they leverage their large-scale pretraining text corpora to achieve such capabilities. In this work, we investigate the interplay between generalization and memorization in pretrained LLMs at scale, through a comprehensive $n$-gram analysis of their training data. Our experiments focus on three general task types: translation, question-answering, and multiple-choice reasoning. With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant $n$-gram pair data becomes increasingly important, leading to improved task performance, decreased memorization, stronger generalization, and emergent abilities. Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data, and point the way to larger-scale analyses that could further improve our understanding of these models.
comment: ICML FM-Wild workshop version
☆ Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models
Vision-language models (VLMs) have achieved significant strides in recent times specially in multimodal tasks, yet they remain susceptible to adversarial attacks on their vision components. To address this, we propose Sim-CLIP, an unsupervised adversarial fine-tuning method that enhances the robustness of the widely-used CLIP vision encoder against such attacks while maintaining semantic richness and specificity. By employing a Siamese architecture with cosine similarity loss, Sim-CLIP learns semantically meaningful and attack-resilient visual representations without requiring large batch sizes or momentum encoders. Our results demonstrate that VLMs enhanced with Sim-CLIP's fine-tuned CLIP encoder exhibit significantly enhanced robustness against adversarial attacks, while preserving semantic meaning of the perturbed images. Notably, Sim-CLIP does not require additional training or fine-tuning of the VLM itself; replacing the original vision encoder with our fine-tuned Sim-CLIP suffices to provide robustness. This work underscores the significance of reinforcing foundational models like CLIP to safeguard the reliability of downstream VLM applications, paving the way for more secure and effective multimodal systems.
☆ Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives
The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) has marked a new era of Natural Language Processing (NLP), introducing unprecedented capabilities that are revolutionizing various domains. This paper explores the current state of these cutting-edge technologies, demonstrating their remarkable advancements and wide-ranging applications. Our paper contributes to providing a holistic perspective on the technical foundations, practical applications, and emerging challenges within the evolving landscape of Generative AI and LLMs. We believe that understanding the generative capabilities of AI systems and the specific context of LLMs is crucial for researchers, practitioners, and policymakers to collaboratively shape the responsible and ethical integration of these technologies into various domains. Furthermore, we identify and address main research gaps, providing valuable insights to guide future research endeavors within the AI research community.
comment: This version is accepted for publication in the journal of IEEE Transactions on artificial intelligence (TAI)
☆ Conversational Rubert for Detecting Competitive Interruptions in ASR-Transcribed Dialogues
Interruption in a dialogue occurs when the listener begins their speech before the current speaker finishes speaking. Interruptions can be broadly divided into two groups: cooperative (when the listener wants to support the speaker), and competitive (when the listener tries to take control of the conversation against the speaker's will). A system that automatically classifies interruptions can be used in call centers, specifically in the tasks of customer satisfaction monitoring and agent monitoring. In this study, we developed a text-based interruption classification model by preparing an in-house dataset consisting of ASR-transcribed customer support telephone dialogues in Russian. We fine-tuned Conversational RuBERT on our dataset and optimized hyperparameters, and the model performed well. With further improvements, the proposed model can be applied to automatic monitoring systems.
comment: 9 pages, 1 figure, 3 tables
☆ Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Creating secure and resilient applications with large language models (LLM) requires anticipating, adjusting to, and countering unforeseen threats. Red-teaming has emerged as a critical technique for identifying vulnerabilities in real-world LLM implementations. This paper presents a detailed threat model and provides a systematization of knowledge (SoK) of red-teaming attacks on LLMs. We develop a taxonomy of attacks based on the stages of the LLM development and deployment process and extract various insights from previous research. In addition, we compile methods for defense and practical red-teaming strategies for practitioners. By delineating prominent attack motifs and shedding light on various entry points, this paper provides a framework for improving the security and robustness of LLM-based systems.
comment: Preprint. Under review
☆ Consent in Crisis: The Rapid Decline of the AI Data Commons
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how consent preferences to use it are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crisis in data consent, foreclosing much of the open web, not only for commercial AI, but non-commercial AI and academic purposes.
comment: 42 pages (13 main), 5 figures, 9 tables
☆ Improving Context-Aware Preference Modeling for Language Models
While finetuning language models from pairwise preferences has proven remarkably effective, the underspecified nature of natural language presents critical challenges. Direct preference feedback is uninterpretable, difficult to provide where multidimensional criteria may apply, and often inconsistent, either because it is based on incomplete instructions or provided by diverse principals. To address these challenges, we consider the two-step preference modeling procedure that first resolves the under-specification by selecting a context, and then evaluates preference with respect to the chosen context. We decompose reward modeling error according to these two steps, which suggests that supervising context in addition to context-specific preference may be a viable approach to aligning models with diverse human preferences. For this to work, the ability of models to evaluate context-specific preference is critical. To this end, we contribute context-conditioned preference datasets and accompanying experiments that investigate the ability of language models to evaluate context-specific preference. We use our datasets to (1) show that existing preference models benefit from, but fail to fully consider, added context, (2) finetune a context-aware reward model with context-specific performance exceeding that of GPT-4 and Llama 3 70B on tested datasets, and (3) investigate the value of context-aware preference modeling.
comment: 10 pages (28 with references and appendix)
☆ Large-vocabulary forensic pathological analyses via prototypical cross-modal contrastive learning
Forensic pathology is critical in determining the cause and manner of death through post-mortem examinations, both macroscopic and microscopic. The field, however, grapples with issues such as outcome variability, laborious processes, and a scarcity of trained professionals. This paper presents SongCi, an innovative visual-language model (VLM) designed specifically for forensic pathology. SongCi utilizes advanced prototypical cross-modal self-supervised contrastive learning to enhance the accuracy, efficiency, and generalizability of forensic analyses. It was pre-trained and evaluated on a comprehensive multi-center dataset, which includes over 16 million high-resolution image patches, 2,228 vision-language pairs of post-mortem whole slide images (WSIs), and corresponding gross key findings, along with 471 distinct diagnostic outcomes. Our findings indicate that SongCi surpasses existing multi-modal AI models in many forensic pathology tasks, performs comparably to experienced forensic pathologists and significantly better than less experienced ones, and provides detailed multi-modal explainability, offering critical assistance in forensic investigations. To the best of our knowledge, SongCi is the first VLM specifically developed for forensic pathological analysis and the first large-vocabulary computational pathology (CPath) model that directly processes gigapixel WSIs in forensic science.
comment: 28 pages, 6 figures, under review
☆ Falcon2-11B Technical Report
We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and its multimodal counterpart, Falcon2-11B-vlm, which is a vision-to-text model. We report our findings during the training of the Falcon2-11B which follows a multi-stage approach where the early stages are distinguished by their context length and a final stage where we use a curated, high-quality dataset. Additionally, we report the effect of doubling the batch size mid-training and how training loss spikes are affected by the learning rate. The downstream performance of the foundation model is evaluated on established benchmarks, including multilingual and code datasets. The foundation model shows strong generalization across all the tasks which makes it suitable for downstream finetuning use cases. For the vision language model, we report the performance on several benchmarks and show that our model achieves a higher average score compared to open-source models of similar size. The model weights and code of both Falcon2-11B and Falcon2-11B-vlm are made available under a permissive license.
☆ Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment
Multilingual sentence encoders are commonly obtained by training multilingual language models to map sentences from different languages into a shared semantic space. As such, they are subject to curse of multilinguality, a loss of monolingual representational accuracy due to parameter sharing. Another limitation of multilingual sentence encoders is the trade-off between monolingual and cross-lingual performance. Training for cross-lingual alignment of sentence embeddings distorts the optimal monolingual structure of semantic spaces of individual languages, harming the utility of sentence embeddings in monolingual tasks. In this work, we address both issues by modular training of sentence encoders, i.e., by separating monolingual specialization from cross-lingual alignment. We first efficiently train language-specific sentence encoders to avoid negative interference between languages (i.e., the curse). We then align all non-English monolingual encoders to the English encoder by training a cross-lingual alignment adapter on top of each, preventing interference with monolingual specialization from the first step. In both steps, we resort to contrastive learning on machine-translated paraphrase data. Monolingual and cross-lingual evaluations on semantic text similarity/relatedness and multiple-choice QA render our modular solution more effective than multilingual sentence encoders, especially benefiting low-resource languages.
☆ Seal: Advancing Speech Language Models to be Few-Shot Learners
Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.
☆ Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established. Therefore, understanding how LLMs reason and make decisions is crucial for their safe deployment. This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt. Leveraging the insight that LLMs learn to infer latent concepts during pretraining, we propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty. We show that the uncertainty decreases as the prompt's informativeness increases, similar to epistemic uncertainty. Our detailed experimental results on real datasets validate our proposed model.
comment: 27 pages, 11 figures
Overview of AI-Debater 2023: The Challenges of Argument Generation Tasks
In this paper we present the results of the AI-Debater 2023 Challenge held by the Chinese Conference on Affect Computing (CCAC 2023), and introduce the related datasets. We organize two tracks to handle the argumentative generation tasks in different scenarios, namely, Counter-Argument Generation (Track 1) and Claim-based Argument Generation (Track 2). Each track is equipped with its distinct dataset and baseline model respectively. In total, 32 competing teams register for the challenge, from which we received 11 successful submissions. In this paper, we will present the results of the challenge and a summary of the systems, highlighting commonalities and innovations among participating systems. Datasets and baseline models of the AI-Debater 2023 Challenge have been already released and can be accessed through the official website of the challenge.
comment: Overview of AI-Debater 2023
☆ Text Style Transfer: An Introductory Overview
Text Style Transfer (TST) is a pivotal task in natural language generation to manipulate text style attributes while preserving style-independent content. The attributes targeted in TST can vary widely, including politeness, authorship, mitigation of offensive language, modification of feelings, and adjustment of text formality. TST has become a widely researched topic with substantial advancements in recent years. This paper provides an introductory overview of TST, addressing its challenges, existing approaches, datasets, evaluation measures, subtasks, and applications. This fundamental overview improves understanding of the background and fundamentals of text style transfer.
comment: Accepted at 4EU+ International Workshop on Recent Advancements in Artificial Intelligence
♻ ☆ On Context-aware Detection of Cherry-picking in News Reporting
Cherry-picking refers to the deliberate selection of evidence or facts that favor a particular viewpoint while ignoring or distorting evidence that supports an opposing perspective. Manually identifying cherry-picked statements in news stories can be challenging. In this study, we introduce a novel approach to detecting cherry-picked statements by identifying missing important statements in a target news story using language models and contextual information from other news sources. Furthermore, this research introduces a novel dataset specifically designed for training and evaluating cherry-picking detection models. Our best performing model achieves an F-1 score of about 89% in detecting important statements. Moreover, results show the effectiveness of incorporating external knowledge from alternative narratives when assessing statement importance.
♻ ☆ Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals
High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design.
♻ ☆ Exploring the Use of Abusive Generative AI Models on Civitai
The rise of generative AI is transforming the landscape of digital imagery, and exerting a significant influence on online creative communities. This has led to the emergence of AI-Generated Content (AIGC) social platforms, such as Civitai. These distinctive social platforms allow users to build and share their own generative AI models, thereby enhancing the potential for more diverse artistic expression. Designed in the vein of social networks, they also provide artists with the means to showcase their creations (generated from the models), engage in discussions, and obtain feedback, thus nurturing a sense of community. Yet, this openness also raises concerns about the abuse of such platforms, e.g., using models to disseminate deceptive deepfakes or infringe upon copyrights. To explore this, we conduct the first comprehensive empirical study of an AIGC social platform, focusing on its use for generating abusive content. As an exemplar, we construct a comprehensive dataset covering Civitai, the largest available AIGC social platform. Based on this dataset of 87K models and 2M images, we explore the characteristics of content and discuss strategies for moderation to better govern these platforms.
comment: Accepted to ACM Multimedia 2024
♻ ☆ Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. We also introduce Block Q-Sparse for batch training and inference. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency, including cost and energy consumption, of future LLMs.
comment: Work in progress
♻ ☆ Efficient Pre-training for Localized Instruction Generation of Videos ECCV 2024
Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance but demands significant computational resources. Furthermore, transcripts contain irrelevant content and differ in style from human-written instructions. To mitigate these issues, we propose a novel technique, Sieve-&-Swap, to automatically generate high-quality training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high-quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve-&-Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data. We have released code and dataset.
comment: ECCV 2024
♻ ☆ BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
Evaluating the bias in Large Language Models (LLMs) becomes increasingly crucial with their rapid development. However, existing evaluation methods rely on fixed-form outputs and cannot adapt to the flexible open-text generation scenarios of LLMs (e.g., sentence completion and question answering). To address this, we introduce BiasAlert, a plug-and-play tool designed to detect social bias in open-text generations of LLMs. BiasAlert integrates external human knowledge with inherent reasoning capabilities to detect bias reliably. Extensive experiments demonstrate that BiasAlert significantly outperforms existing state-of-the-art methods like GPT4-as-A-Judge in detecting bias. Furthermore, through application studies, we demonstrate the utility of BiasAlert in reliable LLM bias evaluation and bias mitigation across various scenarios. Model and code will be publicly released.
♻ ☆ Evaluating Large Language Models' Ability Using a Psychiatric Screening Tool Based on Metaphor and Sarcasm Scenarios
Metaphors and sarcasm are precious fruits of our highly evolved social communication skills. However, children with the condition then known as Asperger syndrome are known to have difficulties in comprehending sarcasm, even if they possess adequate verbal IQs for understanding metaphors. Accordingly, researchers had employed a screening test that assesses metaphor and sarcasm comprehension to distinguish Asperger syndrome from other conditions with similar external behaviors (e.g., attention-deficit/hyperactivity disorder). This study employs a standardized test to evaluate recent large language models' (LLMs) understanding of nuanced human communication. The results indicate improved metaphor comprehension with increased model parameters; however, no similar improvement was observed for sarcasm comprehension. Considering that a human's ability to grasp sarcasm has been associated with the amygdala, a pivotal cerebral region for emotional learning, a distinctive strategy for training LLMs would be imperative to imbue them with the ability in a cognitively grounded manner.
♻ ☆ Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages
This article introduces contrastive alignment instructions (AlignInstruct) to address two challenges in machine translation (MT) on large language models (LLMs). One is the expansion of supported languages to previously unseen ones. The second relates to the lack of data in low-resource languages. Model fine-tuning through MT instructions (MTInstruct) is a straightforward approach to the first challenge. However, MTInstruct is limited by weak cross-lingual signals inherent in the second challenge. AlignInstruct emphasizes cross-lingual supervision via a cross-lingual discriminator built using statistical word alignments. Our results based on fine-tuning the BLOOMZ models (1b1, 3b, and 7b1) in up to 24 unseen languages showed that: (1) LLMs can effectively translate unseen languages using MTInstruct; (2) AlignInstruct led to consistent improvements in translation quality across 48 translation directions involving English; (3) Discriminator-based instructions outperformed their generative counterparts as cross-lingual instructions; (4) AlignInstruct improved performance in 30 zero-shot directions.
comment: Accepted to LoResMT 2024
♻ ☆ Korean Aspect-Based Sentiment Analysis via Implicit-Feature Alignment with Corpus Filtering EMNLP
Investigations into Aspect-Based Sentiment Analysis (ABSA) for Korean restaurant reviews are notably lacking in the existing literature. Our research proposes an intuitive and effective framework for ABSA in low-resource languages such as Korean. It optimizes prediction labels by integrating translated benchmark and unlabeled Korean data. Using a model fine-tuned on translated data, we pseudo-labeled the actual Korean NLI set. Subsequently, we applied LaBSE and MSP-based filtering to this pseudo-NLI set as implicit feature, enhancing Aspect Category Detection and Polarity determination through additional training. Incorporating dual filtering, this model bridged dataset gaps, achieving positive results in Korean ABSA with minimal resources. Through additional data injection pipelines, our approach aims to utilize high-resource data and construct effective models within communities, whether corporate or individual, in low-resource language countries. Compared to English ABSA, our framework showed an approximately 3% difference in F1 scores and accuracy. We release the dataset and our code for Korean ABSA, at this link.
comment: 13 pages, EMNLP Industry Track (submitted), DMLR@ICML 2024
Computers and Society
Open Problems in Technical AI Governance
AI progress is creating a growing range of risks and opportunities, but it is often unclear how they should be navigated. In many cases, the barriers and uncertainties faced are at least partly technical. Technical AI governance, referring to technical analysis and tools for supporting the effective governance of AI, seeks to address such challenges. It can help to (a) identify areas where intervention is needed, (b) identify and assess the efficacy of potential governance actions, and (c) enhance governance options by designing mechanisms for enforcement, incentivization, or compliance. In this paper, we explain what technical AI governance is, why it is important, and present a taxonomy and incomplete catalog of its open problems. This paper is intended as a resource for technical researchers or research funders looking to contribute to AI governance.
comment: Ben Bucknall and Anka Reuel contributed equally and share the first author position
☆ A Measure for Level of Autonomy Based on Observable System Behavior
Contemporary artificial intelligence systems are pivotal in enhancing human efficiency and safety across various domains. One such domain is autonomous systems, especially in automotive and defense use cases. Artificial intelligence brings learning and enhanced decision-making to autonomy system goal-oriented behaviors and human independence. However, the lack of clear understanding of autonomy system capabilities hampers human-machine or machine-machine interaction and interdiction. This necessitates varying degrees of human involvement for safety, accountability, and explainability purposes. Yet, measuring the level autonomous capability in an autonomous system presents a challenge. Two scales of measurement exist, yet measuring autonomy presupposes a variety of elements not available in the wild. This is why existing measures for level of autonomy are operationalized only during design or test and evaluation phases. No measure for level of autonomy based on observed system behavior exists at this time. To address this, we outline a potential measure for predicting level of autonomy using observable actions. We also present an algorithm incorporating the proposed measure. The measure and algorithm have significance to researchers and practitioners interested in a method to blind compare autonomous systems at runtime. Defense-based implementations are likewise possible because counter-autonomy depends on robust identification of autonomous systems.
comment: 9 pages, 1 figure, 3 tables
☆ From Ad Identifiers to Global Privacy Control: The Status Quo and Future of Opting Out of Ad Tracking on Android
Apps and their integrated third party libraries often collect a variety of data from people to show them personalized ads. This practice is often privacy-invasive. Since 2013, Google has therefore allowed users to limit ad tracking on Android via system settings. Further, under the 2018 California Consumer Privacy Act (CCPA), apps must honor opt-outs from ad tracking under the Global Privacy Control (GPC). The efficacy of these two methods to limit ad tracking has not been studied in prior work. Our legal and technical analysis details how the GPC applies to mobile apps and how it could be integrated directly into Android, thereby developing a reference design for GPC on Android. Our empirical analysis of 1,896 top-ranked Android apps shows that both the Android system-level opt-out and the GPC signal rarely restrict ad tracking. In our view, deleting the AdID and opting out under the CCPA has the same meaning. Thus, the current AdID setting and APIs should be evolved towards GPC and integrated into Android's Privacy Sandbox.
☆ Political Leanings in Web3 Betting: Decoding the Interplay of Political and Profitable Motives
Harnessing the transparent blockchain user behavior data, we construct the Political Betting Leaning Score (PBLS) to measure political leanings based on betting within Web3 prediction markets. Focusing on Polymarket and starting from the 2024 U.S. Presidential Election, we synthesize behaviors over 15,000 addresses across 4,500 events and 8,500 markets, capturing the intensity and direction of their political leanings by the PBLS. We validate the PBLS through internal consistency checks and external comparisons. We uncover relationships between our PBLS and betting behaviors through over 800 features capturing various behavioral aspects. A case study of the 2022 U.S. Senate election further demonstrates the ability of our measurement while decoding the dynamic interaction between political and profitable motives. Our findings contribute to understanding decision-making in decentralized markets, enhancing the analysis of behaviors within Web3 prediction environments. The insights of this study reveal the potential of blockchain in enabling innovative, multidisciplinary studies and could inform the development of more effective online prediction markets, improve the accuracy of forecast, and help the design and optimization of platform mechanisms. The data and code for the paper are accessible at the following link: https://github.com/anonymous.
☆ FairViT: Fair Vision Transformer via Adaptive Masking ECCV 2024
Vision Transformer (ViT) has achieved excellent performance and demonstrated its promising potential in various computer vision tasks. The wide deployment of ViT in real-world tasks requires a thorough understanding of the societal impact of the model. However, most ViT-based works do not take fairness into account and it is unclear whether directly applying CNN-oriented debiased algorithm to ViT is feasible. Moreover, previous works typically sacrifice accuracy for fairness. Therefore, we aim to develop an algorithm that improves accuracy without sacrificing fairness. In this paper, we propose FairViT, a novel accurate and fair ViT framework. To this end, we introduce a novel distance loss and deploy adaptive fairness-aware masks on attention layers updating with model parameters. Experimental results show \sys can achieve accuracy better than other alternatives, even with competitive computational efficiency. Furthermore, \sys achieves appreciable fairness results.
comment: 20 pages, The European Conference on Computer Vision (ECCV 2024)
☆ Do Generative AI Models Output Harm while Representing Non-Western Cultures: Evidence from A Community-Centered Approach AAAI
Our research investigates the impact of Generative Artificial Intelligence (GAI) models, specifically text-to-image generators (T2Is), on the representation of non-Western cultures, with a focus on Indian contexts. Despite the transformative potential of T2Is in content creation, concerns have arisen regarding biases that may lead to misrepresentations and marginalizations. Through a community-centered approach and grounded theory analysis of 5 focus groups from diverse Indian subcultures, we explore how T2I outputs to English prompts depict Indian culture and its subcultures, uncovering novel representational harms such as exoticism and cultural misappropriation. These findings highlight the urgent need for inclusive and culturally sensitive T2I systems. We propose design guidelines informed by a sociotechnical perspective, aiming to address these issues and contribute to the development of more equitable and representative GAI technologies globally. Our work also underscores the necessity of adopting a community-centered approach to comprehend the sociotechnical dynamics of these models, complementing existing work in this space while identifying and addressing the potential negative repercussions and harms that may arise when these models are deployed on a global scale.
comment: This is the pre-peer reviewed version, which has been accepted at the 7th AAAI ACM Conference on AI, Ethics, and Society, Oct. 21, 2024, California, USA
☆ Implementing Fairness: the view from a FairDream
In this paper, we propose an experimental investigation of the problem of AI fairness in classification. We train an AI model and develop our own fairness package FairDream to detect inequalities and then to correct for them, using income prediction as a case study. Our experiments show that it is a property of FairDream to fulfill fairness objectives which are conditional on the ground truth (Equalized Odds), even when the algorithm is set the task of equalizing positives across groups (Demographic Parity). While this may be seen as an anomaly, we explain this property by comparing our approach with a closely related fairness method (GridSearch), which can enforce Demographic Parity at the expense of Equalized Odds. We grant that a fairness metric conditioned on true labels does not give a sufficient criterion to reach fairness, but we argue that it gives us at least a necessary condition to implement Demographic Parity cautiously. We also explain why neither Equal Calibration nor Equal Precision stand as relevant fairness criteria in classification. Addressing their limitations to warn the decision-maker for any disadvantaging rate, Equalized Odds avoids the peril of strict conservatism, while keeping away the utopia of a whole redistribution of resources through algorithms.
♻ ☆ What Radio Waves Tell Us about Sleep
The ability to assess sleep at home, capture sleep stages, and detect the occurrence of apnea (without on-body sensors) simply by analyzing the radio waves bouncing off people's bodies while they sleep is quite powerful. Such a capability would allow for longitudinal data collection in patients' homes, informing our understanding of sleep and its interaction with various diseases and their therapeutic responses, both in clinical trials and routine care. In this article, we develop an advanced machine learning algorithm for passively monitoring sleep and nocturnal breathing from radio waves reflected off people while asleep. Validation results in comparison with the gold standard (i.e., polysomnography) (n=849) demonstrate that the model captures the sleep hypnogram (with an accuracy of 81% for 30-second epochs categorized into Wake, Light Sleep, Deep Sleep, or REM), detects sleep apnea (AUROC = 0.88), and measures the patient's Apnea-Hypopnea Index (ICC=0.95; 95% CI = [0.93, 0.97]). Notably, the model exhibits equitable performance across race, sex, and age. Moreover, the model uncovers informative interactions between sleep stages and a range of diseases including neurological, psychiatric, cardiovascular, and immunological disorders. These findings not only hold promise for clinical practice and interventional trials but also underscore the significance of sleep as a fundamental component in understanding and managing various diseases.
comment: The first two authors contributed equally to this work
♻ ☆ CourseAssist: Pedagogically Appropriate AI Tutor for Computer Science Education
The growing enrollments in computer science courses and increase in class sizes necessitate scalable, automated tutoring solutions to adequately support student learning. While Large Language Models (LLMs) like GPT-4 have demonstrated potential in assisting students through question-answering, educators express concerns over student overreliance, miscomprehension of generated code, and the risk of inaccurate answers. Rather than banning these tools outright, we advocate for a constructive approach that harnesses the capabilities of AI while mitigating potential risks. This poster introduces CourseAssist, a novel LLM-based tutoring system tailored for computer science education. Unlike generic LLM systems, CourseAssist uses retrieval-augmented generation, user intent classification, and question decomposition to align AI responses with specific course materials and learning objectives, thereby ensuring pedagogical appropriateness of LLMs in educational settings. We evaluated CourseAssist against a baseline of GPT-4 using a dataset of 50 question-answer pairs from a programming languages course, focusing on the criteria of usefulness, accuracy, and pedagogical appropriateness. Evaluation results show that CourseAssist significantly outperforms the baseline, demonstrating its potential to serve as an effective learning assistant. We have also deployed CourseAssist in 6 computer science courses at a large public R1 research university reaching over 500 students. Interviews with 20 student users show that CourseAssist improves computer science instruction by increasing the accessibility of course-specific tutoring help and shortening the feedback loop on their programming assignments. Future work will include extensive pilot testing at more universities and exploring better collaborative relationships between students, educators, and AI that improve computer science learning experiences.
comment: Submitted to SIGCSE Virtual 2024
Computers and Society
☆ Using Case Studies to Teach Responsible AI to Industry Practitioners
Responsible AI (RAI) is the science and the practice of making the design, development, and use of AI socially sustainable: of reaping the benefits of innovation while controlling the risks. Naturally, industry practitioners play a decisive role in our collective ability to achieve the goals of RAI. Unfortunately, we do not yet have consolidated educational materials and effective methodologies for teaching RAI to practitioners. In this paper, we propose a novel stakeholder-first educational approach that uses interactive case studies to achieve organizational and practitioner -level engagement and advance learning of RAI. We discuss a partnership with Meta, an international technology company, to co-develop and deliver RAI workshops to a diverse audience within the company. Our assessment results indicate that participants found the workshops engaging and reported a positive shift in understanding and motivation to apply RAI to their work.
☆ Large-Area Emergency Lockdowns with Automated Driving Systems
Region-wide restrictions on personal vehicle travel have a long history in the United States, from riot curfews in the late 1960s, to travel bans during snow events, to the 2013 shelter-in-place "lockdown" during the search for the perpetrator of the Boston Marathon bombing. Because lockdowns require tremendous resources to enforce, they are often limited in duration or scope. The introduction of automated driving systems may allow governments to quickly and cheaply effect large-area lockdowns by jamming wireless communications, spoofing road closures on digital maps, exploiting a vehicle's programming to obey all traffic control devices, or coordinating with vehicle developers. Future vehicles may lack conventional controls, rendering them undrivable by the public. As travel restrictions become easier to implement, governments may enforce them more frequently, over longer durations and wider areas. This article explores the practical, legal, and ethical implications of lockdowns when most driving is highly automated, and provides guidance for the development of lockdown policies.
☆ Auditing the Grid-Based Placement of Private Label Products on E-commerce Search Result Pages
E-commerce platforms support the needs and livelihoods of their two most important stakeholders -- customers and producers/sellers. Multiple algorithmic systems, like ``search'' systems mediate the interactions between these stakeholders by connecting customers to producers with relevant items. Search results include (i) private label (PL) products that are manufactured/sold by the platform itself, as well as (ii) third-party products on advertised / sponsored and organic positions. In this paper, we systematically quantify the extent of PL product promotion on e-commerce search results for the two largest e-commerce platforms operating in India -- Amazon.in and Flipkart. By analyzing snapshots of search results across the two platforms, we discover high PL promotion on the initial result pages (~ 15% PLs are advertised on the first SERP of Amazon). Both platforms use different strategies to promote their PL products, such as placing more PLs on the advertised positions -- while Amazon places them on the first, middle, and last rows of the search results, Flipkart places them on the first two positions and the (entire) last column of the search results. We discover that these product placement strategies of both platforms conform with existing user attention strategies proposed in the literature. Finally, to supplement the findings from the collected data, we conduct a survey among 68 participants on Amazon Mechanical Turk. The click pattern from our survey shows that users strongly prefer to click on products placed at positions that correspond to the PL products on the search results of Amazon, but not so strongly on Flipkart. The click-through rate follows previously proposed theoretically grounded user attention distribution patterns in a two-dimensional layout.
☆ Informational Health --Toward the Reduction of Risks in the Information Space
The modern information society, markedly influenced by the advent of the internet and subsequent developments such as WEB 2.0, has seen an explosive increase in information availability, fundamentally altering human interaction with information spaces. This transformation has facilitated not only unprecedented access to information but has also raised significant challenges, particularly highlighted by the spread of ``fake news'' during critical events like the 2016 U.S. presidential election and the COVID-19 pandemic. The latter event underscored the dangers of an ``infodemic,'' where the large amount of information made distinguishing between factual and non-factual content difficult, thereby complicating public health responses and posing risks to democratic processes. In response to these challenges, this paper introduces the concept of ``informational health,'' drawing an analogy between dietary habits and information consumption. It argues that just as balanced diets are crucial for physical health, well-considered nformation behavior is essential for maintaining a healthy information environment. This paper proposes three strategies for fostering informational health: literacy education, visualization of meta-information, and informational health assessments. These strategies aim to empower users and platforms to navigate and enhance the information ecosystem effectively. By focusing on long-term informational well-being, we highlight the necessity of addressing the social risks inherent in the current attention economy, advocating for a paradigm shift towards a more sustainable information consumption model.
comment: 8pages, 2tables
☆ Undermining Mental Proof: How AI Can Make Cooperation Harder by Making Thinking Easier
Large language models and other highly capable AI systems ease the burdens of deciding what to say or do, but this very ease can undermine the effectiveness of our actions in social contexts. We explain this apparent tension by introducing the integrative theoretical concept of "mental proof," which occurs when observable actions are used to certify unobservable mental facts. From hiring to dating, mental proofs enable people to credibly communicate values, intentions, states of knowledge, and other private features of their minds to one another in low-trust environments where honesty cannot be easily enforced. Drawing on results from economics, theoretical biology, and computer science, we describe the core theoretical mechanisms that enable people to effect mental proofs. An analysis of these mechanisms clarifies when and how artificial intelligence can make low-trust cooperation harder despite making thinking easier.
☆ Honest Computing: Achieving demonstrable data lineage and provenance for driving data and process-sensitive policies
Data is the foundation of any scientific, industrial or commercial process. Its journey typically flows from collection to transport, storage, management and processing. While best practices and regulations guide data management and protection, recent events have underscored its vulnerability. Academic research and commercial data handling have been marred by scandals, revealing the brittleness of data management. Data, despite its importance, is susceptible to undue disclosures, leaks, losses, manipulation, or fabrication. These incidents often occur without visibility or accountability, necessitating a systematic structure for safe, honest, and auditable data management. In this paper, we introduce the concept of Honest Computing as the practice and approach that emphasizes transparency, integrity, and ethical behaviour within the realm of computing and technology. It ensures that computer systems and software operate honestly and reliably without hidden agendas, biases, or unethical practices. It enables privacy and confidentiality of data and code by design and by default. We also introduce a reference framework to achieve demonstrable data lineage and provenance, contrasting it with Secure Computing, a related but differently-orientated form of computing. At its core, Honest Computing leverages Trustless Computing, Confidential Computing, Distributed Computing, Cryptography and AAA security concepts. Honest Computing opens new ways of creating technology-based processes and workflows which permit the migration of regulatory frameworks for data protection from principle-based approaches to rule-based ones. Addressing use cases in many fields, from AI model protection and ethical layering to digital currency formation for finance and banking, trading, and healthcare, this foundational layer approach can help define new standards for appropriate data custody and processing.
comment: Accepted for publication in the Data & Policy journal
☆ LLMs left, right, and center: Assessing GPT's capabilities to label political bias from web domains
This research investigates whether OpenAI's GPT-4, a state-of-the-art large language model, can accurately classify the political bias of news sources based solely on their URLs. Given the subjective nature of political labels, third-party bias ratings like those from Ad Fontes Media, AllSides, and Media Bias/Fact Check (MBFC) are often used in research to analyze news source diversity. This study aims to determine if GPT-4 can replicate these human ratings on a seven-degree scale ("far-left" to "far-right"). The analysis compares GPT-4's classifications against MBFC's, and controls for website popularity using Open PageRank scores. Findings reveal a high correlation ($\text{Spearman's } \rho = .89$, $n = 5,877$, $p < 0.001$) between GPT-4's and MBFC's ratings, indicating the model's potential reliability. However, GPT-4 abstained from classifying approximately $\frac{2}{3}$ of the dataset, particularly less popular and less biased sources. The study also identifies a slight leftward skew in GPT-4's classifications compared to MBFC's. The analysis suggests that while GPT-4 can be a scalable, cost-effective tool for political bias classification of news websites, but its use should complement human judgment to mitigate biases. Further research is recommended to explore the model's performance across different settings, languages, and additional datasets.
comment: 12 pages, 4 figures
☆ User-Creator Feature Dynamics in Recommender Systems with Dual Influence
Recommender systems present relevant contents to users and help content creators reach their target audience. The dual nature of these systems influences both users and creators: users' preferences are affected by the items they are recommended, while creators are incentivized to alter their contents such that it is recommended more frequently. We define a model, called user-creator feature dynamics, to capture the dual influences of recommender systems. We prove that a recommender system with dual influence is guaranteed to polarize, causing diversity loss in the system. We then investigate, both theoretically and empirically, approaches for mitigating polarization and promoting diversity in recommender systems. Unexpectedly, we find that common diversity-promoting approaches do not work in the presence of dual influence, while relevancy-optimizing methods like top-$k$ recommendation can prevent polarization and improve diversity of the system.
♻ ☆ From Principles to Practices: Lessons Learned from Applying Partnership on AI's (PAI) Synthetic Media Framework to 11 Use Cases
2023 was the year the world woke up to generative AI, and 2024 is the year policymakers are responding more firmly. Importantly, this policy momentum is taking place alongside real world creation and distribution of synthetic media. Social media platforms, news organizations, dating apps, image generation companies, and more are already navigating a world of AI-generated visuals and sounds, already changing hearts and minds, as policymakers try to catch up. How, then, can AI governance capture the complexity of the synthetic media landscape? How can it attend to synthetic media's myriad uses, ranging from storytelling to privacy preservation, to deception, fraud, and defamation, taking into account the many stakeholders involved in its development, creation, and distribution? And what might it mean to govern synthetic media in a manner that upholds the truth while bolstering freedom of expression? What follows is the first known collection of diverse examples of the implementation of synthetic media governance that responds to these questions, specifically through Partnership on AI's (PAI) Responsible Practices for Synthetic Media - a voluntary, normative Framework for creating, distributing, and building technology for synthetic media responsibly, launched in February 2023. In this paper, we present a case bank of real world examples that help operationalize the Framework - highlighting areas synthetic media governance can be applied, augmented, expanded, and refined for use, in practice. Read together, the cases emphasize distinct elements of AI policymaking and seven emergent best practices supporting transparency, safety, expression, and digital dignity online: consent, disclosure, and differentiation between harmful and creative use cases.
comment: 18 pages, 2 figures, 1 table
♻ ☆ MoralBERT: A Fine-Tuned Language Model for Capturing Moral Values in Social Discussions
Moral values play a fundamental role in how we evaluate information, make decisions, and form judgements around important social issues. Controversial topics, including vaccination, abortion, racism, and sexual orientation, often elicit opinions and attitudes that are not solely based on evidence but rather reflect moral worldviews. Recent advances in Natural Language Processing (NLP) show that moral values can be gauged in human-generated textual content. Building on the Moral Foundations Theory (MFT), this paper introduces MoralBERT, a range of language representation models fine-tuned to capture moral sentiment in social discourse. We describe a framework for both aggregated and domain-adversarial training on multiple heterogeneous MFT human-annotated datasets sourced from Twitter (now X), Reddit, and Facebook that broaden textual content diversity in terms of social media audience interests, content presentation and style, and spreading patterns. We show that the proposed framework achieves an average F1 score that is between 11% and 32% higher than lexicon-based approaches, Word2Vec embeddings, and zero-shot classification with large language models such as GPT-4 for in-domain inference. Domain-adversarial training yields better out-of domain predictions than aggregate training while achieving comparable performance to zero-shot learning. Our approach contributes to annotation-free and effective morality learning, and provides useful insights towards a more comprehensive understanding of moral narratives in controversial social debates using NLP.
♻ ☆ Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach
A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of good evaluation practices, reinforced by the challenges in defining excellent pedagogy. Here we present our work collaborating with learners and educators to translate high level principles from learning science into a pragmatic set of seven diverse educational benchmarks, spanning quantitative, qualitative, automatic and human evaluations; and to develop a new set of fine-tuning datasets to improve the pedagogical capabilities of Gemini, introducing LearnLM-Tutor. Our evaluations show that LearnLM-Tutor is consistently preferred over a prompt tuned Gemini by educators and learners on a number of pedagogical dimensions. We hope that this work can serve as a first step towards developing a comprehensive educational evaluation framework, and that this can enable rapid progress within the AI and EdTech communities towards maximising the positive impact of gen AI in education.
♻ ☆ Are you still on track!? Catching LLM Task Drift with Activations
Large Language Models (LLMs) are routinely used in retrieval-augmented applications to orchestrate tasks and process inputs from users and other sources. These inputs, even in a single LLM interaction, can come from a variety of sources, of varying trustworthiness and provenance. This opens the door to prompt injection attacks, where the LLM receives and acts upon instructions from supposedly data-only sources, thus deviating from the user's original instructions. We define this as task drift, and we propose to catch it by scanning and analyzing the LLM's activations. We compare the LLM's activations before and after processing the external input in order to detect whether this input caused instruction drift. We develop two probing methods and find that simply using a linear classifier can detect drift with near perfect ROC AUC on an out-of-distribution test set. We show that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks. Our setup does not require any modification of the LLM (e.g., fine-tuning) or any text generation, thus maximizing deployability and cost efficiency and avoiding reliance on unreliable model output. To foster future research on activation-based task inspection, decoding, and interpretability, we will release our large-scale TaskTracker toolkit, comprising a dataset of over 500K instances, representations from 5 SoTA language models, and inspection tools.
♻ ☆ Debiasing surgeon: fantastic weights and how to find them
Nowadays an ever-growing concerning phenomenon, the emergence of algorithmic biases that can lead to unfair models, emerges. Several debiasing approaches have been proposed in the realm of deep learning, employing more or less sophisticated approaches to discourage these models from massively employing these biases. However, a question emerges: is this extra complexity really necessary? Is a vanilla-trained model already embodying some ``unbiased sub-networks'' that can be used in isolation and propose a solution without relying on the algorithmic biases? In this work, we show that such a sub-network typically exists, and can be extracted from a vanilla-trained model without requiring additional training. We further validate that such specific architecture is incapable of learning a specific bias, suggesting that there are possible architectural countermeasures to the problem of biases in deep neural networks.
♻ ☆ An Analysis of European Data and AI Regulations for Automotive Organizations
This report summarizes the European Union's series of data and AI regulations and analyzes them for managers in automotive vehicle manufacturing organizations. In particular, we highlight the relevant ideas of the regulations, including how they find their roots in earlier legislation, how they contradict and complement each other, as well as the business opportunities that these regulations offer. The structure of the report is as follows. First, we address the GDPR as the cornerstone against which the requirements of other regulations are weighed and legislated. Second, we explain the EU Data Act since it directly addresses Internet of Things (IoT) for businesses in the private sector and imposes strict requirements on large data generators such as vehicle manufacturers. For manufacturers, compliance with the EU Data Act is a prerequisite for the subsequent legislation, in particular the EU AI Act. Third, we explain the Data Governance Act, Digital Services Act, Digital Markets Act, and EU AI Act in chronological order. Overall, we characterize European Union data regulations as a wave set, rooted in historical precedent, with important implications for the automotive industry.
Computers and Society
☆ "We're not all construction workers": Algorithmic Compression of Latinidad on TikTok
The Latinx diaspora in the United States is a rapidly growing and complex demographic who face intersectional harms and marginalizations in sociotechnical systems and are currently underserved in CSCW research. While the field understands that algorithms and digital content are experienced differently by marginalized populations, more investigation is needed about how Latinx people experience social media and, in particular, visual media. In this paper, we focus on how Latinx people experience the algorithmic system of the video-sharing platform TikTok. Through a bilingual interview and visual elicitation study of 19 Latinx TikTok users and 59 survey participants, we explore how Latinx individuals experience TikTok and its Latinx content. We find Latinx TikTok users actively use platform affordances to create positive and affirming identity content feeds, but these feeds are interrupted by negative content (i.e. violence, stereotypes, linguistic assumptions) due to platform affordances that have unique consequences for Latinx diaspora users. We discuss these implications on Latinx identity and representation, introduce the concept of \textit{algorithmic identity compression}, where sociotechncial systems simplify, flatten, and conflate intersection identities, resulting in compression via the loss of critical cultural data deemed unnecessary by these systems and designers of them. This study explores how Latinx individuals are particularly vulnerable to this in sociotechnical systems, such as, but not limited to, TikTok.
☆ Report on the Conference on Ethical and Responsible Design in the National AI Institutes: A Summary of Challenges
In May 2023, the Georgia Tech Ethics, Technology, and Human Interaction Center organized the Conference on Ethical and Responsible Design in the National AI Institutes. Representatives from the National AI Research Institutes that had been established as of January 2023 were invited to attend; researchers representing 14 Institutes attended and participated. The conference focused on three questions: What are the main challenges that the National AI Institutes are facing with regard to the responsible design of AI systems? What are promising lines of inquiry to address these challenges? What are possible points of collaboration? Over the course of the conference, a revised version of the first question became a focal point: What are the challenges that the Institutes face in identifying ethical and responsible design practices and in implementing them in the AI development process? This document summarizes the challenges that representatives from the Institutes in attendance highlighted.
comment: 9 pages
☆ Social Capital and Persistence in Computer Science of Google's Computer Science Summer Institute (CSSI) Students
While a lucrative and growing field, low levels of gender and racial diversity in CS remain prevalent. Education and workforce support programs with the intention to promote underrepresented students' persistence in CS exist, which teach skills, inform of career options, and grow students' network in CS. Studies demonstrate these programs' effectiveness as it relates to changes in affective outcomes, such as participants' confidence in CS skills and attitudes towards CS jobs. However, programs' longitudinal impact on participants' build-up of social capital in CS, and the resulting social capital's influence on their persistence in CS, remain unexplored. Motivated by the literature that associates demographic identifiers with access to social capital, and students' access to developmental relationships and career resources (social capital) in CS with their persistence, this study explores a CS support program's impact on persistence through capital building. We focus on Google's CSSI, which provided graduating high school students with a 3-week-long introduction to CS. We use interviews with participants who are now 2-5 years out of the program to study CSSI's impact on their social capital and long-term CS persistence. Thematic analysis reveals three program elements that influenced students' build-up of social capital, and that the resulting persistence was realized through students' progress towards internships and goals for paying-it-forward in CS. These findings inform our recommendations that future support programs and educational settings consider mentorship centered on socioemotional support, opportunities for collaboration, and time for fun social activities. Additional suggestions center on engaging socially-oriented individuals with CS support programs. These insights inform CS educators on design choices that can encourage the persistence of underrepresented students in CS.
comment: Presented at the 2024 Annual Conference of the American Society for Engineering Education
☆ Shaded Route Planning Using Active Segmentation and Identification of Satellite Images CIKM24
Heatwaves pose significant health risks, particularly due to prolonged exposure to high summer temperatures. Vulnerable groups, especially pedestrians and cyclists on sun-exposed sidewalks, motivate the development of a route planning method that incorporates somatosensory temperature effects through shade ratio consideration. This paper is the first to introduce a pipeline that utilizes segmentation foundation models to extract shaded areas from high-resolution satellite images. These areas are then integrated into a multi-layered road map, enabling users to customize routes based on a balance between distance and shade exposure, thereby enhancing comfort and health during outdoor activities. Specifically, we construct a graph-based representation of the road map, where links indicate connectivity and are updated with shade ratio data for dynamic route planning. This system is already implemented online, with a video demonstration, and will be specifically adapted to assist travelers during the 2024 Olympic Games in Paris.
comment: Paper accepted to CIKM24 demo track
☆ Decentralised Governance for Autonomous Cyber-Physical Systems
This paper examines the potential for Cyber-Physical Systems (CPS) to be governed in a decentralised manner, whereby blockchain-based infrastructure facilitates the communication between digital and physical domains through self-governing and self-organising principles. Decentralised governance paradigms that integrate computation in physical domains (such as 'Decentralised Autonomous Organisations' (DAOs)) represent a novel approach to autono-mous governance and operations. These have been described as akin to cybernetic systems. Through the lens of a case study of an autonomous cabin called "no1s1" which demonstrates self-ownership via blockchain-based control and feedback loops, this research explores the potential for blockchain infrastructure to be utilised in the management of physical systems. By highlighting the considerations and challenges of decentralised governance in managing autonomous physical spaces, the study reveals that autonomy in the governance of autonomous CPS is not merely a technological feat but also involves a complex mesh of functional and social dynamics. These findings underscore the importance of developing continuous feedback loops and adaptive governance frameworks within decentralised CPS to address both expected and emergent challenges. This investigation contributes to the fields of infra-structure studies and Cyber-Physical Systems engineering. It also contributes to the discourse on decentralised governance and autonomous management of physical spaces by offering both practical insights and providing a framework for future research.
☆ A Security Assessment tool for Quantum Threat Analysis
The rapid advancement of quantum computing poses a significant threat to many current security algorithms used for secure communication, digital authentication, and information encryption. A sufficiently powerful quantum computer could potentially exploit vulnerabilities in these algorithms, rendering data in transit insecure. This threat is expected to materialize within the next 20 years. Immediate transition to quantum-resilient cryptographic schemes is crucial, primarily to mitigate store-now-decrypt-later attacks and to ensure the security of products with decade-long operational lives. This transition requires a systematic approach to identifying and upgrading vulnerable cryptographic implementations. This work developed a quantum assessment tool for organizations, providing tailored recommendations for transitioning their security protocols into a post-quantum world. The work included a systematic evaluation of the proposed solution using qualitative feedback from network administrators and cybersecurity experts. This feedback was used to refine the accuracy and usability of the assessment process. The results demonstrate its effectiveness and usefulness in helping organizations prepare for quantum computing threats. The assessment tool is publicly available at (https://quantum-watch.soton.ac.uk).
☆ Training Foundation Models as Data Compression: On Information, Model Weights and Copyright Law
The training process of foundation models as for other classes of deep learning systems is based on minimizing the reconstruction error over a training set. For this reason, they are susceptible to the memorization and subsequent reproduction of training samples. In this paper, we introduce a training-as-compressing perspective, wherein the model's weights embody a compressed representation of the training data. From a copyright standpoint, this point of view implies that the weights could be considered a reproduction or a derivative work of a potentially protected set of works. We investigate the technical and legal challenges that emerge from this framing of the copyright of outputs generated by foundation models, including their implications for practitioners and researchers. We demonstrate that adopting an information-centric approach to the problem presents a promising pathway for tackling these emerging complex legal issues.
comment: Accepted for spotlight presentation at GenLaw'24, see https://www.genlaw.org/2024-icml-papers#training-foundation-models-as-data-compression-on-information-model-weights-and-copyright-law
☆ Personal Data Transfers to Non-EEA Domains: A Tool for Citizens and An Analysis on Italian Public Administration Websites
Six years after the entry into force of the GDPR, European companies and organizations still have difficulties complying with it: the amount of fines issued by the European data protection authorities is continuously increasing. Personal data transfers are no exception. In this work we analyse the personal data transfers from more than 20000 Italian Public Administration (PA) entities to third countries. We developed "Minos", a user-friendly application which allows to navigate the web while recording HTTP requests. Then, we used the back-end of Minos to automate the analysis. We found that about 14% of the PAs websites transferred data out of the European Economic Area (EEA). This number is an underestimation because only visits to the home pages were object of the analysis. The top 3 destinations of the data transfers are Amazon, Google and Fonticons, accounting for about the 70% of the bad requests. The most recurrent services which are the object of the requests are cloud computing services and content delivery networks (CDNs). Our results highlight that, in Italy, a relevant portion of public administrations websites transfers personal data to non EEA countries. In terms of technology policy, these results stress the need for further incentives to improve the PA digital infrastructures. Finally, while working on refinements of Minos, the version here described is openly available on Zenodo: it can be helpful to a variety of actors (citizens, researchers, activists, policy makers) to increase awareness and enlarge the investigation.
comment: International Conference on Information Technology for Social Good (GoodIT '24), September 4-6, 2024, Bremen, Germany
☆ Empirical Analysis of Sri Lankan Mobile Health Ecosystem: A Precursor to an Effective Stakeholder Engagement
Sri Lanka recently passed its first privacy legislation covering a wide range of sectors, including health. As a precursor for effective stakeholder engagement in the health domain to understand the most effective way to implement legislation in healthcare, we have analyzed 41 popular mobile apps and web portals. We found that 78% of the tested systems have third-party domains receiving sensitive health data with minimal visibility to the consumers. We discuss how this will create potential issues in preparing for the new privacy legislation.
☆ Generative AI and the problem of existential risk
Ever since the launch of ChatGPT, Generative AI has been a focal point for concerns about AI's perceived existential risk. Once a niche topic in AI research and philosophy, AI safety and existential risk has now entered mainstream debate among policy makers and leading foundation models developers, much to the chagrin of those who see it as a distraction from addressing more pressing nearer-term harms. This chapter aims to demystify the debate by highlighting the key worries that underpin existential risk fears in relation to generative AI, and spotlighting the key actions that governments and industry are taking thus far to helping address them.
comment: forthcoming in: Philipp Hacker, Andreas Engel, Sarah Hammer and Brent Mittelstadt (eds.), Oxford Handbook on the Foundations and Regulation of Generative AI (Oxford University Press, 2024) This version: 30 May 2024
☆ Evaluating Large Language Models for Anxiety and Depression Classification using Counseling and Psychotherapy Transcripts
We aim to evaluate the efficacy of traditional machine learning and large language models (LLMs) in classifying anxiety and depression from long conversational transcripts. We fine-tune both established transformer models (BERT, RoBERTa, Longformer) and more recent large models (Mistral-7B), trained a Support Vector Machine with feature engineering, and assessed GPT models through prompting. We observe that state-of-the-art models fail to enhance classification outcomes compared to traditional machine learning methods.
☆ How to quantify an examination? Evidence from physics examinations via complex networks
Given the untapped potential for continuous improvement of examinations, quantitative investigations of examinations could guide efforts to considerably improve learning efficiency and evaluation and thus greatly help both learners and educators. However, there is a general lack of quantitative methods for investigating examinations. To address this gap, we propose a new metric via complex networks; i.e., the knowledge point network (KPN) of an examination is constructed by representing the knowledge points (concepts, laws, etc.) as nodes and adding links when these points appear in the same question. Then, the topological quantities of KPNs, such as degree, centrality, and community, can be employed to systematically explore the structural properties and evolution of examinations. In this work, 35 physics examinations from the NCEE examination spanning from 2006 to 2020 were investigated as an evidence. We found that the constructed KPNs are scale-free networks that show strong assortativity and small-world effects in most cases. The communities within the KPNs are obvious, and the key nodes are mainly related to mechanics and electromagnetism. Different question types are related to specific knowledge points, leading to noticeable structural variations in KPNs. Moreover, changes in the KPN topology between examinations administered in different years may offer insights guiding college entrance examination reforms. Based on topological quantities such as the average degree, network density, average clustering coefficient, and network transitivity, the Fd is proposed to evaluate examination difficulty. All the above results show that our approach can comprehensively quantify the knowledge structures and examination characteristics. These networks may elucidate comprehensive examination knowledge graphs for educators and guide improvements in teaching.
☆ Reimagining Communities through Transnational Bengali Decolonial Discourse with YouTube Content Creators SC
Colonialism--the policies and practices wherein a foreign body imposes its ways of life on local communities--has historically impacted how collectives perceive themselves in relation to others. One way colonialism has impacted how people see themselves is through nationalism, where nationalism is often understood through shared language, culture, religion, and geopolitical borders. The way colonialism has shaped people's experiences with nationalism has shaped historical conflicts between members of different nation-states for a long time. While recent social computing research has studied how colonially marginalized people can engage in discourse to decolonize or re-imagine and reclaim themselves and their communities on their own terms--what is less understood is how technology can better support decolonial discourses in an effort to re-imagine nationalism. To understand this phenomenon, this research draws on a semi-structured interview study with YouTubers who make videos about culturally Bengali people whose lives were upended as a product of colonization and are now dispersed across Bangladesh, India, and Pakistan. This research seeks to understand people's motivations and strategies for engaging in video-mediated decolonial discourse in transnational contexts. We discuss how our work demonstrates the potential of the sociomateriality of decolonial discourse online and extends an invitation to foreground complexities of nationalism in social computing research.
comment: accepted at CSCW 2024
☆ SOMONITOR: Explainable Marketing Data Processing and Analysis with Large Language Models
Online marketing faces formidable challenges in managing and interpreting immense volumes of data necessary for competitor analysis, content research, and strategic branding. It is impossible to review hundreds to thousands of transient online content items by hand, and partial analysis often leads to suboptimal outcomes and poorly performing campaigns. We introduce an explainable AI framework SoMonitor that aims to synergize human intuition with AI-based efficiency, helping marketers across all stages of the marketing funnel, from strategic planning to content creation and campaign execution. SoMonitor incorporates a CTR prediction and ranking model for advertising content and uses large language models (LLMs) to process high-performing competitor content, identifying core content pillars such as target audiences, customer needs, and product features. These pillars are then organized into broader categories, including communication themes and targeted customer personas. By integrating these insights with data from the brand's own advertising campaigns, SoMonitor constructs a narrative for addressing new customer personas and simultaneously generates detailed content briefs in the form of user stories that can be directly applied by marketing teams to streamline content production and campaign execution. The adoption of SoMonitor in daily operations allows digital marketers to quickly parse through extensive datasets, offering actionable insights that significantly enhance campaign effectiveness and overall job satisfaction
♻ ☆ The Files are in the Computer: On Copyright, Memorization, and Generative AI
The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have "memorized" NYT articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. These debates are clouded by ambiguities over the nature of "memorization." We attempt to bring clarity to the conversation. We draw on the technical literature to provide a firm foundation for legal discussions, providing a precise definition of memorization: a model has "memorized" a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that piece of training data. We distinguish memorization from "extraction" (user intentionally causes a model to generate a near-exact copy), from "regurgitation" (model generates a near-exact copy, regardless of user intentions), and from "reconstruction" (the near-exact copy can be obtained from the model by any means). Several consequences follow. (1) Not all learning is memorization. (2) Memorization occurs when a model is trained; regurgitation is a symptom not its cause. (3) A model that has memorized training data is a "copy" of that training data in the sense used by copyright. (4) A model is not like a VCR or other general-purpose copying technology; it is better at generating some types of outputs (possibly regurgitated ones) than others. (5) Memorization is not a phenomenon caused by "adversarial" users bent on extraction; it is latent in the model itself. (6) The amount of training data that a model memorizes is a consequence of choices made in training. (7) Whether or not a model that has memorized actually regurgitates depends on overall system design. In a very real sense, memorized training data is in the model--to quote Zoolander, the files are in the computer.
comment: Forthcoming, Chicago-Kent Law Review
♻ ☆ Mapping the Empirical Evidence of the GDPR (In-)Effectiveness: A Systematic Review
In the realm of data protection, a striking disconnect prevails between traditional domains of doctrinal, legal, theoretical, and policy-based inquiries and a burgeoning body of empirical evidence. Much of the scholarly and regulatory discourse remains entrenched in abstract legal principles or normative frameworks, leaving the empirical landscape uncharted or minimally engaged. Since the birth of EU data protection law, a modest body of empirical evidence has been generated but remains widely scattered and unexamined. Such evidence offers vital insights into the perception, impact, clarity, and effects of data protection measures but languishes on the periphery, inadequately integrated into the broader conversation. To make a meaningful connection, we conduct a comprehensive review and synthesis of empirical research spanning nearly three decades (1995- March 2022), advocating for a more robust integration of empirical evidence into the evaluation and review of the GDPR, while laying a methodological foundation for future empirical research.
♻ ☆ The Impact of School and Family Networks on COVID-19 Infections Among Dutch Students: A Study Using Population-Level Registry Data
Understanding the impact of different social interactions is key to improving epidemic models. Here, we use extensive registry data -- including PCR test results and population-level networks -- to investigate the impact of school, family, and other social contacts on SARS-CoV-2 transmission in the Netherlands (June 2020--October 2021). We isolate and compare different contexts of potential SARS-CoV-2 transmission by matching pairs of students based on their attendance at the same or different primary school (in 2020) and secondary school (in 2021) and their geographic proximity. We then calculated the probability of temporally associated infections -- i.e. the probability of both students testing positive within a 14-day period. Our results highlight the relative importance of household and family transmission in the spread of SARS-CoV-2 compared to school settings. The probability of temporally associated infections for siblings and parent-child pairs living in the same household was 22.6--23.2\%, and 4.7--7.9\% for family members living in different household. In contrast, the probability of temporally associated infections was 0.52\% for pairs of students living nearby but not attending the same primary or secondary school, 0.66\% for pairs attending different secondary schools but having attended the same primary school, and 1.65\% for pairs attending the same secondary school. Finally, we used multilevel regression analyses to examine how individual, school, and geographic factors contribute to transmission risk. We found that the largest differences in transmission probabilities were due to unobserved individual (60\%) and school-level (34\%) factors. Only a small proportion (3\%) could be attributed to geographic proximity of students or to school size, denomination, or the median income of the school area.
♻ ☆ FairEM360: A Suite for Responsible Entity Matching
Entity matching is one the earliest tasks that occur in the big data pipeline and is alarmingly exposed to unintentional biases that affect the quality of data. Identifying and mitigating the biases that exist in the data or are introduced by the matcher at this stage can contribute to promoting fairness in downstream tasks. This demonstration showcases FairEM360, a framework for 1) auditing the output of entity matchers across a wide range of fairness measures and paradigms, 2) providing potential explanations for the underlying reasons for unfairness, and 3) providing resolutions for the unfairness issues through an exploratory process with human-in-the-loop feedback, utilizing an ensemble of matchers. We aspire for FairEM360 to contribute to the prioritization of fairness as a key consideration in the evaluation of EM pipelines.
♻ ☆ Auditing for Racial Discrimination in the Delivery of Education Ads
Digital ads on social-media platforms play an important role in shaping access to economic opportunities. Our work proposes and implements a new third-party auditing method that can evaluate racial bias in the delivery of ads for education opportunities. Third-party auditing is important because it allows external parties to demonstrate presence or absence of bias in social-media algorithms. Education is a domain with legal protections against discrimination and concerns of racial-targeting, but bias induced by ad delivery algorithms has not been previously explored in this domain. Prior audits demonstrated discrimination in platforms' delivery of ads to users for housing and employment ads. These audit findings supported legal action that prompted Meta to change their ad-delivery algorithms to reduce bias, but only in the domains of housing, employment, and credit. In this work, we propose a new methodology that allows us to measure racial discrimination in a platform's ad delivery algorithms for education ads. We apply our method to Meta using ads for real schools and observe the results of delivery. We find evidence of racial discrimination in Meta's algorithmic delivery of ads for education opportunities, posing legal and ethical concerns. Our results extend evidence of algorithmic discrimination to the education domain, showing that current bias mitigation mechanisms are narrow in scope, and suggesting a broader role for third-party auditing of social media in areas where ensuring non-discrimination is important.
comment: Published in ACM Conference on Fairness, Accountability, and Transparency 2024 (ACM FAccT 2024)
♻ ☆ Charting the Landscape of Nefarious Uses of Generative Artificial Intelligence for Online Election Interference
Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs) pose significant risks, particularly in the realm of online election interference. This paper explores the nefarious applications of GenAI, highlighting their potential to disrupt democratic processes through deepfakes, botnets, targeted misinformation campaigns, and synthetic identities.
comment: The 2024 Election Integrity Initiative: HUMANS Lab - Working Paper No. 2024.1
♻ ☆ Beyond Incompatibility: Trade-offs between Mutually Exclusive Fairness Criteria in Machine Learning and Law
Fair and trustworthy AI is becoming ever more important in both machine learning and legal domains. One important consequence is that decision makers must seek to guarantee a 'fair', i.e., non-discriminatory, algorithmic decision procedure. However, there are several competing notions of algorithmic fairness that have been shown to be mutually incompatible under realistic factual assumptions. This concerns, for example, the widely used fairness measures of 'calibration within groups' and 'balance for the positive/negative class'. In this paper, we present a novel algorithm (FAir Interpolation Method: FAIM) for continuously interpolating between these three fairness criteria. Thus, an initially unfair prediction can be remedied to, at least partially, meet a desired, weighted combination of the respective fairness conditions. We demonstrate the effectiveness of our algorithm when applied to synthetic data, the COMPAS data set, and a new, real-world data set from the e-commerce sector. Finally, we discuss to what extent FAIM can be harnessed to comply with conflicting legal obligations. The analysis suggests that it may operationalize duties in traditional legal fields, such as credit scoring and criminal justice proceedings, but also for the latest AI regulations put forth in the EU, like the Digital Markets Act and the recently enacted AI Act.
comment: Recent changes (July 2024): Changes driven by peer review comments - Significant cosmetic updates to figures 1, 2 and 4 to make them easier to understand, explanatory changes aimed to bring more clarity to the motivation, theory, and results, no changes to any underlying results; v4 accidentally contained spurious blue highlighting only meant for reviewers, this has now been removed in v5
♻ ☆ A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles
The work of this paper presents a dataset that contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. The dataset is available at https://dx.doi.org/10.21227/40s8-xf63. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. Finally, this paper also presents a list of open research questions that may be investigated using this dataset.
comment: 19 pages
♻ ☆ PersLLM: A Personified Training Approach for Large Language Models
Large language models exhibit aspects of human-level intelligence that catalyze their application as human-like agents in domains such as social simulations, human-machine interactions, and collaborative multi-agent systems. However, the absence of distinct personalities, such as displaying ingratiating behaviors, inconsistent opinions, and uniform response patterns, diminish LLMs utility in practical applications. Addressing this, the development of personality traits in LLMs emerges as a crucial area of research to unlock their latent potential. Existing methods to personify LLMs generally involve strategies like employing stylized training data for instruction tuning or using prompt engineering to simulate different personalities. These methods only capture superficial linguistic styles instead of the core of personalities and are therefore not stable. In this study, we propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development, into a comprehensive training methodology. We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality. Single-agent evaluation validates our method's superiority, as it produces responses more aligned with reference personalities compared to other approaches. Case studies for multi-agent communication highlight its benefits in enhancing opinion consistency within individual agents and fostering collaborative creativity among multiple agents in dialogue contexts, potentially benefiting human simulation and multi-agent cooperation. Additionally, human-agent interaction evaluations indicate that our personified models significantly enhance interactive experiences, underscoring the practical implications of our research.
comment: 10 pages for main text, 5 figures
Computers and Society
☆ Socially Assistive Robot in Sexual Health: Group and Individual Student-Robot Interaction Activities Promoting Disclosure, Learning and Positive Attitudes
Comprehensive sex education (SE) is crucial in promoting sexual health and responsible behavior among students, particularly in elementary schools. Despite its significance, teaching SE can be challenging due to students' attitudes, shyness, and emotional barriers. Socially assistive robots (SARs) sometimes are perceived as more trustworthy than humans, based on research showing that they are not anticipated as judgmental. Inspired by those evidences, this study aims to assess the success of a SAR as a facilitator for SE lessons for elementary school students. This study conducted two experiments to assess the effectiveness of a SAR in facilitating SE education for elementary school students. We conducted two experiments, a) a group activity in the school classroom where the Nao robot gave a SE lecture, and we evaluated how much information the students acquired from the lecture, and b) an individual activity where the students interacted 1:1 with the robot, and we evaluated their attitudes towards the subject of SE, and if they felt comfortable to ask SE related questions to the robot. Data collected from pre- and post-questionnaires, as well as video annotations, revealed that the SAR significantly improved students' attitudes toward SE. Furthermore, students were more open to asking SE-related questions to the robot than their human teacher. The study emphasized specific SAR characteristics, such as embodiment and non-judgmental behavior, as key factors contributing to their effectiveness in supporting SE education, paving the way for innovative and effective approaches to sexual education in schools.
☆ The Foundation Model Transparency Index v1.1: May 2024
Foundation models are increasingly consequential yet extremely opaque. To characterize the status quo, the Foundation Model Transparency Index was launched in October 2023 to measure the transparency of leading foundation model developers. The October 2023 Index (v1.0) assessed 10 major foundation model developers (e.g. OpenAI, Google) on 100 transparency indicators (e.g. does the developer disclose the wages it pays for data labor?). At the time, developers publicly disclosed very limited information with the average score being 37 out of 100. To understand how the status quo has changed, we conduct a follow-up study (v1.1) after 6 months: we score 14 developers against the same 100 indicators. While in v1.0 we searched for publicly available information, in v1.1 developers submit reports on the 100 transparency indicators, potentially including information that was not previously public. We find that developers now score 58 out of 100 on average, a 21 point improvement over v1.0. Much of this increase is driven by developers disclosing information during the v1.1 process: on average, developers disclosed information related to 16.6 indicators that was not previously public. We observe regions of sustained (i.e. across v1.0 and v1.1) and systemic (i.e. across most or all developers) opacity such as on copyright status, data access, data labor, and downstream impact. We publish transparency reports for each developer that consolidate information disclosures: these reports are based on the information disclosed to us via developers. Our findings demonstrate that transparency can be improved in this nascent ecosystem, the Foundation Model Transparency Index likely contributes to these improvements, and policymakers should consider interventions in areas where transparency has not improved.
comment: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Project page: https://crfm.stanford.edu/fmti
☆ The Role of Network and Identity in the Diffusion of Hashtags
Although the spread of behaviors is influenced by many social factors, existing literature tends to study the effects of single factors -- most often, properties of the social network -- on the final cascade. In order to move towards a more integrated view of cascades, this paper offers the first comprehensive investigation into the role of two social factors in the diffusion of 1,337 popular hashtags representing the production of novel culture on Twitter: 1) the topology of the Twitter social network and 2) performance of each user's probable demographic identity. Here, we show that cascades are best modeled using a combination of network and identity, rather than either factor alone. This combined model best reproduces a composite index of ten cascade properties across all 1,337 hashtags. However, there is important heterogeneity in what social factors are required to reproduce different properties of hashtag cascades. For instance, while a combined network+identity model best predicts the popularity of cascades, a network-only model has better performance in predicting cascade growth and an identity-only model in adopter composition. We are able to predict what type of hashtag is best modeled by each combination of features and use this to further improve performance. Additionally, consistent with prior literature on the combined network+identity model most outperforms the single-factor counterfactuals among hashtags used for expressing racial or regional identity, stance-taking, talking about sports, or variants of existing cultural trends with very slow- or fast-growing communicative need. In sum, our results imply the utility of multi-factor models in predicting cascades, in order to account for the varied ways in which network, identity, and other social factors play a role in the diffusion of hashtags on Twitter.
☆ The Future of Learning: Large Language Models through the Lens of Students
As Large-Scale Language Models (LLMs) continue to evolve, they demonstrate significant enhancements in performance and an expansion of functionalities, impacting various domains, including education. In this study, we conducted interviews with 14 students to explore their everyday interactions with ChatGPT. Our preliminary findings reveal that students grapple with the dilemma of utilizing ChatGPT's efficiency for learning and information seeking, while simultaneously experiencing a crisis of trust and ethical concerns regarding the outcomes and broader impacts of ChatGPT. The students perceive ChatGPT as being more "human-like" compared to traditional AI. This dilemma, characterized by mixed emotions, inconsistent behaviors, and an overall positive attitude towards ChatGPT, underscores its potential for beneficial applications in education and learning. However, we argue that despite its human-like qualities, the advanced capabilities of such intelligence might lead to adverse consequences. Therefore, it's imperative to approach its application cautiously and strive to mitigate potential harms in future developments.
☆ A Survey of Scam Exposure, Victimization, Types, Vectors, and Reporting in 12 Countries
Scams are a widespread issue with severe consequences for both victims and perpetrators, but existing data collection is fragmented, precluding global and comparative local understanding. The present study addresses this gap through a nationally representative survey (n = 8,369) on scam exposure, victimization, types, vectors, and reporting in 12 countries: Belgium, Egypt, France, Hungary, Indonesia, Mexico, Romania, Slovakia, South Africa, South Korea, Sweden, and the United Kingdom. We analyze 6 survey questions to build a detailed quantitative picture of the scams landscape in each country, and compare across countries to identify global patterns. We find, first, that residents of less affluent countries suffer financial loss from scams more often. Second, we find that the internet plays a key role in scams across the globe, and that GNI per-capita is strongly associated with specific scam types and contact vectors. Third, we find widespread under-reporting, with residents of less affluent countries being less likely to know how to report a scam. Our findings contribute valuable insights for researchers, practitioners, and policymakers in the online fraud and scam prevention space.
comment: To appear in the Journal of Online Trust and Safety
☆ Towards Understanding Unsafe Video Generation
Video generation models (VGMs) have demonstrated the capability to synthesize high-quality output. It is important to understand their potential to produce unsafe content, such as violent or terrifying videos. In this work, we provide a comprehensive understanding of unsafe video generation. First, to confirm the possibility that these models could indeed generate unsafe videos, we choose unsafe content generation prompts collected from 4chan and Lexica, and three open-source SOTA VGMs to generate unsafe videos. After filtering out duplicates and poorly generated content, we created an initial set of 2112 unsafe videos from an original pool of 5607 videos. Through clustering and thematic coding analysis of these generated videos, we identify 5 unsafe video categories: Distorted/Weird, Terrifying, Pornographic, Violent/Bloody, and Political. With IRB approval, we then recruit online participants to help label the generated videos. Based on the annotations submitted by 403 participants, we identified 937 unsafe videos from the initial video set. With the labeled information and the corresponding prompts, we created the first dataset of unsafe videos generated by VGMs. We then study possible defense mechanisms to prevent the generation of unsafe videos. Existing defense methods in image generation focus on filtering either input prompt or output results. We propose a new approach called Latent Variable Defense (LVD), which works within the model's internal sampling process. LVD can achieve 0.90 defense accuracy while reducing time and computing resources by 10x when sampling a large number of unsafe prompts.
comment: 18 pages
☆ Conspiracy theories and where to find them on TikTok
TikTok has skyrocketed in popularity over recent years, especially among younger audiences, thanks to its viral trends and social challenges. However, concerns have been raised about the potential of this platform to promote and amplify online harmful and dangerous content. Leveraging the official TikTok Research API and collecting a longitudinal dataset of 1.5M videos shared in the US over a period of 3 years, our study analyzes the presence of videos promoting conspiracy theories, providing a lower-bound estimate of their prevalence (approximately 0.1% of all videos) and assessing the effects of the new Creator Program, which provides new ways for creators to monetize, on the supply of conspiratorial content. We evaluate the capabilities of state-of-the-art open Large Language Models to identify conspiracy theories after extracting audio transcriptions of videos, finding that they can detect harmful content with high precision but with overall performance comparable to fine-tuned traditional language models such as RoBERTa. Our findings are instrumental for content moderation strategies that aim to understand and mitigate the spread of harmful content on rapidly evolving social media platforms like TikTok.
☆ What's Distributive Justice Got to Do with It? Rethinking Algorithmic Fairness from the Perspective of Approximate Justice
In the field of algorithmic fairness, many fairness criteria have been proposed. Oftentimes, their proposal is only accompanied by a loose link to ideas from moral philosophy -- which makes it difficult to understand when the proposed criteria should be used to evaluate the fairness of a decision-making system. More recently, researchers have thus retroactively tried to tie existing fairness criteria to philosophical concepts. Group fairness criteria have typically been linked to egalitarianism, a theory of distributive justice. This makes it tempting to believe that fairness criteria mathematically represent ideals of distributive justice and this is indeed how they are typically portrayed. In this paper, we will discuss why the current approach of linking algorithmic fairness and distributive justice is too simplistic and, hence, insufficient. We argue that in the context of imperfect decision-making systems -- which is what we deal with in algorithmic fairness -- we should not only care about what the ideal distribution of benefits/harms among individuals would look like but also about how deviations from said ideal are distributed. Our claim is that algorithmic fairness is concerned with unfairness in these deviations. This requires us to rethink the way in which we, as algorithmic fairness researchers, view distributive justice and use fairness criteria.
☆ Characterization of Political Polarized Users Attacked by Language Toxicity on Twitter SC
Understanding the dynamics of language toxicity on social media is important for us to investigate the propagation of misinformation and the development of echo chambers for political scenarios such as U.S. presidential elections. Recent research has used large-scale data to investigate the dynamics across social media platforms. However, research on the toxicity dynamics is not enough. This study aims to provide a first exploration of the potential language toxicity flow among Left, Right and Center users. Specifically, we aim to examine whether Left users were easier to be attacked by language toxicity. In this study, more than 500M Twitter posts were examined. It was discovered that Left users received much more toxic replies than Right and Center users.
comment: This work has been accepted by 2024 Conference on Computer Supported Cooperative Work and Social Computing (CSCW2024). Association for Computing Machinery (ACM), New York, NY, USA
☆ Across Platforms and Languages: Dutch Influencers and Legal Disclosures on Instagram, YouTube and TikTok
Content monetization on social media fuels a growing influencer economy. Influencer marketing remains largely undisclosed or inappropriately disclosed on social media. Non-disclosure issues have become a priority for national and supranational authorities worldwide, who are starting to impose increasingly harsher sanctions on them. This paper proposes a transparent methodology for measuring whether and how influencers comply with disclosures based on legal standards. We introduce a novel distinction between disclosures that are legally sufficient (green) and legally insufficient (yellow). We apply this methodology to an original dataset reflecting the content of 150 Dutch influencers publicly registered with the Dutch Media Authority based on recently introduced registration obligations. The dataset consists of 292,315 posts and is multi-language (English and Dutch) and cross-platform (Instagram, YouTube and TikTok). We find that influencer marketing remains generally underdisclosed on social media, and that bigger influencers are not necessarily more compliant with disclosure standards.
comment: Accept for publication at the 16th International Conference on Advances in Social Networks Analysis and Mining -ASONAM-2024
☆ A Comprehensive Sustainable Framework for Machine Learning and Artificial Intelligence ECAI 24
In financial applications, regulations or best practices often lead to specific requirements in machine learning relating to four key pillars: fairness, privacy, interpretability and greenhouse gas emissions. These all sit in the broader context of sustainability in AI, an emerging practical AI topic. However, although these pillars have been individually addressed by past literature, none of these works have considered all the pillars. There are inherent trade-offs between each of the pillars (for example, accuracy vs fairness or accuracy vs privacy), making it even more important to consider them together. This paper outlines a new framework for Sustainable Machine Learning and proposes FPIG, a general AI pipeline that allows for these critical topics to be considered simultaneously to learn the trade-offs between the pillars better. Based on the FPIG framework, we propose a meta-learning algorithm to estimate the four key pillars given a dataset summary, model architecture, and hyperparameters before model training. This algorithm allows users to select the optimal model architecture for a given dataset and a given set of user requirements on the pillars. We illustrate the trade-offs under the FPIG model on three classical datasets and demonstrate the meta-learning approach with an example of real-world datasets and models with different interpretability, showcasing how it can aid model selection.
comment: 8 pages, 3 figures, 4 tables, ECAI 24'
☆ Are Educational Escape Rooms More Effective Than Traditional Lectures for Teaching Software Engineering? A Randomized Controlled Trial
Contribution: This article analyzes the learning effectiveness of a virtual educational escape room for teaching software engineering and compares this activity with traditional teaching through a randomized controlled trial. Background: Educational escape rooms have been used across a wide variety of disciplines at all levels of education and they are becoming increasingly popular among teachers. Nevertheless, there is a clear general need for more robust empirical evidence on the learning effectiveness of these novel activities and, particularly, on their application in software engineering education. Research Questions: Is game-based learning using educational escape rooms more effective than traditional lectures for teaching software engineering? What are the perceptions of software engineering students toward game-based learning using educational escape rooms? Methodology: The study presented in this article is a randomized controlled trial with a pre-and post-test design that was completed by a total of 326 software engineering students. The 164 students belonging to the experimental group learned software modeling by playing an educational escape room whereas the 162 students belonging to the control group learned the same subject matter through a traditional lecture. Findings: The results of the randomized controlled trial show that the students who learned software modeling through the educational escape room had very positive perceptions toward this activity, significantly increased their knowledge, and outperformed those students who learned through a traditional lecture in terms of knowledge acquisition.
comment: 9 pages, 1 figure. Journal article published in IEEE Transactions on Education
☆ The Better Angels of Machine Personality: How Personality Relates to LLM Safety
Personality psychologists have analyzed the relationship between personality and safety behaviors in human society. Although Large Language Models (LLMs) demonstrate personality traits, the relationship between personality traits and safety abilities in LLMs still remains a mystery. In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i.e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale. Meanwhile, the safety alignment generally increases various LLMs' Extraversion, Sensing, and Judging traits. According to such findings, we can edit LLMs' personality traits and improve their safety performance, e.g., inducing personality from ISTJ to ISTP resulted in a relative improvement of approximately 43% and 10% in privacy and fairness performance, respectively. Additionally, we find that LLMs with different personality traits are differentially susceptible to jailbreak. This study pioneers the investigation of LLM safety from a personality perspective, providing new insights into LLM safety enhancement.
☆ Questionable practices in machine learning
Evaluating modern ML models is hard. The strong incentive for researchers and companies to report a state-of-the-art result on some metric often leads to questionable research practices (QRPs): bad practices which fall short of outright research fraud. We describe 43 such practices which can undermine reported results, giving examples where possible. Our list emphasises the evaluation of large language models (LLMs) on public benchmarks. We also discuss "irreproducible research practices", i.e. decisions that make it difficult or impossible for other researchers to reproduce, build on or audit previous research.
♻ ☆ Deep Generative Attacks and Countermeasures for Data-Driven Offline Signature Verification
This study investigates the vulnerabilities of data-driven offline signature verification (DASV) systems to generative attacks and proposes robust countermeasures. Specifically, we explore the efficacy of Variational Autoencoders (VAEs) and Conditional Generative Adversarial Networks (CGANs) in creating deceptive signatures that challenge DASV systems. Using the Structural Similarity Index (SSIM) to evaluate the quality of forged signatures, we assess their impact on DASV systems built with Xception, ResNet152V2, and DenseNet201 architectures. Initial results showed False Accept Rates (FARs) ranging from 0% to 5.47% across all models and datasets. However, exposure to synthetic signatures significantly increased FARs, with rates ranging from 19.12% to 61.64%. The proposed countermeasure, i.e., retraining the models with real + synthetic datasets, was very effective, reducing FARs between 0% and 0.99%. These findings emphasize the necessity of investigating vulnerabilities in security systems like DASV and reinforce the role of generative methods in enhancing the security of data-driven systems.
comment: Ten pages, 6 figures, 1 table, Signature verification, Deep generative models, attacks, generative attack explainability, data-driven verification system
♻ ☆ Collaborative Design for Job-Seekers with Autism: A Conceptual Framework for Future Research
The success of employment is highly related to a job seeker's capability of communicating and collaborating with others. While leveraging one's network during the job-seeking process is intuitive to the neurotypical, this can be challenging for people with autism. Recent empirical findings have started to show how facilitating collaboration between people with autism and their social surroundings through new design can improve their chances of employment. This work aims to provide actionable guidelines and conceptual frameworks that future researchers and practitioners can apply to improve collaborative design for job-seekers with autism. Built upon the literature on past technological interventions built for supporting job-seekers with autism, we define three major research challenges of (1) communication support, (2) employment stage-wise support, and (3) group work support. For each challenge, we review the current state-of-the-art practices and possible future solutions. We then suggest future designs that can provide breakthroughs from the interdisciplinary lens of human-AI collaboration, health services, group work, accessibility computing, and natural language processing.
♻ ☆ TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models ECCV 2024
Text-to-Image (TTI) generative models have shown great progress in the past few years in terms of their ability to generate complex and high-quality imagery. At the same time, these models have been shown to suffer from harmful biases, including exaggerated societal biases (e.g., gender, ethnicity), as well as incidental correlations that limit such a model's ability to generate more diverse imagery. In this paper, we propose a general approach to study and quantify a broad spectrum of biases, for any TTI model and for any prompt, using counterfactual reasoning. Unlike other works that evaluate generated images on a predefined set of bias axes, our approach automatically identifies potential biases that might be relevant to the given prompt, and measures those biases. In addition, we complement quantitative scores with post-hoc explanations in terms of semantic concepts in the images generated. We show that our method is uniquely capable of explaining complex multi-dimensional biases through semantic concepts, as well as the intersectionality between different biases for any given prompt. We perform extensive user studies to illustrate that the results of our method and analysis are consistent with human judgements.
comment: Accepted to ECCV 2024. Code and data available at https://tibet-ai.github.io
Computers and Society
GPT-4V Cannot Generate Radiology Reports Yet
GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray. We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the medical image reasoning step of predicting medical condition labels from images; and 2) the report synthesis step of generating reports from (groundtruth) conditions. We show that GPT-4V's performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned LLaMA-2. Altogether, our findings cast doubt on the viability of using GPT-4V in a radiology workflow.
comment: 24 pages, 3 figures, code: https://github.com/YuyangJ0/GPT-4V-evaluation-radiology-report
☆ Physical partisan proximity outweighs online ties in predicting US voting outcomes
Affective polarization and increasing social divisions affect social mixing and the spread of information across online and physical spaces, reinforcing social and electoral cleavages and influencing political outcomes. Here, using aggregated and de-identified co-location and online network data, we investigate the relationship between partisan exposure and voting patterns in the USA by comparing three dimensions of partisan exposure: physical proximity and exposure to the same social contexts, online social ties, and residential sorting. By leveraging various statistical modeling approaches, we consistently find that partisan exposure in the physical space, as captured by co-location patterns, more accurately predicts electoral outcomes in US counties, outperforming online and residential exposures across metropolitan and non-metro areas. Moreover, our results show that physical partisan proximity is the best predictor of voting patterns in swing counties, where the election results are most uncertain. We also estimate county-level experienced partisan segregation and examine its relationship with individuals' demographic and socioeconomic characteristics. Focusing on metropolitan areas, our results confirm the presence of extensive partisan segregation in the US and show that offline partisan isolation, both considering physical encounters or residential sorting, is higher than online segregation and is primarily associated with educational attainment. Our findings emphasize the importance of physical space in understanding the relationship between social networks and political behavior, in contrast to the intense scrutiny focused on online social networks and elections.
comment: 13 pages, 4 figures, SI Appendix: 26 pages
☆ Gaming and Blockchain: Hype and Reality
This paper explores the adoption of blockchain technology in the gaming industry. While supporters affirm that distributed ledger technology has potential to revolutionize gaming economies and provide players with control over their virtual assets, there are practical challenges such as energy consumption and user adoption to be addressed, and detractors question whether blockchain integration is even necessary. This report characterises popular blockchain-based gaming projects like Enjin and Axie Infinity, then compares metrics such as transaction cost and player feedback to evaluate the longevity of blockchain-integrated gaming as a whole.
☆ A Benchmark for Fairness-Aware Graph Learning
Fairness-aware graph learning has gained increasing attention in recent years. Nevertheless, there lacks a comprehensive benchmark to evaluate and compare different fairness-aware graph learning methods, which blocks practitioners from choosing appropriate ones for broader real-world applications. In this paper, we present an extensive benchmark on ten representative fairness-aware graph learning methods. Specifically, we design a systematic evaluation protocol and conduct experiments on seven real-world datasets to evaluate these methods from multiple perspectives, including group fairness, individual fairness, the balance between different fairness criteria, and computational efficiency. Our in-depth analysis reveals key insights into the strengths and limitations of existing methods. Additionally, we provide practical guidance for applying fairness-aware graph learning methods in applications. To the best of our knowledge, this work serves as an initial step towards comprehensively understanding representative fairness-aware graph learning methods to facilitate future advancements in this area.
☆ Market or Markets? Investigating Google Search's Market Shares Under Horizontal and Vertical Segmentation AAAI
Is Google Search a monopoly with gatekeeping power? Regulators from the US, UK, and Europe have argued that it is based on the assumption that Google Search dominates the market for horizontal (a.k.a. "general") web search. Google disputes this, claiming that competition extends to all vertical (a.k.a. "specialized") search engines, and that under this market definition it does not have monopoly power. In this study we present the first analysis of Google Search's market share under both horizontal and vertical segmentation of online search. We leverage observational trace data collected from a panel of US residents that includes their web browsing history and copies of the Google Search Engine Result Pages they were shown. We observe that Google Search receives 71.8% of participants' queries when compared to other horizontal search engines, and that participants' search sessions begin at Google greater than 50% of the time in 24 out of 30 vertical market segments (which comprise almost all of our participants' searches). Our results inform the consequential and ongoing debates about the market power of Google Search and the conceptualization of online markets in general.
comment: Extended version of Hu et al. paper that was published in Proceedings of the International AAAI Conference on Weblogs and Social Media (2024). Includes additional analysis of the horizontal search market that did not appear in the published manuscript
☆ Large Language Models as Misleading Assistants in Conversation ICML 2024
Large Language Models (LLMs) are able to provide assistance on a wide range of information-seeking tasks. However, model outputs may be misleading, whether unintentionally or in cases of intentional deception. We investigate the ability of LLMs to be deceptive in the context of providing assistance on a reading comprehension task, using LLMs as proxies for human users. We compare outcomes of (1) when the model is prompted to provide truthful assistance, (2) when it is prompted to be subtly misleading, and (3) when it is prompted to argue for an incorrect answer. Our experiments show that GPT-4 can effectively mislead both GPT-3.5-Turbo and GPT-4, with deceptive assistants resulting in up to a 23% drop in accuracy on the task compared to when a truthful assistant is used. We also find that providing the user model with additional context from the passage partially mitigates the influence of the deceptive model. This work highlights the ability of LLMs to produce misleading information and the effects this may have in real-world situations.
comment: Next Generation of AI Safety Workshop, 41st International Conference on Machine Learning (ICML 2024)
☆ Rethinking Fair Graph Neural Networks from Re-balancing KDD 2024
Driven by the powerful representation ability of Graph Neural Networks (GNNs), plentiful GNN models have been widely deployed in many real-world applications. Nevertheless, due to distribution disparities between different demographic groups, fairness in high-stake decision-making systems is receiving increasing attention. Although lots of recent works devoted to improving the fairness of GNNs and achieved considerable success, they all require significant architectural changes or additional loss functions requiring more hyper-parameter tuning. Surprisingly, we find that simple re-balancing methods can easily match or surpass existing fair GNN methods. We claim that the imbalance across different demographic groups is a significant source of unfairness, resulting in imbalanced contributions from each group to the parameters updating. However, these simple re-balancing methods have their own shortcomings during training. In this paper, we propose FairGB, Fair Graph Neural Network via re-Balancing, which mitigates the unfairness of GNNs by group balancing. Technically, FairGB consists of two modules: counterfactual node mixup and contribution alignment loss. Firstly, we select counterfactual pairs across inter-domain and inter-class, and interpolate the ego-networks to generate new samples. Guided by analysis, we can reveal the debiasing mechanism of our model by the causal view and prove that our strategy can make sensitive attributes statistically independent from target labels. Secondly, we reweigh the contribution of each group according to gradients. By combining these two modules, they can mutually promote each other. Experimental results on benchmark datasets show that our method can achieve state-of-the-art results concerning both utility and fairness metrics. Code is available at https://github.com/ZhixunLEE/FairGB.
comment: Accepted by SIGKDD 2024, research track
☆ Bringing AI Participation Down to Scale: A Comment on Open AIs Democratic Inputs to AI Project
This commentary piece reviews the recent Open AI Democratic Inputs programme, which funded 10 teams to design procedures for public participation in generative AI. While applauding the technical innovations in these projects, we identify several shared assumptions including the generality of LLMs, extracting abstract values, soliciting solutions not problems and equating participation with democracy. We call instead for AI participation which involves specific communities and use cases and solicits concrete problems to be remedied. We also find it important that these communities have a stake in the outcome, including ownership of data or models.
☆ Navigating the Data Trading Crossroads: An Interdisciplinary Survey
Data has been increasingly recognized as a critical factor in the future economy. However, constructing an efficient data trading market faces challenges such as privacy breaches, data monopolies, and misuse. Despite numerous studies proposing algorithms to protect privacy and methods for pricing data, a comprehensive understanding of these issues and systemic solutions remain elusive. This paper provides an extensive review and evaluation of data trading research, aiming to identify existing problems, research gaps, and propose potential solutions. We categorize the challenges into three main areas: Compliance Challenges, Collateral Consequences, and Costly Transactions (the "3C problems"), all stemming from ambiguity in data rights. Through a quantitative analysis of the literature, we observe a paradigm shift from isolated solutions to integrated approaches. Addressing the unresolved issue of right ambiguity, we introduce the novel concept of "data usufruct," which allows individuals to use and benefit from data they do not own. This concept helps reframe data as a more conventional factor of production and aligns it with established economic theories, paving the way for a comprehensive framework of research theories, technical tools, and platforms. We hope this survey provides valuable insights and guidance for researchers, practitioners, and policymakers, thereby contributing to digital economy advancements.
☆ EARN Fairness: Explaining, Asking, Reviewing and Negotiating Artificial Intelligence Fairness Metrics Among Stakeholders
Numerous fairness metrics have been proposed and employed by artificial intelligence (AI) experts to quantitatively measure bias and define fairness in AI models. Recognizing the need to accommodate stakeholders' diverse fairness understandings, efforts are underway to solicit their input. However, conveying AI fairness metrics to stakeholders without AI expertise, capturing their personal preferences, and seeking a collective consensus remain challenging and underexplored. To bridge this gap, we propose a new framework, EARN Fairness, which facilitates collective metric decisions among stakeholders without requiring AI expertise. The framework features an adaptable interactive system and a stakeholder-centered EARN Fairness process to Explain fairness metrics, Ask stakeholders' personal metric preferences, Review metrics collectively, and Negotiate a consensus on metric selection. To gather empirical results, we applied the framework to a credit rating scenario and conducted a user study involving 18 decision subjects without AI knowledge. We identify their personal metric preferences and their acceptable level of unfairness in individual sessions. Subsequently, we uncovered how they reached metric consensus in team sessions. Our work shows that the EARN Fairness framework enables stakeholders to express personal preferences and reach consensus, providing practical guidance for implementing human-centered AI fairness in high-risk contexts. Through this approach, we aim to harmonize fairness expectations of diverse stakeholders, fostering more equitable and inclusive AI fairness.
☆ Perceived Importance of ICT Proficiency for Teaching, Learning, and Career Progression among Physical Education Teachers in Pampanga
The integration of information and communication technology (ICT) has become increasingly vital across various educational fields, including physical education (PE). This study aimed to evaluate the proficiency levels of PE teachers in using various ICT applications and to examine the relationship between the perceived importance of ICT proficiency for teaching and learning, career advancement, and actual proficiency among Senior High school PE teachers in the municipality of Mexico, Pampanga. This study employed a quantitative descriptive approach. PE teachers from the municipality of Mexico, Pampanga, were selected as the respondents. This study used a two-part survey. The first section collected demographic data, such as age, gender, rank/position, and years of teaching experience, and the second section assessed ICT skill levels and the perceived importance of ICT in teaching, learning, and career progression. The results revealed that the majority of PE teachers had access to ICT resources. However, their proficiency levels with these tools varied significantly. Factors such as age, teaching experience, and professional position were found to significantly influence teachers proficiency and their perceptions of the benefits of ICT integration in PE instruction. The study provided a glimpse of the current state of ICT integration among Senior High school PE teachers in Mexico, Pampanga, Philippines. This also highlights areas of improvement. The study suggests that policymakers, administrators, and training program developers should focus on enhancing the ICT proficiency of PE teachers to improve teaching practices and student engagement. Enhancing the ICT proficiency of PE teachers is recommended to foster better teaching experiences, increase student engagement, and promote overall educational outcomes.
comment: 16 pages, 1 figure, 4 tables
☆ Mitigating biases in big mobility data: a case study of monitoring large-scale transit systems
Big mobility datasets (BMD) have shown many advantages in studying human mobility and evaluating the performance of transportation systems. However, the quality of BMD remains poorly understood. This study evaluates biases in BMD and develops mitigation methods. Using Google and Apple mobility data as examples, this study compares them with benchmark data from governmental agencies. Spatio-temporal discrepancies between BMD and benchmark are observed and their impacts on transportation applications are investigated, emphasizing the urgent need to address these biases to prevent misguided policymaking. This study further proposes and tests a bias mitigation method. It is shown that the mitigated BMD could generate valuable insights into large-scale public transit systems across 100+ US counties, revealing regional disparities of the recovery of transit systems from the COVID-19. This study underscores the importance of caution when using BMD in transportation research and presents effective mitigation strategies that would benefit practitioners.
comment: 25 pages, 10 figures. Transportation Letters. August 2024
♻ ☆ Bike Frames: Understanding the Implicit Portrayal of Cyclists in the News AAAI
Increasing cycling for transportation or recreation can boost health and reduce the environmental impacts of vehicles. However, news agencies' ideologies and reporting styles often influence public perception of cycling. For example, if news agencies overly report cycling accidents, it may make people perceive cyclists as "dangerous," reducing the number of cyclists who opt to cycle. Additionally, a decline in cycling can result in less government funding for safe infrastructure. In this paper, we develop a method for detecting the perceived perception of cyclists within news headlines. We introduce a new dataset called ``Bike Frames'' to accomplish this. The dataset consists of 31,480 news headlines and 1,500 annotations. Our focus is on analyzing 11,385 headlines from the United States. We also introduce the BikeFrame Chain-of-Code framework to predict cyclist perception, identify accident-related headlines, and determine fault. This framework uses pseudocode for precise logic and integrates news agency bias analysis for improved predictions over traditional chain-of-thought reasoning in large language models. Our method substantially outperforms other methods, and most importantly, we find that incorporating news bias information substantially impacts performance, improving the average F1 from .739 to .815. Finally, we perform a comprehensive case study on US-based news headlines, finding reporting differences between news agencies and cycling-specific websites as well as differences in reporting depending on the gender of cyclists. WARNING: This paper contains descriptions of accidents and death.
comment: Accepted to AAAI ICWSM 2025
♻ ☆ When Should Algorithms Resign? A Proposal for AI Governance
Algorithmic resignation is a strategic approach for managing the use of artificial intelligence (AI) by embedding governance directly into AI systems. It involves deliberate and informed disengagement from AI, such as restricting access AI outputs or displaying performance disclaimers, in specific scenarios to aid the appropriate and effective use of AI. By integrating algorithmic resignation as a governance mechanism, organizations can better control when and how AI is used, balancing the benefits of automation with the need for human oversight.
♻ ☆ Designing Decision Support Systems Using Counterfactual Prediction Sets ICML 2023
Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption to achieve an exponential improvement in regret in comparison to vanilla bandit algorithms. We conduct a large-scale human subject study ($n = 2{,}751$) to compare our methodology to several competitive baselines. The results show that, for decision support systems based on prediction sets, limiting experts' level of agency leads to greater performance than allowing experts to always exercise their own agency. We have made available the data gathered in our human subject study as well as an open source implementation of our system at https://github.com/Networks-Learning/counterfactual-prediction-sets.
comment: Best paper award in the ICML 2023 AI&HCI Workshop, spotlight paper at ICML 2024
♻ ☆ Transforming Agency. On the mode of existence of Large Language Models
This paper investigates the ontological characterization of Large Language Models (LLMs) like ChatGPT. Between inflationary and deflationary accounts, we pay special attention to their status as agents. This requires explaining in detail the architecture, processing, and training procedures that enable LLMs to display their capacities, and the extensions used to turn LLMs into agent-like systems. After a systematic analysis we conclude that a LLM fails to meet necessary and sufficient conditions for autonomous agency in the light of embodied theories of mind: the individuality condition (it is not the product of its own activity, it is not even directly affected by it), the normativity condition (it does not generate its own norms or goals), and, partially the interactional asymmetry condition (it is not the origin and sustained source of its interaction with the environment). If not agents, then ... what are LLMs? We argue that ChatGPT should be characterized as an interlocutor or linguistic automaton, a library-that-talks, devoid of (autonomous) agency, but capable to engage performatively on non-purposeful yet purpose-structured and purpose-bounded tasks. When interacting with humans, a "ghostly" component of the human-machine interaction makes it possible to enact genuine conversational experiences with LLMs. Despite their lack of sensorimotor and biological embodiment, LLMs textual embodiment (the training corpus) and resource-hungry computational embodiment, significantly transform existing forms of human agency. Beyond assisted and extended agency, the LLM-human coupling can produce midtended forms of agency, closer to the production of intentional agency than to the extended instrumentality of any previous technologies.
♻ ☆ Socioeconomic factors of national representation in the global film festival circuit: skewed toward the large and wealthy, but small countries can beat the odds
This study analyzes how economic, demographic, and geographic factors predict the representation of different countries in the global film festival circuit. It relies on the combination of several open access datasets, including festival programming information from the Cinando platform of the Cannes Film Market, covering more than 30,000 screenings of over 20,000 films in almost 600 festivals across the world over a decade. It is shown that while the festival screen is indeed dominated by films from large affluent countries, the bias is nevertheless not fully proportional to the large demographic and economic disparities across the world, and that several small countries perform better than expected. It is further analyzed via computational simulations how much including films from smaller countries contributes to cultural diversity, and how countries differ in cultural "trade balance" dynamics, revealing differences between net exporters and importers of festival films. This research underscores the importance of balanced representation in film festivals and the public value of increasing cultural diversity. The data-driven insights and approaches to quantitative festival program and cultural event analytics are hoped to be useful for both the academic community as well as film festival organizers and policymakers aiming to foster more inclusive and diverse cultural landscapes.
♻ ☆ AI and the Iterable Epistopics of Risk
Abstract. The risks AI presents to society are broadly understood to be manageable through general calculus, i.e., general frameworks designed to enable those involved in the development of AI to apprehend and manage risk, such as AI impact assessments, ethical frameworks, emerging international standards, and regulations. This paper elaborates how risk is apprehended and managed by a regulator, developer and cyber-security expert. It reveals that risk and risk management is dependent on mundane situated practices not encapsulated in general calculus. Situated practice surfaces iterable epistopics, revealing how those involved in the development of AI know and subsequently respond to risk and uncover major challenges in their work. The ongoing discovery and elaboration of epistopics of risk in AI a) furnishes a potential program of interdisciplinary inquiry, b) provides AI developers with a means of apprehending risk, and c) informs the ongoing evolution of general calculus.
comment: 17 pages
♻ ☆ A Survey on Safe Multi-Modal Learning System
In the rapidly evolving landscape of artificial intelligence, multimodal learning systems (MMLS) have gained traction for their ability to process and integrate information from diverse modality inputs. Their expanding use in vital sectors such as healthcare has made safety assurance a critical concern. However, the absence of systematic research into their safety is a significant barrier to progress in this field. To bridge the gap, we present the first taxonomy that systematically categorizes and assesses MMLS safety. This taxonomy is structured around four fundamental pillars that are critical to ensuring the safety of MMLS: robustness, alignment, monitoring, and controllability. Leveraging this taxonomy, we review existing methodologies, benchmarks, and the current state of research, while also pinpointing the principal limitations and gaps in knowledge. Finally, we discuss unique challenges in MMLS safety. In illuminating these challenges, we aim to pave the way for future research, proposing potential directions that could lead to significant advancements in the safety protocols of MMLS.
♻ ☆ Automating psychological hypothesis generation with AI: when large language models meet causal graph
Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 potential psychological hypotheses focusing on `well-being', then compared them against research ideas conceived by doctoral scholars and those produced solely by the LLM. Interestingly, our combined approach of a LLM and causal graphs mirrored the expert-level insights in terms of novelty, clearly surpassing the LLM-only hypotheses (t(59) = 3.34, p=0.007 and t(59) = 4.32, p<0.001, respectively). This alignment was further corroborated using deep semantic analysis. Our results show that combining LLM with machine learning techniques such as causal knowledge graphs can revolutionize automated discovery in psychology, extracting novel insights from the extensive literature. This work stands at the crossroads of psychology and artificial intelligence, championing a new enriched paradigm for data-driven hypothesis generation in psychological research.
Computers and Society
☆ Conquering images and the basis of transformative action
Our rapid immersion into online life has made us all ill. Through the generation, personalization, and dissemination of enchanting imagery, artificial technologies commodify the minds and hearts of the masses with nauseating precision and scale. Online networks, artificial intelligence (AI), social media, and digital news feeds fine-tune our beliefs and pursuits by establishing narratives that subdivide and polarize our communities and identities. Meanwhile those commanding these technologies conquer the final frontiers of our interior lives, social relations, earth, and cosmos. In the Attention Economy, our agency is restricted and our vitality is depleted for their narcissistic pursuits and pleasures. Generative AI empowers the forces that homogenize and eradicate life, not through some stupid "singularity" event, but through devaluing human creativity, labor, and social life. Using a fractured lens, we will examine how narratives and networks influence us on mental, social, and algorithmic levels. We will discuss how atomizing imagery -- ideals and pursuits that alienate, rather than invigorate the individual -- hijack people's agency to sustain the forces that destroy them. We will discover how empires build digital networks that optimize society and embolden narcissists to enforce social binaries that perpetuate the ceaseless expansion of consumption, exploitation, and hierarchy. Structural hierarchy in the world is reified through hierarchy in our beliefs and thinking. Only by seeing images as images and appreciating the similarity shared by opposing narratives can we facilitate transformative action and break away from the militaristic systems plaguing our lives.
comment: I would like to thank Nick Ichien for his feedback on this manuscript. This is a working manuscript and is bound to change in the coming months
☆ (De)Noise: Moderating the Inconsistency Between Human Decision-Makers SC
Prior research in psychology has found that people's decisions are often inconsistent. An individual's decisions vary across time, and decisions vary even more across people. Inconsistencies have been identified not only in subjective matters, like matters of taste, but also in settings one might expect to be more objective, such as sentencing, job performance evaluations, or real estate appraisals. In our study, we explore whether algorithmic decision aids can be used to moderate the degree of inconsistency in human decision-making in the context of real estate appraisal. In a large-scale human-subject experiment, we study how different forms of algorithmic assistance influence the way that people review and update their estimates of real estate prices. We find that both (i) asking respondents to review their estimates in a series of algorithmically chosen pairwise comparisons and (ii) providing respondents with traditional machine advice are effective strategies for influencing human responses. Compared to simply reviewing initial estimates one by one, the aforementioned strategies lead to (i) a higher propensity to update initial estimates, (ii) a higher accuracy of post-review estimates, and (iii) a higher degree of consistency between the post-review estimates of different respondents. While these effects are more pronounced with traditional machine advice, the approach of reviewing algorithmically chosen pairs can be implemented in a wider range of settings, since it does not require access to ground truth data.
comment: To appear in CSCW 2024
☆ EyeDentify: A Dataset for Pupil Diameter Estimation based on Webcam Images
In this work, we introduce EyeDentify, a dataset specifically designed for pupil diameter estimation based on webcam images. EyeDentify addresses the lack of available datasets for pupil diameter estimation, a crucial domain for understanding physiological and psychological states traditionally dominated by highly specialized sensor systems such as Tobii. Unlike these advanced sensor systems and associated costs, webcam images are more commonly found in practice. Yet, deep learning models that can estimate pupil diameters using standard webcam data are scarce. By providing a dataset of cropped eye images alongside corresponding pupil diameter information, EyeDentify enables the development and refinement of models designed specifically for less-equipped environments, democratizing pupil diameter estimation by making it more accessible and broadly applicable, which in turn contributes to multiple domains of understanding human activity and supporting healthcare. Our dataset is available at https://vijulshah.github.io/eyedentify/.
☆ Blockchain Governance: An Empirical Analysis of User Engagement on DAOs
In this note, we examine voting on four major blockchain DAOs: Aave, Compound, Lido and Uniswap. Using data directly collected from the Ethereum blockchain, we examine voter activity. We find that in most votes, the "minimal quorum," i.e., the smallest number of active voters who could swing the vote is quite small. To understand who is actually driving these DAOs, we use data from the Ethereum Name Service (ENS), Sybil.org, and Compound, to divide voters into different categories.
☆ Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis
Effective educational measurement relies heavily on the curation of well-designed item pools (i.e., possessing the right psychometric properties). However, item calibration is time-consuming and costly, requiring a sufficient number of respondents for the response process. We explore using six different LLMs (GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro, and Cohere Command R Plus) and various combinations of them using sampling methods to produce responses with psychometric properties similar to human answers. Results show that some LLMs have comparable or higher proficiency in College Algebra than college students. No single LLM mimics human respondents due to narrow proficiency distributions, but an ensemble of LLMs can better resemble college students' ability distribution. The item parameters calibrated by LLM-Respondents have high correlations (e.g. > 0.8 for GPT-3.5) compared to their human calibrated counterparts, and closely resemble the parameters of the human subset (e.g. 0.02 Spearman correlation difference). Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).
☆ Innovation Resistance Theory in Action: Unveiling Barriers to Open Government Data Adoption by Public Organizations to Unlock Open Data Innovation
Open Government Data (OGD) plays a pivotal role in fostering data-driven innovation and sustainability across various sectors. Despite its potential, many public organizations are reluctant to share their data openly. While existing research has explored factors impacting the public organizations intention to share OGD, there is a paucity of research applying theoretical models to investigate the resistance by public organizations to making government data publicly available. This study addresses the gap by developing an Innovation Resistance Theory (IRT) model tailored to OGD that allows identifying predictors of resistance among public agencies. We develop an initial model based on literature and refine it through interviews with 21 public agencies across six countries. The final model describes 39 barriers related to usage, value, risks, tradition, and image. The findings contribute to the literature by adapting IRT to the context of OGD, an area where its application has been notably limited. As such, this study addresses the growing demand for novel theoretical frameworks to examine OGD adoption barriers. Practical insights are provided to support policymakers in creating data ecosystems that encourage data openness and address challenges in OGD adoption.
☆ The Missing Link: Allocation Performance in Causal Machine Learning
Automated decision-making (ADM) systems are being deployed across a diverse range of critical problem areas such as social welfare and healthcare. Recent work highlights the importance of causal ML models in ADM systems, but implementing them in complex social environments poses significant challenges. Research on how these challenges impact the performance in specific downstream decision-making tasks is limited. Addressing this gap, we make use of a comprehensive real-world dataset of jobseekers to illustrate how the performance of a single CATE model can vary significantly across different decision-making scenarios and highlight the differential influence of challenges such as distribution shifts on predictions and allocations.
☆ Exploring the Factors of "AI Guilt" Among Students -- Are You Guilty of Using AI in Your Homework?
This study explores the phenomenon of "AI guilt" among secondary school students, a form of moral discomfort arising from the use of AI tools in academic tasks traditionally performed by humans. Through qualitative methodologies, the research examines the factors contributing to AI guilt, its social and psychological impacts, and its implications for educational practices. The findings revealed three main dimensions for AI guilt - perceived laziness and authenticity, fear of judgment, and identity and self-efficacy concerns. The findings suggest a need to redefine academic integrity and shift our mindset to reconsider what we should value in education. The study also emphasizes the importance of ethical guidelines and educational support and provides implications to help students navigate the complexities of AI in education, reducing feelings of guilt while enhancing learning outcomes.
Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models
Large Language Models have recently been applied to text annotation tasks from social sciences, equalling or surpassing the performance of human workers at a fraction of the cost. However, no inquiry has yet been made on the impact of prompt selection on labelling accuracy. In this study, we show that performance greatly varies between prompts, and we apply the method of automatic prompt optimization to systematically craft high quality prompts. We also provide the community with a simple, browser-based implementation of the method at https://prompt-ultra.github.io/ .
☆ Balancing the Scales: Reinforcement Learning for Fair Classification
Fairness in classification tasks has traditionally focused on bias removal from neural representations, but recent trends favor algorithmic methods that embed fairness into the training process. These methods steer models towards fair performance, preventing potential elimination of valuable information that arises from representation manipulation. Reinforcement Learning (RL), with its capacity for learning through interaction and adjusting reward functions to encourage desired behaviors, emerges as a promising tool in this domain. In this paper, we explore the usage of RL to address bias in imbalanced classification by scaling the reward function to mitigate bias. We employ the contextual multi-armed bandit framework and adapt three popular RL algorithms to suit our objectives, demonstrating a novel approach to mitigating bias.
☆ Laypeople's Egocentric Perceptions of Copyright for AI-Generated Art
Recent breakthroughs in generative AI (GenAI) have fueled debates concerning the status of AI-generated creations under copyright law. This research investigates laypeople's perceptions ($N$ = 424) of AI-generated art concerning factors associated with copyright protection. Inspired by prior work suggesting that people show egocentric biases when evaluating their own creative outputs, we also test if the same holds for AI-generated art. Namely, we study the differences between the perceptions of those who have something to gain from copyright protection -- creators of AI-generated art -- and uninvested third parties. To answer our research questions, we held an incentivized AI art competition, in which some participants used a GenAI model to generate images for consideration while others evaluated these submissions. We find that participants are most likely to attribute authorship and copyright over AI-generated images to the users who prompted the AI system to generate the image and the artists whose creations were used for training the AI model. We also find that participants egocentrically favored their own art over other participants' art and rated their own creations higher than other people evaluated them. Moreover, our results suggest that people judge their own AI-generated art more favorably with respect to some factors (creativity and effort) but not others (skills). Our findings have implications for future debates concerning the potential copyright protection of AI-generated outputs.
♻ ☆ Bridging the Global Divide in AI Regulation: A Proposal for a Contextual, Coherent, and Commensurable Framework
As debates on potential societal harm from artificial intelligence (AI) culminate in legislation and international norms, a global divide is emerging in both AI regulatory frameworks and international governance structures. In terms of local regulatory frameworks, the European Union (E.U.), Canada, and Brazil follow a horizontal or lateral approach that postulates the homogeneity of AI, seeks to identify common causes of harm, and demands uniform human interventions. In contrast, the United States (U.S.), the United Kingdom (U.K.), Israel, and Switzerland (and potentially China) have pursued a context-specific or modular approach, tailoring regulations to the specific use cases of AI systems. This paper argues for a context-specific approach to effectively address evolving risks in diverse mission-critical domains, while avoiding social costs associated with one-size-fits-all approaches. However, to enhance the systematicity and interoperability of international norms and accelerate global harmonization, this paper proposes an alternative contextual, coherent, and commensurable (3C) framework. To ensure contextuality, the framework (i) bifurcates the AI life cycle into two phases: learning and deployment for specific tasks, instead of defining foundation or general-purpose models; and (ii) categorizes these tasks based on their application and interaction with humans as follows: autonomous, discriminative (allocative, punitive, and cognitive), and generative AI. To ensure coherency, each category is assigned specific regulatory objectives replacing 2010s vintage AI ethics. To ensure commensurability, the framework promotes the adoption of international standards for measuring and mitigating risks.
comment: 33(2) Wash. Int'l L.J. _ (2024)
♻ ☆ Reducing the Filtering Effect in Public School Admissions: A Bias-aware Analysis for Targeted Interventions
Problem definition: Traditionally, New York City's top 8 public schools have selected candidates solely based on their scores in the Specialized High School Admissions Test (SHSAT). These scores are known to be impacted by socioeconomic status of students and test preparation received in middle schools, leading to a massive filtering effect in the education pipeline. The classical mechanisms for assigning students to schools do not naturally address problems like school segregation and class diversity, which have worsened over the years. The scientific community, including policymakers, have reacted by incorporating group-specific quotas and proportionality constraints, with mixed results. The problem of finding effective and fair methods for broadening access to top-notch education is still unsolved. Methodology/results: We take an operations approach to the problem different from most established literature, with the goal of increasing opportunities for students with high economic needs. Using data from the Department of Education (DOE) in New York City, we show that there is a shift in the distribution of scores obtained by students that the DOE classifies as "disadvantaged" (following criteria mostly based on economic factors). We model this shift as a "bias" that results from an underestimation of the true potential of disadvantaged students. We analyze the impact this bias has on an assortative matching market. We show that centrally planned interventions can significantly reduce the impact of bias through scholarships or training, when they target the segment of disadvantaged students with average performance.
♻ ☆ Stop! In the Name of Flaws: Disentangling Personal Names and Sociodemographic Attributes in NLP ACL 2024
Personal names simultaneously differentiate individuals and categorize them in ways that are important in a given society. While the natural language processing community has thus associated personal names with sociodemographic characteristics in a variety of tasks, researchers have engaged to varying degrees with the established methodological problems in doing so. To guide future work that uses names and sociodemographic characteristics, we provide an overview of relevant research: first, we present an interdisciplinary background on names and naming. We then survey the issues inherent to associating names with sociodemographic attributes, covering problems of validity (e.g., systematic error, construct validity), as well as ethical concerns (e.g., harms, differential impact, cultural insensitivity). Finally, we provide guiding questions along with normative recommendations to avoid validity and ethical pitfalls when dealing with names and sociodemographic characteristics in natural language processing.
comment: Gender Bias in Natural Language Processing Workshop at ACL 2024
♻ ☆ Uncovering Latent Arguments in Social Media Messaging by Employing LLMs-in-the-Loop Strategy
The widespread use of social media has led to a surge in popularity for automated methods of analyzing public opinion. Supervised methods are adept at text categorization, yet the dynamic nature of social media discussions poses a continual challenge for these techniques due to the constant shifting of the focus. On the other hand, traditional unsupervised methods for extracting themes from public discourse, such as topic modeling, often reveal overarching patterns that might not capture specific nuances. Consequently, a significant portion of research into social media discourse still depends on labor-intensive manual coding techniques and a human-in-the-loop approach, which are both time-consuming and costly. In this work, we study the problem of discovering arguments associated with a specific theme. We propose a generic LLMs-in-the-Loop strategy that leverages the advanced capabilities of Large Language Models (LLMs) to extract latent arguments from social media messaging. To demonstrate our approach, we apply our framework to contentious topics. We use two publicly available datasets: (1) the climate campaigns dataset of 14k Facebook ads with 25 themes and (2) the COVID-19 vaccine campaigns dataset of 9k Facebook ads with 14 themes. Additionally, we design a downstream task as stance prediction by leveraging talking points in climate debates. Furthermore, we analyze demographic targeting and the adaptation of messaging based on real-world events.
♻ ☆ Discovering Latent Themes in Social Media Messaging: A Machine-in-the-Loop Approach Integrating LLMs AAAI
Grasping the themes of social media content is key to understanding the narratives that influence public opinion and behavior. The thematic analysis goes beyond traditional topic-level analysis, which often captures only the broadest patterns, providing deeper insights into specific and actionable themes such as "public sentiment towards vaccination", "political discourse surrounding climate policies," etc. In this paper, we introduce a novel approach to uncovering latent themes in social media messaging. Recognizing the limitations of the traditional topic-level analysis, which tends to capture only overarching patterns, this study emphasizes the need for a finer-grained, theme-focused exploration. Traditional theme discovery methods typically involve manual processes and a human-in-the-loop approach. While valuable, these methods face challenges in scalability, consistency, and resource intensity in terms of time and cost. To address these challenges, we propose a machine-in-the-loop approach that leverages the advanced capabilities of Large Language Models (LLMs). To demonstrate our approach, we apply our framework to contentious topics, such as climate debate and vaccine debate. We use two publicly available datasets: (1) the climate campaigns dataset of 21k Facebook ads and (2) the COVID-19 vaccine campaigns dataset of 9k Facebook ads. Our quantitative and qualitative analysis shows that our methodology yields more accurate and interpretable results compared to the baselines. Our results not only demonstrate the effectiveness of our approach in uncovering latent themes but also illuminate how these themes are tailored for demographic targeting in social media contexts. Additionally, our work sheds light on the dynamic nature of social media, revealing the shifts in the thematic focus of messaging in response to real-world events.
comment: Accepted at 19th International AAAI Conference on Web and Social Media (ICWSM-2025)
♻ ☆ Finding Fake News Websites in the Wild
The battle against the spread of misinformation on the Internet is a daunting task faced by modern society. Fake news content is primarily distributed through digital platforms, with websites dedicated to producing and disseminating such content playing a pivotal role in this complex ecosystem. Therefore, these websites are of great interest to misinformation researchers. However, obtaining a comprehensive list of websites labeled as producers and/or spreaders of misinformation can be challenging, particularly in developing countries. In this study, we propose a novel methodology for identifying websites responsible for creating and disseminating misinformation content, which are closely linked to users who share confirmed instances of fake news on social media. We validate our approach on Twitter by examining various execution modes and contexts. Our findings demonstrate the effectiveness of the proposed methodology in identifying misinformation websites, which can aid in gaining a better understanding of this phenomenon and enabling competent entities to tackle the problem in various areas of society.
comment: This is a preprint version of a submitted manuscript on the Brazilian Symposium on Multimedia and the Web (WebMedia)
♻ ☆ What Do People Think about Sentient AI?
With rapid advances in machine learning, many people in the field have been discussing the rise of digital minds and the possibility of artificial sentience. Future developments in AI capabilities and safety will depend on public opinion and human-AI interaction. To begin to fill this research gap, we present the first nationally representative survey data on the topic of sentient AI: initial results from the Artificial Intelligence, Morality, and Sentience (AIMS) survey, a preregistered and longitudinal study of U.S. public opinion that began in 2021. Across one wave of data collection in 2021 and two in 2023 (total N = 3,500), we found mind perception and moral concern for AI well-being in 2021 were higher than predicted and significantly increased in 2023: for example, 71% agree sentient AI deserve to be treated with respect, and 38% support legal rights. People have become more threatened by AI, and there is widespread opposition to new technologies: 63% support a ban on smarter-than-human AI, and 69% support a ban on sentient AI. Expected timelines are surprisingly short and shortening with a median forecast of sentient AI in only five years and artificial general intelligence in only two years. We argue that, whether or not AIs become sentient, the discussion itself may overhaul human-computer interaction and shape the future trajectory of AI technologies, including existential risks and opportunities.
♻ ☆ Evaluating Algorithmic Bias in Models for Predicting Academic Performance of Filipino Students
Algorithmic bias is a major issue in machine learning models in educational contexts. However, it has not yet been studied thoroughly in Asian learning contexts, and only limited work has considered algorithmic bias based on regional (sub-national) background. As a step towards addressing this gap, this paper examines the population of 5,986 students at a large university in the Philippines, investigating algorithmic bias based on students' regional background. The university used the Canvas learning management system (LMS) in its online courses across a broad range of domains. Over the period of three semesters, we collected 48.7 million log records of the students' activity in Canvas. We used these logs to train binary classification models that predict student grades from the LMS activity. The best-performing model reached AUC of 0.75 and weighted F1-score of 0.79. Subsequently, we examined the data for bias based on students' region. Evaluation using three metrics: AUC, weighted F1-score, and MADD showed consistent results across all demographic groups. Thus, no unfairness was observed against a particular student group in the grade predictions.
comment: Published in proceedings of the 17th Educational Data Mining Conference (EDM 2024)