International Symposium on Trustworthy Foundation Models

Day 1 Program (May 26, Mon, Executive Theatre)

9:20 - 9:30 am	Opening remarks
	Timothy Baldwin (MBZUAI)

9:30 - 10:30 am	Keynote: Matryoshka structured Adaptive Models
	Prateek Jain (Google)
	Abstract: Current foundational models are rigidly structured and need to be trained for each scale to meet the desired quality-latency-cost tradeoff. In this talk, we will discuss how simple "Matryoshka" or nested structured models -- aided by natural modularity in our foundational models -- allow for training one model but reading off many accurate models that allows us to meet desired tradeoffs without any additional training. We illustrated this basic idea for transformer based LLMs, VITs, servable MLP models, embeddings, and quantized models. Bio: Prateek Jain is a Principal Scientist / Director at Google DeepMind India where he leads the Machine Learning and Optimization team as well as Gemini's Long Term Model Design Research. He obtained his doctorate from UT Austin and BTech from IIT-Kanpur. He has conducted foundational research in the areas of large-scale and non-convex optimization, and resource-constrained ML. Prateek regularly serves on the senior PC of top ML conferences and is on the editorial board of top ML journals including JMLR, SIMODS. He has also won multiple best paper awards including the 2020 Best Paper by IEEE Signal Processing Society. Prateek also received the Young Alumnus Award from IIT Kanpur in 2021 and the ACM India Early Career Researcher Award in 2022.

10:30 - 11:00 am

Coffee Break

11:00 - 11:30 am	Does Scaling Guarantee Trustworthy LLMs? The Hidden Costs of Reasoning LLMs.
	Nouha Dziri (Allen Institute for AI)
	Abstract: Scaling has led to the rise of the most capable language models (LLMs) in history—systems that can perform complex tasks and demonstrate impressive step-by-step reasoning. At first glance, such reasoning capabilities might suggest that these models are reliable and safe for real-world use. But should we truly feel confident? Despite their power, reasoning LLMs remain brittle. They can fail unexpectedly on simple tasks as complexity increases, and we still lack reliable methods to understand or control their behavior. This unpredictability poses serious safety risks, especially as these models are deployed in increasingly high-stakes environments. In this talk, I will explore the hidden costs of reasoning LLMs. I will begin with the importance of uncovering and characterizing model risks as a first step toward safety. I will present findings on automatic red-teaming attacks, and discuss both model-level and system-level defenses (aka safeguards). I will also discuss how agentic LLM-based systems amplify these challenges and often fall short of performing safely in dynamic, multi-step settings. Finally, I will address the broader social consequences: the erosion of human agency, the growing overreliance on model outputs—even when wrong—and the risk that LLMs may contribute to a collapse in independent human reasoning and creativity. These hidden costs must be confronted directly if we are to build truly trustworthy AI systems. Bio: Nouha Dziri is an AI research scientist at the Allen Institute for AI (Ai2). Her research investigates a wide variety of problems across NLP and AI including building state-of-the-art language models and understanding their limits and inner workings. She also works on AI safety to ensure the responsible deployment of LLMs while enhancing their reasoning capabilities. Prior to Ai2, she worked at Google DeepMind, Microsoft Research and Mila. She earned her PhD from the University of Alberta and the Alberta Machine Intelligence Institute. Her work has been published in top-tier AI venues including NeurIPS, ICML, ICLR, TACL, ACL, NAACL and EMNLP. She was recently awarded the runner-up Best Paper Award at NAACL 2025.

11:30 - 12:00 pm	Scalable Safety
	Ashwinee Panda (Princeton University)
	Abstract: We consider a range of failure settings where existing safety mechanisms are insufficient. We examine the proficiency of existing models for safety oversight. Bio: Ashwinee is a postdoctoral fellow at UMD working with Tom Goldstein on LLMs.

12:00 - 14:00 pm

Lunch break

14:00 - 14:30 pm	Towards Secure and Safe LLM
	Chaowei Xiao (University of Wisconsin, Madison)
	Abstract: We are witnessing a paradigm shift in AI, transitioning from deep learning models to the era of Large Language Models (LLMs). This shift signifies a transformative advancement in AI, enabling it to be applied to diverse real-world safety-critical applications. Despite these impressive achievements, a fundamental question remains: is AI truly ready for safe, and secure use? In this talk, I will demonstrate how my research addresses this question. I will introduce two principles for measuring the worst-case performance of AI and for designing secure and safe AI models. Additionally, I will highlight why securing AI requires more than just focusing on model-level robustness by illustrating security vulnerabilities from a broader system perspective. Finally, I will outline my vision for ensuring AI security through comprehensive, integrated model-level and system-level approaches. Bio: Chaowei Xiao is an Assistant Professor at the University of Wisconsin–Madison. His research focuses on building secure and trustworthy AI systems. He has received several prestigious awards, including the Schmidt Science AI2050 Early Career Award, the Impact Award from Argonne National Laboratory, and various industry faculty awards. His work has been recognized with several paper awards including the USENIX Security Distinguished Paper Award (2024), the International Conference on Embedded Wireless Systems and Networks (EWSN) Best Paper Award (2021), and the MobiCom Best Paper Award (2014), ACM Gordon Bell Prize Finalist (2024) and ACM Gordon Bell Special Prize for HPC-Based COVID-19 (2023). Dr. Xiao's research has been cited over 15k times and has been featured in multiple media outlets such as Nature, Wired, Fortune, and The New York Times. Additionally, one of his research outputs was exhibited at the London Science Museum.

14:30 - 15:00 pm	Adaptively Robust and Forgery-Resistant Watermarking
	Nils Lukas (MBZUAI)
	Abstract: Watermarking is a potential solution to verify the provenance of content generated by large-scale machine learning systems. A watermark is a hidden signal embedded in the content using a secret watermarking key whose presence can later be verified. A threat to providers are watermark evasion and stealing attacks, where users (i) remove a watermark to evade detection or (ii) forge a watermark into content not generated by the provider without access to the secret key, e.g., to falsely accuse or impersonate the provider. In this talk, I will present potential solutions to enhance these security properties of watermarking by studying adaptive attacks and multi-key watermarking against restricted attackers who are limited in their access to computational resources. Bio: Nils is an Assistant Professor in the Machine Learning department at MBZUAI who focused on secure and private machine learning. His work includes developing methods for content watermarking, privacy-preserving and secure inference, alignment of ML models, and enhancing the robustness of ML systems. His dissertation was awarded the Top Mathematics Doctoral Prize and he receive the Alumni Gold Medal from the University of Waterloo.

15:00 - 15:30 pm	Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models
	Preslav Nakov (MBZUAI)
	Abstract: First, we will argue for the need for fully transparent open-source large language models (LLMs), and we will describe the efforts of MBZUAI's Institute on Foundation Models (IFM) towards that based on the LLM360 initiative. Second, we will argue for the need for language-specific LLMs, and we will share our experience from building Jais, the world's leading open Arabic-centric foundation and instruction-tuned large language model, Nanda, our open-weights Hindi LLM, Sherkala, our open-weights Kazakh LLM, and some other models. Third, we will argue for the need for safe LLMs, and we will present Do-Not-Answer, a dataset for evaluating the guardrails of LLMs, which is at the core of the safety mechanisms of our LLMs. Forth, we will argue for the need for factual LLMs, we will discuss the factuality challenges that LLMs pose. We will then present some recent relevant tools for addressing these challenges developed at MBZUAI: (i) OpenFactCheck, a framework for fact-checking LLM output, for building customized fact-checking systems, and for benchmarking LLMs for factuality, (ii) LM-Polygraph, a tool for predicting an LLM's uncertainty in its output using cheap and fast uncertainty quantification techniques, and (iii) LLM-DetectAIve, a tool for machine-generated text detection. Finally, we will argue for the need for specialized models, and we will present the zoo of LLMs currently being developed at MBZUAI's IFM. Bio: Preslav Nakov is Professor and Department Chair for NLP at the Mohamed bin Zayed University of Artificial Intelligence. He is part of the core team at MBZUAI's Institute for Foundation Models that developed Jais, the world's best open-source Arabic-centric LLM, Nanda, the world's best open-weights Hindi model, Sherkala, the world's best open-weights Kazakh model, and LLM360, the first truly open LLM (open weights, open data, and open code). Previously, he was Principal Scientist at the Qatar Computing Research Institute, HBKU, where he led the Tanbih mega-project, developed in collaboration with MIT, which aims to limit the impact of "fake news", propaganda and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking. He received his PhD degree in Computer Science from the University of California at Berkeley, supported by a Fulbright grant. He is Chair of the European Chapter of the Association for Computational Linguistics (EACL), Secretary of ACL SIGSLAV, and Secretary of the Truth and Trust Online board of trustees. Formerly, he was PC chair of ACL 2022, and President of ACL SIGLEX. He is also member of the editorial board of several journals including Computational Linguistics, TACL, ACM TOIS, IEEE TASL, IEEE TAC, CS&L, NLE, AI Communications, and Frontiers in AI. He authored a Morgan & Claypool book on Semantic Relations between Nominals, two books on computer algorithms, and 250+ research papers. He received a Best Paper Award at ACM WebSci'2022, a Best Long Paper Award at CIKM'2020, a Best Resource Paper Award at EACL'2024, a Best Demo Paper Award (Honorable Mention) at ACL'2020, a Best Task Paper Award (Honorable Mention) at SemEval'2020, a Best Poster Award at SocInfo'2019, and the Young Researcher Award at RANLP’2011. He was also the first to receive the Bulgarian President's John Atanasoff award, named after the inventor of the first automatic electronic digital computer. His research was featured by over 100 news outlets, including Reuters, Forbes, Financial Times, CNN, Boston Globe, Aljazeera, DefenseOne, Business Insider, MIT Technology Review, Science Daily, Popular Science, Fast Company, The Register, WIRED, and Engadget, among others.

15:30 - 16:00 pm

Coffee Break

16:00 - 16:30 pm	Two Pathways to Trustworthy AI: Transparent Representations and Robust Adaptation
	Haolun Wu (Stanford University / Mila)
	Abstract: As foundation models become core components of human-centric AI applications, designing trustworthy AI systems—those that are transparent, controllable, and robust—has become a critical topic. In this talk, I will present our recent works highlighting two complementary methods towards trustworthy AI systems through transparent user representations and robust model adaptation. I will first introduce TEARS, a recommender system framework that leverages LLMs to generate editable and interpretable user summaries. These summaries provide users with a clear and direct interface for understanding and shaping their recommendations. By aligning the textual summaries with collaborative filtering signals through optimal transport, TEARS enables recommender systems that are both accurate and user-controllable. Then, I will present Plugin, a method for robustly adapting closed-source language models to domain-specific tasks with only logits access. Framing LLM adaptation as a label noise correction problem, Plugin learns a lightweight autoregressive reweighting model to steer token-level probability during inference. Without access to model weights or training data, our approach is simple and enables a robust deployment. Together, these two methods illustrate how we can engineer trust into AI systems built on top of foundation models. I conclude with reflections on broader principles connecting human feedback, interpretability, and alignment in building trustworthy AI systems. Bio: Haolun is a Ph.D. candidate in Computer Science at McGill University and Mila. He is currently a visiting scholar at Stanford University in the Human-centered AI and Trustworthy AI Research Lab. His research centers on learning from human feedback using ML techniques to align AI towards human needs. His work spans both micro-level aspects (such as data values and personalization) as well as macro-level goals (such as responsibility, social norms, and educational impact). His research is supported by multiple fellowships and has been published at top venues including NeurIPS, ICLR, ICML, The WebConf, SIGIR, EMNLP, CHI, and recognized through internships and collaborations at Google Research, DeepMind, and Microsoft Research.

16:30 - 17:00 pm	Understanding Generalization in Large Language Models through the Lens of Compression
	Sanae Lotfi (New York University)
	Abstract: Gaining insight into the mechanisms behind the generalization of deep learning models is crucial to build on their strengths, address their limitations, and deploy them in safety-critical applications. As state-of-the-art models for various data modalities become increasingly large and are trained on internet-scale data, the notion of generalization becomes more challenging to define. \In this talk, I examine generalization through the lens of Occam’s razor: among models that can fit the training data, the simplest is most likely to perform well on unseen data. Compression bounds provide a principled way to capture this intuition through a trade-off between the model’s training performance and its compressed size. First, I present our work on deriving state-of-the-art generalization bounds for image classification models, providing key insights into why these models generalize effectively in practice. I then explore the challenges of extending these bounds to pretrained large language models (LLMs), establishing the first non-vacuous bounds for LLMs. Our findings reveal that larger LLMs not only yield better bounds but also find simpler representations of the data. Finally, we demonstrate that LLMs retain their understanding of patterns but forget highly unstructured data more rapidly as we compress them more aggressively. Bio: Sanae Lotfi is a PhD candidate at NYU. She works on the science of deep learning and focuses on understanding the generalization properties of deep neural networks using notions that relate to generalization such as model compression and loss surface analysis. Inspired by findings about generalization, Sanae works on building improved and robust deep learning models. Sanae's research has been recognized with numerous accolades including the ICML Outstanding Paper Award and the Microsoft and Google DeepMind Fellowships.