Keynote Title
Advancing the Science of Safe and Trustworthy AI Evaluation
AI architectures and models are increasingly evaluated through benchmarks and leaderboards, enabling broad comparison across systems while simultaneously creating a growing disconnect between measured performance and reliable real-world behavior. Systems optimized for benchmark success may still exhibit poor robustness, unexpected failures, or unsafe emergent behavior after deployment. In this keynote, drawing lessons from traditional machine learning, fairness-aware AI, and recent advances in foundation models and generative AI, I examine fundamental methodological challenges in current evaluation paradigms, including Goodhart’s law, limited construct and external validity, and the conflation of narrow competence with reliable behavior. I advocate complementing benchmarks with stress testing, out-of-distribution assessment, adversarial and interactive evaluation, uncertainty and robustness analysis, lifecycle-aware monitoring, socio-technical considerations, and human-centered evaluation methodologies. Ultimately, I argue that advancing safe and trustworthy AI requires not only more capable systems, but also a rigorous science of evaluation capable of distinguishing performed competence from genuinely reliable behavior.
Prof. Dr. Mykola Pechenizkiy is Chair of the Data Mining group at the Department of Mathematics and Computer Science, Eindhoven University of Technology. He is a founding Director of the Center for Safe AI. His research interests include several technical and socio-technical aspects of responsible AI, with a particular interest in evolving data and machine learning models. He has led and collaborated on several projects that have received international recognition, including the IDA 2023 Runner-up Frontier Prize, IEEE ICDE 2023 Best Demo Award, LoG 2022 Best Paper Award, ALA@AAMAS 2022 Best Paper Award, and the IEEE DSAA 2022 Best Paper Award. He serves in different roles on the organizing and program committees and boards of the leading AI conferences and journals.