【海韵讲座】2026年第10期-Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

发表时间:2026-04-13 编辑:陈 磊 来源: 浏览次数:

讲座日期 2026年4月16日(星期四)10:00-11:00 地点 厦门大学翔安校区西部片区3号楼109
主讲人 Jing-Hao Xue 伦敦大学学院统计科学系

报告题目:Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

主讲人:Jing-Hao Xue 伦敦大学学院统计科学系

报告时间:2026年4月16日(星期四)10:00-11:00

报告地点:厦门大学翔安校区西部片区3号楼109

报告摘要:The dream of instantly creating rich 360-degree panoramic worlds from text is rapidly becoming a reality, yet a crucial gap exists in our ability to reliably evaluate their semantic alignment. Contrastive Language-Image Pre-training (CLIP) models, standard AI evaluators, predominantly trained on perspective image-text pairs, face an open question regarding their understanding of the unique characteristics of 360-degree panoramic image-text pairs. In this talk, we will present some of our preliminary efforts to address this gap. We first introduce two concepts: 360-degree textual semantics, the semantic information conveyed by explicit format identifiers, and 360-degree visual semantics, the invariant semantics under horizontal circular shifts. We then probe CLIP's comprehension of these semantics, through proposing novel evaluation methodologies using keyword manipulation and horizontal circular shifts of varying magnitudes. Our statistical analyses across popular CLIP configurations reveal that: (1) CLIP models effectively leverage explicit textual identifiers, demonstrating an understanding of 360-degree textual semantics; and (2) CLIP models fail to robustly preserve semantic alignment under horizontal circular shifts, indicating limited comprehension of 360-degree visual semantics. To address this limitation, we also propose a LoRA-based fine-tuning framework that explicitly instills invariance to circular shifts. Our fine-tuned models exhibit improved comprehension of 360-degree visual semantics, although with a slight degradation in original semantic evaluation performance, highlighting a fundamental trade-off in adapting CLIP to 360-degree panoramic images. This is a joint work with Hai Wang (UCL), Xiaochen Yang (Glasgow), and Mingzhi Dong (Bath).

报告人简介:

Jinghao Xue于1998年在清华大学获得信号与信息处理专业工学博士学位,并于2008年在格拉斯哥大学获得统计学博士学位。他是伦敦大学学院统计科学系的统计模式识别教授。他的研究兴趣包括统计模式识别、机器学习以及计算机视觉。他同时担任《IEEE T-CSVT》的高级区域编辑。

邀请人:计算机科学与技术系 严严 教授